ACL'99 Workshop

Unsupervised Learning in Natural Language Processing

University of Maryland
June 21, 1999

Endorsed by SIGNLL, the Association for Computational Linguistics (ACL) Special Interest Group (SIG) on Natural Language Learning.

Workshop Description

Many of the successes achieved from using learning techniques in natural language processing (NLP) have utilized the supervised paradigm, in which models are trained from data annotated with the target concepts to be learned. For instance, the target concepts in language modeling for speech recognition are words, and thus raw text corpora suffice. The first successful part-of-speech taggers were made possible by the existence of the Brown corpus (Francis, 1964), a million-word data set which was laboriously hand-tagged a quarter of a century prior. Finally, progress in statistical parsing required the development of the Penn Treebank data set (Marcus et al. 1993), the result of many staff years of effort. While it is worthwhile to utilize annotated data when it is available, the future success of learning for natural language systems cannot depend on a paradigm requiring that large, annotated data sets be created for each new problem or application. The costs of annotation are prohibitively time and expertise intensive, and the resulting corpora are too susceptible to restriction to a particular domain, application, or genre.

Thus, long-term progress in NLP is likely to be dependent on the use of unsupervised and weakly supervised learning techniques, which do not require large annotated data sets. Unsupervised learning utilizes raw, unannotated data to discover underlying structure giving rise to emergent patterns and principles. Weakly supervised learning uses supervised learning on small, annotated data sets to seed unsupervised learning using much larger, unannotated data sets. Because these techniques are capable of identifying new and unanticipated correlations in data, they have the additional advantage of being able to feed new insights back into more traditional lines of basic research.

Unsupervised and weakly supervised methods have been used successfully in several areas of NLP, including acquiring verb subcategorization frames (Brent, 1993; Manning, 1993), part-of-speech tagging (Brill, 1997), word sense disambiguation (Yarowsky, 1995), and prepositional phrase attachment (Ratnaparkhi, 1998). The goal of this workshop is to discuss, promote, and present new research results (positive and negative) in the use of such methods in NLP. We encourage submissions on work applying learning to any area of language interpretation or production in which the training data does not come fully annotated with the target concepts to be learned, including:

  • Fully unsupervised algorithms
  • `Weakly supervised' learning; bootstrapping models from small sets of annotated data
  • `Indirectly supervised' learning, in which end-to-end task evaluation drives learning in an embedded language interpretation module
  • Exploratory data analysis techniques applied to linguistic data
  • Unsupervised adaptation of existing models in changing environments
  • Quantitative and qualitative comparisons of results obtained with supervised and unsupervised learning approaches
Position papers on the pros and cons of supervised vs. unsupervised learning will also be considered.

Format for Submission

Paper submissions can take the form of extended abstracts or full papers, not to exceed six (6) pages. Authors of extended abstracts should note the short timespan between notification of acceptance and the final paper deadline. Up to two more pages may be allocated for the final paper depending on space constraints.

Authors are requested to submit one electronic version of their papers *or* four hardcopies. Please submit hardcopies only if electronic submission is impossible. Submissions in Postscript or PDF format are strongly preferred.

If possible, please conform with the traditional two-column ACL Proceedings format. Style files can be downloaded from

Email submissions should be sent to:

Hard copy submissions should be sent to:

Andrew Kehler
SRI International
333 Ravenswood Avenue
Menlo Park, CA 94025


  • Paper submission deadline: March 26
  • Notification of acceptance: April 16
  • Camera ready papers due: April 30


  • Andrew Kehler (SRI International)
  • Andreas Stolcke (SRI International)

Program Committee

  • Michael Brent (Johns Hopkins University)
  • Eric Brill (Johns Hopkins University)
  • Rebecca Bruce (University of North Carolina at Asheville)
  • Eugene Charniak (Brown University)
  • Michael Collins (AT&T Laboratories)
  • Marie desJardins (SRI International)
  • Moises Goldszmidt (SRI International)
  • Andrew Kehler (SRI International)
  • John Lafferty (Carnegie-Mellon University)
  • Lillian Lee (Cornell University)
  • Chris Manning (University of Sydney)
  • Andrew McCallum (Carnegie-Mellon University and Just Research)
  • Ray Mooney (University of Texas, Austin)
  • Srini Narayanan (ICSI, Berkeley)
  • Fernando Pereira (AT&T Laboratories)
  • David Powers (Flinders University of South Australia)
  • Adwait Ratnaparkhi (IBM Research)
  • Dan Roth (University of Illinois at Urbana-Champaign)
  • Andreas Stolcke (SRI International)
  • Janyce Wiebe (New Mexico State University)
  • Dekai Wu (Hong Kong University of Science and Technology)
  • David Yarowsky (Johns Hopkins University)