Workshop on Unsupervised Learning in Natural Language Processing


Unsupervised Segmentation of Japanese
Lillian Lee, Cornell University

The problem of word segmentation arises in Japanese language processing because Japanese text lacks word delimiters (imagine English text without spaces between the words). Typical approaches to the Japanese word segmentation problem involve either the use of large lexicons combined with morphological analysis, or models trained from annotated corpora.

We present work in progress on an extremely simple, dictionary-less method that relies only on statistics gathered from unannotated corpora. Preliminary results show large improvements with respect to rule-based morphological analyzers over a variety of error metrics.

This is joint work with Rie Ando (Cornell University).