Workshop on Unsupervised Learning
in Natural Language Processing
INVITED TALK
Unsupervised
Segmentation of Japanese
Lillian Lee, Cornell University
The problem of word segmentation arises in Japanese language
processing because Japanese text lacks word delimiters (imagine
English text without spaces between the words). Typical approaches to
the Japanese word segmentation problem involve either the use of large
lexicons combined with morphological analysis, or models trained from
annotated corpora.
We present work in progress on an extremely simple, dictionary-less
method that relies only on statistics gathered from unannotated
corpora. Preliminary results show large improvements with respect to
rule-based morphological analyzers over a variety of error metrics.
This is joint work with Rie Ando (Cornell University).
|