On the Empirical Complexity of Text Classification Problems
by Madani, Omid; Raghavan, Hema; Jones, Rosie
Technical Report 567
Institution: SRI International
Address: 333 Ravenswood Ave, Menlo Park, CA 94025
In order to train a classifier that generalizes well, different learning problems, in particular high-dimensional ones such as text classification, can require widely different amounts of training, as measured in terms of the number of training instances required to reach adequate accuracy or the number of features effectively utilized in the classifier. We define several measures of learning difficulty and explore their utility in approximately capturing the inherent complexity of text classification problems. These measures can be efficiently computed for real-world problems for which linear classifiers are effective. We observe an intimate relationship (a high positive correlation) between feature complexity and instance complexity when using the measures.
|Madani, Omid||Senior Computer Scientist|