Classification Example - Email Importance

1. Introduction

This example uses the Classification Framework to classify email as either important or not important.

Classification is done based on the following set of features:

The classifier requires supervised training, so we provide it with a set of email of known importance. The result of the training is a model that can be used to predict the importance of new email.

2. Use the Entity Class to Create the Object for Classification

We start by using the Entity class object as a representation of the email objects that we want to classify. For each of our email objects, we create an Entity object with a unique email Id and a list of attributes (senderInAddressBook, usuallyRepliesToSender, usuallyRepliesToDomain, usuallyRepliesToSenderIfCc, usuallyRepliesIfAttachment, userSenderRelation, containRequestQuestion, aboutTopic, fromImportantPerson, partOfThread). The code fragment below shows the construction.

        List<IAttribute> attributes = new ArrayList<IAttribute>();
attributes.add(new Attribute(senderInAddressBook));
attributes.add(new Attribute(usuallyRepliesToSender));
attributes.add(new Attribute(usuallyRepliesToDomain));
attributes.add(new Attribute(usuallyRepliesToSenderIfCc));
attributes.add(new Attribute(usuallyRepliesIfAttachment));
attributes.add(new Attribute(userSenderRelation));
attributes.add(new Attribute(containRequestQuestion));
attributes.add(new Attribute(aboutTopic));
attributes.add(new Attribute(fromImportantPerson, 3));
attributes.add(new Attribute(partOfThread));
emailObj = new Entity(emailId, attributes);
Because the classifier will process all features as String terms within the same namespace, the feature values must be unique. In the case of the attribute usuallyRepliesToSender, we will represent its Boolean value (whether or not the user usually replies to this sender's email) with the strings "usuallyRepliesToSender_yes" and "usuallyRepliesToSender_no." In the case of the attribute aboutTopic, we will use string values "aboutTopic_project," "aboutTopic_meeting," "aboutTopic_friends," and "aboutTopic_misc" to represent the possible types of subject matter discussed in an email.

Use the Classification Options Management GUI to set the tokenizer to Simple so that the framework will not filter or modify any of the attribute values. 

By default, all attributes are treated equally (weight 1) by the classification framework. But some attributes can be set to have higher weights than others. In this example, we have chosen to give a weight of 3 to the attribute fromImportantPerson.

3. Training and Classification

The classifier needs to be trained beforehand to develop a classification model. One way to obtain such a model is to pass into the classifier a list of email objects with known importance. To do this, we will create a HashMap with two entries, one for "important" and another "not-important." The keys to the HashMap's two entries will be the two importance labels and the value of each entry will be a list of email of corresponding importance. In the code fragment below, trainingData is such a HashMap.

    IClassifierProcessorFactory factory = ClassifierProcessorFactoryLocator.getFactory();
IClassifierProcessor processor = factory.getDefaultProcessor();
processor.train(trainingData);

For classification, we build a list which contains the email objects of unknown importance. In the code fragment below, dataToClassify is such a list.

    IClassifierResults results = processor.classify(dataToClassify);

The value returned is an entity classifier results object which contains the two importance categories and a list of email Ids for each category.

To process the results, we make use of the API utility class ResultsUtil. The following code fragment gets the more probable importance category for a given email.
    String importance = ResultsUtil.getClasses(results, emailId, 1).get(0).getClassLabel();

The importance variable will contain either "important" or "not important."