Clustering API

The Clustering component includes an interface-based API that provides access to the following functionalities:

The interface used to cluster is shown first, followed by the interfaces it references.

 1 /**
2 * Client clustering interface.
3 */
4 public interface IClusterProcessor {
5 /**
6 * Get information about the cluster processor.
7 * @return information about the cluster processor
8 */
9 IClusterProcessorInfo getClusterProcessorInfo();
10
11
/**
12 * Cluster the entities.
13 * <p/>
14 * Clusters the entities using default options.
15 *
16 * @param entities collection to be clustered
17 * @return clustered results - may be empty, never null
18 */
19 List<IClusterResult> cluster(List<? extends IEntity> entities);
20
21
/**
22 * Cluster the entities.
23 *
24 * @param options clustering options.
25 * @param entities collection to be clustered
26 * @return clustered result - may be empty, never null
27 */
28 List<IClusterResult> cluster(Map<String, Object> options, List<? extends IEntity> entities);
29 }

1 /**
2 * Used to access information about a cluster processor.
3 */
4 public interface IClusterProcessorInfo {
5 /**
6 * Get the cluster processor's id
7 *
8 * @return uniquely identifies the clustering engine. Never empty.
9 */
10 String getId();
11
12
/**
13 * Get the cluster processor's display name.
14 *
15 * @return the display name - never empty
16 */
17 String getDisplayName();
18
19
/**
20 * Get the type of cluster processor
21 *
22 * @return the processor type
23 */
24 ClusterProcessorType getType();
25 }

1 /**
2 * Specifies the type of cluster processor.
3 * <p/>
4 * TextBased processors consume entities whose attribute values contain arbitrary text.
5 * TextBased classifiers have many text preprocessing options.
6 * <p/>
7 * Spatial classifiers consume entities whose attribute values each contain a single
8 * textual representation of a point on a plane.
9 */
10 public enum ClusterProcessorType {
11 TextBased, Spatial}

1 /**
2 * IClusterResult
3 */
4 public interface IClusterResult {
5 /**
6 * Get a description for the cluster.
7 *
8 * @return cluster description determined by the clustering engine, or an empty string if not known.
9 */
10 String getDescription();
11
12
/**
13 * Get a list of the entity id's for the cluster.
14 * <p/>
15 *
16 * @return list of entity ids. Never null.
17 */
18 List<String> getEntityIds();
19
20
/**
21 * Get additional information about the cluster.
22 * <p/>
23 * Support for this method is algorithm dependent.
24 * Implementations that do not support it should return {@link NullClusterResultsAnalysis#INSTANCE};
25 *
26 * @return the results analysis
27 */
28 IClusterResultsAnalysis getResultsAnalysis();
29
30
/**
31 * Get additional information about the spatial cluster.
32 * <p/>
33 * Support for this method is algorithm dependent.
34 * Implementations that do not support it should return {@link NullClusterResultsAnalysis#INSTANCE};
35 *
36 * @return the results analysis
37 */
38 ISpatialClusterResultsAnalysis getSpatialResultsAnalysis();
39 }

1 /**
2 * Contains information about the cluster.
3 */
4 public interface IClusterResultsAnalysis {
5 /**
6 * The score for cluster.
7 * <p/>
8 * This method is implementation dependent, and may not be supported by all algorithms. If the not supported, zero
9 * should be returned.
10 *
11 * @return a relative score for the cluster; the higher the score, the stronger the belief of the algorithm
12 * in the quality of the cluster.
13 */
14 double getScore();
15
16
/**
17 * Get the entity id to {@link IEntityAnalysis} map.
18 * <p/>
19 * This method is implementation dependent, and may not be supported by all algorithms. If not supported, an empty
20 * map should be returned.
21 *
22 * @return Map: Key=The entity id, the entity analysis entry.
23 */
24 Map<String, IEntityAnalysis> getEntityToEntityAnalysis();
25 }

1 /**
2 * IEntityAnalysis
3 */
4 public interface IEntityAnalysis {
5 /**
6 * Get the entity's weight.
7 * <p/>
8 * The entity's weight is implementation dependent, but would typically be a relative indicator of the entity's
9 * relevance.
10 *
11 * @return the entity's weight or zero if not supported.
12 */
13 double getWeight();
14
15
/**
16 * Get the corresponding attribute analysis entries for the entity.
17 *
18 * @return the corresponding attribute analysis entries
19 */
20 List<IAttributeAnalysis> getAttributeAnalysis();
21 }

1 /**
2 * IAttributeAnalysis
3 */
4 public interface IAttributeAnalysis {
5 /**
6 * Get the text for the attribute.
7 * <p/>
8 * This may differ from the text returned from <code>IAttribute.getValue()</code> due to preprocessing
9 * for text replacement, or if the attribute was given a zero weight.
10 *
11 * @return the text that corresponds to the offsets returned from {@link #getTokens()}. May be empty, never null.
12 */
13 String getText();
14
15
/**
16 * Get the extended token data.
17 *
18 * @return list of extended token data. May be empty, never null.
19 */
20 List<IExtTokenOffsets> getTokens();
21 }

1 /**
2 * Defines an object to be processed.
3 */
4 public interface IEntity {
5 /**
6 * Get the unique id for the entity.
7 *
8 * @return the entity's unique id
9 */
10 String getId();
11
12
/**
13 * Get the attributes for the entity.
14 *
15 * @return the entity's attributes
16 */
17 List<IAttribute> getAttributes();
18 }

1 /**
2 * Defines a feature of an {@link IEntity}.
3 */
4 public interface IAttribute {
5 /**
6 * The value of the attribute.
7 *
8 * @return the attribute's text, non-null.
9 */
10 String getValue();
11
12
/**
13 * Get the weight for the attribute.
14 *
15 * @return the attribute weight.
16 */
17 double getWeight();
18
19
/**
20 * Set a weight for the attribute.
21 *
22 * @param weight the attribute's weight
23 */
24 void setWeight(double weight);
25 }

1 /**
2 * IClusterProcessorFactory
3 */
4 public interface IClusterProcessorFactory {
5 /**
6 * Get supported clustering processor ids.
7 * <p/>
8 * @param type the type{s} of processors to return. Calling this method without any parameters
9 * will return the ids for all processors. NOTE: different processor types require different input data.
10 * @return a list of supported cluster processor ids
11 */
12 List<String> getProcessorIds(ClusterProcessorType... type);
13
14
/**
15 * Get default cluster processor id.
16 * <p/>
17 *
18 * @return the default cluster processor id
19 */
20 String getDefaultProcessorId();
21
22
/**
23 * Gets an instance of the specified cluster processor.
24 * <p/>
25 * Whether or not the same (or a new) instance of the specified processor is returned each time this method is
26 * called is implementation dependent.
27 *
28 * @param clusterProcessorId the unique id for the cluster processor
29 * @return the cluster processor instance
30 */
31 IClusterProcessor getProcessor(String clusterProcessorId);
32 }

Entity Processing

Each entity to be clustered is typically converted into a single bag-of-words. An attribute's weight determines the number of times the attribute's value will be copied into the bag-of-words. The framework provides a number of preprocessing options that can be set using the Clustering Options Management GUI that is descibed later in this document.

Coding Example

The following snippet from the clustering example program ClusteringExample shows the clustering API calls.

The example can be run by unzipping the deployable zip and executing the cluster.bat file that is contained within the examples directory. The example runs all clustering algorithms and can output quite a bit of text. You may wish to redirect the output to a text file (i.e.: cluster.bat > cluster.out.txt) and view it in a text edtor.

See the JavaDoc for more information on the API.


public class ClusteringExample {
public static void main(String[] args) {
final String margin = " ";

IClusterProcessorFactory factory = ClusterProcessorFactoryLocator.getFactory();
List<String> processorIds = factory.getProcessorIds(ClusterProcessorType.TextBased);
List<IEntity> inputData = getEntitiesToCluster();
logMsg("Available clustering processors: " + processorIds);
logMsg("Default processor: [" + factory.getDefaultProcessorId() + ']');
logMsg("Input data");
logMsg(formatEntities(inputData, margin));
logMsg("");
for (String processorId : factory.getProcessorIds(ClusterProcessorType.TextBased)) {
logMsg("Using [" + processorId + ']');
IClusterProcessor processor = factory.getProcessor(processorId);
List<IClusterResult> results = processor.cluster(inputData);
logMsg(new ClusterResultsFormatter(results).setClusterInput(inputData).format());
logMsg("");
}
}

The following is console output from the example program:
Available clustering processors: [lingo, lda, katz]
Default processor: [lingo]
Input data
Entity{id='1', attributes=[Attribute{value='William Shakespeare', weight=1.0}, Attribute{value='Hamlet', weight=2.0}]}
Entity{id='2', attributes=[Attribute{value='Shakespeare', weight=1.0}, Attribute{value='Julius Caesar', weight=1.0}]}
Entity{id='3', attributes=[Attribute{value='PAL', weight=1.0}, Attribute{value='osama bin laden sited in
...

Using [lingo]
ClusterResult{description='Cars', entityIds=[7, 9, 8]}
ClusterResult{description='Hamlet', entityIds=[1, 6, 5]}
ClusterResult{description='Osama bin Laden Sited in Sector', entityIds=[3, 10]}
ClusterResult{description='Julius Caesar', entityIds=[2, 5, 4]}

Using [lda]
ClusterResult{description='Hyundai', entityIds=[9]}
ClusterResult{description='car', entityIds=[8]}
ClusterResult{description='Roman', entityIds=[6]}
ClusterResult{description='Caesar', entityIds=[2, 4]}
ClusterResult{description='Winfrey', entityIds=[3, 10]}
ClusterResult{description='Hamlet', entityIds=[1, 5]}
ClusterResult{description='Chevy', entityIds=[7]}

Using [katz]
ClusterResult{description='Centroid Entity ID=3', entityIds=[3, 10]}
ClusterResult{description='Centroid Entity ID=5', entityIds=[2, 1, 6, 5, 4]}
ClusterResult{description='Centroid Entity ID=9', entityIds=[7, 9, 8]}

Configuration

The framework is initialized using Spring configuration files. The following is the main configuration file.

1 <?xml version="1.0" encoding="UTF-8"?>
2 <beans xmlns="http://www.springframework.org/schema/beans"
3 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
4 xmlns:util="http://www.springframework.org/schema/util"
5 xsi:schemaLocation="
6 http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
7 http://www.springframework.org/schema/util http://www.springframework.org/schema/util/spring-util-3.0.xsd"
8 default-autowire="no" default-lazy-init="true">
9
10
<import resource="directories.xml"/>
11 <import resource="resources.xml"/>
12
13
<import resource="optionMap.xml"/>
14 <import resource="optionMapEntryDescriptors-textPrep.xml"/>
15
16
<import resource="clustering-directories.xml"/>
17 <import resource="clustering-optionMapEntryDescriptors.xml"/>
18
19
<import resource="clustering-lda-descriptor.xml"/>
20 <import resource="clustering-lda-optionMapEntryDescriptors.xml"/>
21 <import resource="clustering-lingo-optionMapEntryDescriptors.xml"/>
22 <import resource="clustering-lingo-descriptor.xml"/>
23 <import resource="clustering-katz-optionMapEntryDescriptors.xml"/>
24 <import resource="clustering-katz-descriptor.xml"/>
28
29
<import resource="clustering-options-ui-display-controller.xml"/>
30
31
<bean class="com.sri.dolphin.learning.clustering.ClusterProcessorFactory"
32 id="clustering.cluster-processor-factory"
33 init-method="init">
34 <property name="clusterProcessorDescriptors">
35 <description>
36 The first entry in LinkedHashSet is the default.
37 </description>
38 <util:set set-class="java.util.LinkedHashSet">
39 <ref bean="clustering-lingo-descriptor.descriptor"/>
40 <ref bean="clustering-lda-descriptor.descriptor"/>
41 <ref bean="clustering-katz-descriptor.descriptor"/>
43 </util:set>
44 </property>
45 </bean>
46
47
<bean id="clustering.clusterService" class="com.sri.dolphin.learning.clustering.ClusterService" abstract="true"
48 init-method="init">
49 <property name="resourceManager" ref="resources.languageManager"/>
50 </bean>
51
52
</beans>

Clustering Options Management GUI

The Clustering Options Management GUI is a small utility that allows a user to easily view all options that are available for a given algorithm and change their default settings.

The GUI can be run by unzipping the deployable zip and executing the cluster-options-ui.bat file that is contained within the examples directory.

.ClusterOptionsGUI

The buttons at the bottom of the GUI perform the following actions:

Saved changes are persisted to an XML file that is used for subsequent clustering requests. Options can also be overridden per clustering request by passing a Map of options to the cluster method at runtime.

A brief description of a field can be displayed by hovering the mouse cursor over a field's label.

Resource Editor GUI

The Resource Editor GUI is a small utility that allows a user to view and edit the resource files supported by the framework.

The resource files are used by the framework for Entity preprocessing if the Standard tokenizer is selected and the specific resource feature is enabled. The following resource file types are currently supported: stopwords, stoplabels, wordAliases, and textReplacements. Documentation on each resource file type is contained at the top of the English version of  the resource file.

The GUI can be run by unzipping the deployable zip and executing the resource-editor-ui.bat file that is contained within the examples directory.



The user selects the language using the drop-down list at the top left of the screen and the resource type from the drop-down list at the top right.

The buttons at the bottom of the GUI perform the following actions:

If the text contained in the editor has been modified, both drop-down lists will be disabled, and the Save and Reset buttons will be enabled.  The user can Undo or Redo text modifications using the options from the Edit menu. The editor will not allow a resource file to be saved if it would fail validation.