CALO Express (CE) Desktop Harvesting

CALO Express harvests user data from the desktop to create a semantic data store containing relational information about the structure of commonly used objects such as email, calendar, and documents. Tagging can be used to create links between objects (e.g., between a meeting and related documents), thus providing the means to layer additional semantic information on top of the harvested data.

The CE data store can be leveraged for a wide range of purposes; the main focus within the PAL program has been to build a model of a given user's interests and preferences.

CE provides two APIs to access the contents of the semantic data store:

1.       The open-source Lucene API provides direct data store access. The Lucene API enables read-write access to any field, but requires a greater knowledge of the internal structure of the data store.

2.       A higher-level SearchEngine API, built on top of the Lucene API, provides query access to the data store through an objected-oriented wrapper. This high-level API provides read-only access.

This document describes how to configure the CALO Express harvester, how to interact with the harvested data through CE interfaces, and how to access the data store using the direct access and high-level APIs.

Semantic data harvested by CE

Outlook 2003 & 2007

Email , Email Attachment, Calendar Event, Contact, Task, Note

Thunderbird Email

Email, Email Attachment

Files: Microsoft Office 2003 & 2007

PowerPoint slides, Word documents, Excel spreadsheets

Files: Other

PDF, HTML, Text

Harvesting Configuration

The CE harvester works behind the scene, running only when the computer is idle.

Harvesting configuration setup

CE provides a user-friendly configuration dialog to select the file types and file folders, Outlook folders, and Thunderbird folders to scan as part of the harvesting process.


ceFileTypes

Viewing harvesting statistics

 

The same dialog gives statistics on the number and types of items harvested.

ceSearch

CE Interface to Access the Data Store

Querying and viewing harvested items

Users can enter English-like queries in the CE Search box.

CaloTraySearchBox.png

Query results are displayed in a Windows File Explorer window. Note that the results include email, contacts, and individual slides from presentations not normally found in File Explorer windows.


 

 
 

Tagging

CE provides a "tagging" mechanism for all harvested data. Items can be dragged from Windows File Explorer, PowerPoint, Outlook, or Thunderbird to the CALO Express icon in the Windows system tray and tagged via the "Tag as" action. Items in CE Search Results can be tagged by using the "Tag As" action in the left-panel or right-click menus.

 


 
Tag management

 

 

 

 

 
The CE Tag Manager lets you edit, create, combine, and delete tags. Rules can be defined to automatically tag items when harvested.

Double-clicking a tag in the Tag Manager shows all items with that tag.


Low-Level Direct Data Store Access

CE provides low-level, read-write access to the Lucene API. The CALO Express data store is a set of files in Lucene standard format located at "C:\Documents and Settings\YOUR-USERNAME\caloExpressPersistence\ver-XX\lucene\index". Custom data store access code can be written using Lucene bindings for the language of your choice from lucene.apache.org.

Data Store Record Format

Each record in the Lucene data store represents one "Item" harvested from a desktop "Document," where "Document" means a file, an object in Outlook, or a Thunderbird email. The mapping between Items and Documents is often, but not always one-to-one. A single Powerpoint file will generate multiple Items, one for the document itself and one for each slide. A single email can generate multiple Items, one for the email and one for each email attachment.

Each record contains common fields as well as fields that are specific to just that item type.

Common fields for all items

Field

Type

Description

Examples

Id

string

Item unique identifier

"FileDoc:c:\documents\presentation.txt"

documentPath

string

Location of the object from which this item was harvested. If multiple documents were harvested from the same item, they will share the same documentPath.

"FileDoc:c:\documents\presentation.txt"

Type

string

Item type. The type and extension fields together define the type of item.

File

Outlook

extension

string

Item extension. See type

.txt

Email

subject

string

Subject line or document filename.

readme.txt readme

Re: Meet tomorrow?

Title

string

Title extracted from content of item (null if not found)

Business review

Folder

string

Folder containing this item

"CaloIris"

created

DateTime lucene format

Item creation date

0000fviqzn0z

modified

DateTime lucene format

Item last modified date

0000fviqzn0z

recency

DateTime lucene format

Whatever timestamp makes sense to get the "recency" for this item type. (e.g., for files this is LastModified, for email it is the date the email was sent, for calendar it is the event start time)

0000fviqzn0z

length

long

Item length in bytes

000000000261

directParentPath

string

path of parent folder

FileFldr:c:\documents

parentPath

string

path of a parent folder in item's full path (one for every folder in the path)

FileFldr:c:\documents

FileFldr:c:\

searchText

string

The words extracted from this item and that CE will match when CE perform searches for this item. All strings are trimmed, lower cased, and truncated to a maximum length. Semicolon separated list.

readme.txt; readme; before; installing; this; software; please; read; this

sentences

string

Sentences from this item. Words and sentences are truncated to a certain length. Semicolon separated list.

Bill works at SRI International.;He is a senior software engineer.

shortSummary

string

A human-readable summary of the text of this item.

CALO Express is a lightweight version of CALO... CE provides reusable data harvesting...

popularWords

string

A short list of the most popular words in this item. All strings are trimmed and lower-cased. Semicolon separated list.

ports;used;makes;tcp;slides;displaying;uses;outgoing;corresponding

userHarvested

string

Indicates whether this item was added to the index because the user explicitly performed an action to add it. Note: Field will not be present if false

T

tagSort_{tagName}

string

There is one tagSort_{tagName} field for each tag in the system. (e.g., all items with the tag "important" will have a "tagSort_important" field). The value of this field will always start with 'x' and will give a human-readable indication of why the item was tagged (e.g., all items with the tag "important" will have a "tagSort_important" field).

ximportant: "email this week about San Diego"

 

Slide

Field

Type

Description

Examples

position

string

Slide number within the presentation

21

presentation

string

Full path of the presentation

c:\documents\calo.ppt

presentationName

string

Name of the presentation

calo.ppt

 

Outlook and Thunderbird Email

Field

Type

Description

Examples

To

string

List of "to" addresses. All keywords that could match a search for an email's sender name or address.

Steve Hardt steve.hardt steve.hardt@sri.com

cc

string

List of "cc" addresses. All keywords that could match a search for an email's cc name or address.

Steve Hardt steve.hardt steve.hardt@sri.com Sandra Smith sandra.smith sandra.smith@sri.com

bcc

string

List of "bcc" addresses. All keywords that could match a search for an email's bcc name or address.

Steve Hardt steve.hardt steve.hardt@sri.com Sandra Smith sandra.smith sandra.smith@sri.com

receivedOn

DateTime Lucene Format

Date email was received

0000fvbalqg0

sentOn

DateTime Lucene Format

Date email was sent

0000fvbalml4

senderEmailAddress

string

Email address of sender

sandra.smith@sri.com

senderName

string

Name of sender if known or empty string if all we have is senderEmailAddress.

Sandra Smith

sentOnBehalfOfName

string

Sent on behalf of name

David Martin

Unread

string

Whether this email was read

False

importance

string

Importance of this email, possible values are olimportancelow, olimportancenormal, olimportancehigh

olimportancenormal

 

Outlook and Thunderbird Email attachment

Field

Type

Description

Examples

emailAttachmentEmailPath

string

Path of email attachment

OutlookDoc:sriemail|\\Personal Folders\Inbox\Notes |olMailItem|00..

 

Outlook contact

Field

Type

Description

Examples

firstName

string

Contact's first name

John

fullName

string

Contact's full name

John Smith

jobTitle

string

Contact's job title

Program Director

birthday

DateTime Lucene Format

Birthday of this contact

00sb8398u80

managerName

string

Name of this contact's manager

David Martin

assistantName

string

Name of this contact's assistant

Sandra Smith

businessAddress

string

Full business address for this contact

333 Ravenswood Avenue, Menlo Park, CA 94025 USA

businessAddressCity

string

Business address city

Menlo Park

businessAddressCountry

string

Business address country

USA

businessAddressPostalCode

string

Business address postal code

94025

businessAddressState

string

Business address state

CA

businessAddressStreet

string

Business address street

333 Ravenswood Avenue

businessHomePage

string

Business home page

www.sri.com

businessTelephoneNumber

string

Business phone number

(650) 859 8888

companyName

string

Name of company

SRI

email1Address

string

Email address

John.smith@sri.com

email1DisplayName

string

Email display name

John Smith

homeTelephoneNumber

string

Home telephone number

(650) 859 6578

imAddress

string

IM Address

johnsmith

webPage

string

Web page for this contact

http://home.johnsmith.com

 

Outlook calendar event

Field

Type

Description

Examples

allDayEvent

string

Whether this is an all day event

True

busyStatus

string

Status for this event

olFree

endTime

DateTime Lucene Format

End time for this event

0000fvroy1s0

isConflict

string

Whether there is a conflict on the calendar for this event

False

isRecurring

string

Whether this is a recurring event

True

location

string

Location for this event

EJ228

optionalAttendees

string

List of optional attendees for this event

David Martin

organizer

string

Event organizer

John Smith

reminderSet

string

Whether a reminder is set for this event

False

requiredAttendees

string

List of required attendees for this event

Bob Low; Luis Arvez

 

Outlook note

Field

Type

Description

Examples

categories

string

Categories for this note

work

color

string

Color for this note

olYellow

isConflict

string

Whether there is a conflict

False

 

Outlook task

Field

Type

Description

Examples

owner

string

Owner of this task

David Martin

complete

string

Whether the task is complete

False

actualWork

string

Actual work for this task

0

billingInformation

string

Billing information for this task

SRI

cardData

string

Data for the card

VISA

categories

string

List of categories for this task

Work

companies

string

List of companies for this task

SRI

contactNames

string

List of contact names

David Martiin

dateCompleted

DateTime lucene format

Completion date for this task

000sb8398u80

delegationState

string

State of delegation

Oltasknotdelegated

delegator

string

Name of delegator

Steve Jones

dueDate

DateTime lucene format

Due date for this task

0000f7h7gps0

importance

string

Importance of this task

Olimportancenormal

isConflict

string

Whether there is a conflict

False

isRecurring

string

Whether this is a recurring task

False

mileage

string

Mileage for this task

18k

ownership

string

Ownership status for this task

olnewtask

precentComplete

string

Percentage of completeness for this task

0

responseState

string

Response state

Oltasksimple

role

string

Role for this task

Manager

sensitivity

string

Sensitivity of this task

Olnormal

startDate

DateTime lucene format

Starting date of this task

000sb8398u80

status

string

Status of this task

Oltasknotstarted

statusOnCompletionRecipients

string

List of people to be notified when task is completed

Tim Youl

statusUpdateRecipients

string

List of people to be notified when task is updated

Archie Bunker

teamTask

string

Whether this is a team task

False

totalWork

string

Total work for this task

0

 

RecentEmail_Direct sample program

This sample program queries for all email received during the past week and prints the subject lines for each email using direct data store access instead of the SearchEngine API.

/** To compile and run this example, download Apache Lucene from http://lucene.apache.org, and
 *  compile/run with lucene-core-3.0.0.jar in the CLASSPATH.
 */

package cedemo;

import org.apache.lucene.search.*;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.document.DateField;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.FSDirectory;
 
import java.util.Date;
import java.util.Calendar;
import java.io.File;

public class RecentEmail_Direct {
    /** Prints the subject lines of all email received in the last week.
        Pass in location of index directory
    */

    public static void main(String[] argv) {
        if (argv.length == 0) {
          System.out.println("Must include index directory, generally something like c:\\documents and settings\\bob\\caloExpressPersistence\\ver-96\\lucene\\index");
          System.exit(1);
        }

        String indexDir = argv[0];
 
        BooleanQuery query = new BooleanQuery();
        Query emailTermQuery = new TermQuery(new Term("extension","email"));
        query.add(new BooleanClause(emailTermQuery, BooleanClause.Occur.MUST));
        String now = "000" + DateField.dateToString(new Date());
        Calendar weekAgoCalendar = Calendar.getInstance();
        weekAgoCalendar.add(Calendar.DATE,-7);
        String weekAgo = "000" + DateField.dateToString(weekAgoCalendar.getTime());
        Query receivedOnQuery = new TermRangeQuery("receivedOn",weekAgo,now,true,true);
        query.add(new BooleanClause(receivedOnQuery, BooleanClause.Occur.MUST));

        try {

            IndexReader reader = IndexReader.open(FSDirectory.open(new File(indexDir)), true);
            Searcher searcher = new IndexSearcher(reader);
            // Return 1000 docs max
            TopDocs topDocs = searcher.search(query,1000);
            ScoreDoc[] hits = topDocs.scoreDocs;

            System.out.println("Found " + hits.length +
                    " emails received in the last week, here are the subject lines:");

           for (ScoreDoc hit : hits) {
              Document doc = searcher.doc(hit.doc);
              String subject = doc.get("subject");

              System.out.println(subject);

           }
        }

        catch (Exception e)
        {
            System.err.println("Error accessing Lucene index at " + indexDir + ": " + e.toString());
        }
    }
}

 

TaggedItems_Direct sample program

The following sample program reads tags by using the direct data store access. It queries the Lucene data store for items with a specific tag and prints out the item ids.

/** To compile and run this example, download Apache Lucene from http://lucene.apache.org, and
 *  compile/run with lucene-core-3.0.*.jar in the CLASSPATH.
 */

package cedemo;

import org.apache.lucene.search.*;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.document.DateField;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.FSDirectory;

import java.util.Date;
import java.util.Calendar;
import java.io.File;

public class TaggedItems_Direct {
    /** Query for items tagged with a specific tag and print out the item ids.
        Pass in location of index directory as first argument, name of tag as second argument
    */

    public static void main(String[] argv) {
        if (argv.length < 2) {

          System.out.println("First argument is index directory, generally something like c:\\documents and settings\\bob\\caloExpressPersistence\\ver-96\\lucene\\index.");

          System.out.println("Second argument is tag to search for.");

          System.exit(1);
        }

        String indexDir = argv[0];
        String tagName = argv[1];

        Query query = new PrefixQuery(new Term("tagSort_" + tagName,"x"));

        try {
            IndexReader reader = IndexReader.open(FSDirectory.open(new File(indexDir)), true);
            Searcher searcher = new IndexSearcher(reader);
            // Return 1000 docs max
            TopDocs topDocs = searcher.search(query,1000);
            ScoreDoc[] hits = topDocs.scoreDocs;

            System.out.println("Found " + hits.length +
                    " items tagged " + tagName + ".  Here are the item Ids:");

           for (ScoreDoc hit : hits) {
              Document doc = searcher.doc(hit.doc);
              String id = doc.get("id");

              System.out.println(id);
           }
        }

        catch (Exception e)
        {
            System.err.println("Error accessing Lucene index at " + indexDir + ": " + e.getStackTrace());
        }
    }
}

 

High-Level SearchEngine API

CE provides a high-level, read-only SearchEngine API to access the CE data store from third-party applications written in Java or C#. First, a specific query is constructed using CaloQueryFactory methods. Then, a ISearchEngine.search() call will return a list of ISearchResultItem objects. Please refer to JavaDocs for description of the structure of ISearchResultItem.

RecentEmail_SearchEngine sample program

This sample program queries the semantic data store for all email received in the last week and prints the subject lines.

/** To compile and run this example:
 * 1) Install CALO Express
 * 2) Add the following jar files from "C:\Program Files\SRI\CALO Express\Java"
 *    to your CLASSPATH:
 *    expressCommon.jar
 *    log4j-1.2.14.jar
 *    xmlrpc-common-3.1.2.jar
 *    xmlrpc-client-3.1.2.jar
 *    xmlrpc-server-3.1.2.jar
 *    ws-commons-util-1.0.2.jar
 *    commons-logging-1.1.jar
 * 3) Download Apache Lucene from http://lucene.apache.org
 * 4) Add lucene-core-3.0.*.jar to your CLASSPATH
 */

package cedemo;

import org.apache.lucene.search.*;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.document.DateField;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.FSDirectory;
import java.util.Date;
import java.util.Calendar;
import java.util.List;
import java.util.ArrayList;
import java.io.File;
import com.sri.calo.caloItem.search.ISearchEngine;
import com.sri.calo.caloItem.search.SearchEngineSingleton;
import com.sri.calo.caloItem.search.SearchConstants;
import com.sri.calo.caloItem.query.CaloQueryFactory;
import com.sri.calo.caloItem.query.ICaloQueryClause;
import com.sri.calo.caloItem.query.ICaloQuery;
import com.sri.calo.caloItem.item.IEmailItem;
import com.sri.calo.caloItem.item.ISearchResultItem;
import com.sri.calo.caloCommon.CaloConstants;

public class RecentEmail_SearchEngine {
    /** Prints the subject lines of all email received in the last week.
        Use high-level ISearchEngine API to access CALO Express data store.
    */

    public static void main(String[] argv) {
        // Construct the query
        List clauses = new ArrayList();
        // Constrain type to be email
        ICaloQueryClause typeClause =
            CaloQueryFactory.ConstructTypeConstraint(IEmailItem.class);
        clauses.add(typeClause);
        // Constrain received-on to be last week.
        Calendar weekAgoCalendar = Calendar.getInstance();
        weekAgoCalendar.add(Calendar.DATE,-7);
        Date startDate = weekAgoCalendar.getTime()
        Date endDate = new Date();
        ICaloQueryClause dateClause =       
CaloQueryFactory.ConstructDateConstraint(SearchConstants.RECEIVED_ON_FIELD,startDate,endDate);
        clauses.add(dateClause);
        // The complete query
        ICaloQuery query = CaloQueryFactory.ConstructQuery(clauses);

        try {
            ISearchEngine searchEngine = SearchEngineSingleton.singleton();
            List results =
searchEngine.search(query,SearchConstants.MAX_SEARCH_SIZE);

            System.out.println("Found " + results.size() +
                    " emails received in the last week, here are the subject lines:");

            for(ISearchResultItem result: results) {
                // Since we constrained the search to just IEmailItem objects, it is safe to cast here.
                IEmailItem emailItem = (IEmailItem)result;

                System.out.println(emailItem.getSubject());
            }
        }

        catch (Exception e)
        {
            System.err.println("Error performing SearchEngine search: " + e.getStackTrace());
        }
    }
}