Semantic Extraction Wikipedia Extraction Example

Source Code and Required Jar/Data files

Introduction

The following example shows how to use the Apache Axis Project to fetch articles from Wikipedia and to then extract semantic data from them using the PAL Semantic Extraction APIs. This example also shows how to get a plain text version of the HTML markup returned by Wikipedia using the HTML Parser project.

Calling the Wikipedia API

Our first task is to call the Wikipedia web services API to search for keywords. Fortunately, the Apache Axis Project makes this easy to do. We start by calling the WSDL2Java command line tool to generate a set of stubs to access the web service:

% java org.apache.axis.wsdl.WSDL2Java http://wikipedia-lab.org:8080/WikipediaOntologyAPIv3/Service.asmx

This call will generate a collection of classes representing the types, ports, and services defined in the WSDL page. The basic idea is to use the ServiceLocator object to find the desired service and then call the getCandidatesFromKeyword method.

private List<Candidate> getCandidates(String keyword) {
    List<Candidate> list = new ArrayList<Candidate>();
    try {
        ServiceLocator locator = new ServiceLocator();
        ServiceSoap service = locator.getServiceSoap();
        GetCandidatesFromKeywordResponseGetCandidatesFromKeywordResult result = service.getCandidatesFromKeyword("Iran", "English");
        MessageElement[] messages = result.get_any();
        if (messages.length >= 2) {
            MessageElement diffgram = messages[1];
            MessageElement newDataSet = diffgram.getChildElement(new QName("","NewDataSet"));
            List children = newDataSet.getChildren();
            for (Object child : children) {
                MessageElement table = (MessageElement) child;
                String lti_to_id = table.getChildElement(new QName("","lti_to_id")).getValue();
                String name = table.getChildElement(new QName("","name")).getValue();
                String lti_count = table.getChildElement(new QName("","lti_count")).getValue();
                list.add(new Candidate(Integer.parseInt(lti_to_id), name, Integer.parseInt(lti_count)));
            }
        }
    } catch (ServiceException e) {
        log.error("ServiceException getting candidates for keyword " + keyword, e);
    } catch (RemoteException e) {
        log.error("RemoteException getting candidates for keyword " + keyword, e);
    }
    return list;
}

The above code is somewhat complicated because we have to parse the results object to get all desired fields. The parsing logic is further complicated by the fact that the web service method is returning a .NET DataSet type that is not directly supported in Java. After parsing the results, we return a list of Candidate objects.

Getting Wikipedia URLs

The next step is to build a list of page URLs from the list of Candidate objects generated above. This works on the principle that each URL is the concatenation of the prefix "http://en.wikipedia.org/wiki/" and the Candidate name.

private List<String> getUrls(List<Candidate> candidates) {
    List<String> list = new ArrayList<String>();
    for (Candidate candidate : candidates) {
        list.add("http://en.wikipedia.org/wiki/" + candidate.getName());
    }
    return list;
}

Getting the Article Text

Now that we have a list of articles, we need a method for fetching each page and converting it into plain text. This is made easy by the HTML Parser Project.

private String getContent(String url) {
    StringBean sb = new StringBean();
    sb.setLinks(false);
    sb.setReplaceNonBreakingSpaces(true);
    sb.setCollapse(true);
    sb.setURL(url);
    return sb.getStrings();
}

Initializing the Semantic Extraction Engine

Next we need to initialize the summarization engine by creating a new FindAndSummarize object, setting the working directory, and calling startup.

private void initialize() {
    fas = new FindAndSummarize();
    fas.setWorkDir("data");
    fas.startup();
}

Note that we can also create one or more training examples or define additional names/locations, etc.

    fas.learnPersonName("Abu Viddi");
    fas.learnLocationName("Mosal");
    fas.learnNameAndType("al-Shabab", "Organization");

For a complete discussion on these options, see the PAL Semantic Extraction Framework document in the parent folder.

Extracting the Content

The article content is extracted by calling the semanticExtraction method. This returns an extraction wrapper object. The wrapper contains a ThingMap that maps types to a list of things of that type. In the example below, we are just going to get all of the lists for each type and print them out.

private void summarizeContent(String content) {
    IExtractWrapper wrapper = fas.semanticExtraction(content);
    HashMap<String,List> thingMap = wrapper.getThingMap();
    for (List list : thingMap.values()) {
        for (Object obj : list) {
            Thing thing = (Thing) obj;
            System.out.println("type = " + thing.getType() +
                    ", value = " + thing.getValue() +
                    ", position = " + thing.getCharPosition());
        }
    }
}

Example Output

And here is what the output looks like for the Wikipedia article on Iran:

Summarizing content for http://en.wikipedia.org/wiki/Iran
type = Acronym, value = UN, position = 4731
type = Acronym, value = NAM, position = 4735
type = Acronym, value = OIC, position = 4740
type = Acronym, value = OPEC, position = 4748
type = Person, value = Mohammed Mossadegh, position = 37381
type = Person, value = Dwight D. Eisenhower, position = 37692
type = Person, value = Hassan Pakravan., position = 38402
...