edu.psu.ist.youseer
Class Worker

java.lang.Object
  extended by edu.psu.ist.youseer.Worker
All Implemented Interfaces:
java.lang.Runnable

public class Worker
extends java.lang.Object
implements java.lang.Runnable

Title:

Description: This is the basic unit of execution, each worker is responsible for parsing a document and generating the corresponding solr document. During the processing, the SubmitterDocument is passed to the CustomeExtractor to see if the user has implemented some specific extraction functions.

Copyright: Copyright Madian Khabsa @ Penn State(c) 2009

Company: Penn State


Field Summary
private  SubmitterDocument doc
           
private  ARCSubmitter parent
           
 
Constructor Summary
Worker(ARCSubmitter parent, SubmitterDocument doc)
           
 
Method Summary
 java.lang.String GenerateDocument()
          Generates solr document for the processed ARC record using the tags from the configuration file
private static java.lang.String getTitle(net.htmlparser.jericho.Source source)
          Extracts the title out of a text document using Jericho parser
private  boolean InsertToDB(java.lang.String result)
          Inserts a log entry to the database that the current document wasn't submitted to the index
 boolean ProcessBinaryDocument()
          Process the bindary document, converts it to plain text using apache tika, and then extracts the title of the file
 boolean ProcessTextDocument()
          Processes the text document, extracts the title, and strip the HTML tags
 void run()
           
 java.lang.String sendPostCommand(java.lang.String command, java.lang.String url)
          Sends a post request to the server Courtesy of Grant Ingersoll @ IBM
 java.lang.String StripHTML(java.lang.String rawString)
          Strips the text from the HTML tags.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

parent

private ARCSubmitter parent

doc

private SubmitterDocument doc
Constructor Detail

Worker

public Worker(ARCSubmitter parent,
              SubmitterDocument doc)
Method Detail

run

public void run()
Specified by:
run in interface java.lang.Runnable

InsertToDB

private boolean InsertToDB(java.lang.String result)
Inserts a log entry to the database that the current document wasn't submitted to the index

Parameters:
result - String The exception error message
Returns:
boolean

ProcessTextDocument

public boolean ProcessTextDocument()
Processes the text document, extracts the title, and strip the HTML tags

Returns:
boolean

ProcessBinaryDocument

public boolean ProcessBinaryDocument()
Process the bindary document, converts it to plain text using apache tika, and then extracts the title of the file

Returns:
boolean

GenerateDocument

public java.lang.String GenerateDocument()
Generates solr document for the processed ARC record using the tags from the configuration file

Parameters:
doc - SubmitterDocument
Returns:
String

StripHTML

public java.lang.String StripHTML(java.lang.String rawString)
Strips the text from the HTML tags. This is dependent on the class HTMLStripReader that comes as part of solr.

Parameters:
rawString - String
Returns:
String

sendPostCommand

public java.lang.String sendPostCommand(java.lang.String command,
                                        java.lang.String url)
                                 throws java.lang.Exception
Sends a post request to the server Courtesy of Grant Ingersoll @ IBM

Parameters:
command - String the command to be sent
url - String the URL of the server
Returns:
String The result of the submit
Throws:
java.lang.Exception

getTitle

private static java.lang.String getTitle(net.htmlparser.jericho.Source source)
Extracts the title out of a text document using Jericho parser

Parameters:
source - Source
Returns:
String