|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object edu.psu.ist.youseer.Worker
public class Worker
Title:
Description: This is the basic unit of execution, each worker is responsible for parsing a document and generating the corresponding solr document. During the processing, the SubmitterDocument is passed to the CustomeExtractor to see if the user has implemented some specific extraction functions.
Copyright: Copyright Madian Khabsa @ Penn State(c) 2009
Company: Penn State
Field Summary | |
---|---|
private SubmitterDocument |
doc
|
private ARCSubmitter |
parent
|
Constructor Summary | |
---|---|
Worker(ARCSubmitter parent,
SubmitterDocument doc)
|
Method Summary | |
---|---|
java.lang.String |
GenerateDocument()
Generates solr document for the processed ARC record using the tags from the configuration file |
private static java.lang.String |
getTitle(net.htmlparser.jericho.Source source)
Extracts the title out of a text document using Jericho parser |
private boolean |
InsertToDB(java.lang.String result)
Inserts a log entry to the database that the current document wasn't submitted to the index |
boolean |
ProcessBinaryDocument()
Process the bindary document, converts it to plain text using apache tika, and then extracts the title of the file |
boolean |
ProcessTextDocument()
Processes the text document, extracts the title, and strip the HTML tags |
void |
run()
|
java.lang.String |
sendPostCommand(java.lang.String command,
java.lang.String url)
Sends a post request to the server Courtesy of Grant Ingersoll @ IBM |
java.lang.String |
StripHTML(java.lang.String rawString)
Strips the text from the HTML tags. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
---|
private ARCSubmitter parent
private SubmitterDocument doc
Constructor Detail |
---|
public Worker(ARCSubmitter parent, SubmitterDocument doc)
Method Detail |
---|
public void run()
run
in interface java.lang.Runnable
private boolean InsertToDB(java.lang.String result)
result
- String The exception error message
public boolean ProcessTextDocument()
public boolean ProcessBinaryDocument()
public java.lang.String GenerateDocument()
doc
- SubmitterDocument
public java.lang.String StripHTML(java.lang.String rawString)
rawString
- String
public java.lang.String sendPostCommand(java.lang.String command, java.lang.String url) throws java.lang.Exception
command
- String the command to be senturl
- String the URL of the server
java.lang.Exception
private static java.lang.String getTitle(net.htmlparser.jericho.Source source)
source
- Source
|
||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |