|
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||
java.lang.Objectedu.psu.ist.youseer.ARCSubmitter
public class ARCSubmitter
Title: ARCSubmitter
Description: Read the configuration file, initilize the database, and create the tables if they don't exist The default database is SQLite, but if you prefer to use a server side database, feel free to modify the configuration file and provide the connection string for the database. This class creates a thread pool of Worker runnables to handle the ARC records retrieved from the ARC file. The order of tasks as follows: 1) parse the input parameters, 2) parse the configuration files and build a configuration object, 3) Establisht the database connection, 4) iterate through the input folder and process all the (new) ARC fiels in it (Depth First). 5) for each ARC record in the ARC file, this class creates a SubmitterDocument object and submit the object to the thread pool for processing. After processing the entire ARC file, the log is inserted to the database and the thread waits for the thread pool to finish executing before reading the next ARC file. When it finish processing all the files in the folder, the thread waits for a specified period (default is 5 minutes) before making another scan in the folder to find new ARC files. Thus it can run in parallel with the crawler and waits for the new dumps.
Copyright: Copyright Madian Khabsa @ Penn State(c) 2009
Company: Penn State
| Field Summary | |
|---|---|
java.lang.String |
CacheFolder
The virtual path that will have the ARV files in it |
SubmitterConfig |
Config
Cinfiguration object |
long |
Count
Number of documents submitted so far |
java.util.Vector<SubmitterDocument> |
IndexedDocs
List of documents that have been indexed but not yet inserted to the database |
static java.lang.String |
LINE_SEP
Line separator |
java.lang.String |
OrgiginalPart
The root folder of the ARC files |
java.util.concurrent.ExecutorService |
threadExecutor
|
int |
threadsCount
The number of threads for processing the documents, default is 1 |
java.lang.String |
URL
URL of th eindex |
java.util.concurrent.BlockingQueue<java.lang.Runnable> |
WaitQueue
Queue containing the waiting jobs in the thread pool |
| Constructor Summary | |
|---|---|
ARCSubmitter()
|
|
| Method Summary | |
|---|---|
private void |
FlushIndexedDocs()
Insert all the processed URLs (ARC records) to the database. |
private boolean |
InsertToDB(java.io.File fi)
Inserts this file to the database when the submitter completes processing all its records |
private boolean |
IsIndexed(java.io.File fi)
Check whether the file has been already submitted to the index or not |
static void |
main(java.lang.String[] args)
|
void |
ProcessFolder(java.lang.String path)
Process a folder full of ARC files, or subfolders containing ARC files. |
byte[] |
ReadBinaryDocument(org.archive.io.arc.ARCRecord record,
int offset,
int recordLength)
Reads the content of the ARC record from the ARC file |
java.lang.String |
ReadTextDocument(org.archive.io.arc.ARCRecord record,
int offset)
Reads a text document from the ARC record |
java.lang.String |
sendPostCommand(java.lang.String command,
java.lang.String url)
Sends a post request to the server Courtesy of Grant Ingersoll @ IBM |
boolean |
setupDBConnection()
Setup the database connection and create the mandatary tables |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public java.lang.String OrgiginalPart
public java.lang.String CacheFolder
public java.lang.String URL
public long Count
public SubmitterConfig Config
public int threadsCount
public java.util.concurrent.ExecutorService threadExecutor
public java.util.concurrent.BlockingQueue<java.lang.Runnable> WaitQueue
public java.util.Vector<SubmitterDocument> IndexedDocs
public static final java.lang.String LINE_SEP
| Constructor Detail |
|---|
public ARCSubmitter()
| Method Detail |
|---|
public static void main(java.lang.String[] args)
args - String[] Command-line arguments to process
Parameter 1: The URL of the indexer
Parameter 2: The folder to process which contains the ARC files
Parameter 3: The virtual directory under which the ARC files will be mapped to
Parameter 4: Number of threads to run
Parameter 5: [OPTIONAL] The waiting that the submitter will wait before iterating through the folder again
public boolean setupDBConnection()
throws java.lang.Exception
java.lang.Exceptionprivate boolean InsertToDB(java.io.File fi)
fi - File the file to be inserted to the database. It only inserts the full path of the file
private boolean IsIndexed(java.io.File fi)
fi - File The file to be checked
public void ProcessFolder(java.lang.String path)
path - String the absolute path of the containing folderprivate void FlushIndexedDocs()
public byte[] ReadBinaryDocument(org.archive.io.arc.ARCRecord record,
int offset,
int recordLength)
record - ARCRecord the record that its content is to be readoffset - int the offset at which the content of the document beginsrecordLength - int
public java.lang.String ReadTextDocument(org.archive.io.arc.ARCRecord record,
int offset)
record - ARCRecord the record that its content is to be readoffset - int the offset at which the content of the document begins
public java.lang.String sendPostCommand(java.lang.String command,
java.lang.String url)
throws java.lang.Exception
command - String the command to be senturl - String the URL of the server
java.lang.Exception
|
||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||