edu.psu.ist.youseer
Class ARCSubmitter

java.lang.Object
  extended by edu.psu.ist.youseer.ARCSubmitter

public class ARCSubmitter
extends java.lang.Object

Title: ARCSubmitter

Description: Read the configuration file, initilize the database, and create the tables if they don't exist The default database is SQLite, but if you prefer to use a server side database, feel free to modify the configuration file and provide the connection string for the database. This class creates a thread pool of Worker runnables to handle the ARC records retrieved from the ARC file. The order of tasks as follows: 1) parse the input parameters, 2) parse the configuration files and build a configuration object, 3) Establisht the database connection, 4) iterate through the input folder and process all the (new) ARC fiels in it (Depth First). 5) for each ARC record in the ARC file, this class creates a SubmitterDocument object and submit the object to the thread pool for processing. After processing the entire ARC file, the log is inserted to the database and the thread waits for the thread pool to finish executing before reading the next ARC file. When it finish processing all the files in the folder, the thread waits for a specified period (default is 5 minutes) before making another scan in the folder to find new ARC files. Thus it can run in parallel with the crawler and waits for the new dumps.

Copyright: Copyright Madian Khabsa @ Penn State(c) 2009

Company: Penn State


Field Summary
 java.lang.String CacheFolder
          The virtual path that will have the ARV files in it
 SubmitterConfig Config
          Cinfiguration object
 long Count
          Number of documents submitted so far
 java.util.Vector<SubmitterDocument> IndexedDocs
          List of documents that have been indexed but not yet inserted to the database
static java.lang.String LINE_SEP
          Line separator
 java.lang.String OrgiginalPart
          The root folder of the ARC files
 java.util.concurrent.ExecutorService threadExecutor
           
 int threadsCount
          The number of threads for processing the documents, default is 1
 java.lang.String URL
          URL of th eindex
 java.util.concurrent.BlockingQueue<java.lang.Runnable> WaitQueue
          Queue containing the waiting jobs in the thread pool
 
Constructor Summary
ARCSubmitter()
           
 
Method Summary
private  void FlushIndexedDocs()
          Insert all the processed URLs (ARC records) to the database.
private  boolean InsertToDB(java.io.File fi)
          Inserts this file to the database when the submitter completes processing all its records
private  boolean IsIndexed(java.io.File fi)
          Check whether the file has been already submitted to the index or not
static void main(java.lang.String[] args)
           
 void ProcessFolder(java.lang.String path)
          Process a folder full of ARC files, or subfolders containing ARC files.
 byte[] ReadBinaryDocument(org.archive.io.arc.ARCRecord record, int offset, int recordLength)
          Reads the content of the ARC record from the ARC file
 java.lang.String ReadTextDocument(org.archive.io.arc.ARCRecord record, int offset)
          Reads a text document from the ARC record
 java.lang.String sendPostCommand(java.lang.String command, java.lang.String url)
          Sends a post request to the server Courtesy of Grant Ingersoll @ IBM
 boolean setupDBConnection()
          Setup the database connection and create the mandatary tables
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

OrgiginalPart

public java.lang.String OrgiginalPart
The root folder of the ARC files


CacheFolder

public java.lang.String CacheFolder
The virtual path that will have the ARV files in it


URL

public java.lang.String URL
URL of th eindex


Count

public long Count
Number of documents submitted so far


Config

public SubmitterConfig Config
Cinfiguration object


threadsCount

public int threadsCount
The number of threads for processing the documents, default is 1


threadExecutor

public java.util.concurrent.ExecutorService threadExecutor

WaitQueue

public java.util.concurrent.BlockingQueue<java.lang.Runnable> WaitQueue
Queue containing the waiting jobs in the thread pool


IndexedDocs

public java.util.Vector<SubmitterDocument> IndexedDocs
List of documents that have been indexed but not yet inserted to the database


LINE_SEP

public static final java.lang.String LINE_SEP
Line separator

Constructor Detail

ARCSubmitter

public ARCSubmitter()
Method Detail

main

public static void main(java.lang.String[] args)
Parameters:
args - String[] Command-line arguments to process Parameter 1: The URL of the indexer Parameter 2: The folder to process which contains the ARC files Parameter 3: The virtual directory under which the ARC files will be mapped to Parameter 4: Number of threads to run Parameter 5: [OPTIONAL] The waiting that the submitter will wait before iterating through the folder again

setupDBConnection

public boolean setupDBConnection()
                          throws java.lang.Exception
Setup the database connection and create the mandatary tables

Returns:
boolean true if connected, false otherwise
Throws:
java.lang.Exception

InsertToDB

private boolean InsertToDB(java.io.File fi)
Inserts this file to the database when the submitter completes processing all its records

Parameters:
fi - File the file to be inserted to the database. It only inserts the full path of the file
Returns:
boolean true if the insert was committed, otherwise false

IsIndexed

private boolean IsIndexed(java.io.File fi)
Check whether the file has been already submitted to the index or not

Parameters:
fi - File The file to be checked
Returns:
boolean True if found in the database, false otherwise

ProcessFolder

public void ProcessFolder(java.lang.String path)
Process a folder full of ARC files, or subfolders containing ARC files. The exploring follows depth first paradigm For each file, the processor checks if the file has been indexed before, if so it ignores it. If the file is new then it iterates through all the ARC records in this file. For each ARC record (which contains the downloaded document and metadata about it) it reads the document and its metadata, then it creates a SubmitterDocument object. This object will be submitted to the thread pool that is responsible of parsing, extracting and handling the data before submitting it to the index. After processing a file and before moving to the next file, it waits for the processing thread pool to finish executing and commit all the URLs to the database.

Parameters:
path - String the absolute path of the containing folder

FlushIndexedDocs

private void FlushIndexedDocs()
Insert all the processed URLs (ARC records) to the database. The database will contain the Url, file type, indexing time containing folder, and the record offset within the ARC file


ReadBinaryDocument

public byte[] ReadBinaryDocument(org.archive.io.arc.ARCRecord record,
                                 int offset,
                                 int recordLength)
Reads the content of the ARC record from the ARC file

Parameters:
record - ARCRecord the record that its content is to be read
offset - int the offset at which the content of the document begins
recordLength - int
Returns:
byte[] the byte array containing the document data

ReadTextDocument

public java.lang.String ReadTextDocument(org.archive.io.arc.ARCRecord record,
                                         int offset)
Reads a text document from the ARC record

Parameters:
record - ARCRecord the record that its content is to be read
offset - int the offset at which the content of the document begins
Returns:
String the content of the text document

sendPostCommand

public java.lang.String sendPostCommand(java.lang.String command,
                                        java.lang.String url)
                                 throws java.lang.Exception
Sends a post request to the server Courtesy of Grant Ingersoll @ IBM

Parameters:
command - String the command to be sent
url - String the URL of the server
Returns:
String The result of the submit
Throws:
java.lang.Exception