We present here a user manual for running and maintaining a web-based system for peforming document triage given a corpus of PDF files. We will describe processes for installation, execution and maintenance of the system.
We designed the system to permit developers to run machine learning natural language experiments on their data. We expect that we will need to design specialized components for specific document classification tasks to optimize performance. This is where the system permits this process.
At this point, we make extensive use of the ClearTk 'evaluation' (cleartk-eval
) module
(see https://code.google.com/p/cleartk/wiki/Modules). ClearTk-Eval provides pipeline
functions within UIMA that automatically run N-Fold validation
tests, calculate Tf-Idf, Mutual Information, etc. and calculate standard metrics of
performance (Precision, Recall, F-Score, etc.). The reason we currently work with
ClearTk is that is written in Java and provides a common development platform for other
popular machine learning libraries such as SVMlight,
LIBSVM,
LIBLINEAR,
OpenNLPMaxEnt, Weka
and Mallet. Thus using (and maybe helping to extend) the
ClearTk API could provide a wide range of tools for SciKnowMine.
Note that this work currently requires working with source code, since developers
must create their own ClearTk features and pipelines. This is a work in progress and we
describe it here in it's early, development form. These are not commands that are
packaged in the skmTriage-1-1-5-SNAPSHOT
installer, but are available from within the
codebase but may be run from the command line using the following structure:
java -classpath skmTriage-1.1.5-SNAPSHOT-jar-with-dependencies.jar <path.to.command> <arguments>
java -classpath skmTriage-1.1.5-SNAPSHOT-jar-with-dependencies.jar
edu.isi.bmkeg.skm.triage.cleartk.bin.PreprocessTriageScores
-triageCorpus <triage-corpus-name> -targetCorpus <target-corpus-name>
-dir <target-directory> -prop <proportion-of-docs-held-out>
-l <login> -p <password> -db <database> -wd <workingDirectory>
-targetCorpus NAME : The target corpus that we're linking to
-triageCorpus NAME : The triage corpus to be evaluated.
-dir NAME : The directory where the ML data is to be extracted to
-prop FLOAT : The proportion of documents to be held out (_e.g._ 0.1)
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-wd WORKINGDIR : Working Directory
This runs through the available text from each paper in the system and extracts it to the named directory in the following structure:
+ directory
|-+ <target_corpus_name> (e.g., 'Allele_Phenotype')
|-+ <triage_corpus_name> (e.g., 'Hiroaki_Onda')
|-+ test
|-+ in.txt
|-+ out.txt
|-+ train
|-+ in.txt
|-+ out.txt
Each in.txt
and out.txt
file contains data formatted in the following way:
88177 < > During neural development , programmed cell death ... <all-text-from-the-paper> ... the insets in i and l.
88302 < > Mitochondrial dysfunction has long been implicated ... <all-text-from-the-paper> ... to generate sufficient ATP .
Each line starts with the database id value for the article citation and then contains the text of the article, extracted from the PDF and converted to XML through the LAPDF-Text library (and then rendered to text through the JATS XSLT system). This provides a standard format for the data that may be processed by any NLP text mining system.
We provide the RunEvaluationAcrossFeatures
class as a method for running evaluations
across multiple feature annotators. This is essentially a scripting program which
runs the PreprocessTriageScores
command a number of times followed by a set of feature
extraction pipelines.
This is a current focus of work within the system and is likely to change.
java -classpath skmTriage-1.1.5-SNAPSHOT-jar-with-dependencies.jar
edu.isi.bmkeg.skm.triage.cleartk.bin.RunEvaluationAcrossFeatures
-triageCorpus <triage-corpus-name> -targetCorpus <target-corpus-name>
-dir <target-directory> -prop <proportion-of-docs-held-out>
-nRepeats <number-of-times-whole-process-will-repeat>
-l <login> -p <password> -db <database> -wd <workingDirectory>
-targetCorpus NAME : The target corpus that we're linking to
-triageCorpus NAME : The triage corpus to be evaluated.
-dir NAME : The directory where the ML data is to be extracted to
-prop FLOAT : The proportion of documents to be held out (_e.g._ 0.1)
-nRepeats INT : The number of times the pipeline will be run
-db DBNAME : Database name
-l LOGIN : Database login
-p PASSWD : Database password
-wd WORKINGDIR : Working Directory
Currently, the system runs through a number of feature sets to check performance of document classification pipelines. These include (1) unigrams, (2) bigrams, (3) combined uni- and bi-gram data, (4) tf-idf counts from unigrams only.
The system generates a tab-delimited file called results.txt
in the output directory,
where the evaluation metrics are delivered.