The SciKnowMine Triage Application

We present here a user manual for running and maintaining a web-based system for peforming document triage given a corpus of PDF files. We will describe processes for installation, execution and maintenance of the system.

  1. Installation Manual
  2. System Organization
  3. Command Line - Set up
  4. Command Line - Working with Data
  5. Command Line - Reporting Functions
  6. Command Line - Deleting Data
  7. Command Line - Machine Learning
  8. Command Line Tools - Running Experiments
  9. Web Application - Running the System
  10. Web Application - Extracting text using LAPDF-Text
  11. Web Application - Performing the triage task
  12. Web Application - The Base Digital Library

2. System Organization

The triage task is primarily concerned with sorting documents from an input document set assigned to a curator (called a 'triage corpus') into a set of categories (where each category actually designates a set of documents that are each called a 'target corpus'). The data that assigns each scientific article from triage corpus to it's target corpus is it's 'in-out-code' which can be one of three values: 'in', 'out' or 'unclassified'.

This simple construct forms the basis of the system and provides a relatively straightforward way to attach additional cues and information about each article's possible inclusion in a target corpus based on NLP analysis of the document's contents.

The format of PDF file names.

Each PDF file being processed should start with it's pubmed id, followed by an underscore and a single letter denoting if it is to be included in a target corpus. Thus some examples of possible filenames are as follows:

This indicates that the article with the PubMed id 19763139 is a member of the target corpus denoted by the code 'A'. These codes are set when you create the target corpus.