The LA-PDFText Command Line Application

Instructions for installing the LA-PDFText system.

  1. Installation Manual
  2. Running Commands
  3. Recommended Use of the System

2. Running the Commands

Usage of the tool as a software application is currently based on a set of command-line functions.

$ blockStatistics input-dir-or-file [output-dir]

Running this command on a PDF file or directory will generate statistics about each chunk in each pdf document .

$ blockify input-dir-or-file [output-dir]

Running this command on a PDF file or directory will attempt to generate one XML document per file with unnannotated text chunks .

$ blockifyClassify input-dir-or-file [output-dir] [rule-file]

Running this command on a PDF file or directory will attempt to generate one XML document per file with text chunks annotated with section.

$ debugChunkFeatures input-dir-or-file [output-dir] [rule-file]

Running this command on a PDF file or directory will generate a CSV file for the PDF that can act as a template for rule files. All available features are listed .

$ extractFullText input-dir-or-file [output-dir] [rule-file] [sec1 ... secN]

Running this command on a PDF file or directory will attempt to extract uninterrupted two-column- formatted text of the main narrative section of the paper with one font change (i.e. for papers that use a smaller font for methods sections). Figure legends are moved to the end of the paper (but included), and tables are dropped.

$ imagifyBlocks input-dir-or-file [output-dir]

Running this command on a PDF file or directory will attempt to generate one image per page with text chunks drawn out with labels describing the predominant Font + Style of each block. This is helpful in developing rule files.

$ imagifySections input-dir-or-file [output-dir] [rule-file]

Running this command on a PDF file or directory will attempt to generate one image per page with text chunks drawn out with section labels. This is intended to provide a debugging tool.

$ watchPdfFolder COMMAND dir-to-be-watched output-dir [rule-file]

This program maintains a watcher on this directory to execute the denoted command on any PDF files added to the directory. The system will then delete the appropriate files and folders when the originating PDF file is removed.

Note that this function has not been fully tested and may fail.