Instructions for installing the LA-PDFText system.
Usage of the tool as a software application is currently based on a set of command-line functions.
$ blockStatistics input-dir-or-file
[output-dir
]
input-dir-or-file
- the full path to the PDF file or directory to be extracted output-dir
(optional or '-') - the full path to the output directory Running this command on a PDF file or directory will generate statistics about each chunk in each pdf document .
$ blockify input-dir-or-file
[output-dir
]
input-dir-or-file
- the full path to the PDF file or directory to be extracted output-dir
(optional or '-') - the full path to the output directory Running this command on a PDF file or directory will attempt to generate one XML document per file with unnannotated text chunks .
$ blockifyClassify input-dir-or-file
[output-dir
] [rule-file
]
input-dir-or-file
- the full path to the PDF file or directory to be extracted output-dir
(optional or '-') - the full path to the output directory rule-file
(optional or '-') - the full path to the rule file Running this command on a PDF file or directory will attempt to generate one XML document per file with text chunks annotated with section.
$ debugChunkFeatures input-dir-or-file
[output-dir
] [rule-file
]
input-dir-or-file
- the full path to the PDF file or directory to be extracted output-dir
(optional or '-') - the full path to the output directory rule-file
(optional or '-') - the full path to the rule file Running this command on a PDF file or directory will generate a CSV file for the PDF that can act as a template for rule files. All available features are listed .
$ extractFullText input-dir-or-file
[output-dir
] [rule-file
] [sec1
... secN
]
input-dir-or-file
- the full path to the PDF file or directory to be extracted output-dir
(optional or '-') - the full path to the output directory rule-file
(optional or '-') - the full path to the rule file sec1
... secN
(optional) - a list of section names to be included in the dump Running this command on a PDF file or directory will attempt to extract uninterrupted two-column- formatted text of the main narrative section of the paper with one font change (i.e. for papers that use a smaller font for methods sections). Figure legends are moved to the end of the paper (but included), and tables are dropped.
$ imagifyBlocks input-dir-or-file
[output-dir
]
input-dir-or-file
- the full path to the PDF file or directory to be extracted output-dir
(optional or '-') - the full path to the output directory Running this command on a PDF file or directory will attempt to generate one image per page with text chunks drawn out with labels describing the predominant Font + Style of each block. This is helpful in developing rule files.
$ imagifySections input-dir-or-file
[output-dir
] [rule-file
]
input-dir-or-file
- the full path to the PDF file or directory to be extracted output-dir
(optional or '-') - the full path to the output directory rule-file
(optional or '-') - the full path to the rule file Running this command on a PDF file or directory will attempt to generate one image per page with text chunks drawn out with section labels. This is intended to provide a debugging tool.
$ watchPdfFolder COMMAND
dir-to-be-watched
output-dir
[rule-file
]
COMMAND
- the command to be executed:
dir-to-be-watched
- the full path to the directory to be watched output-dir
(optional or '-') - the full path to the output directory rule-file
(optional or '-') - the full path to the rule file This program maintains a watcher on this directory to execute the denoted command on any PDF files added to the directory. The system will then delete the appropriate files and folders when the originating PDF file is removed.
Note that this function has not been fully tested and may fail.