LAPDF-Text operates recursively over subdirectories to find all PDF files and extract text from them accordingly based on the formatting layout of documents. You should therefore organize your PDFS ahead of time based on whether papers are all two-column formats, or are from the same journal and so have similar layouts across many papers. This way, you can develop your own rule files to help extract text accurately.
imagifyfunctions to check that the system can read the PDFs correctly and eyeball the outputs.
As an example, download this paper: Makki et al. 2010, PLoS Biology 8:e1000441
And then run
imagifySections on it (i.e., using the PDF file as the only parameter). This should generate 12 image files that show each page with each block drawn out as a rectangle with a baseline classification added to the file. Running
imagifySections is always a great sanity check to make sure that the system is working.
blockifyClassifyto get XML
If you're happy with the results provided from the imagify runs, as shown above, you should then try the two blockify commands. These generate XML output that can be parsed and read as necessary.
extractFullTextto get plain text
Note that this command will attempt to order the blocks to place text that does not form part of the main narrative at the end of the file. Making sure that the block classification is accurate is essential to make sure that the text is correctly ordered.
If you run the
debugChunkFeatures command on the PDF file, the system will generate a CSV file that you can use as a template for developing rules. If you open the CSV file in Excel, each row is a separate block and each column are the various features used to enact classification rules on each block. It is possible to code each rule column to improve the quality of the rules being used to classify each block. We will describe this process in more detail in another Wiki page on this site presently, but the file generated should provide a working model from which you can try to build your own rule files.
Having started to develop your own rule files, iterate through steps 1-4 to try to improve performance for each set of documents with different formatting. This should allow you to extract text from PDF files accurately with a little time investment into developing your own rule files for your own documents.
The SciKnowMine interface provides additional functionality for using LA-PDFText, including a web interface for hacking rule files. Check it out here: sciKnowMine