Lapdftext

LA-PDFText is a system for extracting accurate text from PDF-based research articles (and an interface to be able to improve performance where needed). The system is open-source and provides a simple baseline function for extracting text from primary research articles using rules that developers can customize. This means that the system works quite well for most applications (and might occasionally make mistakes and extract the wrong text), but it is always possible to 'hack' your own rules and improve performance.

View on GitHub BMKEG

The Biomedical Knowledge Engineering Research Group (BMKEG) is part of the Intelligent Systems Division at the University of Southern California's Information Sciences Institute


Welcome to the LA-PDFText Technical Documentation.

The system is intended to provide a practical low-level methodology for extracting text from PDF documents (primarily scientific papers) for incorporation into text mining workflows.

This website provides evolving, dynamic technical documentation for the LA-PDFText project in the form of different manuals, published as static evolving online blogs. This will evove and develop over time as we develop the system and its capabilities.

  1. Installation Manual
  2. Running Commands
  3. Recommended Use of the System

If you have any questions, feedback or issues concerning this work, contact us at gully@usc.edu.