MeatballWiki

Edit History Raw

ParsingPDF

Modern techniques to parse and understand PDFs require a mix of basic PDF streaming as well as image processing for embedded images, and a a model to reserialize text and organize it hierarchically. Text that semantically may be serial may be disconnected because PDF is designed for print; so for instance text may be split across columns and page breaks.

Libraries