So you want to parse a PDF?
Suppose you have an appetite for tilting at windmills. Let's say you love pain. Well then why not write a PDF parser today?
The ideal world: how the specification should work
Conceptually parsing a PDF is fairly simple:
- First, locate the version header comment at the start of the file
- Next you need to locate the pointer to the cross-reference
- Then you can find all object offsets
- Finally you locate and build the trailer dictionary which points to the catalog dicitionary
Introduction to PDF objects
A PDF object wraps some valid PDF content, numbers, strings, dictionaries, etc., in an object and generation number. The content is surrounded by the obj/endobj
markers, for example a simple number may have its own PDF object:
16 0 obj
620
endobj
This declares that object 16 with generation 0 contains the number 620.
A PDF file is effectively a graph of objects that may reference each other. Objects reference other objects by use of indirect references. These have the format "16 0 R" which indicates that the content should be found in object 16 (generation number 0). In this case that would point to the object 16 containing the number 620. It is up to producer applications to split file content into objects as they wish, though the specification requires that certain object types be indirect.
Finding the cross-reference offset
To avoid the need to scan the entire file, PDFs declare a cross-reference table (xref). This is an index pointing to where each object in the file lives.
Each file ends with a pointer to the cross-reference file:
<< %trailer >>
startxref
116
%%EOF
This tells the parser to jump to byte offset 116 to find the xref table (or stream). In theory this pointer is right at the end of the file, according to the specification: