The big guide to working with PDFs for students and knowledge workers
Working efficiently with PDFs is essential, but no one is taught what they can do with PDFs, or how to solve problems with PDF documents they encounter in the real world. In this guide I demonstrate the process of ’fixing’ a bad PDF using a real-world example, and then working with it effectively.
PDF documents are the de facto standard for research papers. The vast majority of research papers you download or receive from colleagues will be PDFs. In addition, PDF is one of the standard file types that you can provide to a printing house to have them produce your conference poster, business cards, wedding invitations, whatever.
Working with PDFs is clearly important, but no one is really taught what they can do with PDFs, or how to solve problems with PDF documents they encounter in the real world. In this guide I demonstrate the process of fixing a bad PDF, and then working with it effectively.
Imagine you’ve found a chapter you need in a book at the reference library, but it’s nearly closing time and you have to hurry. You scan the chapter into a PDF two pages at a time to speed things up, and in your hurry you haven’t put the book in exactly straight, so some pages are skewed. And imagine also that the library’s ancient scanner doesn’t do Optical Character Recognition (OCR), so all you’ve gotten is a PDF full of photographs of the book pages.
A PDF just like this one (Citations), for example. This came straight out of my library, and I’ll be using it throughout this guide as my example document. I can’t print it properly because the pages are doubled and the text would come out small. I don’t really like how the pages are skewed. And the worst thing is that I can’t highlight and copy text from it because the text doesn’t exist! What do!
What do, indeed. Repairing such a PDF can take up to four steps, in this order:
Then you will want to do some actual work with the PDF:
We’ll begin by repairing this PDF, then use it to demonstrate the working steps.
I use a cross-platform Java tool called Briss to cut and crop PDFs.
The example PDF has two page images on a single document page, which is not ideal. In this format we have less control over how we print it, and reading it on a tablet will involve more scrolling and zooming. We need to cut each page in half.
Briss will ask you if you want to exclude any pages from the stacking process, which you’ll see in a minute. Just choose Cancel.
Briss will stack the most similar pages on top of each other, and then attempt to generate bounding boxes that adequately capture the contents of each page. Sometimes, as in this case, it gets confused by the PDF and will make the wrong boxes. I’ll delete these boxes by right-clicking and choosing Delete rectangle, and make my own instead.
To make a box, simply left-click and drag one out over the text you want to keep. Briss will maintain the page order if you work from left to right and top to bottom. Here I’ve opted to preserve much of the surrounding page because it will come in handy after de-skewing.
You can see above that my boxes are different sizes. If you crop a PDF with differently-sized boxes, the zoom level will jump around to accommodate the page size as you navigate the document. It’s better to make all the bounding boxes the same size.
To do this, select all of your boxes (either by Ctrl-clicking them or right-clicking and choosing Select Rectangle) and take note of their measurements.
I want all of the boxes to be 170 x 249 mm, so in the top menu bar, select Rectangle → Set size (selected) and enter 170 249.
After dragging the boxes around to get the best positioning you can preview the final result by selecting Action → Preview. You may decide that you want to make your boxes bigger or smaller, or move them around. Now is a good time to compare the preview to the original document to ensure that the pages are in the right order.
Once you’re done you can output the cut PDF by selecting Action → Crop PDF.
Here is our cut PDF. Normally you would crop the document much closer to the content, especially if you intended to read it on a tablet screen. However I have kept the excess because I want some buffer area for a second cropping after de-skewing, which is the next step.
I use an online tool called PDFescape to manually de-skew pages.
Now that the page images are on their own document pages, we can de-skew the pages. Note that de-skewing always makes the text blurrier, and for this reason I often skip de-skewing altogether. However, I list it here for completeness.
Once your file is uploaded, choose the Page tab and click the More button to reveal the Deskew feature.
Click the Deskew button to begin. De-skewing with PDFescape involves drawing a horizontal line along a part of the page that is supposed to be horizontally level. A line along the ascenders of the columns is good:
So is a known-horizontal line, like a table or figure border.
You can download your PDF by click the green icon on the left bar. When you receive it you will notice that the text has become a little blurrier because of the rotation process. You will have to decide for yourself whether this is aesthetically acceptable to you, and whether it will affect the accuracy of OCR in the next step.
After another cropping step with Briss to remove the excess margins, here is a comparison between the original skewed PDF and the unskewed PDF.
Since I believe that incurring eye strain and slower reading speeds from blurry text is a poor trade for an unskewed page, I will use the skewed PDF for the next step.
I use a PDF viewer called PDF-XChange Viewer to perform OCR.
At this point the document is still just photos of pages, and the text is not selectable, copyable, or highlightable. OCR reads the letter forms in the image and generates a text layer that allows you to do all that sweet stuff. If your document already has a text layer but it’s not very good (you get gibberish when you try to copy and paste text from the PDF, for example), replacing it by doing another OCR pass will fix it.
You can get to the OCR dialog by selecting Document → OCR Pages.
In this dialog you can choose the language and accuracy (High is better, naturally), as well as what to do with the PDF. Since the PDF has no existing text layer, our output type will be Preserve original content and add text layer.
If we wanted to replace the existing text layer we would choose Convert page content to image only Add text as a layer. Selecting this opens up the Image Quality dropdown. Most PDFs are in the DPI range of around 96200 DPI, so cranking it higher does nothing except bloat the file and make it scroll more slowly. A bit of calculation (page width in pixels at 100% zoom, divided by width in inches provided by Adobe Reader under File → Properties) found that this PDF is just a touch under 150 DPI, for example.
Once you’re ready, hit Okay and marvel at the genius of technology.
You can now use the text selection tool to highlight text, and when you copy and paste it to a new document, it’s accurate indeed! Even the skewed text is mostly accurate, although you might get some miscodings like a one being interpreted as an ell.
Here is the example document after OCR. Now it’s time to prune the document of unnecessary pages.
I use a command-line tool called PDF Toolkit Free to excise pages.
I don’t awfully care about the last three pages of my document, so I want to remove them and be left with the relevant pages only.
PDF Toolkit is a bit different because it’s a command-line tool and has no graphical interface. I prefer these because they’re faster and can be automated, but they’re less straightforward when you first begin using them.
The easiest way to launch the command prompt is to Shift + Right Click an empty space anywhere in a folder or on the desktop, and choose Open command window here.
Ta da, the command prompt. I’m not going to rehash the PDFtk command-line manual, but here are some commands that I use most often. Just type them in as they appear, substituting the input and output filenames for your own. Dragging a file into the command prompt window automatically enters its full address, and thats the easiest way to reference it.
“Take all the pages between the 2nd page and the end of INPUT.pdf (inclusive) and copy them into a new PDF called OUTPUT.pdf.”
pdftk INPUT.pdf cat 2-end output OUTPUT.pdf
“Copy all pages from INPUT.pdf to OUTPUT.pdf, except for page 5.”
pdftk INPUT.pdf cat 1-4 6-end output OUTPUT.pdf
“Copy pages 1–3 of INPUT1.pdf into OUTPUT.pdf, and then add all of INPUT2.pdf to the end of OUTPUT.pdf.”
pdftk A=INPUT1.pdf B=INPUT2.pdf cat A1-3B output OUTPUT.pdf
“Combine multiple input PDFs into a single output PDF, in the order given.”
pdftk INPUT1.pdf INPUT2.pdf cat output OUTPUT.pdf
“Compile every PDF in the working directory into a single output PDF, in alphabetical order.”
pdftk *.pdf cat output OUTPUT.pdf
Wondering why it’s called cat
? It stands for catenate.
cat
is so handy! “Rotate the first page of INPUT.pdf 90 clockwise, and output all pages to OUTPUT.pdf.”
pdftk INPUT.pdf cat 1right 2-end output OUTPUT.pdf
“Rotate the first page of INPUT.pdf 180°, and output to OUTPUT.pdf.”
pdftk INPUT.pdf cat 1south output OUTPUT.pdf
“Split each page of INPUT.pdf into individual single-page PDF documents.”
pdftk INPUT.pdf burst
Here is the example document after removing the last three pages. Now we have a totally functional PDF, hurrah! We can make annotations on it, but if we want to make exportable annotations, we’ve got to do it special like.
I use Smallpdf’s unlocker to unlock PDFs.
Sometimes the PDFs you download from journal sites are secured to prevent editing or printing. This also prevents you from making annotations, which ruins whats so great about working with PDFs!
Simply upload the protected PDF to PDFUnlock, and you’ll be able to download an unprotected copy that you can annotate and print as much as you want. My example PDF isn’t secured, so I don’t need to do anything here.
I use a tool I wrote called PDF Note Automator to make export-ready annotations.
Normally, text thats been highlighted in a PDF doesn’t get exported when you create a data file of your annotations. Instead of the highlighted text being exported, it’s the location of the highlighting that gets saved instead. Bleh!
You can get around this problem by copying the text before you highlight it, and then adding the text to a pop-up note attached to the highlighting. This is a dull and repetitive process, so I wrote PDF Note Automator to do it for me.
Insert is default. Choose something comfortable.
Fwoosh! So fast! It’s made a Note annotation, which is a highlight plus a pop-up note of text. Note that a highlight by itself has no text in the comments list, and therefore won’t be useful to us when exported. I’ve also added all the different text annotations to the first page of the PDF as it stands at the end of this step, so that we can have a look at how FDF extraction fares in the next step.
I use a tool I wrote called FDF Text Extractor to pull out my annotations.
Making lots of annotations is rather pointless if you can’t access them efficiently. You might want to put your excerpts into the notes box of a paper stored in your reference manager, like in Zotero. You might be using a reference manager like Docear, in which your text annotations are central to the programs organisation and function. Or maybe you just want to save your excerpts online for other people to browse, like I do with books.
My tool FDF Text Extractor automatically pulls all text annotations from an exported data file. Lets do it!
Open your annotated PDF in Adobe Reader and open the Comment sidebar. In the same row as the Find box in the comments list, there is an Options button. Click it and select Export All to Data File.
If you open the FDF in Notepad or some other text editor, you can see that it’s full of gobbledygook that crowds out your annotations. That why we need to extract it.
Drag and drop the resulting FDF onto FDF Text Extractor.exe, just like you were placing a file into a folder. If you have many FDFs to extract you can just drag and drop all of them at once.
A .TXT file with the same name as the PDF will appear in the PDFs directory.
If you open it you will find that your annotations are there, ready for use in your work flow! We’ve gone from a hopelessly crowded data file to just the stuff you care about.
We’ve come a long way. We started with a PDF that was little better than a folder of photographs of a book, and turned it into a full-featured document, one thats searchable and capable of the same use as any other PDF file. I’ve tried to demonstrate the basic process that you will use every time you generate a new PDF, and I hope this long guide was helpful to you.
Agosti D & Alonso LE (2000). The ALL protocol. In: Ants: Standard Methods for Measuring and Monitoring Biodiversity, Smithsonian Institution Press, Washington DC.
Fisher BL, Malsch AKF, Gadagkar R, Delabie JHC, Vasconcelos HL & Majer JD (2000). Applying the ALL protocol. In: Ants: Standard Methods for Measuring and Monitoring Biodiversity, Smithsonian Institution Press, Washington DC. ↵