Postprocessing of Scanned Documents

I often process scanned documents: I fix up the scans, run the document through ORC, add a table of contents, etc. I work on a Linux workstation in a folder with the following structure:

┌ .
│ ├ in/
│ ├ out/
│ ├ upscale/
│ └ input.pdf
└ generátor-obsahu.py

The first step is to get bitmap images of the individual pages. If the input is a PDF document, I use the pdftoppm utility from the poppler-utils package:

pdftoppm -r 300 input.pdf str -png

Then I move these PNGs into the in/ folder.

Sometimes I upscale the pages. I haven't found a good upscaler for text, so I use the C++ version of waifu2x:

waifu2x-converter-cpp -i in/ -o upscale/ -m noise-scale --scale-ratio 2 -f png

Then comes the main part of scan processing – splitting pages, deskewing them, adding or removing borders, etc. I use a GUI tool for that. It's called Scan Tailor. The original version seems to be unmaintained but there are two forks – Scan Tailor Universal and Scan Tailor Advanced. I save Scan Tailor's output into the out/ folder.

Then I combine all the images into one PDF file using the img2pdf tool:

cd out && img2pdf --output ../sd.pdf *.tif && cd ..

And I run the resulting document through OCR. I use OCRmyPDF for that:

ocrmypdf -j 2 --optimize 3 -fl eng sd.pdf sd_ocr.pdf

The last step is to add the table of contents into the metadata of the PDF file. I use my own custom script – generátor-obsahu.py – to generate the input for the command-line version of PDFtk. You could write it by hand. My script takes a tab-delimited representation of the table of contents as an input, see an example:

cat sd.toc | python ../generátor-obsahu.py > sd.pdftk.toc && 
pdftk sd_ocr.pdf update_info sd.pdftk.toc output sd_final.pdf

The resulting file sd_final.pdf should be more pleasant to read than a raw scan.