I often process scanned documents: I fix up the scans, run the document through ORC, add a table of contents, etc. I work on a Linux workstation in a folder with the following structure:
┌ .
│ ├ in/
│ ├ out/
│ ├ upscale/
│ └ input.pdf
└ generátor-obsahu.py
The first step is to get bitmap images of the individual pages. If the input is a PDF document, I use the pdftoppm utility from the poppler-utils package:
pdftoppm -r 300 input.pdf str -png
Then I move these PNGs into the in/
folder.
Sometimes I upscale the pages. I haven't found a good upscaler for text, so I use the C++ version of waifu2x:
waifu2x-converter-cpp -i in/ -o upscale/ -m noise-scale --scale-ratio 2 -f png
Then comes the main part of scan processing – splitting pages, deskewing them, adding or removing borders, etc.
I use a GUI tool for that.
It's called Scan Tailor.
The original version seems to be unmaintained but there are two forks – Scan Tailor Universal and Scan Tailor Advanced.
I save Scan Tailor's output into the out/
folder.
Then I combine all the images into one PDF file using the img2pdf tool:
cd out && img2pdf --output ../sd.pdf *.tif && cd ..
And I run the resulting document through OCR. I use OCRmyPDF for that:
ocrmypdf -j 2 --optimize 3 -fl eng sd.pdf sd_ocr.pdf
The last step is to add the table of contents into the metadata of the PDF file. I use my own custom script – generátor-obsahu.py – to generate the input for the command-line version of PDFtk. You could write it by hand. My script takes a tab-delimited representation of the table of contents as an input, see an example:
cat sd.toc | python ../generátor-obsahu.py > sd.pdftk.toc &&
pdftk sd_ocr.pdf update_info sd.pdftk.toc output sd_final.pdf
The resulting file sd_final.pdf
should be more pleasant to read than a raw scan.