Open-source Scanning – Julian Foad

In the last few years I have been scanning most of the printed documents I receive. Bank statements, important receipts, event tickets to keep as souvenirs. I also from time to time scan batches of old documents such as my University notes.

I envision a free software solution that makes it easy.

SimpleScan, Pdf-Shuffler, gscan2pdf

The default solution for scanning documents on Ubuntu is not great. SimpleScan is very useful, and I have used it to scan the vast majority of my paper documents for the last few years. It has some good qualities, but at the same time is annoyingly crude. Main omissions:

OCR support.
Load a previously saved document and scan more pages into it. (For correcting mistakes.)
Scan a stack of pages and then divide them into separate documents. (For speed.)
Undo.
Automatic page size detection, rotation (right angles or de-skew), blank page skipping, and so on.
Basic UI usability (e.g. selecting and re-ordering pages doesn’t work well; forgets the crop size and position).
Generally stable but some serious bugs, e.g. deleting one page while another is scanning leads to data corruption.

To add more pages to a previously saved PDF document, I use PDF-Shuffler. It too is useful but crude, lacking:

Undo.
Basic UI usability (e.g. multiple selection of pages doesn’t work properly; doesn’t remember its own window position and size; doesn’t provide a dead-simple way to save the result back to the same file).

I tried gscan2pdf mainly for its OCR promise (see below), but I also liked that it can do automatic corrections such as de-skew. However, I found those features nearly impossible to configure because they are basically undocumented and presented as meaningless numbers with no clue how to choose useful values or how they interact.

On the plus side, gscan2pdf scans from my HP Photosmart C7280 all-in-one device at about twice the speed SimpleScan does, keeping up with the scanner’s speed, while SimpleScan makes the scanner pause several times per page. I don’t know if this is because it is using different scan settings and receiving less data, or because it is more efficient at receiving the data.

OCR

Most of the documents for which I expect OCR to work were printed by a computer in a nice regular clean typeface. In this century it seems ridiculous that the computer still can’t read what it wrote.

For OCR I have tried gscan2pdf, with both Tesseract and Cuneiform OCR engines. The OCR results I have obtained have been decidedly poor, good enough for generating an index of words found in the document but not good enough to want to display the result.

By contrast, the OCR performed by what I understand is a version of Abbyy FineReader embedded in my Fujitsu ScanSnap N1800 scanner gives excellent results, almost perfectly transcribing most printed pages. A notable exception is it swaps bold for italic and italic for bold. It only OCR’s one page a minute, and the scanner’s UI software is dreadful, but this does show what can be achieved by good OCR software.

Free Software Improvements

What’s needed for a good free software solution? Two main things:

Decent basic UI
OCR that just works

For the former, adding Undo to a combination of SimpleScan and PDF-Shuffler would form a good basis in terms of features and rough UI design, with maybe some ideas from gscan2pdf as well. This seems a run-of-the-mill software design and programming task.

For the OCR, I expect some intensive work is needed. It need not be totally out of reach for a group of keen volunteers, but might be more likely to happen if there were some investment by one of the big players such as Canonical, IBM or Google (who have improved Tesseract a lot but it’s not a complete OCR solution).

And then could come enhancements such as:

Automatic rotation, de-skew, blank page skip, cropping, …
Detecting a small palette of colours (often black, white and one other)

Some features such as automatic rotation are closely associated with OCR.

If Linux distributions would ship with such a solution ready to go, I think this would help general home and small office users to accept them as systems that do what they would regard as basic computer tasks.