head tap: fedora: scan image to text

Wednesday, May 21, 2014

fedora: scan image to text

so... i had a need to convert hard copy of statements sent by slow mail to flat digital text files. i need them in digital form so i can do comparison with another transmitted digital data. and the procedure below works best for me, so far (you'll need imagemagick and tesseract installed):

1. scan hard copy of statement using the largest resolution

2. save as jpeg image file format

3. convert jpeg file to tif using imagemagick

# convert sample_doc.jpg sample_doc.tif

4. do optical character recognition - ocr using tesseract

# tesseract sample_doc.tif sample_doc

5. manually verify ocr output

here's a sample of the scanned image and its digital text file output. compare rightmost values. (i had to censor some details which were accurately converted by tesseract hahaha)

need met ;)

i remember having a pdf-to-text function before but pdftotext of xpdf didn't work with my scanned image file. also, i had problems using lower resolution scanned images.

conversion of jpeg to tif in step 3 is needed because tesseract works with tif files.

head tap

Wednesday, May 21, 2014

fedora: scan image to text

No comments:

Post a Comment

SSH : No matching host key type found. Their offer: ssh-rsa,ssh-dss