Wednesday, May 21, 2014

fedora: scan image to text

so... i had a need to convert hard copy of statements sent by slow mail to flat digital text files. i need them in digital form so i can do comparison with another transmitted digital data. and the procedure below works best for me, so far (you'll need imagemagick and tesseract installed):

1. scan hard copy of statement using the largest resolution
2. save as jpeg image file format
3. convert jpeg file to tif using imagemagick
# convert sample_doc.jpg sample_doc.tif
4. do optical character recognition - ocr using tesseract
# tesseract sample_doc.tif sample_doc
5. manually verify ocr output

here's a sample of the scanned image and its digital text file output. compare rightmost values. (i had to censor some details which were accurately converted by tesseract hahaha)


need met ;)

i remember having a pdf-to-text function before but pdftotext of xpdf didn't work with my scanned image file. also, i had problems using lower resolution scanned images.

conversion of jpeg to tif in step 3 is needed because tesseract works with tif files.

No comments:

Post a Comment

SSH : No matching host key type found. Their offer: ssh-rsa,ssh-dss

Got this while connecting to my mikrotik router via ssh   Unable to negotiate with <ip address> port <ssh port>: no matching hos...