1. scan hard copy of statement using the largest resolution
2. save as jpeg image file format
3. convert jpeg file to tif using imagemagick
# convert sample_doc.jpg sample_doc.tif
4. do optical character recognition - ocr using tesseract
# tesseract sample_doc.tif sample_doc
5. manually verify ocr output
here's a sample of the scanned image and its digital text file output. compare rightmost values. (i had to censor some details which were accurately converted by tesseract hahaha)
need met ;)
i remember having a pdf-to-text function before but pdftotext of xpdf didn't work with my scanned image file. also, i had problems using lower resolution scanned images.
conversion of jpeg to tif in step 3 is needed because tesseract works with tif files.
No comments:
Post a Comment