One of the most important features I like using software is search. I hate to remember stuff. That’s one of the reasons why my bachelor thesis is about Enterprise Search and why I hate to receive official documents as letters.
Sure, you can file them in folders by date or by categories, like insurance, contracts etc. But in the end, a “strg + f” (or cmd + f) is all you want.
While there are a lot of services out there, which offer OCR services for scanned documents, I choose to implement something myself. Why? Because I want it cheap, learn something and of course as an engineer I like to reinvent the wheel (cause that’s what we do, right?).
As mentioned, cost was one of the driving factors, so the idea was to use mainly free tiers of different cloud offerings to have a private ocr service.
This first post is just about sketching out the idea, and subsequent blog posts will show the implementation.
The idea
Since my problem starts kind of non-digital analog, I must digitize the letters I receive. Since I have one of those clumsy big “multifunction printers”, with integrated document scanner. The printer can scan the documents and send them via email, put them on a USB stick, or put them on a smb share. I’ve chosen the email “output” for my project.
OCR is not a trivial topic, and while I was claiming that engineers like to reinvent the wheel, I really wouldn’t like to implement an OCR engine. During my master’s degree I got in touch with tesseract. Tesseract is an open source OCR tool/library, which has very descent ocr results. Since its already shipping with a command line tool there is no need for doing implementation work.
The next step would be to execute tesseract every time an email with an attachment is send from my printer. Therefor we need some execution environment which can run tesseract. Running a docker container 24/7 seems a little bit much for getting a letter every other day. I choose Azure Functions in a consumption plan - also because they have a free tier of some free executions.
Ok, document as pdf in an email attachment, tesseract running inside Azure Function, next step is about connecting both. My first thought was to use Microsoft Power Automation/IFTT, but both platforms offer HTTP requests only in a paid plan. Since I’m using OneDrive to store the pdf files, I will use webhooks from OneDrive to call the Azure Function. To put the files into OneDrive I will use Microsoft Power Automation (aka Microsoft Flow).
To setup the webhook I could create a CLI tool, but since I want to improve my Angular/React, I going to create a SPA for it. Why a SPA for such simple thing? Well, serving a bunch of html, css and js files is free whereas running a server 24/7h does cost something.
To wrap things up, the following image gives an overview over the different components.
