Tesseract — is an optical character recognition engine with open-source code, this is the most popular and qualitative OCR-library.
OCR uses artificial intelligence for text search and its recognition on images.
Tesseract is finding templates in pixels, letters, words and sentences. It uses two-step approach that calls adaptive recognition. It requires one data stage for character recognition, then the second stage to fulfil any letters, it wasn’t insured in, by letters that can match the word or sentence context.
Tesseract OCR
The main task was to recognize receipts from photos.
Tesseract OCR was used as a primary tool. Library pros are trainedlanguage models (>192), different kinds of recognition (image as word, text block, vertical text), easy to setup. 3rd party wrapper from github was used as Tesseract OCR was written on C++.
The version difference is in different trained models (the 4th version is more accurate so I used it).
We need file with data for text recognition, for each language each file. Download here.
The better the image quality (size, contrast, lightning) the better the recognition result.
Also the image processing was found for the further recognition by the OpenCV library. As OpenCV is written on C++ and there’s no optimalwrapper for our decision so I made my own wrapper for this library with necessary functions for image processing. The main difficulty is to choose solutions for the filter for right image processing. There’s also a possibility to find receipt/test outlines, but it’s not researched enough. The result was for 5–10% better.
Params:
language — text language on image, you can choose some by listing them by “+”.
pageSegmentationMode — the type of placement on image.
The only Tesseract usage was accurate on ~70% with perfect image, with bad lighting/quality the image accuration was ~30%.
As the result was insufficient I decided to use Vision library by Apple. I used it for block finding and its recognition. The result was ~5% more accurate but there were errors due to recurrenced blocks.
private func sliceImage(text: [VNTextObservation], onImageWithBounds bounds: CGRect) {
CATransaction.begin()
var slices = [UIImage]()
for wordObservation in text {
let wordBox = boundingBox(forRegionOfInterest: wordObservation.boundingBox, withinImageBounds: bounds)
if !wordBox.isNull {
guard let slice = self.image.cgImage?.cropping(to: wordBox) else { continue }
slices.append(UIImage(cgImage: slice))
}
}
self.sliceCompletion(slices)
CATransaction.commit()
}
The cons of decision were:
1) The recognition rate. It was decreased less than 4 times (there’s a possibility to run in multiple threads).
2) Some text blocks were recognized more than 1 time.
3) Text is recognising from right to the left so the right receipt side is recognising earlier than from the left side.
One more method to text recognition is MLKit by Google on Firebase. This way was the most precise (~90%) but the main con is only latin symbols support and difficult separated text processing in one line (the name on the right, the price on the left).
Summing up, the text recognition on images is realizable task but there are some difficulties. The main problem is quality (size, lightning, contrast) of image that can be solved by filtration. By using the Vision or MLKit in text recognition there were problems with incorrect recognition order, separated text processing. The recognised text can be corrected manually and useful, whie text recognition from receipts the total is recognising well and doesn’t need corrections.