I was given the task of researching and prototyping various computer vision algorithms to determine whether the text appearing within a given video frame was readable or not.
My readability algorithms had access to the source text image in addition to the video frame where the same text appeared somewhere within the frame somewhat degraded. The text in the video frame might be washed out, obstructed, blurred, reduced and skewed, have less contrast and/or have color shifts. The job of my automatic readability assessment algorithm was to judge whether the source text appearing in the video frame was still readable.
The first problem was to find the size, location, orientation and skew of the source text within the video frame. Because I was limited to one camera and one video stream I tried some of the traditional feature matching algorithms. (We later switched to a 3-D scene analysis technique once we added more cameras.)
Based on the feature matching between the source and video frame images, a homography matrix could be computed that summarizes where and how the source text was warped within the video frame. There are a number of techniques for doing this which are provided by the OpenCV library. I started off using the SIFT (Scale-Invariant Feature Transform) method but soon found out that the portion of the algorithm which computes Euclidean distance measures (in floating-point) was patented and could not used without paying a royalty. The same goes for the SURF (Speeded-Up Robust Features) method which improves upon SIFT but still uses Euclidean distance measures.
My colleague, Dr. Slowich researched some feature matching methods using non-Euclidean distance measures. FAST (Features from Accelerated Segment Test), BRIEF (Binary Robust Independent Elementary Features), ORB (Oriented and Rotated BRIEF) and BRISK (Binary Robust Invariant Scalable Keypoints) all use Hamming distances to get around the SIFT/SURF patent with performance improvements that eliminate the need for floating-point computations. However, the accuracy of these non-Euclidean approaches doesn't quite match the results from SIFT and SURF. Slowack determined that BRISK provided the best results for our use case, so I adopted that method. Instead of comparing images in color, I reduced the images to grey-scale before comparing in my version of BRISK to increase the performance a bit for real-time video processing. Here are the top 100 matches for one case:
These feature matching methods (including BRISK) still produce a lot of bad matches which lie outside the majority of good matches. So, to smartly eliminate the bad from the good, I ran the BRISK results (sorted by threshold scores) through the RANSAC (RANdom SAmple Consensus) algorithm. [There is also an improved PROSAC (PROgressive SAmple Consensus) algorithm but I never tried it.] From this cleaned up set of matches it was possible to compute a fairly accurate homography matrix. Then, using an inverse perspective transform and perspective dewarping I was able to extract a facsimile image from the video frame that closely matched the size and dimensions of the source image. From this, it seemed like it would be possible to do side-by-image image quality assessments that would also provide a measure of text readability.
There are two double-stimulus measures of image quality assessment (IQA) commonly in use: PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural SIMilarity). I tried both and neither provided useful results. The problem for PSNR, in particular, was that the source and facsimile images need to have pixel perfect alignment and very similar pixel luminosities for the statistical assessment measure to have much meaning. Both PSNR and SSIM were intended to measure the degradation between the source and facsimile images due to compression artifacts alone (i.e. JPEG or H264). PSNR was removed from consideration within a day because it relies on absolute error differences. Since SSIM uses relative distance measures, I put a bit more effort on it. First, both images were reduced to grey-scale to remove color balance differences. Then, I used histogram matching and correction to reduce contrast differences. However, SSIM still did not provide very useful measures of readability except when the facsimile was blurred. I gave up on this approach within a week or so.
I then considered using an OCR reader library like Tesseract where the OCR reader would compare the amount of the text read in the video frame against the text read from the source. However, I had previously completed a study of Tesseract with many degraded text image examples and found out that Tesseract does not do well with low contrast, inverted colors or skewed text that would otherwise be perfectly readable by humans. Tesseract was also huge and slow. So, I decided to bypass the OCR reader approach.
During my testing with SSIM, I noticed that the human eye is quite insensitive to contrast and color differences when it comes to readability. The thing that really counts is the pixel height of the rendered glyphs within the image. If the height of the rendered Latin glyph is less than N pixels, it is usually unreadable. (Note that Mandarin glyphs, being considerably more dense, would require a larger readability threshold than Latin glyphs.) This observation gave me a new idea.
There are a class of algorithms that can find text within a scene without actually reading the text. Text has a unique pattern which humans can easily recognize even if it's in a language that one cannot be read. Text usually consists of line-stroke glyph blocks arranged along a common baseline. There are scene text detection algorithms that have been trained to recognize this pattern. I looked at a number of these algorithms including TextBoxes, TextBoxes++, SegLink, DeepReg and EAST. Of these, the EAST (Efficient and Accurate Scene Text) DNN (Deep Neural Network) was the fastest and from the published examples and the online demo, it could detect low contrast, inverted color and skewed text with good accuracy.
In the original 2017 paper by Xinyu Zhou, et. al., the algorithm used the PVANet CNN (Convolutional Neural Net) architecture organized like this:
Altman then ported it to Google's TensorFlow and trained it on Microsoft's ResNet-50 DNN and then put the code on GitHub. That made it even easier for me to use this EAST detector. In addition, the OpenCV 4.2 library added a DNN module that could read Intel's Optimized DAG (Directed Acyclic Graph) Model format generated by the OpenVINO toolkit. So, I used OpenVINO to convert the TensorFlow version of EAST into Intel's Optimized Model format which could then be imported into OpenCV 4.2.
This worked quite well (though OpenCV 4.2 did not support GPU acceleration at that point). The results with my tests were better than expected. The text height estimations from EAST where nearly perfect and I used those estimates to measure text readability. In addition, I no longer needed to do the inverse perspective transform and perspective dewarping. All I really needed was the homography matrix where I could get the amount of overall text shrinkage between the source text and the same text in the video frame. And, then, with a bit of scaling, compute the rendered text height as shown in the video frame.
The only minor tweak added was that some of the smallest text had a very low frequency. Small amounts of very small text were usually of low importance. So, I computed a histogram of text segments sorted by height and added a condition that ignores the smallest text segments heights of low frequency. I then ran hundreds of readability tests comparing the EAST results against human perception and the agreement was very good.
When OpenCV 4.3 was released, the OpenCV DNN module had been upgraded to handle GPU acceleration via CUDA. Since Nvidia's CUDA is tightly coupled to the GPU and driver in use by that particular computer and OpenCV is tightly coupled to the particular version of CUDA in use, it was not possible to install a prebuilt OpenCV library. It was necessary to specify a number of CMake flags to tell it what kind of GPU, driver and CUDA library was in use and then build the OpenCV library from source. I did just that and with a few additional modifications in the code, I got the EAST DNN to run with GPU acceleration. On my Dell Precision laptop equipped with an Nvidia Quadro M2200 GPU and the Nvidia CUDA 10.3 library, I observed an 8X improvement in the EAST detection speed running at about 11 FPS.
Since then, better DNN scene text detectors have been released including the DB detector (real-time scene text detection with Differentiable Binarization detector of 2019) by Minghui Liao, et. al. that can detect text on a curved path. However, the EAST detector was good enough for my purposes at the time.