Introduction
The field of computer vision has seen many successful implementations of artificial neural networks for various tasks and further improvements require thinking out of the box. At Visionary.ai, We consider image preprocessing as part of any computer vision task; any shortcomings of the original preprocessing, such as noise, blur, or sharpness will have to be bridged by the Artificial Neural Network (ANN). Decoupling a complicated task into smaller, simpler tasks makes it easier for the ANN to converge.
We propose a new approach to image signal processing. Instead of classical preprocessing followed by an ANN for a specific task, we offer to substitute the classical preprocessing algorithms with an ANN that will enhance the image, leaving an easier task for the original ANN. We show that our approach improves the performance of a detector, suggesting that current computer vision approaches do not generalize well for imperfect images, even if noise augmentations are added to the training dataset.
We compared detection results of a denoising ANN, a classical ISP (image signal processor), and a classical ISP with a classical denoising algorithm. We also introduce a novel loss, coined “focus loss”, and show its effect on training.
The test case that was chosen for our research is license plate recognition in low light. We used off-the-shelf license plate recognition software to test and compare the ISPs with digit recognition recall and precision.
In every image, there are areas of higher importance. Specifically, when the task at hand is detection, the objects of interest are self-explanatory. The focus loss is an additional loss, which may or may not necessarily have the same components as the original loss, but unlike it, it is active only in the regions of higher interest. We performed several experiments; our regular loss, and our regular loss with the additional focus loss with three different weighting ratios: 1:1, 1:2, and 1:5.
Data Collection
Special datasets were collected to enhance the network’s performance on license plate recognition. Bursts of images of idle cars were taken to capture the effects of the artificial light on the license plates in the expected environment. Due to the dynamic nature of said scenes, the dataset was carefully cleaned.
Video datasets were added to the training to support dynamic inputs. Instead of the usual burst methodology, the videos were captured in daylight and were used as the ground truth. The noise characteristics were learned from lowlight datasets of the new camera and synthetic noise was added to the videos to serve as the inputs in training.
Additionally, bursts of Images of parked cars in low light were captured to increase the volume of cars and license plates in the training datasets. An example can be seen in figure 4.
In total, 163 license plates were collected and labeled, out of which 143 were part of the training dataset and 21 were part of the test dataset.
Test Benchmark
The test benchmark was created using 28 videos of cars driving on light-isolated roads that were specially collected for evaluation and not used in the training set. The videos include a total of 112 different cars, with each car being present in 10 frames on average. The videos were manually labeled for the creation of the ground truth.
The test includes processing the videos with the three approaches mentioned above, followed by passing each processed frame to the commercial license plate recognition service of ParkPaw, and the mean precision and mean recall of the digit recognition was calculated for comparison and final evaluation.
Some videos are of convoys of cars, where cars headlights illuminate each other to varying degrees. Thus, out of said benchmark, a sub benchmark was created that includes only the darkest examples, such as lone cars or the last car in a convoy.
Results
Here we summarize the results of the experiments and compare them to classical approaches. We present the loss graphs of the four experiments in Fig. 5 and the mean precision and recall in Table. 1. Additionally, we compare the results on a subset of the benchmark, including only the darkest and hardest examples in Table. 2.
The focus loss proved beneficial, and not only did it improve the loss in the focus areas, but it also improved the overall loss as is shown in Fig 5.
Additionally, the visual results from the various denoisers can be viewed in Fig. 6.
Fig 6. (below) Visual comparison of different ISPs: No denoiser, classical denoiser, and Visionary.ai’s denoiser:
The most successful version of the denoiser is the 1:2 focus loss experiment. With a mean precision of 88.17% and a mean recall of 87.32%, it shows a decrease of 49.66% in false detections and 47.71 in missed detections compared to the results received without a denoiser. Compared to a classical denoiser, we show a significant decrease of 25.36% and 24.39% in false and missed detections, respectively.
When we compare the performance of the denoisers on the hardest, darkest examples, the additional benefit of our denoiser becomes more apparent. Table. 2 shows a decrease of 46.55% in falsely detected digits and 50.16% in missed detections compared to the classical denoiser, and a decrease of 79.84% and 82.31% in false detections and missed detections respectively, compared to detections without a denoiser.
Conclusion
The focus loss proved beneficial for the training of the networks, and not only did it improve the loss in the focus areas, but it also improved the overall global loss as is shown in Fig 5. Furthermore, the benefits of the new focus loss to the networks’ performance are consistent with the weight of the focus in the loss. Thus, the experiments with the weighing ratio of 1:2 and 1:5 are showing the best results both loss-wise and detection wise as shown in Table 1.
In conclusion, we showcase the importance of preprocessing for classical computer vision tasks. Our novel image processing method improved the performance of an off-the-shelf license plate recognition software in low light conditions. When compared with an ISP with a classical denoiser, we were able to extract more information out of the camera, achieving a decrease of 46.55% in false detections and a decrease of 50.16% in missed detections.