Submitted by Oliver Hamilton – Director of Computer Vision – COSMONiO
Deep learning is an incredibly flexible technology that can be used for a plethora of tasks. In this post I’m going to show how three of the most common supervised methods can be used to sense and localise a cat with CCTV footage.
The problem: I want to be notified when a cat enters the house.
First decision is whether to solve this with hardware or software?
I could use a multitude of different sensors with an Arduino or ESP8266 and integrate it with the existing catflap. While this would be an interesting project, it would also likely have the appearance of a ‘hobby’ grade solution rather than a production level device.
There is already a sensor in place that I can make use of, a CCTV camera, which is just a fancy array of light sensors.
Solution: Use Machine Learning (ML) based Computer Vision (CV) to warn when a cat enters the house.
ML and CV are broad fields which provide many approaches. I’ll just focus on three common supervised machine learning algorithms, Classification vs Object Detection vs Segmentation.
Each of these approaches was designed to solve different challenges but we can use any of them to achieve the goal of sensing when a cat enters the house.
For each case I need to build a dataset to train a model. I’m going to be using the COSMONiO NOUS platform for both annotation and training of all three models. The table below indicates how much effort is required to annotate the images for each task type. The higher the annotation effort, the more information we are imparting to the network; this added knowledge allows us to extract more data from the network.
|Localisation Accuracy||None*||High||High + Shape|
To create a dataset for a classification task we simply need to apply a ‘global’ label to each image in the training set. This tells the network that a given image either contains a ‘cat’ or ‘no cat’. We provide no other information than this, it’s up to the network to learn which features are useful for it to classify the images into each class.
Unlike classification, we specifically tell the object detection network where cats are within the image. Building this dataset takes more effort as we have to draw a bounding box around the cat in each image of the training set. The output is also a bounding box which allows us to track the motion of the cat.
This is the most time consuming of the three. Here the dataset consists of delineations around the object of interest. This is a time-consuming process where we need to draw an outline around the cat. However, we are now imparting a significant amount of information to the network. Each segmentation outline not only locates the cat within an image, like object detection, it also provides information about the object’s shape and size.
Training Inception-v3 to create a ‘cat’ / ‘no cat’ model it quickly reaches 98% accuracy. On this unseen image (of our cat wearing his cone-of-shame after a fight) it correctly classifies it as, in fact, containing a cat with 94% confidence (top left of image).
Previously I said that classification does not provide any localisation – while technically true, the network doesn’t directly output any specific localisation data, we can recover something. Using Grad-CAM++ we can delve into the network and highlight which regions of the image the classification network found most important to the decision of adding the label ‘cat’.
These images should be taken with a pinch of salt. It is possible the network learns to classify images based on a completely unexpected feature in the image, perhaps in a different use-case the presence of a shadow rather than to object itself is the feature the network learns to recognise.
If accurate localisation and tracking is important then object detection is the better solution.
Using the same source images and videos we can quickly rebuild a new dataset to train Keras retinanet, this time using the bounding box to draw the location of cats within the images.
Looking at these results you can clearly see this provides more definitive localisation data. This can then easily be used to extract further information such as movement – is he heading towards or away from the cat flap?
Finally, moving up the information content tree, we get to segmentation. Generally, creating segmentation datasets is a more laborious task. It typically takes longer to draw the outline of an object than it does to draw a box or just add a label to the whole image. In some cases we can make use of traditional Computer Vision tools such as GrabCut to make life a little easier! Here we are going to train the good old U-Net implemented in PyTorch.
Segmentation models typically produce a mask that highlights the pixels of the image that belong to a particular class. These can then be processed in several ways, here we generate a contour from the mask.
Like object detection, segmentation models enable for accurate localisation of the object within the image. It also accurately outlines the cat, enabling detailed information about the structure to be extracted, such as in this case the tail.
You’ve seen how different deep learning architectures can be used in different ways to solve a problem. How depending on the level of fidelity needed and the effort available to build a dataset you can take different approaches.
When embarking on a production level project there will be other factors that I have not covered here, such as inference speed, processing hardware limitations and power consumption limitations for edge deployment.
For more information about the COSMONiO NOUS platform used to annotate and train all these models visit https://www.cosmonio.com/
For more information about edge deployment check out https://goto50.ai/2020/04/15/what-is-edge-computing-and-how-it-empowers-computer-vision/
If you want have a go at edge deployment follow this guide