Let’s use this image of a cake as a really good example to illustrate the 3 primary types of computer vision “recipes”. Side note: these “recipes” are often called “architectures."
The three primary types of computer vision are classification, detection, and segmentation. While you don’t necessarily need to remember these names, it is helpful to understand what each is capable of doing, so you know which one you might want to use on your own images and videos.
Classification is the simplest type of computer vision. Think of it as just adding a tag or a label to the image to say what’s in the image.
Classification models can have one or many tags, which are called classes. Here’s how that looks using our cake image:
Binary classification model - yes/no or true/false that a class is present in the image
Multi-label classification model - yes/no or true/false that each of one or more classes (or labels) are present in the image
So a simple binary classification model to detect cake in our example image might look like this:
The next type of CV is detection, often also called object detection. It’s pretty easy to create, and can be used very flexibly.
With a detection model, you’re looking to draw a box around the object you’re looking for in the image (these are called bounding boxes, as they represent the “bounds” of the object within the image). Whereas classification could only tell you a “yes/no” answer, detection gives you that same answer, as well as the size and location of the cake in the image.
Like classification, detection models can have one or more classes of object. So you could train a model to just draw boxes around cake, or one that attempts to do the same for both cake and candles at the same time.
So a detection model to just detect cake might look like this:
This leaves the third and most advanced type of CV: segmentation, more formally called image segmentation. Segmentation takes all the pixels that make up an image and groups them based on whether or not they’re a part of the object you’re looking for. This allows you to get the precise group of pixels that represent your object, opening up a lot of more advanced analysis that we’ll get into later.
Just like classification and detection, segmentation models can have one or more classes of object.
A simple segmentation model to detect cake might look like this:
Segmentation is where things get really interesting. The human eye can differentiate multiple examples of the same object at once and our brains can understand that those objects are distinct from one another. When you look at a lush green tree, you can see all of the leafy area as one big blob of leaves, but you are also able to pick out an individual leaf and know that it’s separate from the other leaves nearby.
With segmentation, we can train a model to do the same thing. This is known as instance segmentation.
With instance segmentation, you can teach a model to understand the difference between one instance of cake, and a separate instance of cake in the same image. It might look like this: