We prefer to use the term media to broadly describe still imagery (i.e. pictures) and motion video. This way, it removes some confusion around the word “imagery”, which is often used to describe only still images, and not videos.

When it comes to media, there are a lot of sources out there, and it’s only growing every day. However, to keep things simple, we’ve narrowed down a few key concepts you should know as you think about your media and how it can be used with computer vision.

Sensor type

Every type of camera, whether it’s for still images or full-motion video, is first and foremost a sensor for collecting light. The lens of the sensor focuses incoming light and that information is stored for later use. At it’s simplest, that’s what every camera does.

That said, there’s obviously a lot of science involved that we don’t need to get into here. But the important thing to know about sensors is that they usually specialize in capturing a few specific types (or wavelengths) of light.

In short, red-green-blue (RGB) is the most common, and it’s what you’d likely consider “normal” for digital pictures and video. The pictures you take on your phone, the ones you see online—these are almost always RGB images. But there are other sensors out there that can capture additional types of light, such as infrared (which is how night-vision goggles work).

When thinking about your media, consider: What am I looking for in this media? What do I want the model to be able to detect? Does the sensor I’m using to capture the media allow for that?

A fair amount of time, standard RGB imagery will do the job. But if you're looking for something "hidden" to the human eye, you may need to rely on another type of sensor that can detect that kind of light. Think of x-rays, for example: an RGB image of someone's broken arm would only show, well, an arm, not the broken bones hiding within.


You’ve probably heard the term “megapixels” used for cameras. A megapixel is a unit of measure for media resolution. See our dictionary article on resolution for a deeper dive.

In short, every digital sensor captures media at a specific resolution, and that resolution involves pixels. Typically, the higher the resolution, the more detail you can see, but also the more storage space the media takes up.

When thinking about your media, consider: What resolution do I need in order to see the object(s) I’m interested in? Is the resolution consistent, or does it change?


Metadata is a broad term that describes any information about your media that describes the media itself. Every sensor collects different types of metadata, but there are some common ones that almost everyone uses, such as a timestamp or resolution. Geospatial media will include metadata about geographic location.

When thinking about your media, consider: What other information does this sensor collect? Do I have access to that information? Would this information be helpful to understanding the problem I’m trying to solve with computer vision?

Did this answer your question?