When we say that an AI is 'looking' at an image, what's it really doing?

Analyzing grids of numerical values which represent color and brightness

A kernel will output a numerical score for every patch of pixels in an image. What do these scores represent?

How closely each patch of pixels matches up to the pattern that this kernel was looking for

Computer Vision

It's time to look at another subfield of Artificial Intelligence. Welcome to Computer Vision.

Computer Vision is all about a machine's ability to see. In this pathway so far, we've encountered a few examples. Do you remember that model which could tell the difference between photos of cats and photos of dogs? Or the Text-to-Image models which learn by looking at giant datasets of images?

What we haven't really talked about, though, is how these models are able to do this. It isn't as though computers have eyes... so how are they able to look at images, and interpret what they see?

Computers don't have eyes. But what they do have are cameras and sensors.

You've probably used a digital camera, like the one on your phone, more times than you can count. But have you ever stopped and wondered how that camera works?

When light enters the camera, it hits an image sensor. This sensor is divided into tiny squares. Where the light hits each square, it's given a number that describes its color and brightness.

Graph — Simplified version of image sensor numbers.

These numbers can then be used to reproduce the image on a digital screen. You just need to make sure the pixels on that screen are the same color and brightness as the light that entered the camera.

They can also be given to an Artificial Intelligence, which can use these numbers to 'see'. The technical term for this particular process is image acquisition.

So, when we say that an AI is looking at photos of cats and dogs, and learning to tell the difference between them, what it's really doing is analyzing grids of numerical values which represent the color and brightness of individual pixels.

Scientists call this particular process image interpretation. It's the part which sets an AI apart from just taking that photo on your phone.

This interpretation is usually done by a specialized type of neural network called a Convolutional Neural Network. We'll look at those in a lot more detail next time – for now, all you really need to know is that these networks are great at learning to interpret images.

Convolutional Neural Networks can be taught to perform a few different types of image interpretation.

Image classification involves looking at an image, and working out what it's actually showing. This often includes putting labels around the objects that appear in an image.

Image segmentation, on the other hand, involves dividing an image into patches of pixels that represent the exact positions of the objects that appear in an image.

There are plenty of other examples, but another interesting one is pose estimation, which analyzes the poses of humans in photos. For example, in a photo from a security camera, does a person look threatening or not?

Convolutional Neural Networks

Last time, we mentioned Convolutional Neural Networks (CNNs) – a special type of neural network which a lot of the best computer vision systems are based on. Now, we're going to take a look at how these networks actually work.

A CNN is defined by some special hidden layers, which each use something called a kernel. This is basically just a tiny filter that can slide back and forth across a digital image, examining little patches of pixels as it goes.

In the diagram below, you can see how a kernel might examine a series of 3x3 patches as it slides from position 1, to position 2, to position 3.

Importantly, a kernel will always be trained to look for a particular pattern as it slides across an image. For example, a kernel might be trained to look for a corner, or a dot, or maybe a pattern like the one you can see below.

For every patch of pixels it slides across, the kernel will output a numerical score. This score describes how closely that particular patch of pixels matched the pattern it was looking for.

By the time it's scanned every patch of pixels, we'll have a sheet of numerical scores. This sheet is like a map of the image. Where the numbers are high, we know that the pattern occurred.

In a convolutional neural network, this kernel-scanning doesn't only happen once.

Each hidden layer will have a kernel, which is trained to look for its own particular pattern. Every time this happens, we end up with another sheet of numerical scores.

Typically, the kernels in the first few layers will search for relatively simple patterns, like edges, corners and dots. As we get deeper, the kernels search for complex patterns, like shapes or textures. Some might even search for specific objects, like faces, cars, or buildings.

By the time all these kernels have scanned the image, we might have thousands of sheets of scores. Together, these become a feature map – a large and complex numerical model which represents that image as a whole.

In other words, a CNN is just a way to turn all that intangible data in a real-life image into an interpretable model that computers are able to work with.

CNNs aren't the only way to go about this, but they're a really popular approach to Computer Vision. These models can learn to use their kernels in all kinds of contexts, like recognizing faces, interpreting medical scans, reading handwritten documents, and more.

Other types of perception

Along with vision, humans have a number of other senses. We can hear things, smell things, taste things, touch things. But can computers do the same?

Machine Listening is a catch-all term for computer hearing systems. These systems are based on the same principles as vision, only a microphone sensor is used in place of a camera.

Machine Olfaction is the term for smelling systems, which use sensors to detect airborne chemicals. We sometimes call these 'electronic noses' – one model developed in Sydney, Australia, can tell the difference between whiskies just by giving each one a sniff.

As well as electronic noses, scientists have also developed electronic tongues. We call this field Machine Taste – where electronic noses detect airborne chemicals, these tongues detect chemicals in solids and liquids instead.

Last but not least, we have Machine Touch. There are a few different ways to go about this. Some researchers have developed whisker-like sensors that measure pressure at the end of each tip.

More complex approaches use electronic skin, which can detect patterns of pressure over wider areas thanks to an embedded array of sensors. The most cutting-edge examples of electronic skin can even use their sense of touch to measure an object's temperature.

All of these different modes of perception have different real-world uses.

As we've already seen, Computer Vision is great at sorting through images. It's also a useful tool in security systems, with its ability to identify human intruders automatically.

Machine Listening can also be used for security, as it automatically reacts to loud or suspicious sounds. It's also important for Natural Language Processing. If you want an AI to understand a voice command, it needs to be able to hear it.

Machine Olfaction can detect dangerous chemicals, like carbon monoxide, in the environment. Machine Taste can be used in food testing, checking for signs of contamination. And Machine Touch is great for factory robots – a sense of touch makes it easier to handle items.

In addition to all these human-like senses, some AI models can also perceive the world around them using methods that humans aren't capable of.

LiDAR (which stands for Light Detection and Ranging) is a technology that uses lasers to generate 3D maps of its surroundings.

It works by sending out laser pulses, then measuring how long it takes for these lasers to bounce back after hitting an object. It's great for something like an AI-driven car, which needs to keep track of moving objects all around it.

Basically, there are lots of different ways for an AI to perceive its surroundings. But they all come down to a similar principle: some kind of sensor will receive an input, then translate this input into a digital version. This digital version is what the AI is able to work with.