What is the main purpose of unsupervised learning?

Identifying patterns in messy datasets

Why might it be hard to train an AI to identify rare diseases?

There isn't much data on rare diseases

Machine Learning

What is machine learning?

Last time, we saw how symbolic AI has a major limitation. It relies on manual programming – and manual programming can only take us so far. In order to build more complex AIs, we need to use a different approach.

But what's the alternative to manual programming?

To answer that question, we can look at a human brain. Your own brain (obviously) wasn't manually programmed. Instead, you filled it with facts and ideas via a process of gradual learning. Taste an apple? That taste was added to your brain. Ride a bike? That skill was automatically stored.

Over the last few decades, many scientists have taken the same approach to AI. They call this approach machine learning, and it's literally changed the world.

We'll get into the complexities of machine learning later, but here's a simple definition. An AI is said to 'learn' when you give it an input – like a big list of numbers, or some photos, or videos – and it changes its behavior afterwards. As we said, this is how humans learn as well. We experience an input, then we change.

Let's run with that photo example. Imagine you've built an Artificial Intelligence that struggles to tell the difference between cats and dogs. You decide to show it a pile of animal photos and tell it over and over: "this one’s a cat, this one’s a cat, this one’s a dog, this one’s a dog..."

Your AI starts to notice patterns in the photos, like the shape of the ears or the whiskers. And by the end, it's pretty good at recognizing cats and dogs on its own.

Learning wouldn't be the only way to get an AI to recognize cats. Alternatively, a scientist could write symbolic rules ("fluffy tail", "pointy ears", "long whiskers", etc.), then manually build these rules into the AI's design.

But it's often a lot faster, and a lot more effective, to take a learning approach instead.

Let's think about another example: an AI that's designed to translate text. Instead of manually programming all the words from every language, you can train the AI on a pile of texts which have already been translated. As it works through those texts, and compares the languages, it can learn all the different words.

This is exactly what Google did in 2016. They built an AI, then trained it on millions of translated texts. Whenever you load up Google Translate, you're interacting with an impressive, self-taught AI.

Generalization

So, we know that machine learning will often be faster and easier than manual programming. But there's also another important benefit. If it's done correctly, this approach can actually allow a model to 'think outside the box'.

Imagine that you manually programmed an AI to close your window when it’s rainy, and to open your window when it’s sunny. Then one day it snows, and the AI doesn’t know what to do.

It's another big drawback with symbolic AI. However much time you spend on the programming, you might struggle to think of every possible scenario. And if you forget a scenario – if you don't programme a rule – your AI model won't know how to respond.

Compare that to machine learning, where an AI model has watched thousands of videos of people opening and closing their windows.

None of these videos have snow in them – but the AI has noticed some patterns. People usually close their windows when something falls from the sky. Especially when that something is cold and wet.

Based on these patterns, the AI is able to extrapolate: the snow appears to be cold and wet, so it must be time to close the window.

That snow scenario is an example of generalization. In the context of AI, this term refers to a model's ability to handle data it hasn't seen before, as opposed to just handling the data it originally learned from.

Humans are pretty good at this. Imagine showing a child some pictures of dogs: big dogs, small dogs, furry dogs, short-haired dogs. Later, in the park, they see a breed of dog that wasn't in the photos. But they still know it's a dog, because they have a 'generalized' understanding of the data.

Machine learning allows an AI model to gain a generalized understanding too. This is extremely useful – as we said, it's hard to predict (and manually programme) every possible scenario that an AI model might face.

Parameters and loss

Like everything in AI, machine learning is essentially a computer programme which mimics a human-like process. But how does this programme work?

You can think of it as a template with some numbers attached. AI scientists refer to these numbers as parameters.

When data is fed into this template, those parameters automatically adjust. The numbers will either get higher or lower, depending on the nature of the data. Just imagine someone sitting there, adjusting dials up and down, as the different bits of data come through.

We'll look at an example in a second. But in simple terms: when we say that a machine is 'learning', we really just mean that these parameters are changing up and down.

Here's that example we promised. Let's imagine you're building an AI model which can predict the price of a pizza based on its size.

Your template has parameters for all the different sizes: 10-inch pizza, 11-inch pizza, 12-inch pizza, and so on. These parameters all have prices attached, which you've estimated yourself.

You then give the AI some menus from local pizza restaurants. As each menu comes through, it tweaks its parameters, and the prices of each pizza gradually go up or down.

For example, if one menu showed a 10-inch pizza for $8.50, the AI would tweak that original parameter down a couple of notches. If the next menu showed a 10-inch pizza for $9.50, it tweaks the number up again. Every time it does this, it gets closer and closer to an average price for a 10-inch pizza in your area.

By the end, the AI will have a new set of numbers: it dropped from $10.00 to $9.00 for a 10-inch, from $11.00 to $10.00 for an 11-inch, and from $12.00 to $11.00 for a 12-inch.

In other words, this AI has 'learned' some average prices for pizzas being sold in your area. And if you asked it to estimate the cost of, say, a 14-inch pizza, it would hopefully give you a decent answer. In this case, maybe $13.00.

You can check if that number is accurate by looking in another pizza menu. You glance through the options, and find that this particular pizza restaurant is selling 14-inch pizzas for $13.25. That's not exactly the same as your AI's estimate, but all things considered, it's pretty close.

There's actually a name for this 'gap' between the AI's prediction ($13.00) and the real-life data ($13.25). Scientists call it the loss function, and it's an important part of machine learning.

The smaller the loss function, the better your machine has learned. In an ideal scenario, the loss function would be non-existent – your AI would have predicted $13.25 in the first place.

To reduce the loss function for this particular AI, you could give it more menus to learn from. In theory, with every piece of data it encounters, it will get closer and closer to the 'truth'. It's not always that simple, but as a general rule, larger datasets usually lead to more effective AI models.

Supervised learning

When you want an AI to learn from a dataset, like in that pizza example we talked about last time, there are a few different approaches you can take.

One of these approaches is supervised learning. This is when you train your AI using a labeled set of data. When we say 'labeled', we mean that the data has all been carefully arranged into something called input-output pairs.

That's what we did with those pizzas last time. The input side of the pair was the size of each pizza (e.g. 10 inch) and the output side of the pair was the price of each pizza (e.g. $9.00).

In this example, our labeled data is essentially just a list of these size-price (input-output) pairs.

An input-output pair will hopefully have some kind of relationship or rule. In our pizza example, the rule went like this: the output number (price) was always one less than the input number (size).

In a lot of cases, we won't actually know that relationship or rule ourselves. Instead, we want the AI to look at the input-output pairs, and search for patterns that help it to establish the relationship or rule by itself.

Once it's established the rule, it can start using it. For example, we could give an input number to our pizza AI: "5-inch". Using the rule above, the AI could predict the output number: "$4.00".

In plainer language, we could ask the AI, "How much do you think a 5-inch pizza would cost?" and it could reply to you, "It will probably cost $4.00".

Here's another example. We could give our AI a bunch of emails to look at. Half of these emails are labeled 'spam', and the other half are labeled 'not spam'. This time, the input is the email itself, and the output is either 'spam' or 'not spam'.

Using this labeled data, the AI can now learn some rules or patterns that distinguish spam from non-spam. For example, it might notice that an input email with more spelling mistakes, or strange punctuation, is typically linked to an output label saying 'spam'.

Later, whenever we give this AI a random email, it can use this rule to sort it. Lots of spelling mistakes? Or unusual punctuation? The email is sent to spam.

Supervised learning is great in scenarios like that email example – when there's a clear pair of inputs and outputs.

Here are some other examples: when you give your AI some text in one language (input) then get a translated version (output). When you play a piece of music to your AI (input) then get the title of that song (output).

Also, that cat and dog example we mentioned earlier on. You can give it a photo of an animal (input) then get the name of that animal (output).

This isn't the only approach to machine learning. We'll look at some other types later. But it's a very common approach right now, with thousands of different uses.

Unsupervised learning

The main alternative to supervised learning is (can you guess it?) unsupervised learning.

This time, you won't be giving your AI a set of nicely labeled pairs. Instead, you'll be giving it an unstructured pile of raw, unlabelled data.

This data might actually follow some interesting rules and patterns. But as a human, you don't know what they are. The data is too messy, too large, too confusing. There's certainly nothing as intuitive here as a simple input-output pair.

So you ask your AI "are there any patterns in here?" and see if it can learn anything useful.

Typically, an unsupervised learning model will use a technique called clustering. This technique involves sorting data into groups based on apparent similarities and differences.

It's like giving a child a handful of marbles, then asking them to sort them. They may start to sort them by size, or color, or opacity, or weight, or whatever other patterns they come up with. An AI can do the same with data, seeking patterns, then learning from them.

Here's a real world example. Imagine that you run a gym. You ask an AI to look for patterns in a database of all your members.

It finds some interesting clusters: people who live in the east of the city seem more interested in taking yoga classes, whereas people who live in the west of the city are more interested in taking spin classes. Why? You have no idea. But later, when you open a new gym in the east, you make sure it specializes in yoga.

Another technique that a model might use is association. This involves looking for connections between pieces of data. Often, these connections are causal: if X happens, then Y will happen next.

As another example, imagine that you run a streaming service like Netflix. You ask an AI to look for patterns in a database of all your members. It finds an association: after watching a scary horror movie, it's common for people to calm down with their favorite rom com.

This knowledge helps you adjust your service – you make sure that rom coms are typically suggested after horrors.

Unsupervised learning is incredibly powerful for exploratory analysis. It can identify all kinds of rules and patterns that you never would have thought of on your own. But remember: these rules will sometimes be bizarre and useless.

Your AI might notice that people born in November like movies about dogs, while people born in September prefer movies which feature cats – but only if those cats are black. For whatever reason, this pattern might genuinely be present in your data.

But is it useful? That's for you to decide.

It's worth pointing out that unsupervised learning only works on massive datasets. And it's more computationally complex than supervised learning – it takes up a lot more power.

There's a trade off here. Unsupervised learning is more work for a computer, but it's much less work for a human. Supervised learning, on the other hand, is easier for the computer, but it takes time and effort on the human side to sort and label the data.

A scientist will generally choose between them based on the ultimate goal of their project. If they're looking to find some unexpected patterns in a giant dataset, they'll use unsupervised. If they just want to find some simple relationships between inputs and outputs, they'll use supervised learning instead.

Reinforcement learning

Along with supervised learning and unsupervised learning, another popular approach to machine learning is something called reinforcement learning.

This one is pretty different from the others, as it doesn't require a dataset. Instead, you're going to be putting your AI in a closed environment, where it can learn via rewards and punishments.

These rewards and punishments are just positive values and negative values which are assigned to different actions. Whenever the AI performs an action, the corresponding value will tell the AI whether to perform that action again.

Fundamentally, it's just like operant conditioning. But instead of using rewards and punishments to train rats and monkeys, it's being used to train an AI.

Reinforcement learning would be a great approach if you wanted to teach an AI model how to play a game of chess.

You'd need to build an environment where the AI can play chess matches over and over and over. Each win is associated with a positive value; each loss is associated with a negative value.

By the time it's played thousands of matches, this AI might have learned which sets of moves are most likely to lead to a positive, winning outcome. It might also have learned which moves to avoid, or which tactics are too risky to pull off.

Reinforcement learning can also be used in other scenarios, like teaching a robot how to walk. Assign a positive value to a successful step, and a negative value if the robot falls to the floor.

In general, you'd use reinforcement learning if you wanted your AI to learn how to deal with dynamic, real world situations. The kinds of situations that can't easily be summed up in a database.

Another example would be training a self-driving car. Instead of getting that AI to look for patterns in data about collisions and traffic codes, it might work better to build an artificial environment where the AI can simulate driving around, and learn through trial and error.

Having said that, there might also be value in getting an AI model to learn some basic principles using supervised learning, then to graduate it to reinforcement learning later. This approach was actually how AlphaGo was trained to play the game of Go.

AlphaGo initially learned from a dataset: it observed thousands of recordings of real-life moves by expert human Go players. This was a labeled dataset – the model could see when a set of moves (input) resulted in a victory (output).

After absorbing enough basic rules and patterns through this supervised stage, the model moved to an environment where it played thousands of Go matches against versions of itself, getting better and better through reinforcement.

This combined approach worked wonders. By the end, AlphaGo had learned enough to successfully defeat the greatest Go player in the world.

Effectiveness of data

A lot of people are (rightly) excited about modern machine learning.

As we talked about earlier, it lets us build complex, powerful models without needing to programme all the rules of those models by hand. It also lets us build flexible, adaptive models, which can change and evolve, and handle unexpected problems.

But there's one important challenge that we need to be aware of: availability of data.

While reinforcement learning doesn't need a dataset, supervised learning and unsupervised learning do. And this can't just be any dataset. It needs to be relevant to the task that you want your AI to be learning.

As well as a relevant dataset, you'll also need a large dataset. AI learns best when it's able to repeat things thousands and thousands of times.

It's hard to put an exact number of the amount of data that you'll need. One general rule is that you'll need ten pieces of data (for example, ten photos) for each parameter in your model. If you had a model with 200 parameters? You'd need 2000 pieces of data.

That rule is pretty arbitrary. It all depends on the model, and what you want it to learn. But these approximate numbers still help to highlight how much data is often needed.

If you do manage to find enough usable data, you'll want to divide it into parts. The first part will be used as training data, which your AI model can learn from.

The second part will be some test data, which you can give to the AI after the learning stage, and check how well it performs.

As another general rule: of your total data, 80% should be used for training, and 20% for testing. So if you had 2000 photos in total, you'd use 1600 to train your AI, and 400 to test it later on.

If you can find all that data, that's amazing news. AI scientists often talk about the unreasonable effectiveness of data. They're basically saying, when you do have enough data, it will probably be more effective than you'd ever expect.

But the opposite is true as well. When you don't have that data, you're seriously going to struggle. It's an important drawback to modern AI. It's still pretty hard to develop models for problems that don't have much data.

Imagine, for example, that you wanted to train an AI model to spot the symptoms of a rare disease. But you can only find about twenty case studies of people who have had that disease. That's nowhere near enough data – machine learning wouldn't be possible.