What is the purpose of language pre-processing?

It strips language down into a simpler, more manageable form

How does a statistical NLG model work?

It knows which words are most likely to follow one another

Natural Language Processing

Talking to machines

In this pathway so far, we've traveled on a journey from the days of Charles Babbage and Ada Lovelace, through the era of symbols and expert systems, all the way to the age of modern neural networks.

Later, we'll continue this journey onward, and take a look at the future of AI. But first, let's spend a little bit more time in the present. A little bit more time in the AI spring that we're literally living through right now.

This period has seen Artificial Intelligence branch out into a number of subfields. Over the next few tiles, we'll be learning all about vision, robotics, and generative AI. But first, we're going to take a look at Natural Language Processing.

Natural Language Processing (NLP) is a catch-all term which is used to describe a computer's ability to understand and respond to real, human language.

This is a pretty big deal. Imagine you wanted to build an AI which could analyze news articles, or translate novels, or identify typos in essays. It wouldn't be able to do any of those things if it didn't know how to understand and respond to language.

And how about voice assistants? When you say: "Alexa, play my favorite song," that AI needs to understand human language if it's going to know what you're asking for. It also needs to understand language if it's going to form a meaningful reply, like: "Of course. Now playing your favorite song."

So yeah. NLP is important. But here's the thing: understanding language is easier said than done.

Every language has a set of grammar rules. But those rules are often strange and irregular. In English, you can tweak a verb like "walk" into a past tense version by adding -ed: "walked". But for another verb, "go", you can't add -ed. Instead, the past tense is "went".

And what about slang and abbreviations? Think of the people who use "gonna" instead of "going to", or "y'all" instead of "you all".

Then there's vocabulary. A language like English has more than a million words, but often the exact same word is used to mean totally different things. We call these words homonyms. For example, a good AI would need to know the difference between "give the bell a ring", and "give your partner a ring".

And what about subtext? When we say something, there's often a secondary meaning hidden underneath words. When your friend turns up, and you say, "That's an... interesting hat..." what do you actually mean?

Graph — Interesting hat. Image by Viktoria Borodinova (CC BY-SA 4.0) <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

A basic form of NLP can be achieved using rule-based approaches.

Imagine a simple voice assistant. It could be programmed with a list of possible commands, like "play a song" or "tell me the time." When somebody asks it to do something, it just checks that list, then responds.

As we talked about earlier, the Eliza chatbot used a rule-based approach to simulate the dialogue of a psychoanalyst, basically just spotting key words and patterns, then generating relevant responses.

But as always, this kind of rule-based approach can only take us so far. Nowadays, the most complex NLP models are generally neural networks. With a large enough dataset of written text, they can learn all the subtleties of natural language, like nuance, slang and subtext.

Language pre-processing

To make Natural Language Processing easier, an AI will often use a technique called language pre-processing. In simple terms, this means cleaning the mess from a piece of natural language, and stripping it down to a simpler, more manageable form.

Imagine calling to Alexa: "Hey, Alexa, can you play my favorite song again?" Language pre-processing might cut that down to "Alexa play favorite song". That second version still holds all the key information from the original version, but with the messiness cleaned up, it's easier for a computer to interpret.

Think of it like digging up a fossil, and brushing away all the excess soil until you're left with nothing but a nice clean bone. Or sorry, let us rephrase that: think dig up fossil, brush away soil, left with nice clean bone.

Tokenization is a core technique in language pre-processing. It involves breaking a text into individual units. These often correspond to clusters of letters – but for the sake of clarity, let's treat a token as a single, individual word.

For example, with a sentence like "the birds are searching for better food", tokenization would break it down into ["the", "birds", "are", "searching", "for", "better", "food"].

This step is important, because it converts a loose stream of text into tight, individual elements. Each of these elements (or tokens) can then be analyzed one-by-one. For example, the AI might associate the token "birds" with certain properties, like "feathers", "wings", "beaks".

This approach is much easier for an AI model than attempting to analyze the entire text at once.

Another important pre-processing technique is stemming. This involves taking a word and cutting off its ending (like -ing, or -ed). This reduces it to a stem – a more basic form of the word.

For example, ["the", "birds", "are", "searching", "for", "better", "food"] becomes ["the", "bird", "are", "search", "for", "better", "food"].

A related technique is lemmatization. Instead of removing the endings of words, it changes words into simpler versions that still mean more or less the same. It might turn "better" into "good", or convert a conjugated verb to its base grammatical form.

For example, ["the", "birds", "are", "search", "for", "better", "food"] becomes ["the", "birds", "be", "search", "for", "good", "food"].

Both of these techniques help to standardize words, which makes them easier to understand.

One more important pre-processing technique is something called stop-word removal. This involves deleting any filler words (like "and" or "the") which don't add much meaning to the text.

For example, ["the", "bird", "be", "search", "for", "good", "food"] becomes ["bird", "search", "good", "food"].

And there we have it. Through a series of simple pre-processing techniques, that natural sentence ("the birds are searching for better food") has been reduced to a format that's much simpler and easier for a computer to work with.

Feature extraction

Last time, we saw how language pre-processing could be used to turn a piece of natural language into a cleaner, simpler form. This makes it easier for an AI to identify meaning.

For example, if someone asked a voice assistant "What's the weather like tomorrow?", it could pre-process that text into ["weather", "tomorrow"]. The word "weather" tells the model that the user wants a forecast, while "tomorrow" gives a clear timeframe.

It won't always be that straightforward, though.

Instead of giving a simple voice command, imagine using an AI to analyze the contents of an online article. Even with the help of language pre-processing, the AI still ends up with a body of words that it doesn't really know what to do with. In cases like this, the AI will need to use an approach which scientists call feature extraction.

Feature extraction is a way for an AI to look at a piece of complicated language, like an article or a book, then pick out the most important features.

There are a few different ways to go about this. But they're all based around a similar principle. The AI needs to turn the natural language into some kind of graph, or statistical model – something that uses numbers. After all, computers work best with numbers. That's their equivalent of language.

In other words, this is a translation exercise. Feature extraction takes human language (text), and translates it into some kind of numerical model.

One common approach to feature extraction is word embedding.

This approach starts by taking a piece of text. For example, an online article. After pre-processing the text, the AI will take all the tokens in the article, then plot them in a multidimensional graph.

In this graph, every token is represented as a different point. And the AI is able to derive patterns and meanings from their positions. For example, related tokens might be clustered together, while unrelated tokens are positioned further apart.

We're heavily over-simplifying here. Word embedding is extremely complex. But for the purpose of this pathway, that's the general idea.

Let's take a look at an example.

Imagine you did use word embedding on an article. Afterwards, you end up with a multidimensional graph where "AI" is clustered with "exciting" and "revolutionary". But for another article, you end up with a graph where "AI" is clustered with "worrying" and "dangerous".

When the AI interprets both of these graphs, it should be able to identify which article is pro-AI, and which article is a lot less keen. There's actually a name for this: sentiment analysis. That's when a model is used to analyze the general mood of a text.

This kind of process wouldn't be possible without feature extraction. As we said, you don't really need it to interpret simple text ("What's the weather like tomorrow?") but for longer, more complex NLP tasks, it's a really useful approach.

Generation

Natural Language Processing (NLP) isn't just about turning human language into a form that computers can understand. It's also about the opposite process: turning computer language into a form that humans can understand.

Just think about a chatbot or a voice assistant. It might know what you mean when you ask it, "What's the best way to boil an egg?" but that's not very useful if it doesn't know how to reply.

This process of turning computer language into human language is called Natural Language Generation (NLG). Meanwhile, all that stuff we learned about earlier – language pre-processing and feature extraction – is generally known as Natural Language Understanding (NLU).

Essentially, the process of NLG is the reverse of NLU. The computer will start with some numbers, data, graphs or tables, then translate this data into coherent words and sentences.

There are a few different ways to go about this, though.

The most basic approach is a rule-based template. This is where an AI is programmed with a set of template sentences, which function like fill-in-the-blanks. For example: "The weather today is [condition] with a high of [temperature]."

Using a template like this one, it's pretty easy for the AI to look at some data (like a weather report) then output a meaningful sentence. But this approach is pretty limited. Apart from those template sentences, the AI can't say anything else.

That rule-based template is a form of symbolic programming. And as you'd probably expect, more advanced approaches to Natural Language Generation will generally use machine learning techniques instead.

One of the most popular examples of this is something called a statistical model. This involves training an AI to identify patterns between words. More specifically, to identify which words are most likely to occur right after one another.

For example, imagine that the AI has already generated three words: "The dog is..." Statistically, these words are more likely to be followed by a word like "barking" or "running", rather than a word like "flying" or "delicious".

Like with any machine learning, statistical models are only as good as their data. But luckily, there's a lot of natural language out there. Think of all the millions of books that you could give to an AI model, or the billions of webpages, or news articles.

Google Translate uses a neural network which was trained to predict likely sequences of words after analyzing a dataset of texts in various languages. ChatGPT uses a neural network which was trained using texts from the internet.

There's actually a name for NLG models like ChatGPT: Large Language Models (LLMs). They're so good at generating natural language, that they can actually pass the Turing Test.