Generative AI
Large Language Models
In the last few years, Generative AI (or Gen AI, for short) has become a bit of a buzzword.
It's used to describe any AI model which can study a dataset, then use what it's learned to create something totally new. For example, it might study a thousand paintings, then generate a painting of its own. Or it might study a million bedtime stories, then write a new one completely from scratch.
To give you some context, this is totally different to something like a descriptive AI, which explains and classifies data. Where descriptive AI is designed to explain, generative AI is specifically designed to create.
The most famous example of generative AI is ChatGPT, which exploded onto the global scene towards the end of 2022. It's an example of a particular subtype of AI, which is known as a Large Language Model (LLM).
We mentioned LLMs last time. They're essentially just complex neural networks which specialize in natural language. As a general rule, we call it a 'Large' Language Model when it uses a billion parameters. The latest version of ChatGPT is said to have more than a trillion.
These vast neural networks are trained on a colossal volume of data, sometimes called a corpus. This corpus might be made up of news articles, text messages, song lyrics, novels, screenplays, textbooks, emails, poems...
Some of the world's most powerful LLMs were literally trained using all the data on the internet.
You might be wondering what 'GPT' stands for: Generative Pre-trained Transformer.
'Generative' and 'pre-trained' are self-explanatory. But transformer is a term that you might not have seen before. It refers to a specific learning framework that helps to improve the quality of an LLM's outputs.
We won't get into all the details. But in simple terms, transformer architecture uses mathematical models to keep track of the overall context of a text, rather than just the last few words it has written.
For example, if you used an LLM to write an email, and the first sentence mentioned the recipient's name, the transformer architecture helps it remember this context, and keep using that name throughout the rest of the email.
It's worth pointing out that ChatGPT isn't the only transformer LLM on the market. Anthropic's Claude is another example. So is Google's Gemini.
It has to be said: even the world's best LLMs still struggle to compete with human writing in more expressive or artistic contexts. They lack the kind of empathy and emotion which the very best writing is based on.
But for simple writing, LLMs are extremely powerful tools. You can use them to write emails, or summarize articles, or rephrase complicated text. You can use them to draft to-do lists, or write computer code, or even just have a quick chat.
Interestingly, the latest version of ChatGPT has also been trained to work with images and audio. In other words, it's no longer strictly an LLM – we can call this model multimodal, which basically means it can perform more than one task.
Case study: Turing Test
Before moving on from LLMs, here's an interesting question. Is an AI model like ChatGPT capable of passing the Turing Test? In this orb, we'll find out firsthand.
There are a few variations of the Turing Test, but we'll be using one that looks like this. A machine and a human have a conversation. A second human (that's you!) reads the messages. At the end, you'll be asked a question: of the two participants, can you tell which one was a machine?
The conversation will start on the page after this one. The participants are labeled Person A and Person B. But just to repeat: only one is a real person.
Person A: "Hey, quick question – if robots had favorite movies, what kind of films do you think they’d be into?"
Person B: "Interesting. Honestly, I have no idea. Science fiction, maybe? They'd probably like seeing themselves on screen."
Person A: "Yeah, I can see that. Probably big fans of The Terminator or Wall-E, right? Imagine a robot getting super invested in robot characters."
Person B: "Ha, yeah. Like, deciding that the Terminator was the ideal role model. Or developing a crush on Wall-E."
Person A: "Exactly! A robot trying to act all cool like the Terminator... but then secretly fangirling over Wall-E’s personality. That would be hilarious."
Conversation continued on next page.
Person B: "Yep, it sure would be! How about you, though? Do you agree that robots would like science fiction films, or is there another genre they might prefer?"
Person A: "Hmm, I feel like robots might also get into documentaries, you know? Something all factual and logical, like ‘How It's Made.’ Maybe they’d enjoy learning about how the world works – or how they work."
Person B: "I like this. It's like a human watching a documentary about the digestive system. It's important to know how your body works."
Conversation end.
There you have it. A conversation between a machine and a human. But which one is which? Take a moment to think. Make a guess in your mind. When you're ready to find out the actual answer, please continue to the page after this one.
Here's the answer: Person A was actually ChatGPT, while person B was a real-life human.
Maybe you guessed that correctly. Maybe you got it wrong. But either way, it probably wasn't easy. Would you have been confident enough to bet your savings on it?
Just to be clear, none of this means that ChatGPT is sentient, or conscious, or an Artificial General Intelligence. It's still nothing more than an Artificial Narrow Intelligence using a neural network to mimic natural language.
But it does this very effectively. In one recent study, using a similar test, GPT was misidentified as a human 54% of the time. Compare that to something like the Eliza chatbot, way back in the 1960s. In the same test, it was only misidentified 22% of the time.
Text-to-Image models
Large Language Models, like ChatGPT, are probably the most striking and well known type of modern generative AI. But not far behind is another type: Text-to-Image models.
You've probably heard of at least one of them. Midjourney, DALL-E, Stable Diffusion...
These models can take a piece of text (a prompt) and use it to generate a picture. For example, if you prompted it to generate "a picture of purple mountain," that's exactly what it would do.
Just like those Large Language Models we looked at last time, these text-to-image models are extremely powerful and complex. The neural network behind the DALL-E model uses several billion parameters.
Training a Text-to-Image model will generally involve a massive dataset of images with descriptive captions.
This is a great example of supervised learning: training a model using a labeled set of input-output pairs. The descriptive caption is the input, while the image is the output.
This approach allows the model to learn the relationship between words ("a ginger cat beside a book on a windowsill") and visuals (the image below).
It's worth pointing out that a dataset like this isn't easy to put together. One example is Microsoft's Common Objects in Context (COCO), which features more than 100,000 different images, each with five descriptive captions. All of these captions had to be manually written by a team of human annotators.
Text-to-Image models can also be built using something called a Generative Adversarial Network (GAN).
This approach involves a pair of neural networks which are designed to work against each other. The first neural network (the Generator) produces AI generated images. The second network (the Discriminator) looks at these images, along with some real-life images, and tries to guess which ones are which.
It's almost as though the two networks are playing a game. If the Generator manages to fool the Discriminator, it gets positive feedback. If it fails, the Discriminator gets positive feedback. Over time, both networks get better and better at generating and detecting AI generated images.
The GAN approach isn't only used for Text-to-Image models. It can also be used to generate text. But it seems to work especially well for images.
Now, it's important to remember that when a Text-to-Image model learns to generate images, it's not actually 'aware' of what those images look like. At least, not in the same way as a human would be if they were drawing, or taking a photo.
Instead, every image is converted into some kind of numerical structure which the computer is able to work with. There are different ways to go about this, but one simple example would be a map of pixels, each with a number that describes that pixel's color.
In other words, what's happening behind the scenes of a Text-to-Image model is similar to an LLM. An image is treated as a complex pattern of numbers, exactly the same as an LLM might treat a piece of text.
The applications of Text-to-Image models are extremely vast and varied.
Just like text generation from an LLM, these images lack the emotional depth and conceptual complexity that you would find in human-made artwork. But if you're willing to overlook that fact...
With the right choice of prompt, you can use them to generate digital paintings, product blueprints, marketing materials, greetings cards, fashion designs, game assets, memes, and more.
Work is even being done to harness these models in medical settings. In the future, a surgeon could potentially describe a patient's symptoms to one of these models, and receive a diagram which helps them to plan the surgery.
Other generative AI
Generative AI isn't only used for text and image generation. There are also lots of other models out there, which can produce a range of content, including speech, videos and music.
Text-to-Speech models are able to turn written text into human-like spoken words. They're used all over the place, from voice assistants, to audiobooks, to tools for visually impaired people.
There are lots of different ways to build these models. But they're generally based on deep learning. If you train a model on thousands of written texts, plus audio files of people reading those texts, it can learn to sound like a real person talking.
As always, these models aren't aware that they're 'talking'. Just like LLMs, and Text-to-Image models, they treat their output (audio) as a piece of numerical data.
Text-to-Video models can generate moving visual content after receiving a written prompt. They're essentially just a more advanced and complicated version of a Text-to-Image model, maybe with some Text-to-Speech thrown in if the video is meant to have audio.
This particular type of generative AI isn't quite as popular as the others. That's mainly because it lacks consistency: even the best models have a tendency to produce slightly strange, uncanny videos.
But that could potentially change in the next few years, as more advanced models, trained on better datasets, are developed. If that does happen, it could have a major impact on fields like entertainment and advertising.
Text-to-Music models are one more type of modern generative AI. They've been trained to generate pieces of music by looking at the patterns and structures of existing compositions.
For example, you could input a prompt like "generate a calm, relaxing melody," and the model would generate a piece of music accordingly.
One of the most well-known examples of this is a model called Suno AI. But Suno is pretty controversial. In June 2024, it was actually sued by a group of record labels, including Universal, Sony, and Warner Music.
Why? Because Suno was supposedly trained on a dataset which featured songs from famous artists like Mariah Carey, Green Day, and Michael Jackson. According to the record labels, Suno's use of these songs was breaking copyright laws.
Suno isn't the only AI with a dataset that's come under scrutiny.
A number of artists have also been trying to take Text-to-Image models (like Midjourney) to court, after claiming that these models were trained on their art without getting any prior permission.
In 2023, a group of authors including George RR Martin, John Grisham, and Jodi Picoult, made similar claims against ChatGPT.
As things stand, the use of copyrighted content in AI training is still a bit of a gray area. There isn't just the legal question to think about, but the ethical question too. Is it fair to use a creator's work without asking for prior permission?
For now, let's put a pin in that question. We'll be taking a proper look at AI ethics later in the pathway.