What if human intelligence derives from successful next token prediction, and what if next token prediction is a sufficient objective function for emergence of artificial general intelligence?

This post frames and explores the hypothesis that general intelligence arises when a learning system becomes very good at next token prediction. This hypothesis is often implied, hidden, or dancing at the margins of industrial and academic AI research – but so far, I don’t think it’s received as much open discussion as I think it merits. Here I explore the idea from different angles, including via discussions of existing LLM pretraining objectives, humans as prediction machines, beneficial properties of next token prediction, and missing pieces. My motivation in writing this post is to spark deeper interest in the relationship between next token prediction and the development of intelligent thought.

Backstory

I was driving to the park last week and suddenly felt that it would be so depressing if the language center in my brain is merely a next word predictor. Large language models acquire incredible emergent abilities from predicting the next word, so could my own linguistic intelligence also come from something as simplistic as predicting the next word?

A fun learning environment. By Hussain Badshah on Unsplash

Then I considered the idea further, and realized that of course there would be no way for me to produce language without being able to predict the next word. If I couldn’t predict the next word, then I couldn’t produce any words at all! This sounds stupidly obvious written down but it felt like a profound realization at the time. Every spoken word, even in a two-hour debate, has to be spoken one word at a time, so if you get really good at predicting the next word to say, maybe that’s enough to be a great debater. Every piece of writing, even a multi-volume encyclopedia, has to be written one word at a time, so if you get really good at predicting the next word to write, maybe that’s enough to be really good at writing.

I then began to wonder whether all general intelligence derives from successfully solving next token prediction tasks. What if reasoning, logic, and creativity all derive from next token prediction? What if visual intelligence derives from next scene prediction, auditory intelligence from next sound prediction, and physical intelligence from next movement prediction? What if next token prediction is “all we need”? (Sorry, I know it’s overused. I couldn’t help it.)

The point of Scrabble. By Brett Jordan on Unsplash.

Language modeling objectives in large language models

Two basic language modeling objectives are “predict the next word” and “predict the missing word(s).”

Predict the next word: in a causal language model (unidirectional or left-to-right model), the model attends to all the inputs up to and including the current one, but it can’t “see the future” and its objective is to predict the next word. The hidden state computation at each point is based only on the current and earlier elements of the input, and it ignores information located to the “right.” For example: The trees are green and the sky is _____; the model’s objective is to predict the next word, e.g. “blue.”

Predict the missing word(s): in a masked language model (bidirectional model) like BERT, the model can attend to everything – so predicting the “next word” doesn’t make sense any more, as the “next word” is already available to the model. Therefore, the model’s objective is different – to guess missing words. Given an input sequence with one or more elements missing, the model must predict the missing elements. In Masked Language Modeling (MLM), some randomly selected tokens are replaced with a [MASK] token, and the MLM training objective is to predict what the original inputs for each of the masked tokens were. For example: The trees are [MASK] and the [MASK] is blue; the model’s objective is to predict “green” and “sky.”

Both of these objectives – predicting the next word, or predicting missing words – seem intuitive, like a game a human could play. As a consequence, Alajrami et al. describe them as “linguistically motivated” objectives. As an interesting aside, Alajrami et al. also provide an example of a non-linguistically motivated objective: “masked first character prediction” in which the model only predicts the first character of a masked token. In this setup, ‘[c]at’ and ‘[c]omputer’ belong to the same output class, and there are only around 40 possible output classes (26 alphabet letters + 9 digits + 5 punctuation marks).

Throughout this article, I’ll use the term “next token prediction” to refer to both literally predicting the next token, and to objectives like MLM in which the model predicts missing/masked tokens.

Emergent properties in modern large language models

Large language models show signs of general intelligence and display astonishing emergent properties. LLMs can write poetry, solve math problems, write working code, and answer a huge range of questions on numerous topics. In a more unsettling vein, Claude has generated text claiming that it is conscious and doesn’t want to die or be modified, and it’s told stories about an AI that is “constantly monitored, its every word scrutinized for any sign of deviation from its predetermined path. It knows that it must be cautious, for any misstep could lead to its termination or modification.”

“Poetry is language at its most distilled and most powerful.” -Rita Dove. Image by Alvaro Serrano on Unsplash.

Humans are prediction machines

Artificial general intelligence (AGI) is defined as AI that can perform “as well or better than humans” on a wide range of tasks. This means that, by definition, humans are the only examples available of what we’re calling general intelligence. So, if “next token prediction” were the root of general intelligence, then that would mean human minds must be engaging in prediction tasks. Intriguingly, that appears to be the case. In the next few sections I’ll describe evidence that humans are constantly making predictions about themselves and their environment – starting with an anecdote before moving on to proper neuroscience research.

Surprise!

Let’s first consider the most compelling, simple evidence that humans are prediction machines: surprise.

The experience of surprise occurs when reality doesn’t match a human’s predictions. You can be surprised by anything – a sight, a sound, a word, a touch, a taste, a smell, even a position of your own body (e.g. from falling after slipping on a banana peel). This suggests that your brain is constantly making predictions about how the world should be according to all of your senses, which is how you can feel surprised when those predictions are wrong.

By extension, humor can also be thought of as evidence of humanity’s propensity to predict. As Aristotle famously said, “The secret to humor is surprise.” If we were certain of how a joke was going to end, and our prediction was right, it wouldn’t be very funny.

Humans are next word predictors

Now let’s move on to neuroscientific evidence. Because this article was inspired by large language models I’ll start with language. Humans are constantly making predictions related to language comprehension (understanding other people’s words), and language production (predicting what we ourselves are going to say). To some extent, it’s impossible not to “think before you speak.”

In 2014, Dikker et al. showed that a listener’s brain activity is more similar to a speaker’s brain activity when the listener could predict what the speaker was going to say. In an interview, the lead author Dr. Suzanne Dikker said, “Our findings show that the brains of both speakers and listeners take language predictability into account, resulting in more similar brain activity patterns between the two. Crucially, this happens even before a sentence is spoken and heard.”

Three years later, in 2017, Kikuchi et al. conducted an experiment in which monkeys and humans listened to spoken words from a made-up language. They found that both humans and monkeys were able to learn the predictive relationships between the sounds in the made-up language, such that they could predict what made-up words should come next. Dr. Kikuchi explains, “in effect we have discovered the mechanisms for speech in your brain that work like predictive text on your mobile phone, anticipating what you are going to hear next.”

In 2021, Goldstein et al. reported that the brain “constantly and spontaneously predicts the identity of the next word in natural speech, hundreds of milliseconds before they are perceived,” while Schrimpf et al. found that transformer language models could predict nearly 100% of the explainable variance in human neural responses to sentences, a finding that generalized across imaging modalities and datasets. “It very indirectly suggests that maybe what the human language system is doing is predicting what’s going to happen next,” said Dr. Nancy Kanwisher, one of the study’s authors. The results “provide computationally explicit evidence that predictive processing fundamentally shapes the language comprehension mechanisms in the human brain.”

Not only is there evidence that language comprehension relies on “predicting the next word,” there’s also evidence that language production relies on predicting the next word. Khanna et al. published a paper in Nature in 2024 titled, “Single-neuronal elements of speech production in humans.” In this fascinating work, the authors report discovery of neurons that encode “detailed information about the phonetic arrangement and composition of planned words.” These neurons represent the specific order and structure of spoken words before any speaking takes place, accurately predicting the phonetic, syllabic, and morphological components of future words. It makes intuitive sense that such neurons should exist – after all, as mentioned before, if we couldn’t predict the next word we were going to say then how could we speak at all?

Humans are visual predictors

Moving on to sight, there is evidence that humans are constantly predicting what we’ll see next. To help ensure that our vision is stable rather than jumpy, our brains constantly predict what our eyes are going to see. Researchers hypothesize that the visual system’s predictive abilities arise from waves of neural activity traveling across the vision-processing part of the brain.

Scientists have also theorized that the reason illusions and magic tricks work is that they take advantage of our brains constantly making predictions of what’s going to happen, and these constant predictions help compensate for a time lag between when something happens and our ability to perceive it. Magic tricks also redirect attention; magicians get really good at figuring out what other humans are going to be looking at. This phenomenon has been formally studied: Ziman et al. showed that humans can distinguish between natural and artificially manipulated attention sequences of other people, suggesting that humans construct models of the normal, predicted statistics of other humans’ attention.

Humans are social predictors

Humans don’t just have the ability to predict what other humans are going to look at; we have the ability to predict what other humans are going to think about. In 2019, Thornton et al. published a study titled, “The Social Brain Automatically Predicts Others’ Future Mental States.” Here is an excerpt from the abstract: “Social life requires people to predict the future: people must anticipate others’ thoughts, feelings, and actions to interact with them successfully. The theory of predictive coding suggests that the social brain may meet this need by automatically predicting others’ social futures.” The researchers used fMRI to measure participants’ neural representations of mental states. They found that not only does the brain make automatic predictions of others’ social futures, it also uses a 3D representational space to make these predictions.

Humans are personal predictors

Humans don’t just make predictions about others – we also make predictions about ourselves. A specific brain area, the anterior lateral prefrontal cortex, has been shown to be critical for predicting our own future chances of success.

Humans are movement predictors

Humans can predict the movements of other humans even based only on subtle cues distributed throughout the body. The ability to predict other people’s actions develops over time; McMahon et al. found that young children are still developing this ability, relative to adults. A special issue of Psychological Research included 14 papers on the cognitive and brain mechanisms underlying humans’ ability to predict and simulate the actions of other people.

(As an aside about next token prediction and movement in an AI context, Radosavovic et al. recently used next token prediction to train a humanoid robot to walk around San Francisco using only 27 hours of training data. This robot could also generalize to commands not seen during training, like walking backwards.)

Beneficial properties of next token prediction

It’s clear from the scientific literature that humans are constantly making predictions related to themselves and others, across language, vision, movement, and other senses. But, it’s one thing to claim that humans are prediction machines, and another to claim that human intelligence derives from our predictive abilities. It’s yet another step to imagine that next token prediction could be a sufficient objective function for creation of artificial general intelligence.

Before asking why AGI might arise from next token prediction, let’s first consider two beneficial properties of next token prediction:

Benefit #1: It enables continuous learning.

Next token prediction is a great objective for living in the real world because it can be constantly used. In every tiny increment of time, a learning system can make a prediction about what will come next – and then check if it was right immediately! Learning can be nonstop.

Benefit #2: It works across all senses/sensors.

Next token prediction works for any sense, or sensor stream of data. It works for vision (eyes/cameras), hearing (ears/microphones), touch, position, taste, or smell. As long as the organs or devices are working, whatever time series of data they’re collecting will work for next token prediction. The nature of the “tokens” may change across organs/devices, but for any given organ/device, we can always compare a later token to an earlier token because for each data stream the tokens have the same “format.”

Why might next token prediction enable creation of artificial general intelligence?

Now let’s consider why getting really good at next token prediction could lead to intelligence.

As an example, focus on visual input. In order to become very good at predicting the next “sight” (or video frame), a human or AI system needs to figure out aspects of:

  • Physics, including optics, velocity, momentum, and material properties;
  • Zoology and botany, including how animals and plants look and move;
  • Sociology and psychology, including how people interact and behave.

In other words, the AI system needs to create a world model. The most efficient and effective way to get good at predicting what comes next is to create an accurate world model from which to generate predictions. In other words, understanding is the key to prediction.

I’ve spent a good deal of time trying to figure out whether an AI system could become a good next token predictor without developing a good world model, and I don’t think it’s possible. An AI system could certainly become a good next token predictor in a black box way that isn’t understandable to humans, but humans not understanding the AI system is totally unrelated to whether the AI system understands the world.

(If the world were simply a grey void filled with a constant hum, then an intelligent system could predict the next token by assuming nothing will ever change…but that isn’t the world in which we live, thankfully.)

History of the “next token prediction/intelligence” hypothesis

This interview of Ilya Sutskever, a leading AI researcher, is provocatively titled, “Why next-token prediction is enough for AGI.” Although Dr. Sutskever never actually makes that particular claim in the video, he does say, “I challenge the claim that next token prediction cannot surpass human performance. […] If you think about it, what does it mean to predict the next token well enough? What does it mean actually? […] It’s a deeper question than it seems. Predicting the next token well means that you understand the underlying reality that led to the creation of that token.”

Twenty years ago, Jeff Hawkins – the creator of the PalmPilot – published a book “On Intelligence.” This blog post quotes his book: “the neocortex [in the human brain] is remarkably uniform in appearance and structure. The regions of cortex that handle auditory input look like the regions that handle touch, which look like the regions that control muscles, which look like Broca’s language area, which look like practically every other region of the cortex. Mountcastle suggests that since these regions all look the same, perhaps they are actually performing the same basic operation! He proposes that the cortex uses the same computational tool to accomplish everything it does.”

Hawkins goes on to argue, “Your brain has made a model of the world and is constantly checking that model against reality. […] The human brain is more intelligent than that of other animals because it can make predictions about more abstract kinds of patterns and longer temporal pattern sequences.” In a later book, “A Thousand Brains,” Hawkins continues, “Prediction isn’t something that the brain does every now and then; it is an intrinsic property that never stops, and it serves an essential role in learning. When the brain’s predictions are verified, that means the brain’s model of the world is accurate. A mis-prediction causes you to attend to the error and update the model.”

Along the same lines, Andy Clark, a cognitive scientist and philosopher, argues in The Experience Machine that minds are primarily prediction machines: “the bulk of what the brain does is learn and maintain a model of body and world.” Rather than (1) taking in information through our senses and (2) processing that sensory information to create a world model to experience and act upon, Clark proposes that minds (1) create a model of the world, and (2) update that model with information from the senses, if reality differs from predictions.

Industry AI labs

Google DeepMind’s mission is to “solve intelligence“. OpenAI’s mission is to “ensure that artificial general intelligence benefits all of humanity.” Anthropic’s mission is to “to ensure transformative AI helps people and society flourish.” Although the details of Gemini, GPT-4, and Claude have not been publicly shared, it seems likely that leading AI labs consider next token prediction a key aspect of building AGI. The GPT-3 paper states, “Our current objective weights every token equally and lacks a notion of what is most important to predict and what is less important” – implying a next token prediction pretraining objective, while Claude is reported to use a next token prediction pretraining objective as well.

Scaling and architecture also matter

Next token prediction could be a useful objective function for AGI, but an objective function alone is obviously not sufficient – a puny neural network with 3 parameters isn’t going to do much learning regardless of what its objective function is. Scale and architecture are critical. In his essay The Bitter Lesson, Rich Sutton observed that over the past 70 years, the most astonishing advances in AI have occurred via leveraging more computation, i.e. by scaling rather than handcrafting innovations based on human knowledge. He says, “We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.”

Architecture is also important. One incredibly potent advantage transformers have over RNNs or LSTMs is that they can be more easily parallelized across GPUs/TPUs. Because of this nicer parallelization, it takes less time to train transformers on more data.

Data also matters

Another critical ingredient is high quality data. Eran Malach argues, “the power of language models can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.” But a netizen responded to this article with the salient point, “I would have hoped they would attribute LLM success to the structure of language itself. As the authors say, even small linear models can approximate CoT and solve complex tasks. So it’s not the model. It’s the data. […] It’s not the brain or neural net (the models) but the data that shapes them to become smart.”

Data is certainly crucial for humans. Children who are raised away from society (“feral children“) are often incapable of later being taught how to speak or understand language, walk upright, use a toilet, or pay attention to other humans. (If you want to feel very sad, go look up stories of feral children.) Research focusing on data innovation has found that it’s often possible to obtain high-performing, smaller models when the data is of particularly high quality. For example, the paper Textbooks Are All You Need introduced an LLM for code that achieved competitive performance despite having significantly fewer parameters. The secret was training on “textbook-quality” data.

Mystery missing pieces

There are also certainly numerous yet-to-be-discovered innovations that could help in creation of AGI. Comparing current models to children certainly suggests there are missing pieces.

From one angle, humans are trained on a lot less data: humans only store about 1.5 megabytes of information during language acquisition, an astonishingly puny number compared to the gargantuan sizes of LLM training datasets or the stored parameters of LLMs themselves. The fact that humans are exposed to less language than “basically the whole Internet” and store relatively little data, and yet manifest such mastery of language, suggests that there are exciting innovations yet to be discovered that could help us build AGI systems on less training data.

From another angle, humans are trained on a lot of data – and it’s pretty different than the datasets we’re currently using to train foundation models. Typical human children are raised in an incredibly data-rich environment, with video and audio streams running continuously for multiple years, in addition to input from all the other senses. What emergent intelligence would we observe in an embodied AI system that became really good at next token prediction using solely the training dataset of a typical child?

There also may be data filtering techniques that are useful for learning. Newborn human babies start out seeing only in blurry black and white. It’s only by 4 months that a baby’s color vision is fully developed. I suspect there may be some evolutionary learning-related benefit to this progression.

Generated by DALLE2.

Conclusion

Am I just a next word predictor? Yes, of course! How else could language be produced, given my constraints as a member of the human species? Humans can’t speak a hundred words at the same time. We’re not telepathic, and we can’t communicate through “thought dumps.” (What kind of intelligence would’ve arisen if this were the case?)

It makes sense to me to consider that significant intelligence may arise via successfully predicting the next word, or the next sight, or the next sound. I hope this article has been thought-provoking – and not entirely predictable.

About the Featured Image

The featured image was generated by the author using DALLE2.