Discover the cutting-edge potential of TTT models in advancing generative AI, redefining technological possibilities, and shaping future innovations.
After years of dominance by the transformer AI form, the search for new architectures is underway.
Transformers are the foundation of OpenAI’s video-generating model Sora, as well as text-generating models such as Claude from Anthropic, Gemini from Google, and GPT-4o from GPT.
However, they are encountering technical obstacles, particularly those that pertain to computation.
Transformers are not particularly effective at processing and analyzing large quantities of data, at least when operating on off-the-shelf hardware.
This is resulting in significant and potentially unsustainable increases in power demand as companies construct and expand infrastructure to meet the needs of transformers.
Test-time training (TTT) is a promising architecture that was devised over a year and a half by researchers at Stanford, UC San Diego, UC Berkeley, and Meta.
The research team asserts that TTT models are capable of processing significantly more data than transformers, and they can do so without consuming nearly as much computing capacity.
Transformers’ concealed state
The “hidden state” is a fundamental component of transformers, and it is essentially a lengthy catalog of data. As the transformer processes an object, it adds entries to the concealed state to “remember” the object it has just processed.
For example, the hidden state values will consist of representations of words (or portions of words) if the model is progressing through a book.
Yu Sun, a post-doc at Stanford and a co-contributor to the TTT research, stated to TechCrunch, “If you consider a transformer to be an intelligent entity, then the lookup table, which is its hidden state, is the transformer’s brain.”
“This specialized brain facilitates the well-established capabilities of transformers, including in-context learning.”
Transformers are exceedingly potent due to their concealed state. However, it also hinders them.
The model would be required to scan its entire lookup table to “say” even a single word about a book that a transformer has recently read. This task is as computationally demanding as rereading the entire book.
Therefore, Sun and his team proposed the substitution of the concealed state with a machine learning model, which could be likened to nested dolls of AI, or a model within a model.
Although it may be somewhat technical, the basic concept is that the internal machine learning model of the TTT model does not expand as it processes additional data, in contrast to the lookup table of a transformer.
In contrast, TTT models are highly performant because they incorporate the data they process into representative variables known as weights. The internal model’s dimension remains constant, regardless of the volume of data that a TTT model processes.
Sun thinks that future TTT models could effectively analyze billions of pieces of data, including words, images, audio recordings, and videos. That is significantly beyond the capabilities of the models that are operational today.
Sun stated, “Our system is capable of expressing X words about a book without the computational complexity of rereading the book X times.” “Sora and other large video models that are based on transformers are limited to processing 10 seconds of video because they only have a lookup table “brain.”
Our ultimate objective is to create a system capable of processing a lengthy video that resembles the visual experience of a human life.
Skepticism regarding the TTT models
So, will TTT models eventually replace transformers? They can. However, it is premature to make a definitive determination.
Transformers are not interchangeable with TTT models. Currently, it is challenging to compare TTT as a method to some of the larger transformer implementations that are available, as the researchers have only developed two tiny models for the study.
Mike Cook, a senior lecturer in the informatics department at King’s College London who was not involved in the TTT research, stated, “I believe it is a thoroughly intriguing innovation. If the data supports the assertion that it provides efficiency gains, that is excellent news. However, I am unable to determine whether it is superior to existing architectures.”
“When I was an undergraduate, an elderly professor of mine would tell a joke: “How do you resolve any computer science problem?” Add a layer of abstraction. The act of incorporating a neural network within a neural network certainly evokes that memory.
Nevertheless, the increasing velocity of research into transformer alternatives suggests that there is a growing acknowledgment of the necessity for a breakthrough.
A model, Codestral Mamba, was published by the AI startup Mistral this week. This model is based on state space models (SSMs), which are an alternative to the transformer. SSMs, similar to TTT models, are computationally more efficient than transformers and can accommodate larger volumes of data.
Additionally, AI21 Labs is investigating SSMs. The same is true for Cartesia, which was responsible for the development of several of the first SSMs, as well as Codestral Mamba’s namesakes, Mamba and Mamba-2.
If these endeavors are successful, generative AI may become even more accessible and widespread than it is currently, whether for the better or the worse.