Measure Intelligence by Speed

A metric to track the exponential growth of AI

Haifeng Jin

Interview

100%

50%

100%

While we have two candidates. The interviewer gave each of them 45 minutes to solve the same tough coding problem. The first candidate finishes 100% of the problem on time. While the second one struggled a little bit and only finished 50% of them problem. Normally, up to this point, we'd just fail the second candidate. However, the interviewer said "let's double the time for the second candidate." Then, the second candidate used another 45 minutes to finish 100% of the problem. In the end, the interviewer said. "Since both of you finished 100% of the problem, I will assign both of you with the same score."

Is this fair?

Benchmark	Gemini 2.5 Pro	OpenAI o3	OpenAI o4-mini	...
Humanity's Last Exam	21.6%	20.3%	14.3%	...
GPQA (single)	86.4%	83.3%	81.4%	...
GPQA (multiple)	—	—	—	...
AIME (single)	88.0%	88.9%	92.7%	...
AIME (multiple)	—	—	—	...
LiveCodeBench	69.0%	72.0%	75.8%	...
Aider Polyglot	82.2%	79.6%	72.0%	...
SWE-bench (single)	59.6%	69.1%	68.1%	...
SWE-bench (multiple)	67.2%	—	—	...
SimpleQA	54.0%	48.6%	19.3%	...
...	...	...	...	...

Why does it matter?

Customer Service

Autonomous Driving

Why now?

2012

AlexNet

ChatGPT

Llama

Qwen

DeepSeek

Let's pretrain!

Let's fine-tune!

Let's just serve!

Let's buy tokens!

And then came the ChatGPT moment. A single model that can do all sorts of tasks. And people, say, let's pretrain our own large language models so that we do not need to pay OpenAI. And then, Llama was released to the public as an open-source model, And people start to think, let's just fine-tune it. We don't have to train our own models. It's too expensive. Then, the models became more and more capable, the Qwen models was one of those. These model can solve a lot of problems out of the box without any fine-tuning. People, start to say, let's just drop our training infra completely, and just serve the model as-is. Then came the DeepSeek moment. Its per-token price was extremely low. It forced its competitors like Google and OpenAI to lower their token prices to stay competitive. People say, Let's just buy tokens. Maintaining a team of engineers for our serving infra is too expensive. Every time I look at my paycheck, I was like I am too expensive. Why don't you just fire me and buy tokens instead?

What did we learn?

Buy Tokens!

More Capable

More General

A Mental Shift

Model

Service

A shift from using models to using services. By model, I mean the neural architectures and their parameters. Something that purely static and can be written on a piece of paper. By service, I mean running the models on GPUs to produce tokens and put it behind the APIs. We used to interact with models directly, we pretrain them, fine-tune them, and serve them. but now, we only use a well-encapsulated service to get tokens, we no longer know the hardware, framework, and the models behind it. For example, GPT5 dynamically routes to different models based on the query. We don't even know what models is used. Now, the user pattern shifted from using models to using services, the evaluation should shift as well. To properly evaluate the services, speed is an integral part of a service.

Why now?

Speed Metric for AI

Tokens

Tasks

per

Second

More TPS == Faster ?

Test-Time Scaling

Smaller Model
More Tokens

==

Larger Model
Fewer Tokens

Implications

Benchmarks != intelligence

Tokens != Tasks

The first is, benchmark datasets can no longer measure intelligence. A less intelligent model can finish a task with 20 steps. A more intelligent model can finish the same task with 3 efficient steps. They are obviously at different intelligence levels, but they can do the same tasks with no problem. So, benchmarks can no longer measure intelligence. The second implication is, it basically decoupled the number of tokens with the number of tasks. To do the same task, different models may use different number of tokens. So, how long does this AI service take to finish my task? The tokens per second metric can no longer answer that. Let's see if the actual data matches these claims.

A New Metric

Intelligence
Goodput

Intelligence Time

Disambiguation

Time: Wall Time

Intelligence:
Delegate to benchmarks

Intelligence Goodput:

$$ G = \frac{\sum\limits_{i=1}^{n} w_i s_i}{\sum\limits_{i=1}^{n} w_i \cdot t} $$

Here is how we measure it in practice. It is pretty similar to how we calculate GPA. In this equation, G is the intelligence goodput. We run it through n benchmarks, each benchmark has a weight w. and the score achieved is s. We take a weighted average of all the benchmarks and divide it by the total time elapsed during the exams, which is the t here. I know this is too simple. Hardly anyone would believe this is from someone with a PhD degree. However, I am also an engineer, tend to go for the simplest solution that works. It accurately captures the intelligence in the output and filters out everything else. Because if you look into the benchmark datasets, they are mostly choice-based questions. We are only scoring the final result. Any other tokens, like the long chain of thought, does not contribute to the scores at all.

Models	Intelligence	Time	Intelligence Goodput
Grok 4 Fast	60	2.7d	254.88
GPT-5 Medium	66	3.8d	202.85
Gemini 2.5 Flash	54	3.1d	199.27
GPT-5 High	68	7.9d	100.00
Gemini 2.5 Pro	60	7.5d	92.67
Claude 4.5 Sonnet	63	7.9d	91.75
Grok 4	65	40.2d	18.71

Reduce Verbosity

Limitations

1. Complex engineering setup

2. Expensive to run

3. Ignored tokens

Another Problem with TPS:

Multi-Modal

A normal day in

1990

MS - DOS Version 6.22
(C) Copyright Microsoft Corp 1981 - 1990.

Human-Computer Interaction

2020

2025

Human-Computer Interaction

Text

Audio

Image

Video

???

Measure the speed of AI

We can put these metrics into 4 quadrants based on two axes. The x-axis is single-modal to multi-modal. The y-axis is qualitative to quantitative. The tokens per second metric is in the bottom-left quadrant. It is single-modal because it only works with text today. It is quantitative because it only measures the number of tokens. The intelligence goodput metric is in the top-left quadrant. It is still single-modal because it only works with text today. However, it is qualitative because it measures the intelligence in the output. The top-right quadrant is where we want to be. A speed metric that works with multi-modal inputs and outputs, and also measures the intelligence in the output. So, is it possible to make intelligence goodput work with multi-modal? As long as we have benchmark datasets for multi-modal tasks, it is possible. For example, a benchmark dataset to evaluate the intelligence of the image generator. It does not exist today. And I highly doubt if it will ever be created. So, may be we should try to explore what is here in the bottom-right quadrant, which is a speed metric that works with multi-modal, but only measures the quantity of the output. Maybe we can find something interesting there.

Intelligence
Bandwidth

KiloBytes per Second
(KB/s)

The x-axis is time, like when the model is released to the public. The y-axis is intelligence bandwidth in KB/s. Each dot is a model. I collected the most popular models from 2023 to 2025, including text, image, and video generation models. There are a few observations we can make here. 1. Text models are pretty low in KB/s. They are mostly ranging from 0 KB/s to 3 KB/s. which is as expected I assume. 2. The Gemini 2.5 flash image generator is an outlier. Much higher than other image generators. The 3rd one is the most interesting one. Video generation models, like Veo3, are lower than most of the image generators today. I believe it is mostly because of the serving technology is not well-optimized. Because video generation is rather new comparing to other modalities like text and image generation.

Good Metric

Number of Transistors

Compute Performance (FLOPS)

Network Bandwidth

Intelligence Bandwidth (KB/s)

Growth Pattern

Moore's Law

Huang's Law

Nielsen's Law

[???]'s Law

And we are not just stopping there. Some experimental results and observations. A good metric should help us discover new growth pattern. For example, the number of transistors as a simple metric helped us discovered Moore's Law, which is saying that, The number of transistors on a chip doubles every two years. The compute performance of AI, measured in floating point operations per second led to Huang's Law, proposed by jensen huang, the CEO of nvidia. saying that, the growth of flops is faster than the moore's law. Network bandwidth led to Nielsen's Law. The internet bandwidth we use increases 50% every year. Are these laws useful? Of course they are. For example, this Nielsen's law can predict the emergence of YouTube by predicting when the internet bandwidth would high enough for the users to download videos easily. So, what about intelligence bandwidth? Is there a new growth pattern we can discover about AI?

Jin's Law

The peak AI output rate (KB/s)
doubles every year.

Human AI Interaction

Self-Paced

Text

Image

Fixed Speed

Audio

Video

There are two main types of human-AI interaction: Self-paced and fixed-speed. Text and images are self-paced, where the users would go at their own speed. Audio and video are fixed-speed, where they will just let it play at the fixed playback speed. For self-paced ones, AI's output speed needs to exceeds human reading speed to enable real-time interaction. Text, For example, we as humans can read at about 200 to 300 words per minute. while AI can generate 14,000 words per minute, which is way faster than human reading speed. So, it allows us to do real-time text-based interactions with AI. For fixed-speed ones, AI's output speed needs to match the playback speed for real-time interaction. Audio, for example, as long as it can generate 1s of audio in 1s, we can real-time audio interactions with AI. Today, the best model can generate 100 seconds of audio in 1 second. So, we can already do audio interactions with AI. For images, and videos, we are not there yet. But, we can use Jin's Law to predict when it will happend.

Predictions

Images in text responses
in 1 year

Predictions

Real-time video interactions
in 3 years

2025: 8s generated in ~60s
2028: 8s generated in 8s = $\frac{64}{2^3}$

Limitations

1. The metrics are too simple

2. The doubling period

3. The growth plateau

There are three main limitations of this work. First, the approximations. The three methods to approximate usefulness we have proposed are all pretty straight-forward. They can definitely be improved. We are just here to point out the problem and provide a basic solution to kickstart the discussions. They are nowhere near perfect. Second, the doubling period. Currently, we estimated intelligence bandwidth doubles every 1 year. but this estimation is based on a limited number of samples we can possibly collect from the past 3 years. As we collect more data, we may have a more accurate estimation of the doubling period. Third, the growth of intelligence bandwidth may plateau someday in the future. Just like many people saying that it is the end of Moore's Law, there will be an end of Jin's Law. It would not last forever either. There are mainly 2 risks.

Risks to exponential growth

1. AI bubbles

2. Energy supply

The number one risk to the exponential growth is the AI bubbles. We saw some hundreds million packages from Meta to poach talent from other big tech companies. This is not sustainable. If the bubble bursts, the growth of AI would slow down. Thank everyone for rejecting the 100 million packages to stay with Google. The second risk to the exponential growth is the energy supply. We are building data centers that consumes more energy than we can produce. You all heard of the stargate project from OpenAI. Projects like this are all over the country. And, 20% of them will be delayed to connect to the power grid due to the shortage of power supply. So, there are risks that the exponential growth may not be sustainable.

Takeaway

Processing speed is
an essential component of
intelligence.

Training Recipe

Model

Data

} AI {

Framework

Kernels & Compiler

Hardware

I would like to conclude my presentation with this slide. We used to think that the training recipe, the model architecture, and the data are the most important things to improve AI intelligence levels. But now, we know that the frameworks, the kernels and compilers, and the hardware are equally important. because they all contribute to the processing speed of AI. Even if you are just making AI more efficient for a specific application. You are improving its intelligence. So, if you ever ask yourself: am I helping improve the intelligence of AI? With everything we have discussed today, The answer should be: yes!

Measure Intelligence by Speed

A metric to track the exponential growth of AI

Haifeng Jin

Interview

100%

50%

100%

Is this fair?

Why does it matter?

Why now?

2012

AlexNet

ChatGPT

Llama

Qwen

DeepSeek

Let's pretrain!

Let's fine-tune!

Let's just serve!

Let's buy tokens!

What did we learn?

Buy Tokens!

More Capable

More General

A Mental Shift

A Mental Shift

Why now?

Speed Metric for AI

Speed Metric for AI

More TPS == Faster ?

Test-Time Scaling

Test-Time Scaling

Smaller ModelMore Tokens

==

Larger ModelFewer Tokens

Implications

Implications

Benchmarks != intelligence

Tokens != Tasks

A New Metric

Disambiguation

Disambiguation

Time: Wall Time

Intelligence:Delegate to benchmarks

Intelligence Goodput:

$$ G = \frac{\sum\limits_{i=1}^{n} w_i s_i}{\sum\limits_{i=1}^{n} w_i \cdot t} $$

Reduce Verbosity

Limitations

1. Complex engineering setup

2. Expensive to run

3. Ignored tokens

Another Problem with TPS:

Multi-Modal

A normal day in

1990

Human-Computer Interaction

Human-Computer Interaction

2020

2025

Human-Computer Interaction

Measure the speed of AI

KiloBytes per Second(KB/s)

Good Metric

Number of Transistors

Compute Performance (FLOPS)

Network Bandwidth

Intelligence Bandwidth (KB/s)

Growth Pattern

Moore's Law

Huang's Law

Nielsen's Law

[???]'s Law

Jin's Law

Jin's Law

The peak AI output rate (KB/s) doubles every year.

Human AI Interaction

Human AI Interaction

Self-Paced

Fixed Speed

Predictions

Smaller Model
More Tokens

Larger Model
Fewer Tokens

Intelligence:
Delegate to benchmarks

KiloBytes per Second
(KB/s)

The peak AI output rate (KB/s)
doubles every year.

Images in text responses
in 1 year

Real-time video interactions
in 3 years

2025: 8s generated in ~60s
2028: 8s generated in 8s = $\frac{64}{2^3}$

Processing speed is
an essential component of
intelligence.