I am not an AI algorithm engineer, but the recent developments in China's AI models—from last month's rising star, Deepseek, to this month's newcomer, Manus—have had a significant impact on the demand for AI chips (at least within China). Therefore, today I will attempt a simple analysis based on my current understanding of their potential impact on AI computing power demand.
Firstly, let’s try to analyze the negative impact of Deepseek’s optimization algorithms on NVIDIA chip demand. The following discussion will focus on three key technologies used by Deepseek: MLA (Multi-head Latent Attention), MTP (Multi-token Prediction), and Sparse MoE (Mixture of Experts).
We know that large language models rely on relationships between any two tokens as the basis for prediction. Before Deepseek, most large language models used MHA (Multi-head Attention), which captures relationships between tokens through multiple dimensions rather than a single scalar value. For example, one dimension might indicate whether two tokens belong to the same language, another might represent their semantic association, and another might determine whether they have a logical negation relationship. By using such multi-dimensional representations, models can capture the complex and diverse relationships between tokens more comprehensively. In mainstream large language models, 10~30 heads are typically used to ensure sufficient information representation.
However, due to the lack of GPGPU computing power, China's Deepseek adopted an MLA (Multi-head Latent Attention) approach, which retains only the most critical feature vectors in the matrix, thereby reducing the number of tokens that need to be computed. To give a simplified example (ignoring low-rank joint compression and RoPE computation): In a traditional large language model, for an input sequence of 100 tokens, each token needs to compute relationships with the other 99 tokens; In contrast, MLA compresses the original 100-token sequence and retains only ~5 of the most important feature vectors, then performs matrix computations on these 5 feature vectors, reducing computational load by approximately 95%. A knowledgeable reader might ask here: theoretically, the computational cost of attention is O(n²d), where n is the sequence length. Reducing n from 100 to 5 should decrease FLOPs by (100² - 5²) / 100² = 99.75%. However, in practical engineering, additional calculations are required for selecting key feature vectors, performing multi-layer aggregation, optimizing feedforward networks, and normalizing layers. Therefore, while the theoretical reduction is 99.75%, the actual reduction is around 95%, which is still a significant improvement.
Next, we need to introduce the concept of KV Cache (Key-Value Cache). Anyone familiar with AI hardware knows that besides GPGPU, the biggest bottleneck in computing power is HBM (High Bandwidth Memory). This is directly related to KV Cache, a data structure that stores the key and value in attention mechanisms. When predicting the next token, a large language model must compute relationships between the current token and all previous tokens. This involves querying the current token's query (Q) value, searching for matching key (K) values from past tokens, and extracting corresponding value (V) values based on relevance. To speed up this process, keys and values must be stored in highly accessible locations, i.e. HBM. The complexity of KV Cache arises from several factors: 1) Keys and values transform at each layer, requiring separate storage for each layer, 2) If multiple heads are used, the keys and values need to be stored per layer, per head, and per token respectively.
Since HBM capacity is limited, Deepseek’s MLA (Multi-head Latent Attention) technology helps compress keys and values, thereby reducing the required storage for KV Cache. However, it is important to note that MLA only reduces HBM usage during inference stage, not during training stage. This is because during training stage, KV Cache is not the main content stored in HBM. In addition to KV Cache, HBM primarily stores:
· Model parameters
· Optimizer states
· Gradients
· Token vocabularies (which can reach millions level in multimodal models)
Among these, optimizer states consume the most memory and are typically 3~5 times the size of model parameters. For example, in the classic Adam optimizer, each parameter requires additional storage for momentum (first-order moment) and variance (second-order moment), plus some auxiliary states, hence reaching 3~5 times the size.
For example, the optimizer state alone of Deepseek R1's 671-billion-parameter model, under the assumption of int8 storage, would require a memory footprint of 5 times the model's parameter size. In reality, some components still require higher precision (e.g., Float16/32) to maintain numerical stability and to prevent gradient explosion/ vanishing, indicating that actual memory usage may exceed this estimate.
On the other hand, during inference stage, optimizer states and gradients are not stored, meaning that MLA technology can effectively reduce KV Cache-related memory consumption. Nevertheless, if Deepseek further develops a native multimodal model in the future, its vocabulary size could expand by 10-20x, offsetting some of these HBM savings.
Next, let’s take a look at another technology adopted by Deepseek: MTP (Multi-token Prediction). As the name suggests, multi-token prediction allows the model to predict multiple upcoming tokens at once. Currently, industry tests show that predicting less than 5 tokens per step achieves optimal results—beyond this, accuracy drops significantly. In practice, Deepseek chose to predict upcoming 2 tokens per step. As a result, compared to predicting 1 token at a time, predicting 2 tokens at a time reduces floating-point computation demand by approximately 30%. It is not a 50% reduction, because when the predicted tokens are incorrect, the model needs to go back and recompute, which still consumes meaningful computational power. Additionally, MTP helps reduce data transmission latency: Compared to the original method where 1 token required sending a data packet each time, now 2 tokens are sent per packet, reducing the total number of data packets by half. This alleviates the network congestion caused by routing data through switches.
Last but not least, let’s explain the Sparse MoE architecture that is adopted by Deepseek. Before Deepseek, traditional large language models primarily used Dense MoE architecture, where a relatively small number of large experts were activated per step. For example, each layer may contain 100 experts, with 80 experts activated per step. This approach does not require complex routing optimizations and ensures stable performance. However, Deepseek’s Sparse MoE architecture takes the opposite approach: It has a large number of small experts, but only a few are activated at a time. For example, instead of activating 80 out of 100 experts, Sparse MoE only activates 8 out of 100. To determine which 8 experts should be activated, a separate gating model is needed to assign each token to the appropriate experts. This method also helps reduce computing power demand.
However, the drawbacks of Sparse MoE are also evident: Since only a limited number of experts are activated at a time, some experts may not be trained sufficiently over time. To solve this issue, a balancing network is required to ensure expert activation is distributed as evenly as possible during training stage. Additionally, penalties must be applied to prevent uneven expert distribution; Also during inference stage, if certain experts are frequently accessed, these expert nodes may become bottlenecks in the system. To address this problem, Deepseek has developed a so-called EPLB (Expert Parallel Load Balancer) technique, which dynamically adjusts GPGPU resources to allocate additional copies of high-load experts during inferencing.
Now, let’s estimate the potential impact of Deepseek’s optimization algorithms on NVDA chip demand in China:
Here, I assume an optimistic scenario where all urban residents in China (~900 million people) will eventually become accustomed to using AI model chatbots for daily searches. This means a Daily Active User (DAU) base of 900 million. Additionally, I assume that the concurrent usage rate of chatbots is 1%, meaning that at any given moment, there are 900 x 1% = 9 million cunrurrencies.
According to NVDA’s latest posted data on the Deepseek R1 model (DeepSeek-R1 Now Live With NVIDIA NIM), an NVIDIA H200 GPU, using FP8 precision for inference, can achieve a token throughput of 484 tokens per second. Deepseek’s official paper indicates that its chatbot outputs an average of 20~22 tokens per second. From this, we can easily calculate that each NVIDIA H200 GPU can handle 484 / 21 = 23 concurrencies per second. Thus, to support 9 million concurrent chatbot users, a total of 9 million / 23 = 390,000 NVIDIA H200 GPUs would be required.
Similarly, Deepseek’s official paper states that, after applying its full suite of optimizations, its NVIDIA H800 GPU (China-special edition) achieves a token throughput of 1850 tokens per second using FP8 precision for inference. From this, we can calculate that each NVIDIA H800 GPU can handle 1850 / 21 = 88 concurrencies per second. Thus, to support 9 million concurrent chatbot users, only 9 million / 88 = 102,000 NVIDIA H800 GPUs would be required. This represents a nearly ~75% reduction in the number of GPUs required, compared to the scenario without Deepseek’s optimizations!
Nevertheless, this is not the only surprise China’s AI startups have brought to the industry this year. Just a few weeks ago, another Chinese AI startup company, Manus, launched an extremely smooth AI agent product. This has not only generated considerable buzz in the industry but also led investors to ponder: In the foreseeable future, how much additional AI compute demand could a popular AI agent product create? Here, I will provide a general framework for analysis, for readers’ reference.
Manus AI Agent Processing Flow is as follow:
When a user submits a query, a planner model first generates a master plan and then breaks it down into step-by-step plans. Each plan contains multiple tasks. Each task then calls different functions (such as a web browser, Linux system, or Claude programming) to complete subtasks. The output of each subtask is stored in a file system (which requires a sandbox system), so that users can review the results of each subtask later. Finally, since Manus integrates Claude’s programming functionality, the AI agent’s final output can be customized by users into different formats (such as apps, document reports, or web pages).
The key factor here is that, since the entire process contains dozens of small tasks, and many of these tasks require calling vision large language models to capture and interact with web browsers. Compared to text-based models, vision models require significantly higher computational power (measured in tokens per second). I mentioned earlier that Deepseek’s chatbot has an average token output speed of ~21 tokens per second. However, a vision AI agent could require 500~1000 tokens per second. Of course, Manus is not a pure vision AI agent, but rather a hybrid of text and vision AI agent. Therefore, I make a conservative assumption here that Manus has an average token output speed of ~200 tokens per second. Under this assumption, a single NVIDIA H800 GPU, after Deepseek’s full suite of optimizations, can now only handle 1850 / 200 ≈ 9 concurrencies per second.
Moreover, users’ actual test of Manus shows that a complex task completed by Manus AI agent typically takes 20–30 minutes, whereas a chatbot query usually takes only 4–5 minutes. In theory, this means that the concurrent usage rate of AI agents should be about 5~6 times higher than chatbots. Here, I conservatively assume that AI agent users have a concurrent usage rate of ~5% (which is 5 times higher than chatbots). Assuming that all ~200 million white-collar workers in China will become daily active users (DAU) of AI agents in the future, then the number of concurrences should be 200 million × 5% = 10 million. For reference, China’s WPS (the Chinese equivalent of Microsoft Office) already has over 600 million monthly active users, so this assumption seems not overly aggressive. Using the same GPU computation method as before, to support 10 million concurrent AI agent users, a total of 10 million / 9 = 1.1 million NVIDIA H800 GPUs would be required. This means that, compared to the earlier Deepseek chatbot case, the number of GPUs required increases by nearly 10×!
This shows that although Deepseek's entire set of optimization algorithms has reduced the AI computing power demand for large language model reasoning by ~75%, AI agents like Manus could also increase inference-related AI compute demand by ~1000%! As a result, net-net, the future demand for AI computing power will still increase significantly. Of course, one critical question remains: When will AI agent become a commonly used app popular among the general public? I am in an optimistic view that, given the fast iteration speed of Chinese software engineers, a mature AI agent blockbuster application will definitely emerge in China in the near future.
Special thanks to a good friend of mine from the University of Toronto for his contribution to this article : )
Love the post - hadnt seen a good market sizing. This feels bearish for NVDA and all AI stock. H800 cost how much $40k? If 1MM of them are needed that’s only $40bn in sales. Doesn’t seem like enough to justify a $3Tn NVDA valuation unless it’s at least 100x that. That’s before we even consider that chips get better and inference gets more efficient at a rapid clip
If 1.1m h800 is needed...then it's like pretty much all h800 has ever been made right?
Follow the same logic, how far china can go with inferencing even with h20 or huawei chip?