The fastest RTX 40 could be 5 times more powerful than the RTX 3090 Ti

A little over a week has passed since we talked about a very controversial and totally speculative topic on our part, specifically since Monday of last week, where we named three hypotheses about the changes that NVIDIA could implement in its organization and internal structuring of the Ada Lovelace architecture and how it would affect the RTX 40. Well, today a leak reveals where Huang’s are going to go and above all, what performance the RTX 40 could have. fastest RTX 40.

Three hypotheses with the same principle: there will be changes in the SM in Ada Lovelace as the main architecture where, as we have already anticipated, it will have little to do with what was seen in Hopper, thereby confirming that NVIDIA has two totally different approaches for both architectures and that the next step is clearly to an MCM chiplet system.

Ada Lovelace’s internal changes for the RTX 40

Again a leaker like Kopite7kimi on the prowl and within the leak that has just been revealed we have one of the hypotheses that we considered last week. Specifically, the improvements of this architecture that will give life to the RTX 40 focus on an internal reorganization of the FP32 and INT32where NVIDIA’s move is the most logical and perhaps the least risky: combine all shaders into a single engine that encompasses integers and floats.

That is to say, there would be a group of complete Shaders for FP32 and INT32, which could give as such a higher count than expected in a bombastic number to hate, but less practical in real performance, as happened with the RTX 30.

1. Double the subcore to improve 2*FP32 efficiency.
2. There is 4*FP32 expansion space.
That’s my thought about ADA. pic.twitter.com/HAt48SP5RT

— kopite7kimi (@kopite7kimi) May 5, 2022

To understand the changes we have to go to Pascal vs. Turing as such, since that is where the first change took place. NVIDIA gave up integer performance to promote FP32 in every SM. Ampere left behind the work count of 16 operations for FP32 and 16 operations for INT32 that Turing had for each clock cycle and unified back to work with 32 operations per cycle for both. Due to this, the controversy of the “false” count of these in Shaders arose, since NVIDIA doubled the number of operations, yes, but not the number of Shaders as such.

The fastest RTX 40 performance

The next step now is to unify both engines into one with a very clear objective: to improve efficiency. There will be no FP64 logically, but we will have an exclusive group of FP32 and INT32 that is also scalable, and here comes the really interesting part.

Although the diagram shows a single group for these, really if we look closely there are two, only technically they are unified as one for their functionality and not for their total number. The information leaked today reveals that these two groups could really be up to four as such, where given the capabilities of floating and integer units to work at the same time, it is speculated with a whopping 100 TFLOPS at worst and up to 200 TFLOPS at best.

This idea is based on some certain information I can’t tell you now.
So 100T, 150T or 200TFLOPS is possible.

— kopite7kimi (@kopite7kimi) May 5, 2022

To put it in context, an RTX 3090 Ti gets 40 TFLOPS currently and already with the double count system that we have discussed above, which means that in the event that NVIDIA used two unified groups of FP32 and INT32, the supposed RTX 4090 would be more than twice as fast as the current top of the range the company, while in the case of using 4 of them the performance shoots up to 5 times.

Logically that would imply a monstrous chip in size, unlikely that we will see it, but it indicates that NVIDIA has an ace up its sleeve, possibly not for ada lovelacebut yes for his successors.