GDDR6X at the limit? Over 100 degrees measured in the chip on the GeForce RTX 3080 FE! | Investigative
The latest memory chips, such as Micron’s GDDR6X modules on the GeForce RTX 3080, allow the chip temperature value used internally for special protective mechanisms (e.g. clocking down) Tjunction to read out with suitable software, which is a nice addition in itself, anyone would be able to do it too. Only this knowledge of what it is really about and how high the values then turn out could cause many contemporaries to find themselves in dire need and fear. This is precisely why this value is not exposed in the normal sensor loop. The Founders Edition cooler is not bad per se, but the fact that NVIDIA only clocks the memory at 19 Gbps is certainly also due to thermal reasons.
We remember: In the launch article I wrote that the memory could hardly be clocked stably above 20 Gbps and even suddenly slowed down again at the limit. Some other colleagues have also observed this behavior, so I wanted to get to the bottom of the cause. As a little food for thought, I’ll show you again the infrared image of the back of the circuit board with the 84 ° C hotspot at the hottest point where a memory module is located. And yes. it is not due to the memory itself, which remains cool enough in other places, but also to the heating caused by the six voltage converters for NVDD that are far too close.
Thermal resistances and temperatures in different places
Interestingly, Micron is completely silent with the GDDR6 (X), because even the “Device Thermal Information” enclosed with the GDDR6 documentation annoyingly still ends with GDDR5. The manufacturer gives a maximum for its GDDR5 modules Tjunction from 100 ° C on, which seems entirely plausible and corresponds to the specifications for the maximum “Operation Temperature” of 95 ° C. But it is precisely at this point that the uncertainty begins as to what will then be where and why and how warm.
When asked by colleagues, for example from the R&D departments, it was agreed that the maximum temperature Tuntil before the beginning of a possible destruction of the chip should be at 120 ° C and Tjunction should be specified as the maximum value for the GDDR6 at 105 ° C or for the GDDR6X even at 110 ° C. But let’s first consider the thermal scheme of such a GDDR6X module. It is interesting to begin with PT, so the maximum “power” Puntilwhich is supplied as electrical energy and almost completely released again as heat (see red arrow).
That should be around 2.5 to 3 watts per module, which sounds little at first, but due to the small structure width and heat density (density) it is definitely a house number, especially if the circuit board underneath is already quite hot. Because even if the memory module may look quite large as a package, the chip itself is rather tiny. You just need a lot of space for all the connections and you would also like to remain backwards compatible:
In the same place now comes TJ, also Tjunction in the game. Maximum chip temperature and maximum power dissipation are therefore directly related here. This is exactly the value that, for example, AMD also outputs as the storage temperature in the sensor loop. At that time I asked AMD and found out that it is not an average value of all modules, but the absolute peak value, i.e. Tjunction of the hottest module on a card. The values marked with the other two red arrows are also important PB, so as Pboard the power loss that is dissipated via the board and PCwhat for the dissipated heat Pcase above the top of the housing (package).
In addition, there are all the thermal resistances of the individual layers and the combination of layers that belong together as a directional value upwards and through the board downwards, as well as the temperatures of the environment (air) TA or. Tair on the top and bottom, both of which can differ if a water block comes into play at the top. But more on that in a moment.
The crux of a tester like me
consists There was now, on the one hand, the very sparse (public) availability of the specifications and, on the other hand, the lack of (official) possibility to measure inside a module yourself. But stop! Meanwhile, I can also read out the temperatures of the GDDR6X, more precisely the temperature of the hottest module. For certain reasons, I will not go into this in detail now, especially since the software suitable for this is only intended for internal use by engineers. Even if I am not subject to any NDA in this regard, I will adhere to it and neither offer nor redistribute anything publicly for download. This is simply a question of honor and source protection for signed software, so asking is pointless.
Test system and setup
As always, I “tropicalized” the back of the circuit board, ie overplayed it with a transparent varnish, which is used in industry to protect against environmental factors such as high humidity and whose emissivity was measured at approx. 0.95, so it is known. If the factor 1 were applied here, the measured temperature would be significantly lower. The wafer-thin special film attached to the benchtable has a transmission factor of approx. 0.97, which I also take into account in the measurement. This enables me to carry out a clever temperature analysis of the relevant surfaces with the Optris PI640, as the resolution of the built-in bolometer is sufficiently high with 640 x 480 real measuring points.
I have already talked about tropicalization, so that I would also classify the measured values as reliable. Despite the good equipment, I would expect a tolerance of about 0.5 to 1 degree, but not more. Witcher 3 is used in UHD, which I let run for 30 minutes until I finally measure. The room temperature is 22 ° C, exceptionally the structure is open because I need a constant ambient temperature. You could already see the difference between the memory modules on the circuit board (picture at the top). We now please remember the 84 ° C from above and the launch article.
Measurement of Tjunction in memory
The graphic now shows the development of the storage tank temperature, which I was able to read with the in-house software. After heating up, everything remains constant from minute 8, so you can always assume the final temperature, which will not change even after 30 minutes. The hottest module on the IR image is in the immediate vicinity of the voltage converter and brings it to one Tjunction inside of 104 ° C. This results in a Delta 20 degrees between the chip and the underside of the board.
Earlier experiments with a water block and many backplate allocation variants have shown other interesting influences. One cools just the memory on the back now with a good pad between the backplate and the circuit board sinks Tboard by up to 4 degrees, whatever Tjunction drops by 1 to 2 degrees. When water cooling is Tcase also significantly lower than Tboardwhich could make rear cooling all the more interesting. However, with the RTX 3080 FE, it is advisable to only cool the significantly hotter voltage converters, because these are in the immediate vicinity. Incidentally, if you insist that I have removed the backplate, I can reassure you. Even in the completely assembled original state, the RAM is internally still at 104 ° C for the hottest module.
Summary and conclusion
It is no secret that memory modules inside can get significantly hotter than the outer surface on the module top of the package or the bottom of the circuit board would suggest. If you now set the maximum for GDDR6 Tjunction of 110 ° C, the remaining 6 degrees up to the suspected throttling are really not a big cushion. But even such a high value is no reason to panic prematurely if you understand the relationships between all temperatures.
Unfortunately, NVIDIA and the board partners are very cautious when it comes to the exact use of this value for regulating the performance (throttling down) or safety features such as shutdown processes, but you will certainly not have made the effort for nothing . For my part, I will also read out the memory temperatures of the GDDR6 (X) of the new ampere cards in all upcoming tests. That too is a matter of honor.