NVIDIA has been pushing into machine learning (ML) hard. Calling that a trick up their sleeves for AMD is not very accurate. On top of 2x FP16 performance on Turing, NVIDIA also has the Tensor Cores, which can do a lot of ML work "for free", while on AMD it costs shader performance. More Machine Learning tasks in mainstream software/games only plays to NVIDIAs advantages in Turing, if anything.
We all seen the marketing. But nobody did show actual reproducible benchmark. So, all those Tensor cores, by how many TFLOPs it boosts 2080Ti FP16 (27TFLOPs). As actual total FP16 performance of card 30TFLOPs? I do not think so, otherwise it would be marketed as such. Or in reverse. If you take general workload for ML and run it on 27TFLOPs GPU and then run it via Tensor Cores. How much faster it will be? Spoiler: Well, Anand did this: Titan V has 30TFLOPs of FP16. And 640 Tensor Cores. They account for additional throughput which would be equal to 10.7 TFLOPs of FP16. RTX 2060 has 240 Tensor Cores which would be around 4 TFLOPs of FP16 used for ML. Totaling FP16 ML of card as 16.9 TFLOPs. RTX 2070 has 288 Tensor Cores which would be around 4.8 TFLOPs of FP16 used for ML. Totaling FP16 ML of card as 19.7 TFLOPs. RTX 2080 has 368 Tensor Cores which would be around 6.1 TFLOPs of FP16 used for ML. Totaling FP16 ML of card as 26.2 TFLOPs. RTX 2080Ti has 544 Tensor Cores which would be around 9 TFLOPs of FP16 used for ML. Totaling FP16 ML of card as 35.9 TFLOPs. Those are not exactly crazy high values in comparison to Vega 64 with 25 TFLOPs of FP16. And Radeon 7 having 27 TFLOPs of FP16. = = = = Now as for actual statement that running something on Tensor cores comes "for free". It kind of does, but how much would it cost AMD? RTX 2080 which is comparable in price to Radeon 7 has 6.1 TFLOPs of FP16 "for free" outside shaders. For Radeon 7 it means, it will have 20.9TFLOPs left after doing same ML workload. (Sacrifice of 22.6% of shader time/performance.)
Could they not also do a 12gb card as well? 12gb would have even to cover needs longer, for a bit more than 8gb.
I think the issue is no-one is making 2 or 3gb stacks of HBM2 memory. As far as i know it's 4gb only. I have no idea of the costs involved, but it might end up being to costly to change and that is why we are only getting 16gb HBM2 at this stage.