Log in or Sign up

AMD's Rome is rumored to be designed with 8 Zen2 dies -Can it look like this?

Discussion in 'Frontpage news' started by HWgeek, Sep 22, 2018.

Tags:

Page 1 of 2

HWgeek Guest

Messages:

441

Likes Received:

315

GPU:

Gigabyte 6200 Turbo Fotce @500/600 8x1p

I just saw this article and thought how can it look like if they choose to go for 8 dies+1 I/O:
AMD's EPYC Rome 64C illustration by me:
https://i.**********/wBXbr9Xj/AMD-_EPYC_7nm.jpg
What do you think?
I think it could be great for better yields/cheaper per CPU/ better Heat dissipation and 1 memory channel per die, make it harder for Intel to compete..
https://www.notebookcheck.net/AMD-s...ond-with-triple-die-Cooper-Lake.331644.0.html

Last edited: Sep 22, 2018

HWgeek, Sep 22, 2018

#1
PrMinisterGR Ancient Guru

Messages:

8,128

Likes Received:

971

GPU:

Inno3D RTX 3090

Sounds completely implausible, but weirder things have happened.

PrMinisterGR, Sep 22, 2018

#2
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

I think speculations like this are something to be put inside OnnA's thread instead of FP news.
And if this was to be real... AMD CPU division would start to look little megalomaniac.
And I would expect MCM for GPUs to be here too, but it looks like that's other kind of story.

Fox2232, Sep 22, 2018

#3
HWgeek Guest

Messages:

441

Likes Received:

315

GPU:

Gigabyte 6200 Turbo Fotce @500/600 8x1p

Maybe this is why Intel have so much trouble with 10nm because the yields are not so-so, and that's why AMD choose this path (8 core per die) to overcome this problem and to be first in market with 7nm chips.
Should infinity-fabric have problem with such design?

Last edited: Sep 22, 2018

HWgeek, Sep 22, 2018

#4
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

HWgeek said: ↑

Maybe this is why Intel have so much trouble with 10nm because the yields are not so-so, and that's why AMD choose this path to overcome this problem and to be first in market with 7nm chips.
Should infinity-fabric have problem with such design?
Click to expand...

Considering size of that IF logic in middle, probably not a problem. But IF is not all to all communication, it is channel between 2 points. So, A lot of channels, or each die going to this "HUB".
In case IF clock was decoupled from IMC clock and AMD got it to higher clock... (As those EPYC chips are not exactly paired with highly clocked memory.)
Before, it was 1 ~ 2 jumps for anything to any CPU. Here it would be always 2 jumps. But double IF clock and improve latency by doing same thing in fewer cycles and you have winner.

Fox2232, Sep 22, 2018

#5
HWgeek Guest

Messages:

441

Likes Received:

315

GPU:

Gigabyte 6200 Turbo Fotce @500/600 8x1p

Just remembered that IBM did something crazy like this while I was at High School
IBM's POWER CPU's:

HWgeek, Sep 22, 2018

#6

Evildead666 likes this.
H83 Ancient Guru

Messages:

5,508

Likes Received:

3,034

GPU:

XFX Black 6950XT

Wouldn´t be better to make a 16 core CPU die and then glue 4 of them to achieve the same objective but without such complicated design? I ask this because the communication between all those is a possible nightmare.
Also in the link provided, it says Intel´s answer is to glue two 28 Core CPUs together, how the hell are they going to cool those???

H83, Sep 22, 2018

#7
wavetrex Ancient Guru

Messages:

2,462

Likes Received:

2,574

GPU:

ROG RTX 6090 Ultra

Don't forget that in current Epyc and TR 2, each die is connected to every other die.
So for 4 of them, there's 3 I/O used for comm to the other dies.

The reason for this is that even with non-local memory access, the data is only ONE hop away in all cases, so somewhat acceptable low latency.

With 8 dies it would be very difficult to implement properly, you can't have 8*7/2 = 28 copies of the high-speed interconnect on the chip, that's just too much for the substrate, as well as an insane numbers of connections to the dies themselves, to support 7 links at once ! That's also a lot of transistors to route everything. Impractical.

If you choose to use two hops instead, the number would be 8*3/2 = 12 copies of the high-speed interconnect (compared to just 6 in Epyc), so doable... but instead it would increase latency considerably.

16 cores per die at 7nm should not be a problem whatsoever! Intel is capable of doing 28 of them at 14nm, so I don't see why AMD cant' do 16 on a much smaller process.

P.S.
Mouse drawing schematics on how a hypothetical 8-die /2-hop NUMA chip would look like:

Note that only 3 of the interconnects "intersect", this allows somewhat simpler substrate design.
Also, by moving the chips a bit the maximum line length could get shorter, resulting in a more stable signal, higher interconnect frequency and lower latency.

But, don't forget, 2-hops...
Even if not needed today, I can imagine 8-die chips of the future will use such a configuration, it's the only way to keep expanding computing power !

Last edited: Sep 22, 2018

wavetrex, Sep 22, 2018

#8

Lilith likes this.
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

wavetrex said: ↑

Don't forget that in current Epyc and TR 2, each die is connected to every other die.
So for 4 of them, there's 3 I/O used for comm to the other dies.

The reason for this is that even with non-local memory access, the data is only ONE hop away in all cases, so somewhat acceptable low latency.

With 8 dies it would be very difficult to implement properly, you can't have 8*7/2 = 28 copies of the high-speed interconnect on the chip, that's just too much for the substrate, as well as an insane numbers of connections to the dies themselves, to support 7 links at once ! That's also a lot of transistors to route everything. Impractical.

If you choose to use two hops instead, the number would be 8*3/2 = 12 copies of the high-speed interconnect (compared to just 6 in Epyc), so doable... but instead it would increase latency considerably.

16 cores per die at 7nm should not be a problem whatsoever! Intel is capable of doing 28 of them at 14nm, so I don't see why AMD cant' do 16 on a much smaller process.

P.S.
Mouse drawing schematics on how a hypothetical 8-die /2-hop NUMA chip would look like:

Note that only 3 of the interconnects "intersect", this allows somewhat simpler substrate design.
Also, by moving the chips a bit the maximum line length could get shorter, resulting in a more stable signal, higher interconnect frequency and lower latency.

But, don't forget, 2-hops...
Even if not needed today, I can imagine 8-die chips of the future will use such a configuration, it's the only way to keep expanding computing power !
Click to expand...

That's why there is 9th silicon in middle of 8 CPU silicons. All would likely connect to central one, maybe at double width of interface than EPYC with 4 dies as that still saves one connection per die. And boosts transfer rate.
Question is, since central one is Huge in comparison to CPU dies, does it contain all IMCs? That would be pretty impressive trick.

Fox2232, Sep 22, 2018

#9
wavetrex Ancient Guru

Messages:

2,462

Likes Received:

2,574

GPU:

ROG RTX 6090 Ultra

That design would be horribly inefficient, and basically going back to the old concept of FSB and memory controller in the North Bridge, from the times of Pentium 4/Core 2 or Athlon XP (A64 already has IMC, which is why it was so much faster)
Such idea would be instantly rejected by any modern silicon engineer...

You want each "Uncore" (L3 cache, Memory Controller) to be directly linked to RAM for the lowest possible latency.

In that "central hub" concept none of the cores would have direct access to RAM, and also chip-to chip would ALWAYS be Two-hops.
In my drawing with 12 links, at least 12 inter-chip communication lanes are direct, with the other 16 possible communication routes are 2-hop. It's still better than all 28 being 2-hop, with ALL non-local memory.

Note: The number 28 is from the "Combinations" formula: nCr = n! / r! (n - r)!, in this case 8! / (2!*(8-2)!) = 28

wavetrex, Sep 22, 2018

#10
HWgeek Guest

Messages:

441

Likes Received:

315

GPU:

Gigabyte 6200 Turbo Fotce @500/600 8x1p

I made that image, I did it with Die diagram as a reference and tried to remove the I/O space and put them in the 9th die in middle.
I think you are right and that why AMD used the 9th Dies to connect between all dies and reduce the latency.
https://i.**********/V64myXj5/950px-amd_zen_octa-core_die_shot_annotated.png

HWgeek, Sep 22, 2018

#11
wavetrex Ancient Guru

Messages:

2,462

Likes Received:

2,574

GPU:

ROG RTX 6090 Ultra

HWgeek, the thing with fast digital electronics is not the physical distance, but about how many times a signal needs to "stop and forward".
Electric current travels at roughly 1/3 of the speed of light, so distances inside a computer aren't slowing down things. The problem is distance DEGRADES signal quality, which arrives at the destination different than how it was sent.

For example, PCIe works equally fast on the videocard placed right near the CPU at 20cm, or 1 meter away through long riser cables, up to a point where signal is so degraded that it stops working completely !

In a chip that has stuff very close the degradation is almost non-existent, so that's not an issue... BUT, it requires logic to read the signal and forward it to the next thing.

For example: Core ALU -> L1 cache -> L2 cache -> L3 cache -> Memory controller -> DRAM inner controller -> RAM data lines.
Each step introduces extra latency.

Your "design" moves the Memory controller away from L3 cache and into another chip, so another step (or two even) is needed:
... L3 cache-> Infinity Fabric node (Core Die) -> Infinity fabric node (Center Die)-> Memory controller -> ...
Same goes for PCIe.

That is a very bad idea today, and as I said, it's turning back in time.
Not to mention, they would have to completely reengineer Zen die, which currently contains both DDR4 controller AND the PCIe link.

Basically the entire chip would need to be rebuilt from ground up to move those functions away from it and into something else.
I hope now you understand why this will not happen, not now, not ever, and those rumors are complete gibberish from people who don't understand how these things work.

Let me show you how Zen 2 chip will be inside:

It is what makes most sense from an engineering standpoint, it's most economically viable, it's the easiest to redesign and can be done quickly.
Most if not all of the logic is kept the same, just the internal addressing will be changed to support doubling an all core resources, while keeping everything outside the same.

wavetrex, Sep 22, 2018

#12

HandR and HWgeek like this.
user1 Ancient Guru

Messages:

2,777

Likes Received:

1,299

GPU:

Mi25/IGP

Fox2232 said: ↑

That's why there is 9th silicon in middle of 8 CPU silicons. All would likely connect to central one, maybe at double width of interface than EPYC with 4 dies as that still saves one connection per die. And boosts transfer rate.
Question is, since central one is Huge in comparison to CPU dies, does it contain all IMCs? That would be pretty impressive trick.
Click to expand...

the website i saw this idea floating around on claimed the "hub" was 14nm which might explain why.

there is a reason to do something like this , mainly that newer processes are getting more expensive and using smaller chips would greatly improve effective yields

if this was a real chip design , I doubt its rome, i'd guess its something much bigger than 64cores. then the latency tradeoff might be worth it.

Last edited: Sep 22, 2018

user1, Sep 22, 2018

#13
Evildead666 Guest

Messages:

1,309

Likes Received:

277

GPU:

Vega64/EKWB/Noctua

wavetrex said: ↑

That design would be horribly inefficient, and basically going back to the old concept of FSB and memory controller in the North Bridge, from the times of Pentium 4/Core 2 or Athlon XP (A64 already has IMC, which is why it was so much faster)
Such idea would be instantly rejected by any modern silicon engineer...

You want each "Uncore" (L3 cache, Memory Controller) to be directly linked to RAM for the lowest possible latency.

In that "central hub" concept none of the cores would have direct access to RAM, and also chip-to chip would ALWAYS be Two-hops.
In my drawing with 12 links, at least 12 inter-chip communication lanes are direct, with the other 16 possible communication routes are 2-hop. It's still better than all 28 being 2-hop, with ALL non-local memory.

Note: The number 28 is from the "Combinations" formula: nCr = n! / r! (n - r)!, in this case 8! / (2!*(8-2)!) = 28
Click to expand...

The problem with the memory controller in the northbridge was mostly due to the distance from the CPU, and the terribly slow interconnect to the NB.
If you have the northbridge on chip, but not on die, Its still a very short hop, and if using an interposer, you can have a pretty wide and fast channel to the CPU's.
Trace length is the killer here. short and fast/wide.

I'd say they'll probably keep the memory controllers on die, but offload the interconnects, and really just make it a fast switching hub between the CPU's.

Also. if you decide to keep the three comms lanes per cpu die, you can have three bi-directional channels to the Comms die, potentially allowing concurrent requests to other CPU's, alleviating some of the latency.

Evildead666, Sep 23, 2018

#14
Evildead666 Guest

Messages:

1,309

Likes Received:

277

GPU:

Vega64/EKWB/Noctua

wavetrex said: ↑

HWgeek, the thing with fast digital electronics is not the physical distance, but about how many times a signal needs to "stop and forward".
Electric current travels at roughly 1/3 of the speed of light, so distances inside a computer aren't slowing down things. The problem is distance DEGRADES signal quality, which arrives at the destination different than how it was sent.

For example, PCIe works equally fast on the videocard placed right near the CPU at 20cm, or 1 meter away through long riser cables, up to a point where signal is so degraded that it stops working completely !

In a chip that has stuff very close the degradation is almost non-existent, so that's not an issue... BUT, it requires logic to read the signal and forward it to the next thing.

For example: Core ALU -> L1 cache -> L2 cache -> L3 cache -> Memory controller -> DRAM inner controller -> RAM data lines.
Each step introduces extra latency.

Your "design" moves the Memory controller away from L3 cache and into another chip, so another step (or two even) is needed:
... L3 cache-> Infinity Fabric node (Core Die) -> Infinity fabric node (Center Die)-> Memory controller -> ...
Same goes for PCIe.

That is a very bad idea today, and as I said, it's turning back in time.
Not to mention, they would have to completely reengineer Zen die, which currently contains both DDR4 controller AND the PCIe link.

Basically the entire chip would need to be rebuilt from ground up to move those functions away from it and into something else.
I hope now you understand why this will not happen, not now, not ever, and those rumors are complete gibberish from people who don't understand how these things work.

Let me show you how Zen 2 chip will be inside:

It is what makes most sense from an engineering standpoint, it's most economically viable, it's the easiest to redesign and can be done quickly.
Most if not all of the logic is kept the same, just the internal addressing will be changed to support doubling an all core resources, while keeping everything outside the same.
Click to expand...

This brings back the yield problem though.
Bigger chips are more complex, hence there is more to go wrong, and drop yields.
If they could just shrink Zen to 7nm, that would give quite a size reduction.
they could stitch two 7nm die together for the desktop chips, like Presler (Pentium D) did, except they would talk direct through an interposer rather than go through a distant NB.
That would give 16 cores on desktop on AM4/5. Not sure you could adequately feed that with two DDR4 memory controllers, but maybe DDR5 ?
Smaller dies = better yields = lower costs.
I doubt AMD will go back to making big dies right now, they seem to be in good shape to keep going the MCM route, eventually with GPU's too.
Theres no reason as to why the CPU Comms chip should be that much different to a GPU Comms chip.
In fact on an APU, i would expect it to communicate between the CPU and GPU, maybe even being the interface for the memory also in that case... (although the APu is better served with it all integrated into one chip, The one case where a separate SKU is a better choice)

Evildead666, Sep 23, 2018

#15

Luc likes this.
Luc Active Member

Messages:

94

Likes Received:

57

GPU:

RX 480 | Gt 710

Radeon engineers alreay talked about active interposers connecting two dies at low latencies, but it would be expensive.

Nvidia is using chips between cards in servers to do the work, with nice results.

If AMD manage to do an active interposer using an hub that can do the trick at low power-latency-cost, then it will be possible, but I'll wait to see it.

Luc, Nov 1, 2018

#16
MegaFalloutFan Maha Guru

Messages:

1,048

Likes Received:

203

GPU:

RTX4090 24Gb

wavetrex said: ↑

Don't forget that in current Epyc and TR 2, each die is connected to every other die.
So for 4 of them, there's 3 I/O used for comm to the other dies.

The reason for this is that even with non-local memory access, the data is only ONE hop away in all cases, so somewhat acceptable low latency.

With 8 dies it would be very difficult to implement properly, you can't have 8*7/2 = 28 copies of the high-speed interconnect on the chip, that's just too much for the substrate, as well as an insane numbers of connections to the dies themselves, to support 7 links at once ! That's also a lot of transistors to route everything. Impractical.

If you choose to use two hops instead, the number would be 8*3/2 = 12 copies of the high-speed interconnect (compared to just 6 in Epyc), so doable... but instead it would increase latency considerably.

16 cores per die at 7nm should not be a problem whatsoever! Intel is capable of doing 28 of them at 14nm, so I don't see why AMD cant' do 16 on a much smaller process.

P.S.
Mouse drawing schematics on how a hypothetical 8-die /2-hop NUMA chip would look like:

Note that only 3 of the interconnects "intersect", this allows somewhat simpler substrate design.
Also, by moving the chips a bit the maximum line length could get shorter, resulting in a more stable signal, higher interconnect frequency and lower latency.

But, don't forget, 2-hops...
Even if not needed today, I can imagine 8-die chips of the future will use such a configuration, it's the only way to keep expanding computing power !
Click to expand...

No, not like that.

MegaFalloutFan, Nov 2, 2018

#17
Astyanax Ancient Guru

Messages:

17,035

Likes Received:

7,378

GPU:

GTX 1080ti

wavetrex said: ↑

and basically going back to the old concept of FSB and memory controller in the North Bridge, from the times of Pentium 4/Core 2 or Athlon XP (A64 already has IMC, which is why it was so much faster)
Click to expand...

no it wouldn't.
off-cpu implementations were higher latency and lower throughput because of the back and forth on pcb lines to a chip, the IOX structure is far better than this.

Astyanax, Nov 2, 2018

#18
Lilith Member Guru

Messages:

125

Likes Received:

71

GPU:

nvidia GTX 1660ti

Mouse drawing schematics on how a hypothetical 8-die /2-hop NUMA chip would look like:

Far be it from me to be the one to say this, but that is the cutest schematic drawing I've ever seen lol. Sorry, carry on

Lilith, Nov 2, 2018

#19

sverek likes this.
Anthony Paull Guest

Messages:

1

Likes Received:

0

GPU:

GTX 1080ti 11GB

I hope now you understand why this will not happen, not now, not ever, and those rumors are complete gibberish from people who don't understand how these things work.
Click to expand...

Lmfao. Got egg on your face mate. We've had confirmation from AMD that this is how it will be done. I'm not going to go into why you're wrong about latency, but rest assured it won't be as you say
https://arstechnica.com/gadgets/201...uture-7nm-gpus-with-pcie-4-zen-2-zen-3-zen-4/

Anthony Paull, Nov 7, 2018

#20

(You must log in or sign up to reply here.)

Page 1 of 2

Share This Page