AMD's Rome is rumored to be designed with 8 Zen2 dies -Can it look like this?

Discussion in 'Frontpage news' started by HWgeek, Sep 22, 2018.

Tags:
  1. HWgeek

    HWgeek Guest

    Messages:
    441
    Likes Received:
    315
    GPU:
    Gigabyte 6200 Turbo Fotce @500/600 8x1p
    I just saw this article and thought how can it look like if they choose to go for 8 dies+1 I/O:
    AMD's EPYC Rome 64C illustration by me:
    https://i.**********/wBXbr9Xj/AMD-_EPYC_7nm.jpg
    What do you think?
    I think it could be great for better yields/cheaper per CPU/ better Heat dissipation and 1 memory channel per die, make it harder for Intel to compete..
    https://www.notebookcheck.net/AMD-s...ond-with-triple-die-Cooper-Lake.331644.0.html
     
    Last edited: Sep 22, 2018
  2. PrMinisterGR

    PrMinisterGR Ancient Guru

    Messages:
    8,128
    Likes Received:
    971
    GPU:
    Inno3D RTX 3090
    Sounds completely implausible, but weirder things have happened.
     
  3. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    I think speculations like this are something to be put inside OnnA's thread instead of FP news.
    And if this was to be real... AMD CPU division would start to look little megalomaniac.
    And I would expect MCM for GPUs to be here too, but it looks like that's other kind of story.
     
  4. HWgeek

    HWgeek Guest

    Messages:
    441
    Likes Received:
    315
    GPU:
    Gigabyte 6200 Turbo Fotce @500/600 8x1p
    Maybe this is why Intel have so much trouble with 10nm because the yields are not so-so, and that's why AMD choose this path (8 core per die) to overcome this problem and to be first in market with 7nm chips.
    Should infinity-fabric have problem with such design?
     
    Last edited: Sep 22, 2018

  5. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    Considering size of that IF logic in middle, probably not a problem. But IF is not all to all communication, it is channel between 2 points. So, A lot of channels, or each die going to this "HUB".
    In case IF clock was decoupled from IMC clock and AMD got it to higher clock... (As those EPYC chips are not exactly paired with highly clocked memory.)
    Before, it was 1 ~ 2 jumps for anything to any CPU. Here it would be always 2 jumps. But double IF clock and improve latency by doing same thing in fewer cycles and you have winner.
     
  6. HWgeek

    HWgeek Guest

    Messages:
    441
    Likes Received:
    315
    GPU:
    Gigabyte 6200 Turbo Fotce @500/600 8x1p
    Just remembered that IBM did something crazy like this while I was at High School :)
    IBM's POWER CPU's:
    [​IMG]
     
    Evildead666 likes this.
  7. H83

    H83 Ancient Guru

    Messages:
    5,508
    Likes Received:
    3,034
    GPU:
    XFX Black 6950XT
    Wouldn´t be better to make a 16 core CPU die and then glue 4 of them to achieve the same objective but without such complicated design? I ask this because the communication between all those is a possible nightmare.
    Also in the link provided, it says Intel´s answer is to glue two 28 Core CPUs together, how the hell are they going to cool those???
     
  8. wavetrex

    wavetrex Ancient Guru

    Messages:
    2,462
    Likes Received:
    2,574
    GPU:
    ROG RTX 6090 Ultra
    Don't forget that in current Epyc and TR 2, each die is connected to every other die.
    So for 4 of them, there's 3 I/O used for comm to the other dies.

    The reason for this is that even with non-local memory access, the data is only ONE hop away in all cases, so somewhat acceptable low latency.

    With 8 dies it would be very difficult to implement properly, you can't have 8*7/2 = 28 copies of the high-speed interconnect on the chip, that's just too much for the substrate, as well as an insane numbers of connections to the dies themselves, to support 7 links at once ! That's also a lot of transistors to route everything. Impractical.

    If you choose to use two hops instead, the number would be 8*3/2 = 12 copies of the high-speed interconnect (compared to just 6 in Epyc), so doable... but instead it would increase latency considerably.

    16 cores per die at 7nm should not be a problem whatsoever! Intel is capable of doing 28 of them at 14nm, so I don't see why AMD cant' do 16 on a much smaller process.

    P.S.
    Mouse drawing schematics on how a hypothetical 8-die /2-hop NUMA chip would look like:
    [​IMG]

    Note that only 3 of the interconnects "intersect", this allows somewhat simpler substrate design.
    Also, by moving the chips a bit the maximum line length could get shorter, resulting in a more stable signal, higher interconnect frequency and lower latency.

    But, don't forget, 2-hops...
    Even if not needed today, I can imagine 8-die chips of the future will use such a configuration, it's the only way to keep expanding computing power !
     
    Last edited: Sep 22, 2018
    Lilith likes this.
  9. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    That's why there is 9th silicon in middle of 8 CPU silicons. All would likely connect to central one, maybe at double width of interface than EPYC with 4 dies as that still saves one connection per die. And boosts transfer rate.
    Question is, since central one is Huge in comparison to CPU dies, does it contain all IMCs? That would be pretty impressive trick.
     
  10. wavetrex

    wavetrex Ancient Guru

    Messages:
    2,462
    Likes Received:
    2,574
    GPU:
    ROG RTX 6090 Ultra
    That design would be horribly inefficient, and basically going back to the old concept of FSB and memory controller in the North Bridge, from the times of Pentium 4/Core 2 or Athlon XP (A64 already has IMC, which is why it was so much faster)
    Such idea would be instantly rejected by any modern silicon engineer...

    You want each "Uncore" (L3 cache, Memory Controller) to be directly linked to RAM for the lowest possible latency.

    In that "central hub" concept none of the cores would have direct access to RAM, and also chip-to chip would ALWAYS be Two-hops.
    In my drawing with 12 links, at least 12 inter-chip communication lanes are direct, with the other 16 possible communication routes are 2-hop. It's still better than all 28 being 2-hop, with ALL non-local memory.

    Note: The number 28 is from the "Combinations" formula: nCr = n! / r! (n - r)!, in this case 8! / (2!*(8-2)!) = 28
     

  11. HWgeek

    HWgeek Guest

    Messages:
    441
    Likes Received:
    315
    GPU:
    Gigabyte 6200 Turbo Fotce @500/600 8x1p
    I made that image, I did it with Die diagram as a reference and tried to remove the I/O space and put them in the 9th die in middle.
    I think you are right and that why AMD used the 9th Dies to connect between all dies and reduce the latency.
    https://i.**********/V64myXj5/950px-amd_zen_octa-core_die_shot_annotated.png
     
  12. wavetrex

    wavetrex Ancient Guru

    Messages:
    2,462
    Likes Received:
    2,574
    GPU:
    ROG RTX 6090 Ultra
    HWgeek, the thing with fast digital electronics is not the physical distance, but about how many times a signal needs to "stop and forward".
    Electric current travels at roughly 1/3 of the speed of light, so distances inside a computer aren't slowing down things. The problem is distance DEGRADES signal quality, which arrives at the destination different than how it was sent.

    For example, PCIe works equally fast on the videocard placed right near the CPU at 20cm, or 1 meter away through long riser cables, up to a point where signal is so degraded that it stops working completely !

    In a chip that has stuff very close the degradation is almost non-existent, so that's not an issue... BUT, it requires logic to read the signal and forward it to the next thing.

    For example: Core ALU -> L1 cache -> L2 cache -> L3 cache -> Memory controller -> DRAM inner controller -> RAM data lines.
    Each step introduces extra latency.

    Your "design" moves the Memory controller away from L3 cache and into another chip, so another step (or two even) is needed:
    ... L3 cache-> Infinity Fabric node (Core Die) -> Infinity fabric node (Center Die)-> Memory controller -> ...
    Same goes for PCIe.

    That is a very bad idea today, and as I said, it's turning back in time.
    Not to mention, they would have to completely reengineer Zen die, which currently contains both DDR4 controller AND the PCIe link.

    Basically the entire chip would need to be rebuilt from ground up to move those functions away from it and into something else.
    I hope now you understand why this will not happen, not now, not ever, and those rumors are complete gibberish from people who don't understand how these things work.

    Let me show you how Zen 2 chip will be inside:
    [​IMG]

    It is what makes most sense from an engineering standpoint, it's most economically viable, it's the easiest to redesign and can be done quickly.
    Most if not all of the logic is kept the same, just the internal addressing will be changed to support doubling an all core resources, while keeping everything outside the same.
     
    HandR and HWgeek like this.
  13. user1

    user1 Ancient Guru

    Messages:
    2,777
    Likes Received:
    1,299
    GPU:
    Mi25/IGP
    the website i saw this idea floating around on claimed the "hub" was 14nm which might explain why.

    there is a reason to do something like this , mainly that newer processes are getting more expensive and using smaller chips would greatly improve effective yields

    if this was a real chip design , I doubt its rome, i'd guess its something much bigger than 64cores. then the latency tradeoff might be worth it.
     
    Last edited: Sep 22, 2018
  14. Evildead666

    Evildead666 Guest

    Messages:
    1,309
    Likes Received:
    277
    GPU:
    Vega64/EKWB/Noctua
    The problem with the memory controller in the northbridge was mostly due to the distance from the CPU, and the terribly slow interconnect to the NB.
    If you have the northbridge on chip, but not on die, Its still a very short hop, and if using an interposer, you can have a pretty wide and fast channel to the CPU's.
    Trace length is the killer here. short and fast/wide.

    I'd say they'll probably keep the memory controllers on die, but offload the interconnects, and really just make it a fast switching hub between the CPU's.

    Also. if you decide to keep the three comms lanes per cpu die, you can have three bi-directional channels to the Comms die, potentially allowing concurrent requests to other CPU's, alleviating some of the latency.
     
  15. Evildead666

    Evildead666 Guest

    Messages:
    1,309
    Likes Received:
    277
    GPU:
    Vega64/EKWB/Noctua
    This brings back the yield problem though.
    Bigger chips are more complex, hence there is more to go wrong, and drop yields.
    If they could just shrink Zen to 7nm, that would give quite a size reduction.
    they could stitch two 7nm die together for the desktop chips, like Presler (Pentium D) did, except they would talk direct through an interposer rather than go through a distant NB.
    That would give 16 cores on desktop on AM4/5. Not sure you could adequately feed that with two DDR4 memory controllers, but maybe DDR5 ?
    Smaller dies = better yields = lower costs.
    I doubt AMD will go back to making big dies right now, they seem to be in good shape to keep going the MCM route, eventually with GPU's too.
    Theres no reason as to why the CPU Comms chip should be that much different to a GPU Comms chip.
    In fact on an APU, i would expect it to communicate between the CPU and GPU, maybe even being the interface for the memory also in that case... (although the APu is better served with it all integrated into one chip, The one case where a separate SKU is a better choice)
     
    Luc likes this.

  16. Luc

    Luc Active Member

    Messages:
    94
    Likes Received:
    57
    GPU:
    RX 480 | Gt 710
    Radeon engineers alreay talked about active interposers connecting two dies at low latencies, but it would be expensive.

    Nvidia is using chips between cards in servers to do the work, with nice results.

    If AMD manage to do an active interposer using an hub that can do the trick at low power-latency-cost, then it will be possible, but I'll wait to see it.
     
  17. MegaFalloutFan

    MegaFalloutFan Maha Guru

    Messages:
    1,048
    Likes Received:
    203
    GPU:
    RTX4090 24Gb
    No, not like that.
     
  18. Astyanax

    Astyanax Ancient Guru

    Messages:
    17,035
    Likes Received:
    7,378
    GPU:
    GTX 1080ti

    no it wouldn't.
    off-cpu implementations were higher latency and lower throughput because of the back and forth on pcb lines to a chip, the IOX structure is far better than this.
     
  19. Lilith

    Lilith Member Guru

    Messages:
    125
    Likes Received:
    71
    GPU:
    nvidia GTX 1660ti
    Mouse drawing schematics on how a hypothetical 8-die /2-hop NUMA chip would look like:
    [​IMG]

    Far be it from me to be the one to say this, but that is the cutest schematic drawing I've ever seen :rolleyes: lol. Sorry, carry on :p
     
    sverek likes this.
  20. Anthony Paull

    Anthony Paull Guest

    Messages:
    1
    Likes Received:
    0
    GPU:
    GTX 1080ti 11GB
    Lmfao. Got egg on your face mate. We've had confirmation from AMD that this is how it will be done. I'm not going to go into why you're wrong about latency, but rest assured it won't be as you say ;)
    https://arstechnica.com/gadgets/201...uture-7nm-gpus-with-pcie-4-zen-2-zen-3-zen-4/
     

Share This Page