AMD EPYC Processors Integrated into New NVIDIA DGX A100

Discussion in 'Frontpage news' started by Hilbert Hagedoorn, Jun 10, 2020.

  1. Hilbert Hagedoorn

    Hilbert Hagedoorn Don Vito Corleone Staff Member

    Messages:
    40,765
    Likes Received:
    9,178
    GPU:
    AMD | NVIDIA
  2. dampflokfreund

    dampflokfreund Member Guru

    Messages:
    173
    Likes Received:
    13
    GPU:
    8600/8700M Series
    AMD+Nvidia is a very powerful combination...
     
  3. Kaarme

    Kaarme Ancient Guru

    Messages:
    2,353
    Likes Received:
    969
    GPU:
    Sapphire 390
    Nvidia already announced this a while back. I guess AMD was waiting for a slow news day to have something to tell.
     
  4. Noisiv

    Noisiv Ancient Guru

    Messages:
    7,765
    Likes Received:
    1,119
    GPU:
    2070 Super
    That's $14,000 from $200,000 the machine that everyone wants going to AMD. Minus any discount given to Nvidia.
     

  5. schmidtbag

    schmidtbag Ancient Guru

    Messages:
    5,801
    Likes Received:
    2,232
    GPU:
    HIS R9 290
    Not that I really care but isn't 128 cores overkill for a GPGPU server? CPUs don't tend to work very hard if the GPUs are crunching big numbers. You don't gain any usable PCIe lanes when adding a 2nd socket. But, Nvidia must know what they're doing - the price premium going from single socket to dual socket EPYC is hefty (though amusingly, still super cheap compared to Intel).
     
  6. Fox2232

    Fox2232 Ancient Guru

    Messages:
    11,803
    Likes Received:
    3,357
    GPU:
    6900XT+AW@240Hz
    For a moment, I took "integrated" word seriously there. And I was thinking if there is some chip that uses nVidia's IP in form of nV-Link.
     
  7. Bagus Hanindhito

    Bagus Hanindhito New Member

    Messages:
    2
    Likes Received:
    2
    GPU:
    NVIDIA Titan Xp
    Not really overkill considering its workload. Last year, I did a lot of profiling on multiple server configuration to run MLPerf benchmark (" a SPEC-like benchmark for Machine Learning"). The CPUs do the data preprocessing and batching before sending them to GPUs for performing major calculations. I saw that the dual Xeon Platinum with 24 cores each has utilization around 60% for 4 x NVIDIA Tesla V100 systems. That is for single-user only. In the case of DGX-A100, I believe that it supports multiple users doing multiple different things, and thus CPU performance also plays important roles here, especially to be able to feed enough data for 8 x A100 GPUs. Finally, the number of lanes for PCIe is also mattered; it will connect the GPUs to CPUs as well as to NVME storage and Mellanox NICs. Having 128 lanes of PCIe 4.0 should be really helpful.
     
  8. schmidtbag

    schmidtbag Ancient Guru

    Messages:
    5,801
    Likes Received:
    2,232
    GPU:
    HIS R9 290
    Understood; I have a PC built for BOINC and I'm aware some projects can have some temporary heavy CPU utilization to prepare the workload. But it's only especially a problem when all workloads are starting at the same time. Once the GPU workloads are running, there isn't a whole lot of CPU usage going on. So, as long as the workloads are slightly different enough where they don't all complete at the same time, that ought to give the CPU enough breathing room between starting/completing each workload. In other words, only when you first initialize the system will you see most of the cores getting utilized, where CPU usage becomes more "spread out" after each workload completes.
    Of course, I'm making a lot of assumptions here. It's very possible that all workloads Nvidia intends to do will complete at the same time (or close to it). It's very possible that there is cross-communication between the rest of the system while GPU workloads are running, which will cause an increase in CPU usage throughout the workload being processed. But even then.... 256 threads? I know this is some powerful hardware Nvidia is working on but it just seems very surprising to me that a single 64-core Epyc would be a bottleneck.

    Also, I'm only questioning the core count; I totally understand Nvidia's demand for the 128 PCIe 4.0 lanes; I'd make the same priority in their shoes too. But whether you do dual 64-core Epycs or a single 32-core, you're still getting 128 usable PCIe lanes. So, if a single 64-core Epyc isn't going to be a bottleneck, doubling up on them seems to be a very hefty expense for no real gain. But like I said, Nvidia obviously knows what they're doing if they're spending that much more.
     
    Last edited: Jun 10, 2020
  9. Gomez Addams

    Gomez Addams Member Guru

    Messages:
    177
    Likes Received:
    97
    GPU:
    Titan RTX, 24GB
    It probably is a bit of overkill but DGX systems are intended for containerized usage, for the most part, where many users are running lots of different jobs on them. A dual-CPU system with these processors could support a LOT of users and that's what they are intended for. This is THE ideal machine for GPU computing as a service on a large scale.
     

Share This Page