AMD Naples 32-core Zen-Processors photos

Warrax · Dec 11, 2016

Ziggymac said: ↑

Yeah, but with each core running at 150Mhz.

minehunter & solitaire can both be run at 60fps...at the same time!!!
Click to expand...

They are running at 1.4ghz base clock and 2.8ghz turbo.

Frances · Dec 11, 2016

eTheBlack said: ↑

PGA is pins on CPU, LGA is pins in socket
Click to expand...

PGA is male players on the golf course and LGA is females getting played on the course.

yasamoka · Dec 11, 2016

0blivious said: ↑

I sure hope the mainstream Zen chips don't have pins on them. Having them on the motherboard is so much safer.
Click to expand...

No, it is never safer having them on the motherboard. Never bent CPU socket pins on a motherboard before? It's an utter nightmare.

David3k · Dec 11, 2016

Caesar said: ↑

Isn't the L3 Cache that feeds the L2 cache and L2 feeds L1????

Wow....

It's only a "bank"..... much of the capacity depends on the RAM to be addressed to. Would be better to pour the 512MB capacity to L1 cache.....!!!

Click to expand...

You can't make an L1 too big, because it's so associative, it needs a LOT of connections for even a 128kb L1 cache. Making a CPU with a 1MB cache as associative as it is would make the die size GARGANTUAN.

The Zen L3 is a "Victim" cache, meaning that when data is evicted from the L2 and L1, it lands in the L3. This still provides the benefit of a shorter fetch of an L3 cache (vs fetching from RAM), but also cuts latency on the CPU's reads directly from RAM, because the L3 isn't involved.

When data is fetched memory the CPU checks if the L1, then the succeeding caches, to see if it has the data it needs before fetching from RAM. The most used bits are held as copies in the L1 Data/Instruction Cache (closest to the CPU, lowest latency), the bits that are just slightly less used than the L1 Cache are in the L2 Cache (slightly higher latency than L1), and so on.

Typically, when new data comes in from RAM to the Caches, it evicts the least used copies of data from the Caches to populate it with data it does need, and as long as the data keeps getting re-used by operations enough, it stays in one of the caches. Also, "last" cache traditionally has redundant data the cache above it does. (L1's current contents are also in the L2, L2's are in the L3, and so on until RAM)

Here's where Zen's L3 is different from traditional x86 CPU's: it doesn't store every read from RAM the CPU demands nor does it prefetch data before requested, it only holds data that was pushed out from the L1/L2. This is brilliant; because it's a "victim" cache, it doesn't have to hold L2 data inside, (because the data is still in the L2,) basically increasing its potential capacity and with less redundant data.

Though this style of cache would usually be less effective, the large L2 of Zen negates its downside, (smaller L2 caches would be evicting so frequently the Victim cache is likely to drop data before the CPU needs it again) giving you the best of both worlds; L3 having data that the L2 doesn't, but not having to hold "live" L2 contents.

The L3 is also shared, so if Core A works hard on a small subset of data and instructions, then switches to another task, the data from it's caches are "flushed" into the L3, which Core B can then "pickup" to quickly fill its pipeline without having to read it all the way from RAM.

Sorry for this being so "tl;dr" but I'm quite excited to see new ideas in the CPU world. (and deep down, I know Hilbert is just as excited. Come on, admit it, man )

DLD · Dec 11, 2016

I read this, breathless!

@David3k

Wow! You aren't a "member guru'', you're a AncientMahaMaximus Guru here!Thank you, David.

yasamoka · Dec 11, 2016

David3k said: ↑

You can't make an L1 too big, because it's so associative, it needs a LOT of connections for even a 128kb L1 cache. Making a CPU with a 1MB cache as associative as it is would make the die size GARGANTUAN.

The Zen L3 is a "Victim" cache, meaning that when data is evicted from the L2 and L1, it lands in the L3. This still provides the benefit of a shorter fetch of an L3 cache (vs fetching from RAM), but also cuts latency on the CPU's reads directly from RAM, because the L3 isn't involved.

When data is fetched memory the CPU checks if the L1, then the succeeding caches, to see if it has the data it needs before fetching from RAM. The most used bits are held as copies in the L1 Data/Instruction Cache (closest to the CPU, lowest latency), the bits that are just slightly less used than the L1 Cache are in the L2 Cache (slightly higher latency than L1), and so on.

Typically, when new data comes in from RAM to the Caches, it evicts the least used copies of data from the Caches to populate it with data it does need, and as long as the data keeps getting re-used by operations enough, it stays in one of the caches. Also, "last" cache traditionally has redundant data the cache above it does. (L1's current contents are also in the L2, L2's are in the L3, and so on until RAM)

Here's where Zen's L3 is different from traditional x86 CPU's: it doesn't store every read from RAM the CPU demands nor does it prefetch data before requested, it only holds data that was pushed out from the L1/L2. This is brilliant; because it's a "victim" cache, it doesn't have to hold L2 data inside, (because the data is still in the L2,) basically increasing its potential capacity and with less redundant data.

Though this style of cache would usually be less effective, the large L2 of Zen negates its downside, (smaller L2 caches would be evicting so frequently the Victim cache is likely to drop data before the CPU needs it again) giving you the best of both worlds; L3 having data that the L2 doesn't, but not having to hold "live" L2 contents.

The L3 is also shared, so if Core A works hard on a small subset of data and instructions, then switches to another task, the data from it's caches are "flushed" into the L3, which Core B can then "pickup" to quickly fill its pipeline without having to read it all the way from RAM.

Sorry for this being so "tl;dr" but I'm quite excited to see new ideas in the CPU world. (and deep down, I know Hilbert is just as excited. Come on, admit it, man )
Click to expand...

Great processor architecture post!

The victim cache is a stroke of genius the more I think of it. You pay less penalty to load data from RAM into L2 cache, you save on L3 cache utilization when data is in L2 (there is exactly one copy of the data in any of the levels), and you pay the rest of the penalty (a bit less if there is a direct connection to L2 from RAM) that you would have paid (load into L3 + (then) load into L1) in a conventional cache architecture.

Stairmand · Dec 11, 2016

"Never bent CPU socket pins on a motherboard before?"

No I haven't because I'm not an idiot. You must be pretty ham-fisted to do that.

yasamoka · Dec 11, 2016

Stairmand said: ↑

"Never bent CPU socket pins on a motherboard before?"

No I haven't because I'm not an idiot. You must be pretty ham-fisted to do that.
Click to expand...

Don't be a dick eh? Sheesh, the kinds we get around here these days...

I've had CPU socket pins bent due to excessive pressure from Intel's stock cooler. Some motherboards even ship with bent pins, and it's Hell to return them (since companies such as ASUS see this as pretty much the most common reason for motherboard returns, especially when it's user error). I've seen countless cases of bent CPU socket pins on other motherboards, though not my own, often with reasons unknown to their users (ranging from excessive pressure, not dropping in the CPU perfectly, etc...).

KissSh0t · Dec 11, 2016

I've bent cpu pins on the cpu before!

Fixed it though..... through pure sweat and tears.....

Broke a ram stick too trying to plug it in, since then I hate ram sticks with tall cooling units.

OCZ Never Again.

*stares off into the distance*

tsunami231 · Dec 12, 2016

Warrax said: ↑

They are running at 1.4ghz base clock and 2.8ghz turbo.
Click to expand...

that imo horrible speeds do hope server software is actual written to make use of all those cores unlikes conusmer grade software where 32 core would be total waste.

I really do hope Zen has STP/MTP that actual completes with Intel

PrMinisterGR · Dec 12, 2016

David3k said: ↑

You can't make an L1 too big, because it's so associative, it needs a LOT of connections for even a 128kb L1 cache. Making a CPU with a 1MB cache as associative as it is would make the die size GARGANTUAN.

The Zen L3 is a "Victim" cache, meaning that when data is evicted from the L2 and L1, it lands in the L3. This still provides the benefit of a shorter fetch of an L3 cache (vs fetching from RAM), but also cuts latency on the CPU's reads directly from RAM, because the L3 isn't involved.

When data is fetched memory the CPU checks if the L1, then the succeeding caches, to see if it has the data it needs before fetching from RAM. The most used bits are held as copies in the L1 Data/Instruction Cache (closest to the CPU, lowest latency), the bits that are just slightly less used than the L1 Cache are in the L2 Cache (slightly higher latency than L1), and so on.

Typically, when new data comes in from RAM to the Caches, it evicts the least used copies of data from the Caches to populate it with data it does need, and as long as the data keeps getting re-used by operations enough, it stays in one of the caches. Also, "last" cache traditionally has redundant data the cache above it does. (L1's current contents are also in the L2, L2's are in the L3, and so on until RAM)

Here's where Zen's L3 is different from traditional x86 CPU's: it doesn't store every read from RAM the CPU demands nor does it prefetch data before requested, it only holds data that was pushed out from the L1/L2. This is brilliant; because it's a "victim" cache, it doesn't have to hold L2 data inside, (because the data is still in the L2,) basically increasing its potential capacity and with less redundant data.

Though this style of cache would usually be less effective, the large L2 of Zen negates its downside, (smaller L2 caches would be evicting so frequently the Victim cache is likely to drop data before the CPU needs it again) giving you the best of both worlds; L3 having data that the L2 doesn't, but not having to hold "live" L2 contents.

The L3 is also shared, so if Core A works hard on a small subset of data and instructions, then switches to another task, the data from it's caches are "flushed" into the L3, which Core B can then "pickup" to quickly fill its pipeline without having to read it all the way from RAM.

Sorry for this being so "tl;dr" but I'm quite excited to see new ideas in the CPU world. (and deep down, I know Hilbert is just as excited. Come on, admit it, man )
Click to expand...

The L3 is shared between each four Zen cores. Unlike Intel, it's not shared with all cores in a CPU. This sounds "bad", until you see that the Intel design is using a 2MB/core design and it's connecting to those 8 cores using 16-way associativity. Zen cores get 2MB/core, but they get 16-way associativity per 4 cores. In ideal situations that could translate for up to 80% lower latency in cache access, in addition to that cache being a victim cache. The L2 is also double the size vs Skylake (512kb vs 256kb) and it has double the associativity (8-way vs 4-way). The L3 and the whole physical design is made in a way that all the cores have the same average latency when "talking" to the L3.

Another tidbit I found very interesting is this one:

Anandtech said:

We are told that the operations used in Zen for the uOp cache are ‘pretty dense’, and equivalent to x86 operations in most cases.
Click to expand...

It seems that the CPU's "internal" language is not that far away from x86 itself, which is actually quite peculiar.

Another little peculiarity, linked to the L3 discussion above:

Anandtech said:

It is worth noting that a single CCX (CPU Complex, 4 Zen cores) has 8 MB of cache, and as a result the 8-core Zen being displayed by AMD at the current events involves two CPU Complexes. This affords a total of 16 MB of L3 cache, albeit in two distinct parts. This means that the true LLC (Last Level Cache) for the entire chip is actually DRAM, although AMD states that the two CCXes can communicate with each other through the custom fabric which connects both the complexes, the memory controller, the IO, the PCIe lanes etc.
Click to expand...

I know that the fabric that connects multiple cores is different than an interposer that can take HBM, but I bet that the design of having an out-of-the-CPU memory as LLC was on purpose so that something between the processor complex and RAM can be used. I won't be surprised to see Zen CPUs with HBM on, it sounds almost like a drop in solution.

icedman · Dec 12, 2016

Stairmand said: ↑

"Never bent CPU socket pins on a motherboard before?"

No I haven't because I'm not an idiot. You must be pretty ham-fisted to do that.
Click to expand...

I had one or two pins from the mobo actually stick to my cpu while making a swap with my 3770k and flushed 200$ down the toilet because of it lga sucks balls imo

Amx85 · Dec 12, 2016

:3eyes:

I can´t think in 180w of TDP since its base frequency... i expect between 125-140w, and other thing, they don´t have 512MB lv3 cache, since start we saw, Zen design talks about 1MB per Thread or 2MB per Core, Desktop Zen will have total of 16MB lv3 cache on the 8 Core model, each complex (4 Cores) has 8MB lv3, then... Zen´s Opteron has a total of 80MB Cache (64MB lv3 + 16MB lv2 caches) per CPU

chispy · Dec 12, 2016

I wonder if this cpus will be overclockable , at least to 3.8Ghz~4.0Ghz would be nice.

schmidtbag · Dec 12, 2016

chispy said: ↑

I wonder if this cpus will be overclockable , at least to 3.8Ghz~4.0Ghz would be nice.
Click to expand...

Though that would be cool, I'm willing to bet they won't be overclockable. First of all, it's a server CPU. These are meant to be stable, accurate, and reliable. Speed is not the #1 priority. This is why the vast majority of Opterons and Xeons don't have unlocked multipliers.

Second, running 32 cores at 4GHz is a recipe for disaster. 288W (using an 8-pin CPU connector) is the safest amount of power you can draw for a single CPU. Yes, you can exceed that, but there are a lot of caveats involved. But in most cases if you exceed that wattage by too much or for too long, one of the following will happen:
* The CPU will either starve for electricity and become unstable
* You'll overload the PSU and damage it
* The solder joints or the connectors of the cables will melt
There is no way 32 cores at 4GHz could run at less than 300W. Right now the best thing we can compare these CPUs to are Xeons. If a 10-core 3Ghz Xeon uses around 140W, just tripling the core count alone while keeping the same frequency would exceed 300W (yes, I'm aware wattage doesn't scale linearly, but the outlook still doesn't look great). 3.8GHz isn't going to make much of a difference vs 4.

Anyway, it's a nice thought but don't count on it.

chispy · Dec 12, 2016

schmidtbag said: ↑

Though that would be cool, I'm willing to bet they won't be overclockable. First of all, it's a server CPU. These are meant to be stable, accurate, and reliable. Speed is not the #1 priority. This is why the vast majority of Opterons and Xeons don't have unlocked multipliers.

Second, running 32 cores at 4GHz is a recipe for disaster. 288W (using an 8-pin CPU connector) is the safest amount of power you can draw for a single CPU. Yes, you can exceed that, but there are a lot of caveats involved. But in most cases if you exceed that wattage by too much or for too long, one of the following will happen:
* The CPU will either starve for electricity and become unstable
* You'll overload the PSU and damage it
* The solder joints or the connectors of the cables will melt
There is no way 32 cores at 4GHz could run at less than 300W. Right now the best thing we can compare these CPUs to are Xeons. If a 10-core 3Ghz Xeon uses around 140W, just tripling the core count alone while keeping the same frequency would exceed 300W (yes, I'm aware wattage doesn't scale linearly, but the outlook still doesn't look great). 3.8GHz isn't going to make much of a difference vs 4.

Anyway, it's a nice thought but don't count on it.
Click to expand...

That was a wishful thinking as an overclocker point of view , but you are correct as i have seen a 24 pin psu cable melt in front of my eyes while overclocking cpus on LN2.

Athlonite · Dec 20, 2016

0blivious said: ↑

I sure hope the mainstream Zen chips don't have pins on them. Having them on the motherboard is so much safer /s.
Click to expand...

you forgot the /S after you comment so I fixed it for you

Log in or Sign up

AMD Naples 32-core Zen-Processors photos

Warrax Member Guru

Frances Active Member

yasamoka Ancient Guru

David3k Member Guru

DLD Master Guru

yasamoka Ancient Guru

Stairmand Master Guru

yasamoka Ancient Guru

KissSh0t Ancient Guru

tsunami231 Ancient Guru

PrMinisterGR Ancient Guru

icedman Maha Guru

Amx85 Master Guru

chispy Ancient Guru

schmidtbag Ancient Guru

chispy Ancient Guru

Athlonite Maha Guru

Share This Page