Log in or Sign up

AVX-512 Is an Intel Gimmick To Win Benchmarks and should die a painful death

Discussion in 'Frontpage news' started by Hilbert Hagedoorn, Jul 13, 2020.

Page 5 of 6

bobblunderton Master Guru

Messages:

420

Likes Received:

199

GPU:

EVGA 2070 Super 8gb

Gomez Addams said: ↑

I agree with what you wrote, for the most part. The only thing I would add is alternative floating point instructions were introduced because the x87 model of floating point computation is rather inefficient and cumbersome. It was designed for an off-chip co-processor and has been translated to on-chip instructions. Its stack-based model for registers is long past it sell-by date. I am glad to see better alternatives arise but they need to get all these various instruction sets under better control. Some are so rarely used they are just a waste of chip real estate now.
Click to expand...

Well thank-you for the value-add there - appreciated!
AMD just dropped 3DNOW not that long ago come to think of it, I read it not too long back. Might have been with Ryzen, or could have been one of the FX line. 3DNOW! was basically like AMD's MMX from what I remember. This stuff for both chips I believed turned into SSE somewhere around pentium II or pentium III, with various versions up to around 4.2 which I believe was released with Nehelam / 1st gen i7. I know about 4.2 because Noita required it in the beta for world-seeding and creating the terrain. If you only had SSE 4.1 in the chip like a Conroe or Wolfdale or derivatives of such architecture closely related, you ended up spawning into to a never-ending abyss. They did fix the problem with it and added a work-around for processors without the two or three ops added in 4.2 over 4.1, for what it's worth.

More on-topic, I wish both AMD and intel could offer me some sort of 'math performance enhancer' I could use to speed up BeamNG Drive's floating point computations. Almost like the XEON Phi cards but something that won't make fire inside the computer or cost an organ or two's worth of cash. When I test my map out with traffic enabled, 15~25 vehicles brings a decent Ryzen chip to it's knees! While it shames GTA's game engine for physics and driving, it requires so much computational horsepower. I almost would pay anything to have more, but it isn't really realistic in these days.
Mr. Addams I hope you get your wish - I hope they do upgrade FPU's and release it with all the chip lines, we could really use it not just for gaming and simulation, but for AI and research too. It would be better for all of us.
Though if we do get our wish - your wish - however it works out, i'll need to hire someone to beat dx11 draw call limits into submission, because having more cpu power I will run face-first into a draw call brick wall fast... [cue lots of obscene words about dx11's limits on rendering] ... grrr :x

If someone's vacationing on an island, where they have DX12 rendering, please send me a post card, please.

bobblunderton, Jul 19, 2020

#81

Gomez Addams likes this.
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

DmitryKo said: ↑

I stand corrected. Speaks volumes about internal development and documentation processes at Intel.

Lower-end Xeon Silver/Gold do have only a single AVX-512 FMA unit though.

This is not a strictly AVX-512 benchmark - it's just some minimal C code, so end performance is heavily dependent on the ability of specific compilers and optimizers to correctly unwind the loop and auto-vectorize the code to issue SSE2, AVX2, or AVX-512 instructions in an optimal order.
Willow Cove architecture seems to offer enough improvement to warrant an unproportionately large hike in this test.
A proper benchmark would be designed to smooth out intricate architectural differences with code path hand-coded with native AVX intrinsics and large dataset.

Anyway server/HEDT parts like Core i9-X make 40000+ points in 3DPM v2.1, or 5 times as much as Tiger Lake U, which is expected of mobile parts. Mainstream desktop Rocket Lake S should be considerably faster.
Click to expand...

Is benchmark something that does not represent way, it is done in real world? To me, if you do such thing, it would have zero value unless I could use exactly same code/binaries to do actual work.

Fox2232, Jul 19, 2020

#82
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

user1 said: ↑

this is not trivial, and I imagine is a real concern, I frankly do not have a dog is this fight.
Click to expand...

Mixing AVX code with legacy code is simply lazy programming. You can't just add a few mnemonics to 50 years old code and expect your throughput to radically improve. For best results, you need to reimplement the entire algorithm in the SIMD paradigm, and you need to break the work into multiple parallel threads to run on separate processor cores. It's OK to rely on auto-vectorization and loop unrolling in the optimizer, and threads are much easier in modern C++ than they were in 1990s. We have a freaking Cray Y-MP vector supercomputer on our desktop today - you just need to use it to its full potential!

Prefering more cores over fatter cores(that require the use of new instructions), isn't really a controversial or new opinion. So i dont really see what the core issue is with linus's opinion, other than " i dont like it"
Click to expand...

Improving FPU performance and simplifying the CPU are mutually exclusive goals.

Fox2232 said: ↑

Is benchmark something that does not represent way, it is done in real world? To me, if you do such thing, it would have zero value unless I could use exactly same code/binaries to do actual work.
Click to expand...

It's a real-world algorithm for 3D particle systems, but CPU implementations are limited to computer graphics / video effects software like blender, Autodesk 3Ds Max, Adobe After Effects etc. - these are not found in your typical home PC.

I'm also not sure if that specific algorithm was vectorized to the maximum extent possible. Auto-vectorization is often used as the first step in optimisation - you tweak your C/C++ code and data to allow the optimizer produce the fastest SSE2/AVX assembly code. You can also take the assembly code and convert it to SSE2/AVX compiler intrinsics in your C/C++ code, then further tweak the algorithm to achieve better performance. It's quite hard work though, even for a very qualified engineer.

Last edited: Jul 20, 2020

DmitryKo, Jul 19, 2020

#83
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

DmitryKo said: ↑

Mixing AVX code with legacy code is simply lazy programming. You can't just add a few mnemonics to 50-year old code and expect your throughput to radicallly imprvove. For best results, you need to reimplement the entire algorithm in the SIMD paradigm, and you need to break the work into multiple parallel thread to run on separate processor cores. It's OK to rely on auto-vectorization and loop unrolling in the optimizer, and threads are much easier in modern C++ than they were in 1990s. We have a freaking Cray Y-MP vector supercomputer on our desktops today - you just need to use it to its full potential!

Improving FPU performance of and simplifying the CPU are mutually exclusive goals.

It's a real-world algorithm for 3D particle systems, but CPU implementations are limited to computer graphics / video effects software like blender, Autodesk 3Ds Max, Adobe After Effects etc. - these are not found in your typical home PC.

I'm also not sure if that specific algorithm was vectorized to the maximum extent possible. Auto-vectorization is often used as the first step in optimisation - you tweak your C/C++ code to allow the optimizer produce the fastest SSE/AVX assembly. You can also take the assembly code and port it to SSE/AVX compiler intrinsics then further tweak the algorithm in your C/C++ code. It's quite hard work though, even for a very qualified engineer.
Click to expand...

So fair benchmark is actually real world set of workloads in existing software, right?
Like running 1000 tasks (actions) in sequence as they would be done by user. Apparently, many of those workloads and actions would not benefit from it and improvement of entire workflow would not be exactly big as result. But that would still be adequate representation of benefit users can expect in day to day work.

Fox2232, Jul 19, 2020

#84
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

BTW, there is a recent presentation from the same author:

Daniel Lemire
Data Engineering at the Speed of Your Disk
https://speakerdeck.com/lemire/data-engineering-at-the-speed-of-your-disk
youtu.be/p6X8BGSrR9wp

He compares sequential reading speed of PCIe 4.0 SSDs - which is around 5 GByte/s - to memory throughput of basic text processing algorithms, like BASE64 encoding (used in SMTP protocol), UTF8 validation, JSON parsing, and general text-to-number conversion.
Not surprisingly, many legacy implementations achieve only 100-300 MByte/s real-world throughput.

And SIMD implementations of these base algorithms improve parsing performance by nearly 1.5 orders of magnitude - that is 3 GByte/s or more.
These are based on individual experiments posted in his blog, complete with code examples:

Encoding binary in ASCII very fast
Fast float parsing in practice
Parsing numbers in C++: streams, strtod, from_chars
How expensive is it to parse numbers from a string in C++?
JSON parsing: simdjson vs. JSON for Modern C++
We released simdjson 0.3: the fastest JSON parser in the world is even better!
How fast is getline in C++?
Reusing a thread in C++ for better performance
Cost of a thread in C++ under Linux

Last edited: Jul 19, 2020

DmitryKo, Jul 19, 2020

#85
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

Fox2232 said: ↑

So fair benchmark is actually real world set of workloads in existing software
many of those workloads and actions would not benefit from it and improvement of entire workflow would not be exactly big as result
Click to expand...

Not quite. These are two separate classes of performance tests, synthetic and real-world.

Synthetic tests / microbenchmarks either measure some specific general-purpose algorithm, or stress a specific hardware block to measure its maximum output. Think about generic 'compression' or 'video encoding' benchmarks, or memory/cache bandwidth and pixel/texel fill rate tests.

Real-world tests either simulate actual applications by essentially re-creating their workflow and data sets, or directly measure the actual running time, graphics frame rate etc. of specific popular applications by using their scripting engines. 3D Mark is the most popular simulation test, while PC Mark and game engine tests use built-in scripting to profile real-world applications.

Last edited: Jul 19, 2020

DmitryKo, Jul 19, 2020

#86
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

DmitryKo said: ↑

Not quite. These are two separate classes of performance tests, synthetic and real-world.

Synthetic tests / microbenchmarks either measure some specific general-purpose algorithm, or stress a specific hardware block to measure its maximum output. Think about generic 'compression' or 'video encoding' benchmarks, or memory/cache bandwidth and pixel/texel fill rate tests.

Real-world tests either simulate actual applications by essentially re-creating their workflow and data sets, or directly measure the actual running time, graphics frame rate etc. of specific popular applications by using their scripting engines. 3D Mark is the most popular simulation test, while PC Mark and game engine tests use built-in scripting to profile real-world applications.
Click to expand...

I do not need benchmarks to guess that AVX-512 could do like twice as much of work than AVX2 per cycle under optimal conditions. It may be held back by cache, actual data structure and so on.
But what I question is actual usefulness even for people that do workloads which partly use such instructions. IIRC, running even AVX2 resulted in lower clock for intel's CPUs due to hitting TDP wall.
(Mind, that productivity workloads are about power efficiency too. And in the end, number of actual flops on actual data will be about same as with AVX2 at time workload is over.)

Fox2232, Jul 19, 2020

#87
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

Fox2232 said: ↑

what I question is actual usefulness even for people that do workloads which partly use such instructions
Click to expand...

Well, the message is simple: if you are unable to optimize your specific workloads to run faster, then don't use SIMD. At the same time, it's quite possible to achieve an order-of-magnitude better performance even in trivial cases if you make concerted refactoring efforts.

DmitryKo, Jul 19, 2020

#88

Alessio1989 likes this.
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

DmitryKo said: ↑

Well, the message is simple: if you are unable to optimize your specific workloads to run faster, then don't use SIMD. At the same time, it's quite possible to achieve an order-of-magnitude better performance even in trivial cases if you make concerted refactoring efforts.
Click to expand...

Do you imply that it is x87 era vs AVX-512? Because I did not.

Fox2232, Jul 19, 2020

#89
Carfax Ancient Guru

Messages:

3,973

Likes Received:

1,463

GPU:

Zotac 4090 Extreme

Here's an actual application that Intel used to compared scalar to SIMD performance. The application was originally written by Microsoft and was intended to demonstrate how DX12 asynchronous compute could use spare GPU cycles to do particle simulation. It uses 10,000 particles and simulates up to 100,000,000 interactions with every tick.

That's some damn good scaling, and makes it obvious how AVX2 will be heavily leveraged by game developers for various physics simulations in the next gen consoles.

Carfax, Jul 20, 2020

#90
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

Fox2232 said: ↑

Do you imply that it is x87 era vs AVX-512?
Click to expand...

No. x87 era is over, x87 commands are not supported in x64 Long mode where SSE2 is required by AMD64 specs, and even 32-bit platforms now require SSE2 support (since Windows 8 and Fedora 29). Any recent compiler uses SSE2 by default for all calculations and data transfer operations.

I'm talking about scalar code vs. vectorized code. Vectorization - i.e. converting from scalar (one value at a time) to vector (multiple values) paradigm - involves changing the data layout of your variables/structures to use wide registers with multiple values and changing your control flow to do more work per each iteration until branching.
You can either rely on the compiler/optimizer to auto-vectorize your scalar code and unroll the loops, or you can refactor your code to use Intel SSE2/AVX2/AVX-512 intrinsics and vectorize all your data to the maximum register width.

On Skylake X, AVX-512 vs AVX2 should give you a ~1.7 x improvement in an ideal case - so even if thermal throttling kicks in, AVX-512 path should still be faster. And if there is no such improvement with either auto-vectorization or manual coding, it's probably because you code is still not optimal and you need to refactor it further.
Unfortunately this often takes a lot of time and effort, and may require you to change APIs or data layouts which is not always possible, so there is no choice but to skip these optimisations and wait for a better opportunity - and this could be Willow Cove / Golden Cove processors which offer significant IPC improvements, and may also include faster AVX-512 FMA units.

Last edited: Jul 20, 2020

DmitryKo, Jul 20, 2020

#91
Alessio1989 Ancient Guru

Messages:

2,959

Likes Received:

1,246

GPU:

.

Fox2232 said: ↑

Do you imply that it is x87 era vs AVX-512? Because I did not.
Click to expand...

x87 is no more a thing when targeting 64-bit extension unless you explicitly use it with assembly. Base 64-bit FPU instructions target SSE2.

DmitryKo said: ↑

No. x87 era is over, x87 commands are not even supported in x64 Long mode where SSE2 is required by AMD64 specs, and even 32-bit platforms now require SSE2 support (since Windows 8 and Fedora 29). Any recent compiler uses SSE2 by default for all calculations and data transfer operations.
Click to expand...

X87 is supported in long mode too (chapter 2, middle of first page of the chapter beginning : https://www.amd.com/system/files/TechDocs/26569_APM_V5.pdf). But yes, with SSE2 there are no usually good reasons to use the old FPU ISA.

Last edited: Jul 20, 2020

Alessio1989, Jul 20, 2020

#92
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

Alessio1989 said: ↑

x87 is no more a thing when targeting 64-bit extension unless you explicitly use it with assembly
X87 is supported in long mode too
Click to expand...

Sure, but Microsoft does not officially support x87, MMX, or 3DNow! intrinsics or inline assembly on the x64 platform, and it's not possible to compile with the 64-bit MSVC toolset - you have to use MASM and object linking.

https://docs.microsoft.com/en-us/wi...programming-for-game-developers#assembly_code
https://docs.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list

It might still be supported by Linux/GCC, though I fail to see the point.

Last edited: Jul 20, 2020

DmitryKo, Jul 20, 2020

#93
Alessio1989 Ancient Guru

Messages:

2,959

Likes Received:

1,246

GPU:

.

Yes, x64 assembly with MS is a pain in the ass (btw does it works if using clang and then linking? Never tried...). Anyway they started supporting auto-vectorization for AVX-512 (F/C/VL/DQ/BW, ie the entire core common set), so it's no more a gimmick, at least for developers: https://devblogs.microsoft.com/cppblog/avx-512-auto-vectorization-in-msvc/

Alessio1989, Jul 20, 2020

#94
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

DmitryKo said: ↑

No. x87 era is over, x87 commands are not supported in x64 Long mode where SSE2 is required by AMD64 specs, and even 32-bit platforms now require SSE2 support (since Windows 8 and Fedora 29). Any recent compiler uses SSE2 by default for all calculations and data transfer operations.

I'm talking about scalar code vs. vectorized code. Vectorization - i.e. converting from scalar (one value at a time) to vector (multiple values) paradigm - involves changing the data layout of your variables/structures to use wide registers with multiple values and changing your control flow to do more work per each iteration until branching.
You can either rely on the compiler/optimizer to auto-vectorize your scalar code and unroll the loops, or you can refactor your code to use Intel SSE2/AVX2/AVX-512 intrinsics and vectorize all your data to the maximum register width.

On Skylake X, AVX-512 vs AVX2 should give you a ~1.7 x improvement in an ideal case - so even if thermal throttling kicks in, AVX-512 path should still be faster. And if there is no such improvement with either auto-vectorization or manual coding, it's probably because you code is still not optimal and you need to refactor it further.
Unfortunately this often takes a lot of time and effort, and may require you to change APIs or data layouts which is not always possible, so there is no choice but to skip these optimisations and wait for a better opportunity - and this could be Willow Cove / Golden Cove processors which offer significant IPC improvements, and may also include faster AVX-512 FMA units.
Click to expand...

Alessio1989 said: ↑

x87 is no more a thing when targeting 64-bit extension unless you explicitly use it with assembly. Base 64-bit FPU instructions target SSE2.
Click to expand...

...Hyperbole. BEcause @DmitryKo made it sound like it is either AVX-512 or no wide data optimization at all while in reality we are comparing world of AVX-2 and AVX-512.

Alessio1989 said: ↑

Yes, x64 assembly with MS is a pain in the ass (btw does it works if using clang and then linking? Never tried...). Anyway they started supporting auto-vectorization for AVX-512 (F/C/VL/DQ/BW, ie the entire core common set), so it's no more a gimmick, at least for developers: https://devblogs.microsoft.com/cppblog/avx-512-auto-vectorization-in-msvc/
Click to expand...

It still remains true that to process data through AVX-2, CPU core eats more power. And that increases again with AVX-512.
If you held 8C/16T CPU strictly to 65W, clock goes down with more data you put through CPU cores.
In other words AVX-512 can do more, if you are willing to pay more.

i9-10900K has base TDP 125W at which its base clock is 3.7GHz. In 95W TDP mode base clock is 3.3GHz.
I doubt that those are AVX-512 clocks.

Fox2232, Jul 20, 2020

#95
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

Fox2232 said: ↑

@DmitryKo made it sound like it is either AVX-512 or no wide data optimization at all
Click to expand...

In my book, 'SIMD' on x86 processors would encompass MMX, SSE/SSE2/SSE4, AVX/AVX2/FMA, and AVX-512F/BW/DQ/VL.

It still remains true that to process data through AVX-2, CPU core eats more power. And that increases again with AVX-512.
Click to expand...

That's exactly why Intel has been widening the data path and adding more registers - even if you have to throttle down the frequency by 30%, you're still processing up to 32 single-precision numbers per each command, so it's a sizeable gain overall. Newer architectures and process nodes should improve it further with better power management.

The blog post above mentions OpenSSL. From the code comments, their AVX2 (VEX-256) integer path is exactly 2 times faster than AVX (VEX-128) integer path on Haswell/Skylake X; AVX-512 is still ~1.5 times faster than AVX2, but they thought it wasn't enough to compensate for thermal throttling on Skylake X.
But SSE2/AVX versions do not even provide much gain comparing to scalar integer versions, except on Bulldozer.
https://github.com/openssl/openssl/...a0d1b6/crypto/poly1305/asm/poly1305-x86_64.pl

DmitryKo, Jul 20, 2020

#96
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

DmitryKo said: ↑

DmitryKo said: ↑

Well, the message is simple: if you are unable to optimize your specific workloads to run faster, then don't use SIMD. At the same time, it's quite possible to achieve an order-of-magnitude better performance even in trivial cases if you make concerted refactoring efforts.
Click to expand...

In my book, 'SIMD' on x86 processors would encompass MMX, SSE/SSE2/SSE4, AVX/AVX2/FMA, and AVX-512F/BW/DQ/VL.
Click to expand...

DmitryKo said: ↑

That's exactly why Intel has been widening the data path and adding more registers - even if you have to throttle down the frequency by 30%, you're still processing up to 32 single-precision numbers per each command, so it's a sizeable gain overall. Newer architectures and process nodes should improve it further with better power management.

The blog post above mentions OpenSSL. From the code comments, their AVX2 (VEX-256) integer path is exactly 2 times faster than AVX (VEX-128) integer path on Haswell/Skylake X; AVX-512 is still ~1.5 times faster than AVX2, but they thought it wasn't enough to compensate for thermal throttling on Skylake X.
But SSE2/AVX versions do not even provide much gain comparing to scalar integer versions, except on Bulldozer.
https://github.com/openssl/openssl/...a0d1b6/crypto/poly1305/asm/poly1305-x86_64.pl
Click to expand...

https://www.intel.com/content/dam/w...on-e5-v3-advanced-vector-extensions-paper.pdf
Some power figures for AVX2. I really wonder about AVX-512 around which we dance, but refuse to talk.

Just to not be confusing for some. Image shows relative power draw (relative because it is energy eaten per delivered performance) and performance improvement from using wider data processing.
Please note, that it is 1.02 for AVX over SSE4.2. This means that while AVX2 can process 2.8 times more data than SSE 4.2 per unit of time, it eats 2.856x more energy. (And that's while intel has their AVX clock, otherwise it would be multiple of 2x,4x,8x performance as AVX width increases.)

To put it bluntly, if you have 4C CPU w/ AVX-512 vs 8C CPU with AVX2, both could likely do about same work at same clock. But 4C CPU would be so far outside of its power efficiency, that performance per Watt would tank.
Then add that people have to pay for those transistors even if they have almost no use for them and same workload could be processed via AVX2.

And therefore one needs real comparison in TDP constrained scenario. Because AVX-512 is practically meant for productivity, and there power draw per delivered result matters.
So, would we take that 4C AVX-512 enabled CPU and restrained to to same TDP like let's say 6C CPU with AVX2, would it win? And it it did, by how much?

Fox2232, Jul 20, 2020

#97
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

Fox2232 said: ↑

So, would we take that 4C AVX-512 enabled CPU and restrained to to same TDP like let's say 6C CPU with AVX2, would it win? And it it did, by how much?
Click to expand...

4-core AVX-512 should be ~33% faster than 6-core AVX2 (VEX-256).

Provided you can parallelize your SIMD algorithm and scale to all available cores - and parallelization is more complex than vectorization - you'd probably need at least twice the number of active cores to compensate for the narrow data path, smaller register file, and process synchronisation overhead, since a ~20% frequency increase going from AVX-512 to AVX2 will be largely negated by a 10-20% frequency drop from employing the additional cores.

These ratios come from the specification update for Xeon Gold/Platinum (Cascade Lake) processors, wich provides base and turbo boost frequencies for each number of active cores in AVX2/AVX-512 modes (to restrict the thermals to the specified TDP rating).

https://www.intel.com/content/www/u...s/xeon/2nd-gen-xeon-scalable-spec-update.html

4-core Xeon Gold 5222 (105 W) - roughly equal to Core i3-11300 (Rocket Lake)
base 3.8 GHz, all-core turbo 3.9 GHz;
AVX2 base frequency 3.3 GHz, all-core turbo 3.8 GHz;
AVX-512 base frequency 2.7 GHz, 1-2 core turbo 3.7 GHz (3-4 core 3.5 GHz);

8-core Xeon Gold 6234 (120 W) - roughly equal to Core i7-11700 (Rocket Lake)
base 3.3 GHz, all-core turbo 4.0 GHz;
AVX2 base 2.8 GHz, 1-2 core turbo 3.9 GHz (2-8 core 3.7 GHz);
AVX-512 base 2.3 GHz, 1-2 core turbo 3.7 GHz (2-4 core 3.5 GHz, 5-9 3.1 GHz);

16-core Xeon Gold 6242[/URL] (150 W) - roughly equal to Core i9-11980X (Sapphire Rapids)
base 2.8 GHz, 1-2 core turbo 3.9 GHz (3-4 core 3.7 GHz, 5-12 3.6 GHz, 13-16 3.5 GHz);
AVX2 base 2.3 GHz, 1-2 core turbo 3.8 GHz (3-4 core 3.6 GHz, 5-8 3.5 GHz, 9-12 3.4 GHz, 13-16 3.1 GHz);
AVX-512 base 1.9 GHz, 1-2 core turbo 3.7 GHz (3-4 core 3.5 GHz, 5-8 3.2 GHz, 9-12 2.7 GHz, 13-16 2.5 GHz);
Click to expand...

Last edited: Jul 21, 2020

DmitryKo, Jul 20, 2020

#98

Alessio1989 likes this.
Fox2232 Guest

Messages:

11,808

Likes Received:

3,371

GPU:

6900XT+AW@240Hz

DmitryKo said: ↑

4-core AVX-512 should be ~33% faster than 6-core AVX2 (VEX-256).

Provided you can parallelize your SIMD algorithm and scale to all available cores - and parallelization is more complex than vectorization - you'd probably need at least twice the number of active cores to compensate for the narrow data path, smaller register file, and process synchronisation overhead, since a ~20% frequency increase going from AVX-512 to AVX2 will be largely negated by a 10-20% frequency drop from employing the additional cores.

These ratios come from the specification update for Xeon Gold/Platinum (Cascade Lake) processors, wich provides base and turbo boost frequencies for each number of active cores in AVX2/AVX-512 modes (to restrict the thermals to the specified TDP rating).

https://www.intel.com/content/www/u...s/xeon/2nd-gen-xeon-scalable-spec-update.html
Click to expand...

33% above 6C is 8C.

Does SIMD cares about cores (CUs)? It does not care much in GPUs. Why would native SIMD workload need sync across cores?
(Your example shows that double core CPU in AVX2 mode has equal or higher clock than CPU with fewer cores and AVX-512 mode. Not that it matters much as they are not equal in TDP. And intel's TDP is not something I trust even while those values look much more realistic than what their desktops propaganda states. On top of that all there is binning where CPU of same generation with more cores can have higher turbo than CPU with fewer cores at exactly same TDP.)

Would likely need real world test. But their paper basically states that while AVX2 improves absolute performance per core, it does not actually improve performance per watt. Sadly I could not find similar paper for AVX-512.
But test could be done even with single CPU by measuring time to finish and power eaten with AVX2 vs AVX-512.

Fox2232, Jul 21, 2020

#99
DmitryKo Master Guru

Messages:

450

Likes Received:

165

GPU:

ASRock RX 7800 XT

Fox2232 said: ↑

Why would native SIMD workload need sync across cores?
Click to expand...

Because continuously closing and restarting threads/fibers with new batches of work would be very expensive comparing to any form of interprocess synchronisation.
You'd want to keep your threads/fibers running, by continuously sending them new data and getting back the results.

https://lemire.me/blog/2020/01/30/cost-of-a-thread-in-c-under-linux/
https://lemire.me/blog/2020/06/10/reusing-a-thread-in-c-for-better-performance/

Not that it matters much as they are not equal in TDP
Click to expand...

You can compare the same CPU with different number of active cores - the frequency ratios above would still stand. These are US$3000 server CPUs, it's as good as you can possibly get.

their paper basically states that while AVX2 improves absolute performance per core, it does not actually improve performance per watt.
Click to expand...

No, it doesn't. AVX2 improves performance per second for the same wattage, so it does improve performance per watt.

Last edited: Jul 25, 2020

DmitryKo, Jul 25, 2020

#100

(You must log in or sign up to reply here.)

Page 5 of 6

Share This Page