AVX-512 Is an Intel Gimmick To Win Benchmarks and should die a painful death

Discussion in 'Frontpage news' started by Hilbert Hagedoorn, Jul 13, 2020.

  1. bobblunderton

    bobblunderton Master Guru

    Messages:
    420
    Likes Received:
    199
    GPU:
    EVGA 2070 Super 8gb
    Well thank-you for the value-add there - appreciated!
    AMD just dropped 3DNOW not that long ago come to think of it, I read it not too long back. Might have been with Ryzen, or could have been one of the FX line. 3DNOW! was basically like AMD's MMX from what I remember. This stuff for both chips I believed turned into SSE somewhere around pentium II or pentium III, with various versions up to around 4.2 which I believe was released with Nehelam / 1st gen i7. I know about 4.2 because Noita required it in the beta for world-seeding and creating the terrain. If you only had SSE 4.1 in the chip like a Conroe or Wolfdale or derivatives of such architecture closely related, you ended up spawning into to a never-ending abyss. They did fix the problem with it and added a work-around for processors without the two or three ops added in 4.2 over 4.1, for what it's worth.

    More on-topic, I wish both AMD and intel could offer me some sort of 'math performance enhancer' I could use to speed up BeamNG Drive's floating point computations. Almost like the XEON Phi cards but something that won't make fire inside the computer or cost an organ or two's worth of cash. When I test my map out with traffic enabled, 15~25 vehicles brings a decent Ryzen chip to it's knees! While it shames GTA's game engine for physics and driving, it requires so much computational horsepower. I almost would pay anything to have more, but it isn't really realistic in these days.
    Mr. Addams I hope you get your wish - I hope they do upgrade FPU's and release it with all the chip lines, we could really use it not just for gaming and simulation, but for AI and research too. It would be better for all of us.
    Though if we do get our wish - your wish - however it works out, i'll need to hire someone to beat dx11 draw call limits into submission, because having more cpu power I will run face-first into a draw call brick wall fast... [cue lots of obscene words about dx11's limits on rendering] ... grrr :x

    If someone's vacationing on an island, where they have DX12 rendering, please send me a post card, please.
     
    Gomez Addams likes this.
  2. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    Is benchmark something that does not represent way, it is done in real world? To me, if you do such thing, it would have zero value unless I could use exactly same code/binaries to do actual work.
     
  3. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    Mixing AVX code with legacy code is simply lazy programming. You can't just add a few mnemonics to 50 years old code and expect your throughput to radically improve. For best results, you need to reimplement the entire algorithm in the SIMD paradigm, and you need to break the work into multiple parallel threads to run on separate processor cores. It's OK to rely on auto-vectorization and loop unrolling in the optimizer, and threads are much easier in modern C++ than they were in 1990s. We have a freaking Cray Y-MP vector supercomputer on our desktop today - you just need to use it to its full potential!

    Improving FPU performance and simplifying the CPU are mutually exclusive goals.

    It's a real-world algorithm for 3D particle systems, but CPU implementations are limited to computer graphics / video effects software like blender, Autodesk 3Ds Max, Adobe After Effects etc. - these are not found in your typical home PC.

    I'm also not sure if that specific algorithm was vectorized to the maximum extent possible. Auto-vectorization is often used as the first step in optimisation - you tweak your C/C++ code and data to allow the optimizer produce the fastest SSE2/AVX assembly code. You can also take the assembly code and convert it to SSE2/AVX compiler intrinsics in your C/C++ code, then further tweak the algorithm to achieve better performance. It's quite hard work though, even for a very qualified engineer.
     
    Last edited: Jul 20, 2020
  4. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    So fair benchmark is actually real world set of workloads in existing software, right?
    Like running 1000 tasks (actions) in sequence as they would be done by user. Apparently, many of those workloads and actions would not benefit from it and improvement of entire workflow would not be exactly big as result. But that would still be adequate representation of benefit users can expect in day to day work.
     

  5. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    BTW, there is a recent presentation from the same author:


    He compares sequential reading speed of PCIe 4.0 SSDs - which is around 5 GByte/s - to memory throughput of basic text processing algorithms, like BASE64 encoding (used in SMTP protocol), UTF8 validation, JSON parsing, and general text-to-number conversion.
    Not surprisingly, many legacy implementations achieve only 100-300 MByte/s real-world throughput.

    And SIMD implementations of these base algorithms improve parsing performance by nearly 1.5 orders of magnitude - that is 3 GByte/s or more.
    These are based on individual experiments posted in his blog, complete with code examples:

    Encoding binary in ASCII very fast
    Fast float parsing in practice
    Parsing numbers in C++: streams, strtod, from_chars
    How expensive is it to parse numbers from a string in C++?
    JSON parsing: simdjson vs. JSON for Modern C++
    We released simdjson 0.3: the fastest JSON parser in the world is even better!
    How fast is getline in C++?
    Reusing a thread in C++ for better performance
    Cost of a thread in C++ under Linux
     
    Last edited: Jul 19, 2020
  6. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    Not quite. These are two separate classes of performance tests, synthetic and real-world.

    Synthetic tests / microbenchmarks either measure some specific general-purpose algorithm, or stress a specific hardware block to measure its maximum output. Think about generic 'compression' or 'video encoding' benchmarks, or memory/cache bandwidth and pixel/texel fill rate tests.

    Real-world tests either simulate actual applications by essentially re-creating their workflow and data sets, or directly measure the actual running time, graphics frame rate etc. of specific popular applications by using their scripting engines. 3D Mark is the most popular simulation test, while PC Mark and game engine tests use built-in scripting to profile real-world applications.
     
    Last edited: Jul 19, 2020
  7. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    I do not need benchmarks to guess that AVX-512 could do like twice as much of work than AVX2 per cycle under optimal conditions. It may be held back by cache, actual data structure and so on.
    But what I question is actual usefulness even for people that do workloads which partly use such instructions. IIRC, running even AVX2 resulted in lower clock for intel's CPUs due to hitting TDP wall.
    (Mind, that productivity workloads are about power efficiency too. And in the end, number of actual flops on actual data will be about same as with AVX2 at time workload is over.)
     
  8. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    Well, the message is simple: if you are unable to optimize your specific workloads to run faster, then don't use SIMD. At the same time, it's quite possible to achieve an order-of-magnitude better performance even in trivial cases if you make concerted refactoring efforts.
     
    Alessio1989 likes this.
  9. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    Do you imply that it is x87 era vs AVX-512? Because I did not.
     
  10. Carfax

    Carfax Ancient Guru

    Messages:
    3,973
    Likes Received:
    1,463
    GPU:
    Zotac 4090 Extreme
    Here's an actual application that Intel used to compared scalar to SIMD performance. The application was originally written by Microsoft and was intended to demonstrate how DX12 asynchronous compute could use spare GPU cycles to do particle simulation. It uses 10,000 particles and simulates up to 100,000,000 interactions with every tick.

    That's some damn good scaling, and makes it obvious how AVX2 will be heavily leveraged by game developers for various physics simulations in the next gen consoles. :D

    [​IMG]
     

  11. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    No. x87 era is over, x87 commands are not supported in x64 Long mode where SSE2 is required by AMD64 specs, and even 32-bit platforms now require SSE2 support (since Windows 8 and Fedora 29). Any recent compiler uses SSE2 by default for all calculations and data transfer operations.


    I'm talking about scalar code vs. vectorized code. Vectorization - i.e. converting from scalar (one value at a time) to vector (multiple values) paradigm - involves changing the data layout of your variables/structures to use wide registers with multiple values and changing your control flow to do more work per each iteration until branching.
    You can either rely on the compiler/optimizer to auto-vectorize your scalar code and unroll the loops, or you can refactor your code to use Intel SSE2/AVX2/AVX-512 intrinsics and vectorize all your data to the maximum register width.


    On Skylake X, AVX-512 vs AVX2 should give you a ~1.7 x improvement in an ideal case - so even if thermal throttling kicks in, AVX-512 path should still be faster. And if there is no such improvement with either auto-vectorization or manual coding, it's probably because you code is still not optimal and you need to refactor it further.
    Unfortunately this often takes a lot of time and effort, and may require you to change APIs or data layouts which is not always possible, so there is no choice but to skip these optimisations and wait for a better opportunity - and this could be Willow Cove / Golden Cove processors which offer significant IPC improvements, and may also include faster AVX-512 FMA units.
     
    Last edited: Jul 20, 2020
  12. Alessio1989

    Alessio1989 Ancient Guru

    Messages:
    2,959
    Likes Received:
    1,246
    GPU:
    .
    x87 is no more a thing when targeting 64-bit extension unless you explicitly use it with assembly. Base 64-bit FPU instructions target SSE2.

    X87 is supported in long mode too (chapter 2, middle of first page of the chapter beginning : https://www.amd.com/system/files/TechDocs/26569_APM_V5.pdf). But yes, with SSE2 there are no usually good reasons to use the old FPU ISA.
     
    Last edited: Jul 20, 2020
  13. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    Sure, but Microsoft does not officially support x87, MMX, or 3DNow! intrinsics or inline assembly on the x64 platform, and it's not possible to compile with the 64-bit MSVC toolset - you have to use MASM and object linking.

    https://docs.microsoft.com/en-us/wi...programming-for-game-developers#assembly_code
    https://docs.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list

    It might still be supported by Linux/GCC, though I fail to see the point.
     
    Last edited: Jul 20, 2020
  14. Alessio1989

    Alessio1989 Ancient Guru

    Messages:
    2,959
    Likes Received:
    1,246
    GPU:
    .
  15. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    ...Hyperbole. BEcause @DmitryKo made it sound like it is either AVX-512 or no wide data optimization at all while in reality we are comparing world of AVX-2 and AVX-512.
    It still remains true that to process data through AVX-2, CPU core eats more power. And that increases again with AVX-512.
    If you held 8C/16T CPU strictly to 65W, clock goes down with more data you put through CPU cores.
    In other words AVX-512 can do more, if you are willing to pay more.

    i9-10900K has base TDP 125W at which its base clock is 3.7GHz. In 95W TDP mode base clock is 3.3GHz.
    I doubt that those are AVX-512 clocks.
     

  16. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    In my book, 'SIMD' on x86 processors would encompass MMX, SSE/SSE2/SSE4, AVX/AVX2/FMA, and AVX-512F/BW/DQ/VL.

    That's exactly why Intel has been widening the data path and adding more registers - even if you have to throttle down the frequency by 30%, you're still processing up to 32 single-precision numbers per each command, so it's a sizeable gain overall. Newer architectures and process nodes should improve it further with better power management.


    The blog post above mentions OpenSSL. From the code comments, their AVX2 (VEX-256) integer path is exactly 2 times faster than AVX (VEX-128) integer path on Haswell/Skylake X; AVX-512 is still ~1.5 times faster than AVX2, but they thought it wasn't enough to compensate for thermal throttling on Skylake X.
    But SSE2/AVX versions do not even provide much gain comparing to scalar integer versions, except on Bulldozer.
    https://github.com/openssl/openssl/...a0d1b6/crypto/poly1305/asm/poly1305-x86_64.pl
     
  17. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    https://www.intel.com/content/dam/w...on-e5-v3-advanced-vector-extensions-paper.pdf
    Some power figures for AVX2. I really wonder about AVX-512 around which we dance, but refuse to talk.

    Just to not be confusing for some. Image shows relative power draw (relative because it is energy eaten per delivered performance) and performance improvement from using wider data processing.
    Please note, that it is 1.02 for AVX over SSE4.2. This means that while AVX2 can process 2.8 times more data than SSE 4.2 per unit of time, it eats 2.856x more energy. (And that's while intel has their AVX clock, otherwise it would be multiple of 2x,4x,8x performance as AVX width increases.)

    To put it bluntly, if you have 4C CPU w/ AVX-512 vs 8C CPU with AVX2, both could likely do about same work at same clock. But 4C CPU would be so far outside of its power efficiency, that performance per Watt would tank.
    Then add that people have to pay for those transistors even if they have almost no use for them and same workload could be processed via AVX2.

    And therefore one needs real comparison in TDP constrained scenario. Because AVX-512 is practically meant for productivity, and there power draw per delivered result matters.
    So, would we take that 4C AVX-512 enabled CPU and restrained to to same TDP like let's say 6C CPU with AVX2, would it win? And it it did, by how much?
     
  18. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    4-core AVX-512 should be ~33% faster than 6-core AVX2 (VEX-256).

    Provided you can parallelize your SIMD algorithm and scale to all available cores - and parallelization is more complex than vectorization - you'd probably need at least twice the number of active cores to compensate for the narrow data path, smaller register file, and process synchronisation overhead, since a ~20% frequency increase going from AVX-512 to AVX2 will be largely negated by a 10-20% frequency drop from employing the additional cores.

    These ratios come from the specification update for Xeon Gold/Platinum (Cascade Lake) processors, wich provides base and turbo boost frequencies for each number of active cores in AVX2/AVX-512 modes (to restrict the thermals to the specified TDP rating).

    https://www.intel.com/content/www/u...s/xeon/2nd-gen-xeon-scalable-spec-update.html

     
    Last edited: Jul 21, 2020
    Alessio1989 likes this.
  19. Fox2232

    Fox2232 Guest

    Messages:
    11,808
    Likes Received:
    3,371
    GPU:
    6900XT+AW@240Hz
    33% above 6C is 8C.

    Does SIMD cares about cores (CUs)? It does not care much in GPUs. Why would native SIMD workload need sync across cores?
    (Your example shows that double core CPU in AVX2 mode has equal or higher clock than CPU with fewer cores and AVX-512 mode. Not that it matters much as they are not equal in TDP. And intel's TDP is not something I trust even while those values look much more realistic than what their desktops propaganda states. On top of that all there is binning where CPU of same generation with more cores can have higher turbo than CPU with fewer cores at exactly same TDP.)

    Would likely need real world test. But their paper basically states that while AVX2 improves absolute performance per core, it does not actually improve performance per watt. Sadly I could not find similar paper for AVX-512.
    But test could be done even with single CPU by measuring time to finish and power eaten with AVX2 vs AVX-512.
     
  20. DmitryKo

    DmitryKo Master Guru

    Messages:
    450
    Likes Received:
    165
    GPU:
    ASRock RX 7800 XT
    Because continuously closing and restarting threads/fibers with new batches of work would be very expensive comparing to any form of interprocess synchronisation.
    You'd want to keep your threads/fibers running, by continuously sending them new data and getting back the results.

    https://lemire.me/blog/2020/01/30/cost-of-a-thread-in-c-under-linux/
    https://lemire.me/blog/2020/06/10/reusing-a-thread-in-c-for-better-performance/

    You can compare the same CPU with different number of active cores - the frequency ratios above would still stand. These are US$3000 server CPUs, it's as good as you can possibly get.

    No, it doesn't. AVX2 improves performance per second for the same wattage, so it does improve performance per watt.
     
    Last edited: Jul 25, 2020

Share This Page