Discussion in 'Frontpage news' started by Hilbert Hagedoorn, Jul 13, 2020.
x265 benchmark is based on, and named after the x265 open source codec. The benchmark literally tests a CPUs performance when running the x265 codec.
I'm genuinely surprised that you were unaware of this, as you're usually very tech savvy. I guess you don't do much encoding
x265 wiki page .
the point is, if you're using cpu encoding you're aiming for quality not speed of encode, and the avx 512 optimizations have negative benefits unless you're using 4k or higher resolution.
Ordinarily I would agree with you on this, because you're correct that encoding for quality typically requires branchier and more serial processing. However, with Intel's SVT (scalable video technology) codecs, that paradigm seems to have changed. Intel's SVT codecs allow much greater parallelization without affecting quality, while simultaneously achieving a huge performance gain.
Cannot understand why AVX-512 is suddenly such an overkill while AVX/AVX2 is perfectly OK.
AVX-512 is a continuation of the previous SIMD extensions model, where major revisions like SSE and AVX introduce wider _MM registers and increase register count, with new commands and/or updated existing mnemonics to operate on new registers and data types, while interim revisions introduce a small number of application-specific commands.
Likewise AVX-512F and VL/BW/DQ, VBMI/VBMI2 etc. extensions mostly update existing SSE/SSE2/SSE4 and AVX/AVX2/FMA instructions, using a new streamlined encoding format (the EVEX prefix), while CD, VNNI, IFMA, GFNI, VAES etc. add application-specific extensions - overall it's probably 25% new commands, the rest is upgraded mnemonics and existing SSE/AVX commands (including 128-bit and 256-bit EVEX-encoded variants).
It doesn't mean FPUs will be wasting transistors on unnecessarily wide blocks either. Initial Skylake X / Cascade Lake implementations combine two 256-bit execution blocks to process 512-bit commands; advanced implementations would use 512 bit wide execution blocks to improve throughput (like Haswell and AMD Zen 2 did with 256-bit AVX2 commands).
Also AVX-512 will only be released on the mainsream desktop in early 2021 with the 14 nm Rocket Lake S (Willow Cove). It was supposed to be generally available with CannonLake S in 2017, but the continued failure of Intel 10 nm process node resulted in cancellation of Palm Cove / Sunny Cove desktop processors (except HEDT versions).
Right now Intel didn't even start optimizing their AVX-512 implementations, it's mostly for developers and early adopters; this would only change after AMD Zen4 architecture introduces AVX-512 (if the rumour is true).
Exactly, AVX-512 is going to improve and become better over time with continued development and adoption, and smaller process nodes. I think AVX-512 will hit its long stride when Intel switches over to 7nm, and they have a lot more room to work with in terms of widening the architecture and adding more cache, plus not to mention DDR5 and PCIe 5.0 will also be in full swing by then as well.
AVX2 support came with 256-bit wide FPU's on HASWELL / Intel 4000 series processors (well most of them, surely the low-end ones like Celeron precluded support for that and virtualization).
AVX2 support came via Zen 2 on Ryzen 3xxx (non-APU) 7nm processors, I don't believe 14nm or 12nm processors had it (which also includes the 3xxx series APU-only chips like 3200g/3400g and other 3xxx g-series models by market).
AVX-512 support on intel was limited to server / workstation or HEDT markets.
If a user has borderline cooling, AVX2 or AVX512 will surely send temps into the stratosphere, if memory serves correctly that's why Prime95 used to BSOD / overheat & throttle intel systems!
AMD's FX series and the like (Jaguar, Bulldozer, Piledriver, and so-forth derivatives of that architecture) used a module topology for the processor.
This 'module' was TWO Arithmetic Logic Units (ALU), and one 256-bit FPU that could split to 2x 128-bit FPUs when there was two threads requiring FPU work. This would, for the most part, cut performance in half on a processor which had rather weak FPUs to begin with. So in reality, while the FPUs weren't nearly as good as Haswell and later intel chips were, there were only half as many FPUs in the AMD FX series chip (including consoles with them) when compared to intel chips or even Phenom II chips (Phenom II had 128-bit FPUs but also had one per ALU).
This was the reason that AMD's FX series, which wasn't very powerful to start with (and in some cases, was slower than the great-for-their-day Phenom II line), was fairly reviled in the news/review site media, and wasn't very good from the get-go even with the scheduler enhancements and some new instruction sets. It seems AMD never learned from Cyrix's undoing with poor-performing FPU's on their 6x86 line with regard to the FX-series, which weren't much any better than their 486/586 models as they did NOT upgrade them, only to be killed off by the Quake craze. I had a Cyrix chip in 1995~1996, which I overclocked back in 1996 to help with Quake performance, because 'yes it was bad'. Cyrix 6x86 32-bit Windows app performance was above-par, but the game performance is what really put the nails in the coffin for them, sadly. Thankfully, Zen 2 FPU performance is very good and no less than generally that of intel for AI and Physics from everything I can see (3700x, 3950x tested with 'meh' RAM).
There was even a lawsuit over that, stating that even though a 'core' is technically the ALU dating back to days where in the pre-Pentium chips, some chips had a defective/disabled math co-processor, or weren't made with one to begin with to cut costs. The user of the old machine could buy a 287,387, or 487 chip, to help speed up math and performance in games like Duke 3D, Doom, many other 3D games and also games with lots of units such as Warcraft and Command and Conquer. The 487 was actually a fully-functional chip that disabled the on-board 486 SX chip entirely, taking over ALL operations unlike previous chip solutions.
In 2014 I upgraded my Athlon II x4 2.8ghz processor to an FX 6300. I was so disappointed at the poor performance of the processor, that I almost threw the machine out of the 2nd floor window of my house a few times. Three weeks later I went out and spent 1200~1500$ (forget exactly) and built a new Z97 Haswell system, which I just replaced a year ago with Zen 2. While toasty hot and required delidding, the 4790k was literally double the performance of the 6300; and in AI/physics situations, even better.
So entirely, the only person in a nutshell (or company) that intel should blame for lack of mass AVX-512 adoption is themselves, for artificially segmenting it to only the highest levels of the market. Outside of a few select purpose-built or niche apps, it's never going to get used if no one has support for it. It'd be like writing a book in a made-up language that no one but you knew.
AVX, which stands for Audio Video eXtensions, is there to facilitate quick & painless multimedia and math ops in a multitude of software such as browsers, media players, and games. The CPU can get a lot MORE work done with the same amount of power, and therefor having it can save on energy and extend (albeit marginally) battery life. I believe most or all of these 'short cut' instructions so to speak that AVX is made of go through the FPU.
If anyone has corrections on any of the above, please quote and correct if they wish to take the time to. I have in my best interest tried my best to make sure I got all the facts right before typing this book out, though. Good topic though!
*This part added, as I am not 100% sure, but I believe the reason Prime 95 would 'almost melt' post-Sandy-Bridge non-soldered intel processors was because 'Small FFT's' test would use AVX to the highest level of AVX supported by the processor. Haswell going 256-bit FPU/AVX2 just exacerbated the situation. So I added a 100$ air cooler and delidded/did the liquid metal dance so that I didn't burn down the building next time I used AVX. While getting toasty, the Zen 2 chips are soldered like the later intel 8xxx chips and hence don't have as much of an issue with heat - though they still get warm during AVX ops.
Source: 25~26 years of computing since I had an AM486 DX/2 66 chip to do it on. Also, I live / eat / breathe / sleep modding BeamNG.Drive which is an entirely physics based driving simulator with soft-body physics. You'll recognize this name if you're into downloading mods for it, as I've made Los Injurus City map mod and Roane County TN map mods (among a few) for the game over the last few years. Good floating-point performance in my computer is an absolute MUST, though for users of the game who love it a lot, don't buy into more than a 3900x for it, you'll be wasting your money - in fact I'd recommend sticking with either a 3600 or 3700x for best value while still running decent amounts of traffic. An intel 10600k would do nice for a good price (if you can get it for a good price, either AMD 6-core or intel 6-core chips) if you're a die-hard intel fan or just don't want to anger your Windows installation when changing teams/brands.
--That is all!
Yes. Intel already widened some execution blocks to 512 bits, but only on the highest-end Xeon Platinum (Skylake SP) server processors, which have an additional dedicated AVX-512 FMA unit.
The rest of the current lineup uses two fused 256-bit blocks to function as a single AVX-512 FMA unit, thus the issue rate for AVX-512 commands is much slower and actual performance is roughly the same as AVX2 despite the wider data path.
So there is room for all kinds of similar optimizations in future 7nm Golden Cove/Ocean Cove parts, with additional fused or dedicated execution blocks and back-end/scheduler improvements (more issue ports, larger μop caches etc.)
Initial AMD implementations (Zen4?) will probably use 256 bit blocks as well.
I'm not sure what is the future of AVX-512, since Intel is moving to implement AMX extensions with Sapphire Rapids based Xeons in 2021. This new ISA will take over ML acceleration, that will leave all the AI additions to AVX-512 hang in the air. The consumer/workstation CPUs will probably end up with AVX-512 as a single fused unit, to save die area and that's where it will stay for good, if Intel decides at some point to bring the new matrix extensions down to the mass market.
The HEDT Intel CPUs based on Skylake Xeons have 2 AVX-512 units. I remember when the original Core i9 series was launched and there was confusion as to whether parts like the 7820x had one or two AVX-512 units because they were marketed as having only one, but the benchmarks showed them having two and Intel had to clarify that they all had two.
Intel Core i7 7820x
Intel Core i9 7900x
Intel Core i9 7980xe
The more recent ones also have two:
A lot of the performance problems with AVX-512 are caused by the smaller caches and the extreme power consumption from 14nm, both of which will be significantly ameliorated with smaller process nodes.
I'm pretty sure that Ice lake and Tiger lake will both have a single AVX-512 unit though. Kind of weird but in the Anandtech benchmarks, the Ice lake part destroys the Whiskey Lake and Kaby Lake in this particular benchmark which is optimized for both AVX2 and AVX-512. I would have thought that Ice Lake's single AVX-512 unit would get slightly better performance than Whiskey Lake and Kaby Lake, but they were totally destroyed!
Yeah, I wouldn't be surprised at this either. AMD is always very conservative when it comes to adopting new extensions, while Intel is extremely cavalier!
I agree with what you wrote, for the most part. The only thing I would add is alternative floating point instructions were introduced because the x87 model of floating point computation is rather inefficient and cumbersome. It was designed for an off-chip co-processor and has been translated to on-chip instructions. Its stack-based model for registers is long past it sell-by date. I am glad to see better alternatives arise but they need to get all these various instruction sets under better control. Some are so rarely used they are just a waste of chip real estate now.
I stand corrected. Speaks volumes about internal development and documentation processes at Intel.
Lower-end Xeon Silver/Gold do have only a single AVX-512 FMA unit though.
This is not a strictly AVX-512 benchmark - it's just some minimal C code, so end performance is heavily dependent on the ability of specific compilers and optimizers to correctly unwind the loop and auto-vectorize the code to issue SSE2, AVX2, or AVX-512 instructions in an optimal order.
Willow Cove architecture seems to offer enough improvement to warrant an unproportionately large hike in this test.
A proper benchmark would be designed to smooth out intricate architectural differences with code path hand-coded with native AVX intrinsics and large dataset.
Anyway server/HEDT parts like Core i9-X make 40000+ points in 3DPM v2.1, or 5 times as much as Tiger Lake U, which is expected of mobile parts. Mainstream desktop Rocket Lake S should be considerably faster.
You have to think from a kernel software point of view to understand, from someone like linus's perspective, adding more instructions isn't always better, he would probably much rather just have faster avx2 than a new set of instructions, hence "garbage", since it requires updated software to exploit new extensions and fragments software support for older cpus.
Its not like avx-512 REALLY enables you to do anything much better that will help most programs ,that a faster avx2 wouldn't be able to give you.
perhaps an antiquated point of view , but a fair point
It's a bit premature to dismiss SSE2/AVX which is here for like 20 years now, so it's not going to disappear any time soon.
Intel AMX is not even released, and the only 'accelerator' currently defined is TMUL (matrix multiply).
AVX-512F/VL/BW/DQ/VBMI/VBMI2 is mostly an extension of the exising SSE2/AVX2 command set, with new EVEX encoding to allow twice as many 128/256-bit XMM/YMM registers and 512-bit wide ZMM registers. New commands/mnemonics in CD/VNNI/IFMA/GFNI/VAES/ etc. make only a small subset of the entire legacy command set.
AVX-512 is exactly 'a faster AVX2' - there is no simple way to substantially speed up SSE2/AVX2 when most commands are already executed in a single cycle and much higher CPU frequencies are not attainable.
If you've already optimized for avx2, building a cpu with faster avx2 capabilities doesn't require you to do anything to get that performance bump. you can increase the number of simd units to get that bump without changing the instruction set/adding new instructions,(this is basically what linus suggests in his rant .) Not saying its the "correct" direction to go , but its a fair criticism considering implementation issues with avx512 (mainly throttling)
There are already two (Skylake S), three (Sunny Cove U), or four (Cascade Lake SP) FPU/FMA units, and AVX2 performance remains virtually the same.
Additional blocks mostly sit idling, because each SIMD command has multiple dependencies on registers/memor and you cannot schedule them all to run in paralle, unless you maintain multiple coherent copies of every register internally (and you practically cannot).
At the same time moving to 512-bit registers gives you instant 50-70% improvement on HEDT models with two 512-bit FPU/FMA blocks.
Lazy programmer's heaven, but that's not how things work in real life.
Thats all well and good, It is wise to read the Document in question.
I meant via more simd units, but whatever floats your boat. I dont speak from the perspective of the "core" i speak from the level of the chip, if you add more cores instead of bigger cores, you get the uplift ,with its own set of parallelization problems of course.
edit: to add this is probably why linus doesn't like it
I've read the entire series back at time of the supposed controversy, and the author concludes that he can't really see the problem.
Running heavy computational loads on multiple threads per core will engage thermal throttling in any case, this is not limited to AVX-512 instructions.
The dangers of AVX-512 throttling: a 3% impact
Trying harder to make AVX-512 look bad: my quantified and reproducible results
AVX-512 throttling: heavy instructions are maybe not so dangerous
Per-core frequency scaling and AVX-512: an experiment
AVX-512: when and how to use these new instructions
Also the impact was measured on Skylake-X - next generation Sunny Cove (10 nm Ice Lake) has additional buffers and caches and Willow Cove (10 nm Tiger Lake/ 14 nm Rocket Lake) then Golden Cove (Alder Lake / Sapphire Rapids) should have substantial IPC improvements as well.
It would be interesting to see how these experiments behave on these more recent architectures (fortunately the source code has been published on GitHub).
It's not an architectural change and it doesn't improve per-core peformance, whereas Torwalds argues that Intel needs to invest in a better FPU architecture to improve their floating-point performance (but fails to see how AVX-512 is one of these architectural improvements he calls for).
So do I.
AVX is only useful when its all you need.
this is not trivial, and I imagine is a real concern, I frankly do not have a dog is this fight.
Prefering more cores over fatter cores(that require the use of new instructions), isn't really a controversial or new opinion. So i dont really see what the core issue is with linus's opinion, other than " i dont like it"