New RX 580 has display driver crashing frequently; any other troubleshooting steps?

Discussion in 'Videocards - AMD Radeon' started by Deleted member 282649, Aug 21, 2020.

  1. I recently bought a new RX 580 GPU. In games, I usually get TDRs (game freezes, screen goes black for under a minute, then returns either with the game still running or crashed). This happens normally after 5-10 minutes, but I can go up to an hour sometimes before it happens.

    I noticed that it hasn't happened at all when I'm in VR though strangely, but I'm not sure if it's just because I don't play too long or what (normally I do 30-min sessions). I've only tried Beat Saber and Blade & Sorcery.

    I have a replacement GPU arriving at some point (tried all other solutions I could think of and I'm thinking the GPU is defective), but I'd like to know if there's anything else I can try:

    Things I've tried and other notes:

    • This happens on both the gaming and compute BIOS
    • Happens on a few randomly-attempted VBIOSes
    • Can reproduce it very quickly with OCCT and the GPU Memtest
    • Happens at random points with FFXIV and Age of Empires 2 DE (both with their native renderers and DXVK)
    • Tried V19 Enterprise graphics drivers (back before 2020; I think it was Q4)
    • Happens on both WDDM 2.6 (with enterprise drivers above) and 2.7
    • Happens on both 20.8.1 and 20.8.2 AMD drivers
    • Happens on both PCI-E slots (PCI-E 3.0 x8 and x16)
    • Happens on both stock voltages/clocks, and maxed 1.2V voltages (with stock clocks) across all states in Wattman on core and memory, along with maxed 30% power limit
    • No thermal throttling (CPU and GPU are below 80C even under stress-tests)
    • Doesn't seem to happen with general desktop usage (web browsing mostly) on Windows or Linux
    • Has happened ever since I got the GPU
    • Happens at both stock and XMP RAM speeds/timings
    • Memtest passed 4 tests no errors
    • Swapping PSU connector in GPU 8-pin
    • Can't quite remember if this happened on Windows 10 LTSC/1809 with WDDM 2.5
    • Windows, DirectX, Vulkan, VC++ redists, chipset drivers, and BIOS are all up-to-date
    • I have a 4K display connected over HDMI at 60Hz

    Specs:

     
    Last edited by a moderator: Aug 21, 2020
  2. The_Amazing_X

    The_Amazing_X Master Guru

    Messages:
    708
    Likes Received:
    233
    GPU:
    Red Devil V64
    by steps

    1 - DDU
    2 - USe most up to date drivers
    3 - CMOS
    4 - Use both cables from PSU if possible
    5 - If you have a junk PSU get a proper rated one
     
  3. Chastity

    Chastity Ancient Guru

    Messages:
    3,733
    Likes Received:
    1,656
    GPU:
    Nitro 5700XT/6800M
    If DDU doesn't clean everything there is the AMD Cleaning Tool.
     
  4. MerolaC

    MerolaC Ancient Guru

    Messages:
    4,359
    Likes Received:
    1,073
    GPU:
    AsRock RX 6700XT
    @The_Amazing_X
    I agree there.
    The only thing you are missing is a new PSU to try at this point.
     

  5. This seems to still happen even with a new GPU.

    If I do a regular stress test on both the CPU and GPU (either the Power option in OCCT or separate prime95 + MSI Kombustor tests), stability seems fine (no driver crashing or display blanking). If the PSU was the problem, I'd expect it wouldn't be able to handle the stress tests?

    If I run the GPU memory test in OCCT, it crashes consistently within 3 seconds. I press the start button, and right at 3 seconds, the animations in OCCT window stop, and then 2 seconds or so after that, the mouse cursor stops responding, then my display blanks and the driver restarts. This happens consistently, regardless of Wattman options (default, increased powerlimit at defaults, maxed voltages on all states, etc)
    • poclmembench runs fine and doesn't cause a crash
    • GPU-Z render test runs fine and PCI-E reaches expected speeds (x16 3.0)
    • GPU load stress-tests don't seem to cause an issue (Gputest in Linux, Furmark GL tests in Windows, MSI Kombustor GL and VK burn-in tests on Windows)
    • FurMark-Donut-6500MB test in MSI Kombustor works fine (throws total GPU memory usage to 7200MB)
    • Tried plugging computer straight into wall plug with different cable

    Edit: For fun, I even tossed both RX 580s into the computer to mess around with Crossfire a bit. My PSU itself handled the stress test of both GPUs no problem, but my power line conditioner tripped eventually (it's about 600-650W).

    Edit 2: I tried the enterprise drivers again and it seemed stable with the OCCT test. Went back to the latest drivers and OCCT crashed the driver in 1 second. And went back to the enterprise driver again where OCCT worked fine before I stopped the VRAM test after almost 8 minutes (4 iterations). Eventually I'll make another post with more detailed info on this, but it looks like it's the driver.

    Edit 3: Both GPUs survive 1 iteration of OCCT VRAM test. Old GPU shows 80-130 errors. New GPU showed 0 errors with 1 iteration; plan to do overnight test. FFXIV had display driver crash and fatal DX error with old GPU.

    Edit 4: New GPU now randomly shows errors in OCCT. I disabled XMP and BIOS Fast Boot (saw a thread reporting someone had mem errors with W10 hybrid boot) and now it seems ok again, but not sure if this is consistent.

    Edit 5: OCCT seems completely busted on newer AMD drivers (newer than V19Q4 enterprise drivers); it only allocates about 4GB of VRAM during the VRAM test before killing the driver, but this is with either a single RX 580, or having both in with Crossfire disabled. With Crossfire enabled, the test still allocates only 4GB of VRAM on the primary GPU, can complete an iteration, but that iteration reports exactly 2147483647 errors (which doesn't sound right). HWInfo64 reports no GPU memory errors.

    OCCT VRAM test is broken on the following drivers:
    • 20.7.2-july14 (Adrenalin)
    • 20.5.1-june10 (Adrenalin)
    • 20.q2.1-june24 (Enterprise, new control panel)

    OCCT VRAM test works on the following drivers:
    • 19.q4.1-dec4 (Enterprise)
    • 20.q2 (Enterprise, old control panel)
    • 20.4.2-may25 (Adrenalin, new control panel)
    I'm thinking WDDM level may have something to also do with this (on all the broken drivers, WDDM is 2.7, whereas all the working ones are WDDM 2.6). Will test on LTSC which maxes at 2.5.
     
    Last edited by a moderator: Aug 25, 2020
  6. Managed to debug this. Basically, turns out the GPU is fine by itself, but there are driver-related issues with OCCT, and along with that, I had some other unrelated issues with stability (either Fast Boot in firmware settings, or XMP).

    • Any driver at or beyond 20.5.1-june10 (Adrenalin) and 20.q2.1-june24 (Enterprise) at least on my RX 580 breaks OCCT's video memory test. It only allocates up to 4GB of VRAM before crashing out the driver.
    • Drivers at or older than 20.4.2-may25 (Adrenalin) and 20.q2 (Enterprise) work fine with OCCT's video memory test.
    • This is reproducible on two different RX 580s, on both 1809/LTSC and 2004 Windows 10 versions.
    • poclmembench, FurMark, MSI Kombustor, and games all seemingly work fine regardless of driver (old or new), and all allocate video memory fine beyond 4GB.
    • FFXIV seems 100% stable with DXVK and hours of gameplay. It crashed relatively quickly with DX11 and Crossfire regardless of driver version. Haven't tested DX11 with a single RX 580 for too long (performance is better with DXVK).
     
  7. PieEyedPiper

    PieEyedPiper Master Guru

    Messages:
    628
    Likes Received:
    10
    GPU:
    RX 580
    fwiw, I recently went from an oc'ed, bios modded 7950 that was rock stable, never ever threw a driver crash at me. I upgraded to an RX 580 and over the last few months I've seen easily a hundred or more AMD driver crashes or TDR's.

    Often it's me mucking about with settings that aren't stable as this card seems very sensitive to different memory loads (fine in one thing, insta crash in another). While the driver crash does happen when running stock clocks and voltages it's usually well enough behaved. I do feel, however, that driver crashing is just how the 580 likes to handle any little niggle.
     
    Last edited: Sep 3, 2020
  8. bobblunderton

    bobblunderton Master Guru

    Messages:
    420
    Likes Received:
    199
    GPU:
    EVGA 2070 Super 8gb
    Drop the GPU core clocks by 25mhz and test, repeat until stable. You don't have to mess with voltage but you can put it up a notch or two on Afterburner (MSI afterburner, can get it here on guru3d utilities section).
    When it no longer hitches, and no longer crashes to desktop, then you have found the stable clock speed. This is not your CPU core clocks just to be clear, it's the GPU core clocks.
    This issue affects mostly all Windows 10 Polaris users to some degree, with some cards being 'better' than others and not crash or not crashing as badly or as much. However, it seems to be limited to Polaris core cards (it could be issues with Vega but I haven't heard of similar issues). All these Polaris cards (over a dozen different users in my circle of folks) exhibited variation of degree of largely the same issues, with crashing, display driver resets, application hitching or complete lock up, or loss of work in 3D modeling applications for no reason.
    If you look at your graphs, you'll see the driver pushes the core clock too high on Windows 10.
    This is why I ditched my RX 480 8gb card last year, because it kept crashing out constantly and I lost too much work, so I jumped ship quickly, spent way-too-much (572$) and now have a 2070 Super - no more crashing though and I can work!

    No, seriously, Polaris owners please take heed if it feels like you have a bad GPU or just lots of random hitching/crashing. This fixes it 90% of the time!
    You will have to re-apply your settings at subsequent reboots if I recall properly, but it should keep it stable. Once the crashes go away, you may still have hitching, just lower clocks on the core until it goes away and is fine. You shouldn't need to touch memory clocks, voltages, and so-forth, only core clock mhz.
    My RX 480 8gb was sold with 1288mhz core clock, keeping it around 1200~1225mhz kept it stable all the time, and didn't much affect FPS as it's memory bandwidth starved most of the time. For some dumb reason, with all stock settings, my card was being pushed 15mhz over it's nominal clocks up to 1303mhz sometimes (though rare) more, eventually causing a crash.
    However, it was perfectly fine on a Windows 7 machine, oddly enough. Some Polaris cards even get so bad (especially 4xx series) that they scramble the desktop. While this may seem like a defective GPU, it's actually the driver misusing the card. This even happens WITHOUT rivatuner / MSI afterburner installed, so an application messing things up can be ruled out.
    Use of high performance plan in Windows 10 can exacerbate the issue but the issue should be resolved by lowering the GPU core clock in small amounts and continually testing, lowering it more if crashing yet, or leaving it where it is once stable.
    Don't forget to re-apply your settings after future reboots, as it may not 'stick' between boot-ups.

    *my old trusty Radeon 7850 2gb Pitcairn (I think that's the one) card which shipped with a clock of 850mhz could run 1000mhz all day at stock voltage and never crash, and never had issues like the above Polaris cards did... well no issues aside of MUD textures because '2gb' VRAM...
     
    Last edited: Sep 3, 2020
    PieEyedPiper likes this.
  9. PieEyedPiper

    PieEyedPiper Master Guru

    Messages:
    628
    Likes Received:
    10
    GPU:
    RX 580
    Having a nice lol here, bobblunderton! Espionage is an experienced member and knows what he is doing. Re: mentioning cpu core clocks :) But that certainly is a thorough post that I hope will help someone.

    I have not ever seen AMD drivers boosting an RX 580 beyond it's rated specs, but anything is possible. I'll be making a separate post soon regarding the RX 580, but I have not encountered any drivers forcing the card to "over-boost" as it were, and crash. Do you have more information regarding that scenario? Driver versions, configs, etc.

    I agree that lowering the core clock will likely solve the constant crashing that is being experienced by OP, but on the same token, so would an RMA. Just not fond of owning hardware that will not perform as advertised...which leads me to my next thread :p

    Somewhat interesting information here regarding TDR delay:
    https://forums.guru3d.com/threads/gpu-problem-video-tdr-failure.433745/
     
  10. bobblunderton

    bobblunderton Master Guru

    Messages:
    420
    Likes Received:
    199
    GPU:
    EVGA 2070 Super 8gb
    Adjusting TDR to have a more lenient time window doesn't fix the underlying issue of what's causing it to hitch, if it's the one I described as seeing fairly widespread on Polaris cards, but it can work around some borderline cases of it. That was one of the things I tried on my Polaris card, it didn't solve the problem for me but that's not to say it couldn't solve it for some folks. Anything's worth a shot when you have a computer that likes to go belly up, though.
    I don't specifically know folks by name here, so when I write help posts I have to assume only the lowest common denominator for knowledge of the receiving party. It also doesn't help as much if I make a post very technical and it can only help 10% of the folks who read it (say the 10% who can understand it), due to a higher required tech level going in.

    The driver versions I used were from 2019 including but not limited to 19.7.3 and 19.8.1. This issue persisted from August to December of 2019 until I finally became fed up and bought an nvidia card (and all the crashing then stopped). I used DDU and all that and even tried pulling the NVME drive out, sticking a sata drive in, and installing a separate copy of the OS to see if this fixed it, it did not. The system was otherwise stable outside of the display driver and nothing was run out of spec or overclocked, be it in Windows, Linux, or whenever tested with another graphics card. The system chipset in question was a AM4 x570 here with then-current drivers, on (legit) Windows 1903 LTSB/LTSC. Other systems included mostly intel Skylake-based high-end and mainstream systems and some X370/B450/B350 chipsets, too - run by other users. I had used a 3700x processor at the time that I had the issue, if it matters.
    In the end I couldn't afford to lose any more time on it, and had no choice but to throw money at it. Replacing the GPU removed the issue and the computer has been rock-solid since. I would have put up with it if it was just a gaming PC and lowered the clocks on the core (even though that should be considered 'faulty') to balance out that it was going over it's rated speed (so you're really not missing anything anyway). However, this isn't a gaming PC, it's more a workstation for game development than it is a gaming PC (but I do game once in a while, too, usually not shooty-shooty stuff though unless it's a retro FPS - more into simulation and RPG's if anything so don't need highest fps in the world). I was just out of patience...SO out of patience enough that I dropped 572$ on a graphics card that won't crash anymore, without a second thought.
    These errors are more widespread than people may realize as folks may think it's just a defective card and replace it without a second thought. I thought so too until I tested it on a Windows 7 system and it was fine. Putting it on any Windows 10 machine reproduced the error sooner or later on MOST systems.
    I wrote into AMD almost a year ago about this but it didn't get me anywhere. In the end, Radeon graphics division lost a customer. I'll be very hesitant to go back, too.
    The problem was so pronounced last year, that they won't even sell the Radeon cards anymore, at an independent computer shop I frequented near the college in the next town. He had way too much trouble with them and they were almost all coming back on warranty claims - and the error I described happened to be the most common issue on Polaris cards.
    With a car, it's only as good as it's driver - a video card isn't too much different here on that note.
     

  11. One thing I found odd about both these RX 580s is the default voltage tables. It goes to 1181 on state 5, but for the last two states that are a higher clock, it has lower voltages at 1150.

    [​IMG]

    I haven't had any instability though, so I'm not sure if it's a problem. Does anyone know if that voltage table is like that for a reason?

    As for what was causing the instability to begin with (outside of that OCCT thing; it also does it on the 6.2 betas), I'm thinking it was my RAM settings. Spent a few days playing around with various timings and found that XMP timings don't work (3600MHz; guess my 2700X isn't good enough for it), and that popular Ryzen DRAM timing app gave barely usable numbers. Settled on 3266MHz, overnight mem tested, and over a week of gaming, haven't seen any GPU driver crash.

    On a different matter, I had two RX 580s and a RX 560 before from XFX; drivers were fine, but all 3 cards couldn't handle 4K@60Hz without the signal becoming unstable. Had it in 3 different motherboard, and even a TB2 eGPU box on a Macbook; had the same instability on all of them. Spent months trying to figure out why and went through several cables, and found out that a CVT-RB 4K@60Hz resolution worked fine (this was fun to figure out on Linux with Wayland, and required a paid app on macOS to do that). Eventually got a GTX 1060, and it handled that display no problem with the default 4K@60Hz resolution. There's a few posts in other communities that mention 4K instability as well, so I had assumed Polaris entirely was just flawed.

    With this SAPPHIRE RX 580, it handles my display at the default 4K@60Hz fine most of the time, but rarely I have to hot-plug it when it gets unstable. It's a lot better than the XFX GPUs though, and it seems to only affect my display (not VR). If it wasn't better though, I'd probably would have returned the GPU and went NVIDIA. Out of all the AMD GPUs I've had over the years, that's the most problematic issue I've ran into.
     
  12. pogostickio

    pogostickio Master Guru

    Messages:
    808
    Likes Received:
    67
    GPU:
    DannyD's GTX 1080ti
    Just as a reference my RX 580 has the same anomaly with a higher voltage at a lower state, although this time it's state 4 at stock that shows the figure 1193. Maybe you're right that it's a bug in the driver. I'm currently on 20.8.3. I do not have the same issues that you are experiencing and I hope you find the solution. The RX 580 is a Sapphire Pulse 8gb.
    GPU 17-09-20.png
     
    Deleted member 282649 likes this.
  13. Undying

    Undying Ancient Guru

    Messages:
    25,206
    Likes Received:
    12,611
    GPU:
    XFX RX6800XT 16GB
    It is similar for every 580 it seems. Didnt had any issues with this card and fan profiles have been finally fixed with 20.9.1.

    [​IMG]
     
    Last edited: Sep 17, 2020
    Deleted member 282649 likes this.
  14. PieEyedPiper

    PieEyedPiper Master Guru

    Messages:
    628
    Likes Received:
    10
    GPU:
    RX 580
  15. Undying

    Undying Ancient Guru

    Messages:
    25,206
    Likes Received:
    12,611
    GPU:
    XFX RX6800XT 16GB

  16. PieEyedPiper

    PieEyedPiper Master Guru

    Messages:
    628
    Likes Received:
    10
    GPU:
    RX 580
    XFX RX 580 GTS Black Edition
     
  17. Undying

    Undying Ancient Guru

    Messages:
    25,206
    Likes Received:
    12,611
    GPU:
    XFX RX6800XT 16GB
    One 8pin you could be power limited. My xtr is 8pin and a 6pin.
     
  18. PieEyedPiper

    PieEyedPiper Master Guru

    Messages:
    628
    Likes Received:
    10
    GPU:
    RX 580
    Yes, it seems 8+8 isn't really available anymore. However, I'm not sure what 8+6 would have to do with having vcore maxed out at stage 4 in the stock bios.

    EDIT: ohhh.. maybe my previous post indicated I was trying to OC when I said "This is as far as I got with it". To be sure, I meant only that the thread I linked was the only information I uncovered with regards to the stock bios having seemingly odd vcore.

    fwiw, I agree its likely I am power limited when I am OC'ing. 1435mhz@1.2v is the best it can do.

    Some further postulation: I always thought that maaaaybe the reason for this vcore setting is because that state is used when switching modes or coming out of display sleep, etc and the card just likes to have the extra juice for those kinds of transitions. But that doesn't explain why my card often displays static/snow for a second or two when waking up... *shrug*
     
    Last edited: Sep 23, 2020
    Undying likes this.
  19. What resolution are you running at, and what connection are you using?

    I had frequent display static and cut-outs on screen wake-ups and even general use on 3 different XFX Polaris GPUs (2 RX 580s, 1 560) when using a 4K@60Hz display over HDMI, regardless of OS (Windows, Linux, and even macOS with an eGPU). With this Sapphire RX 580 though, it happens a lot less frequently. I had a GTX 1060 laptop that handled the display no issue.

    I don't believe I did much with overclocking at all with the XFX GPUs; only thing I did was max the power limit (I think 50%). I believe I also had two GTS Black Editions.
     
  20. PieEyedPiper

    PieEyedPiper Master Guru

    Messages:
    628
    Likes Received:
    10
    GPU:
    RX 580
    1920x1200 60hz over DVI.

    I have a tv as a secondary 1080p output (media/movies) that uses up the hdmi port. The tv does not experience issues when waking or switching modes that I have noticed but I'm not running anything at 4k so it wouldn't be apples to apples.
     

Share This Page