I have a question to the owners of Vega 64. I have a Sapphire Nitro+ edition and I have under volted it and added +50% on powerlimit. Yesterday I had a hard lock during a Witcher 3 gameplay, the screen froze with the last game image on, music was still going on then started to buzz for a moment but then went on normally again, the screen eventually turned black and monitor showed an error about no signal. I noticed that the whole system locked up as for example my G15 keyboard screen froze too, so not only GFX went down. My question is, how do you recognise that the crash/lock/bsod etc. is related to the under volting or something else? I would like to know if the problem happened because of my tampering or if it was for example a driver issue. When the computer booted up again the profile I was using was not loaded and I had to apply it again. I have raised the voltage a bit and played the game for an hour again without any problem. I've had this problem for the first time in past two weeks, but it's the first time I was playing Witcher 3 for extended period of time. Before I was playing Doom for few days without any issues with the under volting profile on. I have GPUZ running in the background and I have noticed that occasionally the Power draw sensor spikes a lot to "unreal" numbers, for example it fluctuates between 200-240W and then jumps to something like 1500W for a second. I do not know if this is an error in readings, or if the card really is trying to fry itself. I've read that people have similar problems due to bad PSU, I have a 850W Seasonic unit with Titanium rating so I suppose I should be safe in this regard. The card is connected with two separate cables.
Event viewer will tell you if the driver crashed or not. Generally though driver crashes are due to over clocking or undervolting...and can also be caused from unstable ram or cpu overclocks. Your GPU could be stable but if your ram is not can also cause gpu driver crashes.
I have checked the event log and indeed there was an entry about Driver crash but it also said the driver was reloaded again successfully yet the screen remained black. Hmm... I was playing yesterday with raised volts and had no issue. I suppose I will have to give it a time and see if something will happen again.
I suspect your PSU. How old is it? Most GPUs are built very good unless you have a bad no name brand or abused one. I only had (still have, just not installed) CFX 6950 unlocked and it ate power like crazy when setting power limit high. I could feel the heat coming off the power cables during hard gaming (open case) so it pulls hard on PSU. Plus its still hot out so try to cool down your PSU as well.
I have brand new Seasonic 850W with Titanium rating. I do not think that it was caused by the PSU. I found the error about Driver crash so I suppose the undervolting or Driver problem occured. I have raised the volts a bit and will test it out further to see if it will happen again.
If the driver crashes that means either something wrong is with the driver or the card is unstable at the current clocks/volts. Seeing as how you undervolted it seems a pretty clear cut.
I'm new to this, so I'm trying to understand how things work and how can I detect what could be a source of such problems. I never undervolted or overclocked before. Well at least not for past 10 years. Vega is extraordinary in this regard as you can get very good performance with reduced power consumption for very little effort. I was researching this for very long time and with the recent price cuts decided to buy AMD again after so many years of running on nVidia and enjoy my Freesync monitor at last to it's fullest potential. So far i love the card, at least the Sapphire N+ I have. I cannot comment on other brands.
Make sure you do not touch any of the memory clocks/voltages, most aftermarket(if not all) have Hynix HBM chips that do not like OC/undervolting.
Let me correct that for you..some have Hynix...however not all do...so you have to verify which HBM memory you have....ie all PowerColor Red Devil Vega 64 have Samsung HBM and if I remember correctly some of the PowerColor Vega 56 also have Samsung HBM.
The GPUZ identifies it as Samsung ones. I cannot say if it is correct detection or not, but I have undervolted and overclocked the memory and it is stable so far. I have upped the voltage a bit after the "incident" though.
There are reports from Red Devils having Hynix chips on reddit so could be a few unlucky people there. Anyways it's still a silicon lottery regarding OC/UV.
Only on the Powercolor Vega 56. It depends on them...some have Hynix and some have Samsung, however all Powercolor vega 64 have Samsung. But anyway back to what your asking...it is hard to tell at times. I can get just program crashes, program hard freeze, driver crash and recovers, to just plain old restart. I have had any of these due to undervolt, overvolt, mem overclock, core overclock, combination as well it also being caused from slightly unstable ram clock / timings and / or cpu overclock being unstable.
Do you have your CPU or memory OC'ed ? Also do you use 1 PEG cable with both connectors or use 2 seperate ones for the VGA ?
wattman is really buggy! as a user you wont know when your crap resets. sometimes pci gpu ect will call config files in kernel libs ect so driver trips up or power saver feature trash locked bios %@#! up. i would make sure that you remove 3rd party tools; reset ddu safe mode reinstall driver it load default settings then apply default voltages with manual setting reset pc. Change to auto voltage or set your own. then always just reset wattman and re apply default manual voltage and reset pc. your settings are unstable somewhere. buggy software oc is just to find a stable oc that you can apply though bios at some point that hopefully wont brick your crap enough to not re flash able and unfortunately amd's bios editor & flasher is not for plebs use. see also https://forums.guru3d.com/threads/overdriventool-tool-for-amd-gpus.416116/
If it is not your PSU, then I recommend you reset the GPU to normal, don't OC, do hard game testing with CPU + RAM OC only for a while. If it is stable, then reset CPU and RAM to stock, then OC only GPU and game test some more. Also check your MB chipset heatsink temps. MB VRM's might not be getting enough air flow for example. Then you can at least narrow down where the problem may be coming from. You need to isolate it as much as possible. OC is easy to set, hard to stabilize and more true with many components pushed to all their limits. Same for cars and drugs.
Set the settings like mine (but preserve your GPU/HBM Mhz) And then give it a go You can Edit WattMan Save file as well as OC Tools in Notepad/++ e.g. To have Fan curve like you need... etc. For me WattMan & ReliVe are good & stable Additional Software (i know than for some it's bloatware) Spoiler: here