System Stability Problem for Three Months.

Discussion in 'General Hardware' started by HellSpawn, Aug 30, 2008.

  1. HellSpawn

    HellSpawn Member

    Messages:
    47
    Likes Received:
    0
    GPU:
    HIS ATi Radeon X1900 XT@X
    Hey Guys,

    I am having one big problem. It is actually driving me insane. I've been trying to figure out what is wrong with my system for the past month or two..

    Before i describe the problem, here is the system components;

    Opteron 165
    DFI LanParty UT CFX3200 DR/G
    4 x 1GB A-DATA Vitesta DDR 400 @ 3-4-4-8@2T, 2.65v
    MSI Radeon HD 2600 XT Diamond
    OCZ 600W PSU
    4x 160GB HD SATA [one of them uses IDE to SATA converter]

    Here is my dilemma. I try to use this computer as my server, especially for backing up purposes. I overclocked a little bit without modifying the voltages. I used to run the CPU at 2200 MHz [9x245], and running the rams at 200MHz[166 MHz Divider]. I set the Ram voltage up to 2.8 from the BIOS for 3 months. Timings were 2.5-3-3-8@2T. HT link was running at 1225 MHz fine. All other voltages were left at their default values.Temperatures mostly acceptable, CPU was running between 50-60 C, RD580 was between 30~40 C.

    Two-three months ago i've noticed that the system has been locking after a day of usage. It is running F@H 24/7, one single core CPU, and one GPUv2. After running that for 24~28 hours, system just locks out. I thought it might be that voltages were too low or two high, so i've increased the voltages of CPU , NB, HT voltages up and down. I've tested each value different for a day. I have even decreased the HT link speed to 4x and even 3x. I have also tried leaving all values to Auto.

    I have set everything to its default values. CPU back to 1.8GHz, memory back to 3-4-4-8@2T. Voltages were also default. And run the F@H clietns. again 24-25 hours later, the system got locked up.

    Now i've realized something that i should have realized a while ago. When i set the Ram voltage to 2.6v, and when i've checked the real value from the BIOS, it was actually reading 2.75v. Then i have looked it up when i set it to 2.8v, the bios was reading 2.95v. Now i am thinking, all this time the Ram's were running at 2.96v?.

    Then i ran Memtest for 24~26 Hours. Which resulted in no locks and no errors at all. I've run the Full test. I also Installed WIndows 2008, and before i installed the OS, i've ran the Memory Test. I've run the Extended Test, which nearly took an hour. It passed without a problem. I've also ran Orthos for 24-25 hours. Nothing.

    Now i am thinking what is wrong with the system? I mean, i think i've tried doing everything for the past 2-3 months. What else could i try? What did i miss? What might be the problem?

    I am going to try unplugging two of my modules and let it sit for a couple of days with F@H. But this is driving me nuts.. Do you have any suggestions?
     
  2. HellSpawn

    HellSpawn Member

    Messages:
    47
    Likes Received:
    0
    GPU:
    HIS ATi Radeon X1900 XT@X
    I have realized that i forget to mention one little detail.

    I've always tried to use the memories at 200MHz, so i've chose the dividers accordingly. Now i set the Memory Divider to auto. It is running at DDR333, 2.5-4-4-12@2T. Let see if it is gonna freeze or not.

    I am still open for suggestions. I cannot figure this out. I am pretty sure there are some ideas that you guys have but not willing to write it out.. Laziness?
     
  3. Psychlone

    Psychlone Ancient Guru

    Messages:
    3,686
    Likes Received:
    2
    GPU:
    Radeon HD5970 Engineering
    Your math all figures correctly, so the only thing that I can add here is temperatures.
    Most motherboards have a thermal threshold limit that will shut down the PC if the CPU reaches that temperature. For some motherboards there is an actual value that can be set in BIOS, and for other motherboards, the value is set at 65-68*C...so if your Opty 165 is reaching up into the high 60's, it could be the thermal threshold kicking in and shutting it down.

    If you've cleared Memtest86+ on each individual stick, and you've passed 24 hours of Orthos blend at priority 10, then you have a stable system - and it has to be something else.

    Ever thought of looking through your processes and seeing if you recognize them all? How about some software or service that's set to shut down a component after a certain period of time? Hibernation disabled? HDDs set to Sleep?

    Sounds like you have a solid overclock, so I wouldn't change anything there - with the exception of your HT (most motherboards aren't stable at much over 1000MHz - in fact, the A8R32-MVP is the only one that will hit 1500MHz stable) - but again, passing a 24 hour run of Orthos is sufficient to pretty much guarantee stability at your current speeds.

    Last hardware thing that you could check is your PSU - download OCCT and run that...it will give you graphic reports at the end that shows the 12V, 3V and 5V rails. If there's much fluctuation over the course of an hour, then you know the PSU is beginning to fail (we're talking +/- 2% ripple as per those charts)

    I'd be for checking those other variables - all in the OS or other software itself...you'll probably find your answer there.

    It's too bad that the system you're talking about is only set up as your server - I have an Opty 165 that's been running at 3.2GHz (1.375V), 583MHz RAM and 1424MHz HT for over 2 years now...excellent setup if you ask me, and perhaps a bit overkill to overclock it for actual server purposes.

    Good luck!

    Psychlone
     
  4. HellSpawn

    HellSpawn Member

    Messages:
    47
    Likes Received:
    0
    GPU:
    HIS ATi Radeon X1900 XT@X
    Thank you for your response Psychlone,

    Currently, i have freshly installed Windows 2008 with all the latest drivers. I do not have any other software installed, yet.

    I've made sure the shutdown temperature is 80 C, so i do not think that might be the problem, and i've made sure that the power options are set to Maximum values. It is weird that this happens after 24-40 hours of %100 load.

    I thought HT might be the reason, but i set it below 1000MHz.. So i think that rules out as well?

    One more last update.. System have failed again with Auto Memory divier. It froze after nearly 40 hours of usage. I have unplugged two modules and running F@H now. But now, i am actually going to Run OCCT, when everything is set to Auto in the BIOS..

    I guess i am not lucky as you are. My CPU cannot go further than 2.6GHz with 4 modules. Having four modules pulling me down a bit. It is forcing me to run them at DDR333 speeds with 2T Command Rate. BTW I am very jealous about your CPU.

    I overclocked as much as i can with the default voltage. I use it to encode my homemade videos and DVD's for my ipod touch and creative zen. So i will try my best to get the best frequency as possible with lowest temp:).

    UPDATE:

    I plugged 4 modules back, set the BIOS to its Default Performance and then set Memory Hole Enable and HW Mapping Enabled so that it can detect 4 GB. I've set 24HR of OCCT. I will post the results here if it finishes it succesfully.
     
    Last edited: Sep 2, 2008

  5. Psychlone

    Psychlone Ancient Guru

    Messages:
    3,686
    Likes Received:
    2
    GPU:
    Radeon HD5970 Engineering
    And you're *sure* that your temps aren't reaching up to AMD's thermal threshold of 68*C?? Cause even if your board is set to 80*C, there's a good chance that it may shut down anyway...and another thing - voltages...Opteron CPUs *hate* high voltage - you either have a good overclocking chip that takes very little voltage increase (in my case, .025V above stock for 3.2GHz), or you don't...adding extra voltage does these server chips no good at all. Maybe try setting it back down a bit to where it's stable using 1.35V (only you know where that MHz is...) and see if that helps the situation, if not, move on to the below suggestions.

    When I asked about components set to shut down for power consumption, your Power Settings don't really affect everything the way they should (well, I can't say that for sure since I have no experience with Windows'08 - but assuming that it's almost the same as any Vista or XP install, it would be the same for this)
    Go into your Device Mangler (erm...Manager) ;) And select each component in the menu there - check for a Power Management tab (not everything is going to have this specialized tab, but some will) - and ensure that 'Allow the computer to turn off this device to save power' is NOT ticked...you'll need to go through every device available in the Device Manager to ensure one of them isn't shutting down because of this single setting.

    If you've set your HT to 1000MHz or below - that's out of the equation and won't pose any issues at all, so I wouldn't worry about your HT any more.

    Setting AUTO on several options in BIOS, regardless of the BIOS or the motherboard, will sometimes lead to instability...try setting everything to a known stable variable (and I mean EVERYTHING) and see if that helps out the issue (on my A8R32-MVP Deluxe, there wasn't a single BIOS option that wasn't set manually - literally *nothing* was left on AUTO)

    On a side note, overclocking with all 4 banks populated is always going to leave you with less of an overclock...the memory controller can't push the throughput to 4 banks that it can through 2...but we probably shouldn't be talking about overclocking here - aside from dropping your overclock a few notches.
    By all means, you should have a completely stable system if it's passing all these stability tests - something else is holding you up...did you happen to check the PSU rails using OCCT yet?? I'd be interested in seeing if one or more of your rails is experiencing the ripple effect more than the manufacturer intended it to, which could be the entire problem.

    Good luck!

    Psychlone
     
  6. HellSpawn

    HellSpawn Member

    Messages:
    47
    Likes Received:
    0
    GPU:
    HIS ATi Radeon X1900 XT@X
    I would like it to work fully stable with the default setting, before i OC again..

    You've got a point about leaving stuff Auto, I've never set the settings on Auto before. I thought i might have set something wrong so that's why i left it at Auto today.

    I went thru everything in the Devie Manager, and unchecked everything that said "'Allow the computer to turn off this device to save power". The only thing that had the option was Ethernet's.

    Right now, i've plugged in the four modules, running them at their stock values. Which happen to be 2.5-4-4-8@2T - 2.5v. I know that heating is not a factor because i have an Antec P182 case with Thermalright XP-120 heatsink and a good CFM 120MM fan. Opty 165 is curretly running 9x200 @ 1.3v; 5x HT. It is also sitting right next to the window. BIOS is reading the temperature near 40C, but when i check it with HWMonitor or OCCT, it is reading near 55's on %100.

    [​IMG]

    Here is a SS from 5 mins ago from this post. I will post the graphs as soon as this is finished.

    Update: Dumb Move on my end.. I accidently launched HWMonitor, caused it to read the "false" temperature of the CPU.. So it stopped. I am running 1 HR test to get the graphs.. Then i will let it run 24HRs.
     
    Last edited: Sep 2, 2008
  7. HellSpawn

    HellSpawn Member

    Messages:
    47
    Likes Received:
    0
    GPU:
    HIS ATi Radeon X1900 XT@X
    Here are some of the graphs..



    [​IMG][​IMG][​IMG][​IMG]
    [​IMG][​IMG][​IMG]

    Currently, i started OCCT infinte test.. Let's see if it is going to freeze up on me.. I am open for suggestions regarding to testing these modules or anything that is related. I have four of them pluged in. I can post Screen Shots of the BIOS if you guys need for more info.
     
  8. HellSpawn

    HellSpawn Member

    Messages:
    47
    Likes Received:
    0
    GPU:
    HIS ATi Radeon X1900 XT@X
    After 17 hours of operation and 12 hrs 32 mins and 7 seconds of OCCT run, computer froze up again... phs..

    I do not know what to try. Should i take two modules out and try, or should i set all he options in the BIOS to their proper values.. :(
     

Share This Page