Another look at HPET High Precision Event Timer

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by Bukkake, Sep 18, 2012.

  1. HeavyHemi

    HeavyHemi Ancient Guru

    Messages:
    6,954
    Likes Received:
    960
    GPU:
    GTX1080Ti

    Wow...are you really that thick? I specifically said it was the FIRST POST IN THIS THREAD. The thread we are in now. Here's the link, post one: https://forums.guru3d.com/threads/another-look-at-hpet-high-precision-event-timer.368604/
    Holy cow. WFT are you even babbling at me with? Calm down and read for content. Like I said, this thread is a running gag that keeps recycling every so often because nobody reads back more than a page if that. 8 friggen years of the same crap over and over. Derp.
     
  2. aufkrawall2

    aufkrawall2 Ancient Guru

    Messages:
    1,811
    Likes Received:
    490
    GPU:
    3060 TUF
    Some components like the Explorer definitely got more sluggish with 1903 vs. 1809, but that doesn't mean there would be a general degradation of performance of any applications.
     
  3. Groot

    Groot Member

    Messages:
    26
    Likes Received:
    4
    GPU:
    GTX 1080
  4. mbk1969

    mbk1969 Ancient Guru

    Messages:
    12,469
    Likes Received:
    10,596
    GPU:
    GF RTX 3060TI
    Funny thing is in both screenshots of assembly code the code is completely the same.
     

  5. Groot

    Groot Member

    Messages:
    26
    Likes Received:
    4
    GPU:
    GTX 1080
    :D Just exercising his right to be human but it does show the new way.
     
    Last edited: Mar 22, 2020
  6. BetA

    BetA Ancient Guru

    Messages:
    4,404
    Likes Received:
    348
    GPU:
    G1-GTX980@1400Mhz
    yeah, i know, he did say:

    im shure, if i ask him he could give me some more Information on this..

    Best Regards
     
  7. mbk1969

    mbk1969 Ancient Guru

    Messages:
    12,469
    Likes Received:
    10,596
    GPU:
    GF RTX 3060TI
    He described the difference in words, so we can trust him. I am just mildly curious (and too lazy to disassemble the code myself) .
     
  8. Groot

    Groot Member

    Messages:
    26
    Likes Received:
    4
    GPU:
    GTX 1080
    Before, 1607
    Code:
    Old way, TSC divide by 1024
    
            mov     r11, [7FFE03B8H]   ; qpcbias
            rdtsc                      ; Read TSC to EDX:EAX
            shl     rdx, 32
            or      rdx, rax           ; EDX:EAX to RDX
    ;===============================
            lea     rax, [rdx+r11]     ; rax = tsc + bias
            mov     cl, [7FFE03C7H]    ; (10 for me)
            shr     rax, cl            ; divide by 1024
            mov     [QPC], rax         ; store result
    
    1903
    Code:
    New way, convert TSC to 10MHz
    
            mov     r11, [7FFE03B8H]   ; qpcbias
            rdtscp                     ; Read TSC to EDX:EAX
            shl     rdx, 32
            or      rdx, rax           ; EDX:EAX to RDX
    ;-----------------------------
            mov     rax, [r9+8H]       ; Magic Number, (10000000 * 2^64) / TSC Frequency
            mov     rcx, [r9+10H]      ; Offset (zero for me)
            mul     rdx                ; Convert TSC to 10MHz
            add     rdx, rcx           ; Apply offset (none for me)
    ;-----------------------------
            lea     rax, [rdx+r11]     ; rax = tsc + bias
            mov     cl, [7FFE03C7H]    ; (0 for me)
            shr     rax, cl            ; zero shift
            mov     [QPC], rax         ; store result
     
    I've left out some conditional code and renamed to try and make the comparison simpler. No serializing instructions in the earlier code but maybe not so important since TSC resolution is being cut so much. A 32-bit OS would be somewhat more convoluted with the multiply.

    Hope it helps.
     
    Nastya, BetA and mbk1969 like this.
  9. Smough

    Smough Master Guru

    Messages:
    825
    Likes Received:
    195
    GPU:
    GTX 1660
    So is it "better" or just the same? Or slower? Also, 1709 and 1803 use the old way, its from 1809 and upwards that the QPF it's 10 MHz.
     
    Last edited: Mar 23, 2020
  10. janos666

    janos666 Maha Guru

    Messages:
    1,125
    Likes Received:
    202
    GPU:
    MSI RTX3080 10Gb
    Same hardware source, probably mostly the same code, roughly 2-3 times higher effective frequency... Why would you assume it to be worse?
     

  11. Smough

    Smough Master Guru

    Messages:
    825
    Likes Received:
    195
    GPU:
    GTX 1660
    Well, the QPF being higher in theory leads to more latency. Also, why was it raised in the first place? If its for security reasons or whatever, then the user should still have the right to choose it, not having it pushed into the throat. Since this Spectre&Meltdown obsession started, it seems like you must accept this security even if you may want to get rid of it.
     
  12. mbk1969

    mbk1969 Ancient Guru

    Messages:
    12,469
    Likes Received:
    10,596
    GPU:
    GF RTX 3060TI

    One note. Can the value of (10000000 * 2^64) be stored in 64-bit register? 10000000 left shifted by 64 bits will leave zeros, imo.

    PS Second note. This the code for QPC. And I was questioning the code of QPF.
     
    Last edited: Mar 23, 2020
  13. mbk1969

    mbk1969 Ancient Guru

    Messages:
    12,469
    Likes Received:
    10,596
    GPU:
    GF RTX 3060TI
    It was for time maintenance (synchronization) reasons. There was a link here or in stand-by memory fix thread.
    Anyway if it is still TSC then nothing to worry about. Little increase in QPF can`t cause huge problems in existing code.
     
  14. Groot

    Groot Member

    Messages:
    26
    Likes Received:
    4
    GPU:
    GTX 1080
    One would have to test it.

    It's not the frequency but the time taken to get a result.

    Yes and yes. 64-bit integer division is done using RDX as the upper 64-bits and RAX as the lower 64-bits. Integer dividing 10,000,000 by TSC frequency alone will usually result in zero therefore we multiply 10,000,000 by 2^64 which simply means in this case putting it in RDX. It also means we don't have to divide the result by 2^64 (SHR 64), just take it straight from RDX instead.

    QPF is just a hard coded value, no calculation done. Could easily be something else if wanted.
    Maybe something like
    Code:
            mov     rcx,TSCF           ; Time Stamp Counter Frequency
            mov     rdx,10000000       ; The 10MHz QPF MS wants
            xor     eax,eax            ;
            div     rcx                ;
            mov     [MagicNumber],rax  ; Store the result in HalT and share for use by QPC
                                       ; note, maybe some adjustment for rounding or not?
    
                                       ; Example, if TSCF = 3.0GHz then result is 0xDA740DA740DA74
    
                                       ; If TSC reads 6,000,000,000 then QPC adjusted would be
            mov     rcx,6000000000     ; TSC
            mov     rax,[MagicNumber]  ; 0xDA740DA740DA74
            mul     rcx                ; rdx = 19,999,999
                                       ; as QPC = QPF * TSC / TSCF
                                       ; 10,000,000 * 6,000,000,000 / 3,000,000,000 = 20,000,000
    
     
  15. mbk1969

    mbk1969 Ancient Guru

    Messages:
    12,469
    Likes Received:
    10,596
    GPU:
    GF RTX 3060TI
    I suspect you are talking about different latencies. You meant that due to changes in latest Win10 builds namely function QueryPerformanceCounter will take a bit longer time to execute (in TSC code path), while I don`t really get what latency Smough meant because he mentioned only QueryPerformanceFrequency function - like new result of 10 MHz will increase latency comparing to old 3.smth MHz.
     

  16. Groot

    Groot Member

    Messages:
    26
    Likes Received:
    4
    GPU:
    GTX 1080
    Some QPC results with Haswell and TSC

    Code:
               QPC Calls/s   |    Relative
                 Millions    |   Speed to W7
    W7  SP1         192      |      100%
    W10 1703        192      |      100%
    W10 1709        118      |       61%
    W10 1903         87      |       45%     
    Seems 1709 introduced serializing for TSC and runs RtlQPC a little quicker than QPC, at 122 million calls per second. Results may vary with different HW / code paths and cache retention. For comparison HPET runs at around 1.67 million calls per second on a singular IO while TSC can run concurrently on each thread, better not get them out of sync though.
     
  17. Marctraider

    Marctraider Member

    Messages:
    18
    Likes Received:
    6
    GPU:
    670GTX
    Seems everyone is overthinking this, and Fr33thy doesn't seem to know what forcing max 0.5ms resolution timer does. Some of his explanation have merit, but I'm also seeing lower CPU-Z scores when changing to useplatformtick yes (5% ish). Could also be a weird thing on CPU-z side though.

    Forcing maximum resolution timer can actually slightly decreased throughput (obviously, but probably within margin of error on high-end systems) but increases granularity of a lot of things that depend on the timer resolution.

    Most prominently in-game fps limiters. So lets say you have a 120hz screen, trying to run consistently on 240fps (with in-game cap), you will get much closer to that 240 fps target due to the increased precision/granularity the timer gives, resulting in less tearing and smoother framepacing. Observed this behavior in both CSGO and various Cryengine games.

    But I don't see how this actually improves FPS, even in cpu-bound games, unless somehow they are really badly programmed and rely on timer tick for their speed.

    Also he advises to use disabledynamictick yes, so far all my results have showed slightly worse results across the board, and again lowered thoughput in games/benchmarks.

    Soon I'll have a new Z390 board installed with custom bios with HPET exposed, will be interesting to see if I can reproduce his findings.
     
  18. X7007

    X7007 Ancient Guru

    Messages:
    1,736
    Likes Received:
    51
    GPU:
    Sapphire 6900XT
    Can you say what exactly you saw with disabledynamictick yes?
     
  19. Marctraider

    Marctraider Member

    Messages:
    18
    Likes Received:
    6
    GPU:
    670GTX
    I see decreased throughput in CPU-heavy benchmarking, albeit a percentage or two at best, it is always negative, never positive. And this is to be expected; Look at linux for instance were you can i.e. compile kernel with 1000hz tick rate, vs lower tickrate or tickless, there is also a difference between throughput and, I guess 'deterministic' latency. It won't actually positively affect input latency in games i.e. As for mouse polling results, I have never seen them getting better due to this tweak either. Last but not least, dynamic ticks have been implemented now since Windows 8, and who knows how much software and drivers are now actually written with this mode in mind. Going back to 'legacy' mode could actually cause more harm than good.

    On a completely different note, as some have argued it is better to leave HPET ON in Bios, and disable in Windows; It looks like this is ONLY the case when you use 'useplatformtick yes'. If you're not using that option, you'll wont see a regression keeping HPET off in bios.

    But why use platformtick yes anyway. It causes stuttering in certain conditions, I've seen consistently lower single threaded benchmark on CPU-z with useplatformtick yes and using 0.5/1.0ms windows timer resolution. Not to mention another bug that makes windows move jerky once music is playing, because it looks like it starts to dynamically adjust the windows resolution timer on the fly with this option on instead of lowering it to a fixed value while audio stream is playing.

    Really can't recommend doing this just because a tool shows 0.500/1.000 instead of 0.496/0.997.
    It is a debug command for a reason.
     
    Last edited: Apr 25, 2020
    enkoo1, aufkrawall2 and artina90 like this.
  20. loopy7

    loopy7 New Member

    Messages:
    8
    Likes Received:
    4
    GPU:
    1070
    I completely agree, and have attached an image I have posted elsewhere that mirrors your findings. That's not to say it will work exactly the same for all, but it does show that following a guide step-by-step can make things worse, especially if you don't know which step caused it (performance reduction).
     

    Attached Files:

    enkoo1, Marctraider and artina90 like this.

Share This Page