MSI AB / RTSS development news thread

Discussion in 'MSI AfterBurner Application Development Forum' started by Unwinder, Feb 20, 2017.

  1. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    Fantastic to see my programming suggestion implemented. Raster synchronized frame buffer flipping!

    I'll need to rewrite the Low-Lag VSYNC ON HOWTO eventually because this probably eliminates the "frame rate differential" from being necessary.

    I suspect there may be a system-dependant variable (e.g. slower systems unable to do it properly) but that's an exercise left to the intrepid tweakers of the various forums to discover (including Blur Busters).

    It may take a while before one volunteers to start extensively testing this with a 1000fps camera -- but I'll raise awareness now with a news post telling everyone about the new RTSS feature. Keep an eye on it on the cover page of Blur Busters.

    P.S. WE NEED 1000FPS TESTS ON THESE NEW SETTINGS
    Since benchmarking these advanced (bleeding-edge) settings stuff is usually more technical/advanced than what guru3d normally writes about on the news... if any of you have a 1000fps camera and wants to write an article for BlurBusters or co-teamed article Guru3D/BlurBusters -- open to ideas. I'd pitch in funds if needed to incentivize more freelance 1000fps tests! We all need delicious data, data, data. Open to ideas to more 1000fps high speed camera tests (brief or extensive tests, volunteered or paid) to validate all these new RTSS settings -- mark@blurbusters.com
     
    Last edited: May 28, 2018
  2. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    This can only mean that 3D scene in your test is rather simple for your CPU but complex enough for your GPU, so it takes more time to render it on GPU side than to prepare it on CPU side. Other words, CPU is preparing and submitting frames faster than GPU can render them so DX runtimes and drivers start pre-rendering the frames. In this case, once pre-rendering queue is full, minimum interval between two Present() calls is defined by time required to getting pre-rendered frame ready for presentation from the queue. So if you add some fixed delay to CPU presentation, next frame will simply out from pre-rendering queue a bit faster without affecting the frametime. So CPU influence on frametime in this case can be minimal and 90% of it can be bottlenecked by GPU, and you'll need to throttle CPU rendering loop and make it sleep _more_ than GPU rendering time to flush queue.
     
  3. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    In addition to the previous post, I think I'll just add one more performance counter to new RTSS performance profiler: CPU wait time. It will show how much CPU time RTSS is spending inside busy-wait loop when framerate limiter, throttling or scanline sync mode is active. This way you'll be able to see that in GPU bottlenecked cases 90% of each frametime can be actually spend on waiting and see how much throttle time need to be added to achieve similar effect.
     
  4. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    Thanks, it was actually fun programming, which returned me back to childhood, Z80 demoscene and beam racing at that platform. But I'm afraid that I don't have proper hardware for testing you're offering, so I'm not interested in doing a review from my side. Probably @Hilbert Hagedoorn is interested.
     
    yasamoka and mdrejhon like this.

  5. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    Here is an example of configuring RTSS to sync framerate with double refresh rate:

    [​IMG]

    In this case refresh rate is 60Hz, display mode is 1920x1080 with 1125 scanlines total, both SyncScanline0 and ScanLine1 are set to 1 (so frames are synchronized to scanline 1 and 563) and SyncTimeout is set to 2 (so wait timeout is set to refresh period 16,667ms divided by 2). SyncInfo is enabled in OSD, so all that info is also displayed in the bottom part of OSD (right below the performance profiler). There are two clearly visible tearlines in two fixed positions related to synchronization to scanline 1 in the top of the screen (frame presentation is not immediate so it actually takes a few more scanlines since sync event till actual frame presentation to flush GPU command buffers) and synchronization to scanline 563 (in the third row of checkerboard). You may move tearlines down if you increase SyncScanline0 and SyncScanline1 synchronically and can make bottom tearline disappear in 5xx range (which is almost equal to presenting in VBlank interval, in our case we just do it a few scanlines above it).
     
    The1 likes this.
  6. gatecrasher

    gatecrasher Guest

    Messages:
    20
    Likes Received:
    0
    GPU:
    8GB GTX 1070
    SyncTimeout=1 seems to work well for me at 60Hz, but not 100Hz.
    At 100Hz, it keeps alternating between ~48 / ~108 FPS.

     
  7. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    Timeout is calibrated once during 3D application startup or in runtime during display mode (VTotal) change, dynamic refresh rate changes won't be handled. Also using it with framerate limiter is not the best idea.
     
  8. gatecrasher

    gatecrasher Guest

    Messages:
    20
    Likes Received:
    0
    GPU:
    8GB GTX 1070
    That was with the framerate limiter, v-sync, and G-Sync disabled.
     
  9. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    I don't think so. Leave it for other testers please.
     
  10. gatecrasher

    gatecrasher Guest

    Messages:
    20
    Likes Received:
    0
    GPU:
    8GB GTX 1070
     

  11. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    Once again, leave power user related settings tuning for power users and do not expect me to comment your claims please.
     
  12. RealNC

    RealNC Ancient Guru

    Messages:
    5,100
    Likes Received:
    3,377
    GPU:
    4070 Ti Super
    I get very heavy tearing with this:

    The tear line is not hidden. Instead, there's a "band" that's about 30% of the height of the screen, inside of which tearing happens. The tear line sometimes jumps to other parts of the screen as well.
     
  13. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    Gatecrasher, we help newbies (well, those who are new to "beamracing" concepts) at Blur Busters Forums if you want to participate in this RTSS testing thread.

    Just trying to be nice to people while also freeing up Unwinder's time to focus on the fun/advanced stuff. This is really a bleeding edge feature that has a lot of untested behaviours.

    Because of my beamrace debugging, I'm extremely good at interpreting tearline-jitter behavior. Please film your tearline jittering (use horizontal panning motion) and post your RTSS config file at the BlurBusters thread. See if it's just a config file tweak needed. While if I discover an issue that I understand requires a (hopefully simple) programmatic solution, I'll obviously relay it in useful programmer speak here.

    Since tearlines are proportional to time, and horizontal scanrate
    (e.g. 160KHz scanrate means 1/160000sec moves tearline down 1 pixel):

    -- RealNC, you told me on Blur Busters Forums you were testing at 60Hz, so mathematically:
    -- That's 1/60sec refresh cycle time (let's skip VBI time for simplicity)
    -- You say tearline jitter is 30% screen height

    (1/60sec * 30%) = 5 milliseconds.

    So it seems like your system is not achieving microsecond pageflip accuracy.

    Maybe the game is fudging pageflip timings by large millisecond steps. 1080p60Hz is ~67KHz horizontal scanrate (~67000 pixel rows per second) so 1/67000sec delay moves tearlines down 1 pixel.

    Since the VBI is usually roughly 45 scanlines (1080p often has a Vertical Total 1125 even for DisplayPort and DVI) -- that means framepacing needs to have an accuracy of 45/67000th of a second (<1ms) to have the tearline successfully jittering inside the VBI between the consecutive refresh cycles. So keying at final scanline (e.g. #1079 or the very exact instant InVBlank becomes true) combined with sufficiently-accuracy, should hide the tearline successfully.

    Also I have discovered it is best to Flush() before sleeping/looping/busywaiting to the timed Present(), because that reduces tearline jitter (makes the buffer flip more accurate because the GPU is idle at Present() time).

    Programmatically it's tricky to get zero-pixel or one-pixel tearline jitter (but I have successfully done it in my apps. While I haven't yet tested this new RTSS feature due to time management issue, I certainly have spent quite a lot of time on precision-timed buffer flips as seen in my videos) -- since for some systems it requires busylooping one CPU core with precision calculated time delays as many systems only have millisecond thread sleepers or event precision.

    Keying to top scanline is not recommended because GPU does some background processing (e.g. Windows desktop compositing thread) near raster line #0 so I try to time my beam-raced Present() elsewhere where possible (e.g. bottom edge). You will get bigger amounts of raster jitter when keying to top edge, so it's always better to key to bottom edge for two big reasons:
    (A) GPU is less busy when raster is near bottom.
    (B) You've got VBI time as a jitter safety margin to hide tearlines.

    Also, doing one buffer-flip per refresh cycle (...Present() or glutSwapBuffers()...) seems to have much more raster jitter than doing four to ten buffer flips refresh cycle. Because the GPU likes to start background processing often during idle times. I have found it favourable to do occasional dummy page flips (duplicate buffers) when running low frame rates during beam raced tearlines, to guarantee that the next tearline gets near-pixel-exact stability. In many case, I needed to do only one buffer flip less than 1 millisecond before the new-frame buffer flip, in order to generate a pixel-exact-stability tearline (on recent multicore systems) -- as long as the flip wasn't timed near top edge of screen (the "busy GPU background processing" region).

    Tearline jitters are excellent benchmarks of pageflip timing accuracy -- sometimes it's the GPU/drivers/etc fault, but I have noticed that even an 8-year-old Intel GPU can pageflip to <0.25ms accuracy if I'm very creative. So 5ms jitter indicates a weakness somewhere (Game fault? RTSS config file issue? Flaw? These things are hard to trace cause sometimes but hopefully debug data will narrow this down greatly).

    The QueryDisplayConfig() API can return VBI size (Vertical Total) and scanrate, if one wants additional data to improve accuracy. I've got source code upon request, if anyone wants this. That said, it looks like Unwinder is already detecting Vertical Total ("Sync Total" in debug data).

    BTW, Unwinder, I discovered something during my programming. Do you busyloop on polling the raster register? My tests have shown that it has been bad to busyloop-poll the raster register. Also I found it bad to use millisecond sleeping. Instead (on Windows) I use QueryDisplayConfig() to get horizontal scanrate and vertical total (VBI size), and instead use that data to "predict" the time of the next time the scanline will hit the specific position, and use that to reduce the number of raster polls -- or even eliminate the raster poll completely (as I did recently). Once done, use the microsecond delayer where possible, and on systems that don't have microsecond true-efficiency sleep -- use millisecond sleep followed by a sub-millisecond busywait. That helped my tearline beamracing test with this YouTube video. Yes, you saw the video right -- high level garbage-collected C# programming with ten-microsecond pageflip accuracy:



    And for this specific app, I don't even use D3DKMTGetScanLine() ... I simply use microsecond time offsets from de-jittered VBI timestamps. So that's why it works on both PC and Mac, since I'm currently writing the world's first cross-platform beamraced tearline demo. I'm hoping to release source code (C#) in a few weeks (Primary cause of delay: Debating whether to make a submission to a contest first such as Assembly -- if any volunteer/coder is interested in polishing this crossplatform "Tearlines Are Rasters" concept up to proper modern demoscene standards...)

    Obviously, this might be TMI (Too Much Information) but since there are apparently beam racing (raster interrupt) veterans here, I thought I'd better TMI myself here -- hope y'all learn lots about the "VSYNC OFF Tearlines Are Rasters" concept. :D and hopefully assists on debugging the cause (original game fault? config file fault? RTSS fault?) which can be quite obscure.

    ----

    Advanced Tweak:
    If you are already a Timing & Resolution expert (Custom Resolution Utility tweaking) and your monitor is capable of Large Vertical Totals (e.g. VT1500 at 1080p, for a 420-line VBI) then this can help hide tearline jittering between refresh cycles much better, while providing a "Quick Frame Transport (QFT)" boost from the faster-scanout. This is essentially similar to what HDMI 2.0 new "QFT" feature does, but can work on DVI and DisplayPort. Do not attempt this unless you are already a CRU expert and ready to reboot Windows into Safe Mode a few times from failed CRU attempts.

    NOTE: FreeSync (and GSYNC) are an easier "natural" method of QFT since when running at 60fps at 240Hz FreeSync. e.g. the Vertical Total (Scan Total) is already approximately ~4000-4400 as FreeSync is simply a variable-size VBI -- basically it stays .INVBlank=true until Present() and then .INVBlank=false on the NEXT scanline after Present() .... unless delay is too long (below FreeSync range) then it begins scanning out a repeat refresh cycle without waiting for Present() .... That said, QFT for lowering VSYNC ON input lag is possible on fixed-Hz displays too (if timing Present() to near end of VBI instead of beginning of VBI). So you just run at max VRR Hz and framecap to a lower-Hz for a "low-lag" low framerate. It's just difficult to recreate that low-lag effect on fixed-Hz displays but it can be reproduced via QFT technique + late-VBI flips. Very few non-VRR displays support large VTs useful for QFT but this is useful background knowledge anyway, since now RTSS's new feature theoretically makes QFT-compatible techniques possible (to be tested).

    Unwinder:
    Suggestion #1: You might want to also display horizontal scanrate in the debug data. It helps determine Present() timing accuracy. Horizontal scanrate divided by visually observed tearline jitter amplitude, is the timing error margin of frame pacing. Tearline jitter is an excellent "visualization" of frame pacing error margin!
     
    Last edited: May 30, 2018
    yasamoka and CaptaPraelium like this.
  14. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    For scanline RTSS tests, more recommendations:

    To Tweakers Easy Tip #1:
    Modify config to time tearline (scanline) to bottom edge or near bottom edge, not top edge. See if it improves results.
    Because:
    (A) GPU less busy with background tasks (e.g. Windows compositor) when scanout near bottom.
    (B) The blanking interval gives you a jitter margin to hide tearlines.

    To Tweakers Easy Tip #2:
    Please use Control Panel to change refresh rate, not via monitor button.
    Some gaming monitors include a refresh-rate-change button. I get framepacing problems due to driver bugs and Windows bugs receiving refresh rate changes forced by dynamically-changing monitor EDIDs (it behaves as if you rapidly unplugged & replugged a monitor in a fraction of a second). Other monitors may use a background app for refresh-rate-change management, but some monitors use dynamic EDID changes instead. I've even sometimes get bluescreens (Windows 10 frownyfaces) if I try to do that in the middle of 3D applications.

    To Tweakers Easy Tip #3:
    Please use single-monitor for tests.
    Mixed-Hz multimonitor setups can cause beamracing / framepacing problems (and input lag problems!) with some graphics drivers when used with RTSS. Temporarily disable secondary monitors for now, re-enable later. Yes, that means you competitive players -- displaying static HOWTO pages in Notepad on a secondary 60Hz monitor can actually lag down your 240Hz eSports monitor a bit -- so be forewarned.
     
    Last edited: May 30, 2018
    CaptaPraelium and cookieboyeli like this.
  15. RealNC

    RealNC Ancient Guru

    Messages:
    5,100
    Likes Received:
    3,377
    GPU:
    4070 Ti Super
    A bit more testing revealed that it works perfectly fine in older games, or non-demanding games. As soon as GPU load climbs a bit (>30-40%), tearing becomes unpredictable.

    This is also influenced by GPU clock. If running a non-demanding game and the GPU clocks down, tearing "snaps" between wildly different positions. In these cases, forcing maximum performance helps prevent this.

    For modern games, setting graphics options to minimum and running at low resolution allows this sync method to work. Basically if GPU load is lowered near 20% or so, this starts to become usable.
     
    Last edited: May 30, 2018

  16. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    That also happens to me with complex beamraced graphics if I forget to Flush() before I busywait to my Present().

    Unwinder, are you pre-emptively Flush()-ing before waiting up towards your precision-timed Present() ? (Or are we at full unfortunate mercy to the game programming?)

    NOTE: Preemptive Flush() do lower maximum framerate, but when one is capping to a lower framerate, the preemptive flushes don't hurt that ability, and actually improves precision framepacing (it's found necessary during beamracing more graphically complex stuff). So they can be useful and can actually force the GPU to speed up if power management has merrily slowed down the GPU to match the reduced workload (it happens on laptops sometimes). So giddyup the GPU by Flush() early so it's idle when the precision-timed Present() occur. While eats more power, and your max framerate can fall slightly, the framepacing can become "par excellence" (pixel-accurate tearlines). It's a tradeoff (top framerate capability versus ultra-precision framepacing accuracy) so maybe an adjustable config-file setting for pre-emptive Flush() + also automatically enable this pre-emptive Flush() mode whenever doing scanline mode. And if there is already a setting for pre-emptive Flush() then apologies, I missed this feature! Let me know.

    NOTE2: Tearlines have a vertical downwards offset roughly corresponding to the amount of time Present() takes. An unflushed Present() with lots of pending GPU draw commands will hugely vary in time, potentially explaining RealNC experience. Which is why pre-emptive Flush() reduces Present() timing to only simply the approximate amount of time it takes for the GPU to memory-copy he framebuffer (a much more fixed time interval independent of GPU load). If Present() takes 0.25ms (after fre-flushing with Flush() ....) the tearline is often shifted downwards by ~0.2-0.3ms worth of horizontal scanrate. If you are aiming to exact tearline position, pre-Flush() and performance-profile your Present() time, and subtract that from bottom scanline (e.g. scanline #1000 if your Present() takes 80 scanline delay) and write the new value to config file. It will continually vary depending on Present() frequency but it's still a useful approximate tearline-offset compensator for beamraced tearlines. Or just eyeball it (watch tearline position) and manually calibrate. Just additional "beam race experience" sharing here for background info.

    Suggestions to Unwinder:
    -- Strongly Recommended:
    (big benefit for little programming effort).
    If not already done, maybe add pre-emptive Flush() setting (and automatically enable it whenever doing scanline flipping mode). Easy change I hope.
    -- Recommended: (moderate benefit for moderate programming effort)
    This won't help all systems but some systems really benefit . For more impeccable framepacing (pixel accurate tearlines), use automatic repeat-Present() for fast-frametime material on low framerates. Basically, when framerate capping fast-rendertime games (e.g. 1ms-2ms rendertimes) to 60fps (16.7ms frametimes) -- if you're doing beamraced tearlines or microsecond present timing, it is useful to repeat-Present() duplicate frames to keep the tearline jitter pixel-exact. Idling GPUs starts working on background tasks (e.g. Windows compositors, driver-triggered GPU garbage collection of disposed memory, GPU-powered viruscan, crypto mining apps, and other background shaders, etc) when idling between frames. Keeping GPU busy by redundant repeat-Present() prevents those background GPU tasks from ever jittering your tearlines (that's how I am able to get darn near pixel exact tearlines). Repeat flipping a duplicate frame has no tearlines. This can keep tearline jitter pixel-accurate on the critical actual new-frame Present(). Do the redundant presents approximately every 1ms and make sure to performance-profile how quickly Present() returns for a duplicate framebuffer, so you don't accidentally mistime your actual new-frame Present() because of a last-minute redundant Present(). NOTE: This may consume more power and heat the GPU a little (not as much as full GPU load) despite low capped framerates, but it quite noticeably improves tearline lock on ultrashort rendertime material capped to real low framerates -- see demo of framepacing precision achieved when using a sufficiently high Present() rate -- tearlines start jittering a lot more with longer multimillisecond idling-GPU intervals due to GPU background processing.
    -- Optional: (ease-of-calibration)
    Maybe a hotkey (e.g. Ctrl+Shift+Alt+PgUp/PgDn) to increment/decrement the scan line setting, for easier and quicker manual calibration of tearline position. That way, once calibrated, the new settings can be saved to config (by the user). Due to performance / refresh rate / vertical total differences -- this can speed up user experimentation via eyeballing tearline behavior changes in real time -- to more quickly adjust the tearline jitterarea to a more favourable screen region (or hidden between refresh cycles). I had added a hotkey to my rasterdemo for tearline-offset calibration.
     
    Last edited: May 30, 2018
  17. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    Tear line is not supposed to be hidden with the settings you're using (i.e. synchronizing frame presentation to rasterizer scanline 1), quite opposite tearline is absolutely supposed to be visible in the top of screen in this case. I gave the example of config with the topmost scanline synchronization to provide unified settings for any display mode and get non-beginner friendly VTotal variable (i.e. total count of scanlines, which you have to know to synchronize presentation to bottom part of frame properly) out of context and to allow you to see initial result (tearline position) and tweaking result (tearline position moving down with increasing target scanline) easily.
    Also, you're misunderstanding the functionality and expected effect, with those settings you can only expect that the game engine to calling Present() with fixed frequency on desired scanlines (i.e. always with fixed time offset to vertical blanking interval). But frame presentation is not immediate, actual page flipping and frame buffer presentation is taking place later, asynchronically, when GPU finish rendering the frame (current of queued one) and get it ready to be presented. So even in ideal case there will be a few scans delta between synchronization event (i.e. between submitting Present() to 3D API) and actual frame presentation event (identified by tearline position). That delta entirely depends on 3D scene complexity and your GPU performance, it should be minimal and static for similar GPU workload across the frames and can increase with increasing GPU workload. With a tearline position you can also clearly see how and when exactly GPU power management is having the place, i.e. NVIDIA GPU may start rendering 3D application at maximum clocks (so delta between presentation call and page flipping is minimum), then reduce clocks to lower performance level to save power (so frame rendering will take more time and tearline will start moving down).
    That's exactly what you've observed in your posts below.
     
    Last edited: May 30, 2018
  18. RealNC

    RealNC Ancient Guru

    Messages:
    5,100
    Likes Received:
    3,377
    GPU:
    4070 Ti Super
    When trying to hide the tear lines inside the vertical blanking field (by syncing to scanline 1400 for example on 2560x1440 and 1481 total scanlines), half of the "tear bar" wraps around to the top again, so you get two bars of heavy tearing (top and bottom.) Modern games just tend to have scene complexities that vary a lot more often than not, even in the same location depending on where you look. (You look north for example in an RPG and get 30% GPU load, you turn around and get 70%...)
     
  19. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    Of course no, I avoid forcible graphics pipeline flushing as a plague because it is pure killer for GPU performance (and especially multi-GPU performance). And none of game engines are forcibly flushing GPU command buffers ever as well. GPU and 3D API architecture are built around asynchronous CPU/GPU processing ideas, so direct GPU command buffer flushing functionality (like glFlush) is normally provided for debugging purposes only and using it in retail projects it is a bad programming habit.
    Implementation is also highly 3D API specific, for example it is just a single glFlush call for OpenGL, but such functionality is totally missing in DirectX8 (so it requires some API specific tricks like locking framebuffer, which is indirectly stalling the pipeline and flushing command buffers). For DirectX9-11 and up it can be implemented by busy-waiting GPU queries, and it becomes a total killer with low-level APIs like DX12 and Vulkan, which are initially designed to process multiple command buffers in parallel.
    Summarizing, this kind of functionality is easily implementable and can be reasonable for _your_own_ simple 2D applications with rather simple graphics implemented in a single API of your choice, but injecting such thing that into any third party game 3D engine and forcing pipeline to stall there is a "no go" I'm afraid.
     
  20. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,198
    Likes Received:
    6,866
    Which can only mean that you need to upgrade your GPU to use that, sorry.
     

Share This Page