MSI AB / RTSS development news thread

Discussion in 'MSI AfterBurner Application Development Forum' started by Unwinder, Feb 20, 2017.

  1. jiminycricket

    jiminycricket Master Guru

    Messages:
    203
    Likes Received:
    4
    GPU:
    GTX 1080
    Can throttle time be used to reduce input lag in VSync off scenario by dynamically adjusting the FPS cap to prevent GPU boundedness at all times?
     
  2. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    That's the plan. It's there for g-sync, freesync, or plain old vsync off without any VRR. You could still use it with vsync I suppose, but with vsync you really want to be reaching the frame cap at all times to begin with.
     
    Last edited: May 23, 2018
    jiminycricket likes this.
  3. D2 Ultima

    D2 Ultima Guest

    Messages:
    40
    Likes Received:
    0
    GPU:
    GTX 1080N x2
    Correct me if I'm wrong, but... does this essentially mean you've made software gsync when at a frame limit? I know it won't be the same, but... this looks so good
     
  4. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    Huh? This doesn't have anything to do with gsync. This means that instead of "60", you can specify "16777" to cap to 60FPS (or "16669" to cap to "59.99FPS".)
     

  5. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    Hi Mark,

    Thanks for detailed posting, I'm really pleased to read it and would love to see more posts like that instead of typical "I cannot overclock/overvolt, fix you crap". Also, thanks for offering to share your source code, but there is absolutely no need in that - I already use Direct3D kernel mode thunk (i.e. D3DKMT) for different functionality so it is not a problem to implement alternate framerate synchronization modes, I'll add an ability to sync framerate to specific scanline index, start or end of vertical blanking interface to this beta.
     
    yasamoka, VAlbomb, mdrejhon and 5 others like this.
  6. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    I decided to delay beta 1 a bit because I've just appended it with the following functionality per Mark's request:

    o Added power user oriented profile setting, allowing you to synchronize framerate to up to two independent scanline indices per refresh interval. Combining with user configurable scanline wait timeout, those settings provide experienced users low input lag adaptive VSync or FastSync functionality on any hardware

    Once I'll finish my own internal testing of it, I'll upload it to forums. I think it won't take long, stay tuned!
     
  7. D2 Ultima

    D2 Ultima Guest

    Messages:
    40
    Likes Received:
    0
    GPU:
    GTX 1080N x2
    Gsync is technology that refreshes the screen when a new frame is ready. There is no tearing or input lag when below a certain refresh rate, but without gsync if you lock frames below your refresh rate you can still likely get screen tearing when fullscreened in games and such. But if he's locking frames to a frametime interval, then every frame should come exactly at the correct time (once a FPS lock is achieved). This should eliminate tearing instead of dealing with slight frametime variances which can occur normally (though, extremely rare once using the existing RTSS lock and usually a result of an outside influence). But with the low-lag adaptive vsync feature to scanline timeouts as explained before my reply, it looks a lot more like that might be much more functional than what I was thinking.
     
  8. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    Oh. I thought you were talking about the part that you quoted :p

    Yes, syncing the frame limiter to the vblank period should in theory give you the same input lag g-sync/freesync can deliver when you maintain a perfectly stable frame rate. Well, almost the same. There seems to be some driver latency and variance between the reported scanout position and actual scanout position, which turned up in emulators that recently tried to implement something similar (synchronized rendering to the current scanline, aka "beam chasing.") I assume this is why Unwinder is also adding the "user configurable scanline wait timeout" option, so that you can compensate for that latency.

    If this works as well as we hope it does, then this could in future be promoted into a non-power user setting. Just a "tear-free vsync OFF" checkbox in RTSS for example. Disable vsync, check that box, and you should get what it says on the label: vsync off without tearing :) (Or at least with tearing that is confined to the very top of very bottom of the screen.)
     
    cookieboyeli likes this.
  9. gedo

    gedo Master Guru

    Messages:
    310
    Likes Received:
    43
    GPU:
    RX 6700 XT 12GB
    Call it RT-sync. :)
     
    cookieboyeli and Andy_K like this.
  10. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    Nope, scanline wait timeout is added to provide a functionality similar to NVIDIA's adaptive VSync. In traditional VSync without timout you're always just waiting for the next VBlank interval, so at 60Hz refresh rate it may drop to 30 FPS if frametime is more than 16.7 ms. With timeout implementation you'll never wait for VBlank longer than timeout, so if will effectively mean that sync is disabled once framerate drops below refresh rate.
     

  11. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    RTSS 7.2.0 beta 1 is online:

    http://www.guru3d.com/files-details/rtss-rivatuner-statistics-server-download.html

    · Added On-Screen Display performance profiler. Power users may enable it to measure and visualize CPU and GPU performance overhead added by On-Screen Display rendering. Two performance profiling modes are available:
    o Compact mode provides basic and the most important CPU prepare (On-Screen Display hypertext formatting, parsing and tessellation), CPU rendering and total CPU times, as well as GPU rendering time (currently supported for Direct3D9+ and OpenGL applications only)
    o Full mode provides additional and more detailed per-stage CPU times
    · Improved built-in framerate limiter:
    o Added power user oriented profile setting, allowing you to specify the limit directly as a target frametime with 1 microsecond precision
    o Added power user oriented profile setting, allowing you to adjust throttle time. Throttle time adjustment is aimed to reduce input lag when framerate is below the target limit or without limiting the framerate
    o Added power user oriented profile setting, allowing you to synchronize framerate to up to two independent scanline indices per refresh interval. Combining with user configurable scanline wait timeout, those settings provide experienced users low input lag adaptive VSync or double VSync functionality on any hardware
    · Various On-Screen Display optimizations and improvements:
    o Added adjustable minimum refresh period for On-Screen Display renderer. The period is set to 10 milliseconds by default, so now the On-Screen Display is not allowed to be refreshed more frequently than 100 times per second. Such implementation allows keeping smooth animation when On-Screen Display contents are being updated on each frame (e.g. when displaying realtime frametime graph) without wasting too much CPU time on it
    o Added alternate GPU copy based Vector2D On-Screen Display rendering mode implementation for Direct3D1x applications. New mode provides up to 5x Vector2D performance improvement on NVIDIA graphics cards, however it is disabled on AMD hardware due to slow implementation of CopySubresourceRegion in AMD display drivers
    o Vector2D rendering mode is now forcibly disabled in Vulkan applications on AMD graphics cards due to insanely slow implementation of vkCmdClearAttachments in AMD display drivers
    o Revamped geometry batching and vertex buffer usage strategy in pure Direct3D12 On-Screen Display renderer (currently used in Halo Wars 2 only)
    o Added Vector2D rendering mode support to pure Direct3D12 On-Screen Display renderer
    o Optimized On-Screen Display hypertext parsing and tessellation implementation
    o Optimized state changes in OpenGL On-Screen Display rendering implementation
    o Optimized state changes in Direct3D1x On-Screen Display rendering implementation
    o Solid rectangles and line primitives in Direct3D8 and Direct3D9 On-Screen Display rendering implementations are now rendered from vertex buffer instead of user memory
    o Improved OpenGL framebuffer dimensions detection when framebuffer coordinate space is selected
    · Fixed On-Screen Display rendering in wrong colors when Vector2D mode is selected and Direct3D1x applications use 10-bit framebuffer
    · Fixed Vulkan fence synchronization issue, which could cause GPU-limited Vulkan applications to hang due to attempt to reuse busy command buffer
    · Active busy-wait loop in the framerate limiter module is now forcibly interrupted during unloading the hooks library to minimize the risk of deadlocking 3D application when dynamically closing RivaTuner Statistics Server during 3D application runtime
    · Improved synchronization in 32-bit hook uninstallation routines
    · Updated profiles list


    A few notes about new toys for power users:

    New performance profiler

    Performance profiler can be enabled by setting PerformanceProfiler field in [OSD] section to 1 (basic mode) or 2 (detailed mode). "Show own statistics" must be enabled in RTSS to see the profiler. The following performance counters are available for detailed mode:

    CPU acquire – CPU time, spend on acquiring access to 3D API. This CPU time depends on 3D API used by application, in most cases it is zero, for D3D12 applications displaying OSD in D3D11on12 mode it is CPU time spend on acquiring D3D11on12 wrapper for rendering, in Vulkan applications asynchronically presenting frames from compute queue (e.g. DOOM or Wolfenstein II on AMD cards) it is CPU time spend on synchronizing graphics and compute queues. For OpenGL applications it can be nonzero if application is forcibly flushing the pipeline in the end of each frame rendering with glFlush. CPU acquire stage is executed on each frame.

    CPU prepare – CPU time spend on preparing OSD contents for rendering. This CPU time doesn’t depend on 3D API used by application, it entirely depends on the amount of text/graphs you’re displaying in OSD. CPU prepare time is divided into the following substages: init, parse and tessellate. Init is CPU time spend on formatting own RTSS OSD contents (i.e. formatting own framerate counters, scanning hypertext and replacing framerate macro with real formatted framerate values, formatting performance counters, benchmark statistics etc). Parse is CPU time spend on parsing resulting OSD hypertext (including the hypertext supplied by OSD clients like MSI AB or HwInfo), processing hypertext formatting tags and preparing OSD contents as collection of text with attributes to be tessellated on the next stage. Tessellate is CPU time spend on converting parsed OSD text and attributes to renderable form (collection of vector rects for each symbol for vector 2D/3D OSD rendering modes or collection of textured quads for each symbol for raster 3D mode). CPU prepare stage is executed on the frames when OSD contents is refreshing, i.e. if you’re displaying OSD with framerate counter and default refresh rate in RTSS properties (500 ms), then OSD is refreshing and this stage is executed just twice per second.

    CPU render – CPU time spend on rendering OSD. This CPU time depends on 3D API used by application and on OSD rendering mode selected in RTSS (Vector2D, Vector3D or Raster3D). CPU render time is divided into the following substages: save, submit and restore. Save is CPU time spend on saving 3D rendering pipeline state before rendering OSD. This substage entirely depends on 3D API used by application, for example state changes are most expensive for Direct3D9 applications (especially pure Direct3D9 ones). Low-level 3D APIs (pure Direct3D12 or Vulkan) do not require saving pipeline state, so this CPU time is zero. Vector2D OSD rendering mode also doesn’t require saving and restoring rendering pipeline state, so it is zero in this case too. Submit is CPU time spend on filling vertex buffers with previously tessellated OSD geometry and submitting it to 3D API. Restore is CPU time spend on restoring previously saved 3D rendering pipeline state after drawing OSD. CPU render stage is executed on each frame.

    CPU capture – CPU time spend of capturing framebuffer contents. This stage is executed and this time is not equal to zero during videocapture only.

    CPU flush – CPU time spend on the final stage of flushing OSD renderer and returning control to application’s 3D API. This time is D3D11on12 wrapper flushing time D3D12 applications displaying OSD in D3D11on12 mode. For applications using different 3D APIs it is zero. This stage is executed on each frame.

    CPU total – total CPU time including all stages listed above.

    GPU render – GPU time spend on rendering OSD. This performance counter is currently collected for Direct3D9, Direct3D10, Direct3D11, Direct3D12 applications displaying overlay in D3D11on12 mode and OpenGL applications only. GPU render time profiling is currently not supported for Vulkan and pure Direct3D12 applications.

    New scanline sync based framerate limiter


    Before you start experimenting with new sync mode, it is recommended to enable diagnostic scanline sync related info in OSD by setting SyncInfo field in [OSD] section to 1. "Show own statistics" must be also enabled in RTSS to see it. New scanline sync based framerate limiter is controlled by the following values:

    SyncDisplay – name of logical display device to be synchronized with. Currently it is a primary display name.

    SyncScanline0 – index of the first scanline for framerate synchronization. No synchronization is performed when it is set to zero, otherwise this is treated as scanline index starting from top of the frame. E.g. SyncScanline0=1 means that the frame will be synchronized with the top (or more precisely the second scanline, because indices are zero based) scanline and SyncScanline0=1000 means that the frame will be synchronized with scanline 1000 (which is located in the bottom part of screen if we use 1080p mode with 1125 scanlines total).

    SyncScanline1 – index of the second scanline for framerate synchronization. Defining two independent sycnhronization points per refresh allows us to get functionality of double VSync, i.e. you get smooth 2xRefreshRate framerate). No synchronization is performed when it is set to zero, otherwise this is treated as index starting from middle of the frame. E.g. SyncScanline1=1 with total 1125 scanlines means that the frame will be synchronized with the scanline 562(1125/2)+1=563 and SyncScanline1=400 means that the frame will be synchronized with scanline 562(1125/2)+400=962 (which is located in the bottom part of screen if we use 1080p mode with 1125 scanlines total).

    SyncTimeout – allows adjusting timeout for scanline synchronization. The timeout provides functionality similar to NVIDIA’s Adaptive VSync, meaning that you may forcibly disable synchronization when framerate drops below the refresh rate. Timeout can be specified either explicitly in microseconds (e.g. SyncTimeout=16667 for 60Hz refresh rate) or you can let RTSS to benchmark and calibrate it automatically and set it to 1/N of refresh time when SyncTimeout=N is in [1,8] range).

    Summarizing, you may start experiments with scanline sync with the following presets:

    For traditional VSync with low input lag:

    Code:
    SyncScanline0=1
    SyncScanline1=0
    SyncTimeout=0
    
    In this case tearline position is fixed in the top of frame, so you can move it down via tuning and increasing SyncScanline0 value.

    For adaptive VSync with low input lag on 60Hz refresh rate:

    Code:
    SyncScanline0=1
    SyncScanline1=0
    SyncTimeout=16667
    
    or calibrate timeout automatically:

    Code:
    SyncScanline0=1
    SyncScanline1=0
    SyncTimeout=1
    
    For double VSync (i.e. 2x refresh rate framerate, 120FPS for 60Hz refresh rate)

    Code:
    SyncScanline0=1
    SyncScanline1=1
    SyncTimeout=0
    
    In this case tearlines will be in the top and in the middle of frame, you can move it down via synchronically increasing SyncScanline0 and SyncScanline1 values. To control timeout in such case use either explicit value:

    Code:
    SyncScanline0=1
    SyncScanline1=1
    SyncTimeout=8333
    
    or calibrate timeout automatically:

    Code:
    SyncScanline0=1
    SyncScanline1=1
    SyncTimeout=2
    
     
    Last edited: May 29, 2018
    CaptaPraelium, vukk, mdrejhon and 5 others like this.
  12. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    Thanks!

    I just tested the ThrottleTime setting, and it does not appear to be working for me. ThrottleTime=400 and Limit=0 for example should result in a difference of a couple of FPS (compared to uncapped.) But there is no change. There is also no latency change (which I can get by setting "Limit" to 1FPS below what the game runs at when uncapped.)

    I have to raise the value to something like 8000 to see an effect (at which point there's unpredictable results, and some heavy frame skipping.)

    ThrottleTime is in microseconds, right? Hoping I'm not missing something obvious here :)
     
  13. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    Throttle time is in microseconds, so 400 adds 0.4ms to each frame (when no framerate limit is enabled or when frametime is > target with framerate limit enabled). So that couple of FPS you're expecting depends on your actual framerate and average frametimes. Enable frametime display to see that it is working.
     
  14. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    OK, it seems it isn't working only in the situation it would be useful in: when the GPU is maxed out :-/ (99% or 100%).

    There is no change in frame times. It only starts to work once you set really high values. With values up to about ~7000 or so, the frame time doesn't change. At about 8000 or so, it starts working, with a big jump in the frame time.

    In games where the GPU isn't fully saturated, or when lowering GPU-intensive settings, then it's working as intended. However, in this case, this setting is of course not needed anymore, since input lag only starts appearing when the GPU is completely maxed.

    So there seems to be a difference between setting an FPS cap and setting ThrottleTime. In some games it's very easy to get a stable frame time to test this with. Witcher 3 for example when pausing the game and disabling "hardware mouse cursor" in order to get an in-engine rendered cursor. With a cap of 90FPS but the game maxing out the GPU and only running at 77.4FPS, there is increased input lag. If I alt+tab to RTSS and set a cap of 77FPS (which is just 0.4FPS lower), all input lag goes away completely. I was hoping the throttle time would be effectively doing the same thing, but apparently it doesn't in the case where's there's high GPU load.

    I tested these games: Witcher 3, Fallout 4, Skyrim Special Edition, Resident Evil 7.

    Witcher 3 is the easiest to test with, because of the 100% stable frame times when you hit ESC to bring up the game menu, and in-engine rendered mouse cursor to test input lag with.
     
  15. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    Well, I added exactly what you asked. Your expectations of magic from adding fixed throttle time to each frame are wrong then.
     

  16. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    Well, it's not exactly magic. If the normal frame cap works, one would expect that throttling would do the same. There doesn't seem to be a functional difference between the two :-/

    A cap of 10ms for example would block for 0.2ms on a frame that was presented 9.8ms since the last present call. A throttle time of 0.2ms would do exactly the same thing. There should be no difference.

    If that is not the case, then I would have to apologize for sending you on a wild goose chase :-(
     
  17. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    In case of ideal frame pacing from application side, when we have ideal frame times and always see 9.8 - yes, there will be no difference. Otherwise we'll see fluctuating (0.2 +applicationFrametime)ms vs static 10ms frametime in case of using the limiter.
     
  18. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    Yep! But it's not what I'm seeing. Here are the frame times with ThrottleTime=0:

    https://i.imgur.com/gEM7ihy.jpg

    And here with ThrottleTime=4000:

    https://i.imgur.com/xzcnOeT.jpg

    They should be 4ms higher. But they aren't.
     
  19. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
  20. RealNC

    RealNC Ancient Guru

    Messages:
    4,959
    Likes Received:
    3,235
    GPU:
    4070 Ti Super
    Yes. In-game limiter is off, RTSS limiter is off, vsync off (both in-game and in NVCP,) g-sync off.

    Edit:
    Actually, I take back my comment about high GPU load. This seems to happen even with non-maxed out GPU. It's enough for it to be on the high side (over 80% or so.)
     
    Last edited: May 28, 2018

Share This Page