MSI AB / RTSS development news thread

Discussion in 'MSI AfterBurner Application Development Forum' started by Unwinder, Feb 20, 2017.

  1. RealNC

    RealNC Ancient Guru

    Messages:
    4,954
    Likes Received:
    3,233
    GPU:
    4070 Ti Super
    In any event, thank you for adding this. It works well in games like CS:GO @ 60Hz. So for people who can't afford g-sync/freesync monitors and are stuck with their 60Hz display but want tear-free vsync off, this is worth trying out.
     
    Last edited: May 30, 2018
  2. RealNC

    RealNC Ancient Guru

    Messages:
    4,954
    Likes Received:
    3,233
    GPU:
    4070 Ti Super
    Just had a weird thought about the ThrottleTime feature. What would happen if, in the case where ThrottleTime is triggered, instead of block and then present, you instead present first and block later?

    Other than that, the whole experiment of ThrottleTime appears to be a complete dead end. How far would you be able to go in having an "adaptive frame limiter" in RTSS? Going back to the original thinking of why such a thing seemed useful: when the frame limiter triggers, you get the best possible latency. If it doesn't, you get a latency jump, unless you lower the frame limiter's cap even more. Even if the cap is just barely reached (like 0.1FPS), the full benefits of capping are there. My fixed throttle time suggestion in order to get an "always capped" effect seems to have been a very naive idea, as it turns out.

    Do you think there's another way to achieve a "cap is always active" effect? Like RTSS adapting the cap dynamically depending on the average FPS? Thinking of a bSmoothFramerate-like effect (Unreal Engine) here.
     
  3. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    That's exactly how it is working now, it applies to both frametate limiter and ThrottleTime. In both cases when Present() call is intercepted, RTSS is calling Present first then waiting (either fixed time in case of ThrottleTime usage or variable time in case of using framerate limiter). Busy waiting before calling Present() is only used for synchronization with scanlines.
     
  4. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    I don't use QueryDisplayConfig or any other API calls to determine VTotal, because display mode detection/reporting APIs are normally rather expensive from CPU time POV and not suited best for frequent realtime usage. So it is a good choice for your own application where you can calibrate/detect everything on initialization stage, before starting your own render loop and be sure that settings do not change later, so you don't need to track them in realtime, but it is not the best choice for overlays that intercept other application's rendering do all work inside Present() hooks and should add as low CPU overhead as it is possible.
    So I do not use anything but current scanline detection, detect and calibrate VTotal in realtime, while waiting for target scanline (maximum scanline, reached before entering VBlank interval is constantly monitored and latched as VTotal).
    On the second question: yes, I use busy wait loop on polling D3DKMT to get current scanline and interleave polling with CPU yield via sleep(0).
     

  5. Hakuryuu

    Hakuryuu Guest

    Messages:
    18
    Likes Received:
    1
    GPU:
    MSI GTX 760 Twin Frozr
    I can't comment on any technical stuff yet but I just want to say that Unwinder is probably on the brink of a great discovery here. Hat off to you and keep up the good work
     
    cookieboyeli likes this.
  6. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    Fair enough -- the performance degradation percentage must be rather severe (e.g. framerate halving).

    Just so you know, my multiple-Present() feature was piled as an additional improvement top of the Flush() feature of my beamracing app, but the multiple-Present() alone can actually also serve as an alternative to doing any Flush() -- so doing that alone actually can end up becoming a "moderate programming effort for a huge improvement" (since it's an improvement relative to the no-Flush() situation in your case, rather than an improvement relative to the Flush() situation in my case).

    In a situation of precision waits for a Present() taking longer than a repeat-Present() of a duplicate framebuffer (~1/8000sec on my GTX 1080Ti), this can be a compromise to doing a Flush() if only strategically done in those situations where you're ending up having to precision-wait a few milliseconds, and want to keep the GPU 'busy'.

    I had found that I needed to use both Flush() and the keep-GPU-busy, to stabilize tearlines in a varying-GPU load situation. But I hadn't tried the keep-GPU-busy only (without Flush()) to see if it's as good as concurrently doing Flush() and keep-GPU-busy tricks, in locking tearlines in a varying-GPU situation.

    Some of my rasterdemo experiments use nearly 100% GPU power, due to the high frameslice rate (7000 slices/second), yet I still get near-pixel-exact tearlines after offset-compensating. That said, my draw commands are indeed simple (at this time).

    You don't have to implement this feature, but at least now you know that keeping the GPU busy during multi-millisecond idles between frames, makes a big difference in stabilizing the tearline position to practically pixel-exact (or within a tight range) despite varying GPU load. A "Good To Know" detail, anyway.

    Yes, true. For me, I only do it in initialization time since VTotal doesn't change. (Well on VRR, it does, but QueryDisplayConfig only returns minimum VTotal, i.e. the VTotal of the highest-framerate when running in VRR)

    I'm impressed you're doing realtime VTotal guessing. Essentially equivalent to the time ratio of INVBlank == false (active resolution) versus the time intervals of the refresh cycles -- which ends up as a time-based ratio of 1080:1125 for a VT1125 signal.

    That would be high-performance and also compatible with variable refresh rate -- successful guess of Vertical Total used during variable refresh rate on a frame-to-frame basis. Good going!

    I found out that busylooping on D3DKMTGetScanLine in my own apps actually lowered my framerate a little bit -- because it briefly put a lock on graphics calls while it retrieved the scanline. (If you performance-profile D3DKMTGetScanLine, you'll notice that the call takes approximately 1 microsecond on a Geforce GTX Titan (at least on the 2014 original) -- surprisingly an "expensive" and "slow" API call!). But it may not matter or apply to RTSS.

    The WinUAE emulator author (whom I helped implement beamracing) told me it "stresses" the GPU a bit to busyloop on the scanline.

    For your case, it probably doesn't matter because the GPU is essentially idling anyway, and you're simply polling it only to idle to the next Present() -- so it's not problematic in this case.

    Though it would be interesting to see if there are any benefits (power savings on a Kill-A-Watt wattage measurement device) trying to precision sleep instead before doing the busyloop, perhaps via sub-millisecond NtSetTimerResolution() which can allow sub-millisecond sleeping on modern systems, reducing busywait needs to tens of microseconds by being able to sleep right up to just a few scanlines before the desired raster. Don't know if it's worth it or if there'll be side effects.

    Just warning that ScanLine polling is "expensive". Maybe not as expensive as Flush() (at least per-call) but it's a surprisingly expensive API call at least on some GPUs --

    From my conversations with WinUAE author, it might actually amplify tearline jittering in certain GPU loads. On multithreaded GPU loads -- the continual ScanLine poll can delay background draw-command processing to the point where an un-pre-Flush()'d Present() takes a more variable amount of time (big tearline jitter).

    WinUAE improved this amplified tearline (raster) jittering by inserting a precision sleep before beginning to busyloop the ScanLine (limiting ScanLine busyloops to 1ms or 500us). This may make 30%-50% GPU loads (RealNC) much more stable.


    This might be useful to test: It may mean one Flush() followed by a precision microsleep/nanosleep (on CPU) followed by a sub-microsecond busywait, may actually stress a GPU less than a continual busyloop on ScanLine -- but this is only a theory and has not yet been fully performance-profiled. However, either way is very "expensive"

    For ~40% rendertimes / ~40% GPU at 60fps cap, this means GPU stress may be in this descending order:

    1 Flush()
    2 Busyloop ScanLine (stress magnitude ~10ms)
    3 Thread sleep then Busyloop ScanLine (stress magnitude ~1ms)
    4 Thread sleep then Busyloop CPU then Busyloop ScanLine (stress magnitude ~10us)

    #1 and #2 appears swap in certain situations, but this is untested what situations makes this happens. Different GPUs and different drivers may have different efficiency behaviours. Efficiencies seem much better on newer 1080 Ti's. But still a surprisingly expensive API.
     
    Last edited: May 31, 2018
  7. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    This.

    We're thankful this feature was implemented and it actually has a usefulness with fast-frametime games.

    It won't work perfectly in many games, but it does with quite a few popular ones!

    This will be very useful as an additional low-lag "VSYNC ON" mode alternative for stutterless ULMB that doesn't look jittery (Strobe mode massively amplifies microstutters, as there's no display motion blur to hide the jitteriness/microstutters).

    Simply put, a low-lag "VSYNC ON" via tearingless VSYNC OFF via successfully beam-raced VBI framebuffer swapping.

    This has created a new GPU-independent sync option versus NVIDIA Fast Sync and AMD Enhanced Sync (and may actually work better than those in some older games, due to perfect stutterless refresh rate lock).
     
    Last edited: May 30, 2018
  8. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    Oh, and:

    Also, when it is time to begin ScanLine poll, try to poll only once or twice per scanline. Poll throttling helps speed up background GPU draw commands. Use QPC or RTDSC busyloops between ScanLine polls when it's time to poll. this relieves GPU interference.

    Each poll may stall GPU by only a fraction of a microsecond, but busyloops on ScanLine sure do build up (delaying some pending draw commands slightly).

    Busyloops are a necessary evil for beamraced tearlines but these sparingness helped me a lot (when I was still using D3DKMTGetScanLine)

    WinUAE emulator beamracing is capable of 80% GPU load (fuzzylines HLSL shaders enabled simultaneously with beamracing) with near-raster-exact pageflips (only single-pixel tearline jitter in color filter debug mode). Granted, it is using pre-emptive flushes too, but a large improvement occured in winUAE via poll-sparing techniques. I helped author Toni with a lot of this. :)
     
    Last edited: May 31, 2018
  9. Gris

    Gris Guest

    Messages:
    3
    Likes Received:
    0
    I have RTSS v7.02 but can't update it because it will not stop or uninstall. If I attempt to close it just starts up again a few seconds later. Tried a search but didn't find anything on this problem. Any recommendations?
     
  10. cookieboyeli

    cookieboyeli Master Guru

    Messages:
    304
    Likes Received:
    47
    GPU:
    Gigabyte 1070 @2126
    close MSI Afterburner, it's part of it so while it's running it auto restarts RTSS instantly. (Which is very handy for testing blocking changes).

    Also, use Geek Uninstaller to uninstall stuff. Think of it like the DDU of uninstalling.
     

  11. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    I experimented with throttling on initial implementation stage and used adjustable minimum poll interval between two D3DKMTGetScanLine calls, but I don't expose it in config because it doesn't seem to affect the tearline stability at all. Most likely it makes sense to throttle between D3DKMTGetScanLine calls if you monitor raster position constantly or probably immediately after presenting the frame, but that's not a case for RTSS implementation. RTSS is starting polling raster position immediately before application is presenting the frame then returning control to application after presentation. But I can give access to minimum polling period adjustment in the next beta to let experienced users to play with it.
     
    The1 likes this.
  12. mdrejhon

    mdrejhon Member Guru

    Messages:
    128
    Likes Received:
    136
    GPU:
    4 Flux Capacitors in SLI
    Hmmmm.

    Guess it is not affecting RTSS workflow much. WinUAE may have been using a draw workflow much more tightly coupled with polls, that made it vulnerable to amplifed raster jitter.

    I guess we're out of tools to try in the toolbox (unless you dare do a brief test of Flush() for 50%-GPU-load situations, e.g. capping framerates to half GPU's potential). Even if you don't release this feature, I'd love to see if it has any effect on stabilizing tearlines in the 50% GPU load situation. As to visually verify benefits-versus-drawbacks.

    Even with the current beta, advanced end users can still attempt Large Vertical Totals (big VBIs) to hide the tearline jitter offscreen between refresh cycles. This works very well on my BenQ Zowie XL2720Z (one of my monitors), which supports VT1500 (1080 visible + 420 VBI scanlines) at 120Hz via ToastyX CRU. This is good for strobed mode which amplifies microstutters/tearing visibility (due to lack of display motion blur) so low-latency versions of "VSYNC ON" really improves the niceness-look of strobing (ULMB/LightBoost/ELMB/DyAc/MBR).
     
    Last edited: Jun 6, 2018
  13. kx11

    kx11 Ancient Guru

    Messages:
    4,832
    Likes Received:
    2,639
    GPU:
    RTX 4090
    is it possible to get arabic displayed correctly in RTSS ?!

    this what it looks like when i used arabic

    [​IMG]
     
    Last edited: Jun 6, 2018
  14. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    Of course it has noticeable positive effect on stabilizing tearlines. The irony is that amount of tearline jittering directly depends on GPU performance and it is most noticeable on relatively slow 1-2 generations old GPUs in modern graphically heavy applications, but on opposite side forcibly flushing graphics pipeline also hurts performance on such systems much more than on faster GPU due to preventing any pre-rendering. So I'm seeing such option as potential trouble source, which can give more cons than pros in case of enabling it blindly. Heh, for example even vector 2D OSD rendering implementation, which is clearly documented as much slower compatibility mode for many years is still being enabled and used by many youtubers. And I'm receiving a feedback claiming that "OSD kills performance" and get demands to fix it.
     
  15. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    Nope, only ANSI charset is available in OSD texture to make it as compact as it is possible.
     

  16. RealNC

    RealNC Ancient Guru

    Messages:
    4,954
    Likes Received:
    3,233
    GPU:
    4070 Ti Super
    Only half-kidding here:
    Code:
    i_want_bad_performance=1
    
    (Inspired by a configuration setting on a Linux distribution called "I_WANT_TO_RUIN_MY_SYSTEM_AND_I_WONT_COMPLAIN_ABOUT_IT=YES").
     
    mdrejhon likes this.
  17. kx11

    kx11 Ancient Guru

    Messages:
    4,832
    Likes Received:
    2,639
    GPU:
    RTX 4090

    i see
     
  18. Unwinder

    Unwinder Ancient Guru Staff Member

    Messages:
    17,127
    Likes Received:
    6,691
    Well, that's exactly what we had for enabling unofficial overclocking on AMD graphics cards which used old AMD debug PowerPlay programming API (deprecated and promised to disappear one day). It didn't work at all, a lot of people just tend to ignore CFG entry names completely and tune the system by blindly copy/pasting values from some online guides or randomly enable some options.
    As a compromise solution, I think I'll provide an option for flushing 3D pipeline for betas only (so Mark or anyone else can play with it) and rip it from the final release.
     
  19. JonasBeckman

    JonasBeckman Ancient Guru

    Messages:
    17,564
    Likes Received:
    2,961
    GPU:
    XFX 7900XTX M'310
    Why not keep it in and just don't have it in the config file if users edit that directly. Maybe a pop-up or some notification beyond the tooltip text for the option when toggled on?
    Or perhaps not, probably going to be ignored even if you stick a timer or something on that prompt. It's been tested before in other programs after all. Much like that config setting in the above post I suppose.

    Maybe keeping it beta only is the best option to ensure it's mostly only used by more knowledgeable or advanced users willing to take a risk or understand what the option does. :)
     
  20. RealNC

    RealNC Ancient Guru

    Messages:
    4,954
    Likes Received:
    3,233
    GPU:
    4070 Ti Super
    I don't know, man. To me it sounds like you're compromising RTSS because of stupid people. So basically, they win.
     

Share This Page