Do games still CPU prerender when you set and reach an FPS limit?

Discussion in 'Videocards - NVIDIA GeForce Drivers Section' started by BlindBison, Jun 10, 2021.

  1. BlindBison

    BlindBison Ancient Guru

    Messages:
    2,404
    Likes Received:
    1,128
    GPU:
    RTX 3070
    Something I was wondering about recently after reading CaptaPraelium’s writeup on future frame rendering in BFV: https://www.reddit.com/r/BattlefieldV/comments/9vte98/future_frame_rendering_an_explanation/. So, if we assume uncapped FPS when we’re GPU bound the CPU can work ahead a bit and prepare frame buffers for the GPU.

    This helps achieve high GPU utilization (as the GPU doesn’t have to wait idle for the CPU to prepare the next frame buffer, it’s already in a queue ready to go) and guards against inconsistencies on the CPU side were the CPU to fall behind for a moment it sounds like. Of course the known trade off the having a prerender queue is added frames of latency so a shorter CPU queue when GPU bound can reduce input latency.

    Here’s my question though — it’s been said before in this forum that if you reach your FPS limit that you are no longer GPU bound, you become CPU bound instead and I’ve seen it said that the CPU can’t prerender ahead if you’re capping FPS with say RTSS.

    Assuming CaptaPraelium’s writeup is correct though, this doesn’t make sense to me — I would think assuming the CPU is fast enough it could still fill up the prerender queue, no? For example, The cpu completes the frame buffer in 3 milliseconds and the gpu takes 7 milliseconds to do its work. Assuming we have a 60 FPS limit (16.7 ms) then why couldn’t the CPU start working on the next frame buffer while the GPU does its work? If the CPU finished the second frame buffer in also 3 ms it might even have time to work on the next, but I expect I’m perhaps misunderstanding something. Is it that the CPU doesn’t release the frame buffer until the specified time so it only does it’s work every 16.7 ms or some such?

    Thanks! I’m wondering now if there could be CPU side limiters and GPU side limiters, but I’m unsure. Thanks!

    EDIT: The other thing that makes me wonder if I’m misunderstanding how prerendering works (perhaps CaptaPraelium is too) is that when I’ve read about it elsewhere others have said the cpu is sending batches of frame buffers at various sizes rather than having the queue setup as described in the writeup so … really I’m just sorta confused about how technically this all works I suppose.

    I know Unwinder commented to me once in the past that an RTSS frame limit won’t affect the flip queue directly, but he said limiting framerate surely can indirectly reduce flip queue load Thanks,
     
    Last edited: Jun 10, 2021
  2. Astyanax

    Astyanax Ancient Guru

    Messages:
    16,996
    Likes Received:
    7,337
    GPU:
    GTX 1080ti
    Yes
     
    BlindBison likes this.
  3. Kelutrel

    Kelutrel Member

    Messages:
    43
    Likes Received:
    41
    GPU:
    MSI RTX 2080
    In normal conditions the GPU and CPU times are decoupled. For example the GPU will render and send the image to the screen each 10 milliseconds, while the CPU recalculates the following frame data and send the data to the GPU each 5 milliseconds. The extra data would just be ignored by the GPU (in optimal conditions, without any queuing), and the GPU would only consider the latest and most up-to-date frame data and discard the rest.

    First, let's clarify a few things:
    - You are GPU bound when the GPU takes 10 milliseconds and the CPU takes 5 milliseconds for each frame. You are seeing 100fps and the GPU is limiting you.
    - You are CPU bound when the GPU takes 10 milliseconds and the CPU takes 20 milliseconds for each frame. You are seeing 50fps and the CPU is limiting you.

    Then the answers in normal conditions without FPS capping:
    - Why couldn’t the CPU start working on the next frame buffer while the GPU does its work? : In normal conditions it does. CPU and GPU are completely decoupled and work continuously.
    - Is it that the CPU doesn’t release the frame buffer until the specified time so it only does it’s work every 16.7 ms or some such? : In normal conditions the CPU is decoupled, so the CPU keeps precalculating the following frame data even when the GPU is currently busy.

    If you CAP your FPS, the CPU would just stop rendering the following frames data continuously and will start rendering the following frame data at specific intervals, like a pulse. So if you cap your FPS at 100fps, the CPU will only recalculates the following frame data at the 0ms-10ms-20ms-30ms-... points in time. The GPU is still decoupled and will try to render whatever frame data it has available, if any, and will not render anything if there is no new frame data available.

    Then the answers for the case with 100 FPS capping become:
    - Why couldn’t the CPU start working on the next frame buffer while the GPU does its work? : Because the application is asking to the CPU to recalculate the following frame data at each 10ms step only, and not continuously.
    - Is it that the CPU doesn’t release the frame buffer until the specified time so it only does it’s work every 16.7 ms or some such? : The GPU and CPU are still decoupled, so the CPU recalculates the following frame data at each 10ms step, and the GPU works with the frame data that is available.

    This is valid for most of the games and engines out there, but obviously specific optimisations and features may slightly change these behaviours.
     
    Solfaur, Ohmer and BlindBison like this.
  4. BlindBison

    BlindBison Ancient Guru

    Messages:
    2,404
    Likes Received:
    1,128
    GPU:
    RTX 3070

  5. BlindBison

    BlindBison Ancient Guru

    Messages:
    2,404
    Likes Received:
    1,128
    GPU:
    RTX 3070
    @RealNC Sorry to bother you, but I figured i'd tag you in here since historically I remember seeing you comment on this sort of thing. Thanks!
     
  6. RealNC

    RealNC Ancient Guru

    Messages:
    4,893
    Likes Received:
    3,168
    GPU:
    RTX 4070 Ti Super
    You're neither CPU nor GPU bound when the FPS limiter is active (meaning the FPS cap is reached.) The limiter artificially blocks the CPU from executing the game's render thread.

    In a real CPU-bound scenario, the CPU would simply be too slow and thus fall behind the GPU, but can also produce rather severe frame time inconsistencies. A limiter on the other hand only makes the CPU fall behind the GPU, but does not introduce a stuttery mess, because the CPU is actually fast enough to deliver each frame fast enough, but it's artificially blocked from working on more than one frame at a time.

    It will do this if MPRF is 2 or higher. If it's 1, then it won't. With 1, the queue is only 1 frame, which is the current frame. It has to finish first before preparing the next frame.

    However, games do other things before they start preparing render commands. One of these things is reading player input. The exact timing of that has implications for input lag. Even with a queue size of 1, the game will read player input and then start work on preparing the next frame based on that input. But it can't actually put anything in the queue since it's full. So it has to wait. If it needs to wait for 5ms, then player input is going to be 5ms old already before work on the next frame can even start. Not good.

    A frame limiter solves this. See below.

    The frame is "released" when it's presented. The frame presentation function of the used API (DirectX, GL or Vulkan) is where a limiter like RTSS sits. It intercepts the frame presentation call of the game.

    The thing is that when the game calls a function, the execution thread is transferred to that function. The game's code itself can not continue running until the called function has finished executing, aka "when the function has returned." Once the function returns, the game's own code continues executing, which usually means it can start preparing the next frame, which also means reading player input for that next frame.

    So what happens if you enable the FPS limiter in RTSS? It prevents the frame presentation function from returning too soon. It "blocks" the game code that called that function, preventing it from starting work on the next frame. For how long it blocks depends on the FPS limit. So even though RTSS doesn't directly affect the prerender queue size, it prevents the game from doing any work that could fill up that queue. But also just as important, it prevents the game from reading player input too soon. With a frame limiter active, the game doesn't read input first, then wait until the queue is ready to accept the next frame. It first waits (the frame limiter forces it to), then reads player input and immediately starts submitting render work to the queue. Player input is "fresher" that way, lowering input lag.

    Some games already do this on their own though. Usually esports games like CS:GO and such. They know the state of the queue and avoid reading player input too soon. Many "normal" games don't seem to bother though. They just read input at the most convenient time.

    Finally, a frame limiter that is built into the game itself can do even better than external limiters. RTSS for example can only reduce input lag by means of blocking the present call. That's already providing lower input lag, but a built-in limiter can do even better by allowing other work to happen first before reading player input, and only block at the exact right moment before that. Most in-game limiters do that. I've ran into a couple games in the past that had built-in limiters that weren't better than RTSS. Can't even remember which games anymore. So in my experience, the vast majority of in-game limiters do the right thing.

    But there's an even better way that almost no games do: keep track of the frame time history and predict how long it's gonna take to render the current frame. With this approach, input lag is lowered even further by waiting longer before rendering the next frame. I only know of one game that does this: ezQuake. It uses this approach to eliminate vsync lag. Instead of that, you can simply provide a configuration option to set the frame delay. RetroArch for example does this. If an NES emulator can render a frame in 1ms (which is usually the case in modern CPUs, as they can run NES games at over 1000FPS), then you can wait for 15ms before each frame, and then read input and render it within the remaining 1.7ms. This lowers input lag by a whole 15ms.

    But anyway, we got off topic here. The TL;DR of this is that frame limiters prevent the prerender queue from filling up regardless of its capacity. But they also prevent the game from reading player input too early, thus decreasing input lag by shifting the time of the game's input sampling forwards.

    I'm not an expert either, but regardless of how exactly the technicalities work out, if the queue is full, then something, somewhere, somehow, has to block at some point. Otherwise, you couldn't prevent more render commands from being submitted than the queue can actually deal with. A queue is a queue. When it's full, you have to wait for its current size to decrease. No way around that. This is generally described as "backpressure." Be it vsync backpressure, GPU backpressure, or pre-render queue backpressure, it always has a negative impact on input lag and frame limiters are a good way to prevent it. When the FPS target is not reached, then something like nvidia's "ultra low latency" can kick-in and prevent it or mitigate it.

    Yep. It doesn't alter the queue size at all. But it does keep it from filling up. You can even set the "low latency" setting in the NVCP to "off", then use profile inspector and set max pre-rendered frames to something ridiculous like 7 or 8 or whatever the maximum is. Input lag is going to be really bad and very inconsistent. But as soon as you activate the RTSS limiter, as long as the cap is reached, all the input lag is gone and it behaves as if you're using the "ultra" low latency setting.
     
  7. BlindBison

    BlindBison Ancient Guru

    Messages:
    2,404
    Likes Received:
    1,128
    GPU:
    RTX 3070
    Thanks a lot, this was a very in depth explanation which is awesome. Much appreciated!
     
  8. Kelutrel

    Kelutrel Member

    Messages:
    43
    Likes Received:
    41
    GPU:
    MSI RTX 2080
    BlindBison likes this.
  9. janos666

    janos666 Ancient Guru

    Messages:
    1,645
    Likes Received:
    405
    GPU:
    MSI RTX3080 10Gb
    I thought the common knowledge was that NV_v3 was about the same as RTSS but in-game was slightly better. Although the in-game could obviously vary from game to game but "good in-game" could be assumed in general.
     
    BlindBison likes this.
  10. BlindBison

    BlindBison Ancient Guru

    Messages:
    2,404
    Likes Received:
    1,128
    GPU:
    RTX 3070
    I’ll have to give this another read over, but I thought RodroG (the Reddit link tester/OP) found RTSS/Driver limiter were about the same while in game “could” sometimes be lower (such as in Overwatch and BFV).

    RodroG also made a thread here on Guru3D where that was the finding to my memory.
     

  11. carb1de

    carb1de Member

    Messages:
    14
    Likes Received:
    0
    GPU:
    RTX 4080
    I've never looked into it at any depth, other than just 'feeling' the difference between RTSS FPS cap, NVCP FPS cap, and what's now called NVCP 'low latency mode'. I've used no empirical evidence other than using the frametime overlay from RTSS but, it's always 'felt' better to find out what your min/max FPS on a per game basis is, then use RTSS on a per game basis to cap it at or around the min FPS. My (un educated) thought process has always been that by limiting FPS to the lows, it allows the GPU some breathing space so that it actually never hits the lows it would have done unconstrained. Seems to work, again, only by literally looking at a running frame time graph.

    In my mind it makes sense that the GPU always has resources rather than running flat out at all times, then potentially getting hit by a scene that draws more computational power than is available, causing intermittent stutter.
     
  12. hearnia_2K

    hearnia_2K Active Member

    Messages:
    87
    Likes Received:
    5
    GPU:
    PNY RTX 2080 Super

    Doesn't this all assume not using a proper scheduler? As I understand with a proper scheduler the GPU can work out how long the rendering will take, and prepare the data as late as possible, to provide the most up-to-date information to the display. I also thought the same thing wa spossible fro the CPU perspective, so if the CPU knows it can prep the frame in 2ms, but the gpu takes 6ms, rather than the cpu rendering constantly, and thus creating 2 frames that are discarded, the cpu wil just prep the data just-in-time for the GPU to pick it up. This type of scheduling makes sense on a frame-rate limited scenario, for example a fixed refresh rate display, but also when the work-load is unbalanced between cpu and gpu.
     
  13. Kelutrel

    Kelutrel Member

    Messages:
    43
    Likes Received:
    41
    GPU:
    MSI RTX 2080
    Afaik the approach you describe is possible but very very rarely used even in modern videogames. As far as I know, and RealNC also confirmed, CS:GO uses this approach.
    The advantage is that you may reduce input latency by a couple ms on average under special cases and optimal conditions, but the complexity of the engine pipeline and the involved code increases quite a bit as you are loosely coupling again cpu and gpu, adding the risk to lower the frame rate and increase latency if the predictions are wrong, so most games don't find convenient to implement it.
     
    BlindBison likes this.

Share This Page