Flightgear Rembrandt deferred rendering performance
The deferred configurable pipeline has great possibilities; but poor performance.
ALS works exceptionally well and ThorstenR has done amazing stuff with it; and not burnt frame rate at all. Rembrandt shadows are too flickery and a shadow map as Thorsten has added to ALS is arguably2 a much better way than the way Rembrandt does it. What I personally like about rembrandt is the ability to have light sources.
Long term I don’t know what’s right; and it’s not really up to me anyway; so I’m just presenting my findings.
Personally there are two reasons that I’m investigating this; the first is to figure out the ability to have post processing effects that aren’t possible with forward rendering. The second reason is the performance; it’s long been recognized that deferred pipeline is slow; I would expect a performance hit – but it seems to be disproportionate. I have a reasonable system and over the last few months I’ve changed graphics cards (GTX460 to R9 290); and what struck me as strange was that changing graphics cards didn’t really same to make much difference to the frame rate. Looking at the card from a GPU monitor (MSI Afterburner) I see a GPU activity of around 1% – so something is obviously wrong.
The background to this work required a lot of studying of the code, configuration and notes1;
I’m using ALS (with FG 3.5(git)) as a comparison.
To gain a reasonable baseline I used the same in-air initial position and then removed something from the deferred pipeline and measured it. All measurements were taken with the same osg debug elements on screen.
Step 1; modify default-pipeline.xml
I started by removing bloom, ambient occlusion, but no difference. I then fiddled about with the texture buffers (just in case it something was stalling the GPU because the buffers weren’t in the right format. I’m not a glsl / gpu expert but I do understand enough of how graphics is done to know that this is often the cause. In this case it wasn’t.
Next more of the stages had to go; I also removed all of the unnecessary items from the pipeline (conditions, 16bit buffers etc) just in case these were affecting the OSG processing. This made no difference.
So I’ve continued up until the point where there was almost nothing left. Referring to figure(1) below you can see that I’ve reduced the number of stages in the deferred pipeline to just 3; and yet each stage is taking much longer than the equivalent in forward rendering.
At this point I was confused; something was clearly odd and usually you can gain performance improvements by hacking out half of it but this wasn’t producing any noticeable differences.
Step 2 OSG Multithreading
I tried each of the possible options and although things improved slightly the change was consistent between both the forward and deferred pipeline. Effectively no magic solution was found here (I wasn’t expecting one after some study of the OSG documentation and lists).
Step 3 Adjust the C++
So I’ve got a cut down three stage deferred pipeline; now I need to look at the code.
As it’s looking like it’s not the number of stages, and with a rendering pipeline that is similar to the forward one then I should be getting the same rate (I’m not); if not then it can’t be the GPU, or rather it could be, but before coming back to this I first needed to remove anything that looked possibly unnecessary. For example a massive inefficiency or thread locking wait state; having removed pretty much everything that wasn’t essential
So I removed everything that wasn’t essential; conditions, accesses to the property tree; even the odd cull visitor; but nothing made a difference to the performance.
Step 4; Shaders
As the deferred pipeline uses a different set of shaders I went through each one and checked it for anything that was wrong. I wasn’t expecting to find anything partly because I don’t really properly understand shaders; but it all looked ok. So I then removed the shaders completely from the pipeline (i.e. I didn’t understand them so lets get rid of them totally and see what happens).
What happened was odd; I got a black screen but the performance was about the same.
Step 5; Running out of ideas.
At this point I’ve got a pipeline that has the bare minimum, nothing really being done in the shaders, and the C++ cut down to the bare minimum. At this point with the black screen apart from the OSG stats I get that awful feeling that I’m missing something obvious and looking in the wrong place. But what else was there.
By this point I’d found sim/rendering/draw-mask so I turned everything off; which left the skydome; and there at the top of the screen was a (pretty much constant) 30fps. Refer to Figure(2) below (although the FPS is lower because of the on screen OSG statistics).
How could it be that drawing effectively nothing is giving me a 30hz frame rate. It should be 60hz because that’s my monitor refresh rate. Then it hit me like a hungry tv presenter – this has to be related to vsync. Somehow it has to be.
Figure(2) Drawing nothing is taking a long time
Step 6; vsync investigations;
What I’m looking for now is something that is causing each camera to wait for vsync; as I’m 99.9% certain that this is why it’s slow.
So the first logical thing to do was to figure if vsync was turned on; so I had to figure out what it was called in OSG and how to set it. Once I’d done this I found this out; I then studiously went through the code and figured out where to add it to the init() and in a “we already thought of that” way it was already there.
So I changed the command line to have —prop:/sim/rendering/vsync-enable=false —prop:/sim/frame-rate-throttle-hz=60
The results in figure(3) show that indeed there is a wait for vsync per camera in OSG; now I’m not sure this is right and will probably ask the question in the OSG lists; but this is progress.
Figure(3) Drawing nothing is fast
With vsync turned off I’m getting what I think is a reasonable performance. My GPU is showing way more activity and the cooling fan comes on now (before these changes GPU activity was pretty much 1%)
Rather like a poorly written murder mystery the clues were there all; the frame time on the cameras in figures(1) and (2) is very close to 16.67ms (vysnc period). I should have realised this earlier; but I didn’t.
Now I’m not totally sure if this is the right solution or even if this will apply to other graphics card; but for me with shadows turned off I’m getting 42fps (whereas before I was getting around 15fps). With shadows I get around 30fps. This is good as 30fps is the minimum I can cope with (Even though I usually have the shadows turned off except for screenshots).
Next steps (for me) with deferred renderring.
Now I’m trying to figure out how to adapt the deferred rendering pipeline so I can do post processing;
1 Project Rembrandt http://wiki.flightgear.org/Project_Rembrandt (http://wiki.flightgear.org/Fr/Projet_Rembrandt)
2 There are many arguments about forward vs. deferred rendering and deeply held views. I take the view that we should let the user choose what they like.