Optimizing my quad renderer with Batch Rendering

We can get a 6-8x speedup in performance by restructuring our program and thinking in terms of data

Feb 01, 2025

OpenGL Batch Rendering - Graphics and GPU Programming - Tutorials - GameDev.net

required skills: c/c++, opengl

Before we begin, I would like to share this difference between my initial renderer built purely for prototyping and my new `batched` renderer.

Note: you can view all the source code here. This will give you a better idea about how I have went about doing things in this game.

Hardware Specs

CPU: AMD Ryzen™ 9 5900HS
GPU: Integrated graphics that come with the processor.

Compilation Specs

clang, -O0 (optimizations turned off)

I am rendering 10,000 quads, with varying colors across the screen. This is as intensive as I could make my renderer at this stage. This has also been sufficient enough for me to contrast how well the new batched renderer works.

I have optimizations turned off to demonstrate the absolute worst case, with also accepting that the performance issues lie elsewhere in the code now.

Naive Renderer

frametimes: 17ms

fps: 58

And to give you an idea of what is going on behind the scenes

there are around 11,666 draw calls (70,000 total elements / 6 element per draw call), each taking roughly 0.3 microseconds (thought not exactly).

Batched Renderer

frametime: 12ms

fps: 83

And to give you an idea as to why we have a performance increase, let’s take a look at renderdoc

Here we can see, we have around 5 draw calls

But, that’s not… alot?

Yes, though I would be remiss if I did not mention that up till now 2 things are happening:

I have profiling turned on. I am calculating how long each matrix multiplication takes (for another experiment)
I have compiler optimizations turned off.

Let’s turn off profiling

Naive Renderer

frametime: 13ms

fps: 75

Our naive renderer went from 58 fps to 75fps, a decent speed.

Batched Renderer

frametime: 5ms

fps: 176

So, we immediately jumped from 80fps to 176fps. compared to both before and to our naive renderer, that is a pretty big speedup.

But that’s not all.

Let’s see how things look when we are running with cpu optimizations enabled. I mentioned before, my code is horribly unoptimized despite some sane programming choices, I am not doing anything special to make sure the cpu side of my code is optimized. This means that I have no vectorization, as in, none of my tens of thousands of per frame matrix multiplication is using simd, it’s all scalar, and I have receipts (disassembly instructions) to prove that. I also have no threading whatsoever implemented, so all of this code is single threaded.

So, that said, let’s see how the renderers compare with optimizations enabled.

let’s try to see how we perform at say… O2

Naive Renderer

frametime: 10ms

fps: 91

So, again we have had some speedup, but not alot.

Now you’re about to see just how much of a bottleneck that renderer was.

spoilers, you’re not expecting this

Batched Renderer

frametime: 1ms (we probably are lower than this)

fps: 852

Yup, your eyes do not deceive you.

So after turning on cpu optimization the game now runs *checks notes* with a 9x speedup.

For context, with the amount of things I am renderering, I have ensured that no amount of rendering related item I would do in this 2d game would cause any sort of issues. It’s all on the cpu now. Infact, I would go as far as to say, this renderer is now beyond anything I could do in a 2d game, and is suitable for when I move to making 3d games (but that would be jumping ahead).

Now then, let’s see how I did this.

I have previously touched on how a batched renderer works, and you can also infer based off of what I have shown here, in regards to how the draw calls go and its’ name, as to what a batch renderer does.

There are generally 2 main steps involved in setting up a batched renderer.

Setup your opengl data and specs
Here you mainly specify your opengl buffers. This is you specifying ahead of time, what the data you will be sending for drawing will look like. It involves how much data you send and your vertex attributes (vertex positions and colors)
Rendering your opengl data
Now this step, is very similar to what we do in a normal naive renderer except, with batch renderering you explicitly specify an buffer, and when it comes to drawing you send all your data together, as a tightly packed array, in a single draw call. We saw the benefit of this, the transfer process is slow because of the overhead of setting things up and it’s faster, much much faster to just send a chunk of data, all together then it is to send a little amount each frame.

Let’s look at the code for a batched renderer

I have 3 main functions:

gl_setup_colored_quad_optimized()
gl_draw_colored_quad_optimized()
gl_cq_flush() (I will explain this later)

gl_setup_colored_quad_optimized()

After we setup our batch renderer, we can begin by using it, as we did our naive renderer, by simply making calls to `gl_draw_colored_quad_optimized`

The calls we make look the same as before

There is a slight difference though. We also need to call `gl_cq_flush` at the end of the frame. That process is simple enough, and happens once we have finished drawing of all elements in our batched renderer.

gl_cq_flush(renderer);

Now, let’s take a look at how drawing works.

gl_draw_colored_quad_optimized()

There are some more notable additions in this function. Firstly, you may notice that we calculate scaling and translations on the cpu now. This is admittedly required, since this is something we cannot do in our shader now (as far as I am aware) as we want to control where each individual quad is drawn, alongside it’s color and size and we lose that if we do not move that calculation on the cpu.

A small tradeoff of doing this is that we are now cpu limited. I find that given the results, this is an acceptable tradeoff.

Then you may notice multiple array_insert operations. I have 2 arrays

cq_pos_batch
tightly packed array of vertex position data
cq_color_batch
tightly packed array of vertex color data

These array_insert operations, simply load data into those arrays. There is definitely a better, “cleaner” way of doing this, but I want things to be explicit and easy to parse, and you may find it easy to read and understand.

Lastly, you may notice that if we hit our batch count, we call `gl_cq_flush`. In batch rendering, we specified a fixed size array on the gpu, defining that to be our batch array in which we transfer the data over and let our shaders run it in a single draw call. Once we hit the batch size during our draw operations, meaning our batch buffer gets full, we need to flush it. This means, we will be sending the data over to the gpu and drawing it.

This is what’s happening in render doc.

Our batch size is 2000, and we have 6 vertex elements, which is why we see us sending 12000 items here. This drawing is happening in the middle of the frame.

We also flush at the end of the frame, this is to deal with the case where we do not have a full buffer, but have no more elements to render. In that case we send whatever data we have to the gpu.

Let’s see what happens when we flush data.

gl_cq_flush()

I left out the shader code, but I will still show you how that differs from the shader code of a naive renderer

vertex_shader

There is a small difference, in that all the vertices of a quad were scaled and translated on the cpu, so we no longer have a model matrix in our shader.

fragment_shader

The fragment shader is unchanged.

Conclusion

That about wraps up this article. In quite a lot of ways, this was a challenging task. It was more so daunting because of a lack of a straightforward tutorial. This is why I wrote this, it covers everything for the most normal case of quad rendering. Game development is an extremely vast field, with all sorts of differing cases. This is partly where the challenge of finding resources comes. That is also what lends it to being extremely rewarding when you finally achieve the goal you set out to do. I learnt quite a lot when I set out to implement batch rendering, and that is only one of the most fundamental items in game development. I can only imagine how much room there is for me to learn and improve and that is what excites me about this and why I love making games from scratch. There is a huge amount of opportunity for you to improve here, even if you are building some extremely simple 2d game, and that too as a hobby after work or on the weekends.

Talha’s Substack

Discussion about this post

Talha’s Substack

Optimizing my quad renderer with Batch Rendering

We can get a 6-8x speedup in performance by restructuring our program and thinking in terms of data

Naive Renderer

Batched Renderer

But, that’s not… alot?

Naive Renderer

Batched Renderer

Naive Renderer

Batched Renderer

Let’s look at the code for a batched renderer

Conclusion

Learning Resources

Discussion about this post