This blog series is a part of the write-up assignments of my Real-Time Game Rendering class in the Master of Entertainment Arts & Engineering program at University of Utah. The series will focus on C++, Direct3D 11 API and HLSL.

In this post, I will talk about some little modification that I did to my shader files.


Currently, how we are handling transformation to projected space inside vertex shader is using three matrices in our constant buffer, local to world, world to camera, and camera to projected. Since each matrix has 16 floats, there are 16 * 4 * 3 = 192 floating point multiplications. We can easily reduce the calculation by combining the effect of all three matrices together and put it in our per-draw call constant buffer. However, do we need other information? what about some shaders that require the world position in my previous post?

Before taking care of those questions, let’s take a look at shader instructions count and see how much instructions it is going to save us roughly. Originally, we were multiplying three matrices by the locations in different spaces one by one. And the disassembly from our standard vertex shader looks like below.

#line 72
mov, v0.xyzx  // r0.x <- vertexPosition_local.x; r0.y <- vertexPosition_local.y; r0.z <- vertexPosition_local.z
mov r0.w, l(1.000000)  // r0.w <- vertexPosition_local.w

#line 73
mul r1.xyzw, r0.xxxx, cb2[0].xyzw
mul r2.xyzw, r0.yyyy, cb2[1].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r2.xyzw, r0.zzzz, cb2[2].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r0.xyzw, r0.wwww, cb2[3].xyzw
add r0.xyzw, r0.xyzw, r1.xyzw  // r0.x <- vertexPosition_world.x; r0.y <- vertexPosition_world.y; r0.z <- vertexPosition_world.z; r0.w <- vertexPosition_world.w

#line 78
mul r1.xyzw, r0.xxxx, cb0[0].xyzw
mul r2.xyzw, r0.yyyy, cb0[1].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r2.xyzw, r0.zzzz, cb0[2].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r0.xyzw, r0.wwww, cb0[3].xyzw
add r0.xyzw, r0.xyzw, r1.xyzw  // r0.x &lt;- vertexPosition_camera.x; r0.y &lt;- vertexPosition_camera.y; r0.z &lt;- vertexPosition_camera.z; r0.w <span id="mce_SELREST_start" style="overflow:hidden;line-height:0;"></span>&lt;- vertexPosition_camera.w

#line 82
mul r1.xyzw, r0.xxxx, cb0[4].xyzw
mul r2.xyzw, r0.yyyy, cb0[5].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r2.xyzw, r0.zzzz, cb0[6].xyzw
add r1.xyzw, r1.xyzw, r2.xyzw
mul r0.xyzw, r0.wwww, cb0[7].xyzw
add o0.xyzw, r0.xyzw, r1.xyzw

#line 88
mov o1.xyzw, v1.xyzw

#line 89
// Approximately 25 instruction slots used

To avoid repeatedly calculating the world to the projection matrix, I created a static variable in my Graphics namespace and set it when the camera information is submitted each frame.

// set world to project
s_transform_worldToProjected = constantData_perFrame.g_transform_cameraToProjected * constantData_perFrame.g_transform_worldToCamera;

After I changed my constant buffer and passing only the local to projected matrix into our GPU for use, the instructions count dropped tremendously (from 25 to 9)! Which is pretty close to what we expected since we basically reduce 2/3 of the floating point calculations.

// Updated C++ code
// Update the transform
constantData_perDrawCall.g_transform_localToProjected = s_transform_worldToProjected * objectTransMat;
// update constant buffer

Let’s take a look at the instructions count in below.

#line 70
mul r0.xyzw, v0.xxxx, cb2[0].xyzw
mul r1.xyzw, v0.yyyy, cb2[1].xyzw
add r0.xyzw, r0.xyzw, r1.xyzw
mul r1.xyzw, v0.zzzz, cb2[2].xyzw
add r0.xyzw, r0.xyzw, r1.xyzw
mul r1.xyzw, cb2[3].xyzw, l(1.000000, 1.000000, 1.000000, 1.000000)
add o0.xyzw, r0.xyzw, r1.xyzw

#line 77
mov o1.xyzw, v1.xyzw

#line 78
// Approximately 9 instruction slots used

Other Information

What about other information? I needed the local to world matrix to calculate the vertex’s world position, and the world to camera matrix to determine the z depth from my camera. I can think of some options right now.

  1. Simply keep the local to world matrix inside our per-drawcall buffer so that if any shader needs it, it can still use it.
  2. Inside my per-frame constant buffer, store the “inverse” of the world to camera, and the camera to projected matrices in there. And use them to calculate the vertex’s world position if needed! However, how many matrices are actually going to use these? These are only set once “per frame” and they take up 16 * 4 * 2 = 128 bytes of memory, will it be worth it? More importantly, camera to projection transform is “NOT AFFINE” therefore actually not invertible.

After some thinking, I decided to just go with approach 1 for now. In the future, maybe I’ll modify the constant buffer and allow it not have to pass everything to every shader.