Scene setup

So far we have developed techniques to draw a triangle on the canvas given the coordinates of its vertexes, and some equations to transform the 3D coordinates of a triangle to 2D canvas coordinates. In this chapter we’ll explore how to represent and manipulate objects made of triangles.

We’ll use a cube for this; it’s not the simplest 3D object we can make out of triangles, but it will be convenient to illustrate some issues. The edges of the cube are 2 units long and are parallel to the coordinate axes, and it’s centered on the origin:

These are the coordinates of its vertexes:

\(A = (1, 1, 1)\)

\(B = (-1, 1, 1)\)

\(C = (-1, -1, 1)\)

\(D = (1, -1, 1)\)

\(E = (1, 1, -1)\)

\(F = (-1, 1, -1)\)

\(G = (-1, -1, -1)\)

\(H = (1, -1, -1)\)

The sides of the cube are square, however, and we only know how to deal with triangles. No problem: every polygon can be decomposed in a set of triangles. So we’ll just use two triangles to represent each square side of the cube.

Of course, not any set of three vertexes of the cube describe a triangle on the surface of the cube (for example, AGH is “inside” the cube), so the coordinates of its vertexes aren’t enough to describe it; we also need a list of triangles made of these vertexes:

A, B, C
A, C, D
E, A, D
E, D, H
F, E, H
F, H, G
B, F, G
B, G, C
E, F, B
E, B, A
C, G, H
C, H, D

This suggests a structure we can use to represent any object made of triangles: a list of vertex coordinates and a list of triangles which specify what sets of three vertexes make up triangles of the object.

Each entry in the Triangles list may hold additional information about the triangles; for example, we can store the color of each triangle there.

Since the most natural way to store this information is in two lists, we’ll use list indexes to refer to the vertexes in the vertex list. So the cube above would be represented like this:

0 = ( 1,  1,  1)
1 = (-1,  1,  1)
2 = (-1, -1,  1)
3 = ( 1, -1,  1)
4 = ( 1,  1, -1)
5 = (-1,  1, -1)
6 = (-1, -1, -1)
7 = ( 1, -1, -1)

 0 = 0, 1, 2, red
 1 = 0, 2, 3, red
 2 = 4, 0, 3, green
 3 = 4, 3, 7, green
 4 = 5, 4, 7, blue
 5 = 5, 7, 6, blue
 6 = 1, 5, 6, yellow
 7 = 1, 6, 2, yellow
 8 = 4, 5, 1, purple
 9 = 4, 1, 0, purple
10 = 2, 6, 7, cyan
11 = 2, 7, 3, cyan

Rendering an object with this representation is quite easy: you first project every vertex, storing them in a temporary “projected vertexes” list; then you go through the triangle list rendering each individual triangle. A first approximation would look like this:

RenderObject(vertexes, triangles) {
    projected = []
    for V in vertexes {
    for T in triangles {
        RenderTriangle(T, projected)

RenderTriangle(triangle, projected) {

We can’t just apply this algorithm to the cube as we have it and expect things to look reasonable, because some of its vertexes are behind the camera; which, as we have seen, is a recipe for weird things. In fact, the camera is inside the cube.

So we’ll just move the cube. To achieve this, we just need to move each vertex of the cube in the same direction. Let’s call this vector \(T\), for Translation. Let’s just translate the cube 7 units forward, making sure it’s completely in front of the camera, and 1.5 units to the left, to make it more interesting. Since “forward” is the direction of \(\vec{Z_+}\) and “left” is \(\vec{X-}\), the translation vector is simply

\[ \vec{T} = \begin{pmatrix} -1.5 \\ 0 \\ 7 \end{pmatrix} \]

To obtain the translated version \(V'\) of each point \(V\) in the cube, we just need to add the translation vector:

\[ V' = V + \vec{T} \]

At this point we can take the cube, translate each vertex, and then apply the algorithm above to get our first 3D cube at last:

Source code and live demo >>

Models and instances

What if we want to render two cubes?

A naive approach would be to create a new set of vertexes and triangles describing a second cube. This works, but it wastes a lot of memory. What if we wanted to render a million cubes?

A smarter approach is to think in terms of models and instances. A model is a set of vertexes and triangles that describes how a certain object is (think “a cube is made of the following set of vertexes and triangles.”) An instance of a model, on the other hand, is a concrete instantiation of that model in some position on the scene (think “there’s a cube at (0, 0, 5)”).

The advantage of this second approach is that each object on the scene needs to be defined only once, and then an arbitrary number of instances can be placed on the scene just by describing their positions within the scene.

This is a rough approximation of how a scene like this would be described:

model {
    name = cube
    vertexes {
    triangles {

instance {
    model = cube
    position = (0, 0, 5)

instance {
    model = cube
    position = (1, 2, 3)

In order to render this, we just go through the list of instances; for each instance, make a copy of the model’s vertices, apply the instance’s position to them, and then proceed as before:

RenderScene() {
    for I in scene.instances {

RenderInstance(instance) {
    projected = []
    model = instance.model
    for V in model.vertexes {
        V' = V + instance.position
    for T in model.triangles {
        RenderTriangle(T, projected)

Note that for this to work, the coordinates of the vertexes on the model should be defined in a coordinate system that “makes sense” for the object (this is called model space). For example, the cube was defined such that its center was (0, 0, 0); this means when we say “a cube located at (1, 2, 3)” we mean “a cube centered around (1, 2, 3)”. There are no hard rules to define a model space; it’s mostly application dependent. For example, if you have the model of a person, it’s sensible to place the origin at their feet. The transformed vertexes are now expressed in the “absolute” coordinate system of the scene (called world space).

Here are our two cubes:

Source code and live demo >>

Model transform

The scene definition as described above is quite limiting; in particular, since you can only specify the position of a cube, you can instantiate as many as you want, but they would all be oriented in the same way. In general, we want to have more control over the instances; we also want to specify their orientation and possibly their scale.

Conceptually, we can define a model transform with exactly these three elements: a scaling factor, a rotation around the model space origin, and a translation to a specific point on the scene:

instance {
    model = cube
    transform {
        scale = 1.5
        rotation = <45 degrees around the Y axis>
        translation = (1, 2, 3)

The previous version of the pseudocode can be easily extended to accommodate the new transforms. However, the order in which the transforms are applied is important; in particular, the translation must be done last. Here’s a \(45^\circ\) rotation around the origin followed by a translation along the Z axis:

And here’s the translation applied before the rotation:

We can write a more generic version of RenderInstance():

RenderInstance(instance) {
    projected = []
    model = instance.model
    for V in model.vertexes {
        V' = ApplyTransform(V, instance.transform);
    for T in model.triangles {
        RenderTriangle(T, projected)

The ApplyTransform() method looks like this:

ApplyTransform(vertex, transform) {
    V1 = vertex * transform.scale
    V2 = V1 * transform.rotation
    V3 = V2 + transform.translation
    return V3

The rotation is expressed as a 3x3 matrix; if you’re not familiar with rotation matrices, assume for now that any 3D rotation can be expressed as the multiplication of the point and a 3x3 matrix. See the Linear Algebra appendix for details.

Camera transform

The previous sections explored how we can position instances of models at different points in the scene. In this section we’ll explore how to move and rotate the camera within the scene.

Imagine you’re floating in the middle of a completely empty coordinate system. Everything is black. Suddenly, a red cube appears exactly in front of you:

(TODO: image)

A second later, the cube moves one unit towards you:

(TODO: image)

But… did the cube really move one unit towards you? Or did you move one unit towards the cube?

(TODO: image)

Since there are no points of reference at all, and the coordinate system isn’t visible, there’s no way to tell just by looking at what you see.

Now the cube rotates around you \(45^\circ\) clockwise. Or does it? Maybe it was you who rotated around it \(45^\circ\) counterclockwise? Again, there’s no way to tell:

(TODO: image)

What this thought experiment shows is that there’s no difference between moving the camera around a fixed scene, and keeping the camera fixed while rotating and translating the scene around it!

The advantage of this clearly self-centered vision of the universe is that by keeping the camera fixed at the origin and pointing at \(\vec{Z_+}\) we can immediately use the projection equations derived in the previous chapter without any modification. The coordinate system of the camera is called camera space.

Let’s assume the camera also has a transform attached to it, consisting of translation and rotation (we’ll ignore the scale). In order to render the scene from the point of view of the camera, we need to apply the opposite transforms to each vertex of the scene:

V1 = V - camera.translation
V2 = V1 * inverse(camera.rotation)
V3 = perspective_projection(V2)

The transform matrix

Let’s take a step back and consider what happens to a vertex \(V\) in model space until it’s projected into the canvas point \((cx, cy)\).

We first apply the model transform, to go from model space to world space:

V1 = V * instance.rotation
V2 = V1 * instance.scale
V3 = V2 + instance.translation

Then we apply the camera transform, to go from world space to camera space:

V4 = V3 - camera.translation
V5 = V4 * inverse(camera.rotation)

Next we apply the perspective equations:

vx = V5.x * d / V5.z
vy = V5.y * d / V5.z

And we finally map the viewport coordinates to canvas coordinates:

cx = vx * cw / vw
cy = vy * ch / vh

As you can see, it’s a lot of computation and a lot of intermediate values to be computed per vertex. Wouldn’t it be nice if we could reduce all of that to a single matrix multiplication - take \(V\), multiply it by some matrix, and obtain \(cx\) and \(cy\) directly?

Let’s express the transforms in terms of functions that take a vertex and return a transformed vertex. Let \(C_T\) and \(C_R\) be the camera translation and rotation, \(I_R\), \(I_S\) and \(I_T\) the instance rotation, scale and translation, \(P\) the perspective projection, and \(M\) the viewport-to-canvas mapping. If \(V\) is the original vertex and \(V'\) is the point on the canvas, we can express all the equations above this way:

\[ V' = M(P(C_R^{-1}(C_T^{-1}(I_T(I_S(I_R(V))))))) \]

Ideally, we’d like a single transform that does whatever the series of original transform does, but which has a much simpler expression:

\[ F = M \cdot P \cdot C_R^{-1} \cdot C_T^{-1} \cdot I_T \cdot I_S \cdot I_R \]

\[ V' = F(V) \]

Finding a single matrix that represents \(F\) isn’t trivial. The main problem is that the transforms are expressed in different ways: a translation is the sum of the point and a vector, a rotation and a scale are the product of the point and a 3x3 matrix, and the perspective projection involves a division. But if we could express all the transforms in the same way, and such a way had an easy mechanism to compose transforms, we’ll get what we want.

Homogeneous coordinates

Consider the expression \(A = (1, 2, 3)\). Does \(A\) represent a 3D point or a 3D vector? There’s no way to know without having some additional context.

But let’s adopt the following convention: we’ll add a fourth component to the representation, called \(w\). If \(w = 0\), we’re talking about a vector. If \(w = 1\), we’re talking about a point. So the point \(A\) is unambiguously represented as \(A = (1, 2, 3, 1)\) and the vector \(\vec{A}\) is represented as \((1, 2, 3, 0)\). Since points and vectors share the same representation, these are called homogeneous coordinates 1.

This representation makes a lot of geometrical sense. For example, subtracting two points yields a vector:

\[ (8, 4, 2, 1) - (3, 2, 1, 1) = (5, 2, 1, 0) \]

Adding two vectors yields a vector:

\[ (0, 0, 1, 0) + (1, 0, 0, 0) = (1, 0, 1, 0) \]

In the same way, it’s easy to see that adding a point and a vector yields a point, multiplying a vector by a scalar yields a vector, and so on.

So what do coordinates with \(w\) other than \(0\) or \(1\) represent? They also represent points; in fact, any point in 3D has an infinite number of representations in homogeneous coordinates. What matters is the ratio between the coordinates and the \(w\) value; that is, \((1, 2, 3, 1)\) and \((2, 4, 6, 2)\) represent the same point, as does \((-3, -6, -9, -3)\).

Of all of these representations, we call the one with \(w = 1\) the canonical representation of the point in homogeneous coordinates; converting any other representation to its canonical representation or to its cartesian coordinates is trivial:

\[ \begin{pmatrix} x \\ y \\ z \\ w \end{pmatrix} = \begin{pmatrix}{c} x \over w \\ y \over w \\ z \over w \\ 1 \end{pmatrix} = \begin{pmatrix} x \over w \\ y \over w \\ z \over w \end{pmatrix} \]

So we can convert cartesian coordinates to homogeneous coordinates, and back to cartesian coordinates. But how does this help us find a single representation for all the transforms?

Homogeneous rotation matrix

Let’s begin with a rotation matrix. Expressing a cartesian 3x3 rotation matrix in homogeneous coordinates is trivial; since the \(w\) coordinate of the point shouldn’t change, we add a column to the right, a row to the bottom, fill them with zeros and place a \(1\) in the lower-right element to keep the value of \(w\):

\[ \begin{pmatrix} A & B & C \\ D & E & F \\ G & H & I \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} x' \\ y' \\ z' \end{pmatrix} \rightarrow \begin{pmatrix} A & B & C & 0 \\ D & E & F & 0 \\ G & H & I & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} x' \\ y' \\ z' \\ 1 \end{pmatrix} \]

Homogeneous scale matrix

A scaling matrix is also trivial in homogeneous coordinates, and it’s constructed in the same way as the rotation matrix:

\[ \begin{pmatrix} S_x & 0 & 0 \\ 0 & S_y & 0 \\ 0 & 0 & S_z \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} x \cdot S_x \\ y \cdot S_y \\ z \cdot S_z \end{pmatrix} \rightarrow \begin{pmatrix} S_x & 0 & 0 & 0 \\ 0 & S_y & 0 & 0 \\ 0 & 0 & S_z & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} x \cdot S_x \\ y \cdot S_y \\ z \cdot S_z \\ 1 \end{pmatrix} \]

Homogeneous translation matrix

These were easy; they were already represented as matrix multiplications in cartesian coordinates, we just had to add a \(1\) to preserve the \(w\) coordinate. But what can we do with a translation, which we had expressed as an addition in cartesian coordinates?

We’re looking for a 4x4 matrix such that

\[ \begin{pmatrix} T_x \\ T_y \\ T_z \\ 0 \end{pmatrix} + \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} A & B & C & D \\ E & F & G & H \\ I & J & K & L \\ M & N & O & P \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} x + T_x \\ y + T_y \\ z + T_z \\ 1 \end{pmatrix} \]

Let’s focus on getting \(x + T_x\) first. This value is the result of multiplying the first row of the matrix by the point, that is,

\[ \begin{pmatrix} A & B & C & D \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = x + T_x \]

If we expand the vector multiplication, we get

\[ Ax + By + Cz + D = x + T_x \]

And from here we can deduce that \(A = 1\), \(B = C = 0\), and \(D = T_x\).

Following a similar reasoning for the rest of the coordinates, we arrive at the following matrix expression for the translation:

\[ \begin{pmatrix} T_x \\ T_y \\ T_z \\ 0 \end{pmatrix} + \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & T_x \\ 0 & 1 & 0 & T_y \\ 0 & 0 & 1 & T_z \\ 0 & 0 & 0 & 1 \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} x + T_x \\ y + T_y \\ z + T_z \\ 1 \end{pmatrix} \]

Homogeneous projection matrix

Sums and multiplications are easy to express as products of matrices and vectors, which are, after all, sums and multiplications. But the perspective projection equations have a division by \(z\). How can we express that?

You may be tempted to think dividing by \(z\) is the same as multiplying by \(1/z\), which is indeed true; but it’s useless in this case, because the \(z\) coordinate of a specific point can’t be in the projection matrix that will be applied to every point.

Fortunately, homogeneous coordinates do have one instance of a division: the division by the \(w\) coordinate when converting back to cartesian coordinates. So if we can manage to make the \(z\) coordinate of the original point appear as the \(w\) coordinate of the “projected” point, we’ll get the projected \(x\) and \(y\) once we convert the point back to cartesian coordinates:

\[ \begin{pmatrix} A & B & C & D \\ E & F & G & H \\ I & J & K & L \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} x.d \\ y.d \\ z \end{pmatrix} \rightarrow \begin{pmatrix} x.d \over z \\ y.d \over z \end{pmatrix} \]

Note that this matrix is \(3x4\); it can be multiplied by a 4-element vector (the transformed 3D point in homogeneous coordinates) and it will yield a 3-element vector (the projected 2D point in homogeneous coordinates) which is then converted to 2D cartesian coordinates dividing by \(w\). This gives exactly the values of \(x'\) and \(y'\) we were looking for.

The missing element here is \(z'\), which we know equals \(d\) by definition.

Using a reasoning similar to the one we employed to deduce the translation matrix, we can express the perspective projection as

\[ \begin{pmatrix} d & 0 & 0 & 0 \\ 0 & d & 0 & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} x.d \\ y.d \\ z \end{pmatrix} \rightarrow \begin{pmatrix} x.d \over z \\ y.d \over z \end{pmatrix} \]

Homogeneous Viewport-to-Canvas matrix

The last step is mapping the projected point on the viewport to the canvas. This is just a 2D scaling transform with \(S_x = c_w \over v_w\) and \(S_y = c_h \over v_h\). This matrix is thus

\[ \begin{pmatrix} c_w \over v_w & 0 & 0 \\ 0 & c_w \over v_w & 0 \\ 0 & 0 & 1 \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} x.c_w \over v_w \\ y.c_h \over v_h \\ z \end{pmatrix} \]

In fact it’s easy to combine this with the projection matrix, to get a simple 3D-to-Canvas matrix:

\[ \begin{pmatrix} \over vw & 0 & 0 & 0 \\ 0 & \over vh & 0 & 0 \\ 0 & 0 & 1 & 0 \end{pmatrix} . \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix} = \begin{pmatrix} \over vw \\ \over vh \\ z \end{pmatrix} \rightarrow \begin{pmatrix} ({x.d \over z})({cw \over vw}) \\ ({y.d \over z})({ch \over vh}) \end{pmatrix} \]

Practical matters

For practical reasons, we won’t be using the projection matrix. Instead, we’ll use the Model and Camera transforms, and then convert their results back to cartesian coordinates as follows:

\[ x' = { \over z.vw} \]

\[ y' = { \over z.vh} \]

This lets us do some more operations in 3D before projecting the points, which can’t be expressed as matrix transforms.

The transform matrix revisited

Since we now can express every 3D transform the original vertex \(V\) undergoes before being projected as a 4x4 matrix, we can trivially compose all these transforms in a single 4x4 matrix by multiplying them:

\[ F = C_R^{-1}.C_T^{-1}.I_T.I_S.I_R \]

And then transforming a vertex is just a matter of computing the following product:

\[ V' = F.V \]

Furthermore, we can decompose the transform in two parts:

\[ M_{Camera} = C_R^{-1}.C_T^{-1} \]

\[ M_{Model} = I_T.I_S.I_R \]

\[ M = M_{Camera}.M_{Model} \]

These matrices don’t need to be computed for every vertex (that’s the point of using a matrix after all). In fact, they don’t necessarily need to be computed for every frame.

\(M_{Camera}\) may change every frame; it depends on the camera position and orientation, so if the camera is moving or turning, it needs to be recomputed. Once computed, though, it remains constant for every object drawn in the frame, so it would be computed at most once per frame.

\(M_{Model}\) depends on the transform of a model’s instance, and as such the matrix you use will change once per object in the scene; however, it will remain constant for objects that don’t move (e.g. trees, buildings) so it can be precomputed and stored in the scene itself. For objects that move (e.g. cars in a racing game) it still needs to be computed every time they move (usually every frame).

A very high level of the scene rendering pseudocode would look like this:

RenderModel(model, transform) {
    projected = []
    for V in model.vertexes {
        projected.append(ProjectVertex(transform * V))
    for T in model.triangles {
        RenderTriangle(T, projected)

RenderScene() {
    MCamera = MakeCameraMatrix(camera.position, camera.orientation)

    for I in scene.instances {
        M = MCamera*I.transform
        RenderModel(I.model, M)

We can now draw a scene containing several instances of different models, possibly moving around and rotating, and we can move the camera throughout the scene.

Source code and live demo >>

It’s good progress, but we still have two important limitations. First, moving the camera around means we can have objects behind it, which causes all sorts of problems. Second, the rendering doesn’t look so great: it’s still wireframe.

In the next chapter we’ll deal with objects that shouldn’t be visible, and then we’ll spend the rest of the book making the rendered objects look better.

<< Perspective projection · Clipping >>
Computer Graphics from scratch · Introduction · Table of contents · Common concepts
Part I: Raytracing · Basic ray tracing · Light · Shadows · Reflection · Arbitrary camera · Beyond the basics · Raytracer pseudocode
Part II: Rasterization · Lines · Filled triangles · Shaded triangles · Perspective projection · Scene setup · Clipping · Hidden surface removal · Shading · Textures
Found an error? Everything is in Github.

  1. Homogeneous coordinates have a far deeper and far more involved geometric interpretation, but that’s outside the scope of this book; here we’ll just use them as a convenient tool with certain properties.