A projection is fundamental to cameras, mapping a 3D space onto a 2D image to render geometry. The projection matrix is widely used in computer graphics to encodes such a transform between spaces. It is a linear transform preserving straight lines, which both looks natural and is important for fast rasterizaiton, unlike a fisheye projection for example which is non-linear.

A projection matrix is a 4×4 homogeneous matrix and can be pre-multiplied with other transformation matrices. Multiplying a point in world space, of the form $(x, y, z, 1)$, by a projection matrix produces clip space coordinates. For some projection matrices, a “perspective normalise” divide is required to convert from clip space to normalised device coordinates (NDC). The $(x, y)$ coordinates can be scaled by the image resolution for coordinates in pixels. This process is included in the discussion of spaces.

The projection matrix implicitly defines a viewing volume and image bounaries at the -1 to 1 planes in NDC. That is, the image is formed by $(x, y)$ points inside the -1 to 1 range after being transformed by the projection matrix. Geometry outside the range is “clipped”, discussed later. $z$ may also be constrained to the -1 to 1 range for precision reasons with depth testing. When projected back into world space, these boundaries create a typical cube or frustum shaped viewing volume shown in many projection visualizations.

There are two common projection matrices used in 3D graphics: orthographic and perspective. An orthographic projection is commonly seen in mathematical diagrams as it preserves relative lengths in addition to straight lines. It is also useful in modelling packages to align geometry. The projection matrix is more natural and objects in the distance become smaller, just like typical rectilinear lenses and the human eye.

Orthographic

The orthographic matrix gives a cuboid viewing volume and is really just a scale matrix to frame the scene. The image is eventually formed by geometry in the -1 to 1 range after the projection matrix is applied. To define an orthographic projection, left, right, top and bottom distances ($L$, $R$, $T$, $B$) are chosen to map to the image borders. An orthographic matrix scales these down to the -1 to 1 range. It also performs a translation if the borders are not symmetric. Near and far distances ($N$, $F$) for the depth range are also chosen, particularly so that objects behind a camera are not drawn, but also to help hidden surface removal methods such as the depth buffer.

$\begin{bmatrix} \frac{2}{R-L} & 0 & 0 & \frac{L+R}{L-R} \\ 0 & \frac{2}{T-B} & 0 & \frac{B+T}{B-T} \\ 0 & 0 & \frac{2}{N-F} & \frac{N+F}{N-F} \\ 0 & 0 & 0 & 1 \end{bmatrix}$

The component $\frac{2}{N-F}$ is negative which inverts $z$ components so that the camera looks towards $-Z$.

Because the bottom row is $(0, 0, 0, 1)$ the depth range is scaled linearly, unlike in the projection matrix later. This affects the depth buffer’s precision.

In some cases it is desirable to create a projection which matches world space units with width and height in pixels of the final image, as in the following matrix. Note $B=\mathsf{height}$ rather than $T$ so that $+Y$ is down. While this doesn’t make much sense for computer graphics it is essential for text alignment as we read from top to bottom.

$\begin{array}{2} L=0 & R=\mathsf{width} \\ B=\mathsf{height} & T=0 \\ N=-1 & F=1 \end{array}$

Perspective

A perspective matrix is typically symmetric, defined by a field of view, $\mathsf{fov}$, an aspect ratio $a$, which is discussed later, and near and far ($N$, $F$) boundaries. Rarely is a perspective projection asymmetric and more general frustum viewing volume is not provided here.

$f = \operatorname{cot}(\frac{\mathsf{fov}_y}{2}) = \frac{1}{\tan(\frac{\mathsf{fov}_y}{2})}$

$a = \frac{\mathsf{width}}{\mathsf{height}}$

$\begin{bmatrix} \frac{f}{a} & 0 & 0 & 0 \\ 0 & f & 0 & 0 \\ 0 & 0 & \frac{N+F}{N-F} & \frac{2NF}{N-F} \\ 0 & 0 & -1 & 0 \end{bmatrix}$

Note the $(0, 0, -1, 0)$ bottom row which makes the $w$ component of a transformed vector dependent on $z$. This is what causes objects to become smaller in the distance as $x$ and $y$ are divide by $w$ during the perspective divide, discussed later.

Clip Space and the Perspective Divide

After a vector is transformed by a projection matrix it is in clip space. It is a 4D space and called clip space because this is where geometry is “clipped” at the borders of the image. Some form of clipping is necessary as it would be inefficient to perform computation on geometry outside the image. Clipping is more important to rasterizers as triangles are transformed into image space for pixel–polygon intersection tests (although the intersections are generated) rather than testing in world space as a raytracer does. Exactly why clipping is performed in clip space becomes more apparent after the perspective divide is introduced.

The perspective divide “normalizes” a clip space vector by dividing by its $w$ component so that the new $w$ is 1:

$v_\mathsf{NDC} = \frac{v_\mathsf{clip}}{v_{\mathsf{clip}_w}}$

As said earlier, this scales down objects in the distance in the case of a perspective projection.

Precision

In addition to scaling $x$ and $y$ the perspective divide also affects $z$, creating a hyperbolic mapping which has some beneficial properties for precision when comparing $z$ values. When truncated to integers, the number 3.1 is not less than 3.2 as only the 3 is compared. The resolution of possible values is one, i.e. numbers have to be at least one apart before they are distinct. Using linear $z$ the resolution is the same for objects close to the camera as those way off in the distance. However high precision in the distance is often not necessary as geometry there is sparse, while detailed geometry is drawn up close. By scaling $z$ so that possible values are closer together near the camera, the precision is better optimized for typical scenes.

Clipping

One method is to clip polygons at the image boundaries, as in the image below. However the resulting shapes may be quads and additional vertices and triangles are needed. Clipping geometry is expensive. Alternatively the rasterizer may simply not produce fragments for pixel positions outside the image. I.e. triangles completely outside the image are culled and those that intersect it are still sent to the rasterizer which can efficiently ignore their area outside the image. This works fine for triangles bordering the $X$ and $Y$ boundaries, but there are a few problems in the $Z$ direction.

Clipping triangles to the image borders

The perspective projection preserves straight lines, with the exception of lines that cross the $z=0$ plane, for example a triangle with some vertices in front of the camera and some behind. The point behind will have its $(x, y)$ position inverted and edge interpolation will be incorrect. This makes it impossible to perform triangle clipping in image space or NDC and elevates the importance of clipping in clip space, as it’s named. Triangles that bridge the $z=0$ and $z=N$ region and are within the $X$ and $Y$ image boundaries must be clipped to the near plane before the perspective divide.

Triangles which intersect the near and far clipping planes may be rasterized without geometry clipping and clipped by discarding fragments outside the depth range.

Clipping implementation is discussed in more detail by Fabian Giesen at his blog.

Aspect Ratio

The aspect ratio is a ratio between the width and height of the image, specifically width:height or $\mathsf{width}/\mathsf{height}$. For example, some common resolutions are 640×480, 1680×1050 and 1920×1080 with aspect ratios of 4:3, 16:10 and 16:9 respectively. The projection normally encodes the image aspect ratio so that after NDC, scaling to the image resolution produces an undistorted image. A seeming alternative may be to apply the aspect ratio when scaling NDC, but clipping must be performed in clip space so the aspect ratio must be applied beforehand.

Projection Matrix