Camera basics

What is a camera?
1. The pinhole camera
Sensor
1. Coordinates
2. Technologies
  1. CCD
  2. CMOS
3. Color
Lens
1. Distortion
2. Chromatic aberration
Aperture
1. Zoom lens vs prime lens
Shutter
1. Mechanical shutter
2. Electronic shutter
Photography basics

What is a camera?

A modern definition of a camera is any device capable of collecting light rays coming from a scene, and recording an image of it. The sensor used for the recording can be either digital (e.g. CMOS, CCD), or analog (film).

The pinhole camera

The term camera is derived from the Latin term camera obscura, literally translating to “dark room”. Earliest examples of cameras were just that; a hole in a room/box, projecting an image onto a flat surface.

Using only a small hole (pinhole) blocks off most of the light, but also constraints the geometry of rays, leading to a 1-to-1 relationship between a point on the sensor (or wall!) and a direction. Given a 3D point $(x,y,z)$ in space, the point on the sensor $(u, v)$ is:

\[\begin{cases} u = f \frac{x}{z}\\ v = f \frac{y}{z} \end{cases}\]

in which $f$ is the focal length: the distance from the pinhole to the sensor. Zooming-in corresponds to increasing the focal length. Conversely, short focal lengths are associated with wide-angle photography.

The equation shows that multiple 3D coordinates fall onto the same sensor point; cameras turn the 3D world into a flat, 2D image. Let’s make the sensor coordinate system more general, by introducing an origin $(u_0,v_0)$ and non-isotropy in the $x$ and $y$ focal lengths, which is necessary to describe non-rectilinear sensors. The complete pinhole camera model can be summarized by a single affine matrix multiplication:

\[\begin{bmatrix} uw\\ vw\\ w \end{bmatrix} = \begin{bmatrix} f_x & 0 & u_0\\ 0 & f_y & u_0\\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x\\ y\\ z \end{bmatrix} := K \begin{bmatrix} x\\ y\\ z \end{bmatrix}\]

The matrix $K$ is known as the intrinsic parameters matrix. Let’s complete our model by adding an arbitrary rotation/translation to the world coordinate system. A single matrix multiplication can relate world coordinates $(x_w,y_w,z_w)$ to camera-centric coordinates $(x,y,z)$:

\[\begin{bmatrix} x\\ y\\ z \end{bmatrix} = \begin{bmatrix} R_{11} & R_{12} & R_{13} & t_x\\ R_{21} & R_{22} & R_{23} & t_y\\ R_{31} & R_{32} & R_{33} & t_z\\ \end{bmatrix} \begin{bmatrix} x_w\\ y_w\\ z_w\\ 1 \end{bmatrix} := \begin{bmatrix} R | t \end{bmatrix} \begin{bmatrix} x_w\\ y_w\\ z_w\\ 1 \end{bmatrix}\]

where $R$ is an orthogonal rotation matrix, and $t$ a translation vector. The $\begin{bmatrix}R|t \end{bmatrix}$ matrix is known as the extrinsic parameters matrix. We can combine intrinsic and extrinsic parameters in a single equation:

\[\begin{bmatrix} uw\\ vw\\ w \end{bmatrix} = K \begin{bmatrix} R|t \end{bmatrix} \begin{bmatrix} x\\ y\\ z\\ 1 \end{bmatrix}\]

When using more than one camera, it is useful to have a single world coordinate system while letting each camera have its own sensor coordinate. As explained in the next section, if intrinsic and extrinsic parameters are known for every camera looking at the scene, 3D reconstruction can be achieved through triangulation.

Sensor

Coordinates

Continuous sensor coordinates make sense when simply projecting an image or recording it with a film. If using a digital sensor, a natural choice for the sensor coordinate system is the pixel indices. Those discrete, unitless values can be related to the physical sensor by defining an equivalent focal length in pixel units:

The image plane is an imaginary construct sitting in front of the sensor, at one focal length (in pixels) away from the camera’s coordinate system. Because it sits in front of the camera, the image is upright again.

It is common to choose the $z$ axis to point toward the scene, and the $y$ axis to point downward. This matches the conventional downward-pointing vertical coordinates in pixel coordinates, with $(u,v)=(0,0)$ in the top-left corner.

Technologies

source: https://www.automate.org/vision/blogs/ccd-vs-cmos-image-sensors-which-are-better

We’ll focus on the two main families of digital sensors: CCD and CMOS. In both families, the actual light sensing is based on the electron-hole pair generation in MOS photodiodes. The main difference is how this charge is converted to a signal, offering tradeoffs over complexity, signal-to-noise ratio and readout speed.

The ISO of a sensor is a metric characterizing its sensitivity to light. The same metric is used for both analog and digital sensors. The ISO of a film is a function of its chemistry, while the ISO of a digital sensor is a function of its digital gain. Standard ISO values follow a logarithmic scale: 100, 200, 400, 800, 1600, 3200, etc.

CCD

source: https://www.princetoninstruments.com/learn/camera-fundamentals/ccd-the-basics

In CCD sensors, the generated charges in the photodiodes are accumulated under a potential well, controlled by a voltage on the gate.

Charges can be moved to a neighboring pixel by performing a specific sequence on the gates. By shifting the charges all the way to the edge of the sensor, individual pixel values can be readout sequentially.

Advantage of CCD sensors include the simplicity of their design, and the large surface dedicated to sensing light. One disadvantage is the readout speed bottleneck caused by using a single decoding unit.

CMOS

source: Coath, Rebecca, et al. “Advanced pixel architectures for scientific image sensors.” (2009).

In a CMOS sensor, each pixel is in charge (pun intended) of collecting light and converting it to a signal. The added complexity made them impractical compared to CCD for a long time, but they have now fully caught up.

The main principle is as follows: the charge accumulated by the photodiode is directly controlling the gate of an amplifier. In other terms, the current induced by incoming light is charging up the gate capacitance of the amplifier. This charge is present until a reset is initiated.

The output value is read by selecting the pixel. Usually, an entire row is read out at once.

Using a 4-transistor architecture, the exposure time can be controlled, by decoupling the photodiode from the amplifier’s gate on command.

Color

The most common way of capturing color images with a digital sensor is a Bayer filter, interleaving color filters in front of pixels in this pattern:

source: https://en.wikipedia.org/wiki/Bayer_filter

For every red or blue pixel, there are two green ones. This is to mimic the human eye’s increased sensitivity to green light.

Without getting into the hellscape of color spaces, here is a standard formula to convert between RGB (red, green, blue) and YUV (luminance, chrominance) values:

\[\begin{bmatrix} Y\\ U\\ V \end{bmatrix} = \begin{bmatrix} 0.299 & 0.587 & 0.114\\ -0.14713 & -0.28886 & 0.436\\ 0.615 & -0.51499 & -0.10001 \end{bmatrix} \begin{bmatrix} R\\ G\\ B \end{bmatrix}\]

The luminance $Y$ can be thought of as a grayscale value. The coefficients in the matrix show that green values have twice the impact of red ones, and that blue values are the weakest.

Lens

Pinhole cameras presented earlier capture very little light, needing long exposure times (sometimes hours!). They also suffer from blurry details, and vignetting toward the borders of the image: the hole’s effective size reduces as the incident angle increases.

To gather more light, a lens can be used. The goal of the lens is to take light rays emitted by a point in the scene, and focus those rays back into a single point on the sensor:

The lens equation provides a relationship between the object distance $d_o$ and the image distance behind the lens $d_i$:

\[\frac{1}{f} = \frac{1}{d_o} + \frac{1}{d_i}\]

where $f$ is the focal length of the lens. Note how $d_i$ tends toward $f$ as $d_o$ tends toward infinity: for very far objects, the adequate distance between the lens and the image plane is equal to the focal length. This brings us back to the pinhole camera model, in which the focal length was simply the distance between the hole and the sensor.

As scene objects get closer to the camera, the lens needs to be moved away from the sensor to keep them in focus. This also causes a negligible zoom effect, familiar to seasoned photographers.

The plane of focus (or focus point) is the part of the scene with a perfect focus:

source: https://greatbigphotographyworld.com/depth-of-field-how-what-when/

When a light-emitting point is either in front or behind of this plane, it shows up as a blurry spot on the sensor, also called circle of confusion. When this circle of confusion is no larger than a pixel, the scene’s point is still considered to be in-focus. This defines a region of the scene in which blur is imperceptible: this region is delimited by a plane in front of the plane of focus, and one behind it. The distance between those two planes is called the depth of field.

Distortion

As lenses don’t exactly bend light rays following the pinhole camera model, they introduce distortion. This is modeled as a shift in $(u, v)$ coordinates between the ideal pinhole model, and the observed coordinates. Barrel distortion is the most familiar type of distortion, often visible in wide angle photography.

There are plenty of lens distortion models, with varying complexity and number of parameters. One model that is surprisingly simple and effective is the division model:

\[\begin{cases} u_u = u_{cd} + (u - u_{cd})\alpha\\ v_u = v_{cd} + (v - v_{cd})\alpha \end{cases}\]

with $(u_{cd}, v_{cd})$ being the center of distortion, $(u, v)$ the distorted sensor coordinate, and $(u_u, v_u)$ its undistorted counterpart, matching a pinhole model. The distortion coefficient $\alpha$ is a function of the radial distance from the optical center:

\[\alpha = \frac{1}{1+k_1 r^2 + k_2 r^4}\]

with:

\[r = \sqrt{(u-u_{cd})^2 + (v-v_{cd})^2}\]

Chromatic aberration

The refractive index of the material used in the lens can slightly differ as a function of the wavelength of the incoming light, causing separation of colors:

source: https://www.studiobinder.com/blog/what-is-chromatic-aberration-effect/

These effects can be tackled by combining multiple optical elements. Software correction is also possible, if the lens was properly calibrated beforehand.

Aperture

By using a diaphragm, the amount of light entering the lens can be controlled, effectively emulating a lens of a smaller diameter. The opening left by the diaphragm is called the aperture:

source: https://www.adorama.com/alc/camera-basics-aperture/

Aperture values are often expressed as f-numbers, defined as a ratio between the aperture diameter and the focal length of the lens:

\[f_{\rm number} = \frac{d_{\rm aperture}}{f}\]

This quantity is directly related to the light density reaching the sensor, and lets a photographer estimates the amount of light captured, independently of the focal length.

When scene points are out of focus, their circle of confusion takes the shape of the aperture. This is known as a bokeh effect, and is especially visible for scenes containing distinct, bright points. Notice how the shape of the blades is visible in this picture:

source: https://clideo.com/resources/what-is-bokeh-photography-effect

Zoom lens vs prime lens

Lenses capable of zooming are very common, but introduce significant complexity. This either leads to an increased price, or compromises in their sharpness or aperture. A prime lens, on the other hand, does not offer zoom capabilities, but often have superior image quality and calibration.

source: https://www.slrlounge.com/glossary/prime-lens-definition/

Shutter

Light sensors are integrating light continuously. To obtain a useful image, the exposure start and end times need to be well defined.

Mechanical shutter

A straightforward way of blocking all light coming to the sensor is to hide it behind a curtain. In early photography, this was done manually by sliding a plate in front of the sensor or the lens. For more precise control, mechanical shutters were developed, with careful

A popular type of mechanical shutter is a focal plane shutter, in which two curtains are moving in front of the sensor. The first curtain starts the exposure, and the second curtain ends it. The exposure duration is modulated by changing the distance between the two curtains:

source: https://www.youtube.com/watch?v=CmjeCchGRQo

As different parts of the sensor are exposed at different times, they capture a different instant. This is known as a rolling shutter effect, and leads to distorted images:

Leaf shutters are a forgotten alternative to focal plane shutters, and are implemented directly in the lens, near the diaphragm. The carefully-designed shape of the leaves ensures a consistent exposure time over the whole area:

source: Hasselblad

Although they add some complexity, leaf shutters don’t suffer from rolling shutter artifacts, as all parts of the sensor are exposed simultaneously.

Electronic shutter

Controlling the exposure duration electronically is becoming a new standard, and eliminates the need for moving parts in front of the sensor. Electronic shutters have been implemented in both CCD and CMOS sensors.

Global shutter is the holy grail of exposure strategies, in which each pixel is shuttered simultaneously, fully eliminating rolling shutter artifacts. Very high speed photography benefits from this; for example, the Sony α9 III offers shutter speeds of 1/80,000 of a second.

Photography basics

Photography mainly comes down to setting three parameters on the camera:

Aperture
Exposure time (shutter speed)
ISO

Each parameter can be converted to a $\log_2$ scale. A common name for a unit on that scale is a stop. For example, increasing exposure by one stop can be achieved by doubling the shutter speed, doubling the ISO or increasing the aperture by $\sqrt{2}$.

Unless when shooting in a studio, the photographer has no control over the amount of light available, and has to make choices over the three settings available. While arbitrarily increasing the three settings sounds like an easy way to get enough exposure, there are tradeoffs to consider:

Increasing exposure time: more motion blur
Increasing Aperture: more out-of-focus blur (less depth of field)
Increasing ISO: more noise

Any modern camera is measuring the amount of light available, and offers automatic tuning of the three settings. Seasoned photographers often opt for full manual control, but a good compromise is to fix two settings and let the camera choose the last one. In the shutter priority mode (Tv or S), the user chooses the shutter speed, and the aperture is decided by the camera. In the aperture priority mode (Av or A), the user first sets the aperture, and shutter speed is automatic.