In the past months I have spent the evenings and weekends on a little project: a JPEG decoder and encoder, written in Rust.

First, I should drop a little disclaimer: at the time I’m writing this post, I have successfully decoded multiple test images, but these are fairly standard type images, so more exotic and advanced parts of the JPEG and JFIF standard are yet to be implemented properly, or even at all. Therefore, current program design decisions, as well as explanations of formats and techniques, *may* be executed poorly, as they are based on what I currently know, and the subset of functionality I have implemented. Hopefully, I am not too far off.

When I started working on this project, I had not decided how this post should be. Should it be a step-by-step kind of guide, or more of a writeup of a working program? Initially, I wanted the former, but I changed my mind as I was progressing, because I struggeled with understanding how the decoding process worked, which lead to strange design decisions, commits going back and fourth, and weird naming conventions. I do not believe this process would make a pleasant reading experience for anyone. Instead I want to write a post on how it has turned out so far. This first post will cover the background needed to understand this mess that is JPEG.

## Why?

But first, why am I writing this? The choice of writing a JPEG encoder/decoder is somewhat arbitrary. In fact, if I knew what I know now, I think I would have chosen a different format than JPEG. This is mostly because this was supposed to be a weekend, or maybe weeklong, project; as I’m writing this, there is approximately five weeks since the initial git commit.

The choice of using Rust, however, is not arbitrary. I have found Rust to be an expressive, performant, and fun language to use; it allows high level abstractions and patterns, while still keeping that raw, low-level feeling you get from e.g C. I guess the project of writing a JPEG encoder/decoder was a great excuse for writing a non-trivial program in Rust.

And as always, I want to improve my writing. Writing is hard! Feedback is of course very welcome, even though there is no comment section here.

Let us look a little closer on JPEG.

## JPEG?

JPEG is a method for lossy image compression; it is not, as you might believe, an image format. There are however image formats which use JPEG, such as JPEG/Exif — this is what your digital camera spits out (unless you are only capturing RAW) — and JPEG/JFIF — the file format we will work with. JFIFis the most common format of transmitting images on the web. This is great knowledge to pull out when someone says “JPEG file” or something similiar: “Uhmm, actually … “.

Let’s see how a JFIF file is laid out, and then what the JPEG data format looks like.

## The JFIF Part

Apart from actual image data, the JFIF file contains data such as image dimensions, comment, different tables, and more. A JFIF file consists of *segments* . Each segment contains a marker, a length, and data. The marker is two bytes, and is used to identify the segment. The length is two bytes, and specifies how long the segment is, excluding the marker bytes, but including the length bytes. The data fills the rest of the segment, according to the length.

For instance, the marker for “Comment” is `0xfffe`

, making the bytes for specifying “Hello, World!” as a comment:

ff fe 00 10 48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21 00

The format of the data varies with the different segment types. When implementing the decoder, I simply read parts of the JPEG specification, which is available here (pdf). See page 38 (marked as page 34) for an overview of the format. The format of different markers follow for the next 13 pages.

There is no need to go into too much detail just yet. We can simply start with saying that an image consists of a *frame* , which again consists of one or more *scans* . Each scan contains one or more *entropy-coded segments* (ECS). So far, the images I have tested have contained one frame, one scan, and one ECS, which is a good starting point. Complexity does not always have to be payed for up front.

## The JPEG Part

Let’s have a look at the image data — this is after all the most exciting part.

## Encoding from 10000 meters

In essence, this is how JPEG encoding works:

- Split the image into 8x8 blocks. In case the image is not perfectly divided into 8x8 blocks, extend the borders of the image such that it is
- Convert the block to frequency domain, using the Discrete Cosine Transform
- Reorder the block to a “zigzag” ordering
- Quantize frequency coefficients
- Encode the block using huffman coding

In order to understand why we are doing this, we need to take a closer look at frequency transforms, and huffman coding.

## Discrete Cosine Transform

The Discrete Cosine Transform(DCT) creates a representation of a signal as a sum of cosines of different amplitude and frequency. I do not feel I am able to explain this good enough, but you are encouraged to look at the Wikipedia page for Fourier Series , which contains some great animation and images; for our use, DCT is basically the same thing. If this went over your head, do not worry. Understanding exactly *how* it works is not as important as understanding *why* we would like to use it.

So what does DCT have to do with images? Instead of looking at an image as a grid of pixels, we can interpret the image as a two dimentional signal. For instance, say we have a grayscale image of size 8x1 px, and that the image data, where each pixel is a number between `0`

(black) and `255`

(white), looks like this:

[0, 32, 64, 96, 128, 160, 192, 224]

The image looks like this (scaled by 3200%):

We can interpret this image as mathematical function; in this case it is rather easy: $f(x,y) = 32x$. So how does the DCT of this signal look? Like this:

[896.0, -583.1, 0.0, -61.0, 0.0, -18.2, 0.0, -4.6]

If we go backwards, using the *inverse* DCT, we get the exact same image data as the ones we fed into the DCT; no information is lost. This does not look too impressive; sure — we got some zeroes here and there, but there seems to still be some data which needs to be saved. What if we increase our image to be 8x8, instead of 8x1?

Signal 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 0.0 32.0 64.0 96.0 128.0 160.0 192.0 224.0 After DCT 896.0 -583.1 0.0 -61.0 0.0 -18.2 0.0 -4.6 -0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 0.0 0.0 -0.0 0.0 -0.0 0.0 0.0 0.0 -0.0 0.0 0.0 -0.0 -0.0 -0.0 -0.0 -0.0 0.0 0.0 -0.0 0.0 -0.0 -0.0 0.0 0.0 0.0 -0.0

Now it is clear that for some images — or image blocks — there are great opportunities to minimize the size of the encoded data. There are even more tricks to this, such as quantization, which we will have a look at later.

## Quantization

Quantization is the only lossy step in our 5 steps, and so it is this step that controls the level of compression. Quantization is the process of mapping a set of values to a smaller set, and it is done with a quantization matrix, which is the same size as an image block: 8x8. Predefined matrices are suggested in the JPEG standard, but each image encodes its own quantization matrices.

Quantization is applied after DCT, and is used to reduce the information of the coefficients in every image block. By making almost equal numbers equal, we decrease the number of unique numbers, and increase the compression ratio enabled by huffman conding.

Let us predend blocks are 2x2 instead of 8x8. Say the data we got back from the DTC is

$$G = \begin{bmatrix} 230 & 68 \\ 99 & 72 \end{bmatrix} $$

A quantization matrix could be

$$Q = \begin{bmatrix} 10 & 11 \\ 12 & 13 \end{bmatrix} $$

We take each component of $G$ and divide it by its corresponding component of $Q$, and round the numbers to the nearest integer:

$$B = \begin{bmatrix} \frac{230}{10} & \frac{68}{11}\\ \frac{99}{12} & \frac{72}{13} \end{bmatrix} = \begin{bmatrix} 23 & 6 \\ 8 & 6 \end{bmatrix} $$

And we are done. The data from $B$ is what is passed down to the next step.

We can also see that if we are going the other way, taking the inner product, with the same quantization matrix, we get

$$G’ = \begin{bmatrix} 230 & 66 \\ 96 & 78 \end{bmatrix} $$

## Huffman Coding

So we found a way to represent our image block with a lot of similar numbers — `0`

in our case. How can we take advantage of this? If we use 32 bit integers, the size of the number is 32 bits, no matter the number! Or is it?

Huffman Coding is a coding scheme used for lossless coding. The gist of the scheme is to code numbers as bit strings, with the property that no code can be the prefix of another code (if `01`

is a code, `011`

can not be a code), and that frequent numbers should have a shorter code than less frequent numbers.

Say we want to encode this (totally random) list of bytes

[2, 7, 1, 8, 2, 8, 1, 8, 2, 8]

We can count the number of occurences of each number in the list,

Number | Occurences ------------------- 1 | 2 2 | 3 7 | 1 8 | 4

and create the codes,

Number | Code ------------------- 8 | 0 2 | 10 1 | 111 7 | 110

making our data

// data 2 7 1 8 2 8 1 8 2 8 // coded data 10 110 111 0 10 0 111 0 10 0 // squash together 1011011101001110100 // look at each byte 10110111 01001110 100????? // pad with 1 10110111 01001110 10011111 // which is the same as 183 78 159

Effectively, we coded 10 bytes as 3 bytes, by taking advanage of the fact that some numbers are more frequent than others.

## Back to the 10000 meter view

Now that we have some control over roughly what happens, we can take another look at the whole procedure, with some additional steps.

So, we split the image into blocks, and each block is more or less processed by itself. This is done because of the principle of locality: pixels within a block are likely to be somewhat similar.

Next, we transform the image into frequency domain, so we can make it easier to encode the numbers. Then, we reorder the blocks, with the goal of getting long runs of zeroes. We also take the inner division (element-by-element division) of our frequencies and a quantization matrix, in order to make the coefficients somewhat similar. This is the lossy part. At last, we use huffman coding to write out the image data, taking advantage of the fact that we are writing lots of similiar numbers.

And that is pretty much it — from a 10000 meters view.

## Conclusion

We have looked at how a JFIF file looks like, and roughly how the image data is encoded. In Part 2 we will start implementing this, in actual, runnable, (hopefully) working, Rust code.

If you found some things confusing, or simply want better explanations than I can give, check of the Wiki page for JPEG ; it explains the encoding process using a 8x8 sample block.

Read part 2here

–

### Errata

- Quantization was listed before zigzagging the data. This was the wrong way around.