One of the most fascinating topics in the field of machine learning is image recognition (IR). Examples of systems that employ IR include computer login programs that use fingerprint or retinal identification, and airport security systems that scan passenger faces looking for individuals on some sort of wanted list. The MNIST data set is a collection of simple images that can be used to experiment with IR algorithms. This article presents and explains a relatively simple C# program that introduces you to the MNIST data set, which in turn acquaints you with IR concepts.
It’s unlikely you’ll need to use IR in most software applications, but I think you might find the information in this article useful for four different reasons. First, there’s no better way to understand the MNIST data set and IR concepts than by experimenting with actual code. Second, having a basic grasp of IR will help you understand the capabilities and limitations of real, sophisticated IR systems. Third, several of the programming techniques explained in this article can be used for different, more common tasks. And fourth, you might just find IR interesting in its own right.
The best way to see where this article is headed is to take a look at the demo program in Figure 1 . The demo program is a classic Windows Forms application. The button control labeled Load Images reads into memory a standard image recognition data set called the MNIST data set. The data set consists of 60,000 handwritten digits from 0 through 9 that have been digitized. The demo has the ability to display the currently selected image as a bit-mapped image (on the left of Figure 1 ), and as a matrix of pixel values in hexadecimal form (on the right).
Figure 1 Displaying MNIST Images
In the sections that follow, I’ll walk you through the code for the demo program. Because the demo is a Windows Forms application, much of the code is related to UI functionality, and is contained in multiple files. I focus here on the logic. I refactored the demo code into a single C# source file that’s available at msdn.microsoft.com/magazine/msdnmag0614 . To compile the download, you can save it on your local computer as MnistViewer.cs, then create a new Visual Studio project and add the file to your project. Alternatively, you can launch a Visual Studio command shell (which knows the location of the C# compiler), then navigate to the directory where you saved the download, and issue the command >csc.exe /target:winexe MnistViewer.cs to create the executable MnistViewer.exe. Before you can run the demo program, you’ll need to download and save two MNIST data files, as I explain in the next section, and edit the demo source code to point to the location of those two files.
The terminology used in IR literature tends to vary quite a bit. Image recognition might also be called image classification, pattern recognition, pattern matching or pattern classification. Although these terms do have different meanings, they’re sometimes used interchangeably, which can make searching the Internet for relevant information a bit difficult.
The MNIST Data Set
The mixed National Institute of Standards and Technology (MNIST for short) data set was created by IR researchers to act as a benchmark for comparing different IR algorithms. The basic idea is that if you have an IR algorithm or software system you want to test, you can run your algorithm or system against the MNIST data set and compare your results with previously published results for other systems.
The data set consists of a total of 70,000 images; 60,000 training images (used to create an IR model) and 10,000 test images (used to evaluate the accuracy of the model). Each MNIST image is a digitized picture of a single handwritten digit character. Each image is 28 x 28 pixels in size. Each pixel value is between 0, which represents white, and 255, which represents black. Intermediate pixel values represent shades of gray. Figure 2 shows the first eight images in the training set. The actual digit that corresponds to each image is obvious to humans, but identifying the digits is a very difficult challenge for computers.
Figure 2 First Eight MNIST Training Images
Curiously, the training data and the test data are each stored in two files rather than in a single file. One file contains the pixel values for the images and the other contains the label information (0 through 9) for the images. Each of the four files also contains header information, and all four files are stored in a binary format that has been compressed using the gzip format.
Notice in Figure 1 , the demo program uses only the 60,000-item training set. The format of the test set is identical to that of the training set. The primary repository for the MNIST files is currently located at yann.lecun.com/exdb/mnist . The training pixel data is stored in file train-images-idx3-ubyte.gz and the training label data is stored in file train-labels-idx1-ubyte.gz. To run the demo program, you need to go to the MNIST repository site and download and unzip the two training data files. To unzip the files, I used the free, open source 7-Zip utility.
Creating the MNIST Viewer
To create the MNIST demo program, I launched Visual Studio and created a new C# Windows Forms project named MnistViewer. The demo has no significant .NET version dependencies so any version of Visual Studio should work.
After the template code loaded into the Visual Studio editor, I set up the UI controls. I added two TextBox controls (textBox1, textBox2) to hold the paths to the two unzipped training files. I added a Button control (button1) and gave it a label of Load Images. I added two more TextBox controls (textBox3, textBox4) to hold the values of the current image index and the next image index. Using the Visual Studio designer, I set the initial values of these controls to “NA” and “0,” respectively.
I added a ComboBox control (comboBox1) for the image magnification value. Using the designer, I went to the control’s Items collection and added the strings “1” through “10.” I added a second Button control (button2) and gave it a label of Display Next. I added a PictureBox control (pictureBox1) and set its BackColor property to ControlDark so that the control’s outline could be seen. I set the PictureBox size to 280 x 280 to allow a magnification of up to 10 times (recall an MNIST image is 28 x 28 pixels). I added a fifth TextBox (textBox5) to display the hex values of an image, then set its Multiline property to True and its Font property to Courier New, 8.25 pt., and expanded its size to 606 x 412. And, finally, I added a ListBox control (listBox1) for logging messages.
After placing the UI controls onto the Windows Form, I added three class-scope fields:
The first two strings point to the locations of the unzipped training data files. You’ll need to edit these two strings to run the demo. The third field is an array of program-defined DigitImage objects.
I edited the Form constructor slightly to place the file paths into textBox1 and textBox2, and give the magnification an initial value of 6:
I used the ActiveControl property to set the initial focus onto the button1 control, just for convenience.
Creating a Class to Hold an MNIST Image
I created a small container class to represent a single MNIST image, as shown in Figure 3 . I named the class DigitImage but you might want to rename it to something more specific, such as MnistImage.
Figure 3 A DigitImage Class Definition
I declared all class members with public scope for simplicity and removed normal error checking to keep the size of the code small. Fields width and height could’ve been omitted because all MNIST images are 28 x 28 pixels, but adding the width and height fields gives the class more flexibility. Field pixels is an array-of-arrays-style matrix. Unlike many languages, C# has a true multidimensional array and you might want to use it instead. Each cell value is type byte, which is just an integer value between 0 and 255. Field label is also declared as type byte, but could’ve been type int or char or string.
The DigitImage class constructor accepts values for width, height, the pixels matrix, and the label, and just copies those parameter values to the associated fields. I could’ve copied the pixel values by reference instead of by value, but that could lead to unwanted side effects if the source pixel values changed.
Loading the MNIST Data
I double-clicked on the button1 control to register its event handler. The event handler farms most of the work to the method LoadData:
The LoadData method is listed in Figure 4 . LoadData opens both the pixel and label files and reads them simultaneously. The method begins by creating a local 28 x 28 matrix of pixel values. The handy .NET BinaryReader class is designed specifically for reading binary files.
Figure 4 The LoadData Method
The format of the MNIST training pixels file has an initial magic integer (32 bits) that has value 2051, followed by the number of images as an integer, followed by the number of rows and the number of columns as integers, followed by the 60,000 images x 28 x 28 pixels = 47,040,000 byte values. So, after opening the binary files, the first four integers are read using the ReadInt32 method. For example, the number of images is read by:
Interestingly, the MNIST files store integer values in big endian format (used by some non-Intel processors) rather than the more usual little endian format that’s most commonly used on hardware that runs Microsoft software. So, if you’re using normal PC-style hardware, to view or use any of the integer values, they must be converted from big endian to little endian. This means reversing the order of the four bytes that make up the integer. For example, the magic number 2051 in big endian form is:
That same value stored in little endian form is:
Notice it’s the four bytes that must be reversed, rather than the entire 32-bit sequence. There are many ways to reverse bytes. I used a high-level approach that leverages the .NET BitConverter class, rather than using a low-level, bit-manipulation approach:
Method LoadData reads, but doesn’t use, the header information. You might want to check the four values (2051, 60000, 28, 28) to verify the file hasn’t been damaged. After opening both files and reading the header integers, LoadData reads 28 x 28 = 784 consecutive pixel values from the pixel file and stores those values, then reads a single label value from the label file and combines it with the pixel values into a DigitImage object, which it then stores into the class-scope trainData array. Notice there’s no explicit image ID. Each image has an implicit index ID, which is the image’s zero-based position in the sequence of images.
Displaying an Image
I double-clicked on the button2 control to register its event handler. The code to display an image is shown in Figure 5 .
Figure 5 Displaying an MNIST Image
The index of the image to display is fetched from the textBox4 (next image index) control, then a reference to the image is pulled from the trainImage array. You might want to add a check to make sure the image data has been loaded into memory before trying to access an image. The image is displayed in two ways, first in a visual form in the PictureBox control, and second, as hexadecimal values in the large TextBox control. A PictureBox control’s Image property can accept a Bitmap object and then render the object. Very nice! You can think of a Bitmap object as essentially an image. Note that there’s a .NET Image class, but it’s an abstract base class that’s used to define the Bitmap class. So the key to displaying an image is to generate a Bitmap object from the program-defined DigitImage object. This is done by helper method MakeBitmap, which is listed in Figure 6 .
Figure 6 The MakeBitmap Method
The method isn’t long but it is a bit subtle. The Bitmap constructor accepts a width and a height as integers, which for basic MNIST data will always be 28 and 28. If the magnification value is 3, then the Bitmap image will be (28 * 3) by (28 * 3) = 84 by 84 pixels in size, and each 3-by-3 square in the Bitmap will represent one pixel of the original image.
Supplying the values for a Bitmap object is done indirectly through a Graphics object. Inside the nested loop, the current pixel value is complemented by 255 so that the resulting image will be a black/gray digit against a white background. Without complementing, the image would be a white/gray digit against a black background. To make a grayscale color, the same values for the red, green, and blue parameters are passed to the FromArgb method. An alternative is to pass the pixel value to just one of the RGB parameters to get a colored image (shades of red, green or blue) rather than a grayscale image.
The FillRectangle method paints an area of the Bitmap object. The first parameter is the color. The second and third parameters are the x and y coordinates of the upper-left corner of the rectangle. Notice that x is up-down, which corresponds to index j into the source image’s pixel matrix. The fourth and fifth parameters to FillRectangle are the width and height of the rectangular area to paint, starting from the corner specified by the second and third parameters.
For example, suppose the current pixel to be displayed is at i = 2 and j = 5 in the source image, and has value = 200 (representing a dark gray). If the magnification value is set to 3, the Bitmap object will be 84-by-84 pixels in size. The FillRectangle method would start painting at x = (5 * 3) = column 15 and y = (2 * 3) = row 6 of the Bitmap, and paint a 3-by-3 pixel rectangle with color (55,55,55) = dark gray.
Displaying an Image’s Pixel Values
If you refer back to the code in Figure 5 , you’ll see that helper method PixelValues is used to generate the hexadecimal representation of an image’s pixel values. The method is short and simple:
The method constructs one long string with embedded newline characters, using string concatenation for simplicity. When the string is placed into a TextBox control that has its Multiline property set to True, the string will be displayed as shown in Figure 1 . Although hexadecimal values may be a bit more difficult to interpret than base 10 values, hexadecimal values format more nicely.
Where to from Here?
Image recognition is a problem that’s conceptually simple but extremely difficult in practice. A good first step toward understanding IR is to be able to visualize the well-known MNIST data set as shown in this article. If you look at Figure 1 , you’ll see that any MNIST image is really nothing more than 784 values with an associated label, such as “4.” So image recognition boils down to finding some function that accepts 784 values as inputs and returns, as output,10 probabilities representing the likelihoods that the inputs mean 0 through 9, respectively.
A common approach to IR is to use some form of neural network. For example, you could create a neural network with 784 input nodes, a hidden layer of 1,000 nodes and an output layer with 10 nodes. Such a network would have a total of (784 * 1000) + (1000 * 10) + (1000 + 10) = 795,010 weights and bias values to determine. Even with 60,000 training images, this would be a very difficult problem. But there are several fascinating techniques you can use to help get a good image recognizer. These techniques include using a convolutional neural network and generating additional training images using elastic distortion.
Dr. James McCaffrey works for Microsoft Research in Redmond, Wash. He has worked on several Microsoft products including Internet Explorer and Bing. McCaffrey can be reached firstname.lastname@example.org.
Thanks to the following technical expert for reviewing this article: Wolf Kienzle (Microsoft Research)