The EMNIST Dataset
In the realm of deep learning and machine learning, one common task is the recognition of handwritten characters. Using python and Keras/Tensorflow, I’ll begin in this article to discuss how to go about reading the EMNIST database located here.
It’s possible to experiment with handwriting recognition with very little effort and few lines of code if one is willing to live with the datasets that come with keras. For example, using this YouTube video tutorial, it’s possible to load the built-in dataset with just a few lines of code:
mnist = tf.keras.datasets.mnist (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train = tf.keras.utils.normalize(x_train, axis=1) x_test = tf.keras.utils.normalize(x_test, axis=1)
Those first two lines hide a tremendous load of complexity. To make use of the EMNIST dataset, we’ll have to understand the structure of the data files. The structure is described in a scholarly paper here. The remainder of this article describes my method for reading this dataset — which is much bigger than the built-in keras dataset — and prepare it for use in a keras/Tensorflow neural network.
A Single Image
We can all see that that’s the numeral 5. But in the EMNIST dataset, that single character is stored as a list of list of bytes of binary data ranging from 0 (black background) to 255 (the brightest possible pixel value). Let’s get started reading the EMNIST files.
Reading the Image File Header
Reading the EMNIST file header yields us several important pieces of information, and if we’re clever about it, gives us the position of the start of the actual character data. Because the data are stored as bytes, it’s necessary to open the files in binary mode and interpret the values we read specifically.
def read_image_file_header(filename): f = open(filename, 'rb') int.from_bytes(f.read(4), byteorder='big') # magic number, discard it count = int.from_bytes(f.read(4), byteorder='big') # number of samples in data set rows = int.from_bytes(f.read(4), byteorder='big') # rows per image columns = int.from_bytes(f.read(4), byteorder='big') # columns per image pos = f.tell() # current position used as offset later when reading data f.close() return pos, count, rows, columns
First we read a magic number that is useless to us. Next we read the number of samples by reading the next four bytes as a single 32-bit integer, noting that the byte order is big-endian. Next we read the numbers of rows and columns in each character in a similar fashion. Finally, we remember where the file pointer is located so that when we come back to actually read characters, we know where to start reading. We return all four of these critical values in a tuple.
Reading Character Data from the File
Now we’re ready to code a method that will read the image file and save off into a (big) file the data in a format usable by keras.
def load_save_images(inputfilename, byte_offset, outputfilename, cols, rows, count): list_data = [] infile = open(inputfilename, 'rb') infile.seek(byte_offset) for n in range(count): image_matrix = [[0 for x in range(cols)] for y in range(rows)] for r in range(rows): for c in range(cols): byte = infile.read(1) image_matrix[c][r] = float(ord(byte)) list_data.append(image_matrix) # show progress if n % 5000 == 0: print("... " + str(n)) infile.close() print('converting to numpy array') list_data = np.array(list_data) print('normalizing') list_data = tf.keras.utils.normalize(list_data, axis=1) print('saving') np.save(outputfilename, list_data)
The keras/Tensorflow packages expect a numpy array. We’ll start toward that by creating an empty list with list_data = []
.
Next we open the file in binary mode and move the file pointer to the byte offset value returned by the read_image_file_header() method. Then we start a loop to get each character (240,000 of them for the EMNIST training dataset).
We initialize a 2D matrix with all zeros of the proper width and height:
image_matrix = [[0 for x in range(cols)] for y in range(rows)]
for n in range(count): image_matrix = [[0 for x in range(cols)] for y in range(rows)] for r in range(rows): for c in range(cols): byte = infile.read(1) image_matrix[c][r] = float(ord(byte)) list_data.append(image_matrix)
Then we read each byte in each row and each column inside a double loop, convert its byte value to a float, and add the completed 2D matrix to the list_data[]
list.
Final Steps
The list of character data needs some final transformation before it’s ready to import into keras/Tensorflow. We need to two two things. First, we need to convert it to a numpy array and then normalize the data before saving the data.
list_data = np.array(list_data) list_data = tf.keras.utils.normalize(list_data, axis=1) np.save(outputfilename, list_data)
The resulting file will be 1,505,280,128 bytes with 8-byte floats, but is prepared in such a way that it can be loaded into keras/Tensorflow with a single line of code.