Machine learning applications often need to process a lot of data. In an earlier post, I showed how to read the large EMNIST dataset for deep learning of handwriting recognition. In that article, I used a very naive approach to reading in this large dataset, and it performed poorly. So I decided to optimize it for speed.
The results were great, with the new version of the load_save_images function now operating 8.5 times faster than before. The entire speed gain is due to two major changes
- using numpy for all operations except the initial file read
- reading the file in “one gulp” instead of a byte at a time
To isolate and benchmark the performance of the two methods, I create a new python module that directly compares the two methods. Here it is:
import numpy as np import tensorflow as tf import time as time def read_image_file_header(filename): f = open(filename, 'rb') int.from_bytes(f.read(4), byteorder='big') # magic number, discard it count = int.from_bytes(f.read(4), byteorder='big') # number of samples in data set rows = int.from_bytes(f.read(4), byteorder='big') # rows per image columns = int.from_bytes(f.read(4), byteorder='big') # columns per image pos = f.tell() # current position used as offset later when reading data f.close() return pos, count, rows, columns def load_save_images1(inputfilename, byte_offset, outputfilename, cols, rows, count): start = time.time() print('reading whole data file') infile = open(inputfilename, 'rb') infile.seek(byte_offset) b0 = infile.read(count * rows * cols) infile.close() print('creating array from data') b2 = np.fromstring(b0, dtype='uint8') print('reshaping the array') b2.shape = (count, rows, cols) print('rearranging image arrays') for n in range(count): b2[n] = np.fliplr(b2[n]) b2[n] = np.rot90(b2[n]) # show progress if n % 10000 == 0: print("... " + str(n)) print('normalizing') b2 = tf.keras.utils.normalize(b2, axis=1) print('saving') np.save(outputfilename, b2) end = time.time() elapsed = end - start print("elapsed " + str(elapsed)) def load_save_images2(inputfilename, byte_offset, outputfilename, cols, rows, count): start = time.time() list_data = [] infile = open(inputfilename, 'rb') infile.seek(byte_offset) for n in range(count): image_matrix = [[0 for x in range(cols)] for y in range(rows)] for r in range(rows): for c in range(cols): byte = infile.read(1) image_matrix[c][r] = float(ord(byte)) list_data.append(image_matrix) # show progress if n % 10000 == 0: print("... " + str(n)) infile.close() print('converting to numpy array') list_data = np.array(list_data) print('normalizing') list_data = tf.keras.utils.normalize(list_data, axis=1) print('saving') np.save(outputfilename, list_data) end = time.time() elapsed = end - start print("elapsed " + str(elapsed)) # _____ MAIN PROGRAM STARTS HERE _____ input_filename = 'emnist-digits-train-images-idx3-ubyte' offset, sample_count, rows_per_image, columns_per_image = read_image_file_header(input_filename) load_save_images1(input_filename, offset, 'images_train_array', columns_per_image, rows_per_image, sample_count) load_save_images2(input_filename, offset, 'images_train_array', columns_per_image, rows_per_image, sample_count)
The Main Difference
Fast
b0 = infile.read(count * rows * cols)
Slow
for n in range(count): image_matrix = [[0 for x in range(cols)] for y in range(rows)] for r in range(rows): for c in range(cols): byte = infile.read(1)
That’s one big file read instead of 188,160,000 single-byte reads. The operating system buffers do what they can, but one big read is going to win every time.
In the topmost code, you’ll see that I have to reshape the 1D byte array that results from the one big file read, and then loop through the 240,000 2D arrays that result and flip them and rotate them 90 degrees. All of this is still way faster than the byte-by-byte alternative.
Results
The fast code runs in 13.9 seconds on my machine, while the slow code requires 119.1 seconds. That’s a performance gain of more than 8.5x.
Just goes to show you, if you can do it all — or almost all — with numpy arrays, this is the way to go with large machine learning datasets.