Faster Performance Using numpy Arrays

8.5x

Machine learning applications often need to process a lot of data.  In an earlier post, I showed how to read the large EMNIST dataset for deep learning of handwriting recognition.  In that article, I used a very naive approach to reading in this large dataset, and it performed poorly.  So I decided to optimize it for speed.

The results were great, with the new version of the load_save_images function now operating 8.5 times faster than before.  The entire speed gain is due to two major changes

  • using numpy for all operations except the initial file read
  • reading the file in “one gulp” instead of a byte at a time

To isolate and benchmark the performance of the two methods, I create a new python module that directly compares the two methods.  Here it is:

import numpy as np
import tensorflow as tf
import time as time


def read_image_file_header(filename):
    f = open(filename, 'rb')
    int.from_bytes(f.read(4), byteorder='big')  # magic number, discard it
    count = int.from_bytes(f.read(4), byteorder='big')  # number of samples in data set
    rows = int.from_bytes(f.read(4), byteorder='big')  # rows per image
    columns = int.from_bytes(f.read(4), byteorder='big')  # columns per image
    pos = f.tell()  # current position used as offset later when reading data
    f.close()
    return pos, count, rows, columns


def load_save_images1(inputfilename, byte_offset, outputfilename, cols, rows, count):
    start = time.time()
    print('reading whole data file')
    infile = open(inputfilename, 'rb')
    infile.seek(byte_offset)
    b0 = infile.read(count * rows * cols)
    infile.close()
    print('creating array from data')
    b2 = np.fromstring(b0, dtype='uint8')
    print('reshaping the array')
    b2.shape = (count, rows, cols)
    print('rearranging image arrays')
    for n in range(count):
        b2[n] = np.fliplr(b2[n])
        b2[n] = np.rot90(b2[n])
        # show progress
        if n % 10000 == 0:
            print("... " + str(n))
    print('normalizing')
    b2 = tf.keras.utils.normalize(b2, axis=1)
    print('saving')
    np.save(outputfilename, b2)
    end = time.time()
    elapsed = end - start
    print("elapsed " + str(elapsed))


def load_save_images2(inputfilename, byte_offset, outputfilename, cols, rows, count):
    start = time.time()
    list_data = []
    infile = open(inputfilename, 'rb')
    infile.seek(byte_offset)
    for n in range(count):
        image_matrix = [[0 for x in range(cols)] for y in range(rows)]
        for r in range(rows):
            for c in range(cols):
                byte = infile.read(1)
                image_matrix[c][r] = float(ord(byte))
        list_data.append(image_matrix)
        # show progress
        if n % 10000 == 0:
            print("... " + str(n))
    infile.close()
    print('converting to numpy array')
    list_data = np.array(list_data)
    print('normalizing')
    list_data = tf.keras.utils.normalize(list_data, axis=1)
    print('saving')
    np.save(outputfilename, list_data)
    end = time.time()
    elapsed = end - start
    print("elapsed " + str(elapsed))


# _____ MAIN PROGRAM STARTS HERE _____

input_filename = 'emnist-digits-train-images-idx3-ubyte'
offset, sample_count, rows_per_image, columns_per_image = read_image_file_header(input_filename)

load_save_images1(input_filename, offset, 'images_train_array', columns_per_image, rows_per_image, sample_count)
load_save_images2(input_filename, offset, 'images_train_array', columns_per_image, rows_per_image, sample_count)

The Main Difference

Fast

b0 = infile.read(count * rows * cols)

Slow

for n in range(count):
        image_matrix = [[0 for x in range(cols)] for y in range(rows)]
        for r in range(rows):
            for c in range(cols):
                byte = infile.read(1)

That’s one big file read instead of 188,160,000 single-byte reads. The operating system buffers do what they can, but one big read is going to win every time.

In the topmost code, you’ll see that I have to reshape the 1D byte array that results from the one big file read, and then loop through the 240,000 2D arrays that result and flip them and rotate them 90 degrees. All of this is still way faster than the byte-by-byte alternative.

Results

The fast code runs in 13.9 seconds on my machine, while the slow code requires 119.1 seconds. That’s a performance gain of more than 8.5x.

Just goes to show you, if you can do it all — or almost all — with numpy arrays, this is the way to go with large machine learning datasets.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s