Deep Neural Network for MNIST Classification

The dataset is called MNIST and refers to handwritten digit recognition. It provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).

The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.

Each photo is 28 pixels x 28 pixels, or 784 total pixels. We can think about the problem as a 28x28 matrix, where input values range from 0 to 255, corresponding to the intensity of the color of that pixel.

Without using CNNs, the approach for deep feedforward neural networks is to "flatten" each image into a vector of length 784 x 1. Therefore, for each image, we would have 784 inputs.

Import the relevant packages

In [1]:
import numpy as np
import tensorflow as tf
# tfds has a large number of datasets ready for modeling
import tensorflow_datasets as tfds

Load and Preprocess Data

Load Data

In [2]:
# tfds.load actually loads a dataset (or downloads and then loads if that's the first time you use it) 
# In our case, we are interested in the MNIST; the name of the dataset is the only mandatory argument
mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised=True)
  • with_info=True will also provide us with a tuple containing information about the version, features, number of samples
  • as_supervised=True will load the dataset in a 2-tuple structure (input, target)
  • Alternatively, as_supervised=False, would return a dictionary, but we prefer to have our inputs and targets separated

Split the dataset

Once we have loaded the dataset, we can extract the training and testing dataset with the built references. By default, TensorFlow (TF) has training and testing datasets, but no validation sets. Thus, we must split it on our own.

In [3]:
mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']

# Define the number of validation samples as a % of the train samples
# Use mnist_info (so we don't have to count the observations)
num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples
# Cast this number to an integer, as a float may cause an error along the way
num_validation_samples = tf.cast(num_validation_samples, tf.int64)

# Store the number of test samples in a dedicated variable (instead of using the mnist_info one)
num_test_samples = mnist_info.splits['test'].num_examples
# Again, we'd prefer an integer (rather than the default float)
num_test_samples = tf.cast(num_test_samples, tf.int64)

Scale the data

Normally, we would like to scale our data in some way to make the result more numerically stable. In this case we will simply prefer to have inputs between 0 and 1.

In [4]:
# Define a function called: scale, that will take an MNIST image and its label
def scale(image, label):
    # Make sure the value is a float
    image = tf.cast(image, tf.float32)
    # Since the possible values for the inputs are 0 to 255 (256 different shades of grey)
    # If we divide each element by 255, we would get the desired result => all elements will be between 0 and 1 
    image /= 255.
    return image, label

# The method .map() allows us to apply a custom transformation to a given dataset
# Get the validation data from mnist_train
scaled_train_and_validation_data = mnist_train.map(scale)

# Finally, scale and batch the test data
# Scale it so it has the same magnitude as the train and validation
# There is no need to shuffle it, because we won't be training on the test data
# There would be a single batch, equal to the size of the test data
test_data = mnist_test.map(scale)

Shuffle the data

We shuffle in case the data is ordered in a certain way. Imagine the data is ordered and we have 10 batches. Each batch contains only a given digit. This will confuse the stochastic gradient descent (SGD) algorithm because each batch is homogenous within itself but completely different from all other batches, causing the loss to differ greatly. Since we are batching, we should shuffle the data to make it as randomly spread as possible.

In [5]:
# This BUFFER_SIZE parameter is for cases when we're dealing with enormous datasets
# Then we can't shuffle the whole dataset in one go because we can't fit it all in memory
# So instead TF only stores BUFFER_SIZE samples in memory at a time and shuffles them
# if (BUFFER_SIZE=1) => no shuffling will actually happen
# if (BUFFER_SIZE >= num samples) => shuffling is uniform (will happen all at once)
# BUFFER_SIZE in between - a computational optimization to approximate uniform shuffling
BUFFER_SIZE = 10000

# We can use the shuffle method and specify the buffer size
shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)

Extract training and validation data

Once we have scaled and shuffled the data, we can proceed to actually extracting the train and validation portions. Our validation data would be equal to 10% of the training set, which we've already calculated.

We will be using mini-batch gradient descent to train our model, which is the most efficient way to perform deep learning, as the tradeoff between accuracy and speed is optimal. Therefore, we must set a batch size and prepare our data for batching.

Batch size = 1 = Stochastic gradient descent (SGD)

Batch size = # samples = (single batch) GD

1 < batch size < # samples = (mini-batch) GD

So, we want a number for the batch size that is relatively small compared to the dataset, but reasonably high as to allow us to preserve the underlying dependencies.

Since we will only be forward propagating and not back propagating on the validation data, we do not need to batch it. Batching is useful in updating weights only once per batch rather than at every sample, hence reducing noise in the training updates. This way, we find an average loss and accuracy per batch. However, during validation and testing, we want the exact loss and accuracy, so we should take all of the values at once. Moreover, when forward propagating, we don't use that much computational power so it's not expensive to calculate the exact values. However, the model expects the data in batch form too.

In [6]:
# Use the .take() method to take that many samples
# Finally, create a batch with a batch size equal to the total number of validation samples
validation_data = shuffled_train_and_validation_data.take(num_validation_samples)

# Similarly, the train_data is everything else, so skip as many samples as there are in the validation dataset
train_data = shuffled_train_and_validation_data.skip(num_validation_samples)

# Determine the batch size
BATCH_SIZE = 150

# Batch the train data
# This will be very helpful when we train, as we will be able to iterate over the different batches
train_data = train_data.batch(BATCH_SIZE)

# We will have a single batch with a batch size equal to the total # of validation samples
# This way, the model takes the whole validation dataset at once
validation_data = validation_data.batch(num_validation_samples)

# Batch the test data in the same way as the validation dataset
test_data = test_data.batch(num_test_samples)

# Our validation data must have the same shape and object properties as the train and test data
# The MNIST data is iterable and in 2-tuple format
# Therefore, we must extract and convert the validation inputs and targets accordingly
# Takes next batch (it is the only batch)
# Because as_supervized=True, we've got a 2-tuple structure
validation_inputs, validation_targets = next(iter(validation_data))

Model

Outline the model

We have 784 inputs in our input layer. We combine them and add a nonlinearity to get our first hidden layer. 2 hidden layers with 50 nodes each is enough to achieve x% accuracy but we will try to see if we can do better.

There are 10 digits => 10 classes => 10 output units.

In [7]:
input_size = 784
output_size = 10
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 5000
    
# Define how the model will look
model = tf.keras.Sequential([
    
    # The first layer (the input layer)
    # Each observation is 28x28x1 pixels, therefore it is a tensor of rank 3
    # Flatten the images
    # The 'Flatten' method takes our 28x28x1 tensor and orders it into a (None,) or (28x28x1,) = (784,) vector
    # This allows us to actually create a feed forward neural network
    tf.keras.layers.Flatten(input_shape=(28, 28, 1)), # input layer
    
    # tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
    # Tt takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 3rd hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 4th hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 5th hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 6th hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 7th hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 8th hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 9th hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 10th hidden layer
    
    # The final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

We provide the inputs from the input layer to the model and calculate the dot product of the inputs and the weights and add the bias, as well as apply the ReLu activation function.

The outputs will be compared to the targets. One-hot encoding will be used for both the outputs and the targets. When creating a classifier, we would like to see the probability of a digit being rightfully labeled, so we will use a softmax activation function for the output layer.

Choose the optimizer and the loss function

We define the optimizer we'd like to use, the loss function, and the metrics we are interested in obtaining at each iteration.

One of the best choices we have for the optimizer is the adaptive moment estimation (Adam).

We would like to employ a loss function that is used for classifiers. Using sparse_categorical_crossentropy applies one-hot encoding to the data, which we did not do as a preprocessing step. The output and target layer should have matching forms. Our model and optimizer expect the output shape to match the target shape in a one-hot encoded format.

In [8]:
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Training

Train the model we have built. At each epoch, we will validate. Within each epoch:

  1. At the beginning of the epoch, the training loss will be set to 0.
  2. The algorithm will iterate over a preset number of batches, all from the training set.
  3. The weights and biases will be updated as many times as there are batches.
  4. We will get a value for the loss function, indicating how the training is going.
  5. We will also see a training accuracy.
  6. At the end of the epoch, the algorithm will forward propagate the whole validation set and calculate the validation accuracy.

When we reach the max num of epochs, the training will be over.

In [9]:
# Determine the maximum number of epochs
NUM_EPOCHS = 10

# We fit the model, specifying the training data, the total number of epochs,
# and the validation data we just created in the format: (inputs,targets)
model.fit(train_data, epochs=NUM_EPOCHS, validation_data=(validation_inputs, validation_targets), validation_steps=1, verbose =2)
Epoch 1/10
360/360 - 837s - loss: 1.1751 - accuracy: 0.5719 - val_loss: 0.4082 - val_accuracy: 0.8422
Epoch 2/10
360/360 - 1011s - loss: 0.3195 - accuracy: 0.9209 - val_loss: 0.2415 - val_accuracy: 0.9592
Epoch 3/10
360/360 - 915s - loss: 0.2226 - accuracy: 0.9507 - val_loss: 0.1505 - val_accuracy: 0.9673
Epoch 4/10
360/360 - 838s - loss: 0.1544 - accuracy: 0.9664 - val_loss: 0.2151 - val_accuracy: 0.9593
Epoch 5/10
360/360 - 962s - loss: 0.2175 - accuracy: 0.9537 - val_loss: 0.1260 - val_accuracy: 0.9712
Epoch 6/10
360/360 - 991s - loss: 0.1172 - accuracy: 0.9735 - val_loss: 0.1568 - val_accuracy: 0.9723
Epoch 7/10
360/360 - 1006s - loss: 0.1487 - accuracy: 0.9711 - val_loss: 0.1629 - val_accuracy: 0.9625
Epoch 8/10
360/360 - 987s - loss: 0.1017 - accuracy: 0.9760 - val_loss: 0.1258 - val_accuracy: 0.9717
Epoch 9/10
360/360 - 1062s - loss: 0.0829 - accuracy: 0.9811 - val_loss: 0.1045 - val_accuracy: 0.9773
Epoch 10/10
360/360 - 1004s - loss: 0.0784 - accuracy: 0.9830 - val_loss: 0.0754 - val_accuracy: 0.9837
Out[9]:
<tensorflow.python.keras.callbacks.History at 0x63d340690>
  • We can see that each epoch took ~15 min to conclude.
  • Note that the training loss decreases across epochs, and it decreases less and less with each epoch.
  • Note that the training accuracy increases with each epoch, and it increases less and less with each epoch, following the same trend as the training loss, as expected.
  • The validation loss has not yet started to increase so the model is not overfitting.
  • The validation accuracy for the last epoch is ~98%, which is a great result.

Test the model

After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.

It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is the absolute final instance, so we test once we are completely done with adjusting our model. If we adjust our model after testing, we will start overfitting the test dataset, which will defeat its purpose.

In [10]:
test_loss, test_accuracy = model.evaluate(test_data)
      1/Unknown - 43s 43s/step - loss: 0.1206 - accuracy: 0.9748
In [11]:
# We can apply some nice formatting if we want to
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))
Test loss: 0.12. Test accuracy: 97.48%

After some fine tuning, I decided to brute-force the algorithm and created 10 hidden layers with 5000 hidden units each.

hidden_layer_size = 5000
batch_size = 150
NUM_EPOCHS = 10

All activation functions are ReLu.

Due to the width and the depth of the algorithm, it took my computer 3 hours and 50 mins to train it. However, this yielded 97.5% accuracy. Since this is a percentage point below the the validation accuracy, way may have slightly overfit the model.

Some of the results that leading academics achieved on the MNIST (using different methodologies) can be seen here: https://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results

The credit for building the MNIST dataset that was used goes to Yan LeCun, Corinna Cortes, and Christopher Burges.