The dataset is called MNIST and refers to handwritten digit recognition. It provides 70,000 images (28x28 pixels) of handwritten digits (1 digit per image).
The goal is to write an algorithm that detects which digit is written. Since there are only 10 digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), this is a classification problem with 10 classes.
Each photo is 28 pixels x 28 pixels, or 784 total pixels. We can think about the problem as a 28x28 matrix, where input values range from 0 to 255, corresponding to the intensity of the color of that pixel.
Without using CNNs, the approach for deep feedforward neural networks is to "flatten" each image into a vector of length 784 x 1. Therefore, for each image, we would have 784 inputs.
import numpy as np
import tensorflow as tf
# tfds has a large number of datasets ready for modeling
import tensorflow_datasets as tfds
# tfds.load actually loads a dataset (or downloads and then loads if that's the first time you use it)
# In our case, we are interested in the MNIST; the name of the dataset is the only mandatory argument
mnist_dataset, mnist_info = tfds.load(name='mnist', with_info=True, as_supervised=True)
Once we have loaded the dataset, we can extract the training and testing dataset with the built references. By default, TensorFlow (TF) has training and testing datasets, but no validation sets. Thus, we must split it on our own.
mnist_train, mnist_test = mnist_dataset['train'], mnist_dataset['test']
# Define the number of validation samples as a % of the train samples
# Use mnist_info (so we don't have to count the observations)
num_validation_samples = 0.1 * mnist_info.splits['train'].num_examples
# Cast this number to an integer, as a float may cause an error along the way
num_validation_samples = tf.cast(num_validation_samples, tf.int64)
# Store the number of test samples in a dedicated variable (instead of using the mnist_info one)
num_test_samples = mnist_info.splits['test'].num_examples
# Again, we'd prefer an integer (rather than the default float)
num_test_samples = tf.cast(num_test_samples, tf.int64)
Normally, we would like to scale our data in some way to make the result more numerically stable. In this case we will simply prefer to have inputs between 0 and 1.
# Define a function called: scale, that will take an MNIST image and its label
def scale(image, label):
# Make sure the value is a float
image = tf.cast(image, tf.float32)
# Since the possible values for the inputs are 0 to 255 (256 different shades of grey)
# If we divide each element by 255, we would get the desired result => all elements will be between 0 and 1
image /= 255.
return image, label
# The method .map() allows us to apply a custom transformation to a given dataset
# Get the validation data from mnist_train
scaled_train_and_validation_data = mnist_train.map(scale)
# Finally, scale and batch the test data
# Scale it so it has the same magnitude as the train and validation
# There is no need to shuffle it, because we won't be training on the test data
# There would be a single batch, equal to the size of the test data
test_data = mnist_test.map(scale)
We shuffle in case the data is ordered in a certain way. Imagine the data is ordered and we have 10 batches. Each batch contains only a given digit. This will confuse the stochastic gradient descent (SGD) algorithm because each batch is homogenous within itself but completely different from all other batches, causing the loss to differ greatly. Since we are batching, we should shuffle the data to make it as randomly spread as possible.
# This BUFFER_SIZE parameter is for cases when we're dealing with enormous datasets
# Then we can't shuffle the whole dataset in one go because we can't fit it all in memory
# So instead TF only stores BUFFER_SIZE samples in memory at a time and shuffles them
# if (BUFFER_SIZE=1) => no shuffling will actually happen
# if (BUFFER_SIZE >= num samples) => shuffling is uniform (will happen all at once)
# BUFFER_SIZE in between - a computational optimization to approximate uniform shuffling
BUFFER_SIZE = 10000
# We can use the shuffle method and specify the buffer size
shuffled_train_and_validation_data = scaled_train_and_validation_data.shuffle(BUFFER_SIZE)
Once we have scaled and shuffled the data, we can proceed to actually extracting the train and validation portions. Our validation data would be equal to 10% of the training set, which we've already calculated.
We will be using mini-batch gradient descent to train our model, which is the most efficient way to perform deep learning, as the tradeoff between accuracy and speed is optimal. Therefore, we must set a batch size and prepare our data for batching.
Batch size = 1 = Stochastic gradient descent (SGD)
Batch size = # samples = (single batch) GD
1 < batch size < # samples = (mini-batch) GD
So, we want a number for the batch size that is relatively small compared to the dataset, but reasonably high as to allow us to preserve the underlying dependencies.
Since we will only be forward propagating and not back propagating on the validation data, we do not need to batch it. Batching is useful in updating weights only once per batch rather than at every sample, hence reducing noise in the training updates. This way, we find an average loss and accuracy per batch. However, during validation and testing, we want the exact loss and accuracy, so we should take all of the values at once. Moreover, when forward propagating, we don't use that much computational power so it's not expensive to calculate the exact values. However, the model expects the data in batch form too.
# Use the .take() method to take that many samples
# Finally, create a batch with a batch size equal to the total number of validation samples
validation_data = shuffled_train_and_validation_data.take(num_validation_samples)
# Similarly, the train_data is everything else, so skip as many samples as there are in the validation dataset
train_data = shuffled_train_and_validation_data.skip(num_validation_samples)
# Determine the batch size
BATCH_SIZE = 150
# Batch the train data
# This will be very helpful when we train, as we will be able to iterate over the different batches
train_data = train_data.batch(BATCH_SIZE)
# We will have a single batch with a batch size equal to the total # of validation samples
# This way, the model takes the whole validation dataset at once
validation_data = validation_data.batch(num_validation_samples)
# Batch the test data in the same way as the validation dataset
test_data = test_data.batch(num_test_samples)
# Our validation data must have the same shape and object properties as the train and test data
# The MNIST data is iterable and in 2-tuple format
# Therefore, we must extract and convert the validation inputs and targets accordingly
# Takes next batch (it is the only batch)
# Because as_supervized=True, we've got a 2-tuple structure
validation_inputs, validation_targets = next(iter(validation_data))
We have 784 inputs in our input layer. We combine them and add a nonlinearity to get our first hidden layer. 2 hidden layers with 50 nodes each is enough to achieve x% accuracy but we will try to see if we can do better.
There are 10 digits => 10 classes => 10 output units.
input_size = 784
output_size = 10
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 5000
# Define how the model will look
model = tf.keras.Sequential([
# The first layer (the input layer)
# Each observation is 28x28x1 pixels, therefore it is a tensor of rank 3
# Flatten the images
# The 'Flatten' method takes our 28x28x1 tensor and orders it into a (None,) or (28x28x1,) = (784,) vector
# This allows us to actually create a feed forward neural network
tf.keras.layers.Flatten(input_shape=(28, 28, 1)), # input layer
# tf.keras.layers.Dense is basically implementing: output = activation(dot(input, weight) + bias)
# Tt takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 3rd hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 4th hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 5th hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 6th hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 7th hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 8th hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 9th hidden layer
tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 10th hidden layer
# The final layer is no different, we just make sure to activate it with softmax
tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])
We provide the inputs from the input layer to the model and calculate the dot product of the inputs and the weights and add the bias, as well as apply the ReLu activation function.
The outputs will be compared to the targets. One-hot encoding will be used for both the outputs and the targets. When creating a classifier, we would like to see the probability of a digit being rightfully labeled, so we will use a softmax activation function for the output layer.
We define the optimizer we'd like to use, the loss function, and the metrics we are interested in obtaining at each iteration.
One of the best choices we have for the optimizer is the adaptive moment estimation (Adam).
We would like to employ a loss function that is used for classifiers. Using sparse_categorical_crossentropy applies one-hot encoding to the data, which we did not do as a preprocessing step. The output and target layer should have matching forms. Our model and optimizer expect the output shape to match the target shape in a one-hot encoded format.
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
Train the model we have built. At each epoch, we will validate. Within each epoch:
When we reach the max num of epochs, the training will be over.
# Determine the maximum number of epochs
NUM_EPOCHS = 10
# We fit the model, specifying the training data, the total number of epochs,
# and the validation data we just created in the format: (inputs,targets)
model.fit(train_data, epochs=NUM_EPOCHS, validation_data=(validation_inputs, validation_targets), validation_steps=1, verbose =2)
After training on the training data and validating on the validation data, we test the final prediction power of our model by running it on the test dataset that the algorithm has NEVER seen before.
It is very important to realize that fiddling with the hyperparameters overfits the validation dataset.
The test is the absolute final instance, so we test once we are completely done with adjusting our model. If we adjust our model after testing, we will start overfitting the test dataset, which will defeat its purpose.
test_loss, test_accuracy = model.evaluate(test_data)
# We can apply some nice formatting if we want to
print('Test loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))
After some fine tuning, I decided to brute-force the algorithm and created 10 hidden layers with 5000 hidden units each.
hidden_layer_size = 5000
batch_size = 150
NUM_EPOCHS = 10
All activation functions are ReLu.
Due to the width and the depth of the algorithm, it took my computer 3 hours and 50 mins to train it. However, this yielded 97.5% accuracy. Since this is a percentage point below the the validation accuracy, way may have slightly overfit the model.
Some of the results that leading academics achieved on the MNIST (using different methodologies) can be seen here: https://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results
The credit for building the MNIST dataset that was used goes to Yan LeCun, Corinna Cortes, and Christopher Burges.