Step 8:  

Repeat the process to add two more convolution layers.  But increase the number of filters for each subsequent layer. This is useful because as we move up the layer hierarchy, the features to be coded become more complex, so it's useful to add more filters to get a better representation of the complex features in the original image.  Its common to increase this number by a factor of 2 in each layer because computer GPUs can handle power two computations easier and faster, but you don't have to increase the number of filters by power 2, it's just better and faster.

clear all;
close all;
clc;


[XTrain,YTrain] = digitTrain4DArrayData;
size(XTrain)      %images
size(YTrain)      %correct answer labels


XTrain=1-XTrain;  % Reverse the black and white colors.  Save and run the program to see the difference.

perm = randperm(size(XTrain,4),20);  % Randomize the order of images in XTrain
for i = 1:20
subplot(4,5,i);
imshow(XTrain(:,:,:,perm(i)));
end
 

layers = [
imageInputLayer([28 28 1])   
convolution2dLayer(3,8,'Padding','same')
batchNormalizationLayer
reluLayer
maxPooling2dLayer(2,'Stride',2)

convolution2dLayer(3,16,'Padding','same')
batchNormalizationLayer
reluLayer
maxPooling2dLayer(2,'Stride',2)

convolution2dLayer(3,32,'Padding','same')
batchNormalizationLayer
reluLayer
averagePooling2dLayer(7)
 
Notice that in the last convolution layer we used average (mean) pooling instead of max.  Either is acceptable.  It does a 7 to 1 reduction.

In the next step add a fully connected layer.  Use the help command to see how to set this up before you actually look at the next step (help fullyConnectedLayer).