You split your data into training and testing sets in order to teach your neural network. The training set allows the neural network to understand the rules that causes an object to be identified as a specific label and different from other labels i.e. what makes a shirt qualified/identifiable as a shirt and not a pair of pants. This is achieved by data in the form of an image being entered into pixels and then the neural network being able to recognize what data is in the pixels. Then, once the neural network has learned the rules in identifying the images/data with specific labels it is then introduced to data that it has never seen before, this is the testing data. There would be no point in testing the networks knowledge on all of the data because it has already seen the data and you told it what it was labeled as, it would be like pointing at a shirt, telling a person that it is a shirt, then asking them what type of clothing the apparel is.
The purpose of the relu function is that is sets all of the outputs that are negative to zero. This is done so that the overall results are not skewed or pulled to the left (decreased). If there are negative results, then that could cause other positive results to be negated which would result in the skew. The softmax function finds the most candidate/ label for the data by setting the largest probability to one and the rest to zero. Instead of going through and comparing probabilities to different labels and finding the largest, now the network just has to find the one. There are 10 neurons in the last layer of the neural network because there are 10 labels that the data could be labeled.
The optimizer function repeatedly adjusts the values in the pixels. As this is done overtime the values in the training data become more accurately fitted to the labels. The loss function is used in order to calculate how good or how bad an answer is. The answer being evaluated by the loss function comes from the sum value across pixels. Sparse_categorical_crossentropy loss function is used because what is being predicted here are categories.
There are 50,000 images in the training set in 28 by 28 dimension.
The length of the labels is 10 to correspond with options 0-9.
There are 10,000 images in the test set in 28 by 28 dimensions.
Probability model for image/data at index 30 in the x_test set. The random digit had the highest probability that it is a label four.
The digit “3” corresponds with the fourth label, which is the label with the highest probability to represent the image at index 30.