Wednesday, December 4, 2013

Classifying Handwritten Digits

Six weeks into Andrew Ng's Machine Learning class on Coursera, I found a Kaggle competition to classify handwritten digits that's almost identical in nature to one of the programming assignments. This seemed like a good chance for further practice in implementing neural networks.

I loaded the Kaggle training data and modified code I already had for a 3-layer neural network:

• 1 hidden layer of 100 nodes, regularization parameter lambda = 0.3
• Accuracy of ~ 92% after 100 iterations
In an attempt to improve the regularization parameter lambda, I plotted a number of sample values against their corresponding error rates (% misclassifications). With the caveat that I only ran 10 iterations for each value, lambda = 1 appeared to minimize the cross validation error.  Retraining the network accordingly increased accuracy to ~ 93.5%.  Further training seemed only to increase the variance.

Larger Networks: accuracy ~ 96%

• 2 hidden layers of 200 nodes each, lambda = 0.01
• Accuracy of ~ 94.5% after 130 iterations
• 1 hidden layer of 500 nodes, lambda = 0.01
• Accuracy of ~ 96.5% after 75 iterations
Of the handful of networks that I tried, it seemed like a single hidden layer of 500 nodes learned the training examples the most efficiently.  The 75 iterations completed in about an hour and a half.

As with the initial network, the regularization parameter lambda was chosen by minimizing the error against cross validation data.  After 75 iterations, the network had almost perfect accuracy on the training data (>99.5%), but the accuracy on cross validation and segregated test data hovered around 96%.

At this point, the network appeared sufficiently large to learn the training data -- it just wasn't generalizing well enough for new data.  I tried to lower the variance by running more iterations and including the previously segregated test data as training data, with little success.  A few last thoughts:

• I'm not sure how to reason about the optimal size and structure of a neural network, given a dataset.  There must be a good way to approach this by taking a subset of data (for reasonable processing time) and running through a number of informed guesses.
• I calibrated lambda using random initial parameters.  Does it help to reevaluate the regularization along the way, as the network is trained more and becomes more prone to overfitting?

Submission: accuracy ~ 96%

The standing of 170th place is, of course, nothing to write home about -- but it's gratifying after just six weeks of coursework.