#### Quick Start: accuracy ~ 93.5%

#### I loaded the Kaggle training data and modified code I already had for a 3-layer neural network:

- 1 hidden layer of 100 nodes, regularization parameter lambda = 0.3
- Accuracy of ~ 92% after 100 iterations

####
**Larger Networks: accuracy ~ 96%**

- 2 hidden layers of 200 nodes each, lambda = 0.01
- Accuracy of ~ 94.5% after 130 iterations

- 1 hidden layer of 500 nodes, lambda = 0.01
- Accuracy of ~ 96.5% after 75 iterations

Of the handful of networks that I tried, it seemed like a single hidden layer of 500 nodes learned the training examples the most efficiently. The 75 iterations completed in about an hour and a half.

As with the initial network, the regularization parameter lambda was chosen by minimizing the error against cross validation data. After 75 iterations, the network had almost perfect accuracy on the training data (>99.5%), but the accuracy on cross validation and segregated test data hovered around 96%.

At this point, the network appeared sufficiently large to learn the training data -- it just wasn't generalizing well enough for new data. I tried to lower the variance by running more iterations and including the previously segregated test data as training data, with little success. A few last thoughts:

- I'm not sure how to reason about the optimal size and structure of a neural network, given a dataset. There must be a good way to approach this by taking a subset of data (for reasonable processing time) and running through a number of informed guesses.

- I calibrated lambda using random initial parameters. Does it help to reevaluate the regularization along the way, as the network is trained more and becomes more prone to overfitting?