This post is a jot down of some key concepts from the book “Deep Learning” by Ian Goodfellow and the “Structuring Machine Learning Projects” and “Improving Deep Neural Networks: Hyper parameter tuning, Regularization and Optimization” course by Andrew Ng.

**Under/overflow** are numerical computing errors due to the finite resolution of numbers provided by data type bit count. Underflow is when the number is set to zero. Overflow is when it is wrongly set as infinity (not defined).

The *softmax* function (used to predict probabilities in multinoulli distributions) is an example of a function sensitive to under/overflow.

**Poor conditioning** is a characteristic of a function that drastically changes with small changes (due to rounding) to its inputs. An example given: when largest/smallest eigenvalue ratio is large, matrix inverse is very sensitive.

**Lipschitz continuous** is a a property of a function such that its rate of change is limited by the Lipschitz constant L. This property helps guarantee convergence during gradient descent and functions can be slightly adjusted to fall under this property.

Convex functions have a positive semi-definite Hessian everywhere. A Hessian is the Jacobian of the gradient.

A data set in a machine learning problem is divided into training, development (validation) and test sets. While building the algorithm, adjustments vary according to which of these set the algorithm is performing poorly on.

- To fit training set on cost function:-
- bigger network
- different optimization algorithms such as ADAM or RMS prop

- Fit dev set
- regularize
- bigger training set

- Fit test set
- bigger dev set

- Fit real life data
- change data set of change cost function

When evaluating, we usually use recall and precision, but if we need a one-value metric, we can combine these two with F1 or Harmonic mean. Some metric need to be optimized while it suffices to just satisfy others.

We assume that a machine learning algorithm is topped in performance by a human being (or otherwise we can’t evaluate it because we do not know the ultimate truth). When improving performance, one can

- get more labeled data from humans
- better insight on error analysis
- bias/variance analysis (bias is error b/w training and humans, while variance is that b/w training and dev)

Error Analysis is done manually by inspecting false positives and false negatives and recording statistical data about the type of error (which class is confused with which?)

If we suffer from data mismatch (difference between training and real-life data), it might help to add noise to training data.

For transfer learning, we might train and then replace/add last layers and retrain. Transfer learning is when we use a large data set to initialize the network, and then retrain it on a small, otherwise insufficient data set of interest.

An epoch is every training pass through the whole training set.

In mini-batch, instead of doing gradient descent on the whole training data at the same time, the training set is divided into subsets and gradient descent is performed on each subset at a time. The loss function, instead of steadily going downwards, trends downwards but has noisy oscillations.

If mini batch is of size 1, it is called stochastic mini batch. Stochastic gradient descent will never converge. We also lose all the speedup from vectorisation. Batch gradient descent is disadvantageous when the data set is too large since it takes too much time.