Neural Network Basics

These days there are too many deep learning resources and tutorials out there to count! Regardless, it would be remiss to gloss over the basics in a blog such as this. Let us quickly run through the fundamental ideas behind artificial neural networks.

Classification

I realize that the order of posts here seems without rhyme or reason. I have no justification to offer. But these posts are better late than never! Here we proceed to lay another block of the foundation by discussing classification, logistic regression, and finally, generalized linear models.

Regression

Let’s do a quick review of basic regression, to lay the framework for future posts. The goal of regression, one of the principal problems in supervised learning, is to predict the value(s) of one or more continuous target variables $y$, given a corresponding vector of input variables $x$.

Probabilistic PCA

Today let’s return to principal component analysis (PCA). Previously we had seen how PCA can be expressed as a variance-maximizing projection of the data onto a lower-dimension space, and how that was equivalent to a reconstruction-error-minimizing projection. Now we will show how PCA is also a maximum likelihood solution to a continuous latent variable model, which provides us several useful benefits, some of which include:

Solvable using expectation-maximization (EM) in an efficient manner, where we avoid having to explicitly construct the covariance matrix. Of course, we can also alleviate this issue in regular PCA by using singular value decomposition (SVD).
Handles missing values in the dataset
Permits a Bayesian treatment of PCA

Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is a generative probabilistic model for discrete data. The objective is to find a lower dimensionality representation of the data while preserving the salient statistical structure - a complex way to describe clustering. It is commonly used in NLP applications as a topic model, where we are interested in discovering common topics in a set of documents.

Minimum Description Length

The minimum description length principle is an approach for the model selection problem. It is underpinned by the beautifully simple concept of learning as compression. Any pattern or regularity in the data can be exploited to compress the data. Hence we can equate the two concepts - the more we can compress, the more we know about the data!

Expectation Maximization

In the last two posts we have seen two examples of the expectation maximization (EM) algorithm at work, finding maximum-likelihood solutions for models with latent variables. Now we derive EM for general models, and demonstrate how it maximizes the log-likelihood.

Gaussian Mixtures and EM

Continuing from last time, I will discuss Gaussian mixtures as another clustering method. We assume that the data is generated from several Gaussian components with separate parameters, and we would like to assign each observation to its most likely Gaussian parent. It is a more flexible and probabilistic approach to clustering, and will provide another opportunity to discuss expectation-maximization (EM). Lastly, we will see how K-means is a special case of Gaussian mixtures!

K-means and EM

Clustering is an unsupervised learning problem in which we try to identify groupings of similar data points, i.e. learn the structure of our data. Today I will introduce K-means, a popular and simple clustering algorithm. Our true motivation will be to use this as a gentle introduction to clustering and the expectation maximization (EM) algorithm. In subsequent posts we will expand on this foundation towards Gaussian mixtures, and finally into latent Dirichlet allocation (LDA).

Due-tos

This post is a throwback to the methodology behind one of my first analytics projects at System1. The due-to is a simple name for a simple idea - isolating the effects of individual key performance indicators (KPIs) on a business metric, like gross profit. Sometimes - most times, even - data science doesn’t have to be that sophisticated.

Principal Component Analysis

Principal component analysis (PCA) is a widely used technique for dimensionality reduction. This in turn leads to practical applications such as compression and data visualization, problems that can be reduced to dimensionality reduction at their core.

Singular Value Decomposition

Singular value decomposition is a crucially important method underpinning the mathematics behind all kinds of applications in machine learning. At its core, it is a linear algebra technique for decomposing a matrix. It separates any matrix $A$ into simple pieces.

Ensemble Learning

Ensemble methods express the simple and effective idea of strength in numbers. They combine multiple models together in order to improve performance over a single model in isolation. Here we will discuss some common methods in use today.

Nonparametric Density Estimation

While trying to make good on my promise of another example of kernels here in the form of kernel density estimation, I thought it made sense to also discuss some simple nonparametric density estimation methods in general.

Statistical Tests as Linear Models

These are some quick and incomplete notes on a fantastic post from Dr. Jonas Lindeløv - many thanks for an enlightening read!

Kernels

Here I’ll briefly introduce the idea of kernels, a powerful mathematical concept that allows us to easily incorporate nonlinearity in models.

Support Vector Machines

Continuing with the theme of the last post, here I’ll cover support vector machines (SVM) from the ground-up. Relatively speaking, there are many more resources out there on this - most discussions of SVMs do cover the geometry, happily - but nevertheless, it is worth deriving from scratch and comparing to our earlier result.

Geometry of Logistic Regression

It always helps to have an intuition for the geometric meaning of a model. This is typically emphasized in the case of common models such as linear regression, but there are relatively few discussions of this for logistic regression. It is especially interesting in this case to draw the distinction between logistic regression and support vector machines.