Deep Learning — Week 3, Part 1

In the last week, we took a plunge into the core concepts of Deep Learning and the framework of a Neural Network. We also touched upon the basics of an objective function and its use, the kinds of objective functions and how Gradient Descent plays a pivotal role in rectifying the mistakes/error while predicting a value.

In this article, we shall feast our mind and thought on :

  1. Principal Component Analysis (PCA)
  2. Single Value Decomposition (SVD).
  3. Types of Gradient Descent.

1. Principal Component Analysis

Principal component analysis is a dimensionality reduction method that helps in tackling the dimensions in large data sets. The main idea behind PCA is to transform a large set of variables into a smaller one which however, still holds the information from the large variables. The reduction often would trade blows with the accuracy. The trick lies in the fact that only a small percentage of this accuracy needs to be traded off for simplicity of variables. There are a few steps involved in PCA, they are :

i) Standardization

As the name suggests, the aim is to make the range of the continuous variables between certain values so that each one of them contributes some value within the same range that the other variables contribute to. PCA happens to be very sensitive to the variances of initial variables. What that means is, if there are huge differences between the ranges of the initial variables, those large variables ( with larger values ) will dominate over the small ones. Standardization is achieved by :

This brings all variables at the same level, with each contributing values within a certain range/scale.

ii) Covariance Matrix

Covariance is a metric to judge the strength of dependence or relationship between variables/attributes. This step is quite crucial, as it eliminates those variables having a high correlation but contain redundant information. Let’s assume a 3-dimensional dataset with variables x, y and z. The matrix can be listed out as :

Covariance of a variable with itself is its variance. Hence, the primary diagonal becomes the variances of x , y and z. Covariance is also commutative, hence the lower and upper triangle entries will be symmetric along the diagonal.

iii) Eigen vectors and Eigen Values

This step might be the main and crucial step towards a successful PCA calculation. Principal Components are these new variables constructed as linear combinations of the larger variables. These variables are calculated in such a way so as to minimize any sort of correlation between them and information is maintained. Let’s assume a 10 dimensional dataset. PCA will yield 10 principal components but majority of the information would be squeezed into the first component and then gradually onto the rest, as shown below :

An important thing to understand here is that these components convey no sense or meaning to an analysis. On a mathematical note, these components represent the directions of data with maximum variance — lines that capture more data/information. Principal components can be thought as new axes that provide an optimum angle to visualize and evaluate data and fit it efficiently over a model.

How does PCA construct these components?

Let’s assume a 2-dimensional dataset with 2 variables x , y. The eigen vectors and eigen values are as follows :

If we rank the eigen values in descending value, λ1>λ2, which means that the eigenvector that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the second component (PC2) is v2.

After having calculated the components, compute the percentage of variance accounted for , by each component and then divide the eigen value of each component by the total. PC1 in the above example accounts for 96% of the variance while PC2 will account for 4%

In the next step, one chooses whether to keep all components, or discard the ones with lower eigen values. The remaining vectors are clubbed together to form a feature vector. In the above example, if we drop v2, we’re only left with v1 as the main data matrix which constitutes for 96% of our data.

The last step is to recast the data along the principal components axes. This is done to reorient the data from the original axes( dimensions ) to orient it through the principal axes.

2. Single Value Decomposition

Single Value Decomposition deals with the method of decomposing a matrix into 3 matrices such that :

  • A is an m × n matrix
  • U is an m × n orthogonal matrix
  • S is an n × n diagonal matrix
  • V is an n × n orthogonal matrix

Relation between these matrices is as follows :

Suppose we have two, two-dimensional vectors, x₁=(x, y), and x₂=(x₂, y₂). We can fit an ellipse with major axis, a, and minor axis, b, to these two vectors as shown in the figure. But to make things easier on ourselves and save typing, we write out the equations using matrix algebra.

We can construct an ellipse of any size and orientation by stretching and rotating a unit circle.

Let x’=(x’, y’) be the transformed coordinates:

where R is a rotation matrix:

If we write this out term-by-term, based on the formula :

For a 2-Dimensional dataset, we’d get :

What we need to understand is that this rotation is clockwise. The equation for a unit circle would be :

If the equation is simplified a little more, we’d get :

Finally, if we go back to the original equation :

The singular values are the axes of the least squares fitted ellipsoid.

A — collection of points.

V — Orientation of the ellipsoid.

U — projections of matrix A on the axes that are now defined by the single values.

Do let us know, if you want a detailed explanation on the applications of SVD and how it works in the background.

3. Gradient Descent

In the previous article , we’d covered about Gradient Descent and how it works. Here, we shall look into the 3 major kinds of gradient descent algorithms.

i) Batch Gradient Descent

The most straightforward and common type of gradient descent is the batch gradient descent. It calculates the error for each example in the training set. After all examples are covered, the parameters are updated. This process is called a training epoch. The advantage of this form is that it isn’t a heavy load on a system and is quite efficient. It produces a stable error gradient with no spikes in values after every epoch. However, the main disadvantage lies in the fact that the training set must reside on the memory and must be always available to the algorithm; this results in slow access speeds and time complexity.

ii) Stochastic Gradient Descent

The Stochastic Gradient Descent updates the parameters according to the gradient/error from a single training example. Quite obviously, the frequent updation of parameters gives a very detailed rate of improvement that can be tallied better than the batch gradient descent. However, one must note that, frequent updation gives room for more computations, which increases the cost, hence making it more expensive over the Batch Descent. This frequent updation might also result in noisy gradients which cause errors to spike up or down, instead of decreasing.

iii) Mini Batch Gradient Descent

A combination of both , Stochastic and Batch, this method simply breaks the training set into smaller groups/batches and performs an update on all of these. This combines the efficiency of Batch Descent and reliability of Stochastic Descent. This method is highly used in Deep Learning Situations.

Conclusion

The motive of the article is to describe in brief the 3 mentioned topics. Do comment and let us know if you need a more detailed article on any of the topics, we’d be happy to oblige.

In part 2 of this article, we shall deal with Auto Encoders.

Cheers,
Team FACE

We are the departmental forum for Computer Science & Engineering, at Amrita School of Engineering , Bengaluru.