Deep Learning — Week 3, Part 1

In the last week, we took a plunge into the core concepts of Deep Learning and the framework of a Neural Network. We also touched upon the basics of an objective function and its use, the kinds of objective functions and how Gradient Descent plays a pivotal role in rectifying the mistakes/error while predicting a value.

In this article, we shall feast our mind and thought on :

  1. Principal Component Analysis (PCA)
  2. Single Value Decomposition (SVD).
  3. Types of Gradient Descent.

1. Principal Component Analysis

i) Standardization

This brings all variables at the same level, with each contributing values within a certain range/scale.

ii) Covariance Matrix

Covariance of a variable with itself is its variance. Hence, the primary diagonal becomes the variances of x , y and z. Covariance is also commutative, hence the lower and upper triangle entries will be symmetric along the diagonal.

iii) Eigen vectors and Eigen Values

An important thing to understand here is that these components convey no sense or meaning to an analysis. On a mathematical note, these components represent the directions of data with maximum variance — lines that capture more data/information. Principal components can be thought as new axes that provide an optimum angle to visualize and evaluate data and fit it efficiently over a model.

How does PCA construct these components?

If we rank the eigen values in descending value, λ1>λ2, which means that the eigenvector that corresponds to the first principal component (PC1) is v1 and the one that corresponds to the second component (PC2) is v2.

After having calculated the components, compute the percentage of variance accounted for , by each component and then divide the eigen value of each component by the total. PC1 in the above example accounts for 96% of the variance while PC2 will account for 4%

In the next step, one chooses whether to keep all components, or discard the ones with lower eigen values. The remaining vectors are clubbed together to form a feature vector. In the above example, if we drop v2, we’re only left with v1 as the main data matrix which constitutes for 96% of our data.

The last step is to recast the data along the principal components axes. This is done to reorient the data from the original axes( dimensions ) to orient it through the principal axes.

2. Single Value Decomposition

  • A is an m × n matrix
  • U is an m × n orthogonal matrix
  • S is an n × n diagonal matrix
  • V is an n × n orthogonal matrix

Relation between these matrices is as follows :

Suppose we have two, two-dimensional vectors, x₁=(x, y), and x₂=(x₂, y₂). We can fit an ellipse with major axis, a, and minor axis, b, to these two vectors as shown in the figure. But to make things easier on ourselves and save typing, we write out the equations using matrix algebra.

We can construct an ellipse of any size and orientation by stretching and rotating a unit circle.

Let x’=(x’, y’) be the transformed coordinates:

where R is a rotation matrix:

If we write this out term-by-term, based on the formula :

For a 2-Dimensional dataset, we’d get :

What we need to understand is that this rotation is clockwise. The equation for a unit circle would be :

If the equation is simplified a little more, we’d get :

Finally, if we go back to the original equation :

The singular values are the axes of the least squares fitted ellipsoid.

A — collection of points.

V — Orientation of the ellipsoid.

U — projections of matrix A on the axes that are now defined by the single values.

Do let us know, if you want a detailed explanation on the applications of SVD and how it works in the background.

3. Gradient Descent

i) Batch Gradient Descent

ii) Stochastic Gradient Descent

iii) Mini Batch Gradient Descent

Conclusion

In part 2 of this article, we shall deal with Auto Encoders.

Cheers,
Team FACE

We are the departmental forum for Computer Science & Engineering, at Amrita School of Engineering , Bengaluru.