▪️

Lecture 03: Loss Function and Optimization

•

There were some challenges of recognition in the viewpoint

Illumination

Deformation

Occlusion

Clutter

Intraclass Variation

•

Loss Function tells how good our current classifier is

•

In the Loss Function, x is Image and y is Integer Label

•

Multi Class SVM is the generalization version of binary SVM

•

Multi Class SVM에서 Loss를 재는 간단한 방법은 자기 자신이 속한 Class의 Score에서 나머지 다른 Class의 Score를 빼서 다 더하면 되는데, 이 때 빼내는 Class는 오롯이 자기 자신의 Score보다 더 좋게 측정된 다른 Class의 Score를 빼는 것이다. (빼내고 +1을 하기도 함 by Hinge Loss)

•

더하는 값 1은 어떻게 정해지는 것인가? → W scale에 의해서 묻힐 수 있는 값이다. 즉, 크게 영향을 끼치는 값은 아니다.

•

L_j: Loss of i-th Class

•

L: Loss of Total Classes

•

Suppose that we found a W such that L = 0. Is this W unique? → Could be other W (2W is also L = 0)

•

Loss Function L(W)에 Regularization Lambda R(W)값을 더함으로써 오버피팅을 막아 시험 데이터에 대해서도 좋은 결과를 얻을 수 있다. 즉, Loss Function의 값이 0이 나오지 않도록 어느 정도 조정하는 행위이다. (being simple by Occam's Razor, works on the test data, hyper parameter lambda)

•

오버피팅을 막는 방법은 몇 가지 있다.

L1 Regularization

L2 Regularization

Elastic Net

Max Norm Regularization

Dropout

Fancier: Batch Normalization Stochastic Depth

•

Multinomial Logistic Regression(Softmax Classifier)

•

Using Probabilities, Softmax Function

•

Exp → Normalization (to make sum as 1) → Softmax of the Class (-log(Softmax Value))

•

Recap

•

We have some dataset of (x, y)

•

We have a score function: s = f(x, W) = Wx as a example

•

We have a loss function (like SVM, Softmax to Full Loss)

•

Thus, the procedure is like this

Make Score Function

Measure Data Loss

Sum Total Loss with Regularization Loss of W

•

Best Loss를 찾는 방법?

Random Search → BAD

Gradient Descent → BAD and GOOD (Depends On...)

In 1-dimension, the derivative of a function is what we've learned

In multiple dimension, the gradient is the vector of partial derivatives along each dimension

The slope in any direction is the dot product of the direction with the gradient

The direction of steepest descent is the negative gradient

Remember the idea of how to compute gradient

The flaw of this is that it is too slow, if there is a large scale of CNN

** W의 Input이 10개 정도만 돼도, Gradient의 결과로는 수 백, 수 천만번의 계산을 해야할 수도 있다.

Stochastic Gradient Descent

Full sum is expensive when N is large

Thus, approximate the sum by using a mini batch of examples (32 / 64 / 128 common way)

기타 등등

Analytic Gradient를 쓰면 Direct한 경로로 바로 Minimum을 향해 가지만, Direct로 가지 않음에도 Analytic Gradient보다 빠른 경우도 있다. (Adam Optimizer과 같은...)

•

요약하자면...

Numerical Gradient: Approximate, Slow, Easy to Write

Analytic Gradient: Exact, Fast, Error-Prone

•

Thus, always user analytic gradient, but check implementation with numerical gradient. This is called Gradient Check

•

Raw Pixel들을 통해서 Image Classify를 하면 원활하게 되지 않는 경우가 허다하다. 따라서 두 가지 접근이 필요하다.

Compute various feature representations (Compute different kinds of quantities relating to the appearance) of the Image

Concatenate each feature to the different feature vectors

•

Feature representation of the image would be fed into a linear classifier

Motivation (Feature Transform into the Polar Coordinates to Separate which was not by Linear Classifier)

Histogram of Oriented Gradients (8x8 Pixels to Compute the Dominant Edge Direction of Each Pixel)

Bag of Words

Build Codebook (Extract Random Patches, Cluster Patches to Form Codebook of Visual Words)

Encode Images