Search
▪️

Lecture 03: Loss Function and Optimization

There were some challenges of recognition in the viewpoint
1.
Illumination
2.
Deformation
3.
Occlusion
4.
Clutter
5.
Intraclass Variation
Loss Function tells how good our current classifier is
In the Loss Function, x is Image and y is Integer Label
Multi Class SVM is the generalization version of binary SVM
Multi Class SVM에서 Loss를 재는 간단한 방법은 자기 자신이 속한 Class의 Score에서 나머지 다른 Class의 Score를 빼서 다 더하면 되는데, 이 때 빼내는 Class는 오롯이 자기 자신의 Score보다 더 좋게 측정된 다른 Class의 Score를 빼는 것이다. (빼내고 +1을 하기도 함 by Hinge Loss)
더하는 값 1은 어떻게 정해지는 것인가? → W scale에 의해서 묻힐 수 있는 값이다. 즉, 크게 영향을 끼치는 값은 아니다.
L_j: Loss of i-th Class
L: Loss of Total Classes
Suppose that we found a W such that L = 0. Is this W unique? → Could be other W (2W is also L = 0)
Loss Function L(W)에 Regularization Lambda R(W)값을 더함으로써 오버피팅을 막아 시험 데이터에 대해서도 좋은 결과를 얻을 수 있다. 즉, Loss Function의 값이 0이 나오지 않도록 어느 정도 조정하는 행위이다. (being simple by Occam's Razor, works on the test data, hyper parameter lambda)
오버피팅을 막는 방법은 몇 가지 있다.
1.
L1 Regularization
2.
L2 Regularization
3.
Elastic Net
4.
Max Norm Regularization
5.
Dropout
6.
Fancier: Batch Normalization Stochastic Depth
Multinomial Logistic Regression(Softmax Classifier)
Using Probabilities, Softmax Function
Exp → Normalization (to make sum as 1) → Softmax of the Class (-log(Softmax Value))
Recap
We have some dataset of (x, y)
We have a score function: s = f(x, W) = Wx as a example
We have a loss function (like SVM, Softmax to Full Loss)
Thus, the procedure is like this
1.
Make Score Function
2.
Measure Data Loss
3.
Sum Total Loss with Regularization Loss of W
Best Loss를 찾는 방법?
1.
Random Search → BAD
2.
Gradient Descent → BAD and GOOD (Depends On...)
In 1-dimension, the derivative of a function is what we've learned
In multiple dimension, the gradient is the vector of partial derivatives along each dimension
The slope in any direction is the dot product of the direction with the gradient
The direction of steepest descent is the negative gradient
Remember the idea of how to compute gradient
The flaw of this is that it is too slow, if there is a large scale of CNN
** W의 Input이 10개 정도만 돼도, Gradient의 결과로는 수 백, 수 천만번의 계산을 해야할 수도 있다.
3.
Stochastic Gradient Descent
Full sum is expensive when N is large
Thus, approximate the sum by using a mini batch of examples (32 / 64 / 128 common way)
4.
기타 등등
Analytic Gradient를 쓰면 Direct한 경로로 바로 Minimum을 향해 가지만, Direct로 가지 않음에도 Analytic Gradient보다 빠른 경우도 있다. (Adam Optimizer과 같은...)
요약하자면...
1.
Numerical Gradient: Approximate, Slow, Easy to Write
2.
Analytic Gradient: Exact, Fast, Error-Prone
Thus, always user analytic gradient, but check implementation with numerical gradient. This is called Gradient Check
Raw Pixel들을 통해서 Image Classify를 하면 원활하게 되지 않는 경우가 허다하다. 따라서 두 가지 접근이 필요하다.
1.
Compute various feature representations (Compute different kinds of quantities relating to the appearance) of the Image
2.
Concatenate each feature to the different feature vectors
Feature representation of the image would be fed into a linear classifier
1.
Motivation (Feature Transform into the Polar Coordinates to Separate which was not by Linear Classifier)
2.
Histogram of Oriented Gradients (8x8 Pixels to Compute the Dominant Edge Direction of Each Pixel)
3.
Bag of Words
Build Codebook (Extract Random Patches, Cluster Patches to Form Codebook of Visual Words)
Encode Images