NOTE: Having a good theoretical knowledge is amazing but implementing them in code in a real-time deep learning project is a completely different thing. You might get different and unexpected results based on different problems and datasets. So as a Bonus,I am also adding the links to the various courses which has helped me a lot in my journey to learn Data science and ML, experiment and compare different optimization strategies which led me to write this article on comparisons between different optimizers while implementing deep learning and comparing the different optimization strategies. Below are some of the resources which have helped me a lot to become what I am today. I am personally a fan of DataCamp, I started from it and I am still learning through DataCamp and keep doing new courses. They seriously have some exciting courses. Do check them out.
Also, I have noticed that DataCamp is having a SALE(75%off) on all the courses. So this would literally be the best time to grab some yearly subscriptions(which I have) which basically has unlimited access to all the courses and other things on DataCamp and make fruitful use of your time sitting at home during this Pandemic.So go for it folks and Happy learning, make the best use of this quarantine time and come out of this pandemic stronger and more skilled.1)This is the link to the course by DataCamp on Deep learning in python using Keras package or definitely you can start with Building CNN for image processing using keras . If understanding deep learning and AI fundamentals is what you want right now then the above 2 courses are the best deep learning courses you can find out there to learn fundamentals of deep learning and also implement it in python. These were my first Deep learning course which has helped me a lot to properly understand the basics.3) Machine learning in Python using Scikit-learn- This course will teach you how to implement supervised learning algorithms in python with different datasets.4)Data wrangling and manipulating Data Frames using Pandas-This amazing course will help you perform data wrangling and data pre-processing in python. And a data scientist spends most of his time doing pre-processing and data wrangling. So this course might come out to be handy for beginners.6) Recently Data camp has started an new program where they are providing various real world Projects and problems statements to help data enthusiasts build a strong practical data science foundation along with their courses. So try any of these Projects out. It is surely very exciting and will help you learn faster and better. Recently I completed a project on Exploring the evolution of Linux and it was an amazing experience.7) R users , don’t worry I also have some hand picked best R courses for you to get started with building data science and Machine learning foundations and also doing it side by side using this amazing Data science with R course which will teach you the complete fundamentals. Trust me this one is worth your time and energy.8) This course is also one of the best for understanding basics of Machine learning in R called Machine learning Toolbox .9) All data science projects start from exploring the data and it is one of the most important tasks for a data scientist to know the dataset inside out so this lovely course on Exploratory data analysis using R is what you need to start any data analytics and data science project. Also this course on Statistical modelling in R would be useful for all the aspiring data scientists like me.Statistics is the foundation of data science.
P.S: I am still using DataCamp and keep doing courses in my free time. I actually insist the readers to try out any of the above courses as per their interest, to get started and build a good foundation in Machine learning and Data Science. The best thing about these courses by DataCamp is that they explain it in a very elegant and different manner with a balanced focus on practical and well as conceptual knowledge and at the end, there is always a Case study. This is what I love the most about them. These courses are truly worth your time and money. These courses would surely help you also understand and implement Deep learning, machine learning in a better way and also implement it in Python or R. I am damn sure you will love it and I am claiming this from my personal opinion and experience.
What are Optimization Algorithms?
Types of optimization algorithms?
Optimization Algorithm falls in 2 major categories :
- First Order Optimization Algorithms — These algorithms minimize or maximize a Loss function E(x) using its Gradient values with respect to the parameters. Most widely used First-order optimization algorithm is Gradient Descent. The First order derivative tells us whether the function is decreasing or increasing at a particular point. First-order Derivative basically give us a line which is tangential to a point on its Error Surface. What is the Gradient of a function?
Hence summing up, a derivative is simply defined for a function dependent on single variables , whereas a Gradient is defined for function dependent on multiple variables. Now let’s not get more into Calculas and Physics.
2. Second-Order Optimization Algorithms — Second-order methods use the second-order derivative which is also called Hessian to minimize or maximize the Loss function. The Hessian is a Matrix of Second Order Partial Derivatives. Since the second derivative is costly to compute, the second-order is not used much. The second-order derivative tells us whether the first derivative is increasing or decreasing which hints at the function’s curvature. Second-Order Derivative provide us with a quadratic surface which touches the curvature of the Error Surface.
Some Advantages of Second-Order Optimization over First Order —
You can search more on second-order Optimization Algorithms here-https://web.stanford.edu/class/msande311/lecture13.pdf
So which Order Optimization Strategy to use?
- Now, The First Order Optimization techniques are easy to compute and less time consuming, converging pretty fast on large data sets.
- Second-Order Techniques are faster only when the Second Order Derivative is known otherwise, these methods are always slower and costly to compute in terms of both time and memory.
Although ,sometimes Newton’s Second Order Optimization technique can sometimes Outperform First Order Gradient Descent Techniques because Second Order Techniques will not get stuck around paths of slow convergence around saddle points whereas Gradient Descent sometimes gets stuck and does not converges.
Now, what are the different types of Optimization Algorithms used in Neural Networks?
“Oh Gradient Descent — Find the Minima , control the variance and then update the Model’s parameters and finally lead us to Convergence”
After this, we propagate backwards in the Network carrying Error terms and updating Weights values using Gradient Descent, in which we calculate the gradient of Error(E) function with respect to the Weights (W) or the parameters, and update the parameters (here Weights) in the opposite direction of the Gradient of the Loss function w.r.t to the Model’s parameters.
The image on above shows the process of Weight updates in the opposite direction of the Gradient Vector of Error w.r.t to the Weights of the Network. The U-Shaped curve is the Gradient(slope). As one can notice if the Weight(W) values are too small or too large then we have large Errors , so want to update and optimize the weights such that it is neither too small nor too large , so we descent downwards opposite to the Gradients until we find a local minima.
Variants of Gradient Descent-
1. Stochastic gradient descent
2. Mini Batch Gradient Descent
The advantages of using Mini Batch Gradient Descent are —
- It Reduces the variance in the parameter updates, which can ultimately lead us to much better and stable convergence.
- Can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient.
- Commonly Mini-batch sizes Range from 50 to 256 but can vary as per the application and problem being solved.
- Mini-batch gradient descent is typically the algorithm of choice when training a neural network nowadays
- Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence i.e will result in small baby steps towards finding optimal parameter values which minimize loss and finding that valley which directly affects the overall training time which gets too large. While a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
- Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
- Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous sub-optimal local minima. Actually, Difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slope down. These saddle points are usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
Optimizing the Gradient Descent
The same thing happens with our parameter updates —
- It leads to faster and stable convergence.
- Reduced Oscillations
The momentum term γ increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. This means it does parameter updates only for relevant examples. This reduces the unnecessary parameter updates which lead to faster and stable convergence and reduced oscillations.
Nesterov accelerated gradient
What actually happens is that as we reach the minima i.e the lowest point on the curve, the momentum is pretty high and it doesn’t know to slow down at that point due to the high momentum which could cause it to miss the minima entirely and continue movie up. This problem was noticed by Yurii Nesterov.
It uses a different learning Rate for every parameter θ at a time step based on the past gradients which were computed for that parameter.
- Its main weakness is that its learning rate-η is always Decreasing and decaying.
What Improvements we have done so far —
- We are calculating different learning Rates for each parameter.
- We are also calculating momentum.
- Preventing Vanishing(decaying) learning Rates.
Since we are calculating individual learning rates for each parameter , why not calculate individual momentum changes for each parameter and store them separately. This is where a new modified technique and improvement comes into play called as Adam.
Visualization of the Optimization Algorithms
Which optimizer should we use?
Adam works well in practice and outperforms other Adaptive techniques.
- Optimizing Gradient Descent- http://sebastianruder.com/optimizing-gradient-descent/
- Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V, … Ng, A. Y. (2012). Large Scale Distributed Deep Networks. NIPS 2012: Neural Information Processing Systems. http://doi.org/10.1109/ICDAR.2011.95
- Ioffe, S., & Szegedy, C. (2015). Batch Normalization : Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv Preprint arXiv:1502.03167v3.
- Qian, N. (1999). On the momentum term in gradient descent learning algorithms. Neural Networks : The Official Journal of the International Neural Network Society, 12(1), 145–151. http://doi.org/10.1016/S0893-6080(98)00116-6
- Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations
- Zaremba, W., & Sutskever, I. (2014). Learning to Execute, 1–25. Retrieved from http://arxiv.org/abs/1410.4615
- Zhang, S., Choromanska, A., & LeCun, Y. (2015). Deep learning with Elastic Averaging SGD. Neural Information Processing Systems Conference (NIPS 2015).Retrieved from http://arxiv.org/abs/1412.6651
- Darken, C., Chang, J., & Moody, J. (1992). Learning rate schedules for faster stochastic gradient search. Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop, (September). http://doi.org/10.1109/NNSP.1992.253713