When talking about linear regression(only one variable) then the hypothesis is
ℎ𝜃 (𝑥) = 𝜃0 + 𝜃1𝑥
where hypothesis is best fitting line for the data. In this equation, guessing theta values that creates the best fit line for our linear regression model is important. Therefore to determine the theta values in the equation, there is the need to know what is the best values for those theta in the model.
A cost function is the way to determine the theta value. It is something that minimize. In linear regression the cost function is the sum of square error over the training data sets. Whereas, Simple gradient descent is the iterative method to find the minimum of a function.
Cost function in linear regression is determine by
𝐽(𝜃) = 1/2𝑚 ∑ℎ𝜃(𝑥𝑖) − 𝑦𝑖)^2
Where h(θ) is the hypothesis function difference with y (correct output). Summing the squared error. M is the number of training data. Dividing by m is for normalization.
Gradient descent is the iterative to determine the cost function, that best fit the hypothesis.
In gradient descent. It first starts with initial value of θ0 and θ1. Then apply the cost function those theta. If the gradient descent reaches the global optimum that the gradient descent otherwise the θ0 and θ1 are keep changing until it reaches global optimum of the cost function.
repeat until convergence 𝜃𝑗 = 𝜃𝑗 − 𝛼 𝜕/𝜕𝜃𝑗 𝐽(𝜃0,𝜃1) where α is the learning steep. This learning steep is the step forward to global optimum. If α is too slow gradient descent will be slow but if α is too large it will keep overshooting. If the θ0 and θ1 are already on the optimum that the derivatives will give 0.
import numpy as np import random def gradient_descent(x, y, theta, alpha, m, iteration): # transposing matrices x to multiple by theta # multipling x by theta gives the hypothesis # visualization of this step is that if theta = [0, 1] and x = [ [1, 101],[1, 122],[1, 143],[1, 214] ] # then the equation is h(theta) = theta0 + X.theta1 x_trans = x.transpose() for i in range(0, iteration): hypothesis = np.dot(x, theta) # Calculating difference loss = hypothesis - y # cost function cost = np.sum(loss ** 2) / (2 * m) gradient = np.dot(x_trans, loss) / m theta = theta - alpha * gradient return theta def generate_data(num_points, bias, variance): x = np.zeros(shape=(num_points, 2)) y = np.zeros(shape=(num_points)) for i in range(0, num_points): x[i] = 1 x[i] = i y[i] = (i + bias) + random.uniform(0, 1) * variance return x, y # gen 100 points with a bias of 25 and 10 variance as a bit of noise x, y = generate_data(100, 25, 10) m, n = np.shape(x) numIterations= 100000 alpha = 0.0005 theta = np.ones(n) theta = gradient_descent(x, y, theta, alpha, m, numIterations) print(theta)
def hypothesis(x, theta): return np.dot(x,theta) plt.plot(x, y, 'ro') plt.plot(x, hypothesis(x, theta), 'orange') plt.show()
This is the traditional gradient descent and it is often too slow for training large number of data set. There are different approach for achieving global optimum.
Some are normal equation, stochastic gradient descent, momentum, ada grad, adam optimizer. Since, in real world the situation is different and the data is very sparse therefore adam optimizer has the highest performance rate.