When talking about linear regression(only one variable) then the hypothesis is

βπ (π₯) = π0 + π1π₯

where hypothesis is best fitting line for the data. In this equation, guessing theta values that creates the best fit line for our linear regression model is important. Therefore to determine the theta values in the equation, there is the need to know what is the best values for those theta in the model.

A cost function is the way to determine the theta value. It is something that minimize. In linear regression the cost function is the sum of square error over the training data sets. Whereas, Simple gradient descent is the iterative method to find the minimum of a function.

Cost Function

Cost function in linear regression is determine by

π½(π) = 1/2π ββπ(π₯π) β π¦π)^2

Where h(ΞΈ) is the hypothesis function difference with y (correct output). Summing the squared error. M is the number of training data. Dividing by m is for normalization.

Gradient descent is the iterative to determine the cost function, that best fit the hypothesis.

In gradient descent. It first starts with initial value of ΞΈ0 and ΞΈ1. Then apply the cost function those theta. If the gradient descent reaches the global optimum that the gradient descent otherwise the ΞΈ0 and ΞΈ1 are keep changing until it reaches global optimum of  the cost function.

repeat until convergence ππ = ππ β πΌ π/πππ π½(π0,π1) where Ξ± is the learning steep. This learning steep is the step forward to global optimum. If Ξ± is too slow gradient descent will be slow but if Ξ± is too large it will keep overshooting. If the ΞΈ0 and ΞΈ1 are already on the optimum that the derivatives will give 0.

``````import numpy as np
import random

def gradient_descent(x, y, theta, alpha, m, iteration):
# transposing matrices x to multiple by theta
# multipling x by theta gives the hypothesis
# visualization of this step is that if theta = [0, 1] and x = [ [1, 101],[1, 122],[1, 143],[1, 214] ]
# then the equation is h(theta) = theta0 + X.theta1
x_trans = x.transpose()
for i in range(0, iteration):
hypothesis = np.dot(x, theta)
# Calculating difference
loss = hypothesis - y
# cost function
cost = np.sum(loss ** 2) / (2 * m)

gradient = np.dot(x_trans, loss) / m

theta = theta - alpha * gradient

return theta

def generate_data(num_points, bias, variance):
x = np.zeros(shape=(num_points, 2))
y = np.zeros(shape=(num_points))

for i in range(0, num_points):
x[i][0] = 1
x[i][1] = i

y[i] = (i + bias) + random.uniform(0, 1) * variance
return x, y

# gen 100 points with a bias of 25 and 10 variance as a bit of noise
x, y = generate_data(100, 25, 10)
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradient_descent(x, y, theta, alpha, m, numIterations)
print(theta)``````
``````def hypothesis(x, theta):
return np.dot(x,theta)

plt.plot(x, y, 'ro')
plt.plot(x, hypothesis(x, theta), 'orange')
plt.show()``````

This is the traditional gradient descent and it is often too slow for training large number of data set. There are different approach for achieving global optimum.

Some are normal equation, stochastic gradient descent, momentum, ada grad, adam optimizer. Since, in real world the situation is different and the data is very sparse therefore adam optimizer has the highest performance rate.

Reference

Code is from https://stackoverflow.com/a/17796231/2950010