When talking about linear regression(only one variable) then the hypothesis is

βπ (π₯) = π0 + π1π₯

where hypothesis is best fitting line for the data. In this equation, guessing theta values that creates the best fit line for our linear regression model is important. Therefore to determine the theta values in the equation, there is the need to know what is the best values for those theta in the model.

A cost function is the way to determine the theta value. It is something that minimize. In linear regression the cost function is the sum of square error over the training data sets. Whereas, Simple gradient descent is the iterative method to find the minimum of a function.

**Cost Function**

Cost function in linear regression is determine by

π½(π) = 1/2π ββπ(π₯π) β π¦π)^2

Where h(ΞΈ) is the hypothesis function difference with y (correct output). Summing the squared error. M is the number of training data. Dividing by m is for normalization.

**Gradient Descent**

Gradient descent is the iterative to determine the cost function, that best fit the hypothesis.

In gradient descent. It first starts with initial value of ΞΈ0 and ΞΈ1. Then apply the cost function those theta. If the gradient descent reaches the global optimum that the gradient descent otherwise the ΞΈ0 and ΞΈ1 are keep changing until it reaches global optimum of the cost function.

repeat until convergence ππ = ππ β πΌ π/πππ π½(π0,π1) where Ξ± is the learning steep. This learning steep is the step forward to global optimum. If Ξ± is too slow gradient descent will be slow but if Ξ± is too large it will keep overshooting. If the ΞΈ0 and ΞΈ1 are already on the optimum that the derivatives will give 0.

```
import numpy as np
import random
def gradient_descent(x, y, theta, alpha, m, iteration):
# transposing matrices x to multiple by theta
# multipling x by theta gives the hypothesis
# visualization of this step is that if theta = [0, 1] and x = [ [1, 101],[1, 122],[1, 143],[1, 214] ]
# then the equation is h(theta) = theta0 + X.theta1
x_trans = x.transpose()
for i in range(0, iteration):
hypothesis = np.dot(x, theta)
# Calculating difference
loss = hypothesis - y
# cost function
cost = np.sum(loss ** 2) / (2 * m)
gradient = np.dot(x_trans, loss) / m
theta = theta - alpha * gradient
return theta
def generate_data(num_points, bias, variance):
x = np.zeros(shape=(num_points, 2))
y = np.zeros(shape=(num_points))
for i in range(0, num_points):
x[i][0] = 1
x[i][1] = i
y[i] = (i + bias) + random.uniform(0, 1) * variance
return x, y
# gen 100 points with a bias of 25 and 10 variance as a bit of noise
x, y = generate_data(100, 25, 10)
m, n = np.shape(x)
numIterations= 100000
alpha = 0.0005
theta = np.ones(n)
theta = gradient_descent(x, y, theta, alpha, m, numIterations)
print(theta)
```

```
def hypothesis(x, theta):
return np.dot(x,theta)
plt.plot(x, y, 'ro')
plt.plot(x, hypothesis(x, theta), 'orange')
plt.show()
```

This is the traditional gradient descent and it is often too slow for training large number of data set. There are different approach for achieving global optimum.

Some are normal equation, stochastic gradient descent, momentum, ada grad, adam optimizer. Since, in real world the situation is different and the data is very sparse therefore adam optimizer has the highest performance rate.

**Reference**

Code is from https://stackoverflow.com/a/17796231/2950010

Youtube link Machine learning by Andrew NG

[…] value andΒ π1 .. πn are the learningΒ coefficients/intercepts and x0 are number of features. Gradient descentΒ are used to determine the best value for these […]

LikeLike