Logistic Regression

Logistic regression is a statistical model, in which analyzing one or more independent variables with the possible outcome. The outcome is measure with a dichotomous(separate line) variable.

In logistic regression, the outcome is binary for example if cancer is benign then True/0 or if maligned then False/1

Logistic regression goal is to find the best fit line to describe the relationship between dichotomous variables. The outcome is represented in dependent variables (Y-Axis) and the predictors are independent variables (X-Axis)

The best fit line describe above, is also called logistic function. Most commonly used logistic function is Sigmod function (S-Shape) function which is represented
has  (1 / 1 + e^-x)  where e is the base of natural logarithms

Logistic regression uses an equation which is very similar to linear regression except for that linear regression return a numeric value whereas logistic regression give output in binary 0 or 1.

It is represented as y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

Where is the y is output in binary form, b0 is the bias and b1 is the weight or coefficient. To find the best value for the coefficient, Quasi Netwon Method can be used from numerical method.

If linear regression cost function is used in logistic regression, it will end up with non-convex function meaning there will the many local minimum. The cost function for logistic regression can be define as

if y = 1 then -log(h(x) ) and y = 1 then -log(1 – h(x)) however, these can be combined into one equation as -y.log(h(x)) – (1-y)(log(1 – h(x)). In this equation if y = 1 then the equation will (1 – y) in the above equation will become 0 and first part of equation will be left. In other case if y = 0 then first part of equation will become 0.

Now, the equation for cost function is -1/m * sum of above equation. There are other methods to get cost function. The above type of cost function can be derived from maximum likelihood estimation which determine how to find the parameter in efficient way.

Python Code:

# Using multivarient classification dataset http://archive.ics.uci.edu/ml/datasets/Heart+Disease
# Using http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/reprocessed.hungarian.data

# First step is know your data. Take your time in knowing your data
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

head = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 
        'restecg', 'thalach', 'exange', 'oldpeak', 'slope', 
        'ca', 'thal', 'y']

dataframe = pd.read_csv('heart-disease.csv', sep=' ', names = head)
dataframe = dataframe.dropna()

print(dataframe.shape)
print(list(dataframe.columns))

# Merging 1,2,3 classes into 1
u = dataframe['y'].unique()
print(u)
dataframe['y'] = np.where(dataframe['y'] == 2, 1, dataframe['y'])
dataframe['y'] = np.where(dataframe['y'] == 3, 1, dataframe['y'])
dataframe['y'] = np.where(dataframe['y'] == 4, 1, dataframe['y'])

u = dataframe['y'].unique()
print(u)

mean_by_y = dataframe.groupby('y').mean()
mean_by_sex = dataframe.groupby('sex').mean()
mean_by_age = dataframe.groupby('age').mean()

pd.crosstab(dataframe.age, dataframe.y).plot(kind = 'bar')
plt.title('Frequency per age')
plt.xlabel('Age')
plt.ylabel('Disease')

pd.crosstab(dataframe.sex, dataframe.y).plot(kind = 'bar')
plt.title('Frequency per sex')
plt.xlabel('Sex')
plt.ylabel('Disease')
# plt.show()

print(mean_by_y)
print(mean_by_sex)
print(mean_by_age)

dataframe['y'].value_counts()
sns.countplot(x = 'y', data = dataframe, palette='hls')
# plt.show()

y = ['y']
X=[i for i in dataframe if i not in y]

# Using RFE
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.cross_validation import train_test_split

logreg = LogisticRegression()
rfe = RFE(logreg, 14)
rfe = rfe.fit(dataframe[X], dataframe.y )

print(rfe.support_)
print(rfe.ranking_)

# Since all is true, let select all
col = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 
        'restecg', 'thalach', 'exange', 'oldpeak', 'slope', 
        'ca', 'thal']

X=dataframe[col]
y=dataframe['y']

# import statsmodels.api as sm

# logit_model=sm.Logit(y,X)
# result=logit_model.fit()
# print(result.summary())

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

# Reference:
# https://towardsdatascience.com/building-a-logistic-regression-in-python-step-by-step-becd4d56c9c8
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: