Loan Application Status Prediction

8 min readApr 19, 2021

Loan Application Status Prediction using Logistic Regression

So yes, what is Logistic Regression? and how to do prediction for Loan Application Status.

What is Machine Learning?

Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence.

Machine learning is divided into 2 parts:

a) Supervised Learning: Predictive Models

1. Regression Models

2. Classification Models

b) Unsupervised Learning: Pattern/structure Recognition(Unlabeled Data)

1.Clustering

2.Association

3. Dimensionality Reduction

a) Supervised Learning:

Regression Models

Linear Regression: Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression

2. Classification Model

Types of Classification Models:

Logistic Regression: Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes.
Naive Bayes
K-Nearest Neighbours
Decision Tree
Random Forest Classifier
Support Vector Machine

So, Logistics Regression is an algorithm that is used to predict the target variable which consists of 2 possible cases only.

Load Application Status Prediction is a task that can be done based on historical information of the customer and bank. By checking the dataset already existed regarding the status of the Load Application and creating a model will help us to Predict the further Loan Application Status.

Dataset includes details of applicants who have applied for loan. The dataset includes details like credit history, loan amount, their income, dependents etc.

Will build a model that can predict whether the loan of the applicant will be approved or not on the basis of the details provided in the dataset.

Independent Variables:

- Loan_ID

- Gender

- Married

- Dependents

- Education

- Self_Employed

- ApplicantIncome

- CoapplicantIncome

- Loan_Amount

- Loan_Amount_Term

- Credit History

- Property_Area

Dependent Variable (Target Variable):

- Loan_Status

Lets start creating model.

There are few steps will be following to create model for Loan Application Status Prediction

1> Data cleansing and Wrangling 2> Define the metrics for which model is getting optimized. 3> Feature Engineering 4> Data Pre-processing 5> Feature Selection 6> Split the data into training and test data sets. 7> Model Selection 8> Model Validation 9> Interpret the result 10> Saving Model

Firstly,

Import Dataset and Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings(‘ignore’)

data = pd.read_csv(‘loan_prediction.csv’)

Will, review the data within the CSV file

data

After checking data will check description and get to know minimum value, max value, standard deviation etc.:

data.describe()

To get number of rows and columns will use shape:

data.shape

614- row, 13-columns

Datatypes of data:

data.dtypes

we are having mix type of datatypes for all columns, need to encode the data

In each type of columns in how data are categorized into how many sections:

print(“Number of Categories: “)
for ColName in data[[‘Loan_ID’, ‘Gender’, ‘Married’, ‘Dependents’, ‘Education’,
‘Self_Employed’,’Property_Area’, ‘Loan_Status’]]:
print(“{} = {}”.format(ColName,len(data[ColName].unique())))

Loan_Status is target variable with 2 types of data. sol we can solve it by logistic regression

Observation: This represents number of categories of particular type of data.
As, Loan_Status which is going to be target variable is having 2 types of data so one will be using Logistic Regression

EDA PROCESS:

Checking null values and handling it:

data.isnull().sum()

Null values are present and need to handle it

After Handling null values,

As data are of all mixed types so need to Encode the data using Ordinal Encoder:

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()

for i in data.columns:
if data[i].dtypes == “object”:
data[i]=enc.fit_transform(data[i].values.reshape(-1,1))

data

Correlation between Feature variable and Target Variable:

corr_matrix_hmap=data.corr()
plt.figure(figsize=(22,20))
sns.heatmap(corr_matrix_hmap,annot=True,linewidths=0.1,fmt=”0.2f”)
plt.show()

corr_matrix_hmap[“Loan_Status”].sort_values(ascending=False)

Observation: Most highly corelated variable is : Credit_History, Least is: Education

plt.figure(figsize=(10,5))
data.corr()[‘Loan_Status’].sort_values(ascending=False).drop([‘Loan_Status’]).plot(kind=’bar’,color=’c’)
plt.xlabel(‘Feature’,fontsize=14)
plt.ylabel(‘Column with Target Name’,fontsize=14)
plt.title(‘correlation’,fontsize=18)
plt.show()

DATA CLEANING

Checking Outliers:

#checking for outliers
data.iloc[:,:].boxplot(figsize=[20,8])
plt.subplots_adjust(bottom=0.25)
plt.show()

Applicant_Income if having maximum outliers

Handling Outliers:

# Removing Outliers
from scipy.stats import zscore
z= np.abs(zscore(data))

threshold = 3
print(np.where(z< 3))

#removing outliers
data_new = data[(z< 3).all(axis=1)]

Rechecking of removal of outliers:

data.shape

Before removing Outliers

#After removing outliers
data_new.shape

After removing Outliers

data=data_new

Seperating Independent Variables and Dependent Variables(Target Variables)

# x= independent variable
x = data.iloc[:,0:-1]
x.head()

#y = target variable = Loan_Status
y = data.iloc[:,-1]
y.head()

Checking skewness:

No need to handle skewness if it between -0.5 to 0.5

x.skew()

Max skewness= Self_Employed then , Applicant Income and then,so on

Handling Skewess:

#Method for removing skew

from sklearn.preprocessing import power_transform
z = power_transform(x[0:])
data_new= pd.DataFrame(z,columns=x.columns)

x = data_new

#after removing skewness
x.skew()

Observation: As after checking corelation and skewness, Self_Employed is not that much corelated infact negatively corelated, so one cant drop that column

Visualization:

data.columns

For Categorical type of data:

import seaborn as sns
alpha = sns.countplot(x=”Loan_Status”,data=data)
print(data[“Loan_Status”].value_counts())

Loan_Status which is out target variable contains 2 values, 0 and 1

For Numerical type of data:

df_visual= x[[‘ApplicantIncome’, ‘CoapplicantIncome’, ‘LoanAmount’,
‘Loan_Amount_Term’, ‘Credit_History’]].copy()

import seaborn as sns
sns.distplot(df_visual[‘ApplicantIncome’],kde=True)

Observation: Overfitting is Absent

sns.distplot(df_visual[‘CoapplicantIncome’],kde=True)

Observation: Overfitting is Present

sns.distplot(df_visual[‘Credit_History’],kde=True)

Scaling of Values:

As the range of data is varying we can use Min-Max scaler to limit the ranfe of data between 0 to 1

from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings(‘ignore’)

x=mms.fit_transform(x)

MODEL TRAINING:

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.33,random_state = 42)

from sklearn.linear_model import LogisticRegression
lm = LogisticRegression()

lm.fit(x_train,y_train)

lm.score(x_train,y_train)

Predictions:

#predict the values
pred=lm.predict(x_test)
print(“Predicted Allitation”,pred)
print(“Actual Allitation”,y_test)

print(‘Accuracy Score:’,accuracy_score(y_test,pred))

Accuracy Score without removing overfitting and underfitting

Finding Best Random State to get best Accuracy:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
maxAccu=0
maxRS=0
for i in range(1,200):
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.33,random_state = i)
LR = LogisticRegression()
LR.fit(x_train,y_train)
predrf = LR.predict(x_test)
acc =accuracy_score(y_test,predrf)
if acc > maxAccu:
maxAccu = acc
maxRS = i

print(“Best score is: “,maxAccu,”on Random_state”,maxRS)

Train-Test model based on Best Random State:

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.33,random_state = 12)
LR = LogisticRegression()
LR.fit(x_train,y_train)
predrf = LR.predict(x_test)

Confusion Matrix for Logistic Regression:

from sklearn.metrics import confusion_matrix

from sklearn.metrics import classification_report
print(‘Accuracy Score:’, accuracy_score(y_test,predrf))
print(‘Confusion Matrix:’, confusion_matrix(y_test,predrf))
print(‘Classification Report:’, classification_report(y_test,predrf))

Confusion Matrix for Decision Tree Classifier:

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
preddt = dt.predict(x_test)
print(‘Accuracy Score:’, accuracy_score(y_test,preddt))
print(‘Confusion Matrix:’, confusion_matrix(y_test,preddt))
print(‘Classification Report:’, classification_report(y_test,preddt))

Finding best fold:

pred_train = LR.predict(x_train)
pred_test =LR.predict(x_test)
Train_accuracy = accuracy_score(y_train,pred_train)
Test_accuracy = accuracy_score(y_test,pred_test)
maxAccu=0
maxRS=0

from sklearn.model_selection import cross_val_score
for j in range(2,16):
cv_score=cross_val_score(LR,x,y,cv=j)
cv_mean = cv_score.mean()
if cv_mean > maxAccu:
maxAccu = cv_mean
maxRS = j
print(f”At cross fold {j} cv score is {cv_mean} and accuracy score training is {Train_accuracy} and accuracy for the testing is {Test_accuracy}”)

print(“\n”)

Cross Validation for Logistic Regrssion:

from sklearn.model_selection import cross_val_score
cv_score=cross_val_score(LR,x,y,cv=j)
cv_mean = cv_score.mean()
print(“Cross validation score for Logistic Regression”,cv_mean)

Cross Validation for Decision Tree Classifier:

from sklearn.model_selection import cross_val_score
cv_score=cross_val_score(dt,x,y,cv=j)
cv_mean = cv_score.mean()
print(“Cross validation score for Decision Tree”,cv_mean)

Found the Accuracy of Other algorithms and it went like:

Observation: Minimum difference between Accuracy and Cross validation is of Decision Tree Classifier which, will be proven the Best Alogorithm for this model.

REGULARIZTION:

To mitigate the problem of overfitting and underfitting Regularization Methods are used: Lasso, Ridge or ElasticNet.

from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings(‘ignore’)

from sklearn.linear_model import ElasticNet
from sklearn.model_selection import GridSearchCV
parameters = {‘alpha’:[.0001,.001,.01,.1,1,10],’random_state’:list(range(0,10))}
EN=ElasticNet()
clf=GridSearchCV(EN,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)