Flight Price Prediction

8 min readApr 19, 2021

Flight Price Prediction using Linear Regression

So yes, what is Linear Regression? and how to do prediction for Flight Prices?

What is Machine Learning?

Machine learning is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence.

Machine learning is divided into 2 parts:

a) Supervised Learning: Predictive Models

1. Regression Models

2. Classification Models

b) Unsupervised Learning: Pattern/structure Recognition(Unlabeled Data)

1.Clustering

2.Association

3. Dimensionality Reduction

a) Supervised Learning:

Regression Models

Linear Regression: Linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression

2. Classification Model

Types of Classification Models:

Logistic Regression:
Naive Bayes
K-Nearest Neighbours
Decision Tree
Random Forest Classifier
Support Vector Machine

So, Linear Regression is an algorithm that is used to predict the target variable which consists range of numerical values unlike catrgorical data.

Problem:

Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

Size of training set: 10683 records

Size of test set: 2671 records

FEATURES:

Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

(Target Variable) Price: The price of the ticket

Lets start creating model.

There are few steps will be following to create model for Loan Application Status Prediction

Steps used:

1> Data cleansing and Wrangling
2> Define the metrics for which model is getting optimized.
3> Feature Engineering
4> Data Pre-processing
5> Feature Selection
6> Split the data into training and test data sets.
7> Model Selection
8> Model Validation
9> Interpret the result
10> save Model
11> reload model for prediction of test .csv
12>do data cleaning for test.csv
13> predict Prices

Firstly,

Import Dataset and Libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings(‘ignore’)

data = pd.read_csv(‘flight_price\Data_Train.csv’)

Will, review the data within the CSV file

data

To get number of rows and columns will use shape:

data.shape

rows=10683, columns=11

Datatypes of data:

data.dtypes

EDA PROCESS:

data.describe()

Observation: Only description for Price is shown as other than that all the other columns are of categorical data.

Min — Max: As there is a lot of difference between min-max one need to do is scaling of data.

Checking Null Values:

data.isnull().sum()

Observation: Route and Total_stops are null, need to handle it

After handling null values:

Number of Categories into Columns containing categorical data

data.columns

print(“Number of Categories: “)
for ColName in data[[‘Airline’, ‘Date_of_Journey’, ‘Source’, ‘Destination’, ‘Route’,
‘Dep_Time’, ‘Arrival_Time’, ‘Duration’, ‘Total_Stops’,
‘Additional_Info’, ‘Price’]]:
print(“{} = {}”.format(ColName,len(data[ColName].unique())))

Visualization of categories and data:

For categorical data we can use: counterplot and for numerical data we can use: distplot

import seaborn as sns
alpha = sns.countplot(x=”Airline”,data=data,dodge=False)
print(data[“Airline”].value_counts())

Maximum Flights taken by customer is Jet Airways

alpha = sns.countplot(x=”Source”,data=data)
print(data[“Source”].value_counts())

Data Encoding:

As the data are categorical, so need to encode the data.

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()

for i in data.columns:
if data[i].dtypes == “object”:
data[i]=enc.fit_transform(data[i].values.reshape(-1,1))

data

Corelation of data with the target variable:

corr_matrix_hmap=data.corr()
plt.figure(figsize=(22,20))
sns.heatmap(corr_matrix_hmap,annot=True,linewidths=0.1,fmt=”0.2f”)
plt.show()

visualization of correlation using HeatMap

corr_matrix_hmap[“Price”].sort_values(ascending=False)

plt.figure(figsize=(10,5))
data.corr()[‘Price’].sort_values(ascending=False).drop([‘Price’]).plot(kind=’bar’,color=’c’)
plt.xlabel(‘Feature’,fontsize=14)
plt.ylabel(‘Column with Target Name’,fontsize=14)
plt.title(‘correlation’,fontsize=18)
plt.show()

Maximum corelated with price: Route (Which route is choosen will affect the price majorly)

Minimun corelated with price: Departure Time (Dep_Time)

Negatively corelated with price: Total_stops during the journey

Seperating Independent variable and Target Variable:

# x= independent variable
x = data.iloc[:,0:-1]
x.head()

#y = target variable = Fees
y = data.iloc[:,-1]
y.head()

SCALING the data using Min-Max Scaler:

It will scale the data between range of 0 to 1

from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(‘ignore’)

data=mms.fit_transform(data)

MODEL TRAINING:

Finding Best Random State:

from sklearn.linear_model import LinearRegression
maxAccu=0
maxRS=0
for i in range(1,200):
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.33,random_state = i)
LR = LinearRegression()
LR.fit(x_train,y_train)
predrf = LR.predict(x_test)
mse = r2_score(y_test,predrf)
if mse > maxAccu:
maxAccu = mse
maxRS = i

print(“Best score is: “,maxAccu,”on Random_state”,maxRS)

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.33,random_state = 192)

LR = LinearRegression()
LR.fit(x_train,y_train)
predrf = LR.predict(x_test)

print(‘r2 Score:’,r2_score(y_test,predrf))

As its Linear regression will use r2 score instead of confusion matrix

Finding Best fold:

pred_train = LR.predict(x_train)
pred_test =LR.predict(x_test)
Train_accuracy = r2_score(y_train,pred_train)
Test_accuracy = r2_score(y_test,pred_test)
maxAccu=0
maxRS=0

from sklearn.model_selection import cross_val_score
for j in range(2,16):
cv_score=cross_val_score(LR,x,y,cv=j)
cv_mean = cv_score.mean()
if cv_mean > maxAccu:
maxAccu = cv_mean
maxRS = j

print(f”At cross fold {j} cv score is {cv_mean} and accuracy score training is {Train_accuracy} and accuracy for the testing is {Test_accuracy}”)
print(“\n”)

Observation: As At Fold 5, the difference between cross validation score and accuracy is least, will choose fold 5

Regularization:

To mitigate the problem of overfitting and underfitting Regularization Methods are used: Lasso, Ridge or ElasticNet.

from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings(‘ignore’)

#Lasso tries to ommit coefficient value (the value which dont affect y)
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
parameters = {‘alpha’:[.0001,.001,.01,.1,1,10],’random_state’:list(range(0,10))}
ls=Lasso()
clf=GridSearchCV(ls,parameters)
clf.fit(x_train,y_train)

print(clf.best_params_)

ls = Lasso(alpha=1,random_state=0)
ls.fit(x_train,y_train)
ls.score(x_train,y_train)
pred_ls=ls.predict(x_test)

lss=r2_score(y_test,pred_ls)
lss

#cross_validation_mean = cv_mean
#cross_validation_score= cv_score

cross_validation_score = cross_val_score(ls,x,y,cv=5)
cross_validation_mean = cross_validation_score.mean()
cross_validation_mean

Observation:Difference between Cross validation and r2 score must be least which, represents the difference between actual values and predicted values.

Ensemble Techniques:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
parameters = {‘criterion’:[‘mse’,’mae’],’max_features’:[“auto”,”sqrt”,”log2"]}

rf = RandomForestRegressor()
clf=GridSearchCV(rf,parameters)
clf.fit(x_train,y_train)
print(clf.best_params_)