A Simple Approach to Linear Regression

Linear regression is the next step up after correlation. It is used when we want to predict the value of a variable based on the value of another variable. The variable we want to predict is called the dependent variable (or sometimes, the outcome variable)

However, for new data scientists, it could get overwhelming. Hence in this blog I have shared an iterative approach and perhaps is how we do it in large companies as well.

It is assumed that the reader can follow the code, all the conceptual explanations are omitted intentionally to keep the focus on the code. Code is adequately documented. It is also understood that the reader is a beginner in data science however knows statistics and python.

Lets begin:

Media Company Case Study

Problem Statement: A digital media company (similar to Voot, Hotstar, Netflix, etc.) had launched a show. Initially, the show got a good response, but then witnessed a decline in viewership. The company wants to figure out what went wrong.

In [317]:

# Importing all required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [318]:

#Importing dataset
media = pd.read_csv('mediacompany.csv')
media = media.drop('Unnamed: 7',axis = 1)

In [319]:

#Let's explore the top 5 rows
media.head()

Out[319]:

In [320]:

# Converting date to Pandas datetime format
media['Date'] = pd.to_datetime(media['Date'])

In [321]:

media.head()

Out[321]:

In [322]:

# Deriving "days since the show started"
from datetime import date
d0 = date(2017, 2, 28)
d1 = media.Date
delta = d1 - d0
media['day']= delta

In [323]:

media.head()

Out[323]:

In [324]:

# Cleaning days
media['day'] = media['day'].astype(str)
media['day'] = media['day'].map(lambda x: x[0:2])
media['day'] = media['day'].astype(int)

In [325]:

media.head()

Out[325]:

In [326]:

# days vs Views_show
media.plot.line(x='day', y='Views_show')

Out[326]:

In [327]:

# Scatter Plot (days vs Views_show)
colors = (0,0,0)
area = np.pi*3
plt.scatter(media.day, media.Views_show, s=area, c=colors, alpha=0.5)
plt.title('Scatter plot pythonspot.com')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

In [328]:

# plot for days vs Views_show and days vs Ad_impressionsfig = plt.figure()
host = fig.add_subplot(111)
par1 = host.twinx()
par2 = host.twinx()
host.set_xlabel("Day")
host.set_ylabel("View_Show")
par1.set_ylabel("Ad_impression")
color1 = plt.cm.viridis(0)
color2 = plt.cm.viridis(0.5)
color3 = plt.cm.viridis(.9)
p1, = host.plot(media.day,media.Views_show, color=color1,label="View_Show")
p2, = par1.plot(media.day,media.Ad_impression,color=color2, label="Ad_impression")
lns = [p1, p2]
host.legend(handles=lns, loc='best')
# right, left, top, bottom
par2.spines['right'].set_position(('outward', 60))
# no x-ticks
par2.xaxis.set_ticks([])
# Sometimes handy, same for xaxis
#par2.yaxis.set_ticks_position('right')
host.yaxis.label.set_color(p1.get_color())
par1.yaxis.label.set_color(p2.get_color())
plt.savefig("pyplot_multiple_y-axis.png", bbox_inches='tight')

In [329]:

# Derived Metrics
# Weekdays are taken such that 1 corresponds to Sunday and 7 to Saturday
# Generate the weekday variable
media['weekday'] = (media['day']+3)%7
media.weekday.replace(0,7, inplace=True)
media['weekday'] = media['weekday'].astype(int)
media.head()

Out[329]:

In [330]:

# Putting feature variable to X
X = media[['Visitors','weekday']]
# Putting response variable to y
y = media['Views_show']

In [331]:

from sklearn.linear_model import LinearRegression

In [332]:

# Representing LinearRegression as lr(Creating LinearRegression Object)
lm = LinearRegression()

In [333]:

# fit the model to the training data
lm.fit(X,y)

Out[333]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [334]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_1 = sm.OLS(y,X).fit()
print(lm_1.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.485
Model: OLS Adj. R-squared: 0.472
Method: Least Squares F-statistic: 36.26
Date: Fri, 09 Mar 2018 Prob (F-statistic): 8.01e-12
Time: 10:27:35 Log-Likelihood: -1042.5
No. Observations: 80 AIC: 2091.
Df Residuals: 77 BIC: 2098.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -3.862e+04 1.07e+05 -0.360 0.720 -2.52e+05 1.75e+05
Visitors 0.2787 0.057 4.911 0.000 0.166 0.392
weekday -3.591e+04 6591.205 -5.448 0.000 -4.9e+04 -2.28e+04
==============================================================================
Omnibus: 2.684 Durbin-Watson: 0.650
Prob(Omnibus): 0.261 Jarque-Bera (JB): 2.653
Skew: 0.423 Prob(JB): 0.265
Kurtosis: 2.718 Cond. No. 1.46e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.46e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

In [335]:

# create Weekend variable, with value 1 at weekends and 0 at weekdays
def cond(i):
if i % 7 == 5: return 1
elif i % 7 == 4: return 1
else :return 0
return i
media['weekend']=[cond(i) for i in media['day']]

In [336]:

media.head()

Out[336]:

In [337]:

# Putting feature variable to X
X = media[['Visitors','weekend']]
# Putting response variable to y
y = media['Views_show']

In [338]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_2 = sm.OLS(y,X).fit()
print(lm_2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.500
Model: OLS Adj. R-squared: 0.487
Method: Least Squares F-statistic: 38.55
Date: Fri, 09 Mar 2018 Prob (F-statistic): 2.51e-12
Time: 10:27:35 Log-Likelihood: -1041.3
No. Observations: 80 AIC: 2089.
Df Residuals: 77 BIC: 2096.
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -8.833e+04 1.01e+05 -0.875 0.384 -2.89e+05 1.13e+05
Visitors 0.1934 0.061 3.160 0.002 0.071 0.315
weekend 1.807e+05 3.15e+04 5.740 0.000 1.18e+05 2.43e+05
==============================================================================
Omnibus: 1.302 Durbin-Watson: 1.254
Prob(Omnibus): 0.521 Jarque-Bera (JB): 1.367
Skew: 0.270 Prob(JB): 0.505
Kurtosis: 2.656 Cond. No. 1.41e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.41e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

In [339]:

# Putting feature variable to X
X = media[['Visitors','weekend','Character_A']]
# Putting response variable to y
y = media['Views_show']

In [340]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_3 = sm.OLS(y,X).fit()
print(lm_3.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.586
Model: OLS Adj. R-squared: 0.570
Method: Least Squares F-statistic: 35.84
Date: Fri, 09 Mar 2018 Prob (F-statistic): 1.53e-14
Time: 10:27:35 Log-Likelihood: -1033.8
No. Observations: 80 AIC: 2076.
Df Residuals: 76 BIC: 2085.
Df Model: 3
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const -4.722e+04 9.31e+04 -0.507 0.613 -2.33e+05 1.38e+05
Visitors 0.1480 0.057 2.586 0.012 0.034 0.262
weekend 1.812e+05 2.89e+04 6.281 0.000 1.24e+05 2.39e+05
Character_A 9.542e+04 2.41e+04 3.963 0.000 4.75e+04 1.43e+05
==============================================================================
Omnibus: 0.908 Durbin-Watson: 1.600
Prob(Omnibus): 0.635 Jarque-Bera (JB): 0.876
Skew: -0.009 Prob(JB): 0.645
Kurtosis: 2.488 Cond. No. 1.42e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.42e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

In [341]:

# Create lag variable
media['Lag_Views'] = np.roll(media['Views_show'], 1)
media.Lag_Views.replace(108961,0, inplace=True)

In [342]:

media.head()

Out[342]:

In [343]:

# Putting feature variable to X
X = media[['Visitors','Character_A','Lag_Views','weekend']]
# Putting response variable to y
y = media['Views_show']

In [344]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_4 = sm.OLS(y,X).fit()
print(lm_4.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.740
Model: OLS Adj. R-squared: 0.726
Method: Least Squares F-statistic: 53.46
Date: Fri, 09 Mar 2018 Prob (F-statistic): 3.16e-21
Time: 10:27:36 Log-Likelihood: -1015.1
No. Observations: 80 AIC: 2040.
Df Residuals: 75 BIC: 2052.
Df Model: 4
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const -2.98e+04 7.43e+04 -0.401 0.689 -1.78e+05 1.18e+05
Visitors 0.0659 0.047 1.394 0.167 -0.028 0.160
Character_A 5.527e+04 2.01e+04 2.748 0.008 1.52e+04 9.53e+04
Lag_Views 0.4317 0.065 6.679 0.000 0.303 0.560
weekend 2.273e+05 2.4e+04 9.467 0.000 1.79e+05 2.75e+05
==============================================================================
Omnibus: 1.425 Durbin-Watson: 2.626
Prob(Omnibus): 0.491 Jarque-Bera (JB): 0.821
Skew: -0.130 Prob(JB): 0.663
Kurtosis: 3.423 Cond. No. 1.44e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.44e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

In [345]:

plt.figure(figsize = (20,10))        # Size of the figure
sns.heatmap(media.corr(),annot = True)

Out[345]:

<matplotlib.axes._subplots.AxesSubplot at 0x1d2cc0301d0>

In [346]:

# Putting feature variable to X
X = media[['weekend','Character_A','Views_platform']]
# Putting response variable to y
y = media['Views_show']

In [347]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_5 = sm.OLS(y,X).fit()
print(lm_5.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.602
Model: OLS Adj. R-squared: 0.586
Method: Least Squares F-statistic: 38.24
Date: Fri, 09 Mar 2018 Prob (F-statistic): 3.59e-15
Time: 10:27:37 Log-Likelihood: -1032.3
No. Observations: 80 AIC: 2073.
Df Residuals: 76 BIC: 2082.
Df Model: 3
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
const -1.205e+05 9.97e+04 -1.208 0.231 -3.19e+05 7.81e+04
weekend 1.781e+05 2.78e+04 6.410 0.000 1.23e+05 2.33e+05
Character_A 7.062e+04 2.6e+04 2.717 0.008 1.89e+04 1.22e+05
Views_platform 0.1507 0.048 3.152 0.002 0.055 0.246
==============================================================================
Omnibus: 4.279 Durbin-Watson: 1.516
Prob(Omnibus): 0.118 Jarque-Bera (JB): 2.153
Skew: 0.061 Prob(JB): 0.341
Kurtosis: 2.206 Cond. No. 2.03e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.03e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

In [348]:

# Putting feature variable to X
X = media[['weekend','Character_A','Visitors']]
# Putting response variable to y
y = media['Views_show']

In [349]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_6 = sm.OLS(y,X).fit()
print(lm_6.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.586
Model: OLS Adj. R-squared: 0.570
Method: Least Squares F-statistic: 35.84
Date: Fri, 09 Mar 2018 Prob (F-statistic): 1.53e-14
Time: 10:27:37 Log-Likelihood: -1033.8
No. Observations: 80 AIC: 2076.
Df Residuals: 76 BIC: 2085.
Df Model: 3
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const -4.722e+04 9.31e+04 -0.507 0.613 -2.33e+05 1.38e+05
weekend 1.812e+05 2.89e+04 6.281 0.000 1.24e+05 2.39e+05
Character_A 9.542e+04 2.41e+04 3.963 0.000 4.75e+04 1.43e+05
Visitors 0.1480 0.057 2.586 0.012 0.034 0.262
==============================================================================
Omnibus: 0.908 Durbin-Watson: 1.600
Prob(Omnibus): 0.635 Jarque-Bera (JB): 0.876
Skew: -0.009 Prob(JB): 0.645
Kurtosis: 2.488 Cond. No. 1.42e+07
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.42e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

In [350]:

# Putting feature variable to X
X = media[['weekend','Character_A','Visitors','Ad_impression']]
# Putting response variable to y
y = media['Views_show']

In [351]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_7 = sm.OLS(y,X).fit()
print(lm_7.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.803
Model: OLS Adj. R-squared: 0.792
Method: Least Squares F-statistic: 76.40
Date: Fri, 09 Mar 2018 Prob (F-statistic): 1.10e-25
Time: 10:27:38 Log-Likelihood: -1004.1
No. Observations: 80 AIC: 2018.
Df Residuals: 75 BIC: 2030.
Df Model: 4
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const -2.834e+05 6.97e+04 -4.067 0.000 -4.22e+05 -1.45e+05
weekend 1.485e+05 2.04e+04 7.296 0.000 1.08e+05 1.89e+05
Character_A -2.934e+04 2.16e+04 -1.356 0.179 -7.24e+04 1.38e+04
Visitors 0.0144 0.042 0.340 0.735 -0.070 0.099
Ad_impression 0.0004 3.96e-05 9.090 0.000 0.000 0.000
==============================================================================
Omnibus: 4.808 Durbin-Watson: 1.166
Prob(Omnibus): 0.090 Jarque-Bera (JB): 4.007
Skew: 0.476 Prob(JB): 0.135
Kurtosis: 3.545 Cond. No. 1.32e+10
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.32e+10. This might indicate that there are
strong multicollinearity or other numerical problems.

In [352]:

# Putting feature variable to X
X = media[['weekend','Character_A','Ad_impression']]
# Putting response variable to y
y = media['Views_show']

In [353]:

import statsmodels.api as sm
# Unlike SKLearn, statsmodels don't automatically fit a constant,
# so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_8 = sm.OLS(y,X).fit()
print(lm_8.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.803
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 103.0
Date: Fri, 09 Mar 2018 Prob (F-statistic): 1.05e-26
Time: 10:27:38 Log-Likelihood: -1004.2
No. Observations: 80 AIC: 2016.
Df Residuals: 76 BIC: 2026.
Df Model: 3
Covariance Type: nonrobust
=================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------
const -2.661e+05 4.74e+04 -5.609 0.000 -3.61e+05 -1.72e+05
weekend 1.51e+05 1.88e+04 8.019 0.000 1.14e+05 1.89e+05
Character_A -2.99e+04 2.14e+04 -1.394 0.167 -7.26e+04 1.28e+04
Ad_impression 0.0004 3.69e-05 9.875 0.000 0.000 0.000
==============================================================================
Omnibus: 4.723 Durbin-Watson: 1.169
Prob(Omnibus): 0.094 Jarque-Bera (JB): 3.939
Skew: 0.453 Prob(JB): 0.139
Kurtosis: 3.601 Cond. No. 9.26e+09
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.26e+09. This might indicate that there are
strong multicollinearity or other numerical problems.

In [354]:

#Ad impression in million
media['ad_impression_million'] = media['Ad_impression']/1000000

In [355]:

# Putting feature variable to X
X = media[['weekend','Character_A','ad_impression_million','Cricket_match_india']]
# Putting response variable to y
y = media['Views_show']

In [356]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_9 = sm.OLS(y,X).fit()
print(lm_9.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.803
Model: OLS Adj. R-squared: 0.793
Method: Least Squares F-statistic: 76.59
Date: Fri, 09 Mar 2018 Prob (F-statistic): 1.02e-25
Time: 10:27:39 Log-Likelihood: -1004.0
No. Observations: 80 AIC: 2018.
Df Residuals: 75 BIC: 2030.
Df Model: 4
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -2.633e+05 4.8e+04 -5.484 0.000 -3.59e+05 -1.68e+05
weekend 1.521e+05 1.9e+04 7.987 0.000 1.14e+05 1.9e+05
Character_A -3.196e+04 2.19e+04 -1.457 0.149 -7.57e+04 1.17e+04
ad_impression_million 363.7938 37.113 9.802 0.000 289.861 437.727
Cricket_match_india -1.396e+04 2.74e+04 -0.510 0.612 -6.85e+04 4.06e+04
==============================================================================
Omnibus: 5.270 Durbin-Watson: 1.161
Prob(Omnibus): 0.072 Jarque-Bera (JB): 4.560
Skew: 0.468 Prob(JB): 0.102
Kurtosis: 3.701 Cond. No. 9.32e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.32e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [357]:

# Putting feature variable to X
X = media[['weekend','Character_A','ad_impression_million']]
# Putting response variable to y
y = media['Views_show']

In [358]:

import statsmodels.api as sm
#Unlike SKLearn, statsmodels don't automatically fit a constant,
#so you need to use the method sm.add_constant(X) in order to add a constant.
X = sm.add_constant(X)
# create a fitted model in one line
lm_10 = sm.OLS(y,X).fit()
print(lm_10.summary())
OLS Regression Results
==============================================================================
Dep. Variable: Views_show R-squared: 0.803
Model: OLS Adj. R-squared: 0.795
Method: Least Squares F-statistic: 103.0
Date: Fri, 09 Mar 2018 Prob (F-statistic): 1.05e-26
Time: 10:27:39 Log-Likelihood: -1004.2
No. Observations: 80 AIC: 2016.
Df Residuals: 76 BIC: 2026.
Df Model: 3
Covariance Type: nonrobust
=========================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------
const -2.661e+05 4.74e+04 -5.609 0.000 -3.61e+05 -1.72e+05
weekend 1.51e+05 1.88e+04 8.019 0.000 1.14e+05 1.89e+05
Character_A -2.99e+04 2.14e+04 -1.394 0.167 -7.26e+04 1.28e+04
ad_impression_million 364.4670 36.909 9.875 0.000 290.957 437.977
==============================================================================
Omnibus: 4.723 Durbin-Watson: 1.169
Prob(Omnibus): 0.094 Jarque-Bera (JB): 3.939
Skew: 0.453 Prob(JB): 0.139
Kurtosis: 3.601 Cond. No. 9.26e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.26e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In [359]:

# Making predictions using the model
X = media[['weekend','Character_A','ad_impression_million']]
X = sm.add_constant(X)
Predicted_views = lm_10.predict(X)

In [360]:

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(media.Views_show, Predicted_views)
r_squared = r2_score(media.Views_show, Predicted_views)

In [361]:

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 4677651616.25
r_square_value : 0.802643446858

In [362]:

#Actual vs Predicted
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,Predicted_views, color="red", linewidth=2.5, linestyle="-")
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Views', fontsize=16) # Y-label

Out[362]:

Text(0,0.5,'Views')

In [363]:

# Error terms
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show-Predicted_views, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Views_show-Predicted_views', fontsize=16) # Y-label

Out[363]:

Text(0,0.5,'Views_show-Predicted_views')

In [364]:

# Making predictions using the model
X = media[['weekend','Character_A','Visitors']]
X = sm.add_constant(X)
Predicted_views = lm_6.predict(X)

In [365]:

from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(media.Views_show, Predicted_views)
r_squared = r2_score(media.Views_show, Predicted_views)

In [366]:

print('Mean_Squared_Error :' ,mse)
print('r_square_value :',r_squared)
Mean_Squared_Error : 9815432480.45
r_square_value : 0.585873408098

In [367]:

#Actual vs Predicted
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show, color="blue", linewidth=2.5, linestyle="-")
plt.plot(c,Predicted_views, color="red", linewidth=2.5, linestyle="-")
fig.suptitle('Actual and Predicted', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Views', fontsize=16) # Y-label

Out[367]:

Text(0,0.5,'Views')

In [368]:

# Error terms
c = [i for i in range(1,81,1)]
fig = plt.figure()
plt.plot(c,media.Views_show-Predicted_views, color="blue", linewidth=2.5, linestyle="-")
fig.suptitle('Error Terms', fontsize=20) # Plot heading
plt.xlabel('Index', fontsize=18) # X-label
plt.ylabel('Views_show-Predicted_views', fontsize=16) # Y-label

Out[368]:

Text(0,0.5,'Views_show-Predicted_views')

Hope this was helpful.

I will update a link to code and dataset shortly.

Reach me on Maddy Anand or visit www.maddyanand.com.

I write about Books, Climate, Air pollution, Research, Data Science, Entrepreneurship, Startups, Tech, Learning & Career

I write about Books, Climate, Air pollution, Research, Data Science, Entrepreneurship, Startups, Tech, Learning & Career