Car Sale Analysis¶

Importing the relevant libraries¶

# For this analysis, we will need the following libraries and modules
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

Loading the raw data¶

# Load the data from a .csv in the same folder
raw_data = pd.read_csv('1.04. Real-life example.csv')

# Explore the top 5 rows of the df
raw_data.head()

We would like to predict the price of a used car depending on its specifications. Potential regressors in this data are:

1) Brand: A BMW is generally more expensive than a Toyota.
2) Mileage: The more a car is driven, the cheaper it should be.
3) Engine Volume: Sports car have larger engines than economy cars.
4) Year of Production: The older the car, the cheaper it is, with the exception of vintage vehicles.

The rest are categorical variables which we will deal with on a case-by-case basis.

Preprocessing¶

Exploring the descriptive statistics of the variables¶

# By default, only descriptives for the numerical variables would be shown
# To include the categorical ones, we should specify this with an argument
raw_data.describe(include='all')

The categorical variables don't have some types of numerical descriptives
The numerical variables don't have some types of categorical descriptives
Each variable has a different number of observations, implying that there are some missing values.
There are more than 300 models of cars, meaning we would have more than 300 dummy variables, which would be hard to implement in a regression. Also, a lot of the information from 'Model' could be engineered from 'Brand', 'Year', and 'EngineV', so we won't be losing too much variability. Therefore, we can drop 'Model'.
Registration has almost all 'yes' entries, so this variable probably won't be very useful

Determining the variables of interest¶

# Drop 'Registration' column from dataset
#data = raw_data.drop(['Registration'],axis=1)
#data.describe(include='all')
data = raw_data

Dealing with missing values¶

# data.isnull() # shows a df with the information whether a data point is null 
# Since True = the data point is missing, while False = the data point is not missing, we can sum them
# This will give us the total number of missing values feature-wise
data.isnull().sum()

Brand             0
Price           172
Body              0
Mileage           0
EngineV         150
Engine Type       0
Registration      0
Year              0
Model             0
dtype: int64

# Drop all missing values
# This is not always recommended, however, when we remove less than 5% of the data, it is okay
data_no_mv = data.dropna(axis=0)

# Check the descriptives without the missing values
data_no_mv.describe(include='all')

Notice the min and max values with respect to the mean and the quartiles for each variable. Namely, the max values for Price, Mileage, and EngineV are very high, and the min value for Year is very low compared to the rest of the values.

Exploring the PDFs¶

The probability distribution function (PDF) of a variable will show us how that variable is distributed
This will make it very easy to spot anomalies, such as outliers
The PDF is often the basis on which we decide whether we want to transform a feature

sns.distplot(data_no_mv['Price'])

<matplotlib.axes._subplots.AxesSubplot at 0x10c9eef50>

For optimal results, we would be looking for a normal distribution.
Price, however, has an exponential one, which will surely be a problem for our regression.
Price has 75% of values below or equal to 21,900 but the max value is 300,000, so obviously there are some outliers present.

Dealing with outliers¶

We can deal with the problem easily by removing 0.5% or 1% of the problematic samples
Here, the outliers are situated around the higher prices (right side of the graph)
Logic should also be applied
This is a dataset about used cars, therefore one can imagine how \$300,000 is an excessive price
Outliers are a big issue for OLS, thus we must deal with them in some way

# Declare a variable that will be equal to the 98th percentile of the 'Price' variable
q = data_no_mv['Price'].quantile(0.98)
# Then we can create a new df, with the condition that all prices must be below the 98 percentile of 'Price'
data_1 = data_no_mv[data_no_mv['Price']<q]
# In this way we have essentially removed the top 2% of the data about 'Price'
data_1.describe(include='all')

The max Price is now ~103,000, which is still far away from the mean, but acceptably closer.

# We can check the PDF once again to ensure that the result is still distributed in the same way overall
sns.distplot(data_1['Price'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c23fc9b10>

There are much fewer outliers

# We can treat the other numerical variables in a similar way
sns.distplot(data_no_mv['Mileage'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c241edd50>

q = data_1['Mileage'].quantile(0.99)
data_2 = data_1[data_1['Mileage']<q]

sns.distplot(data_2['Mileage'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c22d76510>

At this point, this plot seems to have a pretty normal distribution

sns.distplot(data_no_mv['EngineV'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c22f3d050>

The situation with engine volume is very strange
In such cases it makes sense to manually check what may be causing the problem
In our case the issue comes from the fact that most missing values are indicated with 99.99 or 99
There are also some incorrect entries like 75
A simple Google search can indicate the natural domain of this variable
Car engine volumes are usually (always?) below 6.5
This is a prime example of the fact that a domain expert (a person working in the car industry) may find it much easier to determine problems with the data than an outsider

data_3 = data_2[data_2['EngineV']<6.5]
sns.distplot(data_3['EngineV'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c24352c90>

Following this graph, we realize we can actually treat EngineV as a categorical variable
Even so, we won't, but that could be something else to try

sns.distplot(data_no_mv['Year'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c244af090>

Finally, the situation with 'Year' is similar to 'Price' and 'Mileage'
However, the outliers are on the low end (a.k.a some vintage cars)

# Remove outliers because vintage cars are not representative of our model
q = data_3['Year'].quantile(0.01)
data_4 = data_3[data_3['Year']>q]
sns.distplot(data_4['Year'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c245dcc10>

# Reset the index and drop the column containing the old index
data_cleaned = data_4.reset_index(drop=True)

# Let's see what's left
data_cleaned.describe(include='all')

Overall, we've deleted ~100 observations that proved to be problematic.

Checking the OLS assumptions¶

The categorical variables will be included as dummies, so we don't need to worry about them when checking the assumptions
The continuous variables (Price, Year, EngineV, Mileage) are those that are likely to be more challenging and cause us more problems

Linearity¶

# We could simply use plt.scatter() for each of Year, EngineV, and Mileage
# But since Price is the 'y' axis of all the plots, it made sense to plot them side-by-side (so we can compare them)
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) #sharey -> share 'Price' as y
ax1.scatter(data_cleaned['Year'],data_cleaned['Price'])
ax1.set_title('Price and Year')
ax2.scatter(data_cleaned['EngineV'],data_cleaned['Price'])
ax2.set_title('Price and EngineV')
ax3.scatter(data_cleaned['Mileage'],data_cleaned['Price'])
ax3.set_title('Price and Mileage')

plt.show()

These patterns are not linear so we should not run a linear regression without transforming one or more variables.

sns.distplot(data_cleaned['Price'])

<matplotlib.axes._subplots.AxesSubplot at 0x1c24942f50>

From the subplots and the PDF of price, we can easily determine that 'Price' is exponentially distributed
Therefore, it's relationships with the other normally distributed features is not linear but rather quite exponential
A good transformation in that case is a log transformation

Relaxing the assumptions¶

# Transform 'Price' with a log transformation
log_price = np.log(data_cleaned['Price'])

# Then add it to our data frame
data_cleaned['log_price'] = log_price
data_cleaned

# Let's check the three scatters once again
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))
ax1.scatter(data_cleaned['Year'],data_cleaned['log_price'])
ax1.set_title('Log Price and Year')
ax2.scatter(data_cleaned['EngineV'],data_cleaned['log_price'])
ax2.set_title('Log Price and EngineV')
ax3.scatter(data_cleaned['Mileage'],data_cleaned['log_price'])
ax3.set_title('Log Price and Mileage')

plt.show()

The relationships show a clear linear relationship
This is some good linear regression material
Alternatively we could have transformed each of the independent variables

# Since we will be using the log price variable, we can drop the old 'Price' one
data_cleaned = data_cleaned.drop(['Price'],axis=1)

Now we will deal with the rest of the assumptions:

No Endogeneity: No correlation between residuals and independent variables (we'll discuss the residuals after the regression is created)
Normality and Homoscedasticity:
- Normality: assumed for a big sample following the Central Limit Theorem
- Zero mean of the distribution of errors: accomplished through including the intercept in the regression
- Homoscedasticity: generally holds, as we can see in the above graphs (already implemented log transformation)
No Autocorrelation: Our observations are not coming from time-series or panel data, they are simply a snapshot of the current situation of a second-hand car sales website - each row comes from a different customer who is willing to sell their car through the platform. Logically, there is no reason for the observations to be dependent on each other, so we are safe:
Multicolinearity: Since an older car would have more mileage, there is grounds to check for multicolinearity (i.e. Year and Mileage)

Multicollinearity¶

# Let's quickly see the columns of our data frame
data_cleaned.columns.values

array(['Brand', 'Body', 'Mileage', 'EngineV', 'Engine Type',
       'Registration', 'Year', 'Model', 'log_price'], dtype=object)

sklearn does not have a built-in way to check for multicollinearity
Here's the relevant module
Full documentation: http://www.statsmodels.org/dev/_modules/statsmodels/stats/outliers_influence.html#variance_inflation_factor

from statsmodels.stats.outliers_influence import variance_inflation_factor

# To make this as easy as possible to use, we declare a variable where we put
# all features where we want to check for multicollinearity
# since our categorical data is not yet preprocessed, we will only take the numerical ones
variables = data_cleaned[['Mileage','Year','EngineV']]

# we create a new data frame which will include all the VIFs
# note that each variable has its own variance inflation factor as this measure is variable specific (not model specific)
vif = pd.DataFrame()

# here we make use of the variance_inflation_factor, which will basically output the respective VIFs 
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
# Finally, I like to include names so it is easier to explore the result
vif["Features"] = variables.columns

# Let's explore the result
vif

Since Year has the highest VIF, I will remove it from the model
This will drive the VIF of other variables down
So even if EngineV seems with a high VIF, too, once 'Year' is gone that will no longer be the case

data_no_multicollinearity = data_cleaned.drop(['Year'],axis=1)

Create dummy variables¶

To include the categorical data in the regression, let's create dummies
There is a very convenient method called: 'get_dummies' which does that seemlessly for all categorical variables in the dataset
It is extremely important that we drop one of the dummies, alternatively we will introduce multicollinearity
- If we have N categories for a feature, we have to create N-1 dummies, otherwise the Nth category could be perfectly predicted by the other N-1 dummies.

data_with_dummies = pd.get_dummies(data_no_multicollinearity, drop_first=True)
data_with_dummies.head()

Rearrange a bit¶

To make our data frame more organized, we prefer to place the dependent variable in the beginning of the df
Since each problem is different, that must be done manually
We can display all possible features and then choose the desired order

data_with_dummies.columns.values

array(['Mileage', 'EngineV', 'log_price', 'Brand_BMW',
       'Brand_Mercedes-Benz', 'Brand_Mitsubishi', 'Brand_Renault',
       'Brand_Toyota', 'Brand_Volkswagen', 'Body_hatch', 'Body_other',
       'Body_sedan', 'Body_vagon', 'Body_van', 'Engine Type_Gas',
       'Engine Type_Other', 'Engine Type_Petrol', 'Registration_yes',
       'Model_100', 'Model_11', 'Model_116', 'Model_118', 'Model_120',
       'Model_19', 'Model_190', 'Model_200', 'Model_210', 'Model_220',
       'Model_230', 'Model_25', 'Model_250', 'Model_300', 'Model_316',
       'Model_318', 'Model_320', 'Model_323', 'Model_325', 'Model_328',
       'Model_330', 'Model_335', 'Model_4 Series Gran Coupe', 'Model_428',
       'Model_4Runner', 'Model_5 Series', 'Model_5 Series GT',
       'Model_520', 'Model_523', 'Model_524', 'Model_525', 'Model_528',
       'Model_530', 'Model_535', 'Model_540', 'Model_545', 'Model_550',
       'Model_6 Series Gran Coupe', 'Model_630', 'Model_640', 'Model_645',
       'Model_650', 'Model_730', 'Model_735', 'Model_740', 'Model_745',
       'Model_750', 'Model_760', 'Model_80', 'Model_9', 'Model_90',
       'Model_A 140', 'Model_A 150', 'Model_A 170', 'Model_A 180',
       'Model_A1', 'Model_A3', 'Model_A4', 'Model_A4 Allroad', 'Model_A5',
       'Model_A6', 'Model_A6 Allroad', 'Model_A7', 'Model_A8',
       'Model_ASX', 'Model_Amarok', 'Model_Auris', 'Model_Avalon',
       'Model_Avensis', 'Model_Aygo', 'Model_B 170', 'Model_B 180',
       'Model_B 200', 'Model_Beetle', 'Model_Bora', 'Model_C-Class',
       'Model_CL 180', 'Model_CL 500', 'Model_CL 55 AMG', 'Model_CL 550',
       'Model_CL 63 AMG', 'Model_CLA 200', 'Model_CLA 220',
       'Model_CLA-Class', 'Model_CLC 180', 'Model_CLC 200',
       'Model_CLK 200', 'Model_CLK 220', 'Model_CLK 230', 'Model_CLK 240',
       'Model_CLK 280', 'Model_CLK 320', 'Model_CLK 430', 'Model_CLS 350',
       'Model_CLS 400', 'Model_CLS 500', 'Model_CLS 63 AMG',
       'Model_Caddy', 'Model_Camry', 'Model_Captur', 'Model_Caravelle',
       'Model_Carina', 'Model_Carisma', 'Model_Celica', 'Model_Clio',
       'Model_Colt', 'Model_Corolla', 'Model_Corolla Verso',
       'Model_Cross Touran', 'Model_Dokker', 'Model_Duster',
       'Model_E-Class', 'Model_Eclipse', 'Model_Eos', 'Model_Espace',
       'Model_FJ Cruiser', 'Model_Fluence', 'Model_Fortuner',
       'Model_G 320', 'Model_G 500', 'Model_G 55 AMG', 'Model_G 63 AMG',
       'Model_GL 320', 'Model_GL 350', 'Model_GL 420', 'Model_GL 450',
       'Model_GL 500', 'Model_GL 550', 'Model_GLC-Class',
       'Model_GLE-Class', 'Model_GLK 220', 'Model_GLK 300',
       'Model_GLS 350', 'Model_Galant', 'Model_Golf GTI', 'Model_Golf II',
       'Model_Golf III', 'Model_Golf IV', 'Model_Golf Plus',
       'Model_Golf V', 'Model_Golf VI', 'Model_Golf VII',
       'Model_Golf Variant', 'Model_Grand Scenic', 'Model_Grandis',
       'Model_Hiace', 'Model_Highlander', 'Model_Hilux', 'Model_I3',
       'Model_IQ', 'Model_Jetta', 'Model_Kangoo', 'Model_Koleos',
       'Model_L 200', 'Model_LT', 'Model_Laguna', 'Model_Lancer',
       'Model_Lancer Evolution', 'Model_Lancer X',
       'Model_Lancer X Sportback', 'Model_Land Cruiser 100',
       'Model_Land Cruiser 105', 'Model_Land Cruiser 200',
       'Model_Land Cruiser 76', 'Model_Land Cruiser 80',
       'Model_Land Cruiser Prado', 'Model_Latitude', 'Model_Logan',
       'Model_Lupo', 'Model_M5', 'Model_M6', 'Model_MB', 'Model_ML 250',
       'Model_ML 270', 'Model_ML 280', 'Model_ML 320', 'Model_ML 350',
       'Model_ML 400', 'Model_ML 430', 'Model_ML 500', 'Model_ML 550',
       'Model_ML 63 AMG', 'Model_Master', 'Model_Matrix', 'Model_Megane',
       'Model_Modus', 'Model_Multivan', 'Model_New Beetle',
       'Model_Outlander', 'Model_Outlander XL', 'Model_Pajero',
       'Model_Pajero Pinin', 'Model_Pajero Sport', 'Model_Pajero Wagon',
       'Model_Passat B3', 'Model_Passat B4', 'Model_Passat B5',
       'Model_Passat B6', 'Model_Passat B7', 'Model_Passat B8',
       'Model_Passat CC', 'Model_Phaeton', 'Model_Pointer', 'Model_Polo',
       'Model_Previa', 'Model_Prius', 'Model_Q3', 'Model_Q5', 'Model_Q7',
       'Model_R 320', 'Model_R8', 'Model_Rav 4', 'Model_S 140',
       'Model_S 250', 'Model_S 300', 'Model_S 320', 'Model_S 350',
       'Model_S 400', 'Model_S 430', 'Model_S 500', 'Model_S 550',
       'Model_S 600', 'Model_S 63 AMG', 'Model_S 65 AMG', 'Model_S4',
       'Model_S5', 'Model_S8', 'Model_SL 500 (550)', 'Model_SL 55 AMG',
       'Model_SLK 200', 'Model_SLK 350', 'Model_Sandero',
       'Model_Sandero StepWay', 'Model_Scenic', 'Model_Scion',
       'Model_Scirocco', 'Model_Sequoia', 'Model_Sharan', 'Model_Sienna',
       'Model_Smart', 'Model_Space Star', 'Model_Space Wagon',
       'Model_Sprinter 208', 'Model_Sprinter 210', 'Model_Sprinter 211',
       'Model_Sprinter 212', 'Model_Sprinter 213', 'Model_Sprinter 311',
       'Model_Sprinter 312', 'Model_Sprinter 313', 'Model_Sprinter 315',
       'Model_Sprinter 316', 'Model_Sprinter 318', 'Model_Sprinter 319',
       'Model_Symbol', 'Model_Syncro', 'Model_T3 (Transporter)',
       'Model_T4 (Transporter)', 'Model_T4 (Transporter) ',
       'Model_T5 (Transporter)', 'Model_T5 (Transporter) ',
       'Model_T6 (Transporter)', 'Model_T6 (Transporter) ', 'Model_TT',
       'Model_Tacoma', 'Model_Tiguan', 'Model_Touareg', 'Model_Touran',
       'Model_Trafic', 'Model_Tundra', 'Model_Up', 'Model_V 250',
       'Model_Vaneo', 'Model_Vento', 'Model_Venza', 'Model_Viano',
       'Model_Virage', 'Model_Vista', 'Model_Vito', 'Model_X1',
       'Model_X3', 'Model_X5', 'Model_X5 M', 'Model_X6', 'Model_X6 M',
       'Model_Yaris', 'Model_Z3', 'Model_Z4'], dtype=object)

To make the code a bit more parametrized, let's declare a new variable that will contain the preferred order
Conventionally, the most intuitive order is: dependent variable, indepedendent numerical variables, dummies

cols = ['log_price', 'Mileage', 'EngineV', 'Brand_BMW',
       'Brand_Mercedes-Benz', 'Brand_Mitsubishi', 'Brand_Renault',
       'Brand_Toyota', 'Brand_Volkswagen', 'Body_hatch', 'Body_other',
       'Body_sedan', 'Body_vagon', 'Body_van', 'Engine Type_Gas',
       'Engine Type_Other', 'Engine Type_Petrol', 'Registration_yes', 'Model_100', 'Model_11', 'Model_116', 'Model_118', 'Model_120',
       'Model_19', 'Model_190', 'Model_200', 'Model_210', 'Model_220',
       'Model_230', 'Model_25', 'Model_250', 'Model_300', 'Model_316',
       'Model_318', 'Model_320', 'Model_323', 'Model_325', 'Model_328',
       'Model_330', 'Model_335', 'Model_4 Series Gran Coupe', 'Model_428',
       'Model_4Runner', 'Model_5 Series', 'Model_5 Series GT',
       'Model_520', 'Model_523', 'Model_524', 'Model_525', 'Model_528',
       'Model_530', 'Model_535', 'Model_540', 'Model_545', 'Model_550',
       'Model_6 Series Gran Coupe', 'Model_630', 'Model_640', 'Model_645',
       'Model_650', 'Model_730', 'Model_735', 'Model_740', 'Model_745',
       'Model_750', 'Model_760', 'Model_80', 'Model_9', 'Model_90',
       'Model_A 140', 'Model_A 150', 'Model_A 170', 'Model_A 180',
       'Model_A1', 'Model_A3', 'Model_A4', 'Model_A4 Allroad', 'Model_A5',
       'Model_A6', 'Model_A6 Allroad', 'Model_A7', 'Model_A8',
       'Model_ASX', 'Model_Amarok', 'Model_Auris', 'Model_Avalon',
       'Model_Avensis', 'Model_Aygo', 'Model_B 170', 'Model_B 180',
       'Model_B 200', 'Model_Beetle', 'Model_Bora', 'Model_C-Class',
       'Model_CL 180', 'Model_CL 500', 'Model_CL 55 AMG', 'Model_CL 550',
       'Model_CL 63 AMG', 'Model_CLA 200', 'Model_CLA 220',
       'Model_CLA-Class', 'Model_CLC 180', 'Model_CLC 200',
       'Model_CLK 200', 'Model_CLK 220', 'Model_CLK 230', 'Model_CLK 240',
       'Model_CLK 280', 'Model_CLK 320', 'Model_CLK 430', 'Model_CLS 350',
       'Model_CLS 400', 'Model_CLS 500', 'Model_CLS 63 AMG',
       'Model_Caddy', 'Model_Camry', 'Model_Captur', 'Model_Caravelle',
       'Model_Carina', 'Model_Carisma', 'Model_Celica', 'Model_Clio',
       'Model_Colt', 'Model_Corolla', 'Model_Corolla Verso',
       'Model_Cross Touran', 'Model_Dokker', 'Model_Duster',
       'Model_E-Class', 'Model_Eclipse', 'Model_Eos', 'Model_Espace',
       'Model_FJ Cruiser', 'Model_Fluence', 'Model_Fortuner',
       'Model_G 320', 'Model_G 500', 'Model_G 55 AMG', 'Model_G 63 AMG',
       'Model_GL 320', 'Model_GL 350', 'Model_GL 420', 'Model_GL 450',
       'Model_GL 500', 'Model_GL 550', 'Model_GLC-Class',
       'Model_GLE-Class', 'Model_GLK 220', 'Model_GLK 300',
       'Model_GLS 350', 'Model_Galant', 'Model_Golf GTI', 'Model_Golf II',
       'Model_Golf III', 'Model_Golf IV', 'Model_Golf Plus',
       'Model_Golf V', 'Model_Golf VI', 'Model_Golf VII',
       'Model_Golf Variant', 'Model_Grand Scenic', 'Model_Grandis',
       'Model_Hiace', 'Model_Highlander', 'Model_Hilux', 'Model_I3',
       'Model_IQ', 'Model_Jetta', 'Model_Kangoo', 'Model_Koleos',
       'Model_L 200', 'Model_LT', 'Model_Laguna', 'Model_Lancer',
       'Model_Lancer Evolution', 'Model_Lancer X',
       'Model_Lancer X Sportback', 'Model_Land Cruiser 100',
       'Model_Land Cruiser 105', 'Model_Land Cruiser 200',
       'Model_Land Cruiser 76', 'Model_Land Cruiser 80',
       'Model_Land Cruiser Prado', 'Model_Latitude', 'Model_Logan',
       'Model_Lupo', 'Model_M5', 'Model_M6', 'Model_MB', 'Model_ML 250',
       'Model_ML 270', 'Model_ML 280', 'Model_ML 320', 'Model_ML 350',
       'Model_ML 400', 'Model_ML 430', 'Model_ML 500', 'Model_ML 550',
       'Model_ML 63 AMG', 'Model_Master', 'Model_Matrix', 'Model_Megane',
       'Model_Modus', 'Model_Multivan', 'Model_New Beetle',
       'Model_Outlander', 'Model_Outlander XL', 'Model_Pajero',
       'Model_Pajero Pinin', 'Model_Pajero Sport', 'Model_Pajero Wagon',
       'Model_Passat B3', 'Model_Passat B4', 'Model_Passat B5',
       'Model_Passat B6', 'Model_Passat B7', 'Model_Passat B8',
       'Model_Passat CC', 'Model_Phaeton', 'Model_Pointer', 'Model_Polo',
       'Model_Previa', 'Model_Prius', 'Model_Q3', 'Model_Q5', 'Model_Q7',
       'Model_R 320', 'Model_R8', 'Model_Rav 4', 'Model_S 140',
       'Model_S 250', 'Model_S 300', 'Model_S 320', 'Model_S 350',
       'Model_S 400', 'Model_S 430', 'Model_S 500', 'Model_S 550',
       'Model_S 600', 'Model_S 63 AMG', 'Model_S 65 AMG', 'Model_S4',
       'Model_S5', 'Model_S8', 'Model_SL 500 (550)', 'Model_SL 55 AMG',
       'Model_SLK 200', 'Model_SLK 350', 'Model_Sandero',
       'Model_Sandero StepWay', 'Model_Scenic', 'Model_Scion',
       'Model_Scirocco', 'Model_Sequoia', 'Model_Sharan', 'Model_Sienna',
       'Model_Smart', 'Model_Space Star', 'Model_Space Wagon',
       'Model_Sprinter 208', 'Model_Sprinter 210', 'Model_Sprinter 211',
       'Model_Sprinter 212', 'Model_Sprinter 213', 'Model_Sprinter 311',
       'Model_Sprinter 312', 'Model_Sprinter 313', 'Model_Sprinter 315',
       'Model_Sprinter 316', 'Model_Sprinter 318', 'Model_Sprinter 319',
       'Model_Symbol', 'Model_Syncro', 'Model_T3 (Transporter)',
       'Model_T4 (Transporter)', 'Model_T4 (Transporter) ',
       'Model_T5 (Transporter)', 'Model_T5 (Transporter) ',
       'Model_T6 (Transporter)', 'Model_T6 (Transporter) ', 'Model_TT',
       'Model_Tacoma', 'Model_Tiguan', 'Model_Touareg', 'Model_Touran',
       'Model_Trafic', 'Model_Tundra', 'Model_Up', 'Model_V 250',
       'Model_Vaneo', 'Model_Vento', 'Model_Venza', 'Model_Viano',
       'Model_Virage', 'Model_Vista', 'Model_Vito', 'Model_X1',
       'Model_X3', 'Model_X5', 'Model_X5 M', 'Model_X6', 'Model_X6 M',
       'Model_Yaris', 'Model_Z3', 'Model_Z4']

# To implement the reordering, we will create a new df, which is equal to the old one but with the new order of features
data_preprocessed = data_with_dummies[cols]
data_preprocessed.head()

Linear regression model¶

Declare the inputs and the targets¶

# The target(s) (dependent variable) is 'log price'
targets = data_preprocessed['log_price']

# The inputs are everything BUT the dependent variable, so we can simply drop it
inputs = data_preprocessed.drop(['log_price'],axis=1)

Scale the data¶

# Import the scaling module
from sklearn.preprocessing import StandardScaler

# Create a scaler object
scaler = StandardScaler()
# Fit the inputs (calculate the mean and standard deviation feature-wise)
scaler.fit(inputs)

StandardScaler(copy=True, with_mean=True, with_std=True)

# Scale the features and store them in a new variable (the actual scaling procedure)
inputs_scaled = scaler.transform(inputs)

Dummy variables are also being standardized here, so they lose all of their dummy meaning, but this should not have a large effect on the model

Train Test Split¶

# Import the module for the split
from sklearn.model_selection import train_test_split

# Split the variables with an 80-20 split and some random state
# To have the same split each time, use random_state = 365
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets, test_size=0.2, random_state=365)

Create the regression¶

# Create a linear regression object
reg = LinearRegression()
# Fit the regression with the scaled TRAIN inputs and targets
reg.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

This is actually a log-linear regression as the dependent variables is the log of 'Price'
We want to check the regression by plotting the predicted values against the observed values

# Let's check the outputs of the regression
# I'll store them in y_hat as this is the 'theoretical' name of the predictions
y_hat = reg.predict(x_train)

# The simplest way to compare the targets (y_train) and the predictions (y_hat) is to plot them on a scatter plot
# The closer the points to the 45-degree line, the better the prediction
plt.scatter(y_train, y_hat)
# Let's also name the axes
plt.xlabel('Targets (y_train)',size=18)
plt.ylabel('Predictions (y_hat)',size=18)
# We need to make sure the scales of the x-axis and the y-axis are the same
# Otherwise we wouldn't be able to interpret the '45-degree line'
plt.xlim(6,13)
plt.ylim(6,13)
plt.show()

Our result is not perfect but definitely not random
The points are situated around the 45-degree line, so our model has passed this first check

# Another useful check of our model is a residual plot
# Residuals = targets - predictions
# We can plot the PDF of the residuals and check for anomalies
sns.distplot(y_train - y_hat)

# Include a title
plt.title("Residuals PDF", size=18)

Text(0.5, 1.0, 'Residuals PDF')

In the best case scenario this plot should be normally distributed
This is because our regression assumptions said that our errors should be normally distributed with a mean of 0 --> since the residutals are the estimates of the errors, we would expect the same
In our case we notice that is a much longer tail on the left side, so there are many negative residuals (far away from the mean)
Given the definition of the residuals (y_train - y_hat), negative values imply that y_hat (predictions) are much higher than y_train (the targets) --> some of the predictions tend to overestimate the targets but rarely underestimate the targets
This is food for thought to improve our model

# Find the R-squared of the model
r2 = reg.score(x_train,y_train)
r2

0.8347161853185299

The appropriate measure to use here is the adjusted R-squared

# Number of observations is the shape along axis 0
n = x_train.shape[0]
# Number of features (predictors, p) is the shape along axis 1
p = x_train.shape[1]

# We find the Adjusted R-squared using the formula
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

0.8164113329300592

Since the adjusted R^2 was a bit lower than the regular R^2, showing that at least one of the variables included is not increasing the explanatory power of the regression
We can use feature selection to identify that variable, remove it, and further simplify our model

Feature Selection¶

from sklearn.feature_selection import f_regression

f_regression(x_train, y_train)

(array([1.76068144e+03, 6.23344807e+02, 1.45893327e+01, 2.59454872e+01,
        1.76510386e+01, 1.18643960e+02, 8.39780982e+01, 3.13007519e+01,
        5.92802666e+01, 8.38777989e+00, 5.79453632e+01, 1.03804394e+02,
        4.29209798e+01, 1.83643828e+01, 1.23316590e+00, 9.30208253e+00,
        9.08970229e+02, 2.36498989e+01, 1.17908948e-14, 3.22393208e-01,
        1.99821289e-03, 2.07573186e+00, 6.76666574e+00, 1.01431727e+01,
        3.71523133e+00, 2.61334565e+00, 5.19098970e-01, 4.18362006e+00,
        1.17908948e-14, 1.17908948e-14, 1.17225278e-14, 5.57624782e+00,
        2.30302970e+01, 1.87914562e+01, 5.30811184e-01, 2.28915986e-02,
        6.28764284e-01, 1.60638823e+00, 8.46476912e-01, 1.17225278e-14,
        1.17225278e-14, 2.22227996e+00, 2.82353443e+00, 5.61381468e+00,
        5.71612162e-01, 2.19559098e+00, 1.17225278e-14, 5.00342254e+01,
        3.71916875e+00, 3.70137924e+00, 2.89878008e+00, 1.12737963e+00,
        2.92506588e-01, 9.56094613e+00, 5.43503573e+00, 5.68813460e-01,
        1.57207667e+01, 1.08627691e+00, 5.29529730e-01, 1.15163974e-01,
        6.48630613e-01, 2.80997666e-01, 3.04970911e-01, 2.37129388e+01,
        3.78549878e-01, 1.93461722e+00, 7.32163714e+00, 1.17908948e-14,
        3.15111222e+00, 1.15351417e-01, 1.53715759e+01, 1.06725375e+00,
        2.17545005e-01, 1.32562525e-01, 1.64552736e+01, 3.27595922e+00,
        1.71072018e+01, 3.14136204e+01, 4.26085368e-01, 9.72790601e+00,
        3.62527843e+00, 9.75421752e-01, 1.02488220e+00, 8.69076359e-02,
        8.57756683e-02, 4.52046536e-01, 3.82149402e-01, 6.34761219e-03,
        1.56495790e+00, 7.36802500e-02, 2.07573186e+00, 4.44765793e+00,
        1.48390866e+00, 5.42485029e+00, 9.19023597e-01, 5.00168738e+00,
        1.62694619e+00, 9.51515011e+00, 4.12308719e+00, 2.22227996e+00,
        1.92658512e+00, 2.94070450e-01, 2.41010841e-01, 3.60102521e+00,
        3.87960674e-01, 2.70934601e+00, 8.95931397e-02, 5.29529730e-01,
        5.49771195e-02, 1.17225278e-14, 2.31770968e+00, 6.26971966e+00,
        7.09173759e+00, 5.91107212e+00, 1.03790624e+01, 6.56362967e+00,
        6.49771064e-01, 7.22618634e-01, 5.55992987e+00, 9.97555832e+00,
        3.60433704e-01, 4.76476597e+00, 7.70254568e+00, 7.46483356e+00,
        2.93209582e-01, 2.90565736e-04, 2.12978860e-01, 2.44754992e-01,
        7.99662311e+00, 1.74098529e-01, 2.54478357e-01, 5.14963391e+00,
        7.49893685e+00, 3.87823577e-02, 7.18660456e-02, 1.36834120e+00,
        1.08867531e+01, 1.08273219e+01, 1.66993442e+00, 1.05944267e+01,
        3.13802723e+01, 8.13336018e-01, 4.04441802e+00, 2.09611817e+00,
        2.04540925e+00, 7.46776293e+00, 4.27863595e+01, 3.07332295e+00,
        1.18736641e+00, 1.17458107e+01, 3.79215336e+01, 1.17908948e-14,
        1.14277506e+01, 3.69612167e+01, 4.14391308e+01, 4.24122106e-01,
        1.78375824e+00, 1.11910141e-01, 1.85853725e+00, 1.03441823e-02,
        3.11967109e-01, 4.19726299e-02, 1.18650288e-03, 1.23940733e+01,
        1.44792320e+01, 1.95958409e+00, 3.37202724e-03, 1.82498688e-02,
        6.75715838e+01, 1.00951618e-02, 1.75539360e+00, 1.45336690e+00,
        2.70339291e+01, 1.85975344e+01, 1.84403360e-03, 3.93375330e+00,
        1.15351417e-01, 3.06825937e+00, 3.78549878e-01, 1.05516292e+02,
        3.87933050e+00, 1.30087745e-01, 5.02561796e+01, 1.75821962e-03,
        2.74142818e+00, 3.95648189e+00, 1.29818352e+01, 4.86978403e+00,
        4.17818569e+00, 2.35042741e+00, 5.67814370e-02, 1.17908948e-14,
        5.14359974e-01, 1.50621055e+01, 8.45281333e-01, 1.17225278e-14,
        2.94028928e+00, 9.64960873e-01, 1.89178117e+00, 1.08041040e-01,
        1.06574101e-01, 8.61441183e+00, 3.01815916e-01, 2.56244628e+01,
        2.17943303e-01, 9.50066142e-02, 2.13251464e+00, 6.88974238e+00,
        2.30596381e-01, 2.95644289e+00, 5.21697729e+00, 3.19358170e+01,
        2.36043428e+01, 6.30623574e+01, 7.86983947e-01, 1.11685591e+01,
        6.66197355e+00, 9.55655842e+00, 2.21739112e+00, 1.17225278e-14,
        5.23301811e+00, 1.17225278e-14, 1.58902258e-01, 6.79184628e+00,
        1.84401679e+01, 7.94270433e+01, 1.17225278e-14, 1.73588258e+01,
        6.17253753e+00, 8.36072077e-01, 3.27309852e+00, 1.90310547e-01,
        3.76987412e-02, 1.67630131e+01, 3.04912953e-01, 8.13010147e-02,
        9.59429018e+00, 8.40320733e+00, 1.76511797e+00, 5.56824566e+00,
        4.37820104e+00, 4.05799299e-02, 1.17225278e-14, 4.49640194e-04,
        4.15238016e-01, 2.38904715e+00, 3.92037231e-01, 3.00895880e-01,
        1.15113429e+00, 1.84962043e-02, 1.13310872e+01, 3.19865007e-02,
        1.33146664e+00, 1.17225278e-14, 7.37367981e-01, 2.60805278e+00,
        1.80543482e+00, 5.13084093e+00, 1.50195080e+00, 2.93209582e-01,
        1.10024986e-14, 1.55131923e-01, 1.79400140e-02, 2.91035818e-02,
        1.17908948e-14, 1.47549124e+00, 2.87836838e-01, 8.95931397e-02,
        7.90700934e-03, 2.39596599e-01, 1.50832891e-01, 6.06026164e+00,
        2.09272002e-02, 6.16003901e+00, 1.03352285e+01, 5.96460259e+00,
        4.97422981e-01, 3.20431392e-02, 1.62853856e-01, 8.11495902e-02,
        1.46954802e+00, 2.58340700e-01, 1.02442797e+01, 4.82643128e+01,
        9.05380860e-01, 3.55251458e+00, 6.84385919e+00, 7.62065588e-02,
        1.63411523e+01, 4.31723451e-01, 8.78546415e+00, 4.37197753e+00,
        2.78015599e+00, 6.47541931e-01, 1.55131923e-01, 4.75548295e+00,
        5.78467546e+00, 2.23540083e+01, 8.06437178e+01, 8.23282676e+00,
        7.54980272e+01, 4.01886144e+00, 3.61087305e+00, 1.17908948e-14,
        1.68439937e+00]),
 array([2.59555216e-304, 2.25998861e-125, 1.36324856e-004, 3.72497035e-007,
        2.72958981e-005, 3.92167544e-027, 8.93240893e-020, 2.40499120e-008,
        1.83322652e-014, 3.80432561e-003, 3.56723826e-014, 5.37348629e-024,
        6.66260033e-011, 1.88067529e-005, 2.66879538e-001, 2.30855729e-003,
        4.66500096e-175, 1.21389338e-006, 1.00000000e+000, 5.70214636e-001,
        9.64348257e-001, 1.49760516e-001, 9.33238416e-003, 1.46292880e-003,
        5.40106718e-002, 1.06071132e-001, 4.71281358e-001, 4.09018317e-002,
        1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.82680568e-002,
        1.67114924e-006, 1.50523964e-005, 4.66321809e-001, 8.79749287e-001,
        4.27870460e-001, 2.05096753e-001, 3.57623485e-001, 1.00000000e+000,
        1.00000000e+000, 1.36135179e-001, 9.29937854e-002, 1.78811957e-002,
        4.49677232e-001, 1.38509149e-001, 1.00000000e+000, 1.86486737e-012,
        5.38836072e-002, 5.44602239e-002, 8.87482448e-002, 2.88418805e-001,
        5.88658101e-001, 2.00552663e-003, 1.98013441e-002, 4.50788872e-001,
        7.50952205e-005, 2.97379988e-001, 4.66860348e-001, 7.34363223e-001,
        4.20665313e-001, 5.96086542e-001, 5.80823116e-001, 1.17507078e-006,
        5.38426123e-001, 1.64355377e-001, 6.85074374e-003, 1.00000000e+000,
        7.59746766e-002, 7.34155313e-001, 9.02445111e-005, 3.01648276e-001,
        6.40950125e-001, 7.15814954e-001, 5.10564003e-005, 7.03999469e-002,
        3.62793654e-005, 2.27050691e-008, 5.13965144e-001, 1.83180440e-003,
        5.70018203e-002, 3.23409858e-001, 3.11443823e-001, 7.68165906e-001,
        7.69637696e-001, 5.01415482e-001, 5.36501189e-001, 9.36503419e-001,
        2.11036462e-001, 7.86070146e-001, 1.49760516e-001, 3.50302485e-002,
        2.23257953e-001, 1.99169608e-002, 3.37807260e-001, 2.53944555e-002,
        2.02222839e-001, 2.05603583e-003, 4.23881838e-002, 1.36135179e-001,
        1.65233649e-001, 5.87663334e-001, 6.23512211e-001, 5.78380517e-002,
        5.33419756e-001, 9.98646439e-002, 7.64715374e-001, 4.66860348e-001,
        8.14634264e-001, 1.00000000e+000, 1.28011628e-001, 1.23335563e-002,
        7.78460165e-003, 1.51030996e-002, 1.28789509e-003, 1.04559046e-002,
        4.20257221e-001, 3.95352840e-001, 1.84387822e-002, 1.60178468e-003,
        5.48309954e-001, 2.91238190e-002, 5.54783053e-003, 6.32774025e-003,
        5.88210504e-001, 9.86401039e-001, 6.44475380e-001, 6.20828247e-001,
        4.71696118e-003, 6.76523999e-001, 6.13975578e-001, 2.33206394e-002,
        6.20931206e-003, 8.43893551e-001, 7.88657028e-001, 2.42188255e-001,
        9.79646856e-004, 1.01148021e-003, 1.96364960e-001, 1.14665214e-003,
        2.30943745e-008, 3.67206815e-001, 4.44052599e-002, 1.47776148e-001,
        1.52768086e-001, 6.31747766e-003, 7.13043897e-011, 7.96868304e-002,
        2.75947286e-001, 6.17856417e-004, 8.32618780e-010, 1.00000000e+000,
        7.32630012e-004, 1.35423027e-009, 1.40684393e-010, 5.14936261e-001,
        1.81787908e-001, 7.38002801e-001, 1.72894077e-001, 9.18996427e-001,
        5.76516484e-001, 8.37685589e-001, 9.72524029e-001, 4.36988541e-004,
        1.44488764e-004, 1.61659245e-001, 9.53697334e-001, 8.92547967e-001,
        2.96720056e-016, 9.19974068e-001, 1.85299107e-001, 2.28081747e-001,
        2.13071976e-007, 1.66534602e-005, 9.65750403e-001, 4.74155136e-002,
        7.34155313e-001, 7.99351497e-002, 5.38426123e-001, 2.33052757e-024,
        4.89741735e-002, 7.18365991e-001, 1.66846886e-012, 9.66556338e-001,
        9.78799481e-002, 4.67801506e-002, 3.19538856e-004, 2.74049688e-002,
        4.10330036e-002, 1.25352106e-001, 8.11673260e-001, 1.00000000e+000,
        4.73312299e-001, 1.06227003e-004, 3.57963176e-001, 1.00000000e+000,
        8.64965773e-002, 3.26017856e-001, 1.69101902e-001, 7.42408008e-001,
        7.44101041e-001, 3.35997831e-003, 5.82786353e-001, 4.39291534e-007,
        6.40644764e-001, 7.57927232e-001, 1.44306743e-001, 8.71232572e-003,
        6.31116320e-001, 8.56371177e-002, 2.24356435e-002, 1.73998378e-008,
        1.24274760e-006, 2.78799487e-015, 3.75083437e-001, 8.41952973e-004,
        9.89534909e-003, 2.01031089e-003, 1.36566604e-001, 1.00000000e+000,
        2.22300243e-002, 1.00000000e+000, 6.90197029e-001, 9.20194612e-003,
        1.80777642e-005, 8.41406314e-019, 1.00000000e+000, 3.18027634e-005,
        1.30279664e-002, 3.60594670e-001, 7.05226526e-002, 6.62689073e-001,
        8.46062217e-001, 4.34475724e-005, 5.80859061e-001, 7.75561307e-001,
        1.96954440e-003, 3.77226357e-003, 1.84086557e-001, 1.83515737e-002,
        3.64836434e-002, 8.40364336e-001, 1.00000000e+000, 9.83083732e-001,
        5.19371128e-001, 1.22291559e-001, 5.31277217e-001, 5.83361372e-001,
        2.83396486e-001, 8.91829639e-001, 7.71614971e-004, 8.58069136e-001,
        2.48634641e-001, 1.00000000e+000, 3.90572616e-001, 1.06425405e-001,
        1.79156529e-001, 2.35740151e-002, 2.20465821e-001, 5.88210504e-001,
        1.00000000e+000, 6.93706032e-001, 8.93458567e-001, 8.64551377e-001,
        1.00000000e+000, 2.24575079e-001, 5.91649108e-001, 7.64715374e-001,
        9.29150226e-001, 6.24532733e-001, 6.97767720e-001, 1.38805295e-002,
        8.84986987e-001, 1.31201473e-002, 1.31873514e-003, 1.46521543e-002,
        4.80688353e-001, 8.57944868e-001, 6.86570732e-001, 7.75764815e-001,
        2.25510669e-001, 6.11299077e-001, 1.38513152e-003, 4.53217998e-012,
        3.41418188e-001, 5.95500342e-002, 8.93839498e-003, 7.82524204e-001,
        5.42072988e-005, 5.11193933e-001, 3.05984748e-003, 3.66169176e-002,
        9.55412662e-002, 4.21055433e-001, 6.93706032e-001, 2.92810628e-002,
        1.62255377e-002, 2.36994467e-006, 4.61750552e-019, 4.14216033e-003,
        5.85571089e-018, 4.50820788e-002, 5.74969452e-002, 1.00000000e+000,
        1.94438788e-001]))

p_values = f_regression(x_train, y_train)[1].round(3)
p_values

array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
       0.004, 0.   , 0.   , 0.   , 0.   , 0.267, 0.002, 0.   , 0.   ,
       1.   , 0.57 , 0.964, 0.15 , 0.009, 0.001, 0.054, 0.106, 0.471,
       0.041, 1.   , 1.   , 1.   , 0.018, 0.   , 0.   , 0.466, 0.88 ,
       0.428, 0.205, 0.358, 1.   , 1.   , 0.136, 0.093, 0.018, 0.45 ,
       0.139, 1.   , 0.   , 0.054, 0.054, 0.089, 0.288, 0.589, 0.002,
       0.02 , 0.451, 0.   , 0.297, 0.467, 0.734, 0.421, 0.596, 0.581,
       0.   , 0.538, 0.164, 0.007, 1.   , 0.076, 0.734, 0.   , 0.302,
       0.641, 0.716, 0.   , 0.07 , 0.   , 0.   , 0.514, 0.002, 0.057,
       0.323, 0.311, 0.768, 0.77 , 0.501, 0.537, 0.937, 0.211, 0.786,
       0.15 , 0.035, 0.223, 0.02 , 0.338, 0.025, 0.202, 0.002, 0.042,
       0.136, 0.165, 0.588, 0.624, 0.058, 0.533, 0.1  , 0.765, 0.467,
       0.815, 1.   , 0.128, 0.012, 0.008, 0.015, 0.001, 0.01 , 0.42 ,
       0.395, 0.018, 0.002, 0.548, 0.029, 0.006, 0.006, 0.588, 0.986,
       0.644, 0.621, 0.005, 0.677, 0.614, 0.023, 0.006, 0.844, 0.789,
       0.242, 0.001, 0.001, 0.196, 0.001, 0.   , 0.367, 0.044, 0.148,
       0.153, 0.006, 0.   , 0.08 , 0.276, 0.001, 0.   , 1.   , 0.001,
       0.   , 0.   , 0.515, 0.182, 0.738, 0.173, 0.919, 0.577, 0.838,
       0.973, 0.   , 0.   , 0.162, 0.954, 0.893, 0.   , 0.92 , 0.185,
       0.228, 0.   , 0.   , 0.966, 0.047, 0.734, 0.08 , 0.538, 0.   ,
       0.049, 0.718, 0.   , 0.967, 0.098, 0.047, 0.   , 0.027, 0.041,
       0.125, 0.812, 1.   , 0.473, 0.   , 0.358, 1.   , 0.086, 0.326,
       0.169, 0.742, 0.744, 0.003, 0.583, 0.   , 0.641, 0.758, 0.144,
       0.009, 0.631, 0.086, 0.022, 0.   , 0.   , 0.   , 0.375, 0.001,
       0.01 , 0.002, 0.137, 1.   , 0.022, 1.   , 0.69 , 0.009, 0.   ,
       0.   , 1.   , 0.   , 0.013, 0.361, 0.071, 0.663, 0.846, 0.   ,
       0.581, 0.776, 0.002, 0.004, 0.184, 0.018, 0.036, 0.84 , 1.   ,
       0.983, 0.519, 0.122, 0.531, 0.583, 0.283, 0.892, 0.001, 0.858,
       0.249, 1.   , 0.391, 0.106, 0.179, 0.024, 0.22 , 0.588, 1.   ,
       0.694, 0.893, 0.865, 1.   , 0.225, 0.592, 0.765, 0.929, 0.625,
       0.698, 0.014, 0.885, 0.013, 0.001, 0.015, 0.481, 0.858, 0.687,
       0.776, 0.226, 0.611, 0.001, 0.   , 0.341, 0.06 , 0.009, 0.783,
       0.   , 0.511, 0.003, 0.037, 0.096, 0.421, 0.694, 0.029, 0.016,
       0.   , 0.   , 0.004, 0.   , 0.045, 0.057, 1.   , 0.194])

If a variable as a p-value > 0.05, we can disregard it, because that variable is redundant (not relevant), but we still don't necessarily know how useful/important that variable is

Finding the weights and bias¶

# Obtain the bias (intercept) of the regression
reg.intercept_

295271601566.04913

# Obtain the weights (coefficients) of the regression
reg.coef_

array([-3.42013244e-01,  4.71096963e-02, -2.73881723e+12,  1.55135632e+12,
        2.14702126e+12, -1.05825320e+11, -3.81614744e+12,  4.61157477e+12,
       -6.64366447e-02, -2.47802734e-02, -9.10339355e-02, -7.34863281e-02,
       -5.18798828e-02, -9.10644531e-02, -2.86865234e-02, -1.19873047e-01,
        3.11401367e-01, -4.64764931e+11,  1.12852481e+13,  3.52325439e-02,
        4.17938232e-02, -8.01467896e-03, -1.62354331e+11, -3.68549829e+11,
       -3.68549829e+11, -2.60672259e+11, -2.60672259e+11, -2.60672259e+11,
        8.03730594e+12,  3.00664832e+12,  8.80732154e+12,  1.95312500e-02,
        5.31005859e-02,  1.06018066e-01,  1.80358887e-02,  1.82495117e-02,
        2.32086182e-02,  4.77905273e-02,  3.47900391e-02,  1.16416367e+13,
        5.72933076e+12,  6.34224883e+10,  1.84020996e-02,  4.48608398e-02,
        1.48162842e-01,  5.04455566e-02, -1.41364946e+13,  6.83898926e-02,
        5.41381836e-02,  1.10443115e-01,  7.21130371e-02,  2.02636719e-02,
        2.65655518e-02,  5.96313477e-02,  4.29687500e-02,  2.73132324e-02,
        6.53076172e-02,  1.47094727e-02,  1.75781250e-02,  9.79614258e-02,
        4.42810059e-02,  4.67834473e-02,  3.25927734e-02,  1.10473633e-01,
        3.56445312e-02, -2.94290033e+11, -1.14816858e+11, -1.31035825e+12,
       -3.19215266e+11, -2.60672259e+11, -4.11997326e+11, -1.84347221e+11,
       -2.08176175e+11, -3.98209724e+11, -8.77305737e+11, -2.40349700e+11,
       -4.64764931e+11, -1.45721766e+12, -5.62341120e+11, -2.08176175e+11,
       -6.55982081e+11, -6.09371044e+11, -5.18955241e+11,  3.34413694e+11,
        8.96812178e+10,  3.40288173e+11,  8.96812178e+10, -1.84347221e+11,
       -4.87354214e+11, -3.19215266e+11, -4.23780595e+11, -7.92302407e+11,
       -1.32045467e+12, -1.84347221e+11, -4.11997326e+11, -3.19215266e+11,
       -1.84347221e+11, -3.19215266e+11, -3.19215266e+11, -1.84347221e+11,
       -1.84347221e+11, -1.84347221e+11, -1.84347221e+11, -2.60672259e+11,
       -1.84347221e+11, -2.60672259e+11, -1.84347221e+11, -1.84347221e+11,
       -3.19215266e+11, -5.21922550e+12, -4.11997326e+11, -1.84347221e+11,
       -2.60672259e+11, -2.60672259e+11, -3.09814202e+12,  7.02864459e+11,
       -1.62354331e+11, -6.69793066e+11,  1.09822249e+11, -6.58109797e+11,
        8.96812178e+10, -3.03538503e+11, -7.46030944e+11,  4.27614451e+11,
        6.34224883e+10, -2.99697311e+11, -1.62354331e+11, -3.24453822e+11,
       -2.42107883e+12, -3.52004687e+11, -4.23780595e+11, -2.29543646e+11,
        1.67668472e+11, -3.62655377e+11,  6.34224883e+10, -1.84347221e+11,
       -5.52463018e+11, -4.11997326e+11, -1.84347221e+11, -5.52463018e+11,
       -5.20935354e+11, -1.84347221e+11, -3.68549829e+11, -2.60672259e+11,
       -1.84347221e+11, -2.60672259e+11, -5.52463018e+11, -2.60672259e+11,
       -1.84347221e+11, -2.60672259e+11, -1.26518193e+12,  3.05418030e+12,
       -5.99159519e+11, -1.26867928e+12, -1.46379009e+12, -6.69793066e+11,
       -1.07887767e+12, -1.11945745e+12, -9.46610482e+11, -7.33625524e+11,
       -3.80306328e+11, -4.31059562e+11,  1.09822249e+11,  2.28314049e+11,
        2.10073452e+11,  2.33154297e-02,  6.34224883e+10, -1.83850117e+12,
       -1.29635919e+12, -2.56604022e+11, -7.86282645e+11, -5.99159519e+11,
       -5.94574757e+11, -1.73159387e+12, -4.31059562e+11, -1.73159387e+12,
       -2.48937444e+11,  2.53192033e+11,  6.34224883e+10,  3.57314968e+11,
        6.34224883e+10,  1.26795224e+11,  5.43915858e+11, -1.14816858e+11,
       -4.28874675e+11, -4.23780595e+11,  8.14819336e-02,  2.60620117e-02,
       -1.84347221e+11, -3.19215266e+11, -3.19215266e+11, -1.08471062e+13,
       -5.52463018e+11, -8.42573614e+11, -3.19215266e+11,  1.25047199e+13,
       -2.60672259e+11, -1.84347221e+11, -3.68549829e+11, -3.44090176e+11,
        6.34224883e+10, -1.04687121e+12, -1.14816858e+11, -1.33695416e+12,
       -4.23780595e+11, -1.28911314e+12, -1.19042376e+12, -5.56349917e+11,
       -2.48937444e+11, -1.26518193e+12, -1.71406032e+12, -1.15859655e+12,
       -1.07887767e+12, -2.18700536e+12, -2.32227974e+12, -2.06356770e+12,
       -6.69793066e+11, -1.33695416e+12, -7.92302407e+11,  5.35460088e+12,
       -2.28449884e+12,  3.65780024e+12,  1.26795224e+11, -2.68683977e+11,
       -4.49064487e+11, -9.00987454e+11, -1.04198805e+13, -2.40349700e+11,
        4.45582938e+11, -3.68549829e+11, -1.84347221e+11, -1.84347221e+11,
       -6.10610812e+11, -7.12665902e+11, -3.68549829e+11, -2.60672259e+11,
       -1.13088296e+12, -6.10610812e+11, -4.11997326e+11, -3.19215266e+11,
       -1.84347221e+11, -1.69997364e+11, -8.60938731e+11, -2.08176175e+11,
       -1.84347221e+11, -1.84347221e+11, -2.60672259e+11, -1.84347221e+11,
       -2.81058837e+11, -1.14816858e+11, -5.37058424e+11,  6.34224883e+10,
       -7.33625524e+11, -1.31926154e+13, -9.92683355e+11,  8.96812178e+10,
       -1.84347221e+11, -5.56349917e+11, -2.48937444e+11, -1.84347221e+11,
       -2.32713874e+12, -1.84347221e+11, -3.19215266e+11, -2.60672259e+11,
        1.26092758e+12, -3.68549829e+11, -6.63629916e+11, -1.84347221e+11,
       -2.60672259e+11, -2.60672259e+11, -1.84347221e+11, -3.62655377e+11,
       -2.99697311e+11, -2.99697311e+11, -1.81438853e+12, -7.92302407e+11,
       -2.20686993e+12, -1.76513546e+12, -6.69793066e+11, -7.92302407e+11,
       -2.68683977e+11,  1.09822249e+11, -1.23309652e+12, -2.39593742e+12,
       -1.30327290e+12, -9.51922846e+11,  1.41742823e+11, -4.23780595e+11,
       -4.51261397e+11, -1.84347221e+11, -7.33625524e+11,  1.67668472e+11,
       -4.11997326e+11, -2.48937444e+11,  6.34224883e+10, -2.24146816e+12,
        6.28662109e-02,  7.95288086e-02,  2.36419678e-01,  6.57958984e-02,
        1.62353516e-01,  4.72412109e-02,  2.36901615e+11, -5.50521900e+11,
        3.82080078e-02])

Note that they are barely interpretable if at all

# Create a regression summary where we can compare them with one-another
reg_summary = pd.DataFrame(inputs.columns.values, columns=['Features'])
reg_summary['Weights'] = reg.coef_
reg_summary['p-values'] = p_values
reg_summary

The bigger the weight, the bigger the impact
Mileage seems to have the largest impact, which makes sense
All of our numerical variables have p-values = 0 so they are relevant and should be kept in the model

Continuous variables:
Positive weight: positive correlation between log_price and feature
Negative weight: negative correlation between log_price and feature

# Check the different categories in the 'Brand' variable
data_cleaned['Brand'].unique()

array(['BMW', 'Mercedes-Benz', 'Audi', 'Toyota', 'Renault', 'Volkswagen',
       'Mitsubishi'], dtype=object)

In this way we can see which 'Brand' is actually the benchmark
After looking at the weights table, we can see that Audi was the dropped brand
Therefore, when all other dummies are 0, Audi is 1
Therefore, Audi is the benchmark

Dummy variables:
Positive weight: respective category (Brand) is more expensive than the benchmark (Audi)
Negative weight: respective category (Brand) is less expensive than the benchmark (Audi)
For example) BMW is more expensive than Audi but Mitsubishi is less expensive than Audi -Clearly, we can see that we cannot compare continuous variables and dummy variables, as the dummy variables can only be compared to their respective benchmarks

# Check the different categories in the 'Body' variable
data_cleaned['Body'].unique()

array(['sedan', 'van', 'crossover', 'vagon', 'other', 'hatch'],
      dtype=object)

'crossover' was the dropped brand and therefore the benchmark
It seems that, generally, crossover is the most expensive type of body, which makes sense

# Check the different categories in the 'Engine Type
data_cleaned['Engine Type'].unique()

array(['Petrol', 'Diesel', 'Gas', 'Other'], dtype=object)

'Diesel' was the dropped brand and therefore the benchmark
It seems that, generally, diesel is the most expensive type of engine, which makes sense

Testing¶

Now that we have trained and fine-tuned our model, we can proceed to testing it
Our test inputs are 'x_test', while the outputs: 'y_test' , which the algorithm has not seen yet
We will feed these inputs and find the predictions
If the predictions are far off, we will know that our model overfitted

# Find the predicted values using inputs from the test data
y_hat_test = reg.predict(x_test)

# Create a scatter plot with the test targets and the test predictions
# Including the argument 'alpha' introduces proportaional opacity to the graph
# The more saturated the color, the higher the concentration of points (like a heat map
plt.scatter(y_test, y_hat_test, alpha=0.2)
plt.xlabel('Targets (y_test)',size=18)
plt.ylabel('Predictions (y_hat_test)',size=18)
plt.xlim(6,13)
plt.ylim(6,13)
plt.show()

Overall, the points seem to hover around the 45-degree line, which is good
For higher prices, we have a higher concentration of values around the 45-degree line. Therefore, our model is very good at predicting higher prices.
However, for lower prices, the values are much more scattered, meaning that we are not getting as many of the prices right.
From the opacity, we can see that most of the points are indeed very close to the 45-degree line

# Finally, let's manually check these predictions
# To obtain the actual prices, we take the exponential of the log_price
df_pf = pd.DataFrame(np.exp(y_hat_test), columns=['Prediction'])
df_pf.head()

/Users/Sujay/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: overflow encountered in exp
  This is separate from the ipykernel package so we can avoid doing imports until

# We can also include the test targets in that data frame (so we can manually compare them)
df_pf['Target'] = np.exp(y_test)
df_pf

Note that we have a lot of missing values
There is no reason to have ANY missing values, though
This suggests that something is wrong with the data frame / indexing

y_test

2533    10.114559
3654     8.294050
3105     9.814656
1481     9.655026
1335     9.560997
          ...    
2306     8.839277
542     10.184900
1017     9.609049
1325     8.517193
33      11.119883
Name: log_price, Length: 766, dtype: float64

After displaying y_test, we find what the issue is: the old indexes are preserved
Therefore, to get a proper result, we must reset the index and drop the old indexing

y_test = y_test.reset_index(drop=True)

# Check the result
y_test.head()

0    10.114559
1     8.294050
2     9.814656
3     9.655026
4     9.560997
Name: log_price, dtype: float64

# Let's overwrite the 'Target' column with the appropriate values
# Again, we need the exponential of the test log price
df_pf['Target'] = np.exp(y_test)
df_pf

We seemed to have fixed the issue
Additionally, we can calculate the difference between the targets and the predictions
Note that this is actually the residual (we already plotted the residuals)
Since OLS is basically an algorithm which minimizes the total sum of squared errors (residuals), this comparison makes a lot of sense

df_pf['Residual'] = df_pf['Target'] - df_pf['Prediction']

# Finally, it makes sense to see how far off we are from the result percentage-wise
# Here, we take the absolute difference in %, so we can easily order the data frame
df_pf['Difference%'] = np.absolute(df_pf['Residual']/df_pf['Target']*100)
df_pf

# Exploring the descriptives here gives us additional insights
df_pf.describe()

Min difference% = 0.06% which is pretty spot-on
Max difference% = 512.7% which is pretty off-mark
All of the percentiles tell us that for most of the predictions, we got pretty close. 75% of results were within a 35% difference

# Sometimes it is useful to check these outputs manually
# To see all rows, we use the relevant pandas syntax
pd.options.display.max_rows = 999
# Moreover, to make the dataset clear, we can display the result with only 2 digits after the dot 
pd.set_option('display.float_format', lambda x: '%.2f' % x)
# Finally, we sort by difference in % and manually check the model
df_pf.sort_values(by=['Difference%'])

The observations that have the largest difference%'s also have extremely low observed prices (~3000)
- All the residuals for these outliers are negative, therefore their predictions are higher than their targets
- An explanation may be that we are missing an important factor which drives the price of the car lower
  - The model of the car
  - Damage to the car
Otherwise, the predictions are fairly accurate

How to potentially improve our model further:

Use a different set of variables
Remove a bigger part of the outliers
Use different kinds of transformations

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
count	4345	4173.000000	4345	4345.000000	4195.000000	4345	4345	4345.000000	4345
unique	7	NaN	6	NaN	NaN	4	2	NaN	312
top	Volkswagen	NaN	sedan	NaN	NaN	Diesel	yes	NaN	E-Class
freq	936	NaN	1649	NaN	NaN	2019	3947	NaN	199
mean	NaN	19418.746935	NaN	161.237284	2.790734	NaN	NaN	2006.550058	NaN
std	NaN	25584.242620	NaN	105.705797	5.066437	NaN	NaN	6.719097	NaN
min	NaN	600.000000	NaN	0.000000	0.600000	NaN	NaN	1969.000000	NaN
25%	NaN	6999.000000	NaN	86.000000	1.800000	NaN	NaN	2003.000000	NaN
50%	NaN	11500.000000	NaN	155.000000	2.200000	NaN	NaN	2008.000000	NaN
75%	NaN	21700.000000	NaN	230.000000	3.000000	NaN	NaN	2012.000000	NaN
max	NaN	300000.000000	NaN	980.000000	99.990000	NaN	NaN	2016.000000	NaN

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
count	4025	4025.000000	4025	4025.000000	4025.000000	4025	4025	4025.000000	4025
unique	7	NaN	6	NaN	NaN	4	2	NaN	306
top	Volkswagen	NaN	sedan	NaN	NaN	Diesel	yes	NaN	E-Class
freq	880	NaN	1534	NaN	NaN	1861	3654	NaN	188
mean	NaN	19552.308065	NaN	163.572174	2.764586	NaN	NaN	2006.379627	NaN
std	NaN	25815.734988	NaN	103.394703	4.935941	NaN	NaN	6.695595	NaN
min	NaN	600.000000	NaN	0.000000	0.600000	NaN	NaN	1969.000000	NaN
25%	NaN	6999.000000	NaN	90.000000	1.800000	NaN	NaN	2003.000000	NaN
50%	NaN	11500.000000	NaN	158.000000	2.200000	NaN	NaN	2007.000000	NaN
75%	NaN	21900.000000	NaN	230.000000	3.000000	NaN	NaN	2012.000000	NaN
max	NaN	300000.000000	NaN	980.000000	99.990000	NaN	NaN	2016.000000	NaN

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
count	3943	3943.000000	3943	3943.000000	3943.000000	3943	3943	3943.000000	3943
unique	7	NaN	6	NaN	NaN	4	2	NaN	300
top	Volkswagen	NaN	sedan	NaN	NaN	Diesel	yes	NaN	E-Class
freq	880	NaN	1515	NaN	NaN	1818	3572	NaN	188
mean	NaN	16840.689820	NaN	166.739538	2.738415	NaN	NaN	2006.197312	NaN
std	NaN	16332.966734	NaN	102.042365	4.980975	NaN	NaN	6.640292	NaN
min	NaN	600.000000	NaN	0.000000	0.600000	NaN	NaN	1969.000000	NaN
25%	NaN	6900.000000	NaN	95.000000	1.800000	NaN	NaN	2002.000000	NaN
50%	NaN	11250.000000	NaN	160.000000	2.200000	NaN	NaN	2007.000000	NaN
75%	NaN	20800.000000	NaN	230.000000	3.000000	NaN	NaN	2011.000000	NaN
max	NaN	103333.000000	NaN	980.000000	99.990000	NaN	NaN	2016.000000	NaN

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
count	3826	3826.000000	3826	3826.000000	3826.000000	3826	3826	3826.000000	3826
unique	7	NaN	6	NaN	NaN	4	2	NaN	289
top	Volkswagen	NaN	sedan	NaN	NaN	Diesel	yes	NaN	E-Class
freq	848	NaN	1454	NaN	NaN	1772	3464	NaN	181
mean	NaN	17171.386333	NaN	162.166231	2.441777	NaN	NaN	2006.615787	NaN
std	NaN	16396.219278	NaN	94.831169	0.947973	NaN	NaN	6.067533	NaN
min	NaN	800.000000	NaN	0.000000	0.600000	NaN	NaN	1988.000000	NaN
25%	NaN	7200.000000	NaN	94.000000	1.800000	NaN	NaN	2003.000000	NaN
50%	NaN	11500.000000	NaN	159.000000	2.200000	NaN	NaN	2008.000000	NaN
75%	NaN	21000.000000	NaN	227.000000	3.000000	NaN	NaN	2011.000000	NaN
max	NaN	103333.000000	NaN	435.000000	6.300000	NaN	NaN	2016.000000	NaN

	Prediction
0	18159.439748
1	3061.537021
2	13747.681742
3	12996.990953
4	18445.409313

	Brand	Price	Body	Mileage	EngineV	Engine Type	Registration	Year	Model
0	BMW	4200.0	sedan	277	2.0	Petrol	yes	1991	320
1	Mercedes-Benz	7900.0	van	427	2.9	Diesel	yes	1999	Sprinter 212
2	Mercedes-Benz	13300.0	sedan	358	5.0	Gas	yes	2003	S 500
3	Audi	23000.0	crossover	240	4.2	Petrol	yes	2007	Q7
4	Toyota	18300.0	crossover	120	2.0	Petrol	yes	2011	Rav 4

	Mileage	EngineV	log_price	Brand_BMW	Brand_Mercedes-Benz	Brand_Toyota	...
0	277	2.0	8.342840	1	0	0	...
1	427	2.9	8.974618	0	1	0	...
2	358	5.0	9.495519	0	1	0	...
3	240	4.2	10.043249	0	0	0	...
4	120	2.0	9.814656	0	0	1	...

	Features	Weights	p-values
0	Mileage	-0.34	0.00
1	EngineV	0.05	0.00
2	Brand_BMW	-2738817232235.43	0.00
3	Brand_Mercedes-Benz	1551356317399.53	0.00
4	Brand_Mitsubishi	2147021258533.68	0.00
5	Brand_Renault	-105825320085.92	0.00
6	Brand_Toyota	-3816147441217.25	0.00
7	Brand_Volkswagen	4611574773601.37	0.00
8	Body_hatch	-0.07	0.00
9	Body_other	-0.02	0.00
10	Body_sedan	-0.09	0.00
11	Body_vagon	-0.07	0.00
12	Body_van	-0.05	0.00
13	Engine Type_Gas	-0.09	0.00
14	Engine Type_Other	-0.03	0.27
15	Engine Type_Petrol	-0.12	0.00
16	Registration_yes	0.31	0.00
17	Model_100	-464764931410.55	0.00
18	Model_11	11285248119855.59	1.00
19	Model_116	0.04	0.57
20	Model_118	0.04	0.96
21	Model_120	-0.01	0.15
22	Model_19	-162354330815.04	0.01
23	Model_190	-368549828573.53	0.00
24	Model_200	-368549828573.51	0.05
25	Model_210	-260672259336.75	0.11
26	Model_220	-260672259336.73	0.47
27	Model_230	-260672259336.76	0.04
28	Model_25	8037305940739.12	1.00
29	Model_250	3006648316611.04	1.00
30	Model_300	8807321538408.15	1.00
31	Model_316	0.02	0.02
32	Model_318	0.05	0.00
33	Model_320	0.11	0.00
34	Model_323	0.02	0.47
35	Model_325	0.02	0.88
36	Model_328	0.02	0.43
37	Model_330	0.05	0.20
38	Model_335	0.03	0.36
39	Model_4 Series Gran Coupe	11641636718966.64	1.00
40	Model_428	5729330763983.54	1.00
41	Model_4Runner	63422488324.06	0.14
42	Model_5 Series	0.02	0.09
43	Model_5 Series GT	0.04	0.02
44	Model_520	0.15	0.45
45	Model_523	0.05	0.14
46	Model_524	-14136494624408.63	1.00
47	Model_525	0.07	0.00
48	Model_528	0.05	0.05
49	Model_530	0.11	0.05
50	Model_535	0.07	0.09
51	Model_540	0.02	0.29
52	Model_545	0.03	0.59
53	Model_550	0.06	0.00
54	Model_6 Series Gran Coupe	0.04	0.02
55	Model_630	0.03	0.45
56	Model_640	0.07	0.00
57	Model_645	0.01	0.30
58	Model_650	0.02	0.47
59	Model_730	0.10	0.73
60	Model_735	0.04	0.42
61	Model_740	0.05	0.60
62	Model_745	0.03	0.58
63	Model_750	0.11	0.00
64	Model_760	0.04	0.54
65	Model_80	-294290033141.79	0.16
66	Model_9	-114816857995.02	0.01
67	Model_90	-1310358246676.97	1.00
68	Model_A 140	-319215266151.05	0.08
69	Model_A 150	-260672259336.75	0.73
70	Model_A 170	-411997326181.60	0.00
71	Model_A 180	-184347221495.45	0.30
72	Model_A1	-208176174524.81	0.64
73	Model_A3	-398209723987.15	0.72
74	Model_A4	-877305736644.15	0.00
75	Model_A4 Allroad	-240349699935.40	0.07
76	Model_A5	-464764931410.48	0.00
77	Model_A6	-1457217663414.56	0.00
78	Model_A6 Allroad	-562341120334.42	0.51
79	Model_A7	-208176174524.78	0.00
80	Model_A8	-655982081396.48	0.06
81	Model_ASX	-609371043825.39	0.32
82	Model_Amarok	-518955241391.95	0.31
83	Model_Auris	334413694097.35	0.77
84	Model_Avalon	89681217813.18	0.77
85	Model_Avensis	340288172891.43	0.50
86	Model_Aygo	89681217813.16	0.54
87	Model_B 170	-184347221495.46	0.94
88	Model_B 180	-487354214082.32	0.21
89	Model_B 200	-319215266151.04	0.79
90	Model_Beetle	-423780594516.67	0.15
91	Model_Bora	-792302407281.55	0.04
92	Model_C-Class	-1320454668086.05	0.22
93	Model_CL 180	-184347221495.46	0.02
94	Model_CL 500	-411997326181.57	0.34
95	Model_CL 55 AMG	-319215266151.03	0.03
96	Model_CL 550	-184347221495.45	0.20
97	Model_CL 63 AMG	-319215266151.01	0.00
98	Model_CLA 200	-319215266151.02	0.04
99	Model_CLA 220	-184347221495.45	0.14
100	Model_CLA-Class	-184347221495.44	0.17
101	Model_CLC 180	-184347221495.46	0.59
102	Model_CLC 200	-184347221495.46	0.62
103	Model_CLK 200	-260672259336.75	0.06
104	Model_CLK 220	-184347221495.45	0.53
105	Model_CLK 230	-260672259336.75	0.10
106	Model_CLK 240	-184347221495.45	0.77
107	Model_CLK 280	-184347221495.46	0.47
108	Model_CLK 320	-319215266151.03	0.81
109	Model_CLK 430	-5219225502074.14	1.00
110	Model_CLS 350	-411997326181.57	0.13
111	Model_CLS 400	-184347221495.44	0.01
112	Model_CLS 500	-260672259336.72	0.01
113	Model_CLS 63 AMG	-260672259336.72	0.01
114	Model_Caddy	-3098142015756.04	0.00
115	Model_Camry	702864459329.49	0.01
116	Model_Captur	-162354330815.01	0.42
117	Model_Caravelle	-669793066100.54	0.40
118	Model_Carina	109822249156.14	0.02
119	Model_Carisma	-658109797083.30	0.00
120	Model_Celica	89681217813.18	0.55
121	Model_Clio	-303538502710.41	0.03
122	Model_Colt	-746030943953.17	0.01
123	Model_Corolla	427614450664.11	0.01
124	Model_Corolla Verso	63422488324.05	0.59
125	Model_Cross Touran	-299697310797.93	0.99
126	Model_Dokker	-162354330815.01	0.64
127	Model_Duster	-324453821568.56	0.62
128	Model_E-Class	-2421078830141.75	0.01
129	Model_Eclipse	-352004687228.10	0.68
130	Model_Eos	-423780594516.67	0.61
131	Model_Espace	-229543645888.08	0.02
132	Model_FJ Cruiser	167668472036.50	0.01
133	Model_Fluence	-362655377266.19	0.84
134	Model_Fortuner	63422488324.05	0.79
135	Model_G 320	-184347221495.45	0.24
136	Model_G 500	-552463017543.17	0.00
137	Model_G 55 AMG	-411997326181.55	0.00
138	Model_G 63 AMG	-184347221495.45	0.20
139	Model_GL 320	-552463017543.18	0.00
140	Model_GL 350	-520935353520.54	0.00
141	Model_GL 420	-184347221495.45	0.37
142	Model_GL 450	-368549828573.50	0.04
143	Model_GL 500	-260672259336.73	0.15
144	Model_GL 550	-184347221495.44	0.15
145	Model_GLC-Class	-260672259336.73	0.01
146	Model_GLE-Class	-552463017543.17	0.00
147	Model_GLK 220	-260672259336.73	0.08
148	Model_GLK 300	-184347221495.46	0.28
149	Model_GLS 350	-260672259336.71	0.00
150	Model_Galant	-1265181926202.12	0.00
151	Model_Golf GTI	3054180296414.70	1.00
152	Model_Golf II	-599159518773.91	0.00
153	Model_Golf III	-1268679284745.92	0.00
154	Model_Golf IV	-1463790092600.70	0.00
155	Model_Golf Plus	-669793066100.56	0.52
156	Model_Golf V	-1078877671612.73	0.18
157	Model_Golf VI	-1119457447861.46	0.74
158	Model_Golf VII	-946610482215.27	0.17
159	Model_Golf Variant	-733625524327.85	0.92
160	Model_Grand Scenic	-380306328223.65	0.58
161	Model_Grandis	-431059561941.23	0.84
162	Model_Hiace	109822249156.17	0.97
163	Model_Highlander	228314049094.30	0.00
164	Model_Hilux	210073451603.96	0.00
165	Model_I3	0.02	0.16
166	Model_IQ	63422488324.05	0.95
167	Model_Jetta	-1838501166431.22	0.89
168	Model_Kangoo	-1296359186114.97	0.00
169	Model_Koleos	-256604022077.30	0.92
170	Model_L 200	-786282644912.17	0.18
171	Model_LT	-599159518773.90	0.23
172	Model_Laguna	-594574757180.82	0.00
173	Model_Lancer	-1731593868374.51	0.00
174	Model_Lancer Evolution	-431059561941.24	0.97
175	Model_Lancer X	-1731593868374.49	0.05
176	Model_Lancer X Sportback	-248937444317.97	0.73
177	Model_Land Cruiser 100	253192033378.95	0.08
178	Model_Land Cruiser 105	63422488324.06	0.54
179	Model_Land Cruiser 200	357314968491.15	0.00
180	Model_Land Cruiser 76	63422488324.06	0.05
181	Model_Land Cruiser 80	126795223762.63	0.72
182	Model_Land Cruiser Prado	543915858302.91	0.00
183	Model_Latitude	-114816857994.99	0.97
184	Model_Logan	-428874675185.92	0.10
185	Model_Lupo	-423780594516.68	0.05
186	Model_M5	0.08	0.00
187	Model_M6	0.03	0.03
188	Model_MB	-184347221495.48	0.04
189	Model_ML 250	-319215266151.01	0.12
190	Model_ML 270	-319215266151.04	0.81
191	Model_ML 280	-10847106237087.32	1.00
192	Model_ML 320	-552463017543.21	0.47
193	Model_ML 350	-842573613553.34	0.00
194	Model_ML 400	-319215266151.04	0.36
195	Model_ML 430	12504719949947.73	1.00
196	Model_ML 500	-260672259336.72	0.09
197	Model_ML 550	-184347221495.45	0.33
198	Model_ML 63 AMG	-368549828573.51	0.17
199	Model_Master	-344090175692.22	0.74
200	Model_Matrix	63422488324.05	0.74
201	Model_Megane	-1046871207206.31	0.00
202	Model_Modus	-114816857994.99	0.58
203	Model_Multivan	-1336954157429.41	0.00
204	Model_New Beetle	-423780594516.69	0.64
205	Model_Outlander	-1289113141512.88	0.76
206	Model_Outlander XL	-1190423762981.27	0.14
207	Model_Pajero	-556349917365.08	0.01
208	Model_Pajero Pinin	-248937444317.96	0.63
209	Model_Pajero Sport	-1265181926202.07	0.09
210	Model_Pajero Wagon	-1714060324365.40	0.02
211	Model_Passat B3	-1158596547953.80	0.00
212	Model_Passat B4	-1078877671612.76	0.00
213	Model_Passat B5	-2187005363961.35	0.00
214	Model_Passat B6	-2322279743819.72	0.38
215	Model_Passat B7	-2063567701376.43	0.00
216	Model_Passat B8	-669793066100.53	0.01
217	Model_Passat CC	-1336954157429.43	0.00
218	Model_Phaeton	-792302407281.51	0.14
219	Model_Pointer	5354600875148.19	1.00
220	Model_Polo	-2284498841685.09	0.02
221	Model_Previa	3657800243375.29	1.00
222	Model_Prius	126795223762.62	0.69
223	Model_Q3	-268683977157.74	0.01
224	Model_Q5	-449064486590.42	0.00
225	Model_Q7	-900987454449.04	0.00
226	Model_R 320	-10419880498281.34	1.00
227	Model_R8	-240349699935.38	0.00
228	Model_Rav 4	445582937669.98	0.01
229	Model_S 140	-368549828573.50	0.36
230	Model_S 250	-184347221495.45	0.07
231	Model_S 300	-184347221495.46	0.66
232	Model_S 320	-610610812292.25	0.85
233	Model_S 350	-712665902409.43	0.00
234	Model_S 400	-368549828573.51	0.58
235	Model_S 430	-260672259336.73	0.78
236	Model_S 500	-1130882959361.13	0.00
237	Model_S 550	-610610812292.23	0.00
238	Model_S 600	-411997326181.58	0.18
239	Model_S 63 AMG	-319215266151.01	0.02
240	Model_S 65 AMG	-184347221495.44	0.04
241	Model_S4	-169997363872.27	0.84
242	Model_S5	-860938730529.01	1.00
243	Model_S8	-208176174524.80	0.98
244	Model_SL 500 (550)	-184347221495.46	0.52
245	Model_SL 55 AMG	-184347221495.45	0.12
246	Model_SLK 200	-260672259336.74	0.53
247	Model_SLK 350	-184347221495.47	0.58
248	Model_Sandero	-281058837077.96	0.28
249	Model_Sandero StepWay	-114816857995.00	0.89
250	Model_Scenic	-537058423742.73	0.00
251	Model_Scion	63422488324.05	0.86
252	Model_Scirocco	-733625524327.83	0.25
253	Model_Sequoia	-13192615423932.46	1.00
254	Model_Sharan	-992683355374.79	0.39
255	Model_Sienna	89681217813.19	0.11
256	Model_Smart	-184347221495.48	0.18
257	Model_Space Star	-556349917365.09	0.02
258	Model_Space Wagon	-248937444317.97	0.22
259	Model_Sprinter 208	-184347221495.46	0.59
260	Model_Sprinter 210	-2327138740717.86	1.00
261	Model_Sprinter 211	-184347221495.46	0.69
262	Model_Sprinter 212	-319215266151.04	0.89
263	Model_Sprinter 213	-260672259336.74	0.86
264	Model_Sprinter 311	1260927576142.38	1.00
265	Model_Sprinter 312	-368549828573.51	0.23
266	Model_Sprinter 313	-663629915683.12	0.59
267	Model_Sprinter 315	-184347221495.46	0.77
268	Model_Sprinter 316	-260672259336.74	0.93
269	Model_Sprinter 318	-260672259336.72	0.62
270	Model_Sprinter 319	-184347221495.45	0.70
271	Model_Symbol	-362655377266.21	0.01
272	Model_Syncro	-299697310797.93	0.89
273	Model_T3 (Transporter)	-299697310797.96	0.01
274	Model_T4 (Transporter)	-1814388526623.73	0.00
275	Model_T4 (Transporter)	-792302407281.55	0.01
276	Model_T5 (Transporter)	-2206869930228.97	0.48
277	Model_T5 (Transporter)	-1765135464196.19	0.86
278	Model_T6 (Transporter)	-669793066100.56	0.69
279	Model_T6 (Transporter)	-792302407281.54	0.78
280	Model_TT	-268683977157.75	0.23
281	Model_Tacoma	109822249156.18	0.61
282	Model_Tiguan	-1233096523357.24	0.00
283	Model_Touareg	-2395937424772.16	0.00
284	Model_Touran	-1303272901299.11	0.34
285	Model_Trafic	-951922846399.65	0.06
286	Model_Tundra	141742823121.09	0.01
287	Model_Up	-423780594516.68	0.78
288	Model_V 250	-451261396600.11	0.00
289	Model_Vaneo	-184347221495.46	0.51
290	Model_Vento	-733625524327.88	0.00
291	Model_Venza	167668472036.50	0.04
292	Model_Viano	-411997326181.57	0.10
293	Model_Virage	-248937444317.96	0.42
294	Model_Vista	63422488324.05	0.69
295	Model_Vito	-2241468155278.27	0.03
296	Model_X1	0.06	0.02
297	Model_X3	0.08	0.00
298	Model_X5	0.24	0.00
299	Model_X5 M	0.07	0.00
300	Model_X6	0.16	0.00
301	Model_X6 M	0.05	0.04
302	Model_Yaris	236901614923.49	0.06
303	Model_Z3	-550521900025.61	1.00
304	Model_Z4	0.04	0.19

	Prediction	Target
0	18159.439748	24700.0
1	3061.537021	4000.0
2	13747.681742	18300.0
3	12996.990953	15600.0
4	18445.409313	14200.0
...	...	...
761	7871.534480	6900.0
762	33011.195840	26500.0
763	9113.322891	14899.0
764	5935.946300	5000.0
765	56318.690432	67500.0

	Prediction	Target	Residual	Difference%
0	18159.439748	24700.0	6540.560252	26.480001
1	3061.537021	4000.0	938.462979	23.461574
2	13747.681742	18300.0	4552.318258	24.876056
3	12996.990953	15600.0	2603.009047	16.685955
4	18445.409313	14200.0	-4245.409313	29.897249
...	...	...	...	...
761	7871.534480	6900.0	-971.534480	14.080210
762	33011.195840	26500.0	-6511.195840	24.570550
763	9113.322891	14899.0	5785.677109	38.832654
764	5935.946300	5000.0	-935.946300	18.718926
765	56318.690432	67500.0	11181.309568	16.564903

	Prediction	Target	Residual	Difference%
count	7.660000e+02	766.000000	7.660000e+02	766.000000
mean	inf	17806.382337	-inf	inf
std	NaN	16381.051873	NaN	NaN
min	0.000000e+00	800.000000	-inf	0.057087
25%	7.791258e+03	7500.000000	-1.581104e+03	9.049756
50%	1.181369e+04	12050.000000	4.428694e+02	18.762869
75%	2.044732e+04	21875.000000	3.121130e+03	34.643019
max	inf	102800.000000	5.017373e+04	inf

	VIF	Features
0	3.899033	Mileage
1	10.307533	Year
2	7.637076	EngineV

	Prediction	Target	Residual	Difference%
484	14691.61	14700.00	8.39	0.06
285	8789.91	8800.00	10.09	0.11
521	8511.16	8500.00	-11.16	0.13
183	13182.34	13200.00	17.66	0.13
282	17025.93	16999.00	-26.93	0.16
591	9175.83	9200.00	24.17	0.26
387	45341.81	45200.00	-141.81	0.31
194	8928.33	8900.00	-28.33	0.32
556	5781.48	5800.00	18.52	0.32
461	12851.85	12800.00	-51.85	0.41
551	12120.44	12179.00	58.56	0.48
222	9546.02	9500.00	-46.02	0.48
375	2363.46	2350.00	-13.46	0.57
15	2565.52	2550.00	-15.52	0.61
503	28318.82	28500.00	181.18	0.64
718	7251.60	7300.00	48.40	0.66
700	42865.93	43163.25	297.32	0.69
319	4564.60	4600.00	35.40	0.77
707	4887.55	4850.00	-37.55	0.77
702	8265.42	8200.00	-65.42	0.80
35	39974.95	39600.00	-374.95	0.95
33	4746.43	4700.00	-46.43	0.99
428	3938.78	3900.00	-38.78	0.99
56	10196.48	10299.00	102.52	1.00
137	8557.00	8650.00	93.00	1.08
292	16206.64	16030.00	-176.64	1.10
738	10087.53	10200.00	112.47	1.10
569	25211.89	25500.00	288.11	1.13
317	11851.24	12000.00	148.76	1.24
584	5431.20	5500.00	68.80	1.25
506	43413.59	44000.00	586.41	1.33
370	25211.89	25555.00	343.11	1.34
46	13909.74	14100.00	190.26	1.35
17	7555.21	7450.00	-105.21	1.41
681	1420.97	1400.00	-20.97	1.50
262	9439.41	9300.00	-139.41	1.50
750	16246.25	16000.00	-246.25	1.54
132	9352.24	9500.00	147.76	1.56
595	20159.64	20500.00	340.36	1.66
614	9042.40	9200.00	157.60	1.71