Car Sale Analysis

Importing the relevant libraries

In [1]:
# For this analysis, we will need the following libraries and modules
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set()

Loading the raw data

In [2]:
# Load the data from a .csv in the same folder
raw_data = pd.read_csv('1.04. Real-life example.csv')

# Explore the top 5 rows of the df
raw_data.head()
Out[2]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
0 BMW 4200.0 sedan 277 2.0 Petrol yes 1991 320
1 Mercedes-Benz 7900.0 van 427 2.9 Diesel yes 1999 Sprinter 212
2 Mercedes-Benz 13300.0 sedan 358 5.0 Gas yes 2003 S 500
3 Audi 23000.0 crossover 240 4.2 Petrol yes 2007 Q7
4 Toyota 18300.0 crossover 120 2.0 Petrol yes 2011 Rav 4

We would like to predict the price of a used car depending on its specifications. Potential regressors in this data are:

1) Brand: A BMW is generally more expensive than a Toyota.
2) Mileage: The more a car is driven, the cheaper it should be.
3) Engine Volume: Sports car have larger engines than economy cars.
4) Year of Production: The older the car, the cheaper it is, with the exception of vintage vehicles.

The rest are categorical variables which we will deal with on a case-by-case basis.

Preprocessing

Exploring the descriptive statistics of the variables

In [3]:
# By default, only descriptives for the numerical variables would be shown
# To include the categorical ones, we should specify this with an argument
raw_data.describe(include='all')
Out[3]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
count 4345 4173.000000 4345 4345.000000 4195.000000 4345 4345 4345.000000 4345
unique 7 NaN 6 NaN NaN 4 2 NaN 312
top Volkswagen NaN sedan NaN NaN Diesel yes NaN E-Class
freq 936 NaN 1649 NaN NaN 2019 3947 NaN 199
mean NaN 19418.746935 NaN 161.237284 2.790734 NaN NaN 2006.550058 NaN
std NaN 25584.242620 NaN 105.705797 5.066437 NaN NaN 6.719097 NaN
min NaN 600.000000 NaN 0.000000 0.600000 NaN NaN 1969.000000 NaN
25% NaN 6999.000000 NaN 86.000000 1.800000 NaN NaN 2003.000000 NaN
50% NaN 11500.000000 NaN 155.000000 2.200000 NaN NaN 2008.000000 NaN
75% NaN 21700.000000 NaN 230.000000 3.000000 NaN NaN 2012.000000 NaN
max NaN 300000.000000 NaN 980.000000 99.990000 NaN NaN 2016.000000 NaN
  • The categorical variables don't have some types of numerical descriptives
  • The numerical variables don't have some types of categorical descriptives
  • Each variable has a different number of observations, implying that there are some missing values.
  • There are more than 300 models of cars, meaning we would have more than 300 dummy variables, which would be hard to implement in a regression. Also, a lot of the information from 'Model' could be engineered from 'Brand', 'Year', and 'EngineV', so we won't be losing too much variability. Therefore, we can drop 'Model'.
  • Registration has almost all 'yes' entries, so this variable probably won't be very useful

Determining the variables of interest

In [4]:
# Drop 'Registration' column from dataset
#data = raw_data.drop(['Registration'],axis=1)
#data.describe(include='all')
data = raw_data

Dealing with missing values

In [5]:
# data.isnull() # shows a df with the information whether a data point is null 
# Since True = the data point is missing, while False = the data point is not missing, we can sum them
# This will give us the total number of missing values feature-wise
data.isnull().sum()
Out[5]:
Brand             0
Price           172
Body              0
Mileage           0
EngineV         150
Engine Type       0
Registration      0
Year              0
Model             0
dtype: int64
In [6]:
# Drop all missing values
# This is not always recommended, however, when we remove less than 5% of the data, it is okay
data_no_mv = data.dropna(axis=0)
In [7]:
# Check the descriptives without the missing values
data_no_mv.describe(include='all')
Out[7]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
count 4025 4025.000000 4025 4025.000000 4025.000000 4025 4025 4025.000000 4025
unique 7 NaN 6 NaN NaN 4 2 NaN 306
top Volkswagen NaN sedan NaN NaN Diesel yes NaN E-Class
freq 880 NaN 1534 NaN NaN 1861 3654 NaN 188
mean NaN 19552.308065 NaN 163.572174 2.764586 NaN NaN 2006.379627 NaN
std NaN 25815.734988 NaN 103.394703 4.935941 NaN NaN 6.695595 NaN
min NaN 600.000000 NaN 0.000000 0.600000 NaN NaN 1969.000000 NaN
25% NaN 6999.000000 NaN 90.000000 1.800000 NaN NaN 2003.000000 NaN
50% NaN 11500.000000 NaN 158.000000 2.200000 NaN NaN 2007.000000 NaN
75% NaN 21900.000000 NaN 230.000000 3.000000 NaN NaN 2012.000000 NaN
max NaN 300000.000000 NaN 980.000000 99.990000 NaN NaN 2016.000000 NaN

Notice the min and max values with respect to the mean and the quartiles for each variable. Namely, the max values for Price, Mileage, and EngineV are very high, and the min value for Year is very low compared to the rest of the values.

Exploring the PDFs

  • The probability distribution function (PDF) of a variable will show us how that variable is distributed
  • This will make it very easy to spot anomalies, such as outliers
  • The PDF is often the basis on which we decide whether we want to transform a feature
In [8]:
sns.distplot(data_no_mv['Price'])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c9eef50>
  • For optimal results, we would be looking for a normal distribution.
  • Price, however, has an exponential one, which will surely be a problem for our regression.
  • Price has 75% of values below or equal to 21,900 but the max value is 300,000, so obviously there are some outliers present.

Dealing with outliers

  • We can deal with the problem easily by removing 0.5% or 1% of the problematic samples
  • Here, the outliers are situated around the higher prices (right side of the graph)
  • Logic should also be applied
  • This is a dataset about used cars, therefore one can imagine how \$300,000 is an excessive price
  • Outliers are a big issue for OLS, thus we must deal with them in some way
In [9]:
# Declare a variable that will be equal to the 98th percentile of the 'Price' variable
q = data_no_mv['Price'].quantile(0.98)
# Then we can create a new df, with the condition that all prices must be below the 98 percentile of 'Price'
data_1 = data_no_mv[data_no_mv['Price']<q]
# In this way we have essentially removed the top 2% of the data about 'Price'
data_1.describe(include='all')
Out[9]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
count 3943 3943.000000 3943 3943.000000 3943.000000 3943 3943 3943.000000 3943
unique 7 NaN 6 NaN NaN 4 2 NaN 300
top Volkswagen NaN sedan NaN NaN Diesel yes NaN E-Class
freq 880 NaN 1515 NaN NaN 1818 3572 NaN 188
mean NaN 16840.689820 NaN 166.739538 2.738415 NaN NaN 2006.197312 NaN
std NaN 16332.966734 NaN 102.042365 4.980975 NaN NaN 6.640292 NaN
min NaN 600.000000 NaN 0.000000 0.600000 NaN NaN 1969.000000 NaN
25% NaN 6900.000000 NaN 95.000000 1.800000 NaN NaN 2002.000000 NaN
50% NaN 11250.000000 NaN 160.000000 2.200000 NaN NaN 2007.000000 NaN
75% NaN 20800.000000 NaN 230.000000 3.000000 NaN NaN 2011.000000 NaN
max NaN 103333.000000 NaN 980.000000 99.990000 NaN NaN 2016.000000 NaN
  • The max Price is now ~103,000, which is still far away from the mean, but acceptably closer.
In [10]:
# We can check the PDF once again to ensure that the result is still distributed in the same way overall
sns.distplot(data_1['Price'])
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c23fc9b10>
  • There are much fewer outliers
In [11]:
# We can treat the other numerical variables in a similar way
sns.distplot(data_no_mv['Mileage'])
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c241edd50>
In [12]:
q = data_1['Mileage'].quantile(0.99)
data_2 = data_1[data_1['Mileage']<q]
In [13]:
sns.distplot(data_2['Mileage'])
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c22d76510>
  • At this point, this plot seems to have a pretty normal distribution
In [14]:
sns.distplot(data_no_mv['EngineV'])
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c22f3d050>
  • The situation with engine volume is very strange
  • In such cases it makes sense to manually check what may be causing the problem
  • In our case the issue comes from the fact that most missing values are indicated with 99.99 or 99
  • There are also some incorrect entries like 75
  • A simple Google search can indicate the natural domain of this variable
  • Car engine volumes are usually (always?) below 6.5
  • This is a prime example of the fact that a domain expert (a person working in the car industry) may find it much easier to determine problems with the data than an outsider
In [15]:
data_3 = data_2[data_2['EngineV']<6.5]
sns.distplot(data_3['EngineV'])
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c24352c90>
  • Following this graph, we realize we can actually treat EngineV as a categorical variable
  • Even so, we won't, but that could be something else to try
In [16]:
sns.distplot(data_no_mv['Year'])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c244af090>
  • Finally, the situation with 'Year' is similar to 'Price' and 'Mileage'
  • However, the outliers are on the low end (a.k.a some vintage cars)
In [17]:
# Remove outliers because vintage cars are not representative of our model
q = data_3['Year'].quantile(0.01)
data_4 = data_3[data_3['Year']>q]
sns.distplot(data_4['Year'])
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c245dcc10>
In [18]:
# Reset the index and drop the column containing the old index
data_cleaned = data_4.reset_index(drop=True)
In [19]:
# Let's see what's left
data_cleaned.describe(include='all')
Out[19]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model
count 3826 3826.000000 3826 3826.000000 3826.000000 3826 3826 3826.000000 3826
unique 7 NaN 6 NaN NaN 4 2 NaN 289
top Volkswagen NaN sedan NaN NaN Diesel yes NaN E-Class
freq 848 NaN 1454 NaN NaN 1772 3464 NaN 181
mean NaN 17171.386333 NaN 162.166231 2.441777 NaN NaN 2006.615787 NaN
std NaN 16396.219278 NaN 94.831169 0.947973 NaN NaN 6.067533 NaN
min NaN 800.000000 NaN 0.000000 0.600000 NaN NaN 1988.000000 NaN
25% NaN 7200.000000 NaN 94.000000 1.800000 NaN NaN 2003.000000 NaN
50% NaN 11500.000000 NaN 159.000000 2.200000 NaN NaN 2008.000000 NaN
75% NaN 21000.000000 NaN 227.000000 3.000000 NaN NaN 2011.000000 NaN
max NaN 103333.000000 NaN 435.000000 6.300000 NaN NaN 2016.000000 NaN
  • Overall, we've deleted ~100 observations that proved to be problematic.

Checking the OLS assumptions

  • The categorical variables will be included as dummies, so we don't need to worry about them when checking the assumptions
  • The continuous variables (Price, Year, EngineV, Mileage) are those that are likely to be more challenging and cause us more problems

Linearity

In [20]:
# We could simply use plt.scatter() for each of Year, EngineV, and Mileage
# But since Price is the 'y' axis of all the plots, it made sense to plot them side-by-side (so we can compare them)
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3)) #sharey -> share 'Price' as y
ax1.scatter(data_cleaned['Year'],data_cleaned['Price'])
ax1.set_title('Price and Year')
ax2.scatter(data_cleaned['EngineV'],data_cleaned['Price'])
ax2.set_title('Price and EngineV')
ax3.scatter(data_cleaned['Mileage'],data_cleaned['Price'])
ax3.set_title('Price and Mileage')

plt.show()
  • These patterns are not linear so we should not run a linear regression without transforming one or more variables.
In [21]:
sns.distplot(data_cleaned['Price'])
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1c24942f50>
  • From the subplots and the PDF of price, we can easily determine that 'Price' is exponentially distributed
  • Therefore, it's relationships with the other normally distributed features is not linear but rather quite exponential
  • A good transformation in that case is a log transformation

Relaxing the assumptions

In [22]:
# Transform 'Price' with a log transformation
log_price = np.log(data_cleaned['Price'])

# Then add it to our data frame
data_cleaned['log_price'] = log_price
data_cleaned
Out[22]:
Brand Price Body Mileage EngineV Engine Type Registration Year Model log_price
0 BMW 4200.0 sedan 277 2.0 Petrol yes 1991 320 8.342840
1 Mercedes-Benz 7900.0 van 427 2.9 Diesel yes 1999 Sprinter 212 8.974618
2 Mercedes-Benz 13300.0 sedan 358 5.0 Gas yes 2003 S 500 9.495519
3 Audi 23000.0 crossover 240 4.2 Petrol yes 2007 Q7 10.043249
4 Toyota 18300.0 crossover 120 2.0 Petrol yes 2011 Rav 4 9.814656
... ... ... ... ... ... ... ... ... ... ...
3821 Renault 6800.0 sedan 152 1.6 Petrol yes 2007 Megane 8.824678
3822 Volkswagen 11500.0 van 163 2.5 Diesel yes 2008 T5 (Transporter) 9.350102
3823 Toyota 17900.0 sedan 35 1.6 Petrol yes 2014 Corolla 9.792556
3824 BMW 6500.0 sedan 1 3.5 Petrol yes 1999 535 8.779557
3825 Volkswagen 13500.0 van 124 2.0 Diesel yes 2013 T5 (Transporter) 9.510445

3826 rows × 10 columns

In [23]:
# Let's check the three scatters once again
f, (ax1, ax2, ax3) = plt.subplots(1, 3, sharey=True, figsize =(15,3))
ax1.scatter(data_cleaned['Year'],data_cleaned['log_price'])
ax1.set_title('Log Price and Year')
ax2.scatter(data_cleaned['EngineV'],data_cleaned['log_price'])
ax2.set_title('Log Price and EngineV')
ax3.scatter(data_cleaned['Mileage'],data_cleaned['log_price'])
ax3.set_title('Log Price and Mileage')

plt.show()
  • The relationships show a clear linear relationship
  • This is some good linear regression material
  • Alternatively we could have transformed each of the independent variables
In [24]:
# Since we will be using the log price variable, we can drop the old 'Price' one
data_cleaned = data_cleaned.drop(['Price'],axis=1)

Now we will deal with the rest of the assumptions:

  • No Endogeneity: No correlation between residuals and independent variables (we'll discuss the residuals after the regression is created)
  • Normality and Homoscedasticity:
    • Normality: assumed for a big sample following the Central Limit Theorem
    • Zero mean of the distribution of errors: accomplished through including the intercept in the regression
    • Homoscedasticity: generally holds, as we can see in the above graphs (already implemented log transformation)
  • No Autocorrelation: Our observations are not coming from time-series or panel data, they are simply a snapshot of the current situation of a second-hand car sales website - each row comes from a different customer who is willing to sell their car through the platform. Logically, there is no reason for the observations to be dependent on each other, so we are safe:
  • Multicolinearity: Since an older car would have more mileage, there is grounds to check for multicolinearity (i.e. Year and Mileage)

Multicollinearity

In [25]:
# Let's quickly see the columns of our data frame
data_cleaned.columns.values
Out[25]:
array(['Brand', 'Body', 'Mileage', 'EngineV', 'Engine Type',
       'Registration', 'Year', 'Model', 'log_price'], dtype=object)
In [26]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# To make this as easy as possible to use, we declare a variable where we put
# all features where we want to check for multicollinearity
# since our categorical data is not yet preprocessed, we will only take the numerical ones
variables = data_cleaned[['Mileage','Year','EngineV']]

# we create a new data frame which will include all the VIFs
# note that each variable has its own variance inflation factor as this measure is variable specific (not model specific)
vif = pd.DataFrame()

# here we make use of the variance_inflation_factor, which will basically output the respective VIFs 
vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
# Finally, I like to include names so it is easier to explore the result
vif["Features"] = variables.columns
In [27]:
# Let's explore the result
vif
Out[27]:
VIF Features
0 3.899033 Mileage
1 10.307533 Year
2 7.637076 EngineV
  • Since Year has the highest VIF, I will remove it from the model
  • This will drive the VIF of other variables down
  • So even if EngineV seems with a high VIF, too, once 'Year' is gone that will no longer be the case
In [28]:
data_no_multicollinearity = data_cleaned.drop(['Year'],axis=1)

Create dummy variables

  • To include the categorical data in the regression, let's create dummies
  • There is a very convenient method called: 'get_dummies' which does that seemlessly for all categorical variables in the dataset
  • It is extremely important that we drop one of the dummies, alternatively we will introduce multicollinearity
    • If we have N categories for a feature, we have to create N-1 dummies, otherwise the Nth category could be perfectly predicted by the other N-1 dummies.
In [29]:
data_with_dummies = pd.get_dummies(data_no_multicollinearity, drop_first=True)
data_with_dummies.head()
Out[29]:
Mileage EngineV log_price Brand_BMW Brand_Mercedes-Benz Brand_Mitsubishi Brand_Renault Brand_Toyota Brand_Volkswagen Body_hatch ... Model_Vito Model_X1 Model_X3 Model_X5 Model_X5 M Model_X6 Model_X6 M Model_Yaris Model_Z3 Model_Z4
0 277 2.0 8.342840 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 427 2.9 8.974618 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 358 5.0 9.495519 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 240 4.2 10.043249 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 120 2.0 9.814656 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 306 columns

Rearrange a bit

  • To make our data frame more organized, we prefer to place the dependent variable in the beginning of the df
  • Since each problem is different, that must be done manually
  • We can display all possible features and then choose the desired order
In [30]:
data_with_dummies.columns.values
Out[30]:
array(['Mileage', 'EngineV', 'log_price', 'Brand_BMW',
       'Brand_Mercedes-Benz', 'Brand_Mitsubishi', 'Brand_Renault',
       'Brand_Toyota', 'Brand_Volkswagen', 'Body_hatch', 'Body_other',
       'Body_sedan', 'Body_vagon', 'Body_van', 'Engine Type_Gas',
       'Engine Type_Other', 'Engine Type_Petrol', 'Registration_yes',
       'Model_100', 'Model_11', 'Model_116', 'Model_118', 'Model_120',
       'Model_19', 'Model_190', 'Model_200', 'Model_210', 'Model_220',
       'Model_230', 'Model_25', 'Model_250', 'Model_300', 'Model_316',
       'Model_318', 'Model_320', 'Model_323', 'Model_325', 'Model_328',
       'Model_330', 'Model_335', 'Model_4 Series Gran Coupe', 'Model_428',
       'Model_4Runner', 'Model_5 Series', 'Model_5 Series GT',
       'Model_520', 'Model_523', 'Model_524', 'Model_525', 'Model_528',
       'Model_530', 'Model_535', 'Model_540', 'Model_545', 'Model_550',
       'Model_6 Series Gran Coupe', 'Model_630', 'Model_640', 'Model_645',
       'Model_650', 'Model_730', 'Model_735', 'Model_740', 'Model_745',
       'Model_750', 'Model_760', 'Model_80', 'Model_9', 'Model_90',
       'Model_A 140', 'Model_A 150', 'Model_A 170', 'Model_A 180',
       'Model_A1', 'Model_A3', 'Model_A4', 'Model_A4 Allroad', 'Model_A5',
       'Model_A6', 'Model_A6 Allroad', 'Model_A7', 'Model_A8',
       'Model_ASX', 'Model_Amarok', 'Model_Auris', 'Model_Avalon',
       'Model_Avensis', 'Model_Aygo', 'Model_B 170', 'Model_B 180',
       'Model_B 200', 'Model_Beetle', 'Model_Bora', 'Model_C-Class',
       'Model_CL 180', 'Model_CL 500', 'Model_CL 55 AMG', 'Model_CL 550',
       'Model_CL 63 AMG', 'Model_CLA 200', 'Model_CLA 220',
       'Model_CLA-Class', 'Model_CLC 180', 'Model_CLC 200',
       'Model_CLK 200', 'Model_CLK 220', 'Model_CLK 230', 'Model_CLK 240',
       'Model_CLK 280', 'Model_CLK 320', 'Model_CLK 430', 'Model_CLS 350',
       'Model_CLS 400', 'Model_CLS 500', 'Model_CLS 63 AMG',
       'Model_Caddy', 'Model_Camry', 'Model_Captur', 'Model_Caravelle',
       'Model_Carina', 'Model_Carisma', 'Model_Celica', 'Model_Clio',
       'Model_Colt', 'Model_Corolla', 'Model_Corolla Verso',
       'Model_Cross Touran', 'Model_Dokker', 'Model_Duster',
       'Model_E-Class', 'Model_Eclipse', 'Model_Eos', 'Model_Espace',
       'Model_FJ Cruiser', 'Model_Fluence', 'Model_Fortuner',
       'Model_G 320', 'Model_G 500', 'Model_G 55 AMG', 'Model_G 63 AMG',
       'Model_GL 320', 'Model_GL 350', 'Model_GL 420', 'Model_GL 450',
       'Model_GL 500', 'Model_GL 550', 'Model_GLC-Class',
       'Model_GLE-Class', 'Model_GLK 220', 'Model_GLK 300',
       'Model_GLS 350', 'Model_Galant', 'Model_Golf GTI', 'Model_Golf II',
       'Model_Golf III', 'Model_Golf IV', 'Model_Golf Plus',
       'Model_Golf V', 'Model_Golf VI', 'Model_Golf VII',
       'Model_Golf Variant', 'Model_Grand Scenic', 'Model_Grandis',
       'Model_Hiace', 'Model_Highlander', 'Model_Hilux', 'Model_I3',
       'Model_IQ', 'Model_Jetta', 'Model_Kangoo', 'Model_Koleos',
       'Model_L 200', 'Model_LT', 'Model_Laguna', 'Model_Lancer',
       'Model_Lancer Evolution', 'Model_Lancer X',
       'Model_Lancer X Sportback', 'Model_Land Cruiser 100',
       'Model_Land Cruiser 105', 'Model_Land Cruiser 200',
       'Model_Land Cruiser 76', 'Model_Land Cruiser 80',
       'Model_Land Cruiser Prado', 'Model_Latitude', 'Model_Logan',
       'Model_Lupo', 'Model_M5', 'Model_M6', 'Model_MB', 'Model_ML 250',
       'Model_ML 270', 'Model_ML 280', 'Model_ML 320', 'Model_ML 350',
       'Model_ML 400', 'Model_ML 430', 'Model_ML 500', 'Model_ML 550',
       'Model_ML 63 AMG', 'Model_Master', 'Model_Matrix', 'Model_Megane',
       'Model_Modus', 'Model_Multivan', 'Model_New Beetle',
       'Model_Outlander', 'Model_Outlander XL', 'Model_Pajero',
       'Model_Pajero Pinin', 'Model_Pajero Sport', 'Model_Pajero Wagon',
       'Model_Passat B3', 'Model_Passat B4', 'Model_Passat B5',
       'Model_Passat B6', 'Model_Passat B7', 'Model_Passat B8',
       'Model_Passat CC', 'Model_Phaeton', 'Model_Pointer', 'Model_Polo',
       'Model_Previa', 'Model_Prius', 'Model_Q3', 'Model_Q5', 'Model_Q7',
       'Model_R 320', 'Model_R8', 'Model_Rav 4', 'Model_S 140',
       'Model_S 250', 'Model_S 300', 'Model_S 320', 'Model_S 350',
       'Model_S 400', 'Model_S 430', 'Model_S 500', 'Model_S 550',
       'Model_S 600', 'Model_S 63 AMG', 'Model_S 65 AMG', 'Model_S4',
       'Model_S5', 'Model_S8', 'Model_SL 500 (550)', 'Model_SL 55 AMG',
       'Model_SLK 200', 'Model_SLK 350', 'Model_Sandero',
       'Model_Sandero StepWay', 'Model_Scenic', 'Model_Scion',
       'Model_Scirocco', 'Model_Sequoia', 'Model_Sharan', 'Model_Sienna',
       'Model_Smart', 'Model_Space Star', 'Model_Space Wagon',
       'Model_Sprinter 208', 'Model_Sprinter 210', 'Model_Sprinter 211',
       'Model_Sprinter 212', 'Model_Sprinter 213', 'Model_Sprinter 311',
       'Model_Sprinter 312', 'Model_Sprinter 313', 'Model_Sprinter 315',
       'Model_Sprinter 316', 'Model_Sprinter 318', 'Model_Sprinter 319',
       'Model_Symbol', 'Model_Syncro', 'Model_T3 (Transporter)',
       'Model_T4 (Transporter)', 'Model_T4 (Transporter) ',
       'Model_T5 (Transporter)', 'Model_T5 (Transporter) ',
       'Model_T6 (Transporter)', 'Model_T6 (Transporter) ', 'Model_TT',
       'Model_Tacoma', 'Model_Tiguan', 'Model_Touareg', 'Model_Touran',
       'Model_Trafic', 'Model_Tundra', 'Model_Up', 'Model_V 250',
       'Model_Vaneo', 'Model_Vento', 'Model_Venza', 'Model_Viano',
       'Model_Virage', 'Model_Vista', 'Model_Vito', 'Model_X1',
       'Model_X3', 'Model_X5', 'Model_X5 M', 'Model_X6', 'Model_X6 M',
       'Model_Yaris', 'Model_Z3', 'Model_Z4'], dtype=object)
  • To make the code a bit more parametrized, let's declare a new variable that will contain the preferred order
  • Conventionally, the most intuitive order is: dependent variable, indepedendent numerical variables, dummies
In [31]:
cols = ['log_price', 'Mileage', 'EngineV', 'Brand_BMW',
       'Brand_Mercedes-Benz', 'Brand_Mitsubishi', 'Brand_Renault',
       'Brand_Toyota', 'Brand_Volkswagen', 'Body_hatch', 'Body_other',
       'Body_sedan', 'Body_vagon', 'Body_van', 'Engine Type_Gas',
       'Engine Type_Other', 'Engine Type_Petrol', 'Registration_yes', 'Model_100', 'Model_11', 'Model_116', 'Model_118', 'Model_120',
       'Model_19', 'Model_190', 'Model_200', 'Model_210', 'Model_220',
       'Model_230', 'Model_25', 'Model_250', 'Model_300', 'Model_316',
       'Model_318', 'Model_320', 'Model_323', 'Model_325', 'Model_328',
       'Model_330', 'Model_335', 'Model_4 Series Gran Coupe', 'Model_428',
       'Model_4Runner', 'Model_5 Series', 'Model_5 Series GT',
       'Model_520', 'Model_523', 'Model_524', 'Model_525', 'Model_528',
       'Model_530', 'Model_535', 'Model_540', 'Model_545', 'Model_550',
       'Model_6 Series Gran Coupe', 'Model_630', 'Model_640', 'Model_645',
       'Model_650', 'Model_730', 'Model_735', 'Model_740', 'Model_745',
       'Model_750', 'Model_760', 'Model_80', 'Model_9', 'Model_90',
       'Model_A 140', 'Model_A 150', 'Model_A 170', 'Model_A 180',
       'Model_A1', 'Model_A3', 'Model_A4', 'Model_A4 Allroad', 'Model_A5',
       'Model_A6', 'Model_A6 Allroad', 'Model_A7', 'Model_A8',
       'Model_ASX', 'Model_Amarok', 'Model_Auris', 'Model_Avalon',
       'Model_Avensis', 'Model_Aygo', 'Model_B 170', 'Model_B 180',
       'Model_B 200', 'Model_Beetle', 'Model_Bora', 'Model_C-Class',
       'Model_CL 180', 'Model_CL 500', 'Model_CL 55 AMG', 'Model_CL 550',
       'Model_CL 63 AMG', 'Model_CLA 200', 'Model_CLA 220',
       'Model_CLA-Class', 'Model_CLC 180', 'Model_CLC 200',
       'Model_CLK 200', 'Model_CLK 220', 'Model_CLK 230', 'Model_CLK 240',
       'Model_CLK 280', 'Model_CLK 320', 'Model_CLK 430', 'Model_CLS 350',
       'Model_CLS 400', 'Model_CLS 500', 'Model_CLS 63 AMG',
       'Model_Caddy', 'Model_Camry', 'Model_Captur', 'Model_Caravelle',
       'Model_Carina', 'Model_Carisma', 'Model_Celica', 'Model_Clio',
       'Model_Colt', 'Model_Corolla', 'Model_Corolla Verso',
       'Model_Cross Touran', 'Model_Dokker', 'Model_Duster',
       'Model_E-Class', 'Model_Eclipse', 'Model_Eos', 'Model_Espace',
       'Model_FJ Cruiser', 'Model_Fluence', 'Model_Fortuner',
       'Model_G 320', 'Model_G 500', 'Model_G 55 AMG', 'Model_G 63 AMG',
       'Model_GL 320', 'Model_GL 350', 'Model_GL 420', 'Model_GL 450',
       'Model_GL 500', 'Model_GL 550', 'Model_GLC-Class',
       'Model_GLE-Class', 'Model_GLK 220', 'Model_GLK 300',
       'Model_GLS 350', 'Model_Galant', 'Model_Golf GTI', 'Model_Golf II',
       'Model_Golf III', 'Model_Golf IV', 'Model_Golf Plus',
       'Model_Golf V', 'Model_Golf VI', 'Model_Golf VII',
       'Model_Golf Variant', 'Model_Grand Scenic', 'Model_Grandis',
       'Model_Hiace', 'Model_Highlander', 'Model_Hilux', 'Model_I3',
       'Model_IQ', 'Model_Jetta', 'Model_Kangoo', 'Model_Koleos',
       'Model_L 200', 'Model_LT', 'Model_Laguna', 'Model_Lancer',
       'Model_Lancer Evolution', 'Model_Lancer X',
       'Model_Lancer X Sportback', 'Model_Land Cruiser 100',
       'Model_Land Cruiser 105', 'Model_Land Cruiser 200',
       'Model_Land Cruiser 76', 'Model_Land Cruiser 80',
       'Model_Land Cruiser Prado', 'Model_Latitude', 'Model_Logan',
       'Model_Lupo', 'Model_M5', 'Model_M6', 'Model_MB', 'Model_ML 250',
       'Model_ML 270', 'Model_ML 280', 'Model_ML 320', 'Model_ML 350',
       'Model_ML 400', 'Model_ML 430', 'Model_ML 500', 'Model_ML 550',
       'Model_ML 63 AMG', 'Model_Master', 'Model_Matrix', 'Model_Megane',
       'Model_Modus', 'Model_Multivan', 'Model_New Beetle',
       'Model_Outlander', 'Model_Outlander XL', 'Model_Pajero',
       'Model_Pajero Pinin', 'Model_Pajero Sport', 'Model_Pajero Wagon',
       'Model_Passat B3', 'Model_Passat B4', 'Model_Passat B5',
       'Model_Passat B6', 'Model_Passat B7', 'Model_Passat B8',
       'Model_Passat CC', 'Model_Phaeton', 'Model_Pointer', 'Model_Polo',
       'Model_Previa', 'Model_Prius', 'Model_Q3', 'Model_Q5', 'Model_Q7',
       'Model_R 320', 'Model_R8', 'Model_Rav 4', 'Model_S 140',
       'Model_S 250', 'Model_S 300', 'Model_S 320', 'Model_S 350',
       'Model_S 400', 'Model_S 430', 'Model_S 500', 'Model_S 550',
       'Model_S 600', 'Model_S 63 AMG', 'Model_S 65 AMG', 'Model_S4',
       'Model_S5', 'Model_S8', 'Model_SL 500 (550)', 'Model_SL 55 AMG',
       'Model_SLK 200', 'Model_SLK 350', 'Model_Sandero',
       'Model_Sandero StepWay', 'Model_Scenic', 'Model_Scion',
       'Model_Scirocco', 'Model_Sequoia', 'Model_Sharan', 'Model_Sienna',
       'Model_Smart', 'Model_Space Star', 'Model_Space Wagon',
       'Model_Sprinter 208', 'Model_Sprinter 210', 'Model_Sprinter 211',
       'Model_Sprinter 212', 'Model_Sprinter 213', 'Model_Sprinter 311',
       'Model_Sprinter 312', 'Model_Sprinter 313', 'Model_Sprinter 315',
       'Model_Sprinter 316', 'Model_Sprinter 318', 'Model_Sprinter 319',
       'Model_Symbol', 'Model_Syncro', 'Model_T3 (Transporter)',
       'Model_T4 (Transporter)', 'Model_T4 (Transporter) ',
       'Model_T5 (Transporter)', 'Model_T5 (Transporter) ',
       'Model_T6 (Transporter)', 'Model_T6 (Transporter) ', 'Model_TT',
       'Model_Tacoma', 'Model_Tiguan', 'Model_Touareg', 'Model_Touran',
       'Model_Trafic', 'Model_Tundra', 'Model_Up', 'Model_V 250',
       'Model_Vaneo', 'Model_Vento', 'Model_Venza', 'Model_Viano',
       'Model_Virage', 'Model_Vista', 'Model_Vito', 'Model_X1',
       'Model_X3', 'Model_X5', 'Model_X5 M', 'Model_X6', 'Model_X6 M',
       'Model_Yaris', 'Model_Z3', 'Model_Z4']
In [32]:
# To implement the reordering, we will create a new df, which is equal to the old one but with the new order of features
data_preprocessed = data_with_dummies[cols]
data_preprocessed.head()
Out[32]:
log_price Mileage EngineV Brand_BMW Brand_Mercedes-Benz Brand_Mitsubishi Brand_Renault Brand_Toyota Brand_Volkswagen Body_hatch ... Model_Vito Model_X1 Model_X3 Model_X5 Model_X5 M Model_X6 Model_X6 M Model_Yaris Model_Z3 Model_Z4
0 8.342840 277 2.0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 8.974618 427 2.9 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 9.495519 358 5.0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 10.043249 240 4.2 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 9.814656 120 2.0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 306 columns

Linear regression model

Declare the inputs and the targets

In [33]:
# The target(s) (dependent variable) is 'log price'
targets = data_preprocessed['log_price']

# The inputs are everything BUT the dependent variable, so we can simply drop it
inputs = data_preprocessed.drop(['log_price'],axis=1)

Scale the data

In [34]:
# Import the scaling module
from sklearn.preprocessing import StandardScaler

# Create a scaler object
scaler = StandardScaler()
# Fit the inputs (calculate the mean and standard deviation feature-wise)
scaler.fit(inputs)
Out[34]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [35]:
# Scale the features and store them in a new variable (the actual scaling procedure)
inputs_scaled = scaler.transform(inputs)
  • Dummy variables are also being standardized here, so they lose all of their dummy meaning, but this should not have a large effect on the model

Train Test Split

In [36]:
# Import the module for the split
from sklearn.model_selection import train_test_split

# Split the variables with an 80-20 split and some random state
# To have the same split each time, use random_state = 365
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets, test_size=0.2, random_state=365)

Create the regression

In [37]:
# Create a linear regression object
reg = LinearRegression()
# Fit the regression with the scaled TRAIN inputs and targets
reg.fit(x_train,y_train)
Out[37]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
  • This is actually a log-linear regression as the dependent variables is the log of 'Price'
  • We want to check the regression by plotting the predicted values against the observed values
In [38]:
# Let's check the outputs of the regression
# I'll store them in y_hat as this is the 'theoretical' name of the predictions
y_hat = reg.predict(x_train)
In [39]:
# The simplest way to compare the targets (y_train) and the predictions (y_hat) is to plot them on a scatter plot
# The closer the points to the 45-degree line, the better the prediction
plt.scatter(y_train, y_hat)
# Let's also name the axes
plt.xlabel('Targets (y_train)',size=18)
plt.ylabel('Predictions (y_hat)',size=18)
# We need to make sure the scales of the x-axis and the y-axis are the same
# Otherwise we wouldn't be able to interpret the '45-degree line'
plt.xlim(6,13)
plt.ylim(6,13)
plt.show()
  • Our result is not perfect but definitely not random
  • The points are situated around the 45-degree line, so our model has passed this first check
In [40]:
# Another useful check of our model is a residual plot
# Residuals = targets - predictions
# We can plot the PDF of the residuals and check for anomalies
sns.distplot(y_train - y_hat)

# Include a title
plt.title("Residuals PDF", size=18)
Out[40]:
Text(0.5, 1.0, 'Residuals PDF')
  • In the best case scenario this plot should be normally distributed
  • This is because our regression assumptions said that our errors should be normally distributed with a mean of 0 --> since the residutals are the estimates of the errors, we would expect the same
  • In our case we notice that is a much longer tail on the left side, so there are many negative residuals (far away from the mean)
  • Given the definition of the residuals (y_train - y_hat), negative values imply that y_hat (predictions) are much higher than y_train (the targets) --> some of the predictions tend to overestimate the targets but rarely underestimate the targets
  • This is food for thought to improve our model
In [41]:
# Find the R-squared of the model
r2 = reg.score(x_train,y_train)
r2
Out[41]:
0.8347161853185299
  • The appropriate measure to use here is the adjusted R-squared
In [42]:
# Number of observations is the shape along axis 0
n = x_train.shape[0]
# Number of features (predictors, p) is the shape along axis 1
p = x_train.shape[1]

# We find the Adjusted R-squared using the formula
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2
Out[42]:
0.8164113329300592
  • Since the adjusted R^2 was a bit lower than the regular R^2, showing that at least one of the variables included is not increasing the explanatory power of the regression
  • We can use feature selection to identify that variable, remove it, and further simplify our model

Feature Selection

In [60]:
from sklearn.feature_selection import f_regression
In [61]:
f_regression(x_train, y_train)
Out[61]:
(array([1.76068144e+03, 6.23344807e+02, 1.45893327e+01, 2.59454872e+01,
        1.76510386e+01, 1.18643960e+02, 8.39780982e+01, 3.13007519e+01,
        5.92802666e+01, 8.38777989e+00, 5.79453632e+01, 1.03804394e+02,
        4.29209798e+01, 1.83643828e+01, 1.23316590e+00, 9.30208253e+00,
        9.08970229e+02, 2.36498989e+01, 1.17908948e-14, 3.22393208e-01,
        1.99821289e-03, 2.07573186e+00, 6.76666574e+00, 1.01431727e+01,
        3.71523133e+00, 2.61334565e+00, 5.19098970e-01, 4.18362006e+00,
        1.17908948e-14, 1.17908948e-14, 1.17225278e-14, 5.57624782e+00,
        2.30302970e+01, 1.87914562e+01, 5.30811184e-01, 2.28915986e-02,
        6.28764284e-01, 1.60638823e+00, 8.46476912e-01, 1.17225278e-14,
        1.17225278e-14, 2.22227996e+00, 2.82353443e+00, 5.61381468e+00,
        5.71612162e-01, 2.19559098e+00, 1.17225278e-14, 5.00342254e+01,
        3.71916875e+00, 3.70137924e+00, 2.89878008e+00, 1.12737963e+00,
        2.92506588e-01, 9.56094613e+00, 5.43503573e+00, 5.68813460e-01,
        1.57207667e+01, 1.08627691e+00, 5.29529730e-01, 1.15163974e-01,
        6.48630613e-01, 2.80997666e-01, 3.04970911e-01, 2.37129388e+01,
        3.78549878e-01, 1.93461722e+00, 7.32163714e+00, 1.17908948e-14,
        3.15111222e+00, 1.15351417e-01, 1.53715759e+01, 1.06725375e+00,
        2.17545005e-01, 1.32562525e-01, 1.64552736e+01, 3.27595922e+00,
        1.71072018e+01, 3.14136204e+01, 4.26085368e-01, 9.72790601e+00,
        3.62527843e+00, 9.75421752e-01, 1.02488220e+00, 8.69076359e-02,
        8.57756683e-02, 4.52046536e-01, 3.82149402e-01, 6.34761219e-03,
        1.56495790e+00, 7.36802500e-02, 2.07573186e+00, 4.44765793e+00,
        1.48390866e+00, 5.42485029e+00, 9.19023597e-01, 5.00168738e+00,
        1.62694619e+00, 9.51515011e+00, 4.12308719e+00, 2.22227996e+00,
        1.92658512e+00, 2.94070450e-01, 2.41010841e-01, 3.60102521e+00,
        3.87960674e-01, 2.70934601e+00, 8.95931397e-02, 5.29529730e-01,
        5.49771195e-02, 1.17225278e-14, 2.31770968e+00, 6.26971966e+00,
        7.09173759e+00, 5.91107212e+00, 1.03790624e+01, 6.56362967e+00,
        6.49771064e-01, 7.22618634e-01, 5.55992987e+00, 9.97555832e+00,
        3.60433704e-01, 4.76476597e+00, 7.70254568e+00, 7.46483356e+00,
        2.93209582e-01, 2.90565736e-04, 2.12978860e-01, 2.44754992e-01,
        7.99662311e+00, 1.74098529e-01, 2.54478357e-01, 5.14963391e+00,
        7.49893685e+00, 3.87823577e-02, 7.18660456e-02, 1.36834120e+00,
        1.08867531e+01, 1.08273219e+01, 1.66993442e+00, 1.05944267e+01,
        3.13802723e+01, 8.13336018e-01, 4.04441802e+00, 2.09611817e+00,
        2.04540925e+00, 7.46776293e+00, 4.27863595e+01, 3.07332295e+00,
        1.18736641e+00, 1.17458107e+01, 3.79215336e+01, 1.17908948e-14,
        1.14277506e+01, 3.69612167e+01, 4.14391308e+01, 4.24122106e-01,
        1.78375824e+00, 1.11910141e-01, 1.85853725e+00, 1.03441823e-02,
        3.11967109e-01, 4.19726299e-02, 1.18650288e-03, 1.23940733e+01,
        1.44792320e+01, 1.95958409e+00, 3.37202724e-03, 1.82498688e-02,
        6.75715838e+01, 1.00951618e-02, 1.75539360e+00, 1.45336690e+00,
        2.70339291e+01, 1.85975344e+01, 1.84403360e-03, 3.93375330e+00,
        1.15351417e-01, 3.06825937e+00, 3.78549878e-01, 1.05516292e+02,
        3.87933050e+00, 1.30087745e-01, 5.02561796e+01, 1.75821962e-03,
        2.74142818e+00, 3.95648189e+00, 1.29818352e+01, 4.86978403e+00,
        4.17818569e+00, 2.35042741e+00, 5.67814370e-02, 1.17908948e-14,
        5.14359974e-01, 1.50621055e+01, 8.45281333e-01, 1.17225278e-14,
        2.94028928e+00, 9.64960873e-01, 1.89178117e+00, 1.08041040e-01,
        1.06574101e-01, 8.61441183e+00, 3.01815916e-01, 2.56244628e+01,
        2.17943303e-01, 9.50066142e-02, 2.13251464e+00, 6.88974238e+00,
        2.30596381e-01, 2.95644289e+00, 5.21697729e+00, 3.19358170e+01,
        2.36043428e+01, 6.30623574e+01, 7.86983947e-01, 1.11685591e+01,
        6.66197355e+00, 9.55655842e+00, 2.21739112e+00, 1.17225278e-14,
        5.23301811e+00, 1.17225278e-14, 1.58902258e-01, 6.79184628e+00,
        1.84401679e+01, 7.94270433e+01, 1.17225278e-14, 1.73588258e+01,
        6.17253753e+00, 8.36072077e-01, 3.27309852e+00, 1.90310547e-01,
        3.76987412e-02, 1.67630131e+01, 3.04912953e-01, 8.13010147e-02,
        9.59429018e+00, 8.40320733e+00, 1.76511797e+00, 5.56824566e+00,
        4.37820104e+00, 4.05799299e-02, 1.17225278e-14, 4.49640194e-04,
        4.15238016e-01, 2.38904715e+00, 3.92037231e-01, 3.00895880e-01,
        1.15113429e+00, 1.84962043e-02, 1.13310872e+01, 3.19865007e-02,
        1.33146664e+00, 1.17225278e-14, 7.37367981e-01, 2.60805278e+00,
        1.80543482e+00, 5.13084093e+00, 1.50195080e+00, 2.93209582e-01,
        1.10024986e-14, 1.55131923e-01, 1.79400140e-02, 2.91035818e-02,
        1.17908948e-14, 1.47549124e+00, 2.87836838e-01, 8.95931397e-02,
        7.90700934e-03, 2.39596599e-01, 1.50832891e-01, 6.06026164e+00,
        2.09272002e-02, 6.16003901e+00, 1.03352285e+01, 5.96460259e+00,
        4.97422981e-01, 3.20431392e-02, 1.62853856e-01, 8.11495902e-02,
        1.46954802e+00, 2.58340700e-01, 1.02442797e+01, 4.82643128e+01,
        9.05380860e-01, 3.55251458e+00, 6.84385919e+00, 7.62065588e-02,
        1.63411523e+01, 4.31723451e-01, 8.78546415e+00, 4.37197753e+00,
        2.78015599e+00, 6.47541931e-01, 1.55131923e-01, 4.75548295e+00,
        5.78467546e+00, 2.23540083e+01, 8.06437178e+01, 8.23282676e+00,
        7.54980272e+01, 4.01886144e+00, 3.61087305e+00, 1.17908948e-14,
        1.68439937e+00]),
 array([2.59555216e-304, 2.25998861e-125, 1.36324856e-004, 3.72497035e-007,
        2.72958981e-005, 3.92167544e-027, 8.93240893e-020, 2.40499120e-008,
        1.83322652e-014, 3.80432561e-003, 3.56723826e-014, 5.37348629e-024,
        6.66260033e-011, 1.88067529e-005, 2.66879538e-001, 2.30855729e-003,
        4.66500096e-175, 1.21389338e-006, 1.00000000e+000, 5.70214636e-001,
        9.64348257e-001, 1.49760516e-001, 9.33238416e-003, 1.46292880e-003,
        5.40106718e-002, 1.06071132e-001, 4.71281358e-001, 4.09018317e-002,
        1.00000000e+000, 1.00000000e+000, 1.00000000e+000, 1.82680568e-002,
        1.67114924e-006, 1.50523964e-005, 4.66321809e-001, 8.79749287e-001,
        4.27870460e-001, 2.05096753e-001, 3.57623485e-001, 1.00000000e+000,
        1.00000000e+000, 1.36135179e-001, 9.29937854e-002, 1.78811957e-002,
        4.49677232e-001, 1.38509149e-001, 1.00000000e+000, 1.86486737e-012,
        5.38836072e-002, 5.44602239e-002, 8.87482448e-002, 2.88418805e-001,
        5.88658101e-001, 2.00552663e-003, 1.98013441e-002, 4.50788872e-001,
        7.50952205e-005, 2.97379988e-001, 4.66860348e-001, 7.34363223e-001,
        4.20665313e-001, 5.96086542e-001, 5.80823116e-001, 1.17507078e-006,
        5.38426123e-001, 1.64355377e-001, 6.85074374e-003, 1.00000000e+000,
        7.59746766e-002, 7.34155313e-001, 9.02445111e-005, 3.01648276e-001,
        6.40950125e-001, 7.15814954e-001, 5.10564003e-005, 7.03999469e-002,
        3.62793654e-005, 2.27050691e-008, 5.13965144e-001, 1.83180440e-003,
        5.70018203e-002, 3.23409858e-001, 3.11443823e-001, 7.68165906e-001,
        7.69637696e-001, 5.01415482e-001, 5.36501189e-001, 9.36503419e-001,
        2.11036462e-001, 7.86070146e-001, 1.49760516e-001, 3.50302485e-002,
        2.23257953e-001, 1.99169608e-002, 3.37807260e-001, 2.53944555e-002,
        2.02222839e-001, 2.05603583e-003, 4.23881838e-002, 1.36135179e-001,
        1.65233649e-001, 5.87663334e-001, 6.23512211e-001, 5.78380517e-002,
        5.33419756e-001, 9.98646439e-002, 7.64715374e-001, 4.66860348e-001,
        8.14634264e-001, 1.00000000e+000, 1.28011628e-001, 1.23335563e-002,
        7.78460165e-003, 1.51030996e-002, 1.28789509e-003, 1.04559046e-002,
        4.20257221e-001, 3.95352840e-001, 1.84387822e-002, 1.60178468e-003,
        5.48309954e-001, 2.91238190e-002, 5.54783053e-003, 6.32774025e-003,
        5.88210504e-001, 9.86401039e-001, 6.44475380e-001, 6.20828247e-001,
        4.71696118e-003, 6.76523999e-001, 6.13975578e-001, 2.33206394e-002,
        6.20931206e-003, 8.43893551e-001, 7.88657028e-001, 2.42188255e-001,
        9.79646856e-004, 1.01148021e-003, 1.96364960e-001, 1.14665214e-003,
        2.30943745e-008, 3.67206815e-001, 4.44052599e-002, 1.47776148e-001,
        1.52768086e-001, 6.31747766e-003, 7.13043897e-011, 7.96868304e-002,
        2.75947286e-001, 6.17856417e-004, 8.32618780e-010, 1.00000000e+000,
        7.32630012e-004, 1.35423027e-009, 1.40684393e-010, 5.14936261e-001,
        1.81787908e-001, 7.38002801e-001, 1.72894077e-001, 9.18996427e-001,
        5.76516484e-001, 8.37685589e-001, 9.72524029e-001, 4.36988541e-004,
        1.44488764e-004, 1.61659245e-001, 9.53697334e-001, 8.92547967e-001,
        2.96720056e-016, 9.19974068e-001, 1.85299107e-001, 2.28081747e-001,
        2.13071976e-007, 1.66534602e-005, 9.65750403e-001, 4.74155136e-002,
        7.34155313e-001, 7.99351497e-002, 5.38426123e-001, 2.33052757e-024,
        4.89741735e-002, 7.18365991e-001, 1.66846886e-012, 9.66556338e-001,
        9.78799481e-002, 4.67801506e-002, 3.19538856e-004, 2.74049688e-002,
        4.10330036e-002, 1.25352106e-001, 8.11673260e-001, 1.00000000e+000,
        4.73312299e-001, 1.06227003e-004, 3.57963176e-001, 1.00000000e+000,
        8.64965773e-002, 3.26017856e-001, 1.69101902e-001, 7.42408008e-001,
        7.44101041e-001, 3.35997831e-003, 5.82786353e-001, 4.39291534e-007,
        6.40644764e-001, 7.57927232e-001, 1.44306743e-001, 8.71232572e-003,
        6.31116320e-001, 8.56371177e-002, 2.24356435e-002, 1.73998378e-008,
        1.24274760e-006, 2.78799487e-015, 3.75083437e-001, 8.41952973e-004,
        9.89534909e-003, 2.01031089e-003, 1.36566604e-001, 1.00000000e+000,
        2.22300243e-002, 1.00000000e+000, 6.90197029e-001, 9.20194612e-003,
        1.80777642e-005, 8.41406314e-019, 1.00000000e+000, 3.18027634e-005,
        1.30279664e-002, 3.60594670e-001, 7.05226526e-002, 6.62689073e-001,
        8.46062217e-001, 4.34475724e-005, 5.80859061e-001, 7.75561307e-001,
        1.96954440e-003, 3.77226357e-003, 1.84086557e-001, 1.83515737e-002,
        3.64836434e-002, 8.40364336e-001, 1.00000000e+000, 9.83083732e-001,
        5.19371128e-001, 1.22291559e-001, 5.31277217e-001, 5.83361372e-001,
        2.83396486e-001, 8.91829639e-001, 7.71614971e-004, 8.58069136e-001,
        2.48634641e-001, 1.00000000e+000, 3.90572616e-001, 1.06425405e-001,
        1.79156529e-001, 2.35740151e-002, 2.20465821e-001, 5.88210504e-001,
        1.00000000e+000, 6.93706032e-001, 8.93458567e-001, 8.64551377e-001,
        1.00000000e+000, 2.24575079e-001, 5.91649108e-001, 7.64715374e-001,
        9.29150226e-001, 6.24532733e-001, 6.97767720e-001, 1.38805295e-002,
        8.84986987e-001, 1.31201473e-002, 1.31873514e-003, 1.46521543e-002,
        4.80688353e-001, 8.57944868e-001, 6.86570732e-001, 7.75764815e-001,
        2.25510669e-001, 6.11299077e-001, 1.38513152e-003, 4.53217998e-012,
        3.41418188e-001, 5.95500342e-002, 8.93839498e-003, 7.82524204e-001,
        5.42072988e-005, 5.11193933e-001, 3.05984748e-003, 3.66169176e-002,
        9.55412662e-002, 4.21055433e-001, 6.93706032e-001, 2.92810628e-002,
        1.62255377e-002, 2.36994467e-006, 4.61750552e-019, 4.14216033e-003,
        5.85571089e-018, 4.50820788e-002, 5.74969452e-002, 1.00000000e+000,
        1.94438788e-001]))
In [62]:
p_values = f_regression(x_train, y_train)[1].round(3)
p_values
Out[62]:
array([0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ,
       0.004, 0.   , 0.   , 0.   , 0.   , 0.267, 0.002, 0.   , 0.   ,
       1.   , 0.57 , 0.964, 0.15 , 0.009, 0.001, 0.054, 0.106, 0.471,
       0.041, 1.   , 1.   , 1.   , 0.018, 0.   , 0.   , 0.466, 0.88 ,
       0.428, 0.205, 0.358, 1.   , 1.   , 0.136, 0.093, 0.018, 0.45 ,
       0.139, 1.   , 0.   , 0.054, 0.054, 0.089, 0.288, 0.589, 0.002,
       0.02 , 0.451, 0.   , 0.297, 0.467, 0.734, 0.421, 0.596, 0.581,
       0.   , 0.538, 0.164, 0.007, 1.   , 0.076, 0.734, 0.   , 0.302,
       0.641, 0.716, 0.   , 0.07 , 0.   , 0.   , 0.514, 0.002, 0.057,
       0.323, 0.311, 0.768, 0.77 , 0.501, 0.537, 0.937, 0.211, 0.786,
       0.15 , 0.035, 0.223, 0.02 , 0.338, 0.025, 0.202, 0.002, 0.042,
       0.136, 0.165, 0.588, 0.624, 0.058, 0.533, 0.1  , 0.765, 0.467,
       0.815, 1.   , 0.128, 0.012, 0.008, 0.015, 0.001, 0.01 , 0.42 ,
       0.395, 0.018, 0.002, 0.548, 0.029, 0.006, 0.006, 0.588, 0.986,
       0.644, 0.621, 0.005, 0.677, 0.614, 0.023, 0.006, 0.844, 0.789,
       0.242, 0.001, 0.001, 0.196, 0.001, 0.   , 0.367, 0.044, 0.148,
       0.153, 0.006, 0.   , 0.08 , 0.276, 0.001, 0.   , 1.   , 0.001,
       0.   , 0.   , 0.515, 0.182, 0.738, 0.173, 0.919, 0.577, 0.838,
       0.973, 0.   , 0.   , 0.162, 0.954, 0.893, 0.   , 0.92 , 0.185,
       0.228, 0.   , 0.   , 0.966, 0.047, 0.734, 0.08 , 0.538, 0.   ,
       0.049, 0.718, 0.   , 0.967, 0.098, 0.047, 0.   , 0.027, 0.041,
       0.125, 0.812, 1.   , 0.473, 0.   , 0.358, 1.   , 0.086, 0.326,
       0.169, 0.742, 0.744, 0.003, 0.583, 0.   , 0.641, 0.758, 0.144,
       0.009, 0.631, 0.086, 0.022, 0.   , 0.   , 0.   , 0.375, 0.001,
       0.01 , 0.002, 0.137, 1.   , 0.022, 1.   , 0.69 , 0.009, 0.   ,
       0.   , 1.   , 0.   , 0.013, 0.361, 0.071, 0.663, 0.846, 0.   ,
       0.581, 0.776, 0.002, 0.004, 0.184, 0.018, 0.036, 0.84 , 1.   ,
       0.983, 0.519, 0.122, 0.531, 0.583, 0.283, 0.892, 0.001, 0.858,
       0.249, 1.   , 0.391, 0.106, 0.179, 0.024, 0.22 , 0.588, 1.   ,
       0.694, 0.893, 0.865, 1.   , 0.225, 0.592, 0.765, 0.929, 0.625,
       0.698, 0.014, 0.885, 0.013, 0.001, 0.015, 0.481, 0.858, 0.687,
       0.776, 0.226, 0.611, 0.001, 0.   , 0.341, 0.06 , 0.009, 0.783,
       0.   , 0.511, 0.003, 0.037, 0.096, 0.421, 0.694, 0.029, 0.016,
       0.   , 0.   , 0.004, 0.   , 0.045, 0.057, 1.   , 0.194])
  • If a variable as a p-value > 0.05, we can disregard it, because that variable is redundant (not relevant), but we still don't necessarily know how useful/important that variable is

Finding the weights and bias

In [43]:
# Obtain the bias (intercept) of the regression
reg.intercept_
Out[43]:
295271601566.04913
In [44]:
# Obtain the weights (coefficients) of the regression
reg.coef_
Out[44]:
array([-3.42013244e-01,  4.71096963e-02, -2.73881723e+12,  1.55135632e+12,
        2.14702126e+12, -1.05825320e+11, -3.81614744e+12,  4.61157477e+12,
       -6.64366447e-02, -2.47802734e-02, -9.10339355e-02, -7.34863281e-02,
       -5.18798828e-02, -9.10644531e-02, -2.86865234e-02, -1.19873047e-01,
        3.11401367e-01, -4.64764931e+11,  1.12852481e+13,  3.52325439e-02,
        4.17938232e-02, -8.01467896e-03, -1.62354331e+11, -3.68549829e+11,
       -3.68549829e+11, -2.60672259e+11, -2.60672259e+11, -2.60672259e+11,
        8.03730594e+12,  3.00664832e+12,  8.80732154e+12,  1.95312500e-02,
        5.31005859e-02,  1.06018066e-01,  1.80358887e-02,  1.82495117e-02,
        2.32086182e-02,  4.77905273e-02,  3.47900391e-02,  1.16416367e+13,
        5.72933076e+12,  6.34224883e+10,  1.84020996e-02,  4.48608398e-02,
        1.48162842e-01,  5.04455566e-02, -1.41364946e+13,  6.83898926e-02,
        5.41381836e-02,  1.10443115e-01,  7.21130371e-02,  2.02636719e-02,
        2.65655518e-02,  5.96313477e-02,  4.29687500e-02,  2.73132324e-02,
        6.53076172e-02,  1.47094727e-02,  1.75781250e-02,  9.79614258e-02,
        4.42810059e-02,  4.67834473e-02,  3.25927734e-02,  1.10473633e-01,
        3.56445312e-02, -2.94290033e+11, -1.14816858e+11, -1.31035825e+12,
       -3.19215266e+11, -2.60672259e+11, -4.11997326e+11, -1.84347221e+11,
       -2.08176175e+11, -3.98209724e+11, -8.77305737e+11, -2.40349700e+11,
       -4.64764931e+11, -1.45721766e+12, -5.62341120e+11, -2.08176175e+11,
       -6.55982081e+11, -6.09371044e+11, -5.18955241e+11,  3.34413694e+11,
        8.96812178e+10,  3.40288173e+11,  8.96812178e+10, -1.84347221e+11,
       -4.87354214e+11, -3.19215266e+11, -4.23780595e+11, -7.92302407e+11,
       -1.32045467e+12, -1.84347221e+11, -4.11997326e+11, -3.19215266e+11,
       -1.84347221e+11, -3.19215266e+11, -3.19215266e+11, -1.84347221e+11,
       -1.84347221e+11, -1.84347221e+11, -1.84347221e+11, -2.60672259e+11,
       -1.84347221e+11, -2.60672259e+11, -1.84347221e+11, -1.84347221e+11,
       -3.19215266e+11, -5.21922550e+12, -4.11997326e+11, -1.84347221e+11,
       -2.60672259e+11, -2.60672259e+11, -3.09814202e+12,  7.02864459e+11,
       -1.62354331e+11, -6.69793066e+11,  1.09822249e+11, -6.58109797e+11,
        8.96812178e+10, -3.03538503e+11, -7.46030944e+11,  4.27614451e+11,
        6.34224883e+10, -2.99697311e+11, -1.62354331e+11, -3.24453822e+11,
       -2.42107883e+12, -3.52004687e+11, -4.23780595e+11, -2.29543646e+11,
        1.67668472e+11, -3.62655377e+11,  6.34224883e+10, -1.84347221e+11,
       -5.52463018e+11, -4.11997326e+11, -1.84347221e+11, -5.52463018e+11,
       -5.20935354e+11, -1.84347221e+11, -3.68549829e+11, -2.60672259e+11,
       -1.84347221e+11, -2.60672259e+11, -5.52463018e+11, -2.60672259e+11,
       -1.84347221e+11, -2.60672259e+11, -1.26518193e+12,  3.05418030e+12,
       -5.99159519e+11, -1.26867928e+12, -1.46379009e+12, -6.69793066e+11,
       -1.07887767e+12, -1.11945745e+12, -9.46610482e+11, -7.33625524e+11,
       -3.80306328e+11, -4.31059562e+11,  1.09822249e+11,  2.28314049e+11,
        2.10073452e+11,  2.33154297e-02,  6.34224883e+10, -1.83850117e+12,
       -1.29635919e+12, -2.56604022e+11, -7.86282645e+11, -5.99159519e+11,
       -5.94574757e+11, -1.73159387e+12, -4.31059562e+11, -1.73159387e+12,
       -2.48937444e+11,  2.53192033e+11,  6.34224883e+10,  3.57314968e+11,
        6.34224883e+10,  1.26795224e+11,  5.43915858e+11, -1.14816858e+11,
       -4.28874675e+11, -4.23780595e+11,  8.14819336e-02,  2.60620117e-02,
       -1.84347221e+11, -3.19215266e+11, -3.19215266e+11, -1.08471062e+13,
       -5.52463018e+11, -8.42573614e+11, -3.19215266e+11,  1.25047199e+13,
       -2.60672259e+11, -1.84347221e+11, -3.68549829e+11, -3.44090176e+11,
        6.34224883e+10, -1.04687121e+12, -1.14816858e+11, -1.33695416e+12,
       -4.23780595e+11, -1.28911314e+12, -1.19042376e+12, -5.56349917e+11,
       -2.48937444e+11, -1.26518193e+12, -1.71406032e+12, -1.15859655e+12,
       -1.07887767e+12, -2.18700536e+12, -2.32227974e+12, -2.06356770e+12,
       -6.69793066e+11, -1.33695416e+12, -7.92302407e+11,  5.35460088e+12,
       -2.28449884e+12,  3.65780024e+12,  1.26795224e+11, -2.68683977e+11,
       -4.49064487e+11, -9.00987454e+11, -1.04198805e+13, -2.40349700e+11,
        4.45582938e+11, -3.68549829e+11, -1.84347221e+11, -1.84347221e+11,
       -6.10610812e+11, -7.12665902e+11, -3.68549829e+11, -2.60672259e+11,
       -1.13088296e+12, -6.10610812e+11, -4.11997326e+11, -3.19215266e+11,
       -1.84347221e+11, -1.69997364e+11, -8.60938731e+11, -2.08176175e+11,
       -1.84347221e+11, -1.84347221e+11, -2.60672259e+11, -1.84347221e+11,
       -2.81058837e+11, -1.14816858e+11, -5.37058424e+11,  6.34224883e+10,
       -7.33625524e+11, -1.31926154e+13, -9.92683355e+11,  8.96812178e+10,
       -1.84347221e+11, -5.56349917e+11, -2.48937444e+11, -1.84347221e+11,
       -2.32713874e+12, -1.84347221e+11, -3.19215266e+11, -2.60672259e+11,
        1.26092758e+12, -3.68549829e+11, -6.63629916e+11, -1.84347221e+11,
       -2.60672259e+11, -2.60672259e+11, -1.84347221e+11, -3.62655377e+11,
       -2.99697311e+11, -2.99697311e+11, -1.81438853e+12, -7.92302407e+11,
       -2.20686993e+12, -1.76513546e+12, -6.69793066e+11, -7.92302407e+11,
       -2.68683977e+11,  1.09822249e+11, -1.23309652e+12, -2.39593742e+12,
       -1.30327290e+12, -9.51922846e+11,  1.41742823e+11, -4.23780595e+11,
       -4.51261397e+11, -1.84347221e+11, -7.33625524e+11,  1.67668472e+11,
       -4.11997326e+11, -2.48937444e+11,  6.34224883e+10, -2.24146816e+12,
        6.28662109e-02,  7.95288086e-02,  2.36419678e-01,  6.57958984e-02,
        1.62353516e-01,  4.72412109e-02,  2.36901615e+11, -5.50521900e+11,
        3.82080078e-02])
  • Note that they are barely interpretable if at all
In [64]:
# Create a regression summary where we can compare them with one-another
reg_summary = pd.DataFrame(inputs.columns.values, columns=['Features'])
reg_summary['Weights'] = reg.coef_
reg_summary['p-values'] = p_values
reg_summary
Out[64]:
Features Weights p-values
0 Mileage -0.34 0.00
1 EngineV 0.05 0.00
2 Brand_BMW -2738817232235.43 0.00
3 Brand_Mercedes-Benz 1551356317399.53 0.00
4 Brand_Mitsubishi 2147021258533.68 0.00
5 Brand_Renault -105825320085.92 0.00
6 Brand_Toyota -3816147441217.25 0.00
7 Brand_Volkswagen 4611574773601.37 0.00
8 Body_hatch -0.07 0.00
9 Body_other -0.02 0.00
10 Body_sedan -0.09 0.00
11 Body_vagon -0.07 0.00
12 Body_van -0.05 0.00
13 Engine Type_Gas -0.09 0.00
14 Engine Type_Other -0.03 0.27
15 Engine Type_Petrol -0.12 0.00
16 Registration_yes 0.31 0.00
17 Model_100 -464764931410.55 0.00
18 Model_11 11285248119855.59 1.00
19 Model_116 0.04 0.57
20 Model_118 0.04 0.96
21 Model_120 -0.01 0.15
22 Model_19 -162354330815.04 0.01
23 Model_190 -368549828573.53 0.00
24 Model_200 -368549828573.51 0.05
25 Model_210 -260672259336.75 0.11
26 Model_220 -260672259336.73 0.47
27 Model_230 -260672259336.76 0.04
28 Model_25 8037305940739.12 1.00
29 Model_250 3006648316611.04 1.00
30 Model_300 8807321538408.15 1.00
31 Model_316 0.02 0.02
32 Model_318 0.05 0.00
33 Model_320 0.11 0.00
34 Model_323 0.02 0.47
35 Model_325 0.02 0.88
36 Model_328 0.02 0.43
37 Model_330 0.05 0.20
38 Model_335 0.03 0.36
39 Model_4 Series Gran Coupe 11641636718966.64 1.00
40 Model_428 5729330763983.54 1.00
41 Model_4Runner 63422488324.06 0.14
42 Model_5 Series 0.02 0.09
43 Model_5 Series GT 0.04 0.02
44 Model_520 0.15 0.45
45 Model_523 0.05 0.14
46 Model_524 -14136494624408.63 1.00
47 Model_525 0.07 0.00
48 Model_528 0.05 0.05
49 Model_530 0.11 0.05
50 Model_535 0.07 0.09
51 Model_540 0.02 0.29
52 Model_545 0.03 0.59
53 Model_550 0.06 0.00
54 Model_6 Series Gran Coupe 0.04 0.02
55 Model_630 0.03 0.45
56 Model_640 0.07 0.00
57 Model_645 0.01 0.30
58 Model_650 0.02 0.47
59 Model_730 0.10 0.73
60 Model_735 0.04 0.42
61 Model_740 0.05 0.60
62 Model_745 0.03 0.58
63 Model_750 0.11 0.00
64 Model_760 0.04 0.54
65 Model_80 -294290033141.79 0.16
66 Model_9 -114816857995.02 0.01
67 Model_90 -1310358246676.97 1.00
68 Model_A 140 -319215266151.05 0.08
69 Model_A 150 -260672259336.75 0.73
70 Model_A 170 -411997326181.60 0.00
71 Model_A 180 -184347221495.45 0.30
72 Model_A1 -208176174524.81 0.64
73 Model_A3 -398209723987.15 0.72
74 Model_A4 -877305736644.15 0.00
75 Model_A4 Allroad -240349699935.40 0.07
76 Model_A5 -464764931410.48 0.00
77 Model_A6 -1457217663414.56 0.00
78 Model_A6 Allroad -562341120334.42 0.51
79 Model_A7 -208176174524.78 0.00
80 Model_A8 -655982081396.48 0.06
81 Model_ASX -609371043825.39 0.32
82 Model_Amarok -518955241391.95 0.31
83 Model_Auris 334413694097.35 0.77
84 Model_Avalon 89681217813.18 0.77
85 Model_Avensis 340288172891.43 0.50
86 Model_Aygo 89681217813.16 0.54
87 Model_B 170 -184347221495.46 0.94
88 Model_B 180 -487354214082.32 0.21
89 Model_B 200 -319215266151.04 0.79
90 Model_Beetle -423780594516.67 0.15
91 Model_Bora -792302407281.55 0.04
92 Model_C-Class -1320454668086.05 0.22
93 Model_CL 180 -184347221495.46 0.02
94 Model_CL 500 -411997326181.57 0.34
95 Model_CL 55 AMG -319215266151.03 0.03
96 Model_CL 550 -184347221495.45 0.20
97 Model_CL 63 AMG -319215266151.01 0.00
98 Model_CLA 200 -319215266151.02 0.04
99 Model_CLA 220 -184347221495.45 0.14
100 Model_CLA-Class -184347221495.44 0.17
101 Model_CLC 180 -184347221495.46 0.59
102 Model_CLC 200 -184347221495.46 0.62
103 Model_CLK 200 -260672259336.75 0.06
104 Model_CLK 220 -184347221495.45 0.53
105 Model_CLK 230 -260672259336.75 0.10
106 Model_CLK 240 -184347221495.45 0.77
107 Model_CLK 280 -184347221495.46 0.47
108 Model_CLK 320 -319215266151.03 0.81
109 Model_CLK 430 -5219225502074.14 1.00
110 Model_CLS 350 -411997326181.57 0.13
111 Model_CLS 400 -184347221495.44 0.01
112 Model_CLS 500 -260672259336.72 0.01
113 Model_CLS 63 AMG -260672259336.72 0.01
114 Model_Caddy -3098142015756.04 0.00
115 Model_Camry 702864459329.49 0.01
116 Model_Captur -162354330815.01 0.42
117 Model_Caravelle -669793066100.54 0.40
118 Model_Carina 109822249156.14 0.02
119 Model_Carisma -658109797083.30 0.00
120 Model_Celica 89681217813.18 0.55
121 Model_Clio -303538502710.41 0.03
122 Model_Colt -746030943953.17 0.01
123 Model_Corolla 427614450664.11 0.01
124 Model_Corolla Verso 63422488324.05 0.59
125 Model_Cross Touran -299697310797.93 0.99
126 Model_Dokker -162354330815.01 0.64
127 Model_Duster -324453821568.56 0.62
128 Model_E-Class -2421078830141.75 0.01
129 Model_Eclipse -352004687228.10 0.68
130 Model_Eos -423780594516.67 0.61
131 Model_Espace -229543645888.08 0.02
132 Model_FJ Cruiser 167668472036.50 0.01
133 Model_Fluence -362655377266.19 0.84
134 Model_Fortuner 63422488324.05 0.79
135 Model_G 320 -184347221495.45 0.24
136 Model_G 500 -552463017543.17 0.00
137 Model_G 55 AMG -411997326181.55 0.00
138 Model_G 63 AMG -184347221495.45 0.20
139 Model_GL 320 -552463017543.18 0.00
140 Model_GL 350 -520935353520.54 0.00
141 Model_GL 420 -184347221495.45 0.37
142 Model_GL 450 -368549828573.50 0.04
143 Model_GL 500 -260672259336.73 0.15
144 Model_GL 550 -184347221495.44 0.15
145 Model_GLC-Class -260672259336.73 0.01
146 Model_GLE-Class -552463017543.17 0.00
147 Model_GLK 220 -260672259336.73 0.08
148 Model_GLK 300 -184347221495.46 0.28
149 Model_GLS 350 -260672259336.71 0.00
150 Model_Galant -1265181926202.12 0.00
151 Model_Golf GTI 3054180296414.70 1.00
152 Model_Golf II -599159518773.91 0.00
153 Model_Golf III -1268679284745.92 0.00
154 Model_Golf IV -1463790092600.70 0.00
155 Model_Golf Plus -669793066100.56 0.52
156 Model_Golf V -1078877671612.73 0.18
157 Model_Golf VI -1119457447861.46 0.74
158 Model_Golf VII -946610482215.27 0.17
159 Model_Golf Variant -733625524327.85 0.92
160 Model_Grand Scenic -380306328223.65 0.58
161 Model_Grandis -431059561941.23 0.84
162 Model_Hiace 109822249156.17 0.97
163 Model_Highlander 228314049094.30 0.00
164 Model_Hilux 210073451603.96 0.00
165 Model_I3 0.02 0.16
166 Model_IQ 63422488324.05 0.95
167 Model_Jetta -1838501166431.22 0.89
168 Model_Kangoo -1296359186114.97 0.00
169 Model_Koleos -256604022077.30 0.92
170 Model_L 200 -786282644912.17 0.18
171 Model_LT -599159518773.90 0.23
172 Model_Laguna -594574757180.82 0.00
173 Model_Lancer -1731593868374.51 0.00
174 Model_Lancer Evolution -431059561941.24 0.97
175 Model_Lancer X -1731593868374.49 0.05
176 Model_Lancer X Sportback -248937444317.97 0.73
177 Model_Land Cruiser 100 253192033378.95 0.08
178 Model_Land Cruiser 105 63422488324.06 0.54
179 Model_Land Cruiser 200 357314968491.15 0.00
180 Model_Land Cruiser 76 63422488324.06 0.05
181 Model_Land Cruiser 80 126795223762.63 0.72
182 Model_Land Cruiser Prado 543915858302.91 0.00
183 Model_Latitude -114816857994.99 0.97
184 Model_Logan -428874675185.92 0.10
185 Model_Lupo -423780594516.68 0.05
186 Model_M5 0.08 0.00
187 Model_M6 0.03 0.03
188 Model_MB -184347221495.48 0.04
189 Model_ML 250 -319215266151.01 0.12
190 Model_ML 270 -319215266151.04 0.81
191 Model_ML 280 -10847106237087.32 1.00
192 Model_ML 320 -552463017543.21 0.47
193 Model_ML 350 -842573613553.34 0.00
194 Model_ML 400 -319215266151.04 0.36
195 Model_ML 430 12504719949947.73 1.00
196 Model_ML 500 -260672259336.72 0.09
197 Model_ML 550 -184347221495.45 0.33
198 Model_ML 63 AMG -368549828573.51 0.17
199 Model_Master -344090175692.22 0.74
200 Model_Matrix 63422488324.05 0.74
201 Model_Megane -1046871207206.31 0.00
202 Model_Modus -114816857994.99 0.58
203 Model_Multivan -1336954157429.41 0.00
204 Model_New Beetle -423780594516.69 0.64
205 Model_Outlander -1289113141512.88 0.76
206 Model_Outlander XL -1190423762981.27 0.14
207 Model_Pajero -556349917365.08 0.01
208 Model_Pajero Pinin -248937444317.96 0.63
209 Model_Pajero Sport -1265181926202.07 0.09
210 Model_Pajero Wagon -1714060324365.40 0.02
211 Model_Passat B3 -1158596547953.80 0.00
212 Model_Passat B4 -1078877671612.76 0.00
213 Model_Passat B5 -2187005363961.35 0.00
214 Model_Passat B6 -2322279743819.72 0.38
215 Model_Passat B7 -2063567701376.43 0.00
216 Model_Passat B8 -669793066100.53 0.01
217 Model_Passat CC -1336954157429.43 0.00
218 Model_Phaeton -792302407281.51 0.14
219 Model_Pointer 5354600875148.19 1.00
220 Model_Polo -2284498841685.09 0.02
221 Model_Previa 3657800243375.29 1.00
222 Model_Prius 126795223762.62 0.69
223 Model_Q3 -268683977157.74 0.01
224 Model_Q5 -449064486590.42 0.00
225 Model_Q7 -900987454449.04 0.00
226 Model_R 320 -10419880498281.34 1.00
227 Model_R8 -240349699935.38 0.00
228 Model_Rav 4 445582937669.98 0.01
229 Model_S 140 -368549828573.50 0.36
230 Model_S 250 -184347221495.45 0.07
231 Model_S 300 -184347221495.46 0.66
232 Model_S 320 -610610812292.25 0.85
233 Model_S 350 -712665902409.43 0.00
234 Model_S 400 -368549828573.51 0.58
235 Model_S 430 -260672259336.73 0.78
236 Model_S 500 -1130882959361.13 0.00
237 Model_S 550 -610610812292.23 0.00
238 Model_S 600 -411997326181.58 0.18
239 Model_S 63 AMG -319215266151.01 0.02
240 Model_S 65 AMG -184347221495.44 0.04
241 Model_S4 -169997363872.27 0.84
242 Model_S5 -860938730529.01 1.00
243 Model_S8 -208176174524.80 0.98
244 Model_SL 500 (550) -184347221495.46 0.52
245 Model_SL 55 AMG -184347221495.45 0.12
246 Model_SLK 200 -260672259336.74 0.53
247 Model_SLK 350 -184347221495.47 0.58
248 Model_Sandero -281058837077.96 0.28
249 Model_Sandero StepWay -114816857995.00 0.89
250 Model_Scenic -537058423742.73 0.00
251 Model_Scion 63422488324.05 0.86
252 Model_Scirocco -733625524327.83 0.25
253 Model_Sequoia -13192615423932.46 1.00
254 Model_Sharan -992683355374.79 0.39
255 Model_Sienna 89681217813.19 0.11
256 Model_Smart -184347221495.48 0.18
257 Model_Space Star -556349917365.09 0.02
258 Model_Space Wagon -248937444317.97 0.22
259 Model_Sprinter 208 -184347221495.46 0.59
260 Model_Sprinter 210 -2327138740717.86 1.00
261 Model_Sprinter 211 -184347221495.46 0.69
262 Model_Sprinter 212 -319215266151.04 0.89
263 Model_Sprinter 213 -260672259336.74 0.86
264 Model_Sprinter 311 1260927576142.38 1.00
265 Model_Sprinter 312 -368549828573.51 0.23
266 Model_Sprinter 313 -663629915683.12 0.59
267 Model_Sprinter 315 -184347221495.46 0.77
268 Model_Sprinter 316 -260672259336.74 0.93
269 Model_Sprinter 318 -260672259336.72 0.62
270 Model_Sprinter 319 -184347221495.45 0.70
271 Model_Symbol -362655377266.21 0.01
272 Model_Syncro -299697310797.93 0.89
273 Model_T3 (Transporter) -299697310797.96 0.01
274 Model_T4 (Transporter) -1814388526623.73 0.00
275 Model_T4 (Transporter) -792302407281.55 0.01
276 Model_T5 (Transporter) -2206869930228.97 0.48
277 Model_T5 (Transporter) -1765135464196.19 0.86
278 Model_T6 (Transporter) -669793066100.56 0.69
279 Model_T6 (Transporter) -792302407281.54 0.78
280 Model_TT -268683977157.75 0.23
281 Model_Tacoma 109822249156.18 0.61
282 Model_Tiguan -1233096523357.24 0.00
283 Model_Touareg -2395937424772.16 0.00
284 Model_Touran -1303272901299.11 0.34
285 Model_Trafic -951922846399.65 0.06
286 Model_Tundra 141742823121.09 0.01
287 Model_Up -423780594516.68 0.78
288 Model_V 250 -451261396600.11 0.00
289 Model_Vaneo -184347221495.46 0.51
290 Model_Vento -733625524327.88 0.00
291 Model_Venza 167668472036.50 0.04
292 Model_Viano -411997326181.57 0.10
293 Model_Virage -248937444317.96 0.42
294 Model_Vista 63422488324.05 0.69
295 Model_Vito -2241468155278.27 0.03
296 Model_X1 0.06 0.02
297 Model_X3 0.08 0.00
298 Model_X5 0.24 0.00
299 Model_X5 M 0.07 0.00
300 Model_X6 0.16 0.00
301 Model_X6 M 0.05 0.04
302 Model_Yaris 236901614923.49 0.06
303 Model_Z3 -550521900025.61 1.00
304 Model_Z4 0.04 0.19
  • The bigger the weight, the bigger the impact
  • Mileage seems to have the largest impact, which makes sense
  • All of our numerical variables have p-values = 0 so they are relevant and should be kept in the model

    Continuous variables:
  • Positive weight: positive correlation between log_price and feature
  • Negative weight: negative correlation between log_price and feature
In [46]:
# Check the different categories in the 'Brand' variable
data_cleaned['Brand'].unique()
Out[46]:
array(['BMW', 'Mercedes-Benz', 'Audi', 'Toyota', 'Renault', 'Volkswagen',
       'Mitsubishi'], dtype=object)
  • In this way we can see which 'Brand' is actually the benchmark
  • After looking at the weights table, we can see that Audi was the dropped brand
  • Therefore, when all other dummies are 0, Audi is 1
  • Therefore, Audi is the benchmark

    Dummy variables:
  • Positive weight: respective category (Brand) is more expensive than the benchmark (Audi)
  • Negative weight: respective category (Brand) is less expensive than the benchmark (Audi)

  • For example) BMW is more expensive than Audi but Mitsubishi is less expensive than Audi -Clearly, we can see that we cannot compare continuous variables and dummy variables, as the dummy variables can only be compared to their respective benchmarks
In [47]:
# Check the different categories in the 'Body' variable
data_cleaned['Body'].unique()
Out[47]:
array(['sedan', 'van', 'crossover', 'vagon', 'other', 'hatch'],
      dtype=object)
  • 'crossover' was the dropped brand and therefore the benchmark
  • It seems that, generally, crossover is the most expensive type of body, which makes sense
In [48]:
# Check the different categories in the 'Engine Type
data_cleaned['Engine Type'].unique()
Out[48]:
array(['Petrol', 'Diesel', 'Gas', 'Other'], dtype=object)
  • 'Diesel' was the dropped brand and therefore the benchmark
  • It seems that, generally, diesel is the most expensive type of engine, which makes sense

Testing

  • Now that we have trained and fine-tuned our model, we can proceed to testing it
  • Our test inputs are 'x_test', while the outputs: 'y_test' , which the algorithm has not seen yet
  • We will feed these inputs and find the predictions
  • If the predictions are far off, we will know that our model overfitted
In [49]:
# Find the predicted values using inputs from the test data
y_hat_test = reg.predict(x_test)
In [50]:
# Create a scatter plot with the test targets and the test predictions
# Including the argument 'alpha' introduces proportaional opacity to the graph
# The more saturated the color, the higher the concentration of points (like a heat map
plt.scatter(y_test, y_hat_test, alpha=0.2)
plt.xlabel('Targets (y_test)',size=18)
plt.ylabel('Predictions (y_hat_test)',size=18)
plt.xlim(6,13)
plt.ylim(6,13)
plt.show()
  • Overall, the points seem to hover around the 45-degree line, which is good
  • For higher prices, we have a higher concentration of values around the 45-degree line. Therefore, our model is very good at predicting higher prices.
  • However, for lower prices, the values are much more scattered, meaning that we are not getting as many of the prices right.
  • From the opacity, we can see that most of the points are indeed very close to the 45-degree line
In [51]:
# Finally, let's manually check these predictions
# To obtain the actual prices, we take the exponential of the log_price
df_pf = pd.DataFrame(np.exp(y_hat_test), columns=['Prediction'])
df_pf.head()
/Users/Sujay/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: RuntimeWarning: overflow encountered in exp
  This is separate from the ipykernel package so we can avoid doing imports until
Out[51]:
Prediction
0 18159.439748
1 3061.537021
2 13747.681742
3 12996.990953
4 18445.409313
In [52]:
# We can also include the test targets in that data frame (so we can manually compare them)
df_pf['Target'] = np.exp(y_test)
df_pf
Out[52]:
Prediction Target
0 18159.439748 NaN
1 3061.537021 7900.0
2 13747.681742 NaN
3 12996.990953 NaN
4 18445.409313 NaN
... ... ...
761 7871.534480 NaN
762 33011.195840 2700.0
763 9113.322891 NaN
764 5935.946300 NaN
765 56318.690432 6950.0

766 rows × 2 columns

  • Note that we have a lot of missing values
  • There is no reason to have ANY missing values, though
  • This suggests that something is wrong with the data frame / indexing
In [53]:
y_test
Out[53]:
2533    10.114559
3654     8.294050
3105     9.814656
1481     9.655026
1335     9.560997
          ...    
2306     8.839277
542     10.184900
1017     9.609049
1325     8.517193
33      11.119883
Name: log_price, Length: 766, dtype: float64
  • After displaying y_test, we find what the issue is: the old indexes are preserved
  • Therefore, to get a proper result, we must reset the index and drop the old indexing
In [54]:
y_test = y_test.reset_index(drop=True)

# Check the result
y_test.head()
Out[54]:
0    10.114559
1     8.294050
2     9.814656
3     9.655026
4     9.560997
Name: log_price, dtype: float64
In [55]:
# Let's overwrite the 'Target' column with the appropriate values
# Again, we need the exponential of the test log price
df_pf['Target'] = np.exp(y_test)
df_pf
Out[55]:
Prediction Target
0 18159.439748 24700.0
1 3061.537021 4000.0
2 13747.681742 18300.0
3 12996.990953 15600.0
4 18445.409313 14200.0
... ... ...
761 7871.534480 6900.0
762 33011.195840 26500.0
763 9113.322891 14899.0
764 5935.946300 5000.0
765 56318.690432 67500.0

766 rows × 2 columns

  • We seemed to have fixed the issue
  • Additionally, we can calculate the difference between the targets and the predictions
  • Note that this is actually the residual (we already plotted the residuals)
  • Since OLS is basically an algorithm which minimizes the total sum of squared errors (residuals), this comparison makes a lot of sense
In [56]:
df_pf['Residual'] = df_pf['Target'] - df_pf['Prediction']
In [57]:
# Finally, it makes sense to see how far off we are from the result percentage-wise
# Here, we take the absolute difference in %, so we can easily order the data frame
df_pf['Difference%'] = np.absolute(df_pf['Residual']/df_pf['Target']*100)
df_pf
Out[57]:
Prediction Target Residual Difference%
0 18159.439748 24700.0 6540.560252 26.480001
1 3061.537021 4000.0 938.462979 23.461574
2 13747.681742 18300.0 4552.318258 24.876056
3 12996.990953 15600.0 2603.009047 16.685955
4 18445.409313 14200.0 -4245.409313 29.897249
... ... ... ... ...
761 7871.534480 6900.0 -971.534480 14.080210
762 33011.195840 26500.0 -6511.195840 24.570550
763 9113.322891 14899.0 5785.677109 38.832654
764 5935.946300 5000.0 -935.946300 18.718926
765 56318.690432 67500.0 11181.309568 16.564903

766 rows × 4 columns

In [58]:
# Exploring the descriptives here gives us additional insights
df_pf.describe()
Out[58]:
Prediction Target Residual Difference%
count 7.660000e+02 766.000000 7.660000e+02 766.000000
mean inf 17806.382337 -inf inf
std NaN 16381.051873 NaN NaN
min 0.000000e+00 800.000000 -inf 0.057087
25% 7.791258e+03 7500.000000 -1.581104e+03 9.049756
50% 1.181369e+04 12050.000000 4.428694e+02 18.762869
75% 2.044732e+04 21875.000000 3.121130e+03 34.643019
max inf 102800.000000 5.017373e+04 inf
  • Min difference% = 0.06% which is pretty spot-on
  • Max difference% = 512.7% which is pretty off-mark
  • All of the percentiles tell us that for most of the predictions, we got pretty close. 75% of results were within a 35% difference
In [59]:
# Sometimes it is useful to check these outputs manually
# To see all rows, we use the relevant pandas syntax
pd.options.display.max_rows = 999
# Moreover, to make the dataset clear, we can display the result with only 2 digits after the dot 
pd.set_option('display.float_format', lambda x: '%.2f' % x)
# Finally, we sort by difference in % and manually check the model
df_pf.sort_values(by=['Difference%'])
Out[59]:
Prediction Target Residual Difference%
484 14691.61 14700.00 8.39 0.06
285 8789.91 8800.00 10.09 0.11
521 8511.16 8500.00 -11.16 0.13
183 13182.34 13200.00 17.66 0.13
282 17025.93 16999.00 -26.93 0.16
591 9175.83 9200.00 24.17 0.26
387 45341.81 45200.00 -141.81 0.31
194 8928.33 8900.00 -28.33 0.32
556 5781.48 5800.00 18.52 0.32
461 12851.85 12800.00 -51.85 0.41
551 12120.44 12179.00 58.56 0.48
222 9546.02 9500.00 -46.02 0.48
375 2363.46 2350.00 -13.46 0.57
15 2565.52 2550.00 -15.52 0.61
503 28318.82 28500.00 181.18 0.64
718 7251.60 7300.00 48.40 0.66
700 42865.93 43163.25 297.32 0.69
319 4564.60 4600.00 35.40 0.77
707 4887.55 4850.00 -37.55 0.77
702 8265.42 8200.00 -65.42 0.80
35 39974.95 39600.00 -374.95 0.95
33 4746.43 4700.00 -46.43 0.99
428 3938.78 3900.00 -38.78 0.99
56 10196.48 10299.00 102.52 1.00
137 8557.00 8650.00 93.00 1.08
292 16206.64 16030.00 -176.64 1.10
738 10087.53 10200.00 112.47 1.10
569 25211.89 25500.00 288.11 1.13
317 11851.24 12000.00 148.76 1.24
584 5431.20 5500.00 68.80 1.25
506 43413.59 44000.00 586.41 1.33
370 25211.89 25555.00 343.11 1.34
46 13909.74 14100.00 190.26 1.35
17 7555.21 7450.00 -105.21 1.41
681 1420.97 1400.00 -20.97 1.50
262 9439.41 9300.00 -139.41 1.50
750 16246.25 16000.00 -246.25 1.54
132 9352.24 9500.00 147.76 1.56
595 20159.64 20500.00 340.36 1.66
614 9042.40 9200.00 157.60 1.71
507 60984.29 59900.00 -1084.29 1.81
117 4988.81 4900.00 -88.81 1.81
405 34629.24 34000.00 -629.24 1.85
125 11816.57 11600.00 -216.57 1.87
129 23940.17 23500.00 -440.17 1.87
715 13442.34 13699.00 256.66 1.87
205 14705.96 15000.00 294.04 1.96
566 6023.54 6150.00 126.46 2.06
263 6609.11 6750.00 140.89 2.09
601 22511.68 23000.00 488.32 2.12
268 45988.43 47000.00 1011.57 2.15
554 20537.19 21000.00 462.81 2.20
552 8798.50 8599.00 -199.50 2.32
73 14648.63 15000.00 351.37 2.34
645 9569.35 9350.00 -219.35 2.35
93 37847.52 38764.00 916.48 2.36
502 23408.44 23999.00 590.56 2.46
197 14114.99 14500.00 385.01 2.66
12 8954.53 9200.00 245.47 2.67
451 54960.37 56500.00 1539.63 2.73
340 2773.98 2700.00 -73.98 2.74
519 6318.78 6500.00 181.22 2.79
603 15321.68 14899.00 -422.68 2.84
657 8257.36 8500.00 242.64 2.85
526 8636.75 8900.00 263.25 2.96
441 3915.77 3800.00 -115.77 3.05
600 49434.84 51000.00 1565.16 3.07
609 14534.63 14999.00 464.37 3.10
415 14648.63 14200.00 -448.63 3.16
508 32404.33 33500.00 1095.67 3.27
720 11122.34 11500.00 377.66 3.28
156 10127.01 9800.00 -327.01 3.34
560 9569.35 9900.00 330.65 3.34
653 14981.37 15500.00 518.63 3.35
323 29475.68 28500.00 -975.68 3.42
108 12505.19 12950.00 444.81 3.43
442 26088.52 25200.00 -888.52 3.53
730 7764.65 7500.00 -264.65 3.53
48 3313.55 3200.00 -113.55 3.55
123 58820.09 61000.00 2179.91 3.57
344 7132.20 7400.00 267.80 3.62
164 10107.25 10500.00 392.75 3.74
532 15823.48 15250.00 -573.48 3.76
457 8561.17 8900.00 338.83 3.81
624 24821.01 23900.00 -921.01 3.85
41 7481.79 7200.00 -281.79 3.91
475 23928.48 22999.00 -929.48 4.04
618 42657.13 41000.00 -1657.13 4.04
745 38331.07 40000.00 1668.93 4.17
87 7662.95 8000.00 337.05 4.21
81 9284.00 9700.00 416.00 4.29
186 6412.02 6700.00 287.98 4.30
221 12726.95 12200.00 -526.95 4.32
270 8034.63 7700.00 -334.63 4.35
512 8038.55 7700.00 -338.55 4.40
149 7167.11 7500.00 332.89 4.44
403 11938.36 12500.00 561.64 4.49
324 10136.90 9700.00 -436.90 4.50
146 10407.74 10900.00 492.26 4.52
204 9068.93 9500.00 431.07 4.54
49 12073.19 12650.00 576.81 4.56
180 31376.70 32900.00 1523.30 4.63
580 9940.84 9500.00 -440.84 4.64
303 6912.77 6600.00 -312.77 4.74
642 26447.64 25250.00 -1197.64 4.74
654 3667.76 3500.00 -167.76 4.79
6 47750.40 45500.00 -2250.40 4.95
544 9256.84 9750.00 493.16 5.06
91 6832.23 7200.00 367.77 5.11
182 11035.79 10499.00 -536.79 5.11
496 14843.03 14100.00 -743.03 5.27
472 35208.94 37200.00 1991.06 5.35
133 12889.55 13621.43 731.88 5.37
66 6538.49 6200.00 -338.49 5.46
462 8789.91 9300.00 510.09 5.48
359 12758.06 13500.00 741.94 5.50
219 7181.13 7600.00 418.87 5.51
660 13137.36 12450.00 -687.36 5.52
630 19217.69 18200.00 -1017.69 5.59
430 1637.12 1550.00 -87.12 5.62
64 13680.72 14500.00 819.28 5.65
60 19558.49 18500.00 -1058.49 5.72
391 15547.77 16500.00 952.23 5.77
88 6399.51 6800.00 400.49 5.89
470 23408.44 22100.00 -1308.44 5.92
411 9777.17 10400.00 622.83 5.99
34 16502.10 17555.00 1052.90 6.00
349 15869.91 16900.00 1030.09 6.10
619 42345.84 39900.00 -2445.84 6.13
555 13694.08 12900.00 -794.08 6.16
11 2954.33 3150.00 195.67 6.21
250 37773.67 35555.00 -2218.67 6.24
362 13028.76 13900.00 871.24 6.27
114 58447.90 55000.00 -3447.90 6.27
206 12652.60 13500.00 847.40 6.28
705 14777.94 13900.00 -877.94 6.32
365 38406.01 41000.00 2593.99 6.33
83 6275.73 6700.00 424.27 6.33
587 14142.59 13300.00 -842.59 6.34
408 8428.45 9000.00 571.55 6.35
414 11187.71 11950.00 762.29 6.38
662 10764.37 11500.00 735.63 6.40
541 11678.91 12500.00 821.09 6.57
467 13869.04 13000.00 -869.04 6.68
191 13902.94 14900.00 997.06 6.69
247 9513.45 10200.00 686.55 6.73
695 4196.90 4500.00 303.10 6.74
708 7153.13 6700.00 -453.13 6.76
607 12389.71 11600.00 -789.71 6.81
331 1583.66 1700.00 116.34 6.84
301 40053.10 43000.00 2946.90 6.85
216 12615.58 11800.00 -815.58 6.91
640 14648.63 13700.00 -948.63 6.92
644 15939.80 14900.00 -1039.80 6.98
74 16381.67 15300.00 -1081.67 7.07
254 16056.98 17300.00 1243.02 7.19
85 8764.20 9450.00 685.80 7.26
26 15994.38 14900.00 -1094.38 7.34
228 26864.13 29000.00 2135.87 7.37
748 9343.12 8700.00 -643.12 7.39
713 14239.59 13250.00 -989.59 7.47
71 22033.19 20500.00 -1533.19 7.48
728 21522.79 19999.00 -1523.79 7.62
318 8867.51 9600.00 732.49 7.63
89 12167.88 11300.00 -867.88 7.68
661 4148.01 4500.00 351.99 7.82
751 16934.73 15700.00 -1234.73 7.86
696 7768.44 7199.00 -569.44 7.91
413 40683.85 37700.00 -2983.85 7.91
489 6906.02 7500.00 593.98 7.92
163 4209.22 3900.00 -309.22 7.93
295 7540.47 8200.00 659.53 8.04
50 25746.83 28000.00 2253.17 8.05
126 38612.85 42000.00 3387.15 8.06
468 5404.74 5000.00 -404.74 8.09
130 8185.10 8910.53 725.43 8.14
516 10276.45 11200.00 923.55 8.25
127 12952.64 11962.00 -990.64 8.28
444 7481.79 6900.00 -581.79 8.43
636 93948.35 102800.00 8851.65 8.61
307 11782.00 12900.00 1118.00 8.67
553 10685.81 11700.00 1014.19 8.67
90 39125.27 36000.00 -3125.27 8.68
160 11520.32 10600.00 -920.32 8.68
170 11089.81 10200.00 -889.81 8.72
14 18699.32 20500.00 1800.68 8.78
424 46326.49 50800.00 4473.51 8.81
390 13221.02 14500.00 1278.98 8.82
638 9476.36 10400.00 923.64 8.88
383 6538.49 5999.00 -539.49 8.99
724 11736.07 12900.00 1163.93 9.02
121 21649.27 23800.00 2150.73 9.04
70 9091.10 10000.00 908.90 9.09
574 12902.15 11818.95 -1083.20 9.16
746 20527.16 18800.00 -1727.16 9.19
656 92853.82 85000.00 -7853.82 9.24
529 11520.32 12700.00 1179.68 9.29
42 26759.40 29500.00 2740.60 9.29
339 5122.11 5650.00 527.89 9.34
548 1749.53 1600.00 -149.53 9.35
634 13130.95 14500.00 1369.05 9.44
283 12851.85 14200.00 1348.15 9.49
297 41063.04 37500.00 -3563.04 9.50
103 4524.66 5000.00 475.34 9.51
200 9311.24 8500.00 -811.24 9.54
448 10571.64 11700.00 1128.36 9.64
753 2414.79 2200.00 -214.79 9.76
381 8928.33 9900.00 971.67 9.81
664 40386.96 44800.00 4413.04 9.85
226 10633.76 11800.00 1166.24 9.88
419 15761.79 17500.00 1738.21 9.93
128 8105.56 9000.00 894.44 9.94
466 23592.04 26200.00 2607.96 9.95
394 2878.86 3200.00 321.14 10.04
143 28071.01 25500.00 -2571.01 10.08
585 11862.82 13200.00 1337.18 10.13
736 11220.53 12500.00 1279.47 10.24
678 15247.05 16999.00 1751.95 10.31
154 3199.08 2900.00 -299.08 10.31
404 15054.70 16800.00 1745.30 10.39
346 32658.48 36500.00 3841.52 10.52
458 16992.71 18999.00 2006.29 10.56
201 63012.38 70500.00 7487.62 10.62
655 11520.32 12900.00 1379.68 10.70
568 8928.33 9999.00 1070.67 10.71
238 16943.00 18988.13 2045.13 10.77
357 9532.04 10700.00 1167.96 10.92
193 27048.40 30385.00 3336.60 10.98
621 8712.99 9800.00 1087.01 11.09
363 31995.55 35999.00 4003.45 11.12
717 2976.05 3350.00 373.95 11.16
353 16607.18 18700.00 2092.82 11.19
300 7636.81 8600.00 963.19 11.20
538 30027.70 27000.00 -3027.70 11.21
271 10468.90 11800.00 1331.10 11.28
325 54960.37 62000.00 7039.63 11.35
380 14142.59 12700.00 -1442.59 11.36
685 16827.57 18999.00 2171.43 11.43
169 10535.57 11900.00 1364.43 11.47
348 12049.63 10800.00 -1249.63 11.57
436 7478.14 6700.00 -778.14 11.61
683 8395.59 9500.00 1104.41 11.63
67 15900.94 18000.00 2099.06 11.66
513 10146.81 11500.00 1353.19 11.77
236 5023.03 5700.00 676.97 11.88
374 11003.51 12500.00 1496.49 11.97
589 8185.10 9299.00 1113.90 11.98
54 9762.86 11100.00 1337.14 12.05
597 84050.44 74999.00 -9051.44 12.07
327 55962.33 49900.00 -6062.33 12.15
533 6680.48 5950.00 -730.48 12.28
155 8330.25 9500.00 1169.75 12.31
698 49944.34 57000.00 7055.66 12.38
237 19086.77 21800.00 2713.23 12.45
28 26890.38 23900.00 -2990.38 12.51
152 7825.55 8950.00 1124.45 12.56
173 11453.02 13100.00 1646.98 12.57
240 8444.92 7500.00 -944.92 12.60
371 54161.16 62000.00 7838.84 12.64
412 45076.91 39999.00 -5077.91 12.70
139 4887.55 5600.00 712.45 12.72
409 4359.83 5000.00 640.17 12.80
659 3838.15 3399.00 -439.15 12.92
620 56539.12 65000.00 8460.88 13.02
747 7917.79 7000.00 -917.79 13.11
756 21272.04 18800.00 -2472.04 13.15
501 6732.88 5950.00 -782.88 13.16
550 29939.85 34500.00 4560.15 13.22
225 19827.72 17500.00 -2327.72 13.30
174 6058.94 6999.00 940.06 13.43
347 18947.48 16700.00 -2247.48 13.46
704 12043.75 10600.00 -1443.75 13.62
273 8297.77 7300.00 -997.77 13.67
278 26037.61 22900.00 -3137.61 13.70
52 9921.44 11500.00 1578.56 13.73
99 7577.38 8800.00 1222.62 13.89
586 6719.74 5900.00 -819.74 13.89
562 5127.11 4500.00 -627.11 13.94
332 13337.73 15500.00 2162.27 13.95
445 20297.92 17800.00 -2497.92 14.03
761 7871.53 6900.00 -971.53 14.08
439 12751.83 14850.00 2098.17 14.13
151 17583.55 20500.00 2916.45 14.23
440 17125.98 19999.00 2873.02 14.37
534 9247.80 10800.00 1552.20 14.37
175 10982.04 9600.00 -1382.04 14.40
27 11441.84 9999.00 -1442.84 14.43
293 14309.29 12500.00 -1809.29 14.47
68 24460.07 28600.00 4139.93 14.48
565 5642.04 6600.00 957.96 14.51
671 2523.28 2200.00 -323.28 14.69
276 5457.78 6400.00 942.22 14.72
267 54960.37 64500.00 9539.63 14.79
157 7153.13 8400.00 1246.87 14.84
112 10107.25 8800.00 -1307.25 14.86
739 8980.80 10550.00 1569.20 14.87
482 26088.52 22700.00 -3388.52 14.93
596 6634.97 7800.00 1165.03 14.94
637 9931.14 11700.00 1768.86 15.12
279 12203.58 10600.00 -1603.58 15.13
417 11531.58 13600.00 2068.42 15.21
124 5341.77 6300.00 958.23 15.21
176 6106.46 5300.00 -806.46 15.22
673 55175.48 65099.00 9923.52 15.24
289 82990.14 71999.00 -10991.14 15.27
212 11068.17 9599.00 -1469.17 15.31
106 28821.01 24990.00 -3831.01 15.33
476 3114.31 2700.00 -414.31 15.34
608 10780.15 12750.00 1969.85 15.45
716 10306.60 12200.00 1893.40 15.52
396 48549.75 42000.00 -6549.75 15.59
47 84379.40 99999.00 15619.60 15.62
296 7518.41 6500.00 -1018.41 15.67
165 16104.09 19100.00 2995.91 15.69
187 7076.70 8400.00 1323.30 15.75
168 4051.92 3500.00 -551.92 15.77
721 9148.99 7900.00 -1248.99 15.81
570 5889.75 7000.00 1110.25 15.86
232 10428.09 9000.00 -1428.09 15.87
605 11588.02 9999.00 -1589.02 15.89
758 12191.67 14499.00 2307.33 15.91
464 34663.07 29900.00 -4763.07 15.93
298 19482.23 16800.00 -2682.23 15.97
579 15426.77 13300.00 -2126.77 15.99
234 7894.63 9400.00 1505.37 16.01
706 7294.21 8700.00 1405.79 16.16
689 4127.80 3550.00 -577.80 16.28
418 15808.04 18900.00 3091.96 16.36
55 7719.29 9250.00 1530.71 16.55
217 17523.56 21000.00 3476.44 16.55
765 56318.69 67500.00 11181.31 16.56
224 7337.08 8800.00 1462.92 16.62
251 12335.38 14800.00 2464.62 16.65
531 5750.51 6900.00 1149.49 16.66
3 12996.99 15600.00 2603.01 16.69
58 6412.02 7700.00 1287.98 16.73
321 15777.19 18950.00 3172.81 16.74
364 12726.95 10900.00 -1826.95 16.76
328 7489.10 9000.00 1510.90 16.79
692 9064.51 10900.00 1835.49 16.84
63 36504.40 43900.00 7395.60 16.85
493 15381.64 18500.00 3118.36 16.86
38 10166.65 8700.00 -1466.65 16.86
682 4323.79 3700.00 -623.79 16.86
734 7890.78 6750.00 -1140.78 16.90
257 6355.91 7650.00 1294.09 16.92
5 53165.48 64000.00 10834.52 16.93
686 8420.22 7200.00 -1220.22 16.95
315 3040.68 2600.00 -440.68 16.95
393 23248.97 28000.00 4751.03 16.97
500 15441.85 18600.00 3158.15 16.98
518 41506.53 49999.00 8492.47 16.99
181 7427.19 8950.00 1522.81 17.01
449 3554.91 4300.00 745.09 17.33
72 16309.84 13900.00 -2409.84 17.34
244 54960.37 66500.00 11539.63 17.35
459 19255.26 23300.00 4044.74 17.36
376 11810.80 14300.00 2489.20 17.41
497 14777.94 17900.00 3122.06 17.44
392 33011.20 40000.00 6988.80 17.47
604 10509.88 12750.00 2240.12 17.57
623 16119.83 13700.00 -2419.83 17.66
104 13828.47 16800.00 2971.53 17.69
687 11187.71 9500.00 -1687.71 17.77
667 13054.23 15900.00 2845.77 17.90
13 8209.12 10000.00 1790.88 17.91
195 7181.13 8750.00 1568.87 17.93
633 13350.76 16300.00 2949.24 18.09
386 9815.44 11990.00 2174.56 18.14
97 13422.66 16400.00 2977.34 18.15
714 8919.62 10900.00 1980.38 18.17
499 89296.65 75500.00 -13796.65 18.27
330 9225.25 11300.00 2074.75 18.36
594 19577.60 23999.00 4421.40 18.42
665 19577.60 23999.00 4421.40 18.42
402 3342.80 4100.00 757.20 18.47
246 28125.89 34500.00 6374.11 18.48
264 10832.91 13300.95 2468.04 18.56
743 54960.37 67500.00 12539.63 18.58
230 24918.16 21000.00 -3918.16 18.66
598 6172.41 5200.00 -972.41 18.70
764 5935.95 5000.00 -935.95 18.72
199 20238.55 24900.00 4661.45 18.72
138 5399.47 6650.00 1250.53 18.81
527 4044.01 3400.00 -644.01 18.94
30 31964.32 39500.00 7535.68 19.08
272 17378.70 21500.00 4121.30 19.17
159 21209.81 26250.00 5040.19 19.20
509 14114.99 17500.00 3385.01 19.34
369 10226.39 12700.00 2473.61 19.48
629 2987.70 2499.00 -488.70 19.56
196 13991.48 11700.00 -2291.48 19.59
407 41791.23 52000.00 10208.77 19.63
495 17982.96 22400.00 4417.04 19.72
40 19501.27 24300.00 4798.73 19.75
669 12814.25 10700.00 -2114.25 19.76
741 10780.15 9000.00 -1780.15 19.78
43 13475.20 16800.00 3324.80 19.79
118 28821.01 35999.00 7177.99 19.94
111 10326.75 12900.00 2573.25 19.95
86 11408.36 9500.00 -1908.36 20.09
460 18053.35 15000.00 -3053.35 20.36
680 9375.10 11800.00 2424.90 20.55
210 10241.38 12900.00 2658.62 20.61
178 20238.55 25500.00 5261.45 20.63
80 18159.44 22900.00 4740.56 20.70
192 8691.75 7200.00 -1491.75 20.72
735 25113.59 20800.00 -4313.59 20.74
379 7489.10 9450.00 1960.90 20.75
147 10530.42 13300.00 2769.58 20.82
540 11242.47 9300.00 -1942.47 20.89
406 29032.87 24000.00 -5032.87 20.97
360 7028.49 8900.00 1871.51 21.03
335 12541.88 15900.00 3358.12 21.12
563 10623.39 13500.00 2876.61 21.31
588 28513.06 23500.00 -5013.06 21.33
209 7001.09 8900.00 1898.91 21.34
672 15854.42 20200.00 4345.58 21.51
723 22555.69 28787.85 6232.16 21.65
697 12783.00 10500.00 -2283.00 21.74
39 22162.67 18200.00 -3962.67 21.77
432 2033.46 2600.00 566.54 21.79
732 34866.77 44600.00 9733.23 21.82
135 42553.11 54500.00 11946.89 21.92
688 12197.63 9999.00 -2198.63 21.99
652 20278.11 25999.00 5720.89 22.00
492 28849.17 37000.00 8150.83 22.03
29 5838.21 7500.00 1661.79 22.16
306 30115.80 38700.00 8584.20 22.18
214 8789.91 11300.00 2510.09 22.21
483 25076.83 20500.00 -4576.83 22.33
223 5815.45 7500.00 1684.55 22.46
166 8832.94 11403.99 2571.05 22.55
446 47517.81 61500.00 13982.19 22.74
184 8330.25 10800.00 2469.75 22.87
227 9343.12 7600.00 -1743.12 22.94
583 8858.85 11500.00 2641.15 22.97
333 12191.67 9900.00 -2291.67 23.15
422 8257.36 6700.00 -1557.36 23.24
520 2466.03 2000.00 -466.03 23.30
367 7666.70 9999.00 2332.30 23.33
740 36772.74 48000.00 11227.26 23.39
1 3061.54 4000.00 938.46 23.46
641 7779.83 6300.00 -1479.83 23.49
694 9229.76 12100.00 2870.24 23.72
627 17361.74 13995.00 -3366.74 24.06
309 7337.08 5900.00 -1437.08 24.36
140 13060.61 10500.00 -2560.61 24.39
453 15854.42 21000.00 5145.58 24.50
525 9015.95 11950.00 2934.05 24.55
762 33011.20 26500.00 -6511.20 24.57
252 13182.34 17500.00 4317.66 24.67
675 15441.85 20500.00 5058.15 24.67
136 16222.47 13000.00 -3222.47 24.79
2 13747.68 18300.00 4552.32 24.88
593 2372.71 1900.00 -472.71 24.88
326 24033.87 32000.00 7966.13 24.89
427 51429.20 68500.00 17070.80 24.92
161 9758.09 12999.00 3240.91 24.93
9 41143.32 32900.00 -8243.32 25.06
20 22271.15 17800.00 -4471.15 25.12
450 25658.98 20500.00 -5158.98 25.17
322 8153.19 10900.00 2746.81 25.20
722 13462.04 18000.00 4537.96 25.21
162 4485.07 6000.00 1514.93 25.25
241 17480.83 23400.00 5919.17 25.30
478 5639.28 4500.00 -1139.28 25.32
122 34969.07 27900.00 -7069.07 25.34
547 10407.74 8300.00 -2107.74 25.39
291 1004.67 1350.00 345.33 25.58
77 20081.05 26999.00 6917.95 25.62
423 7430.82 10000.00 2569.18 25.69
384 34969.07 27800.00 -7169.07 25.79
342 26447.64 21000.00 -5447.64 25.94
109 8691.75 6900.00 -1791.75 25.97
729 30709.78 41500.00 10790.22 26.00
148 16278.02 22000.00 5721.98 26.01
269 29504.48 39900.00 10395.52 26.05
260 12480.79 9900.00 -2580.79 26.07
455 21365.73 28900.00 7534.27 26.07
32 9457.87 7500.00 -1957.87 26.10
266 10196.48 13800.00 3603.52 26.11
399 9467.11 7500.00 -1967.11 26.23
726 29475.68 40000.00 10524.32 26.31
308 11035.79 15000.00 3964.21 26.43
0 18159.44 24700.00 6540.56 26.48
651 8586.29 11700.00 3113.71 26.61
53 32278.00 44000.00 11722.00 26.64
294 10891.25 8600.00 -2291.25 26.64
141 24979.07 19700.00 -5279.07 26.80
416 56705.00 77500.00 20795.00 26.83
395 11314.06 15500.00 4185.94 27.01
443 9532.04 7500.00 -2032.04 27.09
44 29032.87 39900.00 10867.13 27.24
737 2290.75 1800.00 -490.75 27.26
481 5742.09 7899.00 2156.91 27.31
571 7511.07 5900.00 -1611.07 27.31
188 72597.66 57000.00 -15597.66 27.36
610 3486.16 4800.00 1313.84 27.37
635 9849.04 7700.00 -2149.04 27.91
231 9873.12 13708.50 3835.38 27.98
336 11430.67 15900.00 4469.33 28.11
113 31777.57 24800.00 -6977.57 28.14
754 24991.27 19500.00 -5491.27 28.16
31 5815.45 8100.00 2284.55 28.20
463 11046.57 15400.00 4353.43 28.27
248 10402.66 8100.00 -2302.66 28.43
576 8217.14 11500.00 3282.86 28.55
522 16278.02 22800.00 6521.98 28.61
368 3412.06 2650.00 -762.06 28.76
302 7764.65 10900.00 3135.35 28.76
22 3205.33 4500.00 1294.67 28.77
372 48431.36 37500.00 -10931.36 29.15
179 8424.33 11900.00 3475.67 29.21
648 4954.83 7000.00 2045.17 29.22
410 21840.39 30900.00 9059.61 29.32
733 1059.07 1500.00 440.93 29.40
452 3437.14 2650.00 -787.14 29.70
505 3375.60 2600.00 -775.60 29.83
341 27662.81 21300.00 -6362.81 29.87
4 18445.41 14200.00 -4245.41 29.90
676 3784.19 5400.00 1615.81 29.92
755 6865.67 9800.00 2934.33 29.94
742 12677.33 18100.00 5422.67 29.96
202 3900.50 3000.00 -900.50 30.02
8 15040.01 21500.00 6459.99 30.05
663 30263.21 43500.00 13236.79 30.43
613 9007.15 6900.00 -2107.15 30.54
65 20497.12 15700.00 -4797.12 30.55
435 21365.73 30900.00 9534.27 30.86
725 6214.75 9000.00 2785.25 30.95
150 4793.01 3650.00 -1143.01 31.32
431 28736.70 41900.00 13163.30 31.42
190 11046.57 8400.00 -2646.57 31.51
78 11576.71 8800.00 -2776.71 31.55
120 4351.32 3300.00 -1051.32 31.86
24 23940.17 35138.29 11198.12 31.87
116 44988.96 66200.00 21211.04 32.04
101 10576.80 8000.00 -2576.80 32.21
575 47448.25 69999.00 22550.75 32.22
277 28736.70 21700.00 -7036.70 32.43
284 9122.23 13500.00 4377.77 32.43
82 18409.42 27300.00 8890.58 32.57
420 18168.31 13700.00 -4468.31 32.62
561 12664.96 18800.00 6135.04 32.63
712 2693.89 4000.00 1306.11 32.65
514 4645.55 6900.00 2254.45 32.67
421 12395.76 18500.00 6104.24 33.00
275 32042.45 47900.00 15857.55 33.11
373 13448.90 10100.00 -3348.90 33.16
233 19983.23 15000.00 -4983.23 33.22
343 22402.03 16800.00 -5602.03 33.35
564 2131.04 3200.00 1068.96 33.40
397 5315.75 8000.00 2684.25 33.55
361 18901.28 28500.00 9598.72 33.68
401 11736.07 17700.00 5963.93 33.69
437 3506.64 5300.00 1793.36 33.84
535 5420.60 8200.00 2779.40 33.90
545 11897.63 17999.00 6101.37 33.90
389 8034.63 6000.00 -2034.63 33.91
377 6035.31 4500.00 -1535.31 34.12
511 11678.91 8700.00 -2978.91 34.24
480 24033.87 17900.00 -6133.87 34.27
606 7558.90 11500.00 3941.10 34.27
486 4700.31 3500.00 -1200.31 34.29
626 8730.03 6500.00 -2230.03 34.31
625 59079.15 89999.00 30919.85 34.36
632 14763.52 22500.00 7736.48 34.38
290 15994.38 11900.00 -4094.38 34.41
316 8603.08 6400.00 -2203.08 34.42
398 32451.83 49500.00 17048.17 34.44
92 14080.57 21500.00 7419.43 34.51
280 9420.99 7000.00 -2420.99 34.59
185 72704.08 53999.00 -18705.08 34.64
310 18626.42 28500.00 9873.58 34.64
599 3840.03 5900.00 2059.97 34.91
274 15351.63 23600.00 8248.37 34.95
314 9384.26 6950.00 -2434.26 35.03
207 12764.29 19700.00 6935.71 35.21
639 13494.95 20900.00 7405.05 35.43
690 2641.79 1950.00 -691.79 35.48
36 27287.18 42500.00 15212.82 35.79
352 12037.87 18800.00 6762.13 35.97
447 23592.04 36900.00 13307.96 36.06
546 18845.99 29500.00 10654.01 36.12
577 6922.90 10900.00 3977.10 36.49
79 10774.88 16999.00 6224.12 36.61
684 4111.71 6500.00 2388.29 36.74
757 8636.75 6299.00 -2337.75 37.11
487 2763.17 4400.00 1636.83 37.20
536 6399.51 10200.00 3800.49 37.26
488 17446.72 28000.00 10553.28 37.69
98 14128.78 10250.00 -3878.78 37.84
107 10759.11 17500.00 6740.89 38.52
647 32293.76 23300.00 -8993.76 38.60
23 5549.14 4000.00 -1549.14 38.73
763 9113.32 14899.00 5785.68 38.83
709 12505.19 8999.00 -3506.19 38.96
329 26785.55 44000.00 17214.45 39.12
573 3154.10 5200.00 2045.90 39.34
539 7233.91 11999.00 4765.09 39.71
299 8113.48 13500.00 5386.52 39.90
76 9513.45 6800.00 -2713.45 39.90
628 28708.65 47777.00 19068.35 39.91
142 11057.37 7900.00 -3157.37 39.97
699 15487.15 25800.00 10312.85 39.97
567 8403.79 6000.00 -2403.79 40.06
334 59833.96 99999.00 40165.04 40.17
677 13896.16 9900.00 -3996.16 40.37
612 6318.78 10600.00 4281.22 40.39
425 8038.55 13500.00 5461.45 40.46
218 25162.69 17900.00 -7262.69 40.57
631 5941.75 10000.00 4058.25 40.58
646 11822.34 19900.00 8077.66 40.59
590 22888.53 38600.00 15711.47 40.70
19 4679.70 7900.00 3220.30 40.76
693 19837.41 33500.00 13662.59 40.78
557 56208.80 39900.00 -16308.80 40.87
25 8884.84 6300.00 -2584.84 41.03
537 47657.23 81000.00 33342.77 41.16
759 24627.85 42000.00 17372.15 41.36
485 4969.37 8500.00 3530.63 41.54
701 15113.63 25900.00 10786.37 41.65
649 18000.53 12700.00 -5300.53 41.74
382 13092.53 22500.00 9407.47 41.81
358 24893.84 42900.00 18006.16 41.97
208 13066.99 9200.00 -3866.99 42.03
731 15321.68 26600.00 11278.32 42.40
582 7555.21 5300.00 -2255.21 42.55
131 13086.14 22900.00 9813.86 42.86
338 15143.17 10600.00 -4543.17 42.86
426 26824.81 47000.00 20175.19 42.93
213 10256.40 18000.00 7743.60 43.02
615 17480.83 31000.00 13519.17 43.61
504 9274.93 16500.00 7225.07 43.79
524 28208.41 50250.00 22041.59 43.86
51 12269.31 21900.00 9630.69 43.98
100 26551.16 47500.00 20948.84 44.10
177 14463.84 26000.00 11536.16 44.37
265 8807.10 6100.00 -2707.10 44.38
473 8285.63 14900.00 6614.37 44.39
256 9796.28 17777.00 7980.72 44.89
622 3046.62 2100.00 -946.62 45.08
345 10811.78 7450.00 -3361.78 45.12
288 9166.88 6300.00 -2866.88 45.51
400 5457.78 3750.00 -1707.78 45.54
7 12383.66 8500.00 -3883.66 45.69
650 31715.57 58500.00 26784.43 45.79
203 1986.35 3700.00 1713.65 46.31
198 6415.16 11999.00 5583.84 46.54
337 15351.63 28900.00 13548.37 46.88
253 20022.30 38000.00 17977.70 47.31
530 21055.04 40000.00 18944.96 47.36
643 36718.92 70000.00 33281.08 47.54
581 6615.56 12800.00 6184.44 48.32
366 16246.25 31500.00 15253.75 48.42
578 10468.90 20300.00 9831.10 48.43
158 10843.50 7300.00 -3543.50 48.54
454 32246.49 21700.00 -10546.49 48.60
37 2306.46 4500.00 2193.54 48.75
471 9060.08 17777.00 8716.92 49.03
320 19434.73 38500.00 19065.27 49.52
75 17643.76 35000.00 17356.24 49.59
558 18535.70 37000.00 18464.30 49.90
543 3866.37 7750.00 3883.63 50.11
355 11122.34 7400.00 -3722.34 50.30
304 6029.42 4000.00 -2029.42 50.74
249 5447.13 3600.00 -1847.13 51.31
542 25261.18 52300.00 27038.82 51.70
239 6828.90 4500.00 -2328.90 51.75
242 22877.35 47600.00 24722.65 51.94
668 29764.94 62000.00 32235.06 51.99
388 24483.97 51000.00 26516.03 51.99
62 2131.04 1400.00 -731.04 52.22
95 7825.55 16500.00 8674.45 52.57
490 5592.67 12000.00 6407.33 53.39
167 12990.65 27900.00 14909.35 53.44
115 27716.89 18000.00 -9716.89 53.98
59 5736.49 12500.00 6763.51 54.11
119 26165.06 16900.00 -9265.06 54.82
710 12299.30 27500.00 15200.70 55.28
602 9644.41 6200.00 -3444.41 55.55
229 7991.59 18500.00 10508.41 56.80
658 11787.76 7500.00 -4287.76 57.17
144 6300.30 3999.00 -2301.30 57.55
69 14204.87 9000.00 -5204.87 57.83
110 22910.89 14500.00 -8410.89 58.01
172 11990.94 28700.00 16709.06 58.22
429 35381.27 85555.00 50173.73 58.64
691 5505.96 3450.00 -2055.96 59.59
258 4595.91 11600.00 7004.09 60.38
616 11176.79 6900.00 -4276.79 61.98
57 11275.45 30500.00 19224.55 63.03
134 21596.48 60500.00 38903.52 64.30
61 7734.38 4700.00 -3034.38 64.56
255 13409.56 39500.00 26090.44 66.05
287 10960.61 32500.00 21539.39 66.28
235 8314.00 5000.00 -3314.00 66.28
727 7423.56 22200.00 14776.44 66.56
153 2479.31 7700.00 5220.69 67.80
18 11851.24 6999.00 -4852.24 69.33
354 29403.81 17100.00 -12303.81 71.95
189 6368.34 3700.00 -2668.34 72.12
617 22380.16 12999.00 -9381.16 72.17
313 24603.81 13999.00 -10604.81 75.75
261 13896.16 7900.00 -5996.16 75.90
517 4439.31 2500.00 -1939.31 77.57
281 16072.67 9000.00 -7072.67 78.59
351 14197.94 7800.00 -6397.94 82.02
21 8937.06 4900.00 -4037.06 82.39
145 23777.07 12500.00 -11277.07 90.22
211 9352.24 4900.00 -4452.24 90.86
434 6693.55 3500.00 -3193.55 91.24
350 9578.70 5000.00 -4578.70 91.57
666 16135.57 8350.00 -7785.57 93.24
479 9113.32 4700.00 -4413.32 93.90
243 20637.71 10500.00 -10137.71 96.55
105 10960.61 5500.00 -5460.61 99.28
515 8577.91 4300.00 -4277.91 99.49
10 0.00 19800.00 19800.00 100.00
491 0.00 17990.00 17990.00 100.00
16 0.00 2899.00 2899.00 100.00
465 0.00 3500.00 3500.00 100.00
356 0.00 29999.00 29999.00 100.00
474 0.00 5000.00 5000.00 100.00
378 0.00 37700.00 37700.00 100.00
305 0.00 5200.00 5200.00 100.00
311 0.00 4100.00 4100.00 100.00
749 3428.76 1700.00 -1728.76 101.69
259 5391.56 2600.00 -2791.56 107.37
523 3589.80 1699.00 -1890.80 111.29
510 12156.01 5500.00 -6656.01 121.02
760 11633.37 5000.00 -6633.37 132.67
572 5431.20 2300.00 -3131.20 136.14
711 21691.60 8900.00 -12791.60 143.73
94 9485.62 3799.00 -5686.62 149.69
220 8893.52 3500.00 -5393.52 154.10
469 6892.55 2700.00 -4192.55 155.28
549 5775.83 2250.00 -3525.83 156.70
719 11565.41 4500.00 -7065.41 157.01
670 12014.38 4500.00 -7514.38 166.99
494 10196.48 3650.00 -6546.48 179.36
84 7511.07 2500.00 -5011.07 200.44
703 35693.61 11200.00 -24493.61 218.69
752 7021.63 2200.00 -4821.63 219.16
528 2813.54 800.00 -2013.54 251.69
312 7167.11 1800.00 -5367.11 298.17
744 11920.89 2900.00 -9020.89 311.07
45 15261.94 2850.00 -12411.94 435.51
456 14705.96 2500.00 -12205.96 488.24
286 40053.10 4100.00 -35953.10 876.90
611 27852.56 2500.00 -25352.56 1014.10
245 29047.05 2400.00 -26647.05 1110.29
592 inf 20000.00 -inf inf
438 inf 8500.00 -inf inf
102 inf 7300.00 -inf inf
477 inf 18900.00 -inf inf
96 inf 17200.00 -inf inf
171 inf 52222.00 -inf inf
498 inf 999.00 -inf inf
559 inf 37500.00 -inf inf
679 inf 1800.00 -inf inf
674 inf 2700.00 -inf inf
215 inf 4100.00 -inf inf
385 inf 8300.00 -inf inf
433 inf 5600.00 -inf inf
  • The observations that have the largest difference%'s also have extremely low observed prices (~3000)
    • All the residuals for these outliers are negative, therefore their predictions are higher than their targets
    • An explanation may be that we are missing an important factor which drives the price of the car lower
      • The model of the car
      • Damage to the car
  • Otherwise, the predictions are fairly accurate

How to potentially improve our model further:

  1. Use a different set of variables
  2. Remove a bigger part of the outliers
  3. Use different kinds of transformations