Vehicle Price Prediction (Linear regression)

So I decided to pick a random dataset on Kaggle to try my hand with, and I went with Australian Price Prediction dataset.

Importing Libraries

Lets start by importing Numpy & Pandas, and our dataset.

We will import our dataset on Kaggle:

Inspecting Data

Lets see what we are working with

df.info()

RangeIndex: 16734 entries, 0 to 16733
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Brand              16733 non-null  object 
 1   Year               16733 non-null  float64
 2   Model              16733 non-null  object 
 3   Car/Suv            16706 non-null  object 
 4   Title              16733 non-null  object 
 5   UsedOrNew          16733 non-null  object 
 6   Transmission       16733 non-null  object 
 7   Engine             16733 non-null  object 
 8   DriveType          16733 non-null  object 
 9   FuelType           16733 non-null  object 
 10  FuelConsumption    16733 non-null  object 
 11  Kilometres         16733 non-null  object 
 12  ColourExtInt       16733 non-null  object 
 13  Location           16284 non-null  object 
 14  CylindersinEngine  16733 non-null  object 
 15  BodyType           16452 non-null  object 
 16  Doors              15130 non-null  object 
 17  Seats              15029 non-null  object 
 18  Price              16731 non-null  object 
dtypes: float64(1), object(18)
memory usage: 2.4+ MB

Print some useful info:

Shape: (16734, 19)

Columns: ['Brand', 'Year', 'Model', 'Car/Suv', 'Title', 'UsedOrNew', 'Transmission', 'Engine', 'DriveType', 'FuelType', 'FuelConsumption', 'Kilometres', 'ColourExtInt', 'Location', 'CylindersinEngine', 'BodyType', 'Doors', 'Seats', 'Price']

Lets look at a preview of our dataset:

It appears we can drop some features that seem like that won't help us very much for making predictions:

Lets see what columns have records with missing values

Index(['Brand', 'Year', 'Model', 'Car/Suv', 'UsedOrNew', 'Transmission',
       'Engine', 'DriveType', 'FuelType', 'FuelConsumption', 'Kilometres',
       'CylindersinEngine', 'BodyType', 'Doors', 'Seats', 'Price'],
      dtype='object')

Brand                   1
Year                    1
Model                   1
Car/Suv                28
UsedOrNew               1
Transmission            1
Engine                  1
DriveType               1
FuelType                1
FuelConsumption         1
Kilometres              1
CylindersinEngine       1
BodyType              282
Doors                1604
Seats                1705
Price                   3
dtype: int64

Inspect data with missing values:

Seems we are missing a few value, since we want prices as our target to train with, we will drop these missing records.

Take a peak of what are data is lookin' like now:

Check the number of unique values for each feature

Brand                   76
Year                    45
Model                  781
Car/Suv                618
UsedOrNew                3
Transmission             3
Engine                 106
DriveType                5
FuelType                 9
FuelConsumption        157
Kilometres           14261
CylindersinEngine       11
BodyType                10
Doors                   13
Seats                   13
Price                 3794
dtype: int64

Brand                   0
Year                    0
Model                   0
Car/Suv                27
UsedOrNew               0
Transmission            0
Engine                  0
DriveType               0
FuelType                0
FuelConsumption         0
Kilometres              0
CylindersinEngine       0
BodyType              281
Doors                1602
Seats                1703
Price                   0
dtype: int64

Starting Our Data Preprocessing

As we can see we have quit a few missing Seats and Door values, lets tackle these two first. We can either first just simply strip any non numeric values out of them, then coerce them into a float, and any missing values or non numeric values would become NaN. Then we could fill all the NaN with the mean value of the ones that are not missing. Or we can grab all records with values, grab their seat number value, as a float, and keep track of the total, and fill in missing values with the string format, then we can later strip the text and coerce them into floats, this is the route I decided to go with.

Total of records with seat values 15,028
Total of records with NO seat values: 1,703

Max Seats: 22.0
Min Seats: 2.0
Mean Seats: 5.1

Mean looks good to me to fill our missing value in with

Let's see where we are at:

Brand                   0
Year                    0
Model                   0
Car/Suv                27
UsedOrNew               0
Transmission            0
Engine                  0
DriveType               0
FuelType                0
FuelConsumption         0
Kilometres              0
CylindersinEngine       0
BodyType              281
Doors                1602
Seats                   0
Price                   0
dtype: int64

Now let's figure out what we should do with the missing door values:

It appears we can do the same as we did for seats

Total of records with Door values 15,028
Total of records with NO door values: 1,703

Max Doors: 22.0
Min Doors: 2.0
Mean Doors: 4.0

Sweet, we'll mean looks good to me again, we'll use that.

Lets see how we are doing:

Brand                  0
Year                   0
Model                  0
Car/Suv               27
UsedOrNew              0
Transmission           0
Engine                 0
DriveType              0
FuelType               0
FuelConsumption        0
Kilometres             0
CylindersinEngine      0
BodyType             281
Doors                  0
Seats                  0
Price                  0
dtype: int64

Seems like we have a relatively small amount of records with missing "Car/Suv" values, we can probably get away with dropping these:

Same for BodyType

Lets check the state of our dtypes:

Brand                 object
Year                 float64
Model                 object
Car/Suv               object
UsedOrNew             object
Transmission          object
Engine                object
DriveType             object
FuelType              object
FuelConsumption       object
Kilometres            object
CylindersinEngine     object
BodyType              object
Doors                 object
Seats                 object
Price                 object
dtype: object

To continue checkout the full notebook on Kaggle