Vehicle Price Prediction (Linear regression)
So I decided to pick a random dataset on Kaggle to try my hand with, and I went with Australian Price Prediction dataset.
Importing Libraries
Lets start by importing Numpy & Pandas, and our dataset.
We will import our dataset on Kaggle:
Inspecting Data
Lets see what we are working with
RangeIndex: 16734 entries, 0 to 16733 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Brand 16733 non-null object 1 Year 16733 non-null float64 2 Model 16733 non-null object 3 Car/Suv 16706 non-null object 4 Title 16733 non-null object 5 UsedOrNew 16733 non-null object 6 Transmission 16733 non-null object 7 Engine 16733 non-null object 8 DriveType 16733 non-null object 9 FuelType 16733 non-null object 10 FuelConsumption 16733 non-null object 11 Kilometres 16733 non-null object 12 ColourExtInt 16733 non-null object 13 Location 16284 non-null object 14 CylindersinEngine 16733 non-null object 15 BodyType 16452 non-null object 16 Doors 15130 non-null object 17 Seats 15029 non-null object 18 Price 16731 non-null object dtypes: float64(1), object(18) memory usage: 2.4+ MB
Print some useful info:
Shape: (16734, 19) Columns: ['Brand', 'Year', 'Model', 'Car/Suv', 'Title', 'UsedOrNew', 'Transmission', 'Engine', 'DriveType', 'FuelType', 'FuelConsumption', 'Kilometres', 'ColourExtInt', 'Location', 'CylindersinEngine', 'BodyType', 'Doors', 'Seats', 'Price']
Lets look at a preview of our dataset:
It appears we can drop some features that seem like that won't help us very much for making predictions:
Lets see what columns have records with missing values
Index(['Brand', 'Year', 'Model', 'Car/Suv', 'UsedOrNew', 'Transmission', 'Engine', 'DriveType', 'FuelType', 'FuelConsumption', 'Kilometres', 'CylindersinEngine', 'BodyType', 'Doors', 'Seats', 'Price'], dtype='object')
Brand 1 Year 1 Model 1 Car/Suv 28 UsedOrNew 1 Transmission 1 Engine 1 DriveType 1 FuelType 1 FuelConsumption 1 Kilometres 1 CylindersinEngine 1 BodyType 282 Doors 1604 Seats 1705 Price 3 dtype: int64
Inspect data with missing values:
Seems we are missing a few value, since we want prices as our target to train with, we will drop these missing records.
Take a peak of what are data is lookin' like now:
Check the number of unique values for each feature
Brand 76 Year 45 Model 781 Car/Suv 618 UsedOrNew 3 Transmission 3 Engine 106 DriveType 5 FuelType 9 FuelConsumption 157 Kilometres 14261 CylindersinEngine 11 BodyType 10 Doors 13 Seats 13 Price 3794 dtype: int64
Brand 0 Year 0 Model 0 Car/Suv 27 UsedOrNew 0 Transmission 0 Engine 0 DriveType 0 FuelType 0 FuelConsumption 0 Kilometres 0 CylindersinEngine 0 BodyType 281 Doors 1602 Seats 1703 Price 0 dtype: int64
Starting Our Data Preprocessing
As we can see we have quit a few missing Seats and Door values, lets tackle these two first. We can either first just simply strip any non numeric values out of them, then coerce them into a float, and any missing values or non numeric values would become NaN. Then we could fill all the NaN with the mean value of the ones that are not missing. Or we can grab all records with values, grab their seat number value, as a float, and keep track of the total, and fill in missing values with the string format, then we can later strip the text and coerce them into floats, this is the route I decided to go with.
Total of records with seat values 15,028 Total of records with NO seat values: 1,703 Max Seats: 22.0 Min Seats: 2.0 Mean Seats: 5.1
Mean looks good to me to fill our missing value in with
Let's see where we are at:
Brand 0 Year 0 Model 0 Car/Suv 27 UsedOrNew 0 Transmission 0 Engine 0 DriveType 0 FuelType 0 FuelConsumption 0 Kilometres 0 CylindersinEngine 0 BodyType 281 Doors 1602 Seats 0 Price 0 dtype: int64
Now let's figure out what we should do with the missing door values:
It appears we can do the same as we did for seats
Total of records with Door values 15,028 Total of records with NO door values: 1,703 Max Doors: 22.0 Min Doors: 2.0 Mean Doors: 4.0
Sweet, we'll mean looks good to me again, we'll use that.
Lets see how we are doing:
Brand 0 Year 0 Model 0 Car/Suv 27 UsedOrNew 0 Transmission 0 Engine 0 DriveType 0 FuelType 0 FuelConsumption 0 Kilometres 0 CylindersinEngine 0 BodyType 281 Doors 0 Seats 0 Price 0 dtype: int64
Seems like we have a relatively small amount of records with missing "Car/Suv" values, we can probably get away with dropping these:
Same for BodyType
Lets check the state of our dtypes:
Brand object Year float64 Model object Car/Suv object UsedOrNew object Transmission object Engine object DriveType object FuelType object FuelConsumption object Kilometres object CylindersinEngine object BodyType object Doors object Seats object Price object dtype: object
To continue checkout the full notebook on Kaggle