Predictive Modeling with Decision Trees
Welcome to this tutorial on building a predictive model using Decision Trees in Python! In the following steps, we will embark on a journey from raw data to an optimized predictive model. Whether you are a budding data scientist or an experienced analyst, this guide is designed to provide you with a practical understanding of how to implement a Decision Tree Regressor using the popular scikit-learn
library.
We will start by importing necessary libraries that help in data manipulation and machine learning. Following that, we'll define functions to evaluate our model's performance. The heart of this tutorial lies in understanding the data, which involves loading, cleaning, and selecting the right features for our model.
After preparing the data, we will delve into the core of machine learning by splitting our data into training and validation sets. This is a crucial step that allows us to train our model and then test its performance on unseen data. Once we've trained our initial model, we'll evaluate it using Mean Absolute Error (MAE), which gives us a clear metric to understand our model's accuracy.
But our work doesn't stop there. We'll look into optimizing our model by tuning its parameters, specifically the size of the decision tree. We'll retrain our model with this optimized parameter and assess the performance gain.
By the end of this tutorial, you'll have a working model that can predict house prices and a solid understanding of how to evaluate and improve your machine learning models. Let's dive into the world of predictive modeling and unlock the insights hidden in our data!
Step 1: Import Necessary Libraries
First, we need to import the required libraries that will enable us to handle data, build a machine learning model, and evaluate its performance.
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
Step 2: Define Functions to Calculate MAE and Print Predictions
Here we define two functions. One to calculate the mean absolute error (MAE) for our model predictions, and another to print out the predictions against the actual values and their MAE for either a subset of the data or the entire dataset.
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=1)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
def print_predictions(model, size, X_dataframe_as_prediction_input, actual_value_dataframe):
preds = model.predict(X_dataframe_as_prediction_input.head()) if size == "head" else model.predict(X_dataframe_as_prediction_input)
actuals = actual_value_dataframe.head().tolist() if size == "head" else actual_value_dataframe.tolist()
mae = mean_absolute_error(actuals, preds)
print(f"Predictions : {preds}")
print(f"Actual Values : {actuals}")
print(f"MAE {size}: {mae}")
Step 3: Load and Prepare Data
Next, we load our dataset from a CSV file and clean it by dropping rows with missing values. We also preview the column names to understand our data better.
home_data = pd.read_csv(iowa_file_path)
home_data = home_data.dropna(axis=0)
print(home_data.columns)
Step 4: Set Target and Features
We choose which column we want to predict—'Price'—and select the feature columns that we will use for making our predictions.
feature_columns = ['Distance', 'Landsize', 'BuildingArea', 'Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = home_data[feature_columns]
Step 5: Split Data for Training and Validation
We split our data into training and validation sets. This allows us to assess how well our model is likely to perform on unseen data.
Step 6: Train the Model
We then create and train our model with the training data.
iowa_model.fit(train_X, train_y)
print("Training complete")
Step 7: Make Predictions and Evaluate the Model
We use our trained model to make predictions and evaluate it using MAE.
print_predictions(iowa_model, "head", val_X, val_y)
Step 8: Optimize Model with Best Tree Size
To improve our model's performance, we search for the optimal size of the decision tree.
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
print(f"\nOptimized Tree size: {best_tree_size}")
Step 9: Retrain Model with Optimized Tree Size
With the optimal tree size found, we retrain our model to hopefully get better predictions.
optimized_iowa_model.fit(train_X, train_y)
Step 10: Evaluate Optimized Model
Finally, we evaluate our optimized model on the validation set to see if our optimization improved its predictive accuracy.
print_predictions(optimized_iowa_model, "head", val_X, val_y)