print("Random Forest")
Random forest is a supervised learning model used for classification and regression tasks. In this chapter, we'll focus on the classification implementation. Random forests use a group of decision trees to make a classification, as the name suggests.
Decision tree is simply a method of making decisions using a tree-like structure. It looks something like this:
The decision tree starts with a root node with children. Each node has at most 2 child nodes. The illustration above shows a job offer decision. From the root node, if it is "Yes" we go to the right node if "No" we go to the left node until we reach the last node, which is called the leaf node.
In the Random forests model several of such trees make a prediction and the prediction with the highest count among the trees would be the final prediction. The aggregation of the trees uses a technique known as the Ensemble method.
The Ensemble method uses decision trees, and their predictions are aggregated to identify the most popular result.
Let's try to explain this with an illustration
Consider a boxing match(Fury vs Usyk ๐ฆพ) which typically has three judges. After the 12th round, each judge would submit their independent results. The contestant with the majority vote wins, so if two judges say a boxer wins it's 2/3 so the boxer wins. That is similar to random forests as well each tree makes an independent prediction and the prediction that majority of the trees make wins.
Playground
Play play time ๐ค retzam-ai.vercel.app. In this chapter, we trained a model to predict if a patient would test positive for diabetes or not. We used Random Forest, check it out directly here.
Disclaimer: This is not a medical diagnosis.
Enter the patient's details as shown below.
The model would predict if a patient would test positive for diabetes.
The image below shows the predictions for the model and the performance report.
Hands-On
We'll use Python for the hands-on section, so you'll need to have a little bit of Python programming experience. If you are not too familiar with Python, still try, the comments are very explicit and detailed.
We'll use Google Co-laboratory as our code editor, it is easy to use and requires zero setup.Here isan article to get started.
Here is a link to our organization on GitHub, github.com/retzam-ai, you can find the code for all the models and projects we work on. We are excited to see you contribute to our project repositories ๐ค.
For this demo project, we used a dataset of about 100,000 marketing campaign data from Kaggle here. We'll train a model that would predict if a patient would test positive for diabetes or not.
For the complete code for this tutorial check this pdf here.
Data Preprocessing
Create a new Colab notebook.
Go to Kaggle here, to download the patient's dataset.
Import the dataset to the project folder
Import the dataset using pandas
We plot a histogram to check which features affect the outcome the most or the least. This helps us determine, which features to use in training our model and the ones to discard.
We then split our dataset into training and test sets in an 80%-20% format.
We then scale the dataset. X_train is the feature vectors, and y_train is the output or outcome. The scale_dataset function over samples and scales the dataset. The pdf document has detailed comments on each line.
Performance Review
First, we'll need to make predictions with our newly trained model using our test dataset.
Then we'll use the prediction to compare with the actual output/target on the test dataset to measure the performance of our model.
From the image above we can see the classification report.
The accuracy is still in the 90th percentile, for our 3rd model in a row, we've gotten accuracy above the 96th percentile ๐ฅถ
Don't forget to play ๐ in the playground and compare the model's classification reports and predictions across each dataset.
End of hands-on
We've completed classification models in supervised learning, congratulations we just completed a section of our journey ๐.
Up next we'll talk about regression models in supervised learning tasks.
Let's keep on ๐๐ผ