Supervised Learning Regression

print("Supervised learning regression")

Regression in supervised learning tasks uses continuous values to make predictions. This uses quantitative feature vectors.

In our previous chapters, we talked about supervised learning classification, which is used on data that are in groups(classes) also known as qualitative data. Something like age groups(toddler, teenager, adult), or height class(short, average, tall).

Now regression works with data that are continuous like age( 21, 22, 23, 24...) and height(1.70, 1.71, 1.72, 1.73). So to make predictions on such type of data we use regression in supervised learning. Say you want to predict the age of a person or the next day's weather, regression is your guy 🤖.

There are a lot of models used for regression tasks, and you know... more are being created/invented. We've talked about some of them in supervised learning classification, which are models that can be used for both classification and regression tasks, like K-Nearest Neighbors and Random Forest.

The two main metrics that are used for evaluating the trained regression model are variance and bias.

Bias refers to the error due to overly simplistic models that miss important relationships also called underfitting. Say a model predicts the price of a house just by the size of the house, without considering other factors like location, rooms, and amenities. That's bias in regression tasks basically.

Variance refers to the error due to overly complex models that capture noise in the training data also called overfitting. Say a model predicts the price of a house taking into factor minute(fancy word 🙂) and irrelevant factors(features) like the color of paint, swimming pool tiles, etc. At this point, it'll just add noise to the model. In our classification tasks, we actually try to find the most important features to train our model and remove the rest, this is similar to variance here in regression.

The key to building effective regression models lies in finding the right balance between bias and variance to minimize the total error and ensure good generalization to new data. So the model is not underfitting or overfitting, just balanced 🤖.

Variance and bias are used in both regression and classification tasks to check the performance of trained models.

To check the performance of regression models we use performance metric formulas like R-squared (R2) and Mean Absolute Error (MAE). You can read more about regression performance metrics here.

We'll talk about one regression model in this series, used primarily for regression tasks: