0% completed
Supervised learning is a cornerstone of machine learning where labeled data (data paired with the correct answer) is used to train a model to make predictions on unseen data.
Supervised learning is like teaching a child with examples.
You give a computer lots of labeled data (for instance, pictures labeled “cat” or “dog”), so it can learn what makes each category different.
Over time, the computer figures out the patterns in these labeled examples.
Then, when it sees new data—like a picture it’s never seen—it can guess the correct label based on what it learned.
Essentially, supervised learning means the computer is “supervised” by having the right answers during its training, just like a student learning with the help of a teacher’s marked examples.
In this section, we will explain you the end-to-end process—from gathering and preparing your data to choosing a model and training it with code.
Acquiring Data
You might download a publicly available dataset (e.g., from Kaggle) or collect your own.
Data can arrive in various formats: CSVs, Excel, databases, or even text files.
Handling Missing Values
Detect: Check for NaN
(Not a Number) or blank entries.
Deal With Them:
Some ML models can’t handle missing data at all, while others degrade in accuracy if large chunks are missing.
Categorical Encoding
If you have columns like “City” or “Color,” you can’t directly feed them as strings into most models.
Label Encoding: Assign each category a numeric code (e.g., Red=0, Blue=1, Green=2).
One-Hot Encoding: Create separate binary columns (e.g., is_Red, is_Blue, is_Green), each set to 1 or 0. This approach prevents the model from assuming a numeric relationship (like Green > Red).
Train-Test Split
How: Often a simple 80/20 or 70/30 split. In Python, you can do:
Use Case: Predicting a continuous value (e.g., house price, sales revenue).
How It Works: Finds a line (or hyperplane in multiple dimensions) that best fits your data, minimizing the difference between predicted and actual values.
Advantages: Simple, fast, easily interpretable.
Limitations: Not ideal for highly complex relationships or data with lots of nonlinear patterns.
Use Case: Can do both regression (predict a number) and classification (predict a category).
How It Works: Splits the data into branches based on feature values, creating a tree structure that aims to group similar outcomes together.
Advantages: Easy to visualize, handles nonlinear data well, and works with both numerical and categorical features.
Limitations: Prone to overfitting if the tree grows too large.
If you suspect your data has a linear trend (e.g., weight vs. height), a linear model might suffice.
If your data is more complex or has clear splits (e.g., if-then rules), a decision tree could capture the patterns better.
Let’s illustrate a basic regression approach with the popular Iris dataset—though Iris is commonly used for classification, you can adapt a similar workflow for house prices or any other regression/classification tasks.
Note: If using a house price dataset, simply replace the loading steps. For the Iris dataset, we’ll pretend we want to predict the petal_length
based on other features.
Look at the MSE value. Lower is better. Compare it against the scale of your target variable. For example, if your petal_length
ranges from 1 to 6, an MSE of 0.2 might be decent, while 5.0 would be terrible.
If using Decision Trees, consider tuning parameters like max_depth
or min_samples_split
to avoid overfitting.
.....
.....
.....