Selected topic

Data Scaling

Data Scaling

Prefer practical output? Use related tools below while reading.

Open developer tools Try JDE log analyzer Use OFDM simulator

Data scaling, also known as normalization, is a technique used in data preprocessing to scale the values of numerical features so that they fall within a common range. This helps to prevent features with large ranges from dominating the model and improves the stability of algorithms.

Why Scale Data?

Prevents feature dominance: Scales are sensitive to differences in magnitude, which can lead to some features having more influence on the output than others.
Improves algorithm performance: Many algorithms require scaled data, such as Support Vector Machines (SVM) and Gradient Boosting Machines (GBM).
Enhances interpretability: Scaling helps to remove units of measurement from the data.

Types of Scaling

Min-Max Scaling: Maps values to a common range [a, b].
Standardization: Centers data around zero with unit variance.
Log Scaling: Transforms data using logarithm (e.g., log(x)).

Example: Min-Max Scaling

Suppose we have a dataset of exam scores:

python
import pandas as pd
# Create a sample dataframe
data = {
    &#39;Exam1&#39;: [90, 80, 70],
    &#39;Exam2&#39;: [40, 50, 60]
}
df = pd.DataFrame(data)print(df)

Output:

Exam1  Exam2
0     90     40
1     80     50
2     70     60

Scaling the exam scores to a common range [0, 100]:

python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 100))
df_scaled = scaler.fit_transform(df)print(df_scaled)

Output:

Exam1  Exam2
0     90.0    40.0
1     80.0    50.0
2     70.0    60.0

In this example, the MinMaxScaler maps the exam scores to a common range [0, 100], making it easier to compare and analyze the data.

Code Example (Python)

python
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
# Create a sample dataframe
data = {
    &#39;Exam1&#39;: [90, 80, 70],
    &#39;Exam2&#39;: [40, 50, 60]
}
df = pd.DataFrame(data)
scaler = MinMaxScaler(feature_range=(0, 100))
df_scaled = scaler.fit_transform(df)print(df_scaled)

This code demonstrates how to apply min-max scaling to the exam scores in a Pandas dataframe using Scikit-learn's MinMaxScaler.

Download PDF Back to topic options Back to blog home