Selected topic
Data Standardization
Prefer practical output? Use related tools below while reading.
Data standardization is a technique used in data preprocessing to scale the numerical features of a dataset to have similar ranges, usually between 0 and 1. This process helps to:
Age and Salary. The raw data looks like this:| Age | Salary |
| --- | --- |
| 25 | $40000 |
| 30 | $60000 |
| 35 | $80000 |
| ... | ... |
Let's say we want to standardize these features so that they have a range of [0, 1].
Age, we calculate the mean (μ) and standard deviation (σ):μ = (25 + 30 + 35 + ...) / n ≈ 45
σ ≈ 5
Similarly, for Salary, we get:
μ ≈ $60000
σ ≈ $12000
x' = (x - μ) / σ
Applying this formula to our example data:
For Age:
25 - 45 ≈ -20 ( Age - mean )
(-20) / 5 ≈ -4 ( standardized age )
30 - 45 ≈ -15
(-15) / 5 ≈ -3
35 - 45 ≈ -10
(-10) / 5 ≈ -2
...and so on.
For Salary:
$40000 - $60000 ≈ -$20000 ( Salary - mean )
($-20000) / $12000 ≈ -1.67 ( standardized salary )
$60000 - $60000 ≈ $0
($0) / $12000 ≈ 0
...
| Standardized Age | Standardized Salary |
| --- | --- |
| -4 | -1.67 |
| -3 | -0.33 |
| -2 | 0 |
| ... | ... |
In this example, we've scaled the Age and Salary features to have similar ranges, making it easier to compare and analyze them together.
python
from sklearn.preprocessing import StandardScaler# assume 'data' is your DataFrame with Age and Salary columns
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data[['Age', 'Salary']])
StandardScaler is the default standardization method in scikit-learn.