Selected topic
Data Validation
Prefer practical output? Use related tools below while reading.
=====================================
Data validation is an essential step in data preprocessing that ensures the quality and accuracy of the data. It involves checking the data for completeness, consistency, and correctness. The goal of data validation is to identify and correct any errors or inconsistencies in the data before it's used for analysis.
Suppose we have a dataset with a column named age. We want to ensure that all ages are within the valid range of 0 to 150 years.
python
import pandas as pd# Create a sample DataFrame
data = {'name': ['John', 'Alice', 'Bob'],
'age': [25, -1, 200]}
df = pd.DataFrame(data)
# Define a function for data validation
def validate_age(age):
if age < 0 or age > 150:
return False
else:
return True
# Apply the validation function to the 'age' column
df['age_valid'] = df['age'].apply(validate_age)
In this example, we've applied a validation function to the age column. The function checks if the age is within the valid range (0-150). If the age is invalid, it returns False. Otherwise, it returns True.