🧹 Data Cleaning with Dropna
Missing data is a common problem. Pandas provides `dropna()` to easily remove rows or columns with null values.
Mastering this concept will significantly boost your Python data science skills!
💻 Code Example:
import pandas as pd import numpy as np # Simulate a messy pynfinity user dataset raw_data = { "user_name" : ["santoshtvk", "dhruv", None, "tvk", "alice", "alice"], "score" : [95, np.nan, 80, 77, np.nan, 60], "course" : ["Python", "AI", "DevOps", None, "Python", "Python"], "joined" : ["2024-01-15", "2024-02-20", "invalid_date", "2024-03-01", "2024-01-15", "2024-01-15"], } df = pd.DataFrame(raw_data) print("Raw shape:", df.shape) # 1. Inspect missing values print("\nMissing per column:\n", df.isnull().sum()) # 2. Drop rows where BOTH critical columns are missing df_clean = df.dropna(subset=["user_name", "course"]) # 3. Fill numeric missing values with column median df_clean = df_clean.copy() df_clean["score"] = df_clean["score"].fillna(df_clean["score"].median()) # 4. Remove duplicates (keep first occurrence) df_clean = df_clean.drop_duplicates(subset=["user_name", "course"]) print("After dedup:", df_clean.shape) # 5. Fix data types df_clean["score"] = df_clean["score"].astype(int) df_clean["joined"] = pd.to_datetime(df_clean["joined"], errors="coerce") # 6. Rename + reset index df_clean = df_clean.rename(columns={"user_name": "username"}).reset_index(drop=True) print("\nClean dataset:") print(df_clean) print("\nDtypes:\n", df_clean.dtypes)
Keep exploring and happy coding! 💻