본문 바로가기
Study Note/Python

Missing data handling

by jhleeatl 2024. 5. 2.

In the previous post, I solved many different type of python questions on the Titanic dataset.Today, I want to analyze the way of calculating the average age, which was the 3rd question in the previous post.Since the 'Age' column contains null values, I will consider how to handle these missing values.

 

The most commonly used methods for handling missing data in practice are as follows:

  1. Mean Imputation: Replace missing values with the mean of the entire column. This method is used when missing values are minimal and aims to minimize distortion of the data while maintaining the distribution of continuous variables.
  2. Median Imputation: Replace missing values with the median of the entire column. This method is robust to outliers and prevents distortion of the data.
  3. Mode Imputation: For categorical variables, replace missing values with the mode (the most frequently occurring value) of the column.
  4. Regression Imputation: Perform regression analysis to predict missing values using other variables. This method considers relationships in the data to make more accurate predictions.
  5. Multiple Imputation: Fill in missing values with multiple possible values predicted by imputation methods. This method considers uncertainty in the data to obtain more accurate results.

The choice among these methods depends on the characteristics of the data, the proportion of missing values, and the analytical objectives.

 

I am going to analyze the data using methods 1, 2, and 3 in this post.

 

 

This is the result of df['Age']

import pandas as pd
file_path = "/Users/junhyunlee/Desktop/data/titanic/train.csv"
df = pd.read_csv(file_path)
a = df['Age']
print(a)

 

 

Result

 

 

First of all, I wanted to check how many Null values are in the column. So I used below code to check the number of Null values.

 

.isna() checks whether each element in a Pandas Series or DataFrame is a missing value. When called, it returns True for each element that is missing and False otherwise (boolean).

 

print(a.isna().sum())

#result 177

 

And this is Average amount (exclude Null)

df = pd.read_csv(file_path)
a = df['Age'
print(a.mean())

#result = 29.69911764705882

 


 

 

Mean Imputation

import pandas as pd
file_path = "/Users/junhyunlee/Desktop/data/titanic/train.csv"
df = pd.read_csv(file_path)
age = df['Age']
avg_age = age.mean() # Average value
na_age = age.fillna(avg_age) # Mean Imputation

print(f'Null Value : {na_age.isnull().sum()}') # check Null value
print(f'Mean imp   : {na_age.mean()}')

##Result
Null Value : 0
Mean imp  : 29.69911764705882

 

 

Median Imputation

import pandas as pd
file_path = "/Users/junhyunlee/Desktop/data/titanic/train.csv"
df = pd.read_csv(file_path)
age = df['Age']
median_age = age.median() # Average value

na_age = age.fillna(median_age) # Mean Imputation

print(f'Null Value  : {na_age.isnull().sum()}') # check Null value
print(f'Median imp  : {na_age.mean()}')

##result
Null Value  : 0
Median imp  : 29.36158249158249

 

 

Mode Imputation

 

Make the code foro the mode was a little bit harder than other imputation because there was no method to fid mode value in Pandas. So I used value_count to find the mode value.

import pandas as pd
file_path = "/Users/junhyunlee/Desktop/data/titanic/train.csv"
df = pd.read_csv(file_path)
age = df['Age']

count_value = age.value_counts() #counting the value
print(count_value)

### result
Age
24.00    30
22.00    27
18.00    26
19.00    25
28.00    25
         ..
36.50     1
55.50     1
0.92      1
23.50     1
74.00     1

 

After I got the list of age and number of people in the list, I added idxmax() to find the max value

 

import pandas as pd
file_path = "/Users/junhyunlee/Desktop/data/titanic/train.csv"
df = pd.read_csv(file_path)
age = df['Age']
max_count_value = age.value_counts().idxmax() # max_value


print(max_count_value)


##Result
24.0

 

Using this max_count_value, I calculate the average age again.

 

import pandas as pd
file_path = "/Users/junhyunlee/Desktop/data/titanic/train.csv"
df = pd.read_csv(file_path)
age = df['Age']
max_count_value = age.value_counts().idxmax()   # max_value
mode_age = age.fillna(max_count_value)          # replace null value to mode_value
avg_mode = mode_age.mean()                      # Average

print(f'Null Value  : {mode_age.isnull().sum()}') # check Null value
print(f'Mode impt : {avg_mode}')


##reult
Null Value  : 0
Mode impt : 28.566969696969696

 

 

 

Through these three imputation methods, I was able to get various averages as below. It was beneficial to contemplate different variables through the process of handling missing values.

 

Mean imp  : 29.69911764705882
median imp  : 29.36158249158249
Mode impt : 28.566969696969696

'Study Note > Python' 카테고리의 다른 글

Library (Pandas, Seaborn, Matplotlib, Numpy) - Iris Data  (0) 2024.05.14
Enumerate()  (0) 2024.05.08
Taitanic data analysis  (0) 2024.04.30
Python Data Types  (0) 2024.04.25
Lambda function  (0) 2024.04.05