본문 바로가기
Study Note/Data Analysis

Data preprocessing

by jhleeatl 2024. 5. 10.

 

Data preprocessing refers to the process of cleaning, transforming, and preparing data before it is analyzed. 

This process involves improving the quality of the data and transforming it into a suitable format for analysis, thereby enhancing the performance of analytical models.

 


The main tasks involved in data preprocessing include:

1. Data Cleaning: This involves removing noise from the data, such as missing values or outliers, to ensure data quality.

2. Data Transformation: Data is transformed to a format suitable for analysis, which may include encoding categorical data, scaling, and normalization.

3. Data Integration: Integrating data from multiple sources and transforming it into a consistent format.

4. Data Reduction: Reducing the complexity of the data and reducing computational costs by reducing the dimensionality of the data.



 

Python

After inputting the data, you need to review the table first and check it. Here are some functions that you can use to check the data

 

df.info()

df.info()

#result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB

 

 

df.describe()

df.describe()

 

 

 

If there are NULL values in the table, you can search using df.isnull() or df.isna(), which will show only the NULL values in the table. df.isnull().sum() will show the total number of NULL values.

 


 

The astype() method is used to change the data type of columns in a DataFrame. It allows you to convert the data type of a column to a desired format.

 

For example, you can convert integer data to float, convert data to strings, or convert data to categorical format using this method.

 

Here's an example of how to use the astype() method:

 

import pandas as pd

# Example DataFrame creation
data = pd.DataFrame({'A': [1, 2, 3],
                     'B': ['4', '5', '6']})

# Checking the data types of the columns in the DataFrame
print(data.dtypes)
# Output:
# A     int64
# B    object
# dtype: object

# Converting the 'A' column to float
data['A'] = data['A'].astype(float)

# Converting the 'B' column to integer
data['B'] = data['B'].astype(int)

# Checking the modified DataFrame
print(data.dtypes)
# Output:
# A    float64
# B      int64
# dtype: object

 

 

lioc, loc

 

.loci[row, columns] : select cell by index number

 

data.iloc[0,2]
#Specific data can be selected through row and column numbers

 

import pandas as pd

# Creating a sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)

# Selecting specific rows and columns using iloc
selected_data = df.iloc[1:4, 0:2]  # Selecting rows from index 1 to 3 and columns from index 0 to 1
print(selected_data)

 

 

 

.loc[low,columns] : select by name

 

data.loc['row_label', 'column_name']
# Specific data can also be selected through row labels and column names

 

import pandas as pd

# Creating a sample DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [10, 20, 30, 40, 50],
    'C': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])

# Selecting specific rows and columns using loc
selected_data = df.loc['b':'d', 'A':'B']  # Selecting rows with labels 'b' to 'd' and columns from 'A' to 'B'
print(selected_data)

 

 


 

Selecting the columns

 

# To select an entire column, you can use list slicing.
data.loc[:, 'column_name']

# Alternatively, you can also use DataFrame['column_name'] to select the same values.
data['column_name']

# When selecting multiple columns, you can use a list to specify the columns.
data[['column_name1', 'column_name2', 'column_name3']]

# You can select the columns in the order you desire when selecting multiple columns.
data[['column_name3', 'column_name1', 'column_name2']]

 

 

When selecting 2 or more cells, you can use a list to specify the rows or columns.

# When selecting 2 column names.
data.loc['row_name', ['column_name1', 'column_name2']]

# When selecting 2 row names.
data.loc[['row_name1', 'row_name2'], 'column_name1']

# Using list slicing, you can specify a range to select.
data.loc['row_name', 'column_name1':]  # 'column_name1': ==> Means from 'column_name1' to the end.

 

 


 

 

Using Boolean Indexing:

1. Filtering with a single condition:

 

Selecting rows based on a condition set on a specific column.

# Filtering rows where the 'age' column is 30 or older.
df[df['age'] >= 30]

 

 

2. Filtering with multiple conditions:

 

Combining multiple conditions for complex filtering.

# Filtering rows where the 'age' column is 30 or older and the 'gender' column is 'Male'.
df[(df['age'] >= 30) & (df['gender'] == 'Male')]

 

 

3. Filtering specific columns based on conditions:

Selecting only specific columns for rows that satisfy the condition.

# Selecting the 'name' column for rows where the 'age' column is 30 or older.
df.loc[df['age'] >= 30, 'name']

 

 

4.Filtering using isin():

Selecting rows that contain multiple specified values using a list.

# Filtering rows where the 'gender' column contains either 'Male' or 'Female'.
df[df['gender'].isin(['Male', 'Female'])]

 

 

What is the isin() method?

 

A method for Series or DataFrame objects to find specific values or values contained in a list. Useful for quickly filtering or selecting data based on desired conditions.

 

How to use isin():

# Checking for the presence of a single value in a Series or DataFrame column.
# Checking if 'banana' is present in the 'B' column.
import pandas as pd

data = {'A': [1, 2, 3, 4, 5],
        'B': ['apple', 'banana', 'orange', 'grape', 'melon']}

df = pd.DataFrame(data)

# Checking if 'banana' is present in the 'B' column.
result = df['B'].isin(['banana'])
print(result)

 

 

Checking for multiple values in a Series or DataFrame column.

# Finding rows where the 'A' column contains either 2 or 4.
result = df['A'].isin([2, 4])
print(result)

 

 

Using isin() for multiple columns in a DataFrame.

# Filtering the DataFrame based on multiple conditions.
result = df.isin({'A': [1, 3], 'B': ['apple', 'orange']})
print(result)

'Study Note > Data Analysis' 카테고리의 다른 글

Exploratory Data Analysis (EDA)  (0) 2024.05.28
The Pearson correlation coefficient  (0) 2024.05.13
What is Data Analyst?  (0) 2024.05.09
Mece logic tree  (0) 2024.04.29
AARRR Funnel Analysis  (0) 2024.04.24