본문 바로가기
Project ( Portfolio )/Project 1

Project - Bank Service Status

by jhleeatl 2024. 5. 20.

Analysis and Improvement Project for Bank Service Status

 

I have started a new project. Our team consists of four members, and we have decided to work together on this project. I have taken on the role of the team leader. Our team plans to analyze the "Analysis and Improvement Project for Bank Service Status."

 

We downloaded the data from Kaggle.

 

https://www.kaggle.com/datasets/khanmdsaifullahanjar/bank-user-dataset

 

Bank User Dataset

This dataset contains user behaviors contributing to their credit score

www.kaggle.com

 


 

Project Name Description
Analysis and Improvement Project for Bank Service Status The goal of this project is to identify the current status and issues of bank services, analyze customer data to segment and identify problems, and provide insights through visualizations and recommendations for improvement

 

Category Description
Language Python, SQL
Library Pandas, Matplotpy, Seaborn, Numpy
Visual Tool Python, Canva

 


Project Result

 

 


 

Project Key Points

  1. Data Collection and Preprocessing
    •   Data preprocessing to ensure the accuracy and reliability of customer data
  2. Service Status Analysis
    •   Assess the current status through key service metrics
  3. Customer Segmentation/Loan Segmentation
    •   Segment based on age, occupation, income, etc.
    •   Analyze usage characteristics for each segment
  4. Identification of Issues
    •   Identify key issues such as deadlines, delays, unpaid debts, etc.
    •   Derive problems and patterns
  5. Deriving Insights and Visualization
    •   Derive insights based on analysis results
    •   Prepare materials to be communicated to relevant departments using visualization tools
    •   Develop improvement plans and detailed execution strategies

 


 

Exploratory Data Analysis (EDA)

Data Information

Import Data

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

file_path = "/content/sample_data/Bank Data.csv"
df = pd.read_csv(file_path)
df

 

 

Information

df.info()

#result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44332 entries, 0 to 44331
Data columns (total 27 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        44332 non-null  object 
 1   Customer_ID               44332 non-null  object 
 2   Month                     44332 non-null  object 
 3   Name                      39868 non-null  object 
 4   Age                       44332 non-null  object 
 5   SSN                       44332 non-null  object 
 6   Occupation                44332 non-null  object 
 7   Annual_Income             44332 non-null  object 
 8   Monthly_Inhand_Salary     37674 non-null  float64
 9   Num_Bank_Accounts         44332 non-null  int64  
 10  Num_Credit_Card           44332 non-null  int64  
 11  Interest_Rate             44332 non-null  int64  
 12  Num_of_Loan               44332 non-null  object 
 13  Type_of_Loan              39312 non-null  object 
 14  Delay_from_due_date       44332 non-null  int64  
 15  Num_of_Delayed_Payment    41230 non-null  object 
 16  Changed_Credit_Limit      44332 non-null  object 
 17  Num_Credit_Inquiries      43442 non-null  float64
 18  Credit_Mix                44332 non-null  object 
 19  Outstanding_Debt          44332 non-null  object 
 20  Credit_Utilization_Ratio  44332 non-null  float64
 21  Credit_History_Age        40383 non-null  object 
 22  Payment_of_Min_Amount     44332 non-null  object 
 23  Total_EMI_per_month       44332 non-null  float64
 24  Amount_invested_monthly   42313 non-null  object 
 25  Payment_Behaviour         44332 non-null  object 
 26  Monthly_Balance           43832 non-null  object 
dtypes: float64(4), int64(4), object(19)
memory usage: 9.1+ MB

 

 

The dataset consists of 50,000 entries. Upon examining the data, we discovered several issues. Most of the column values contained errors, and the process of handling outliers and missing values was very complex

 

df.head()

 

 

 

df.shape

#result
(50000, 27)

 

 

After reviewing all the columns, we checked which items to use and decided on the format for transformation. We made the following decisions

Column Dtype Convert to
Month Object Object
Age Object int64
Num_Bank_Accounts int64 int64
Type_of_Loan Object Object
Num_Credit_Inquiries float64 int64
Credit_History_Age Object float64
SSN Object Object
Num_Credit_Card int64 int64
Delay_from_due_date int64 int64
Num_of_Delayed_Payment Object int64
Amount_invested_monthly Object float64
Annual_Income Object float64
Name Object Object
Interest_Rate int64 int64
Changed_Credit_Limit Object float64
Credit_Mix Object Category
Credit_Utilization_Ratio float64 float64
Monthly_Balance Object float64
Occupation Object Object
Monthly_Inhand_Salary float64 float64
Num_of_Loan Object int64
Outstanding_Debt Object float64
Total_EMI_per_month float64 float64
Payment_Behaviour Object Object
ID Object Object
Customer_ID Object Object

 

 


Data Preprocessing

It is too much to list all column data preprocessing processes here, so I will describe some representative preprocessing processes

 

 

Num_of_Loan

 

Based on the unique values ​ of the Num_of_Loan column, it was expected that there were special characters that could be considered as missing values. They were '-' and '_'. Because of this, Num_of_Loan could not be converted to int format. To remove this, the following function was used.

 

 

#Get rid of '-' & '_' and convert object format to int format
df['Num_of_Loan'] = df['Num_of_Loan'].str.replace('_','')
df['Num_of_Loan'] = df['Num_of_Loan'].str.replace('-','')
df['Num_of_Loan'] = df['Num_of_Loan'].astype(int)

 

After using these code, the format is changed.

 

 

After removing all missing values in this manner, I created a box plot using the count to check for outliers.

 

import matplotlib.pyplot as plt

# Boxplot
plt.figure(figsize=(8, 6))
plt.boxplot(df['Num_of_Loan'])
plt.title('Boxplot of Num_of_Loan')
plt.ylabel('Value')
plt.show()

 

The result was peculiar; Num_of_Loan represents the count of Type_of_Loan, and Type_of_Loan had a maximum of only 9.

 

 

import matplotlib.pyplot as plt

# Filter the dataframe to include only values less than or equal to 15
filtered_df = df[df['Num_of_Loan'] <= 15]

# Create a box plot
plt.figure(figsize=(8, 6))
plt.boxplot(filtered_df['Num_of_Loan'])
plt.title('Boxplot of Num_of_Loan (<= 15)')
plt.ylabel('Value')
plt.show()

 

 

The data was not accurate, I counted the number of Types of Loans and created a new one.

 

df['Num_of_Loan'] = df['Type_of_Loan'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)

 

 

 

 

Outliers in all columns were handled using the code below.

### Data preprocessing
## 'Name'
df['Name'] = df['Name'].str.replace(r'-,."',' ', regex=True)

## 'Age'
df['Age'] = df['Age'].str.replace('_','')
df['Age'] = df['Age'].astype(int)
cond = (df['Age']>80) | (df['Age']<0)
df.loc[cond, 'Age'] = np.nan

## 'SSN'
df['SSN'] = df['SSN'].replace('#F%$D@*&8', np.nan)

## 'Occupation'
df['Occupation'] = df['Occupation'].replace('_______', np.nan)

## 'Annual_Income'
df['Annual_Income'] = df['Annual_Income'].str.replace(r'[-_]','', regex=True)
df['Annual_Income'] = df['Annual_Income'].astype('float')
df.loc[df['Annual_Income']>=180000,'Annual_Income'] = np.nan

## 'Monthly_Inhand_Salary'
# n/a

## 'Num_Bank_Accounts'
df.loc[df['Num_Bank_Accounts'] > 15, 'Num_Bank_Accounts'] = np.nan

## 'Num_Credit_Card'
cond = df['Num_Credit_Card'] >= 12
df.loc[cond,'Num_Credit_Card'] = np.nan

## 'Interest_Rate'
df.loc[df['Interest_Rate']>40,'Interest_Rate'] = np.nan

## 'Num_of_Loan'
df['Num_of_Loan'] = df['Num_of_Loan'].str.replace(r'[-_]','', regex=True)
df['Num_of_Loan'] = df['Num_of_Loan'].astype('int')
df.loc[df['Num_of_Loan']>=10,'Num_of_Loan'] = np.nan

## 'Type_of_Loan'
df['Type_of_Loan'] = df['Type_of_Loan'].str.replace('and ','')

## 'Delay_from_due_date'
# n/a

## 'Num_of_Delayed_Payment'
df['Num_of_Delayed_Payment'] = df['Num_of_Delayed_Payment'].str.replace(r'[-_]','', regex=True)
df['Num_of_Delayed_Payment'] = df['Num_of_Delayed_Payment'].astype('float')
df.loc[df['Num_of_Delayed_Payment']>30,'Num_of_Delayed_Payment'] = np.nan

## 'Changed_Credit_Limit'
cond = df['Changed_Credit_Limit'].str.replace(".","").str.isnumeric()
df.loc[-cond,'Changed_Credit_Limit'] = np.nan
df['Changed_Credit_Limit'] = df['Changed_Credit_Limit'].astype('float')

## 'Num_Credit_Inquiries'
df.loc[df['Num_Credit_Inquiries'] > 20, 'Num_Credit_Inquiries'] = np.nan

## 'Credit_Mix'
df['Credit_Mix'] = df['Credit_Mix'].replace('_', np.nan)

## 'Outstanding_Debt'
df['Outstanding_Debt'] = df['Outstanding_Debt'].str.replace('_','')
df['Outstanding_Debt'] = df['Outstanding_Debt'].astype(float)

## 'Credit_Utilization_Ratio'
# n/a

## 'Credit_History_Age'
# n/a

## 'Payment_of_Min_Amount'
# delete
del df['Payment_of_Min_Amount']

## 'Total_EMI_per_month'
df.loc[df['Total_EMI_per_month']>1800,'Total_EMI_per_month'] = np.nan

## 'Amount_invested_monthly'
df['Amount_invested_monthly'] = df['Amount_invested_monthly'].str.replace(r'[-_]','', regex=True)
df['Amount_invested_monthly'] = df['Amount_invested_monthly'].astype('float')
df.loc[df['Amount_invested_monthly']>=2250,'Amount_invested_monthly'] = np.nan

## 'Payment_Behaviour'
df['Payment_Behaviour'] = df['Payment_Behaviour'].replace('!@9#%8', np.nan)

## 'Monthly_Balance'
df['Monthly_Balance'] = df['Monthly_Balance'].replace('__-333333333333333333333333333__', np.nan)
df['Monthly_Balance'] = df['Monthly_Balance'].astype('float')

 

 

And the generated null values were filled in according to the groups based on Customer_ID. For example, in the case of occupation, if the values for September, October, November, and December were missing for October, they were filled in using the code below.

 

def fillna_by_group(data, col, by):
  df = data
  cond = df[col].isna()
  if by == 'mode':
    t = df.groupby('Customer_ID')[col].agg(pd.Series.mode)
    df.loc[cond,col] = df.loc[cond,'Customer_ID'].apply(lambda x :
                                                        t[x][0] if (type(t[x])==np.ndarray)&(len(t[x])>0) else t[x])
  else:
    if by == 'mean':
      t = df.groupby('Customer_ID')[col].mean()
    elif by == 'max':
      t = df.groupby('Customer_ID')[col].max()
    df.loc[cond,col] = df.loc[cond,'Customer_ID'].apply(lambda x : t[x])

 

 

## mode
mode_list = ['Name', 'SSN', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Credit_History_Age', 'Payment_Behaviour']

for i in mode_list:
  fillna_by_group(df, i, 'mode')
  a = df[i].apply(lambda x : type(x) == np.ndarray)
  df.loc[a,i] = np.nan

## mean
mean_list = ['Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate',
             'Num_of_Loan', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries',
             'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance']

for i in mean_list:
  fillna_by_group(df, i, 'mean')

## max
# Age
fillna_by_group(df, 'Age', 'max')

 

 

Through this, the handling of missing values and outliers was completed.

 

 
df.info()

#result

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 26 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   ID                        50000 non-null  object 
 1   Customer_ID               50000 non-null  object 
 2   Month                     50000 non-null  object 
 3   Name                      49996 non-null  object 
 4   Age                       50000 non-null  float64
 5   SSN                       50000 non-null  object 
 6   Occupation                50000 non-null  object 
 7   Annual_Income             50000 non-null  float64
 8   Monthly_Inhand_Salary     49972 non-null  float64
 9   Num_Bank_Accounts         50000 non-null  float64
 10  Num_Credit_Card           50000 non-null  float64
 11  Interest_Rate             50000 non-null  float64
 12  Num_of_Loan               50000 non-null  float64
 13  Type_of_Loan              44296 non-null  object 
 14  Delay_from_due_date       50000 non-null  int64  
 15  Num_of_Delayed_Payment    49996 non-null  float64
 16  Changed_Credit_Limit      50000 non-null  float64
 17  Num_Credit_Inquiries      50000 non-null  float64
 18  Credit_Mix                49916 non-null  object 
 19  Outstanding_Debt          50000 non-null  float64
 20  Credit_Utilization_Ratio  50000 non-null  float64
 21  Credit_History_Age        49996 non-null  object 
 22  Total_EMI_per_month       50000 non-null  float64
 23  Amount_invested_monthly   49996 non-null  float64
 24  Payment_Behaviour         50000 non-null  object 
 25  Monthly_Balance           49996 non-null  float64
dtypes: float64(15), int64(1), object(10)
memory usage: 9.9+ MB

 

 


Data Analysis

 

Customer segmant

 

Based on the preprocessed data, I attempted various data analyses. The focus was on understanding the current state of the bank's services, identifying customer segments, loan segments, and segments that relate customers to loans.

 

# Select the columns
Customer_Segmant = df[['Age', 'Monthly_Inhand_Salary', 'Annual_Income', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Delay_from_due_date', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance']]

# correlation matrix
correlation_matrix = Customer_Segmant.corr()

# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='YlGnBu', fmt=".2f")
plt.title('Customer_Segmant Heatmap')
plt.show()

 

 

As a result of analyzing this heatmap, the significant values observed were the 0.63 correlation between Monthly_inhand_Salary and Amount_invested_monthly.

 

I created scatter plots to examine the data in more detail.

 

# Scatter
plt.figure(figsize=(10, 7))
plt.scatter(df['Monthly_Inhand_Salary'], df['Amount_invested_monthly'], color='b', alpha=0.7)
plt.title('Scatter Plot: Monthly Inhand Salary vs. Amount Invested Monthly')
plt.xlabel('Monthly Inhand Salary')
plt.ylabel('Amount Invested Monthly')
plt.grid(True)
plt.show()

 

 

 

 

import pandas as pd
import matplotlib.pyplot as plt


# columns
data = df[['Occupation', 'Age', 'Monthly_Inhand_Salary', 'Customer_ID', 'Annual_Income',
                'Num_Bank_Accounts', 'Num_Credit_Card', 'Delay_from_due_date', 'Total_EMI_per_month',
                'Amount_invested_monthly', 'Monthly_Balance']]


# filtering numeric columns
numeric_columns = data.select_dtypes(include=['number']).columns

# making hist
data[numeric_columns].hist(bins=15, figsize=(15, 10), layout=(3, 4))
plt.tight_layout()
plt.show()

 

 

 

 

 

# Occupation counts
occupation_counts = df['Occupation'].value_counts()

# colour
colors = sns.color_palette('pastel', len(occupation_counts))

# pie chart
plt.figure(figsize=(10, 7))
plt.pie(occupation_counts, labels=occupation_counts.index, autopct='%1.1f%%', startangle=140, colors=colors)
plt.title('Number of People by Occupation')
plt.axis('equal')
plt.show()

 

 

 

 

Loan segmant

 

 

 

 

 

Customer & Loan segmant

 

 

 

 

 

 


Conclusion

 

1. Distribution of Data: Customer occupations, loans, and salaries by occupation are all evenly distributed.
   
2. Credit Utilization Rate Management:
    - Overall rate: 34%
    - High-risk group: 5.9%
    - Necessary across all age groups, following demographic trends.

3. Age Group Analysis:
   - Individuals in their 50s have the highest average monthly salary, investment, and bank balance compared to other groups.
   - No bad credit ratings were found among individuals in their 50s.
   - Despite excelling in credit ratings, salary, investments, and bank balances, the overall balance is lower due to a smaller population in this age group.

4. Investment Rates: Many customers show a low investment rate relative to their salary.

 


Insights


Service Status Analysis:

  1. Customer Demographics:
    •   Very high proportion of customers aged 20 to 40.
    •   Even distribution across occupations, loan products, and salary levels.
  2. Credit Ratings and Delinquency:
    •   Low credit ratings are associated with significantly higher delinquency days, requiring continuous monitoring.
  3. Credit Utilization Rate:
    •   High average credit utilization rate (recommended for individuals: less than 30%).

 

Customer Segments:

  1. Targeting Low Investment Customers:
    •   Focus on customers with low investment rates relative to their salary for new financial products or marketing initiatives.
  2. Maintaining Services for 50s Customers:
    •   Continue providing existing services to customers in their 50s, who have high investment rates, salaries, and bank balances.

Loan Segment:

  1. Risk Reduction:
    •   Explore options to reduce the proportion of risky credit utilization rates above 35%, which account for 34% of the total.

 

 


 

Feedback Summary:

  1. Strengths:
    •   Outlier and Missing Data Handling: A lot of effort was put into removing outliers and missing values, with thorough consideration for each column, which was well done.
    •   Visualization Efforts: Using scatter plots to visually segment specific customers by keywords and heatmaps was highly effective.
    •   Exploratory Data Analysis (EDA) and Visualization: The EDA and visualizations were excellent, with interesting attempts to analyze salary and investment amounts.
  2. Areas for Improvement:
    •   Column Descriptions: There was a lack of detailed explanations for the columns, which was a drawback.
    •   Clarity of Goals: Although the data aimed to present a big picture, the goals were not well-defined. Each analysis segment was good, but the overall direction was unclear.
    •   Communication: The message being conveyed was somewhat lacking. Having more detailed and specific objectives would have been beneficial.
    •   Detail on High-Potential Segments: While the analysis on high-income, low-investment individuals was interesting, more details on these specific segments would have been valuable.
    •   Number of Columns: There were too many columns, making the data difficult to manage and interpret.
  3. Additional Notes:
    •   The data provided insights into individual behavior patterns.
    •   The EDA and visualizations were very well done.