Analysis and Improvement Project for Bank Service Status
I have started a new project. Our team consists of four members, and we have decided to work together on this project. I have taken on the role of the team leader. Our team plans to analyze the "Analysis and Improvement Project for Bank Service Status."
We downloaded the data from Kaggle.
https://www.kaggle.com/datasets/khanmdsaifullahanjar/bank-user-dataset
Project Name | Description |
Analysis and Improvement Project for Bank Service Status | The goal of this project is to identify the current status and issues of bank services, analyze customer data to segment and identify problems, and provide insights through visualizations and recommendations for improvement |
Category | Description |
Language | Python, SQL |
Library | Pandas, Matplotpy, Seaborn, Numpy |
Visual Tool | Python, Canva |
Project Result
Project Key Points
- Data Collection and Preprocessing
- Data preprocessing to ensure the accuracy and reliability of customer data
- Service Status Analysis
- Assess the current status through key service metrics
- Customer Segmentation/Loan Segmentation
- Segment based on age, occupation, income, etc.
- Analyze usage characteristics for each segment
- Identification of Issues
- Identify key issues such as deadlines, delays, unpaid debts, etc.
- Derive problems and patterns
- Deriving Insights and Visualization
- Derive insights based on analysis results
- Prepare materials to be communicated to relevant departments using visualization tools
- Develop improvement plans and detailed execution strategies
Exploratory Data Analysis (EDA)
Data Information
Import Data
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
file_path = "/content/sample_data/Bank Data.csv"
df = pd.read_csv(file_path)
df
Information
df.info()
#result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44332 entries, 0 to 44331
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 44332 non-null object
1 Customer_ID 44332 non-null object
2 Month 44332 non-null object
3 Name 39868 non-null object
4 Age 44332 non-null object
5 SSN 44332 non-null object
6 Occupation 44332 non-null object
7 Annual_Income 44332 non-null object
8 Monthly_Inhand_Salary 37674 non-null float64
9 Num_Bank_Accounts 44332 non-null int64
10 Num_Credit_Card 44332 non-null int64
11 Interest_Rate 44332 non-null int64
12 Num_of_Loan 44332 non-null object
13 Type_of_Loan 39312 non-null object
14 Delay_from_due_date 44332 non-null int64
15 Num_of_Delayed_Payment 41230 non-null object
16 Changed_Credit_Limit 44332 non-null object
17 Num_Credit_Inquiries 43442 non-null float64
18 Credit_Mix 44332 non-null object
19 Outstanding_Debt 44332 non-null object
20 Credit_Utilization_Ratio 44332 non-null float64
21 Credit_History_Age 40383 non-null object
22 Payment_of_Min_Amount 44332 non-null object
23 Total_EMI_per_month 44332 non-null float64
24 Amount_invested_monthly 42313 non-null object
25 Payment_Behaviour 44332 non-null object
26 Monthly_Balance 43832 non-null object
dtypes: float64(4), int64(4), object(19)
memory usage: 9.1+ MB
The dataset consists of 50,000 entries. Upon examining the data, we discovered several issues. Most of the column values contained errors, and the process of handling outliers and missing values was very complex
df.head()
df.shape
#result
(50000, 27)
After reviewing all the columns, we checked which items to use and decided on the format for transformation. We made the following decisions
Column | Dtype | Convert to |
Month | Object | Object |
Age | Object | int64 |
Num_Bank_Accounts | int64 | int64 |
Type_of_Loan | Object | Object |
Num_Credit_Inquiries | float64 | int64 |
Credit_History_Age | Object | float64 |
SSN | Object | Object |
Num_Credit_Card | int64 | int64 |
Delay_from_due_date | int64 | int64 |
Num_of_Delayed_Payment | Object | int64 |
Amount_invested_monthly | Object | float64 |
Annual_Income | Object | float64 |
Name | Object | Object |
Interest_Rate | int64 | int64 |
Changed_Credit_Limit | Object | float64 |
Credit_Mix | Object | Category |
Credit_Utilization_Ratio | float64 | float64 |
Monthly_Balance | Object | float64 |
Occupation | Object | Object |
Monthly_Inhand_Salary | float64 | float64 |
Num_of_Loan | Object | int64 |
Outstanding_Debt | Object | float64 |
Total_EMI_per_month | float64 | float64 |
Payment_Behaviour | Object | Object |
ID | Object | Object |
Customer_ID | Object | Object |
Data Preprocessing
It is too much to list all column data preprocessing processes here, so I will describe some representative preprocessing processes
Num_of_Loan
Based on the unique values of the Num_of_Loan column, it was expected that there were special characters that could be considered as missing values. They were '-' and '_'. Because of this, Num_of_Loan could not be converted to int format. To remove this, the following function was used.
#Get rid of '-' & '_' and convert object format to int format
df['Num_of_Loan'] = df['Num_of_Loan'].str.replace('_','')
df['Num_of_Loan'] = df['Num_of_Loan'].str.replace('-','')
df['Num_of_Loan'] = df['Num_of_Loan'].astype(int)
After using these code, the format is changed.
After removing all missing values in this manner, I created a box plot using the count to check for outliers.
import matplotlib.pyplot as plt
# Boxplot
plt.figure(figsize=(8, 6))
plt.boxplot(df['Num_of_Loan'])
plt.title('Boxplot of Num_of_Loan')
plt.ylabel('Value')
plt.show()
The result was peculiar; Num_of_Loan represents the count of Type_of_Loan, and Type_of_Loan had a maximum of only 9.
import matplotlib.pyplot as plt
# Filter the dataframe to include only values less than or equal to 15
filtered_df = df[df['Num_of_Loan'] <= 15]
# Create a box plot
plt.figure(figsize=(8, 6))
plt.boxplot(filtered_df['Num_of_Loan'])
plt.title('Boxplot of Num_of_Loan (<= 15)')
plt.ylabel('Value')
plt.show()
The data was not accurate, I counted the number of Types of Loans and created a new one.
df['Num_of_Loan'] = df['Type_of_Loan'].apply(lambda x: len(x.split(',')) if isinstance(x, str) else 0)
Outliers in all columns were handled using the code below.
### Data preprocessing
## 'Name'
df['Name'] = df['Name'].str.replace(r'-,."',' ', regex=True)
## 'Age'
df['Age'] = df['Age'].str.replace('_','')
df['Age'] = df['Age'].astype(int)
cond = (df['Age']>80) | (df['Age']<0)
df.loc[cond, 'Age'] = np.nan
## 'SSN'
df['SSN'] = df['SSN'].replace('#F%$D@*&8', np.nan)
## 'Occupation'
df['Occupation'] = df['Occupation'].replace('_______', np.nan)
## 'Annual_Income'
df['Annual_Income'] = df['Annual_Income'].str.replace(r'[-_]','', regex=True)
df['Annual_Income'] = df['Annual_Income'].astype('float')
df.loc[df['Annual_Income']>=180000,'Annual_Income'] = np.nan
## 'Monthly_Inhand_Salary'
# n/a
## 'Num_Bank_Accounts'
df.loc[df['Num_Bank_Accounts'] > 15, 'Num_Bank_Accounts'] = np.nan
## 'Num_Credit_Card'
cond = df['Num_Credit_Card'] >= 12
df.loc[cond,'Num_Credit_Card'] = np.nan
## 'Interest_Rate'
df.loc[df['Interest_Rate']>40,'Interest_Rate'] = np.nan
## 'Num_of_Loan'
df['Num_of_Loan'] = df['Num_of_Loan'].str.replace(r'[-_]','', regex=True)
df['Num_of_Loan'] = df['Num_of_Loan'].astype('int')
df.loc[df['Num_of_Loan']>=10,'Num_of_Loan'] = np.nan
## 'Type_of_Loan'
df['Type_of_Loan'] = df['Type_of_Loan'].str.replace('and ','')
## 'Delay_from_due_date'
# n/a
## 'Num_of_Delayed_Payment'
df['Num_of_Delayed_Payment'] = df['Num_of_Delayed_Payment'].str.replace(r'[-_]','', regex=True)
df['Num_of_Delayed_Payment'] = df['Num_of_Delayed_Payment'].astype('float')
df.loc[df['Num_of_Delayed_Payment']>30,'Num_of_Delayed_Payment'] = np.nan
## 'Changed_Credit_Limit'
cond = df['Changed_Credit_Limit'].str.replace(".","").str.isnumeric()
df.loc[-cond,'Changed_Credit_Limit'] = np.nan
df['Changed_Credit_Limit'] = df['Changed_Credit_Limit'].astype('float')
## 'Num_Credit_Inquiries'
df.loc[df['Num_Credit_Inquiries'] > 20, 'Num_Credit_Inquiries'] = np.nan
## 'Credit_Mix'
df['Credit_Mix'] = df['Credit_Mix'].replace('_', np.nan)
## 'Outstanding_Debt'
df['Outstanding_Debt'] = df['Outstanding_Debt'].str.replace('_','')
df['Outstanding_Debt'] = df['Outstanding_Debt'].astype(float)
## 'Credit_Utilization_Ratio'
# n/a
## 'Credit_History_Age'
# n/a
## 'Payment_of_Min_Amount'
# delete
del df['Payment_of_Min_Amount']
## 'Total_EMI_per_month'
df.loc[df['Total_EMI_per_month']>1800,'Total_EMI_per_month'] = np.nan
## 'Amount_invested_monthly'
df['Amount_invested_monthly'] = df['Amount_invested_monthly'].str.replace(r'[-_]','', regex=True)
df['Amount_invested_monthly'] = df['Amount_invested_monthly'].astype('float')
df.loc[df['Amount_invested_monthly']>=2250,'Amount_invested_monthly'] = np.nan
## 'Payment_Behaviour'
df['Payment_Behaviour'] = df['Payment_Behaviour'].replace('!@9#%8', np.nan)
## 'Monthly_Balance'
df['Monthly_Balance'] = df['Monthly_Balance'].replace('__-333333333333333333333333333__', np.nan)
df['Monthly_Balance'] = df['Monthly_Balance'].astype('float')
And the generated null values were filled in according to the groups based on Customer_ID. For example, in the case of occupation, if the values for September, October, November, and December were missing for October, they were filled in using the code below.
def fillna_by_group(data, col, by):
df = data
cond = df[col].isna()
if by == 'mode':
t = df.groupby('Customer_ID')[col].agg(pd.Series.mode)
df.loc[cond,col] = df.loc[cond,'Customer_ID'].apply(lambda x :
t[x][0] if (type(t[x])==np.ndarray)&(len(t[x])>0) else t[x])
else:
if by == 'mean':
t = df.groupby('Customer_ID')[col].mean()
elif by == 'max':
t = df.groupby('Customer_ID')[col].max()
df.loc[cond,col] = df.loc[cond,'Customer_ID'].apply(lambda x : t[x])
## mode
mode_list = ['Name', 'SSN', 'Occupation', 'Type_of_Loan', 'Credit_Mix', 'Credit_History_Age', 'Payment_Behaviour']
for i in mode_list:
fillna_by_group(df, i, 'mode')
a = df[i].apply(lambda x : type(x) == np.ndarray)
df.loc[a,i] = np.nan
## mean
mean_list = ['Annual_Income', 'Monthly_Inhand_Salary', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Interest_Rate',
'Num_of_Loan', 'Num_of_Delayed_Payment', 'Changed_Credit_Limit', 'Num_Credit_Inquiries',
'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance']
for i in mean_list:
fillna_by_group(df, i, 'mean')
## max
# Age
fillna_by_group(df, 'Age', 'max')
Through this, the handling of missing values and outliers was completed.
df.info()
#result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 50000 non-null object
1 Customer_ID 50000 non-null object
2 Month 50000 non-null object
3 Name 49996 non-null object
4 Age 50000 non-null float64
5 SSN 50000 non-null object
6 Occupation 50000 non-null object
7 Annual_Income 50000 non-null float64
8 Monthly_Inhand_Salary 49972 non-null float64
9 Num_Bank_Accounts 50000 non-null float64
10 Num_Credit_Card 50000 non-null float64
11 Interest_Rate 50000 non-null float64
12 Num_of_Loan 50000 non-null float64
13 Type_of_Loan 44296 non-null object
14 Delay_from_due_date 50000 non-null int64
15 Num_of_Delayed_Payment 49996 non-null float64
16 Changed_Credit_Limit 50000 non-null float64
17 Num_Credit_Inquiries 50000 non-null float64
18 Credit_Mix 49916 non-null object
19 Outstanding_Debt 50000 non-null float64
20 Credit_Utilization_Ratio 50000 non-null float64
21 Credit_History_Age 49996 non-null object
22 Total_EMI_per_month 50000 non-null float64
23 Amount_invested_monthly 49996 non-null float64
24 Payment_Behaviour 50000 non-null object
25 Monthly_Balance 49996 non-null float64
dtypes: float64(15), int64(1), object(10)
memory usage: 9.9+ MB
Data Analysis
Customer segmant
Based on the preprocessed data, I attempted various data analyses. The focus was on understanding the current state of the bank's services, identifying customer segments, loan segments, and segments that relate customers to loans.
# Select the columns
Customer_Segmant = df[['Age', 'Monthly_Inhand_Salary', 'Annual_Income', 'Num_Bank_Accounts', 'Num_Credit_Card', 'Delay_from_due_date', 'Total_EMI_per_month', 'Amount_invested_monthly', 'Monthly_Balance']]
# correlation matrix
correlation_matrix = Customer_Segmant.corr()
# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='YlGnBu', fmt=".2f")
plt.title('Customer_Segmant Heatmap')
plt.show()
As a result of analyzing this heatmap, the significant values observed were the 0.63 correlation between Monthly_inhand_Salary and Amount_invested_monthly.
I created scatter plots to examine the data in more detail.
# Scatter
plt.figure(figsize=(10, 7))
plt.scatter(df['Monthly_Inhand_Salary'], df['Amount_invested_monthly'], color='b', alpha=0.7)
plt.title('Scatter Plot: Monthly Inhand Salary vs. Amount Invested Monthly')
plt.xlabel('Monthly Inhand Salary')
plt.ylabel('Amount Invested Monthly')
plt.grid(True)
plt.show()
import pandas as pd
import matplotlib.pyplot as plt
# columns
data = df[['Occupation', 'Age', 'Monthly_Inhand_Salary', 'Customer_ID', 'Annual_Income',
'Num_Bank_Accounts', 'Num_Credit_Card', 'Delay_from_due_date', 'Total_EMI_per_month',
'Amount_invested_monthly', 'Monthly_Balance']]
# filtering numeric columns
numeric_columns = data.select_dtypes(include=['number']).columns
# making hist
data[numeric_columns].hist(bins=15, figsize=(15, 10), layout=(3, 4))
plt.tight_layout()
plt.show()
# Occupation counts
occupation_counts = df['Occupation'].value_counts()
# colour
colors = sns.color_palette('pastel', len(occupation_counts))
# pie chart
plt.figure(figsize=(10, 7))
plt.pie(occupation_counts, labels=occupation_counts.index, autopct='%1.1f%%', startangle=140, colors=colors)
plt.title('Number of People by Occupation')
plt.axis('equal')
plt.show()
Loan segmant
Customer & Loan segmant
Conclusion
1. Distribution of Data: Customer occupations, loans, and salaries by occupation are all evenly distributed.
2. Credit Utilization Rate Management:
- Overall rate: 34%
- High-risk group: 5.9%
- Necessary across all age groups, following demographic trends.
3. Age Group Analysis:
- Individuals in their 50s have the highest average monthly salary, investment, and bank balance compared to other groups.
- No bad credit ratings were found among individuals in their 50s.
- Despite excelling in credit ratings, salary, investments, and bank balances, the overall balance is lower due to a smaller population in this age group.
4. Investment Rates: Many customers show a low investment rate relative to their salary.
Insights
Service Status Analysis:
- Customer Demographics:
- Very high proportion of customers aged 20 to 40.
- Even distribution across occupations, loan products, and salary levels.
- Credit Ratings and Delinquency:
- Low credit ratings are associated with significantly higher delinquency days, requiring continuous monitoring.
- Credit Utilization Rate:
- High average credit utilization rate (recommended for individuals: less than 30%).
Customer Segments:
- Targeting Low Investment Customers:
- Focus on customers with low investment rates relative to their salary for new financial products or marketing initiatives.
- Maintaining Services for 50s Customers:
- Continue providing existing services to customers in their 50s, who have high investment rates, salaries, and bank balances.
Loan Segment:
- Risk Reduction:
- Explore options to reduce the proportion of risky credit utilization rates above 35%, which account for 34% of the total.
Feedback Summary:
- Strengths:
- Outlier and Missing Data Handling: A lot of effort was put into removing outliers and missing values, with thorough consideration for each column, which was well done.
- Visualization Efforts: Using scatter plots to visually segment specific customers by keywords and heatmaps was highly effective.
- Exploratory Data Analysis (EDA) and Visualization: The EDA and visualizations were excellent, with interesting attempts to analyze salary and investment amounts.
- Areas for Improvement:
- Column Descriptions: There was a lack of detailed explanations for the columns, which was a drawback.
- Clarity of Goals: Although the data aimed to present a big picture, the goals were not well-defined. Each analysis segment was good, but the overall direction was unclear.
- Communication: The message being conveyed was somewhat lacking. Having more detailed and specific objectives would have been beneficial.
- Detail on High-Potential Segments: While the analysis on high-income, low-investment individuals was interesting, more details on these specific segments would have been valuable.
- Number of Columns: There were too many columns, making the data difficult to manage and interpret.
- Additional Notes:
- The data provided insights into individual behavior patterns.
- The EDA and visualizations were very well done.