본문 바로가기
Study Note/Python

Taitanic data analysis

by jhleeatl 2024. 4. 30.

Today, I had a personal assignment. It involved coding questions related to passengers using Titanic data.

The data was from kaggle.

 

https://www.kaggle.com/competitions/titanic/data?select=train.csv

 

Titanic - Machine Learning from Disaster | Kaggle

 

www.kaggle.com

 

 

 

 

Data file

 

 

I wrote Python code based on this data. It's still quite lacking, and the way I approach coding is very inefficient and crude. However, the purpose of this study is to practice handling with tables data through Python and to learn coding. Through this practice, I hope to develop a more concise and efficient coding style in the future.

 

 

Question 1: Loading the Data

  • Question) Load the Titanic data and store it in a variable called df. Then, examine the contents of the data.
import pandas as pd
file_path = "/Users/junhyunlee/Desktop/data/titanic/train.csv"
df = pd.read_csv(file_path)

 

 

Question 2: Calculating the Number of Survivors

  • Question) Calculate and output the total number of survivors and the number of fatalities on the Titanic.
 a = df['Survived']
 def cnt_survival(a):
     count = 0
     for i in a:
         if i == 1:
             count +=1
     return count
 print(cnt_survival(a))   #result 342

 

 

Problem 3: Calculating the Average Age

  • Question) Calculate and output the average age of Titanic passengers.
s = df['Age'].dropna()
def avg_age(s):
    total = sum(s)
    count = 0
    for i in s:
        if i>0:
            count += 1
    if count >0:
        return total/count
    else:
        return 0
print(avg_age(s))           #result = 29.69911764705882

 

Problem 4: Calculating the Number of Female Survivors

  • Question) Calculate and output the number of female survivors among Titanic passengers.
zipped = zip(df['Sex'], df['Survived'])
def cnt(zipped):
    count = 0
    for sex, survived in zipped:
        if sex == 'female' and survived == 1:
            count +=1
    return count

print(cnt(zipped))			#result = 233

 

Problem 5: Finding the Passenger with the Most Family Members

  • Question) Among passengers with families, find the passenger with the most family members.
df['family'] = df['SibSp'] + df['Parch']
max_f = max(df['family'])
max_family = df[df['family'] == max_f]
print(max_family[['Name', 'family','SibSp', 'Parch']])

#result 
                                  Name  family  SibSp  Parch
159         Sage, Master. Thomas Henry      10      8      2
180       Sage, Miss. Constance Gladys      10      8      2
201                Sage, Mr. Frederick      10      8      2
324           Sage, Mr. George John Jr      10      8      2
792            Sage, Miss. Stella Anna      10      8      2
846           Sage, Mr. Douglas Bullen      10      8      2
863  Sage, Miss. Dorothy Edith "Dolly"      10      8      2

 

Problem 6: Extracting Passengers of a Specific Age Group

  • Question) Extract the names of passengers aged 20 or younger to complete a dictionary where the passengers' names are keys and their ages are values.
zipped = dict(zip(df['Name'], df['Age']))
result = {}
for name, age in zipped.items():
    if age <= 20:
        result[name] = age
print(result)

#result = {'Palsson, Master. Gosta Leonard': 2.0, 'Nasser, Mrs. Nicholas (Adele Achem)': 14.0, 'Sandstrom, Miss. Marguerite Rut': 4.0, 'Saundercock, Mr. William Henry': 20.0, 'Vestrom, Miss. Hulda Amanda Adolfina': 14.0, 'Rice, Master. Eugene': 2.0, 'McGowan, Miss. Anna "Annie"': 15.0, 'Palsson, Miss. Torborg Danira': 8.0, 'Fortune, Mr. Charles Alexander': 19.0, 'Vander Planke, Miss. Augusta Maria': 18.0, 'Nicola-Yarred, Miss. Jamila': 14.0, 'Laroche, Miss. Simonne Marie Anne Andree': 3.0, 'Devaney, Miss. Margaret Delia': 19.0, 'Arnold-Franchi,

 

 

Problem 7: Finding the Cabin Class with the Most Passengers

  • Question) Find the cabin class that had the most passengers on the Titanic.
a = df['Pclass']
def count(a):
    class_1 = 0
    class_2 = 0
    class_3 = 0
    for i in a:
        if i == 1:
            class_1 += 1
        elif i == 2:
            class_2 += 1
        elif i == 3:
            class_3 += 1
    return class_1, class_2, class_3
class_1, class_2, class_3 = count(a)
print("Pclass_1:",class_1)
print("Pclass_2:",class_2)
print("Pclass_3:",class_3)

#result 
Pclass_1: 216
Pclass_2: 184
Pclass_3: 491

 

Problem 8: Printing Information of the Passenger with the Highest Fare

  • Question) Find the passenger on the Titanic who paid the highest fare.
a = df['Fare']
max_fare = max(a)
s = df[df['Fare'] == max_fare]
print(s[['Name', 'Fare']])

#result
                                   Name      Fare
258                    Ward, Miss. Anna  512.3292
679  Cardeza, Mr. Thomas Drake Martinez  512.3292
737              Lesurer, Mr. Gustave J  512.3292

Problem 9: Calculating the Survival Rate for Each Gender

  • Question) Calculate the survival rate for each gender (male/female) on the Titanic.
sm = df[(df['Sex'] == 'male') & (df['Survived'] == 1)]['Sex'].count()
tm = df[df['Sex'] == 'male']['Sex'].count()
sfm = df[(df['Sex'] == 'female') & (df['Survived'] == 1)]['Sex'].count()
tfm = df[df['Sex'] == 'female']['Sex'].count()

print(f'남자 생존률 :{sm/tm}')
print(f'여자 생존률 :{sfm/tfm}')

#result
남자 생존률 :0.18890814558058924
여자 생존률 :0.7420382165605095

 

Problem 10: Finding the Most Common Departure Port

  • Question) Find the port from which the most passengers departed, and output the number of passengers who departed from that port.
s = df['Embarked'].dropna()

def count_emb(s):
    cnt_s = 0
    cnt_c = 0
    cnt_q = 0
    for i in s:
        if i == 'S':
            cnt_s += 1
        elif i == 'C':
            cnt_c += 1
        elif i == 'Q':
            cnt_q += 1
    max_cnt = max(cnt_s, cnt_c, cnt_q)
    if max_cnt == cnt_s:
        return cnt_s, 'S'
    elif max_cnt == cnt_c:
        return cnt_c, 'C'
    else:
        return cnt_q, 'Q'

max_value, max_name = count_emb(s)
print(max_name, max_value)

#result = S 644

 

 

I spent the whole day solvin on these questions. I still don't know whether my answers are correct or not.

I plan to compare them with the correct answers this Friday to explore areas for improvement. 

Through solving these questions, I finally got to use Python to manipulate data directly.

I realized that SQL is more intuitive and easier to handle than Python. Although I'm still at a basic level, I've realized many areas where I'm lacking. While solving today's questions, I learned about various features of Python, and I think I need to use Python more extensively to become more familiar with Python code in the future.

'Study Note > Python' 카테고리의 다른 글

Enumerate()  (0) 2024.05.08
Missing data handling  (0) 2024.05.02
Python Data Types  (0) 2024.04.25
Lambda function  (0) 2024.04.05
10. Purpose of 'while' and 'for in'  (0) 2024.04.02