본문 바로가기
Study Note/Python

Join, Merge, Concat, Append, and Pivot table

by jhleeatl 2024. 5. 23.

 

1. Using the join Method

The join method in pandas is used to combine two dataframes based on their index. Here is the syntax:

 

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

 

  • other: The dataframe to join with.
  • on: Column or index level names to join on. Default is the index.
  • how: Type of join to perform ('left', 'right', 'outer', 'inner'). Default is 'left'.
  • lsuffix: Suffix to use for overlapping columns in the left dataframe.
  • rsuffix: Suffix to use for overlapping columns in the right dataframe.
  • sort: Sort the resulting dataframe by the join keys. Default is False.

 

 

Example

import pandas as pd

# Sample dataframes
df1 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
}, index=['K0', 'K1', 'K2'])

df2 = pd.DataFrame({
    'C': ['C0', 'C1', 'C2'],
    'D': ['D0', 'D1', 'D2']
}, index=['K0', 'K2', 'K3'])

# Join based on index
result = df1.join(df2, how='outer')
print(result)

 


2. Using the merge Method

The merge method in pandas is more versatile and allows you to join dataframes based on specific columns. Here is the syntax:

 

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

 

  • left: The left dataframe.
  • right: The right dataframe.
  • how: Type of join to perform ('left', 'right', 'outer', 'inner'). Default is 'inner'.
  • on: Column or index level names to join on.
  • left_on: Column names from the left dataframe to join on.
  • right_on: Column names from the right dataframe to join on.
  • left_index: Use the index from the left dataframe as the join key.
  • right_index: Use the index from the right dataframe as the join key.
  • sort: Sort the resulting dataframe by the join keys. Default is False.
  • suffixes: Suffix to use for overlapping columns.
  • indicator: Add a column to the output DataFrame called _merge with information on the source of each row.

 

Example

# Sample dataframes
df1 = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K3'],
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})

df2 = pd.DataFrame({
    'key': ['K0', 'K1', 'K2', 'K4'],
    'C': ['C0', 'C1', 'C2', 'C4'],
    'D': ['D0', 'D1', 'D2', 'D4']
})

# Merge on 'key' column
result = pd.merge(df1, df2, on='key', how='inner')
print(result)

 

 

 

join: Use this method when you want to combine dataframes based on their indices.

merge: Use this method when you need to join dataframes on one or more columns.

 

 

 


 

Concat

The concat function in pandas can be used to concatenate pandas objects along a particular axis with optional set logic along the other axes.

 

Syntax:

pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)

 

 

Parameters:

  • objs: A sequence or mapping of pandas objects to be concatenated.
  • axis: The axis to concatenate along (default is 0, which means concatenation by rows).
  • join: How to handle indexes on other axis ('outer', 'inner'). Default is 'outer'.
  • ignore_index: If True, do not use the index values along the concatenation axis. Default is False.
  • keys: If multiple levels are concatenated, create a hierarchical index using the provided keys.
  • levels: Specific levels (unique) to use for constructing a MultiIndex.
  • names: Names for the levels in the resulting hierarchical index.
  • verify_integrity: Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
  • sort: Sort non-concatenation axis if it is not already aligned when join='outer'. Default is False.
  • copy: If False, do not copy data unnecessarily. Default is True.

 

Example Usages

Concatenating DataFrames by Rows (Default Behavior)

This is useful when you have multiple dataframes with the same columns and you want to stack them on top of each other.


Input

import pandas as pd

# Sample dataframes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5']})

# Concatenate along rows (axis=0)
result = pd.concat([df1, df2])
print(result)

 

 

Output

    A   B
0  A0  B0
1  A1  B1
2  A2  B2
0  A3  B3
1  A4  B4
2  A5  B5

 

 

Concatenating DataFrames by Columns

This is useful when you have multiple dataframes with the same index and you want to join them side by side.

 

Input

# Concatenate along columns (axis=1)
result = pd.concat([df1, df2], axis=1)
print(result)

 

Output

    A   B   A   B
0  A0  B0  A3  B3
1  A1  B1  A4  B4
2  A2  B2  A5  B5

 

 

Concatenating with Different Indexes

If the dataframes have different indexes, concat will align them by index.

 

Input

df3 = pd.DataFrame({'A': ['A0', 'A1', 'A2']}, index=[0, 1, 2])
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B4']}, index=[2, 3, 4])

# Concatenate along rows with different indexes
result = pd.concat([df3, df4], axis=1)
print(result)

 

 

Output

     A    B
0    A0  NaN
1    A1  NaN
2    A2    B2
3   NaN    B3
4   NaN    B4

 

 

Using keys to Create a Hierarchical Index

You can use the keys parameter to create a hierarchical index that can be useful for differentiating the source dataframe of each row.

 

# Concatenate with keys
result = pd.concat([df1, df2], keys=['df1', 'df2'])
print(result)

 

 

Concat

        A   B
df1 0  A0  B0
    1  A1  B1
    2  A2  B2
df2 0  A3  B3
    1  A4  B4
    2  A5  B5

 

 


 

append Method Overview

Syntax:

DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)

 

Parameters:

  • other: The data to append. Can be a dataframe, series, dictionary, or a list of these.
  • ignore_index: If True, do not use the index labels. The resulting dataframe will be reindexed. Default is False.
  • verify_integrity: If True, checks if the new dataframe has duplicate indices. Default is False.
  • sort: Sort columns if the columns of the other dataframe are not aligned. Default is False.

Returns: A new dataframe containing the combined data.

Example Usages

Appending DataFrames

Appending one dataframe to another:

import pandas as pd

# Sample dataframes
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']})
df2 = pd.DataFrame({'A': ['A3', 'A4', 'A5'], 'B': ['B3', 'B4', 'B5']})

# Append df2 to df1
result = df1.append(df2)
print(result)

 

Output:

 

    A   B
0  A0  B0
1  A1  B1
2  A2  B2
0  A3  B3
1  A4  B4
2  A5  B5

 


 

pivot_table Function

The pivot_table function in pandas allows you to create a spreadsheet-style pivot table as a dataframe. It aggregates data and allows for multiple levels of grouping and summarization.

 

Syntax:

pd.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)

 

Parameters:

  • data: The dataframe to pivot.
  • values: The column(s) to aggregate.
  • index: The column(s) to set as the index of the resulting pivot table.
  • columns: The column(s) to use to create the columns of the resulting pivot table.
  • aggfunc: The aggregation function(s) to use (e.g., 'mean', 'sum', 'count'). Default is 'mean'.
  • fill_value: Value to replace missing values.
  • margins: Add all row/columns totals. Default is False.
  • dropna: Do not include columns whose entries are all NaN. Default is True.
  • margins_name: Name of the row/column that will contain the totals when margins is True. Default is 'All'.
  • observed: This only applies if any of the groupers are categoricals. If True: only show observed values for categorical groupers. Default is False.
  • sort: Sort the resulting DataFrame by the index and columns. Default is True.

Example Usages

Basic Pivot Table

import pandas as pd

# Sample dataframe
data = {
    'A': ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
    'B': ['one', 'one', 'two', 'two', 'one', 'one', 'two', 'two'],
    'C': ['small', 'large', 'small', 'large', 'small', 'large', 'small', 'large'],
    'D': [1, 2, 3, 4, 5, 6, 7, 8]
}
df = pd.DataFrame(data)

# Create a pivot table
pivot_table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc='sum')
print(pivot_table)

 

Output

C        large  small
A   B                
bar one      6      5
    two      8      7
foo one      2      1
    two      4      3

 


Using Multiple Aggregation Functions

pivot_table = pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'], aggfunc=['sum', 'mean'])
print(pivot_table)

 

 

Output

        sum        mean     
C      large small large small
A   B                        
bar one     6     5   6.0   5.0
    two     8     7   8.0   7.0
foo one     2     1   2.0   1.0
    two     4     3   4.0   3.0

 

 

Using pivot_table, you can transform your data into a more structured and easily analyzable format, making it easier to generate insights and reports.