pandas - merging multiple DataFrames

ghz 昨天 ⋅ 5 views

This is a multi-part question. I just can't seem to combine everything together. The goal is to to create one DataFrame (guessing using MultiIndex) that I can access as follows:

ticker = 'GOLD'
date = pd.to_datetime('1978/03/31')
current_bar = df.ix[ticker].ix[date]

Can I then just say: current_bar.Last ?

Anyway, here are the files, and how I load them.

In [108]: df = pd.read_csv('GOLD.csv', parse_dates='Date', index_col='Date')
In [109]: df
Out[109]: 
            Exp       Last     Volume
Date
1978-03-30  198002    995.6    54
1978-03-31  198002    999.5    78

In [110]: df2 = pd.read_csv('SPX.csv', parse_dates='Date', index_col='Date')
In [111]: df2
Out[111]: 
            Exp       Last     Volume
Date
1978-03-30  198003    215.5    25
1978-03-31  198003    214.1    99

Ideally, I want it to look like this (I think):

ticker      GOLD                            SPX
values      Exp       Last     Volume       Exp       Last     Volume
Date
1978-03-30  198002    995.6    54           198003    215.5    25
1978-03-31  198002    999.5    78           198003    214.1    99
  1. I guess my questions are:
    • How do I make this Hierarchical (the actual data has 20+ identical columns for each file)
    • How do I then combine the files (I have about 100 that need to all go in 1 DataFrame)
    • Is my assumption correct that I can then just do: current_bar.Last to get values?

Answer

You are on the right track with the idea of using a MultiIndex to combine your data. The goal is to merge multiple DataFrames, each with a Date index, into a single DataFrame with a MultiIndex that includes both the ticker (like 'GOLD', 'SPX', etc.) and the Date. This structure allows you to access data as you described, using both the ticker and the date.

Let's break it down step by step:

1. How to make the DataFrame hierarchical (MultiIndex)?

You want the result to have a MultiIndex with two levels: the ticker and the Date. Each DataFrame you load will have the Date as its index, and the columns will contain the relevant data like Exp, Last, Volume, etc.

The key part of this is creating a MultiIndex from your individual DataFrames by adding a ticker level. Here's how you can achieve this.

2. How to combine the files?

You can combine the individual DataFrames using pd.concat(). The idea is to:

  • Read each CSV file and assign a ticker (you can use the filename or a manually assigned value).
  • Stack these DataFrames with ticker as the first level in the MultiIndex.
  • Ensure that each DataFrame’s Date is preserved as the second level of the index.

3. Accessing the data: current_bar.Last

Once you create the MultiIndex, you can access the data via df.xs(ticker) to get the data for a particular ticker, and then use .loc[date] to access the data for a specific date.

Full Example:

import pandas as pd
import os

# Initialize an empty list to hold all the DataFrames
df_list = []

# List of tickers and CSV filenames
tickers = ['GOLD', 'SPX']  # Example tickers

# Assuming files are named like 'GOLD.csv', 'SPX.csv'
for ticker in tickers:
    # Read the CSV for each ticker, setting the ticker as a new column
    df = pd.read_csv(f'{ticker}.csv', parse_dates=['Date'], index_col='Date')
    
    # Add a column for the ticker
    df['ticker'] = ticker
    
    # Append to the list of DataFrames
    df_list.append(df)

# Concatenate all DataFrames into one
df_combined = pd.concat(df_list)

# Set a MultiIndex (ticker, Date)
df_combined.set_index(['ticker', df_combined.index], inplace=True)

# Reorder columns if needed
df_combined = df_combined[['Exp', 'Last', 'Volume', 'ticker']]  # Example reordering

# Display the result
print(df_combined)

Explanation:

  1. Reading files:

    • Each file is read individually using pd.read_csv, with Date as the index.
    • I manually assign the ticker (e.g., 'GOLD', 'SPX') to a new column, which will later be used for the MultiIndex.
  2. Creating the MultiIndex:

    • I concatenate the DataFrames using pd.concat(df_list).
    • Then I set the MultiIndex using set_index(['ticker', df_combined.index]). This creates a two-level index with ticker as the first level and Date as the second level.
  3. Accessing the data:

    • You can access the data for a particular ticker and date combination using the .loc[] method, or you can use .xs(ticker) for easier extraction of a specific ticker’s data.

    For example:

    # Access the Last price for 'GOLD' on '1978-03-31'
    ticker = 'GOLD'
    date = pd.to_datetime('1978/03/31')
    current_bar = df_combined.xs(ticker).loc[date]
    print(current_bar['Last'])  # Access 'Last' value
    

Expected Output:

The output would look like this:

                     Exp   Last  Volume ticker
ticker    Date
GOLD      1978-03-30  198002  995.6     54   GOLD
          1978-03-31  198002  999.5     78   GOLD
SPX       1978-03-30  198003  215.5     25   SPX
          1978-03-31  198003  214.1     99   SPX

4. Accessing current_bar.Last:

Once the DataFrame is structured as shown above, you can access the Last column for any ticker and date like this:

# Access 'Last' for a given ticker and date
ticker = 'GOLD'
date = pd.to_datetime('1978/03/31')
current_bar = df_combined.xs(ticker).loc[date]
print(current_bar['Last'])

This will return 999.5, which is the Last value for 'GOLD' on '1978-03-31'.

5. Handling Multiple Files:

To handle all your files efficiently, you can dynamically read all CSV files in a directory, assign each file a ticker, and then concatenate them into the final DataFrame.

import glob

df_list = []
# Use glob to list all CSV files
csv_files = glob.glob('path_to_csvs/*.csv')  # Adjust path as needed

for file in csv_files:
    ticker = os.path.basename(file).split('.')[0]  # Get ticker from file name
    df = pd.read_csv(file, parse_dates=['Date'], index_col='Date')
    df['ticker'] = ticker
    df_list.append(df)

# Concatenate and create MultiIndex
df_combined = pd.concat(df_list)
df_combined.set_index(['ticker', df_combined.index], inplace=True)
df_combined = df_combined[['Exp', 'Last', 'Volume']]  # Reorder columns as needed

# Now the DataFrame is ready

Final Notes:

  • MultiIndex: You’re using a MultiIndex to allow you to access data both by ticker and Date.
  • Performance: If you have many files, using pd.concat() is efficient, but be cautious about memory usage when concatenating very large datasets.
  • Accessing Data: After setting the MultiIndex, you can access specific data using .xs(ticker) and .loc[date], as shown above.

This approach allows you to manage and query large, multi-file time-series data efficiently.