I have data sets representing travel times past given nodes. The data is in one CSV file per node in this format: node name, datetime, irrelevant field, mac address
I'm reading them into one DataFrame in Pandas:
dfs = [pd.read_csv(f, names=CSV_COLUMNS, parse_dates=[1]) for f in files]
return pd.concat(dfs)
What I want to do is find the time difference between a MAC address' appearance at one node and the next. Right now I'm looping over the resulting DataFrame, which isn't efficient and isn't working: every way I've tried to sort the data causes a problem.
- I can't sort it by MAC and date and time because I need to preserve the direction of travel (sorting by date and time results in all direction looking like it's in the positive direction).
- Sorting by MAC alone keeps the nodes in order (because they are pushed into the file in node order)
While I may be able to figure out the sorting problem, the larger issue is I'm new to Pandas and I bet there's a right way to do this in Pandas. What I want at the end of processing is a data set that shows travel time (timediff.total_seconds()
or similar) for every pair of nodes that a MAC traveled directly between. That last bit is important: for a layout where the nodes are A, B and C, most travel will be A-B or B-C (or the reverse), but it is possible some MACs won't register at B and will go A to C. It's also possible some of the appearances will be orphans where a MAC appears at a node but never shows up at another node.
Answer
To tackle your problem efficiently using Pandas, let's break it down step-by-step:
Goal:
You want to calculate the travel time between nodes for each MAC address, capturing the time difference between their appearances at consecutive nodes. Your final DataFrame should have travel time (timediff.total_seconds()
) for each pair of nodes directly traveled between.
Plan:
- Data Loading: You are already loading multiple CSV files into a list of DataFrames (
dfs
), and then concatenating them into a single DataFrame. - Sorting and Indexing: You'll need to sort by
mac address
anddatetime
to ensure that travel is ordered, but also keep track of node order to preserve the direction of travel. - Travel Time Calculation: You'll calculate the time difference between consecutive nodes for each
mac address
. - Filtering Direct Travel: We’ll handle direct travel between consecutive nodes, skipping orphans (nodes where a MAC doesn't appear at the next node).
Approach:
- Combine the DataFrames: You are already concatenating the data into a single DataFrame.
- Sorting: We'll sort by
mac address
anddatetime
. We'll need to keep track of the node sequence. - Calculate the Time Difference: For each MAC address, we'll calculate the time difference between consecutive nodes.
- Filter Valid Pairs: Ensure that we only keep rows where the MAC address appears at two consecutive nodes.
Let's implement this:
Step-by-Step Code:
import pandas as pd
# Assuming CSV_COLUMNS = ['node_name', 'datetime', 'irrelevant_field', 'mac_address']
# and `files` is the list of file paths
# Step 1: Load and combine CSV files into a single DataFrame
dfs = [pd.read_csv(f, names=CSV_COLUMNS, parse_dates=[1]) for f in files]
df = pd.concat(dfs)
# Step 2: Sort by 'mac_address' and 'datetime' to preserve travel order
df.sort_values(by=['mac_address', 'datetime'], inplace=True)
# Step 3: Calculate the time difference between consecutive nodes for each MAC address
# We use groupby to process each MAC address independently
df['time_diff'] = df.groupby('mac_address')['datetime'].diff().dt.total_seconds()
# Step 4: Add a column for the next node in the journey
df['next_node'] = df.groupby('mac_address')['node_name'].shift(-1)
# Step 5: Filter out rows where there is no next node (orphans or end of the journey)
df_valid_travel = df.dropna(subset=['next_node'])
# Step 6: Create a column to represent the travel between two consecutive nodes
df_valid_travel['node_pair'] = df_valid_travel['node_name'] + '->' + df_valid_travel['next_node']
# Step 7: You may want to drop irrelevant columns or organize the DataFrame
df_result = df_valid_travel[['mac_address', 'node_pair', 'time_diff']]
# Step 8: Optionally, you can aggregate the results for each unique MAC address and node pair
df_travel_times = df_result.groupby(['mac_address', 'node_pair']).agg(
total_travel_time=('time_diff', 'sum'),
num_trips=('time_diff', 'count')
).reset_index()
# Now, df_travel_times will contain the total travel time and number of trips for each MAC address and node pair.
Explanation of the Code:
- Data Loading and Concatenation: We load the CSV files into DataFrames and concatenate them into one large DataFrame (
df
). - Sorting: We sort by
mac_address
anddatetime
to maintain the sequence of events for each MAC address. - Time Difference Calculation: We use the
diff()
function within eachmac_address
group to calculate the difference indatetime
between consecutive rows. Thedt.total_seconds()
converts this difference to seconds. - Next Node: We create a new column
next_node
, which contains the next node for each MAC address. We do this by shifting thenode_name
column within eachmac_address
group. - Filtering Orphans: We drop rows where the
next_node
isNaN
, which indicates that a MAC address appears at a node but doesn't appear at the next node. - Node Pair and Travel Time: We concatenate the current node and the next node to create a
node_pair
column. This will help you keep track of direct travel between nodes. - Aggregation: Optionally, we aggregate the total travel time for each MAC address and node pair. We also count the number of trips.
Result:
df_result
: This DataFrame contains themac_address
, thenode_pair
(e.g., 'A->B', 'B->C'), and thetime_diff
(time in seconds).df_travel_times
: This is an aggregated DataFrame that shows the total travel time and number of trips for each MAC address and node pair.
Example of Output (df_travel_times
):
mac_address | node_pair | total_travel_time | num_trips |
---|---|---|---|
00:11:22:33 | A->B | 1200.0 | 5 |
00:11:22:33 | B->C | 1500.0 | 3 |
00:44:55:66 | A->C | 1800.0 | 2 |
Performance Considerations:
- Efficient Sorting: Sorting by
mac_address
anddatetime
is the most computationally expensive step, but it is necessary for calculating the time difference. - GroupBy Operations: Pandas
groupby()
is efficient, but if the dataset grows significantly, consider using Dask or another parallelized library to handle larger datasets.
Let me know if you need further assistance or optimization for larger datasets!