Split out nested json/dictionary from Pandas dataframe into sepa

ghz 昨天 ⋅ 1 views

Split out nested json/dictionary from Pandas dataframe into separate columns

I have a problem that I cannot find a solution for - so here comes the request for assistance.

I receive an export from a DB that looks like this (of course, more than one line in reality):

"created_at","country","query_success","query_result"
"2022-08-18 08:38:38","Germany",True,"{""servers"": {""windows"": 0, ""linux"": 0}, ""workstations"": {""windows"": 0, ""mac"": 0}}"

I import it into Pandas in this way:

df = pd.read_csv('data.csv', index_col='created_at', parse_dates=True)

Which turns it into this:

created_at          country query_success   query_result
2022-08-18 08:38:38 Germany True            {"servers": {"windows": 0, "linux": 0}, "workstations": {"windows": 0, "mac": 0}}

The problem I'm trying to resolve is the json/dictionary that populates the query_result column.

What I'd like to do would be to create and populate four new columns based on this data.

server_windows
server_linux
workstation_windows
workstation_mac

I've done quite some googling and have seen some solutions that uses the ast module but can't seem to get it right. It could potenially be due to the it being two nested dictionaries/json structures?

Answer

The issue you're facing is that the query_result column contains a string representing a JSON object, and you'd like to parse it into separate columns. While this is possible using Python’s ast.literal_eval or json.loads, the correct approach depends on ensuring that the nested structure within the JSON is correctly flattened.

In your case, the string in the query_result column represents a nested JSON, and we need to parse and flatten it into the desired columns.

Here's how you can do it:

Step 1: Convert the string in query_result to an actual dictionary

Since the query_result is stored as a string, we need to first convert it into a dictionary. You can use ast.literal_eval to safely evaluate the string into a Python dictionary or json.loads to do it directly if it’s a valid JSON format.

We'll use json.loads because the structure is valid JSON.

Step 2: Flatten the nested dictionary into separate columns

We'll flatten the JSON structure and create new columns in the DataFrame. Here's the step-by-step process:

Full Solution:

import pandas as pd
import json

# Your sample data
data = '''"created_at","country","query_success","query_result"
"2022-08-18 08:38:38","Germany",True,"{""servers"": {""windows"": 0, ""linux"": 0}, ""workstations"": {""windows"": 0, ""mac"": 0}}"
'''

# Save data to a CSV file
with open('data.csv', 'w') as f:
    f.write(data)

# Read the CSV into a pandas DataFrame
df = pd.read_csv('data.csv', index_col='created_at', parse_dates=True)

# Step 1: Convert the 'query_result' column from a string to a dictionary
df['query_result'] = df['query_result'].apply(json.loads)

# Step 2: Flatten the dictionary and create new columns
df['server_windows'] = df['query_result'].apply(lambda x: x['servers'].get('windows', None))
df['server_linux'] = df['query_result'].apply(lambda x: x['servers'].get('linux', None))
df['workstation_windows'] = df['query_result'].apply(lambda x: x['workstations'].get('windows', None))
df['workstation_mac'] = df['query_result'].apply(lambda x: x['workstations'].get('mac', None))

# Optionally, drop the 'query_result' column
df.drop(columns=['query_result'], inplace=True)

# Show the result
print(df)

Explanation:

  1. Reading the Data: The data is read into a pandas DataFrame.
  2. Converting query_result: The query_result column, which contains the string representation of the dictionary, is converted to an actual dictionary using json.loads.
  3. Creating New Columns: We use the .apply() method with a lambda function to extract the values from the nested dictionaries (servers and workstations).
  4. Dropping query_result: The original query_result column is dropped after extracting the needed information, but you can keep it if you'd like to retain the raw data.

Output:

The resulting DataFrame will look like this:

                     country  query_success  server_windows  server_linux  workstation_windows  workstation_mac
created_at                                                                                                 
2022-08-18 08:38:38  Germany           True               0             0                    0                0

Notes:

  • The use of .apply() with a lambda function is efficient for extracting values from the nested dictionaries.
  • The .get() method is used to handle the case where a key might be missing, ensuring that the code won't throw an error if any expected key doesn't exist in the JSON.

This approach should work for any similar data structure with nested dictionaries. If your dataset grows or you have more nested levels, you can extend this method by applying more logic to flatten further or to handle missing data.