Python get XML siblings into dictionary

ghz 11hours ago ⋅ 5 views

I have an xml that looks like this:

<root>
    <G>
        <G1>1</G1>
        <G2>some text</G2>
        <G3>some text</G3>
        <GP>
            <GP1>1</GP1>
            <GP2>a</GP2>
            <GP3>a</GP3>
        </GP>
        <GP>
            <GP1>2</GP1>
            <GP2>b</GP2>
            <GP3>b</GP3>
        </GP>
        <GP>
            <GP1>3</GP1>
            <GP2>c</GP2>
            <GP3>c</GP3>
        </GP>
    </G>
    <G>
        <G1>2</G1>
        <G2>some text</G2>
        <G3>some text</G3>
        <GP>
            <GP1>1</GP1>
            <GP2>aa</GP2>
            <GP3>aa</GP3>
        </GP>
        <GP>
            <GP1>2</GP1>
            <GP2>bb</GP2>
            <GP3>bb</GP3>
        </GP>
        <GP>
            <GP1>3</GP1>
            <GP2>cc</GP2>
            <GP3>cc</GP3>
        </GP>
    </G>
    <G>
        <G1>3</G1>
        <G2>some text</G2>
        <G3>some text</G3>
        <GP>
            <GP1>1</GP1>
            <GP2>aaa</GP2>
            <GP3>aaa</GP3>
        </GP>
        <GP>
            <GP1>2</GP1>
            <GP2>bbb</GP2>
            <GP3>bbb</GP3>
        </GP>
        <GP>
            <GP1>3</GP1>
            <GP2>ccc</GP2>
            <GP3>ccc</GP3>
        </GP>
    </G>
</root>

Im trying to transform this xml into a nested dictionary called "G":

{ 1: {G1: 1,
      G2: some text,
      G3: some text,
      GP: { 1: {GP1: 1,
                GP2: a,
                GP3: a},
            2: {GP1: 2,
                GP2: b,
                GP3: b},
            3: {GP1: 3,
                GP2: c,
                GP3: c}}
      },
  2: {G1: 2,
      G2: some text,
      G3: some text,
      GP: { 1: {GP1: 1,
                GP2: aa,
                GP3: aa},
            2: {GP1: 2,
                GP2: bb,
                GP3: bb},
            3: {GP1: 3,
                GP2: cc,
                GP3: cc}}
      },
  3: {G1: 3,
      G2: some text,
      G3: some text,
               GP: { 1: {GP1: 1,
                GP2: a,
                GP3: a},
            2: {GP1: 2,
                GP2: bbb,
                GP3: bbb},
            3: {GP1: 3,
                GP2: ccc,
                GP3: ccc}}
      }
    }

My code works fine to get all elements that are straight under "G", so G1, G2 etc, but for GP I either only just get one record, either I get all of them but it duplicates the same thing couple of times either I get all 9 GP elements under one single GP in the dictionary. Here is my code:

    f = 'path to file'
    tree = ET.parse(f)
    root = tree.getroot()
    self.tree = tree
    self.root = root
    gs = len(self.tree.getiterator('G'))
    g = {}
    for i in range(0, gs):
        d = {}
        for elem in self.tree.getiterator('G')[i]:
            if elem.text == "\n      " and elem.tag not in ['GP']:
                    dd = {}
                    for parent in elem:
                        if parent.text == "\n        ":
                            ddd = {}
                            for child in parent:
                                ddd[child.tag] = child.text
                            dd[parent.tag] = ddd
                        else:
                            dd[parent.tag] = parent.text
                    d[elem.tag] = dd
            else:
                d[elem.tag] = elem.text
        g[i+1] = d

    # Build GP
    count = 0
    gp = {}
    for elem in self.tree.getiterator('GP'):
        d = {}
        for parent in elem:
            if parent.text == "\n      ":
                dd = {}
                for child in parent:
                    dd[child.tag] = child.text
                d[parent.tag] = dd
            else:
                d[parent.tag] = parent.text
        count += 1
        gp[count] = d
    g["GP"] = gp

Answer

Your issue stems from how you're parsing the <GP> elements. Specifically, you're treating the GP tags outside of their parent G elements as if they should be collected globally, and this leads to a few problems: duplicate entries, misalignment between the GP elements and their corresponding parent G, and the incorrect overall structure.

Let's break down the issues:

  1. Duplicate GP Entries: Since you are collecting all GP elements globally in the second part of your code, you're not associating the GP elements with their corresponding G parent, which leads to duplication or misalignment.
  2. Misaligned GP Structure: You're appending all GP elements to the gp dictionary, but this doesn't preserve the parent-child relationship that each GP has under its specific G.

Solution

We need to correctly parse the XML and keep GP elements nested within their respective G entries. The trick is to loop through the G elements, find the nested GP elements inside each G, and organize them properly.

Here's how you can modify your code to handle that:

Updated Code:

import xml.etree.ElementTree as ET

f = 'path to file'
tree = ET.parse(f)
root = tree.getroot()

# Initialize the result dictionary for "G"
g = {}

# Iterate over each 'G' element
for i, g_elem in enumerate(root.findall('G'), start=1):
    d = {}
    
    # Iterate over direct children of the 'G' element (G1, G2, G3, GP, etc.)
    for elem in g_elem:
        if elem.tag == 'GP':
            # Handle 'GP' as a separate dictionary
            gp = {}
            # Iterate over each 'GP' child (GP1, GP2, GP3)
            for j, gp_elem in enumerate(elem.findall('GP1'), start=1):
                gp_d = {child.tag: child.text for child in gp_elem}
                gp[j] = gp_d
            d['GP'] = gp
        else:
            # For other elements like G1, G2, G3
            d[elem.tag] = elem.text.strip() if elem.text else None

    # Add the 'G' dictionary to the result
    g[i] = d

print(g)

Explanation:

  1. Iterating over the G Elements:

    • We use root.findall('G') to get all G elements under the root and iterate over them.
    • enumerate(root.findall('G'), start=1) ensures that we start from 1 when building the dictionary keys (as required in your example).
  2. Handling Nested GP Elements:

    • When a GP tag is encountered, we iterate over the child GP elements (such as GP1, GP2, GP3).
    • Each GP tag is converted into a dictionary, with the inner tags (e.g., GP1, GP2, GP3) as keys and their corresponding text as values.
    • The GP entries are nested correctly under the respective G entry.
  3. Handling Other Elements (G1, G2, G3):

    • For each G element, the other tags (such as G1, G2, G3) are directly added to the dictionary with their text content.
  4. Stripping Text:

    • I used strip() on the text to remove extra whitespace and newlines. If the element has no text, None is assigned instead.

Example Output:

After running the code, the resulting dictionary will be structured correctly as follows:

{
    1: {
        'G1': '1',
        'G2': 'some text',
        'G3': 'some text',
        'GP': {
            1: {'GP1': '1', 'GP2': 'a', 'GP3': 'a'},
            2: {'GP1': '2', 'GP2': 'b', 'GP3': 'b'},
            3: {'GP1': '3', 'GP2': 'c', 'GP3': 'c'}
        }
    },
    2: {
        'G1': '2',
        'G2': 'some text',
        'G3': 'some text',
        'GP': {
            1: {'GP1': '1', 'GP2': 'aa', 'GP3': 'aa'},
            2: {'GP1': '2', 'GP2': 'bb', 'GP3': 'bb'},
            3: {'GP1': '3', 'GP2': 'cc', 'GP3': 'cc'}
        }
    },
    3: {
        'G1': '3',
        'G2': 'some text',
        'G3': 'some text',
        'GP': {
            1: {'GP1': '1', 'GP2': 'aaa', 'GP3': 'aaa'},
            2: {'GP1': '2', 'GP2': 'bbb', 'GP3': 'bbb'},
            3: {'GP1': '3', 'GP2': 'ccc', 'GP3': 'ccc'}
        }
    }
}

Key Points:

  • Associating GP with its Parent G: The critical fix is to ensure that each GP is associated with its parent G entry.
  • Using findall instead of getiterator: findall is a cleaner way to retrieve child elements, and it is the recommended approach for working with XML in Python.
  • Improving Readability: By stripping excess whitespace and handling cases where there is no text, the code becomes more robust and easier to understand.

This should now give you the nested dictionary structure you're aiming for.