I have an xml that looks like this:
<root>
<G>
<G1>1</G1>
<G2>some text</G2>
<G3>some text</G3>
<GP>
<GP1>1</GP1>
<GP2>a</GP2>
<GP3>a</GP3>
</GP>
<GP>
<GP1>2</GP1>
<GP2>b</GP2>
<GP3>b</GP3>
</GP>
<GP>
<GP1>3</GP1>
<GP2>c</GP2>
<GP3>c</GP3>
</GP>
</G>
<G>
<G1>2</G1>
<G2>some text</G2>
<G3>some text</G3>
<GP>
<GP1>1</GP1>
<GP2>aa</GP2>
<GP3>aa</GP3>
</GP>
<GP>
<GP1>2</GP1>
<GP2>bb</GP2>
<GP3>bb</GP3>
</GP>
<GP>
<GP1>3</GP1>
<GP2>cc</GP2>
<GP3>cc</GP3>
</GP>
</G>
<G>
<G1>3</G1>
<G2>some text</G2>
<G3>some text</G3>
<GP>
<GP1>1</GP1>
<GP2>aaa</GP2>
<GP3>aaa</GP3>
</GP>
<GP>
<GP1>2</GP1>
<GP2>bbb</GP2>
<GP3>bbb</GP3>
</GP>
<GP>
<GP1>3</GP1>
<GP2>ccc</GP2>
<GP3>ccc</GP3>
</GP>
</G>
</root>
Im trying to transform this xml into a nested dictionary called "G":
{ 1: {G1: 1,
G2: some text,
G3: some text,
GP: { 1: {GP1: 1,
GP2: a,
GP3: a},
2: {GP1: 2,
GP2: b,
GP3: b},
3: {GP1: 3,
GP2: c,
GP3: c}}
},
2: {G1: 2,
G2: some text,
G3: some text,
GP: { 1: {GP1: 1,
GP2: aa,
GP3: aa},
2: {GP1: 2,
GP2: bb,
GP3: bb},
3: {GP1: 3,
GP2: cc,
GP3: cc}}
},
3: {G1: 3,
G2: some text,
G3: some text,
GP: { 1: {GP1: 1,
GP2: a,
GP3: a},
2: {GP1: 2,
GP2: bbb,
GP3: bbb},
3: {GP1: 3,
GP2: ccc,
GP3: ccc}}
}
}
My code works fine to get all elements that are straight under "G", so G1, G2 etc, but for GP I either only just get one record, either I get all of them but it duplicates the same thing couple of times either I get all 9 GP elements under one single GP in the dictionary. Here is my code:
f = 'path to file'
tree = ET.parse(f)
root = tree.getroot()
self.tree = tree
self.root = root
gs = len(self.tree.getiterator('G'))
g = {}
for i in range(0, gs):
d = {}
for elem in self.tree.getiterator('G')[i]:
if elem.text == "\n " and elem.tag not in ['GP']:
dd = {}
for parent in elem:
if parent.text == "\n ":
ddd = {}
for child in parent:
ddd[child.tag] = child.text
dd[parent.tag] = ddd
else:
dd[parent.tag] = parent.text
d[elem.tag] = dd
else:
d[elem.tag] = elem.text
g[i+1] = d
# Build GP
count = 0
gp = {}
for elem in self.tree.getiterator('GP'):
d = {}
for parent in elem:
if parent.text == "\n ":
dd = {}
for child in parent:
dd[child.tag] = child.text
d[parent.tag] = dd
else:
d[parent.tag] = parent.text
count += 1
gp[count] = d
g["GP"] = gp
Answer
Your issue stems from how you're parsing the <GP>
elements. Specifically, you're treating the GP
tags outside of their parent G
elements as if they should be collected globally, and this leads to a few problems: duplicate entries, misalignment between the GP
elements and their corresponding parent G
, and the incorrect overall structure.
Let's break down the issues:
- Duplicate GP Entries: Since you are collecting all
GP
elements globally in the second part of your code, you're not associating theGP
elements with their correspondingG
parent, which leads to duplication or misalignment. - Misaligned GP Structure: You're appending all
GP
elements to thegp
dictionary, but this doesn't preserve the parent-child relationship that eachGP
has under its specificG
.
Solution
We need to correctly parse the XML and keep GP
elements nested within their respective G
entries. The trick is to loop through the G
elements, find the nested GP
elements inside each G
, and organize them properly.
Here's how you can modify your code to handle that:
Updated Code:
import xml.etree.ElementTree as ET
f = 'path to file'
tree = ET.parse(f)
root = tree.getroot()
# Initialize the result dictionary for "G"
g = {}
# Iterate over each 'G' element
for i, g_elem in enumerate(root.findall('G'), start=1):
d = {}
# Iterate over direct children of the 'G' element (G1, G2, G3, GP, etc.)
for elem in g_elem:
if elem.tag == 'GP':
# Handle 'GP' as a separate dictionary
gp = {}
# Iterate over each 'GP' child (GP1, GP2, GP3)
for j, gp_elem in enumerate(elem.findall('GP1'), start=1):
gp_d = {child.tag: child.text for child in gp_elem}
gp[j] = gp_d
d['GP'] = gp
else:
# For other elements like G1, G2, G3
d[elem.tag] = elem.text.strip() if elem.text else None
# Add the 'G' dictionary to the result
g[i] = d
print(g)
Explanation:
-
Iterating over the
G
Elements:- We use
root.findall('G')
to get allG
elements under the root and iterate over them. enumerate(root.findall('G'), start=1)
ensures that we start from 1 when building the dictionary keys (as required in your example).
- We use
-
Handling Nested
GP
Elements:- When a
GP
tag is encountered, we iterate over the childGP
elements (such asGP1
,GP2
,GP3
). - Each
GP
tag is converted into a dictionary, with the inner tags (e.g.,GP1
,GP2
,GP3
) as keys and their corresponding text as values. - The
GP
entries are nested correctly under the respectiveG
entry.
- When a
-
Handling Other Elements (
G1
,G2
,G3
):- For each
G
element, the other tags (such asG1
,G2
,G3
) are directly added to the dictionary with their text content.
- For each
-
Stripping Text:
- I used
strip()
on the text to remove extra whitespace and newlines. If the element has no text,None
is assigned instead.
- I used
Example Output:
After running the code, the resulting dictionary will be structured correctly as follows:
{
1: {
'G1': '1',
'G2': 'some text',
'G3': 'some text',
'GP': {
1: {'GP1': '1', 'GP2': 'a', 'GP3': 'a'},
2: {'GP1': '2', 'GP2': 'b', 'GP3': 'b'},
3: {'GP1': '3', 'GP2': 'c', 'GP3': 'c'}
}
},
2: {
'G1': '2',
'G2': 'some text',
'G3': 'some text',
'GP': {
1: {'GP1': '1', 'GP2': 'aa', 'GP3': 'aa'},
2: {'GP1': '2', 'GP2': 'bb', 'GP3': 'bb'},
3: {'GP1': '3', 'GP2': 'cc', 'GP3': 'cc'}
}
},
3: {
'G1': '3',
'G2': 'some text',
'G3': 'some text',
'GP': {
1: {'GP1': '1', 'GP2': 'aaa', 'GP3': 'aaa'},
2: {'GP1': '2', 'GP2': 'bbb', 'GP3': 'bbb'},
3: {'GP1': '3', 'GP2': 'ccc', 'GP3': 'ccc'}
}
}
}
Key Points:
- Associating
GP
with its ParentG
: The critical fix is to ensure that eachGP
is associated with its parentG
entry. - Using
findall
instead ofgetiterator
:findall
is a cleaner way to retrieve child elements, and it is the recommended approach for working with XML in Python. - Improving Readability: By stripping excess whitespace and handling cases where there is no text, the code becomes more robust and easier to understand.
This should now give you the nested dictionary structure you're aiming for.