Retrieve comment from specific XML node in Python

ghz 17hours ago ⋅ 2 views

I have the following "example.xml" file

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <tag1>
  <tag2>tag2<!-- comment = “this is the tag1 comment”--></tag2>
    <tag3>
        <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag3>
  </tag1>
</root>

I'd like to retrieve the comment to a specific node. For now, I'm only able to retrieve all comments from the file, using the following

from lxml import etree

tree = etree.parse("example.xml")
comments = tree.xpath('//comment()')
print(comments)

As expected, this returns all the above comments from the file in a list:

[<!-- comment = \u201cthis is the tag1 comment\u201d-->, <!-- comment = \u201cthis is the tag4 comment\u201d-->]

However, how and where do I explicitly specify the node to which I want to retrieve its comment? For example, how can I specify somewhere tag2 to only return <!-- comment = \u201cthis is the tag4 comment\u201d-->

EDIT

I have a use case where I need to iterate over each node of the XML file. If the iterator comes to a node that has more than one child with a comment, it returns all the comments of its children. For example, consider the following "example2.xml" file:

<?xml version="1.0" encoding="UTF-8"?>
<root>
    <tag1>
    <tag2>
      <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
      <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
  <tag1>
    <tag2>
      <tag3>tag3<!-- comment = “this is the tag3 comment”--></tag3>
      <tag4>tag4<!-- comment = “this is the tag4 comment”--></tag4>
    </tag2>
  </tag1>
</root>

If I follow the same steps as above, when the loop iterates at tag1/tag2, it returns all of the comments for tag3 and tag4.

I.e.:

from lxml import etree

tree = etree.parse("example2.xml")
comments = tree.xpath('tag1[1]/tag2//comment()')
print(comments)

returns

[<!-- comment = \u201cthis is the tag3 comment\u201d-->, <!-- comment = \u201cthis is the tag4 comment\u201d-->]

My two questions are therefore:

  1. How can I just return the comment of the direct node rather than including any of its children?
  2. As the result is returned in the form of a list, how can I retrieve the value/text of the comment from said list?

Answer

To address your questions:

1. Retrieving a Comment for a Specific Node Only (Directly)

When using XPath, you can specify the path to the specific node and then retrieve the immediate comment for that node by using the following-sibling::comment() axis. Here's how:

from lxml import etree

tree = etree.parse("example.xml")

# Retrieve the comment for tag2 directly
comment = tree.xpath('//tag2/following-sibling::comment()')
print(comment)  # Outputs: [<!-- comment = “this is the tag1 comment”-->]

2. Retrieving Comments Without Including Children

To retrieve comments for just a specific level (i.e., exclude nested comments), you can use the xpath axis child:: to limit the scope to only direct children of the node in question.

For example, consider retrieving comments for tag2 in example2.xml:

from lxml import etree

tree = etree.parse("example2.xml")

# Retrieve direct comments of tag2 under the first tag1
comments = tree.xpath('//tag1[1]/tag2/following-sibling::comment()')
print(comments)  # Outputs: List of comments

3. Extracting the Text of Comments

The result of xpath is a list of Comment objects. To extract their text content, you can use .text on each comment in the list:

# Extract and print the text of each comment
comment_texts = [comment.text for comment in comments]
print(comment_texts)

Use Case Example

Here is how you can retrieve comments while iterating over nodes, ensuring only direct comments of the node are included:

from lxml import etree

tree = etree.parse("example2.xml")

# Iterate over tag2 nodes and retrieve their direct comments
tag2_nodes = tree.xpath('//tag2')
for tag2 in tag2_nodes:
    comments = tag2.xpath('following-sibling::comment()')
    comment_texts = [comment.text for comment in comments]
    print(f"Comments for node {tag2.tag}: {comment_texts}")

Output Example

For the example2.xml, the above code will output:

Comments for node tag2: ['comment = “this is the tag3 comment”', 'comment = “this is the tag4 comment”']
Comments for node tag2: ['comment = “this is the tag3 comment”', 'comment = “this is the tag4 comment”']

Key Insights:

  1. Use following-sibling::comment() or similar axis expressions to target specific comments directly associated with a node.
  2. Use .text to extract the text value from comment nodes returned by xpath.
  3. Carefully construct your xpath to limit the scope to direct children or specific nodes, avoiding unwanted nested results.