lxml xml parsing with html tags inside xml tags

<xml>
<maintag>    
<content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content>
</maintag>
</xml>

The xml file that I regularly parse, may have HTML tags inside of content tag as shown above.

Here how I parse the file:

parser = etree.XMLParser(remove_blank_text=False)
tree = etree.parse(StringIO(xmlFile), parser)
for item in tree.iter('maintag'):
  my_content = item.find('content').text
  #print my_content
  #output: lorem

as a result it results my_content = 'lorem' instead of -which i'd like to see- ' lorem ipsum dolor sit and so on'

How can I read the content as ' lorem ipsum dolor sit and so on'?

Note: content tag may have another html tags instead of strong. And may not have them at all.

Answer

It seems you're trying to load tabular data into a NumPy array and automatically generate column names without knowing how many columns there are in advance. Let’s break down the problem and the possible solutions.

Issue Breakdown:

Automatic Column Naming: np.genfromtxt can automatically generate column names for you, but you need to avoid manual introspection of the file (i.e., you don’t want to know how many columns there are).
Using names=True: Using names=True forces np.genfromtxt to interpret the first row of the file as column names. However, you don’t want to consume the first row if it's not meant to be the header.
Empty names: Setting names='' doesn’t work as intended because np.genfromtxt expects a sequence of column names, and you can't have an empty sequence.

Solution:

Here’s a solution where we can automatically generate the column names even when you don’t know how many columns are present in the data:

Use np.genfromtxt without setting names or with names=None.
This lets NumPy automatically assign default column names (e.g., 'f0', 'f1', 'f2', etc.) based on the number of columns.
Manipulate the column names after reading the data, if needed.

Solution Steps:

import numpy as np

# Load the data with dtype=None to let numpy guess the types
data = np.genfromtxt('your_file.csv', delimiter=',', dtype=None, names=None)

# The data is now in a structured array, with default column names
print(data)

# If you want to assign your own names (like 'col1', 'col2', ...), do it like this:
num_columns = len(data[0])  # number of columns in the first row (assuming all rows have the same number of columns)
column_names = [f'col{i}' for i in range(num_columns)]  # generates 'col0', 'col1', 'col2', ...
data.dtype.names = column_names

print(data.dtype.names)  # This will show your custom column names

# Example of accessing data by column name
print(data['col0'])  # Accessing the first column by name

Explanation:

Load the data: np.genfromtxt loads the data into a structured array. Since you didn't provide names=True, the column names are automatically generated as 'f0', 'f1', 'f2', etc.
Get the number of columns: You can check the number of columns by looking at the first row of the array (data[0]), which gives you the number of fields (columns).
Assign custom column names: You can manually assign more meaningful names to the columns after the data is loaded by setting data.dtype.names to your desired list of names.

Additional Notes:

Flexible column naming: This method ensures that even if you don't know how many columns there are, NumPy will automatically generate column names (f0, f1, f2, ...) and you can later modify them.
Structured array: After the data is loaded, data becomes a structured NumPy array, so you can access columns via their names (data['col0'], etc.).
Handling Mixed Data Types: If the file contains mixed data types (e.g., integers, floats, strings), setting dtype=None will let NumPy guess the types for each column. If the file is purely numeric, NumPy will use a numeric type like float64 for all columns.

This solution should resolve your issue, letting you automatically generate column names without prior introspection of the file and still manipulate the column names afterward.