How to extract all < p > with its corresponding < h2 >?

ghz 昨天 ⋅ 2 views

I am trying to get all the <p> that come after <h2>.

I know how to do this in case I have only one <p> after <h2>, but not in case I have multiple <p>.

Here's an example of the webpage:

<h2>Heading Text1</h2>

<p>Paragraph1</p>
<p>Paragraph2</p>

<h2>Heading Text2</h2>

<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
....

I need to get all paragraphs in relation to their headings, e.g. Paragraphs 1 and 2 that are related to Heading Text1.

I'm trying that using BeautifulSoup with Python, been trying for days, also googling.

How can this be done?

Answer

To extract all <p> elements that follow each <h2> and group them by their respective <h2> headings using BeautifulSoup in Python, you can approach the problem as follows:

Steps:

  1. Find all <h2> tags: You will first find all <h2> headings in the document.
  2. Extract following <p> tags: For each <h2>, gather all subsequent <p> elements until the next <h2> or the end of the document.
  3. Handle multiple <p> tags: Ensure that all paragraphs under each <h2> are collected together.

Solution Using BeautifulSoup

Here's how you can implement it:

from bs4 import BeautifulSoup

html = """
<h2>Heading Text1</h2>
<p>Paragraph1</p>
<p>Paragraph2</p>
<h2>Heading Text2</h2>
<p>Paragraph3</p>
<p>Paragraph4</p>
<p>Paragraph5</p>
"""

# Parse the HTML
soup = BeautifulSoup(html, 'html.parser')

# Find all <h2> elements
headings = soup.find_all('h2')

# Initialize a list to store the results
result = []

# Loop through each <h2> heading
for heading in headings:
    # Create a dictionary for each heading and its associated paragraphs
    section = {'heading': heading.get_text(), 'paragraphs': []}
    
    # Get the next sibling elements
    sibling = heading.find_next_sibling()
    
    while sibling and sibling.name != 'h2':  # Continue until we find another <h2>
        if sibling.name == 'p':  # If it's a <p> tag, add to the list
            section['paragraphs'].append(sibling.get_text())
        sibling = sibling.find_next_sibling()
    
    # Append the section to the result list
    result.append(section)

# Print the results
for section in result:
    print(f"Heading: {section['heading']}")
    for para in section['paragraphs']:
        print(f"  - {para}")

Explanation:

  1. soup.find_all('h2'): This finds all the <h2> headings in the document.
  2. Loop through headings: For each heading (<h2>), I find the next sibling elements using find_next_sibling().
  3. Collect <p> tags: I then loop through all sibling elements, checking if the sibling is a <p>. If it is, I add it to the list of paragraphs for that particular heading.
  4. Stopping condition: The loop stops once we encounter another <h2> (which indicates the start of a new section) or reach the end of the document.
  5. Store and print results: For each heading, the associated paragraphs are stored in a dictionary, and then the results are printed.

Output:

For the provided HTML:

Heading: Heading Text1
  - Paragraph1
  - Paragraph2
Heading: Heading Text2
  - Paragraph3
  - Paragraph4
  - Paragraph5

Key Points:

  • Handling multiple <p> tags: The loop ensures that all paragraphs (<p>) after each <h2> are included.
  • Grouping paragraphs under their respective headings: By iterating through siblings and stopping when a new <h2> is encountered, the paragraphs are grouped under the correct heading.
  • General approach: This approach can be generalized to work with any number of headings and paragraphs in the document.

This should solve your problem of collecting all paragraphs related to each heading.