Beginner to Scraping, keep on getting empty lists

ghz 昨天 ⋅ 3 views

I've decided to take a swing at web scraping using Python (with lxml and requests). The webpage I'm trying to scrape to learn is: http://www.football-lineups.com/season/Real_Madrid/2013-2014

What I want to scrape is the table on the left of the webpage (the table with the scores and formations used). Here is the code I'm working with:

from lxml import html
import requests
page=requests.get("http://www.football-lineups.com/season/Real_Madrid/2013-2014")
tree=html.fromstring(page.text)
competition=tree.xpath('//*[@id="sptf"]/table/tbody/tr[2]/td[4]/font/text()')
print competition

The xpath that I input is the xpath that I copied over from Chrome. The code should normally return the competition of the first match in the table (i.e. La Liga). In other words, it should return the second row, fourth column entry (there is a random second column on the web layout, I don't know why). However, when I run the code, I get back an empty list. Where might this code be going wrong?

Answer

Your issue lies in the way you're using the XPath and how the HTML is being processed. There are a few possible reasons why you’re getting an empty list as a result:

  1. XPath Issue:
    The XPath //*[@id="sptf"]/table/tbody/tr[2]/td[4]/font/text() might not match the elements as expected due to some dynamic content or the structure of the webpage not being captured correctly by lxml.

  2. HTML Parsing:
    Sometimes, the page might have dynamic content (like JavaScript-rendered elements) that doesn’t appear in the raw HTML response, but is added after the page is fully rendered by the browser. The requests library does not execute JavaScript, so the data you’re looking for might be missing or structured differently in the raw HTML.

  3. HTML Structure:
    The HTML structure of the page could be different from what you expect, meaning the XPath you're using might be incorrect or not targeting the right elements.

Steps to Debug:

1. Check the Raw HTML:

To make sure you’re parsing the correct content, you can check the raw HTML of the page before applying the XPath:

from lxml import html
import requests

page = requests.get("http://www.football-lineups.com/season/Real_Madrid/2013-2014")
tree = html.fromstring(page.text)

# Print the first 500 characters of the raw HTML to inspect the structure
print(page.text[:500])

Inspect the raw HTML to confirm that the data you want is there and whether the XPath needs to be adjusted.

2. Test the XPath in Isolation:

Open the website in Chrome and use the Developer Tools (press F12) to inspect the structure of the page. Use the Elements tab to locate the part of the HTML that contains the data you're looking for.

In Chrome DevTools, you can test your XPath directly by typing this into the console:

$x('//*[@id="sptf"]/table/tbody/tr[2]/td[4]/font')

This will help you verify if the XPath you’re using is correct and if it matches the page structure.

3. Try an Adjusted XPath:

Sometimes the data you're trying to scrape is wrapped in different tags or uses different attributes than you expect. A more flexible approach to scrape the competition information might be using a relative XPath based on the class name or other structural elements.

For instance, you can try to scrape all td elements in the table and check the text within them:

# Scrape all table rows and columns under the 'sptf' table
matches = tree.xpath('//*[@id="sptf"]/table/tbody/tr')

# Iterate through the matches and get the competition in the fourth column
for match in matches:
    competition = match.xpath('td[4]/font/text()')
    if competition:
        print(competition[0])

This will return the text of all the font elements inside the fourth column of each row in the table.

4. Handle Missing or Dynamic Data:

If the page uses JavaScript to render data, consider using a library like Selenium (which can execute JavaScript) or Splash (headless browser) to scrape the content that is dynamically loaded. Here’s how you can use Selenium to scrape content after rendering JavaScript:

from selenium import webdriver
from lxml import html

# Set up the WebDriver (make sure you have the appropriate driver installed)
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')

# Get the page
driver.get("http://www.football-lineups.com/season/Real_Madrid/2013-2014")

# Wait for the page to load completely (you may want to wait explicitly)
driver.implicitly_wait(10)

# Get the page source after JavaScript execution
page_source = driver.page_source

# Use lxml to parse the HTML content
tree = html.fromstring(page_source)

# Now apply your XPath or scrape as needed
matches = tree.xpath('//*[@id="sptf"]/table/tbody/tr')
for match in matches:
    competition = match.xpath('td[4]/font/text()')
    if competition:
        print(competition[0])

# Close the browser once done
driver.quit()

Summary:

  • Check the raw HTML: Ensure the content you need is actually present in the HTML.
  • Test XPath: Validate the XPath in the browser’s developer tools.
  • Try a more general XPath: Instead of targeting specific elements directly, try scraping the entire table and parsing the rows.
  • Handle dynamic content: If the content is rendered by JavaScript, use Selenium to scrape the rendered page.

By debugging with these steps, you should be able to pinpoint the exact problem and get your web scraper working!