Why does my Scrapy code return an empty array?

I am building a web scraper for wunderground.com, but I my code returns the value of "[]" for inches_rain and humidity. Could anyone see why this is happening?

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
import time

from wunderground_scraper.items import WundergroundScraperItem


class WundergroundComSpider(scrapy.Spider):
    name = "wunderground"
    allowed_domains = ["www.wunderground.com"]
    start_urls = (
        'http://www.wunderground.com/q/zmw:10001.5.99999',
    )

    def parse(self, response):
        info_set = Selector(response).xpath('//div[@id="current"]')
        list = []
        for i in info_set:
            item = WundergroundScraperItem()
            item['description'] = i.xpath('div/div/div/div/span/text()').extract()
            item['description'] = item['description'][0]
            item['humidity'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()
            item['inches_rain'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()
            list.append(item)
        return list

I also know that the humidity and inches_rain items are set to the same xpath, but that should be correct because once the information is in an array I just set them to certain values from the array.

Answer

The issue you're encountering—where the humidity and inches_rain fields return an empty list ([])—is likely related to how you're using the XPath expressions and how the data is structured on the webpage you're scraping.

Here are the main things to check and fix:

1. XPath Expression Problem

The XPath expressions for both humidity and inches_rain appear to be the same in your code:

item['humidity'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()
item['inches_rain'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()

This is likely incorrect, because the structure of the page might have different XPath expressions for the two pieces of data (humidity and rainfall). To verify this, you should inspect the page structure (using your browser's developer tools) and check if the XPath for humidity and inches_rain are indeed the same.

The elements for humidity and rain are likely located in different table rows or cells within the page. So you should modify the XPath expressions accordingly.

2. Correct XPath for Humidity and Rain

Here’s a revised approach assuming you're scraping from a typical weather page. You should double-check the page's actual HTML structure and adjust the XPath expressions accordingly.

Humidity might be found in one table cell.
Rain (in inches) could be in a separate table cell or span, and needs a different XPath.

For example, let's say humidity is located in a cell with text like "Humidity: 80%" and rain is in a cell with the text "Rain: 0.10 inches". You would modify your XPath like this:

# Example of revised XPath selectors for humidity and inches of rain:
item['humidity'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Humidity")]]/td[2]/text()').extract()
item['inches_rain'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Rain")]]/td[2]/text()').extract()

Here’s the logic:

For humidity, you're selecting the row containing the text Humidity and extracting the second <td> for the value.
Similarly, for rain, you would look for a row containing Rain and extract the second <td> for the value.

3. Ensure Correct Item Structure

In addition to fixing the XPath, you may want to check the structure of the response you're getting. Sometimes the Selector or response.xpath() results might need more refinement.

For example:

Check whether i.xpath(...) actually returns the elements you're looking for (use .extract() or .get() to debug).
Ensure that you’re not accidentally extracting an empty list because the XPath does not match any elements.

Revised Code:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
import time

from wunderground_scraper.items import WundergroundScraperItem


class WundergroundComSpider(scrapy.Spider):
    name = "wunderground"
    allowed_domains = ["www.wunderground.com"]
    start_urls = (
        'http://www.wunderground.com/q/zmw:10001.5.99999',
    )

    def parse(self, response):
        info_set = Selector(response).xpath('//div[@id="current"]')
        list = []
        for i in info_set:
            item = WundergroundScraperItem()
            
            # Scraping the description (you can modify this part depending on the actual HTML structure)
            item['description'] = i.xpath('div/div/div/div/span/text()').extract()
            item['description'] = item['description'][0] if item['description'] else None
            
            # Scraping humidity
            item['humidity'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Humidity")]]/td[2]/text()').extract()
            item['humidity'] = item['humidity'][0] if item['humidity'] else None  # Assign None if no data
            
            # Scraping inches of rain
            item['inches_rain'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Rain")]]/td[2]/text()').extract()
            item['inches_rain'] = item['inches_rain'][0] if item['inches_rain'] else None  # Assign None if no data
            
            list.append(item)
        
        return list

Notes:

XPath Fixes: I've adjusted the XPath for humidity and rain based on a typical structure where the text for these data points is in table rows.
Handling Empty Data: If the XPath doesn't return any data, the code now checks whether the list is empty and assigns None (or other default values) when the data is missing.
Debugging: You can print out the values of item['humidity'] and item['inches_rain'] to debug whether the correct data is being extracted.

Final Considerations:

Inspect the Website: Use the browser’s developer tools to ensure the correct structure of the HTML (or inspect the actual response of the scrapy spider by saving the HTML in a file and opening it).
Handling Edge Cases: There might be additional cases where values are missing or need parsing, so be sure to check edge cases like missing Rain or Humidity values.

Let me know if you need further adjustments or explanations!