I am building a web scraper for wunderground.com, but I my code returns the value of "[]" for inches_rain and humidity. Could anyone see why this is happening?
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
import time
from wunderground_scraper.items import WundergroundScraperItem
class WundergroundComSpider(scrapy.Spider):
name = "wunderground"
allowed_domains = ["www.wunderground.com"]
start_urls = (
'http://www.wunderground.com/q/zmw:10001.5.99999',
)
def parse(self, response):
info_set = Selector(response).xpath('//div[@id="current"]')
list = []
for i in info_set:
item = WundergroundScraperItem()
item['description'] = i.xpath('div/div/div/div/span/text()').extract()
item['description'] = item['description'][0]
item['humidity'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()
item['inches_rain'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()
list.append(item)
return list
I also know that the humidity and inches_rain items are set to the same xpath, but that should be correct because once the information is in an array I just set them to certain values from the array.
Answer
The issue you're encountering—where the humidity
and inches_rain
fields return an empty list ([]
)—is likely related to how you're using the XPath expressions and how the data is structured on the webpage you're scraping.
Here are the main things to check and fix:
1. XPath Expression Problem
The XPath expressions for both humidity
and inches_rain
appear to be the same in your code:
item['humidity'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()
item['inches_rain'] = i.xpath('div/table/tbody/tr/td/span/span/text()').extract()
This is likely incorrect, because the structure of the page might have different XPath expressions for the two pieces of data (humidity and rainfall). To verify this, you should inspect the page structure (using your browser's developer tools) and check if the XPath for humidity
and inches_rain
are indeed the same.
The elements for humidity and rain are likely located in different table rows or cells within the page. So you should modify the XPath expressions accordingly.
2. Correct XPath for Humidity and Rain
Here’s a revised approach assuming you're scraping from a typical weather page. You should double-check the page's actual HTML structure and adjust the XPath expressions accordingly.
- Humidity might be found in one table cell.
- Rain (in inches) could be in a separate table cell or span, and needs a different XPath.
For example, let's say humidity is located in a cell with text like "Humidity: 80%" and rain is in a cell with the text "Rain: 0.10 inches". You would modify your XPath like this:
# Example of revised XPath selectors for humidity and inches of rain:
item['humidity'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Humidity")]]/td[2]/text()').extract()
item['inches_rain'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Rain")]]/td[2]/text()').extract()
Here’s the logic:
- For humidity, you're selecting the row containing the text
Humidity
and extracting the second<td>
for the value. - Similarly, for rain, you would look for a row containing
Rain
and extract the second<td>
for the value.
3. Ensure Correct Item Structure
In addition to fixing the XPath, you may want to check the structure of the response you're getting. Sometimes the Selector
or response.xpath()
results might need more refinement.
For example:
- Check whether
i.xpath(...)
actually returns the elements you're looking for (use.extract()
or.get()
to debug). - Ensure that you’re not accidentally extracting an empty list because the XPath does not match any elements.
Revised Code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
import time
from wunderground_scraper.items import WundergroundScraperItem
class WundergroundComSpider(scrapy.Spider):
name = "wunderground"
allowed_domains = ["www.wunderground.com"]
start_urls = (
'http://www.wunderground.com/q/zmw:10001.5.99999',
)
def parse(self, response):
info_set = Selector(response).xpath('//div[@id="current"]')
list = []
for i in info_set:
item = WundergroundScraperItem()
# Scraping the description (you can modify this part depending on the actual HTML structure)
item['description'] = i.xpath('div/div/div/div/span/text()').extract()
item['description'] = item['description'][0] if item['description'] else None
# Scraping humidity
item['humidity'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Humidity")]]/td[2]/text()').extract()
item['humidity'] = item['humidity'][0] if item['humidity'] else None # Assign None if no data
# Scraping inches of rain
item['inches_rain'] = i.xpath('div[@id="current"]//table//tr[td[contains(text(),"Rain")]]/td[2]/text()').extract()
item['inches_rain'] = item['inches_rain'][0] if item['inches_rain'] else None # Assign None if no data
list.append(item)
return list
Notes:
- XPath Fixes: I've adjusted the XPath for humidity and rain based on a typical structure where the text for these data points is in table rows.
- Handling Empty Data: If the XPath doesn't return any data, the code now checks whether the list is empty and assigns
None
(or other default values) when the data is missing. - Debugging: You can print out the values of
item['humidity']
anditem['inches_rain']
to debug whether the correct data is being extracted.
Final Considerations:
- Inspect the Website: Use the browser’s developer tools to ensure the correct structure of the HTML (or inspect the actual response of the
scrapy
spider by saving the HTML in a file and opening it). - Handling Edge Cases: There might be additional cases where values are missing or need parsing, so be sure to check edge cases like missing
Rain
orHumidity
values.
Let me know if you need further adjustments or explanations!