I have a Scrapy spider that's trying to pick out content after i submit a form. But the output page i get from the spider is extremely inconsistent. All the pages that i am crawling have data in them when i go through my web browser. But Scrapy goes all the way past the Form and till the result page, but most of the times does not find a result. When the resulting page actually does exist. It does however find the last page every time. So it does seem to be a problem with sessions.
Here's the code for my spider:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.http import FormRequest, Request
from scrapy.shell import inspect_response
class MaharashtraSpider(scrapy.Spider):
name = "maharashtra2"
allowed_domains = ["mahavat.gov.in"]
start_urls = (
'http://mahavat.gov.in/',
)
def parse(self, response):
return Request('http://mahavat.gov.in/Tin_Search/Tinsearch.jsp',
callback=self.parse_form)
def parse_form(self, response):
base_no = 27020000034
no = base_no
for i in range(100):
yield FormRequest.from_response(
response,
formname='f1',
formdata={
"tin": "%sC" % no,
"pan": "",
"rc_no": "",
"fptecno": "",
"bptecno": "",
"DEALERNAME": "",
"Submit": "SEARCH"
},
callback=self.result_page)
no += 97 # The difference between pages with content is 97
def result_page(self, response):
url = response.xpath('//a[@class="search-head"]/@href').extract()[0]
url = response.urljoin(url)
yield Request(url, callback=self.process)
def process(self, response):
x = response.xpath("//td/text()").extract()
x = [x[i].strip() for i in range(1, len(x), 2)]
print "Dealer_Name = ", x[0]
print "Tin_Number = ", x[1]
# inspect_response(response, self)
What am i doing wrong?
It seems like a session problem and NOT an ajax problem because there isn't any XHR
request in the post
.
Also, i kind of have a hacky way of getting over the problem, but it's super slow.
Here's the code for my hacky version.
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.http import FormRequest, Request
# from scrapy.shell import inspect_response
from VATs.items import VatsItem
class MaharashtraSpider(scrapy.Spider):
name = "maharashtra"
allowed_domains = ["mahavat.gov.in"]
start_urls = (
'http://mahavat.gov.in/',
)
def __init__(self, **kwargs):
super(MaharashtraSpider, self).__init__(**kwargs)
self.base_no = 27020000034 - 97
def parse(self, response):
yield Request('http://mahavat.gov.in/Tin_Search/Tinsearch.jsp',
callback=self.parse_form, dont_filter=True)
def parse_form(self, response):
self.base_no += 97
yield FormRequest.from_response(
response,
formname='f1',
formdata={
"tin": "%sC" % self.base_no,
"pan": "",
"rc_no": "",
"fptecno": "",
"bptecno": "",
"DEALERNAME": "",
"Submit": "SEARCH"
},
callback=self.result_page)
def result_page(self, response):
try:
url = response.xpath(
'//a[@class="search-head"]/@href').extract()[0]
url = response.urljoin(url)
yield Request(url, callback=self.process)
except IndexError:
yield Request('http://mahavat.gov.in/', callback=self.parse, dont_filter=True)
def process(self, response):
x = response.xpath("//td/text()").extract()
x = [x[i].strip() for i in range(1, len(x), 2)]
if x[0] != '':
item = VatsItem()
item["Dealer_Name"] = x[0]
item["Tin_Number"] = x[1]
item["Effective_Canceled_Date"] = x[2]
item["Address1"] = x[3]
item["Street_Name"] = x[4]
item["Address2"] = x[5]
item["Address3"] = x[6]
item["Taluka_Name"] = x[7]
item["District_Name"] = x[8]
item["City_Name"] = x[9]
item["State_Name"] = x[10]
item["Pin_Code"] = x[11]
item["Old_RC_No"] = x[12]
item["Location_Name"] = x[13]
item["Act_Name"] = x[14]
yield item
yield Request('http://mahavat.gov.in/', callback=self.parse, dont_filter=True)
# inspect_response(response, self)
This one does work, but is really slow!
Could some one help me out and tell me why the first script doesn't work?
And also if there was a way to make the second script faster?
Answer
The issue you're encountering seems to be related to session management in Scrapy. In your first script, you are making multiple requests using FormRequest.from_response()
, but the session between the requests is likely not maintained correctly, causing inconsistent results.
Problem with the First Script:
-
Session Management: Scrapy does not automatically maintain session cookies or headers between requests unless you explicitly manage them. Since the form request works once but fails intermittently, it's likely that Scrapy is not sending the correct cookies or headers needed for subsequent requests.
-
Request Timing: If the form submission process requires specific timing or if the data is dynamically loaded via JavaScript or session-based, Scrapy might miss the correct response on subsequent requests.
Fix for the First Script:
You can manage the session manually by using the cookies
and headers
parameters in your requests. Additionally, ensure that dont_filter=True
is used in cases where the URL might be filtered out by Scrapy.
Here’s how you can tweak the first script:
import scrapy
from scrapy.http import FormRequest, Request
class MaharashtraSpider(scrapy.Spider):
name = "maharashtra2"
allowed_domains = ["mahavat.gov.in"]
start_urls = ('http://mahavat.gov.in/',)
def parse(self, response):
return Request('http://mahavat.gov.in/Tin_Search/Tinsearch.jsp',
callback=self.parse_form, dont_filter=True)
def parse_form(self, response):
base_no = 27020000034
no = base_no
for i in range(100):
yield FormRequest.from_response(
response,
formname='f1',
formdata={
"tin": "%sC" % no,
"Submit": "SEARCH"
},
callback=self.result_page,
dont_filter=True, # Ensures requests are not filtered
cookies=response.cookies # Pass cookies from the response to maintain session
)
no += 97 # Adjust the page number if needed
def result_page(self, response):
try:
url = response.xpath('//a[@class="search-head"]/@href').extract()[0]
url = response.urljoin(url)
yield Request(url, callback=self.process, cookies=response.cookies)
except IndexError:
# Handle cases where there are no results, retrying or skipping
yield Request('http://mahavat.gov.in/', callback=self.parse, dont_filter=True)
def process(self, response):
x = response.xpath("//td/text()").extract()
x = [x[i].strip() for i in range(1, len(x), 2)]
if x[0] != '':
item = {
"Dealer_Name": x[0],
"Tin_Number": x[1],
"Effective_Canceled_Date": x[2],
"Address1": x[3],
"Street_Name": x[4],
"Address2": x[5],
"Address3": x[6],
"Taluka_Name": x[7],
"District_Name": x[8],
"City_Name": x[9],
"State_Name": x[10],
"Pin_Code": x[11],
"Old_RC_No": x[12],
"Location_Name": x[13],
"Act_Name": x[14]
}
yield item
yield Request('http://mahavat.gov.in/', callback=self.parse, dont_filter=True)
Key Changes:
- Cookies:
cookies=response.cookies
ensures that the session is maintained across multiple requests. - dont_filter=True: Ensures Scrapy doesn't filter out URLs if they're repeated in the crawl.
- Simplified FormRequest: I removed the unused fields in the form submission. You should adjust the form data as needed.
Speeding up the Second Script:
For the second script, where you have a "hacky" version that's working but slow, you can improve the speed in several ways:
-
Parallel Requests: Scrapy supports concurrency out of the box. You can increase the number of concurrent requests to speed up the crawling process. In your
settings.py
, increase the following settings:CONCURRENT_REQUESTS = 32 # Increase this as needed DOWNLOAD_DELAY = 0.5 # Reduce delay between requests if necessary CONCURRENT_REQUESTS_PER_DOMAIN = 16 # Adjust this as needed
-
Optimize Item Processing: Instead of printing or processing items in the
process
method, try to yield the results directly and let Scrapy handle it in bulk. You can also avoid re-crawling the starting pagehttp://mahavat.gov.in/
after every iteration unless necessary. -
Avoid Repeated Form Submissions: In the second script, you’re doing form submissions with a fixed increment (
base_no += 97
). This might not be the most efficient if you can generate the next number dynamically without needing to resubmit the form for each search.
Here’s a suggestion for optimizing it:
- Use
scrapy.downloadermiddlewares.retry
to retry failed requests instead of handling it manually. - Avoid
yield Request
on the same URL repeatedly, which is slow and unnecessary.
Let me know if you need further clarification or more detailed improvements on any part of the script!