I am trying to follow the tutorial here to scrape data from Remax.com. At the moment I am just interested in getting the sqft of a particular home. Although I get this error:
Error during requests to https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html : HTTPSConnectionPool(host='www.remax.com', port=443): Max retries exceeded with url: /realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-28b8e2248942> in <module>()
1 raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
----> 2 html = BeautifulSoup(raw_html, 'html.parser')
3 for i, li in enumerate(html.select('li')):
4 print(i, li.text)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, **kwargs)
190 if hasattr(markup, 'read'): # It's a file-type object.
191 markup = markup.read()
--> 192 elif len(markup) <= 256 and (
193 (isinstance(markup, bytes) and not b'<' in markup)
194 or (isinstance(markup, str) and not '<' in markup)
TypeError: object of type 'NoneType' has no len()
Here is my entire code thus far:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)
raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
print(i, li.text)
I am pretty new to web scraping so I am not sure how to fix this. Any suggestions would be greatly appreciated.
Answer
The error you are encountering is due to an SSL certificate verification issue when making an HTTPS request to the URL. The error message:
SSLError: bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)
This happens when requests
is unable to verify the SSL certificate of the website (in this case, https://www.remax.com
). This can happen for several reasons, such as issues with the server's SSL certificate or problems with your local environment.
Here are some possible solutions to fix the problem:
1. Disable SSL Verification (Not Recommended for Production)
One quick way to bypass SSL verification issues is to disable it using the verify=False
argument in the get()
function. This is generally not recommended for production environments because it can make your requests vulnerable to security risks (like man-in-the-middle attacks). However, it's useful for testing purposes:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
try:
with closing(get(url, stream=True, verify=False)) as resp: # Disable SSL verification here
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
print(i, li.text)
2. Ensure Your Environment's SSL Certificates Are Up to Date
Sometimes SSL verification issues happen because your local machine's SSL certificates are outdated. You can try updating the certificates in your environment:
-
For macOS: Run the following command to update your certificates:
/Applications/Python\ 3.x/Install\ Certificates.command
-
For Windows: Ensure that you have the latest version of
certifi
, the package thatrequests
uses for SSL certificates. You can updatecertifi
with:pip install --upgrade certifi
-
For Linux: Update the
ca-certificates
package:sudo apt-get update sudo apt-get install --reinstall ca-certificates
3. Use Session
Object for Better Handling of Connections
Another method is to use a requests.Session
object, which can persist settings like headers or authentication across multiple requests. While this doesn't directly address SSL errors, it can help in managing requests more efficiently.
from requests import Session
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def simple_get(url):
"""
Attempts to get the content at `url` by making an HTTP GET request.
If the content-type of response is some kind of HTML/XML, return the
text content, otherwise return None.
"""
session = Session()
try:
with closing(session.get(url, stream=True, verify=False)) as resp: # Disable SSL verification here
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
print(i, li.text)
4. Check if the Website Blocks Web Scraping
Some websites employ measures to block automated scraping attempts, including blocking certain user agents, checking for bot-like behavior, or using CAPTCHAs. To work around this, you can modify the headers of your request to mimic a real browser (which can sometimes bypass basic blocks).
Here’s how you can set a custom User-Agent
header:
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup
def simple_get(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
try:
with closing(get(url, headers=headers, stream=True, verify=False)) as resp: # Add user-agent header
if is_good_response(resp):
return resp.content
else:
return None
except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None
raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
print(i, li.text)
5. Check for HTTPS Support
Ensure that you are able to make HTTPS requests. Sometimes specific websites require secure connections and may reject HTTP requests or connections from outdated versions of SSL/TLS.
If the error persists, consider checking if you are using the latest version of requests
, beautifulsoup4
, and certifi
. You can update them using:
pip install --upgrade requests beautifulsoup4 certifi
Final Notes
- SSL verification: If you're disabling SSL verification, keep in mind that you might be vulnerable to MITM attacks. For any sensitive data, always try to solve SSL issues in a secure way.
- Scraping Restrictions: Some websites block scraping, so ensure you're not violating the website's terms of service. It's always a good idea to check the site's
robots.txt
file to see if scraping is allowed.
Let me know if you need further assistance!