Web-scraping Remax.com in python

ghz 11hours ago ⋅ 5 views

I am trying to follow the tutorial here to scrape data from Remax.com. At the moment I am just interested in getting the sqft of a particular home. Although I get this error:

Error during requests to https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html : HTTPSConnectionPool(host='www.remax.com', port=443): Max retries exceeded with url: /realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-28b8e2248942> in <module>()
      1 raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
----> 2 html = BeautifulSoup(raw_html, 'html.parser')
      3 for i, li in enumerate(html.select('li')):
      4         print(i, li.text)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, **kwargs)
    190         if hasattr(markup, 'read'):        # It's a file-type object.
    191             markup = markup.read()
--> 192         elif len(markup) <= 256 and (
    193                 (isinstance(markup, bytes) and not b'<' in markup)
    194                 or (isinstance(markup, str) and not '<' in markup)

TypeError: object of type 'NoneType' has no len()

Here is my entire code thus far:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None
def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)

raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
        print(i, li.text)

I am pretty new to web scraping so I am not sure how to fix this. Any suggestions would be greatly appreciated.

Answer

The error you are encountering is due to an SSL certificate verification issue when making an HTTPS request to the URL. The error message:

SSLError: bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)

This happens when requests is unable to verify the SSL certificate of the website (in this case, https://www.remax.com). This can happen for several reasons, such as issues with the server's SSL certificate or problems with your local environment.

Here are some possible solutions to fix the problem:

1. Disable SSL Verification (Not Recommended for Production)

One quick way to bypass SSL verification issues is to disable it using the verify=False argument in the get() function. This is generally not recommended for production environments because it can make your requests vulnerable to security risks (like man-in-the-middle attacks). However, it's useful for testing purposes:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True, verify=False)) as resp:  # Disable SSL verification here
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
    print(i, li.text)

2. Ensure Your Environment's SSL Certificates Are Up to Date

Sometimes SSL verification issues happen because your local machine's SSL certificates are outdated. You can try updating the certificates in your environment:

  • For macOS: Run the following command to update your certificates:

    /Applications/Python\ 3.x/Install\ Certificates.command
    
  • For Windows: Ensure that you have the latest version of certifi, the package that requests uses for SSL certificates. You can update certifi with:

    pip install --upgrade certifi
    
  • For Linux: Update the ca-certificates package:

    sudo apt-get update
    sudo apt-get install --reinstall ca-certificates
    

3. Use Session Object for Better Handling of Connections

Another method is to use a requests.Session object, which can persist settings like headers or authentication across multiple requests. While this doesn't directly address SSL errors, it can help in managing requests more efficiently.

from requests import Session
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    session = Session()
    try:
        with closing(session.get(url, stream=True, verify=False)) as resp:  # Disable SSL verification here
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
    print(i, li.text)

4. Check if the Website Blocks Web Scraping

Some websites employ measures to block automated scraping attempts, including blocking certain user agents, checking for bot-like behavior, or using CAPTCHAs. To work around this, you can modify the headers of your request to mimic a real browser (which can sometimes bypass basic blocks).

Here’s how you can set a custom User-Agent header:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
    }
    try:
        with closing(get(url, headers=headers, stream=True, verify=False)) as resp:  # Add user-agent header
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None

raw_html = simple_get('https://www.remax.com/realestatehomesforsale/25-montage-way-laguna-beach-ca-92651-gid100012499996.html')
html = BeautifulSoup(raw_html, 'html.parser')
for i, li in enumerate(html.select('li')):
    print(i, li.text)

5. Check for HTTPS Support

Ensure that you are able to make HTTPS requests. Sometimes specific websites require secure connections and may reject HTTP requests or connections from outdated versions of SSL/TLS.

If the error persists, consider checking if you are using the latest version of requests, beautifulsoup4, and certifi. You can update them using:

pip install --upgrade requests beautifulsoup4 certifi

Final Notes

  • SSL verification: If you're disabling SSL verification, keep in mind that you might be vulnerable to MITM attacks. For any sensitive data, always try to solve SSL issues in a secure way.
  • Scraping Restrictions: Some websites block scraping, so ensure you're not violating the website's terms of service. It's always a good idea to check the site's robots.txt file to see if scraping is allowed.

Let me know if you need further assistance!