Web Scraping Table from 'Dune.com' with Python3 and bs4

ghz 11hours ago ⋅ 2 views

I am trying to web scrape table data from Dune.com (https://dune.com/queries/1144723). When I 'inspect' the web page, I am able to clearly see the <table></table> element, but when I run the following code I am returned None results.

import bs4
import requests

data = []

r=requests.get('https://dune.com/queries/1144723/1954237')
soup=bs4.BeautifulSoup(r.text, "html5lib")

table = soup.find('table')

How can I successfully find this table data?

Answer

The issue you're facing is that the table data you're trying to scrape is dynamically loaded via JavaScript after the page is initially rendered. This means that when you make a request with requests, you are getting the raw HTML that does not include the table data, because the JavaScript responsible for rendering that table hasn't been executed yet.

Solution: Use a Headless Browser with Selenium

To handle this, you'll need to use a tool that can execute JavaScript, like Selenium or Playwright. Selenium allows you to load the page, execute JavaScript, and then extract the dynamically rendered content.

Here's how you can use Selenium along with BeautifulSoup to scrape the table data from the Dune.com page:

1. Install Required Libraries:

First, install the necessary libraries:

pip install selenium beautifulsoup4 requests

Additionally, you will need to download a WebDriver (e.g., ChromeDriver for Chrome or geckodriver for Firefox).

  • Download ChromeDriver: https://sites.google.com/a/chromium.org/chromedriver/
  • Download geckodriver (for Firefox): https://github.com/mozilla/geckodriver/releases

Make sure the WebDriver binary is accessible in your system's PATH, or specify the path to the driver explicitly in your code.

2. Code Example with Selenium and BeautifulSoup:

Here’s how you can modify your code to scrape the table data:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

# Set up headless browser using Chrome
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in headless mode (without opening browser window)

# Specify the path to your WebDriver binary if it's not in your PATH
driver = webdriver.Chrome(executable_path='/path/to/chromedriver', options=chrome_options)

# Visit the Dune query page
url = 'https://dune.com/queries/1144723/1954237'
driver.get(url)

# Wait for the JavaScript to load the table (adjust sleep time if needed)
time.sleep(5)  # You may need to adjust this depending on how long the page takes to load

# Get page source after JavaScript has rendered the table
html = driver.page_source

# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(html, 'html.parser')

# Find the table element
table = soup.find('table')

# If table is found, extract the rows and columns
if table:
    rows = table.find_all('tr')
    data = []

    for row in rows:
        cols = row.find_all('td')
        cols = [col.text.strip() for col in cols]  # Get text from each cell and remove leading/trailing spaces
        data.append(cols)

    # Print the data or process it further
    for row in data:
        print(row)
else:
    print("Table not found!")

# Quit the browser session
driver.quit()

Explanation:

  1. Headless Browser with Selenium:

    • We set up a headless browser (which means it runs without opening a GUI window) using Chrome and Selenium.
    • We specify the path to the ChromeDriver (executable_path='/path/to/chromedriver'), or if ChromeDriver is in your PATH, you can omit the executable_path argument.
  2. Loading the Page:

    • driver.get(url) loads the Dune.com page.
    • We use time.sleep(5) to wait for the JavaScript to render the table. You can adjust this delay depending on how fast the table loads.
  3. Scraping with BeautifulSoup:

    • After the page has been loaded and JavaScript has rendered the table, we use driver.page_source to get the HTML source.
    • We then use BeautifulSoup to parse the HTML and extract the data from the table.
  4. Extracting Table Data:

    • We find all the rows in the table with table.find_all('tr') and then extract the text from each cell in each row.
    • Finally, the data is stored in a list data, and we print it row by row.

Notes:

  • You may need to adjust the time.sleep(5) delay if the table takes longer to load.
  • If you're using Firefox, you can replace webdriver.Chrome with webdriver.Firefox and similarly use geckodriver instead of chromedriver.
  • The table might have pagination or infinite scrolling, which could require handling JavaScript events like clicks or scrolling. If that’s the case, you might need to use more advanced Selenium features like WebDriverWait or interact with the page elements before scraping.

By using Selenium, you can ensure that all dynamic content, including JavaScript-rendered tables, is available for scraping.