I am trying to web scrape table data from Dune.com (https://dune.com/queries/1144723). When I 'inspect' the web page, I am able to clearly see the <table></table>
element, but when I run the following code I am returned None results.
import bs4
import requests
data = []
r=requests.get('https://dune.com/queries/1144723/1954237')
soup=bs4.BeautifulSoup(r.text, "html5lib")
table = soup.find('table')
How can I successfully find this table data?
Answer
The issue you're facing is that the table data you're trying to scrape is dynamically loaded via JavaScript after the page is initially rendered. This means that when you make a request with requests
, you are getting the raw HTML that does not include the table data, because the JavaScript responsible for rendering that table hasn't been executed yet.
Solution: Use a Headless Browser with Selenium
To handle this, you'll need to use a tool that can execute JavaScript, like Selenium or Playwright. Selenium allows you to load the page, execute JavaScript, and then extract the dynamically rendered content.
Here's how you can use Selenium along with BeautifulSoup to scrape the table data from the Dune.com page:
1. Install Required Libraries:
First, install the necessary libraries:
pip install selenium beautifulsoup4 requests
Additionally, you will need to download a WebDriver (e.g., ChromeDriver for Chrome or geckodriver for Firefox).
- Download ChromeDriver: https://sites.google.com/a/chromium.org/chromedriver/
- Download geckodriver (for Firefox): https://github.com/mozilla/geckodriver/releases
Make sure the WebDriver binary is accessible in your system's PATH, or specify the path to the driver explicitly in your code.
2. Code Example with Selenium and BeautifulSoup:
Here’s how you can modify your code to scrape the table data:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
# Set up headless browser using Chrome
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in headless mode (without opening browser window)
# Specify the path to your WebDriver binary if it's not in your PATH
driver = webdriver.Chrome(executable_path='/path/to/chromedriver', options=chrome_options)
# Visit the Dune query page
url = 'https://dune.com/queries/1144723/1954237'
driver.get(url)
# Wait for the JavaScript to load the table (adjust sleep time if needed)
time.sleep(5) # You may need to adjust this depending on how long the page takes to load
# Get page source after JavaScript has rendered the table
html = driver.page_source
# Use BeautifulSoup to parse the page source
soup = BeautifulSoup(html, 'html.parser')
# Find the table element
table = soup.find('table')
# If table is found, extract the rows and columns
if table:
rows = table.find_all('tr')
data = []
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols] # Get text from each cell and remove leading/trailing spaces
data.append(cols)
# Print the data or process it further
for row in data:
print(row)
else:
print("Table not found!")
# Quit the browser session
driver.quit()
Explanation:
-
Headless Browser with Selenium:
- We set up a headless browser (which means it runs without opening a GUI window) using Chrome and Selenium.
- We specify the path to the ChromeDriver (
executable_path='/path/to/chromedriver'
), or if ChromeDriver is in your PATH, you can omit theexecutable_path
argument.
-
Loading the Page:
driver.get(url)
loads the Dune.com page.- We use
time.sleep(5)
to wait for the JavaScript to render the table. You can adjust this delay depending on how fast the table loads.
-
Scraping with BeautifulSoup:
- After the page has been loaded and JavaScript has rendered the table, we use
driver.page_source
to get the HTML source. - We then use BeautifulSoup to parse the HTML and extract the data from the table.
- After the page has been loaded and JavaScript has rendered the table, we use
-
Extracting Table Data:
- We find all the rows in the table with
table.find_all('tr')
and then extract the text from each cell in each row. - Finally, the data is stored in a list
data
, and we print it row by row.
- We find all the rows in the table with
Notes:
- You may need to adjust the
time.sleep(5)
delay if the table takes longer to load. - If you're using Firefox, you can replace
webdriver.Chrome
withwebdriver.Firefox
and similarly usegeckodriver
instead ofchromedriver
. - The table might have pagination or infinite scrolling, which could require handling JavaScript events like clicks or scrolling. If that’s the case, you might need to use more advanced Selenium features like
WebDriverWait
or interact with the page elements before scraping.
By using Selenium, you can ensure that all dynamic content, including JavaScript-rendered tables, is available for scraping.