python web scraping using beautiful soup is not working

ghz 12hours ago ⋅ 2 views

I want to scrape all the product categories. each product categories has this container html

  <div class="TempoCategoryTileV2-tile"><img alt="" aria-hidden="true" tabindex="-1" itemprop="image" src="//i5.walmartimages.com/dfw/4ff9c6c9-deda/k2-_c3162a27-dbb6-46df-8b9f-b5b52ea657b2.v1.jpg?odnWidth=168&amp;odnHeight=210&amp;odnBg=ffffff" class="TempoCategoryTileV2-tile-img display-block">
<div class="TempoCategoryTileV2-tile-content-one text-center">
    <div class="TempoCategoryTileV2-tile-linkText">
        <div style="overflow: hidden;">
            <div>Toyland</div>
        </div>
    </div>
</div><a class="TempoCategoryTileV2-tile-overlay" id="HomePage-contentZone12-FeaturedCategoriesCuratedV2-tileLink-1" aria-label="Toyland" href="/cp/toys/4171?povid=14503+%257C+contentZone12+%257C+2017-11-01+%257C+1+%257C+HP+FC+Toys" data-uid="zir3SFhh" tabindex="" data-tl-id="HomePage-contentZone12-FeaturedCategoriesCuratedV2-categoryTile-1-link" style="background-image: url(&quot;about:blank&quot;);"></a></div>

what I want to get is the text and image of each category so I used this python script

 Walmarthome = 'https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo'
 uClient = ''
 while uClient == '':
         try:
             start = time.time()
             uClient = requests.get(Walmarthome)

             print("Relax we are getting the data...")

         except:
             print("Connection refused by the server..")
             print("Let me sleep for 7 seconds")
             print("ZZzzzz...")
             time.sleep(7)
             print("Was a nice sleep, now let me continue...")
             continue
 page_html = uClient.content
 # close client
 uClient.close()
 page_soup = soup(page_html, "html.parser")

 productcategories =page_soup.find_all("div", {"class": "TempoCategoryTileV2 Grid-col u-size-1-2 u-size-1-3-s u-size-1-4-m u-size-1-5-l u-size-1-6-xl"})
 print(productcategories)
 for categorycontainer in productcategories:
     categorycard = categorycontainer.find("div", {"class": "TempoCategoryTileV2-tile-linkText"})
     if categorycard is not None:
         print("getting link")
         print(categorycard)

but when i run it all I get are these

 "Relax we are getting the data..." 
 []

For some reason it's not getting the content from the page. What am I doing wrong and how can I fix this?

Answer

The issue you're facing likely stems from how the page is being loaded. Some websites, like Walmart's homepage, dynamically load content using JavaScript after the initial HTML is loaded. This means the static HTML you're fetching with requests doesn't contain the full content (like product categories) you need. The content is likely being inserted into the page by JavaScript, which is why your script only returns an empty list.

Here’s how you can fix this:

Solutions:

  1. Use Selenium to Render JavaScript: Since the page is dynamic, you can use Selenium to automate a web browser, allowing JavaScript to render the page fully before scraping it. This way, you’ll get the complete content that is not available in the initial HTML fetched by requests.

  2. Use a Headless Browser: Using headless browsing means the browser runs without a graphical interface, which is efficient and often used for web scraping.

  3. Look for API Endpoints: Another approach would be to inspect the network traffic (using your browser's developer tools) to check if the data you need is being loaded from an API endpoint. If it is, you can directly call that API with requests and avoid rendering the entire page.


Solution 1: Using Selenium with a Headless Browser

First, you need to install Selenium and ChromeDriver (or another browser driver).

  1. Install the necessary libraries:
pip install selenium
  1. Download ChromeDriver (if you're using Chrome):

    • Get the version matching your browser from here.
  2. Update your script to use Selenium:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time

# Set up headless Chrome
chrome_options = Options()
chrome_options.add_argument("--headless")  # Ensure the browser window does not open
chrome_options.add_argument("--disable-gpu")  # Disables GPU hardware acceleration
chrome_options.add_argument("--no-sandbox")  # Bypass OS security model
chrome_options.add_argument("--window-size=1920x1080")  # Set window size

# Initialize the WebDriver
driver = webdriver.Chrome(executable_path='/path/to/chromedriver', options=chrome_options)

# Walmart homepage URL
Walmarthome = 'https://www.walmart.com/?povid=14503+%7C+contentZone1+%7C+2017-10-27+%7C+1+%7C+header+logo'

# Fetch the page
driver.get(Walmarthome)

# Wait for some time to ensure JavaScript has finished loading content
time.sleep(5)

# Now, get the page source after JavaScript has rendered it
page_html = driver.page_source
driver.quit()  # Close the browser after scraping

# Parse the HTML using BeautifulSoup
page_soup = BeautifulSoup(page_html, "html.parser")

# Find all product categories
productcategories = page_soup.find_all("div", {"class": "TempoCategoryTileV2-tile"})

# Check if we found the categories
if productcategories:
    print("Found categories:")
    for categorycontainer in productcategories:
        categoryname = categorycontainer.find("div", {"class": "TempoCategoryTileV2-tile-linkText"})
        categoryimg = categorycontainer.find("img")
        if categoryname and categoryimg:
            print("Category Name:", categoryname.get_text(strip=True))
            print("Category Image URL:", categoryimg['src'])
else:
    print("No categories found.")

Key Changes:

  1. Selenium WebDriver: We are now using Selenium to open the page and render it fully before extracting the HTML. This ensures that all content loaded by JavaScript is included.
  2. driver.get(): This loads the page in the headless browser.
  3. driver.page_source: After JavaScript renders the page, we retrieve the HTML source with this command.
  4. Waiting for content: We added time.sleep(5) to ensure JavaScript has time to execute and load the content. You can adjust the sleep time depending on how long it takes for the page to load.

Solution 2: Inspecting API Requests

You can also check the network activity in your browser’s developer tools (press F12 or Ctrl+Shift+I on most browsers, then go to the "Network" tab) to see if the categories are being fetched via an API request.

  • Look for requests that return JSON or XML data, which might contain the product categories.
  • Once you find the API endpoint, you can directly request it using requests.get() and extract the categories without needing Selenium.

Notes:

  • Selenium requires a browser driver: For Chrome, you need to download ChromeDriver, but you can also use Firefox or other browsers if needed.
  • Headless Browsing: This runs the browser without opening a GUI, making it faster and suitable for server-side automation.
  • Page Load Time: If the page is very slow to load, increase the sleep time or use Selenium’s WebDriverWait for a more reliable wait mechanism.

Let me know if you need more assistance!