How to scrape more than one page of critic reviews from Rotten Tomatoes?
To scrape more than one page of critic reviews from Rotten Tomatoes, you'll typically need to paginate through the website. However, it's important to note that scraping websites like Rotten Tomatoes may violate their terms of service, so be sure to review their terms before proceeding.
Assuming you have permission to scrape or are working with publicly available data via an API (if available), here's an example approach using requests and BeautifulSoup in Python to scrape multiple pages of critic reviews:
Steps:
-
Find the URL structure: Rotten Tomatoes might use pagination for reviews, so the URL might look like:
https://www.rottentomatoes.com/m/your_movie/reviews?min=0
https://www.rottentomatoes.com/m/your_movie/reviews?min=20
The
min
parameter might control the starting point for reviews (0, 20, 40, etc.). -
Send Requests to Multiple Pages: By changing the
min
parameter, you can scrape additional pages. -
Extract Data: Use
BeautifulSoup
to parse and extract the review data from the HTML.
Code Example
import requests
from bs4 import BeautifulSoup
def get_reviews(movie_slug, page=1):
url = f'https://www.rottentomatoes.com/m/{movie_slug}/reviews?min={(page-1)*20}'
# Send a request to the page
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve page {page}")
return []
soup = BeautifulSoup(response.text, 'html.parser')
# Find all reviews on the page
reviews = []
# Find all review containers (this will need to be adjusted based on the actual HTML structure)
review_containers = soup.find_all('div', class_='sc-16ede01-2')
for container in review_containers:
# Extract individual review details (this part will vary)
critic_name = container.find('span', class_='sc-16ede01-6').get_text(strip=True) if container.find('span', class_='sc-16ede01-6') else 'Unknown'
review_text = container.find('span', class_='sc-16ede01-8').get_text(strip=True) if container.find('span', class_='sc-16ede01-8') else 'No review text'
reviews.append({
'critic_name': critic_name,
'review_text': review_text
})
return reviews
def scrape_multiple_pages(movie_slug, num_pages=5):
all_reviews = []
for page in range(1, num_pages + 1):
print(f"Scraping page {page}...")
reviews = get_reviews(movie_slug, page)
all_reviews.extend(reviews)
return all_reviews
# Example usage
movie_slug = 'the_batman_2022' # Replace with the actual movie slug
reviews = scrape_multiple_pages(movie_slug, num_pages=3)
# Print out all reviews
for review in reviews:
print(f"Critic: {review['critic_name']}\nReview: {review['review_text']}\n")
Explanation:
-
URL Structure: The URL in
get_reviews()
function includes a query parametermin
that adjusts the page number. For example:min=0
for the first page (reviews 1-20).min=20
for the second page (reviews 21-40), and so on.
The
min
value increases by 20 for each subsequent page. -
Request and Parsing: The
requests.get()
fetches the HTML content of the page. TheBeautifulSoup
object then parses this HTML, and we extract the review text and critic names using the appropriate CSS classes. You will need to inspect the Rotten Tomatoes HTML structure to find the correct class names for the review content. -
Scraping Multiple Pages: The function
scrape_multiple_pages()
takes care of iterating through multiple pages by modifying thepage
parameter and callingget_reviews()
. -
Adjusting for Rotten Tomatoes HTML: The
find_all()
andfind()
functions are used to extract the data based on the HTML structure. You will need to inspect the Rotten Tomatoes page source to identify the correct HTML tags and classes.
Handling Errors and Rate Limiting:
-
Error Handling: You should handle situations where pages might not load or the structure changes. The code includes basic checks for HTTP status and missing elements.
-
Rate Limiting: If you're scraping multiple pages, make sure not to overwhelm the server. Consider adding a delay (
time.sleep()
) between requests to prevent getting blocked.
Alternative: Using an API (if available)
If Rotten Tomatoes provides an API (either publicly or via a partner program), it is recommended to use the API instead of scraping HTML. APIs are more structured and reliable, and typically respect the website's terms of service.
Conclusion:
By following this approach, you can scrape multiple pages of critic reviews from Rotten Tomatoes. Adjust the scraping logic based on the actual HTML structure of the Rotten Tomatoes page and ensure that you're respecting the site's robots.txt or terms of service.