Web Requests for SEO using Python
Making Web Requests for SEO Data Scraping
In the world of SEO, gathering and analyzing data from web pages is crucial for optimizing site performance. Whether you’re analyzing metadata, on-page elements, or schema markups, efficiently scraping a webpage can provide valuable insights. Before diving into parsing HTML code, we first need to make a request to the desired URL. One library that stands out for this is cloudscraper
, a powerful tool for accessing Cloudflare-protected sites without triggering bans or encountering restrictions.
For those working on web scraping with Python, I recommend checking out my detailed guide: “6 Essential Tips for Web Scraping with Python,” where I share the key tricks and strategies for effective scraping. Here’s a step-by-step approach to making your first web request and collecting SEO data using cloudscraper
and BeautifulSoup
.
Why CloudsScraper Over Requests?
While the Requests
library is a popular choice for making HTTP requests, it often faces challenges when dealing with Cloudflare-protected websites. That’s where cloudscraper
excels. It bypasses Cloudflare’s security measures, making it perfect for SEO specialists who need reliable access to data without being blocked.
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("<your_url>")
soup = BeautifulSoup(html.text, 'html.parser')
Scraping Key SEO Elements
Now that you have the HTML content, the next step is extracting valuable SEO data. Here’s how to scrape the most important metadata for SEO audits:
1. Meta Title
The meta title is crucial as it directly affects your website’s search engine ranking. Extracting the meta title helps ensure it’s optimized for the target keywords.
metatitle = soup.find('title').get_text()
2. Meta Description
The meta description is an important snippet of information that helps users understand what your page is about. Extracting this data ensures it aligns with the page’s content and includes the right keywords.
metadescription = soup.find('meta', attrs={'name':'description'})["content"]
3. Robots Directives
Robots directives play a critical role in instructing search engines on how to crawl and index your content. This can directly impact how your website appears in search engine results.
robots_directives = soup.find('meta', attrs={'name':'robots'})["content"].split(",")
4. Viewport & Mobile Optimization
With the importance of mobile-first indexing, extracting the viewport data ensures that the page is optimized for different devices, especially mobile.
viewport = soup.find('meta', attrs={'name':'viewport'})["content"]
Scraping International SEO Elements
If you’re working on international SEO, scraping the language and alternate tag data becomes essential to ensure your content is correctly indexed across different regions.
HTML Language
html_language = soup.find('html')["lang"]
Canonicals & Hreflangs
To avoid duplicate content issues, it’s important to make sure that your canonical tags are properly set up, and hreflang tags are in place for international versions.
canonical = soup.find('link', attrs={'rel':'canonical'})["href"]
list_hreflangs = [[a['href'], a["hreflang"]] for a in soup.find_all('link', href=True, hreflang=True)]
Scraping Structured Data for SEO
Structured data, especially Schema markup, provides essential context to search engines about the content on a webpage. For example, breadcrumbs and organization schema can enhance how your page appears in rich results.
import json
json_schema = soup.find('script', attrs={'type':'application/ld+json'})
json_file = json.loads(json_schema.get_text())
For instance, this snippet allows you to extract and analyze JSON-LD structured data on the page.
Content Scraping
Whether you’re auditing for content length, keyword usage, or content duplication, scraping text and headers can help you perform a detailed SEO audit.
Paragraphs & Headings
paragraphs = [p.get_text() for p in soup.find_all('p')]
headers = [[str(h)[1:3], h.get_text()] for h in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])]
Bonus: Image Optimization
Alt text optimization is key to image SEO. You can scrape all the image URLs and their alt texts to analyze whether the images on the page are optimized for search.
images = [[img['src'], img.get('alt', '')] for img in soup.find_all('img')]
Troubleshooting Web Requests and Scraping
Web scraping can often run into hurdles. Here are a few common issues and how to resolve them:
- Cloudflare Blocks: If you get blocked frequently, try rotating your user agent or using proxies along with cloudscraper.
- Solution:
cloudscraper
can accept custom headers and proxies. Here’s how to set it up:scraper = cloudscraper.create_scraper( browser={ 'browser': 'chrome', 'platform': 'windows', 'mobile': False }, delay=10, # Adjust delay to avoid getting blocked proxies={'http': 'http://your-proxy', 'https': 'https://your-proxy'} )
- Solution:
- Missing Data: Sometimes, scraping results might be incomplete, such as missing certain elements like images or structured data.
- Solution: Use different parsing libraries such as
lxml
with BeautifulSoup, which may handle complex HTML more effectively.soup = BeautifulSoup(html.text, 'lxml')
- Solution: Use different parsing libraries such as
- Dynamic Content: Some pages load content dynamically using JavaScript, which
cloudscraper
cannot directly handle.- Solution: For such cases, using headless browsers like Selenium or Playwright might be necessary to render JavaScript.
from selenium import webdriver driver = webdriver.Chrome() driver.get("<your_url>") html = driver.page_source soup = BeautifulSoup(html, 'html.parser')
- Solution: For such cases, using headless browsers like Selenium or Playwright might be necessary to render JavaScript.
Integrating with SEO Reporting Tools
Scraping data is just one step in your SEO workflow. To truly benefit, you should integrate these scraped insights with your SEO reporting tools.
1. Google Sheets Integration
Export your scraped data into Google Sheets for easy collaboration and tracking. You can use the gspread
library to automate this process:
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# Authenticate and connect to Google Sheets
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('path_to_credentials.json', scope)
client = gspread.authorize(credentials)
# Select the Google Sheet and write data
sheet = client.open('SEO Audit Data').sheet1
sheet.update('A1', [[metatitle, metadescription, robots_directives]])
2. Automating Reports with Google Data Studio
Once the data is in Google Sheets, you can link it to Google Data Studio to create automated visual reports. This allows your clients or team members to view SEO progress with charts and graphs in real-time.
3. Integration with SEO Tools
You can also import the scraped data into SEO tools such as Screaming Frog, Moz, or SEMrush for further analysis. For example, uploading metadata to Screaming Frog’s custom extraction feature will allow you to compare it with their crawling results, ensuring that your on-page SEO is perfectly aligned.
Conclusion
Cloudscraper and BeautifulSoup offer a powerful combination for SEO professionals looking to extract essential data from web pages. Whether you’re auditing for SEO performance, reviewing meta tags, or analyzing structured data, Python makes the process efficient. With proper troubleshooting, integrating with reporting tools, and automating the workflow, you can streamline your SEO processes and focus more on actionable insights.