Skip to main content

Making Web Requests for SEO Data Scraping

In the world of SEO, gathering and analyzing data from web pages is crucial for optimizing site performance. Whether you’re analyzing metadata, on-page elements, or schema markups, efficiently scraping a webpage can provide valuable insights. Before diving into parsing HTML code, we first need to make a request to the desired URL. One library that stands out for this is cloudscraper, a powerful tool for accessing Cloudflare-protected sites without triggering bans or encountering restrictions.

For those working on web scraping with Python, I recommend checking out my detailed guide: “6 Essential Tips for Web Scraping with Python,” where I share the key tricks and strategies for effective scraping. Here’s a step-by-step approach to making your first web request and collecting SEO data using cloudscraper and BeautifulSoup.

Why CloudsScraper Over Requests?

While the Requests library is a popular choice for making HTTP requests, it often faces challenges when dealing with Cloudflare-protected websites. That’s where cloudscraper excels. It bypasses Cloudflare’s security measures, making it perfect for SEO specialists who need reliable access to data without being blocked.

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper() 
html = scraper.get("<your_url>")
soup = BeautifulSoup(html.text, 'html.parser')

Scraping Key SEO Elements

Now that you have the HTML content, the next step is extracting valuable SEO data. Here’s how to scrape the most important metadata for SEO audits:

1. Meta Title

The meta title is crucial as it directly affects your website’s search engine ranking. Extracting the meta title helps ensure it’s optimized for the target keywords.

metatitle = soup.find('title').get_text()

2. Meta Description

The meta description is an important snippet of information that helps users understand what your page is about. Extracting this data ensures it aligns with the page’s content and includes the right keywords.

metadescription = soup.find('meta', attrs={'name':'description'})["content"]

3. Robots Directives

Robots directives play a critical role in instructing search engines on how to crawl and index your content. This can directly impact how your website appears in search engine results.

robots_directives = soup.find('meta', attrs={'name':'robots'})["content"].split(",")

4. Viewport & Mobile Optimization

With the importance of mobile-first indexing, extracting the viewport data ensures that the page is optimized for different devices, especially mobile.

viewport = soup.find('meta', attrs={'name':'viewport'})["content"]

Scraping International SEO Elements

If you’re working on international SEO, scraping the language and alternate tag data becomes essential to ensure your content is correctly indexed across different regions.

HTML Language

html_language = soup.find('html')["lang"]

Canonicals & Hreflangs

To avoid duplicate content issues, it’s important to make sure that your canonical tags are properly set up, and hreflang tags are in place for international versions.

canonical = soup.find('link', attrs={'rel':'canonical'})["href"]
list_hreflangs = [[a['href'], a["hreflang"]] for a in soup.find_all('link', href=True, hreflang=True)]

Scraping Structured Data for SEO

Structured data, especially Schema markup, provides essential context to search engines about the content on a webpage. For example, breadcrumbs and organization schema can enhance how your page appears in rich results.

import json
json_schema = soup.find('script', attrs={'type':'application/ld+json'})
json_file = json.loads(json_schema.get_text())

For instance, this snippet allows you to extract and analyze JSON-LD structured data on the page.

Content Scraping

Whether you’re auditing for content length, keyword usage, or content duplication, scraping text and headers can help you perform a detailed SEO audit.

Paragraphs & Headings

paragraphs = [p.get_text() for p in soup.find_all('p')]
headers = [[str(h)[1:3], h.get_text()] for h in soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])]

Bonus: Image Optimization

Alt text optimization is key to image SEO. You can scrape all the image URLs and their alt texts to analyze whether the images on the page are optimized for search.

images = [[img['src'], img.get('alt', '')] for img in soup.find_all('img')]

Troubleshooting Web Requests and Scraping

Web scraping can often run into hurdles. Here are a few common issues and how to resolve them:

  1. Cloudflare Blocks: If you get blocked frequently, try rotating your user agent or using proxies along with cloudscraper.
    • Solution: cloudscraper can accept custom headers and proxies. Here’s how to set it up:
      scraper = cloudscraper.create_scraper(
       browser={
           'browser': 'chrome',
           'platform': 'windows',
           'mobile': False
       },
       delay=10,  # Adjust delay to avoid getting blocked
       proxies={'http': 'http://your-proxy', 'https': 'https://your-proxy'}
      )
  2. Missing Data: Sometimes, scraping results might be incomplete, such as missing certain elements like images or structured data.
    • Solution: Use different parsing libraries such as lxml with BeautifulSoup, which may handle complex HTML more effectively.
      soup = BeautifulSoup(html.text, 'lxml')
  3. Dynamic Content: Some pages load content dynamically using JavaScript, which cloudscraper cannot directly handle.
    • Solution: For such cases, using headless browsers like Selenium or Playwright might be necessary to render JavaScript.
      from selenium import webdriver
      driver = webdriver.Chrome()
      driver.get("<your_url>")
      html = driver.page_source
      soup = BeautifulSoup(html, 'html.parser')

Integrating with SEO Reporting Tools

Scraping data is just one step in your SEO workflow. To truly benefit, you should integrate these scraped insights with your SEO reporting tools.

1. Google Sheets Integration

Export your scraped data into Google Sheets for easy collaboration and tracking. You can use the gspread library to automate this process:

import gspread
from oauth2client.service_account import ServiceAccountCredentials

# Authenticate and connect to Google Sheets
scope = ['https://spreadsheets.google.com/feeds', 'https://www.googleapis.com/auth/drive']
credentials = ServiceAccountCredentials.from_json_keyfile_name('path_to_credentials.json', scope)
client = gspread.authorize(credentials)

# Select the Google Sheet and write data
sheet = client.open('SEO Audit Data').sheet1
sheet.update('A1', [[metatitle, metadescription, robots_directives]])

2. Automating Reports with Google Data Studio

Once the data is in Google Sheets, you can link it to Google Data Studio to create automated visual reports. This allows your clients or team members to view SEO progress with charts and graphs in real-time.

3. Integration with SEO Tools

You can also import the scraped data into SEO tools such as Screaming Frog, Moz, or SEMrush for further analysis. For example, uploading metadata to Screaming Frog’s custom extraction feature will allow you to compare it with their crawling results, ensuring that your on-page SEO is perfectly aligned.

Conclusion

Cloudscraper and BeautifulSoup offer a powerful combination for SEO professionals looking to extract essential data from web pages. Whether you’re auditing for SEO performance, reviewing meta tags, or analyzing structured data, Python makes the process efficient. With proper troubleshooting, integrating with reporting tools, and automating the workflow, you can streamline your SEO processes and focus more on actionable insights.


Daniel Dye

Daniel Dye is the President of NativeRank Inc., a premier digital marketing agency that has grown into a powerhouse of innovation under his leadership. With a career spanning decades in the digital marketing industry, Daniel has been instrumental in shaping the success of NativeRank and its impressive lineup of sub-brands, including MarineListings.com, LocalSEO.com, MarineManager.com, PowerSportsManager.com, NikoAI.com, and SearchEngineGuidelines.com. Before becoming President of NativeRank, Daniel served as the Executive Vice President at both NativeRank and LocalSEO for over 12 years. In these roles, he was responsible for maximizing operational performance and achieving the financial goals that set the foundation for the company’s sustained growth. His leadership has been pivotal in establishing NativeRank as a leader in the competitive digital marketing landscape. Daniel’s extensive experience includes his tenure as Vice President at GetAds, LLC, where he led digital marketing initiatives that delivered unprecedented performance. Earlier in his career, he co-founded Media Breakaway, LLC, demonstrating his entrepreneurial spirit and deep understanding of the digital marketing world. In addition to his executive experience, Daniel has a strong technical background. He began his career as a TAC 2 Noc Engineer at Qwest (now CenturyLink) and as a Human Interface Designer at 9MSN, where he honed his skills in user interface design and network operations. Daniel’s educational credentials are equally impressive. He holds an Executive MBA from the Quantic School of Business and Technology and has completed advanced studies in Architecture and Systems Engineering from MIT. His commitment to continuous learning is evident in his numerous certifications in Data Science, Machine Learning, and Digital Marketing from prestigious institutions like Columbia University, edX, and Microsoft. With a blend of executive leadership, technical expertise, and a relentless drive for innovation, Daniel Dye continues to propel NativeRank Inc. and its sub-brands to new heights, making a lasting impact in the digital marketing industry.

More Articles By Daniel Dye

Here’s how you can automate sending daily email reports in Python using smtplib for sending emails and scheduling the job with the schedule or APScheduler library. I’ll walk you through the process step by step. Step 1: Set Up Your Email Server Credentials To send emails using Python, you’ll need access to an email SMTP […]
Google’s search algorithm is one of the most sophisticated systems on the internet. It processes millions of searches every day, evaluating the relevance and quality of billions of web pages. While many factors contribute to how Google ranks search results, the underlying system is based on advanced mathematical models and principles. In this article, we’ll […]

Was this helpful?