Skip to main content

Optimizing SERP Scraping and On-Page Meta Analysis with Python and Oxylabs API

Search engine optimization (SEO) is an ever-evolving field where staying on top of indexation and meta titles is critical for improving ranking and click-through rates. With Google’s dynamic handling of metadata, knowing whether Google is accurately representing your site’s information in the search engine results pages (SERPs) or altering meta titles is essential.

In this article, we’ll demonstrate how to use Oxylabs’ API alongside Python libraries like Requests, Cloudscraper, and BeautifulSoup to scrape SERPs and analyze discrepancies between the on-page titles and those shown in the SERPs. This will help you understand how Google interprets your pages and guide you in optimizing meta titles and H1 tags for better performance.

Why Scrape SERPs and Meta Information?

There are several scenarios where SERP scraping and analysis can be invaluable:

  1. Indexation Issues: Identifying URLs that aren’t indexed helps you take action by submitting those pages for indexation via Google Search Console or investigating deeper on-page or technical SEO issues.
  2. Meta Title Discrepancies: Google often rewrites meta titles based on what it believes will improve user experience. By analyzing when and why these changes occur, you can adjust your meta titles to better align with both SEO and user experience principles.
  3. H1 as Meta Title: In cases where Google uses the H1 tag as a meta title, you can assess whether your H1 tags are optimized for both SEO and human readability. Adjusting H1 tags can enhance the appearance of your pages in the SERPs and improve click-through rates.

The Python Script: A Step-by-Step Guide

Let’s break down how you can use a Python script to automate this process, using Oxylabs’ API and scraping tools like BeautifulSoup and Cloudscraper.

1. Extracting URLs from Your Sitemap

The first step is to gather the URLs you want to analyze. You can easily extract them from a sitemap using Python:

The first step is to gather the URLs you want to analyze. You can easily extract them from a sitemap using Python:

Creating a DataFrame and exporting to Excel

from bs4 import BeautifulSoup
import requests
 
# Fetching sitemap and parsing URLs
r = requests.get("https://www.yoursite.com/sitemap.xml")
xml = r.text
soup = BeautifulSoup(xml, 'xml')
urls_list = [x.text for x in soup.find_all("loc")]

This code retrieves your sitemap, parses it with BeautifulSoup, and stores all the URLs in a list.

2. Scraping SERPs and URLs

With the URLs in hand, we can scrape the SERPs using Oxylabs’ API and the actual pages themselves using Cloudscraper. This enables us to compare meta titles in both locations and check if Google is altering any of the titles or using the H1 tag instead.

import cloudscraper

list_comparison = []

for url in urls_list:
    scraper = cloudscraper.create_scraper() 
    indexation = False
    metatitle_coincidence = False
    metatitle_coincidence_h1 = False

    # Oxylabs API payload for SERP scraping
    payload = {
        'source': 'google_search',
        'domain': 'com',
        'query': 'site:' + url,
        'parse': 'true'
    }

    # Requesting SERP data from Oxylabs
    response = requests.post(
        'https://realtime.oxylabs.io/v1/queries',
        auth=('<your_username>', '<your_password>'),
        json=payload,
    )

    # Parsing the results
    for result in response.json().get("results", [])[0].get("content", {}).get("results", {}).get("organic", []):
        if result["url"].endswith(url):
            indexation = True
            html = scraper.get(url)
            soup = BeautifulSoup(html.text, 'html.parser')

            metatitle = soup.find('title').get_text() if soup.find('title') else ""
            h1 = soup.find('h1').get_text() if soup.find('h1') else ""

            # Check if SERP title matches on-page meta title or H1
            metatitle_coincidence = result["title"] == metatitle
            metatitle_coincidence_h1 = result["title"] == h1

            list_comparison.append([url, indexation, result["title"], metatitle, h1, metatitle_coincidence, metatitle_coincidence_h1])
            break
    if not indexation:
        list_comparison.append([url, indexation, "", "", "", False, False])

3. Exporting Results to Excel

Once we have scraped and analyzed the meta titles, H1 tags, and indexation status, we can export this data to an Excel file for further analysis.

import pandas as pd

# Creating a DataFrame and exporting to Excel
df = pd.DataFrame(list_comparison, columns=["URL", "Indexation", "Metatitle SERPs", "Metatitle", "H1", "Metatitle Coincidence", "H1 - Metatitle Coincidence"])
df.to_excel('serp_comparison.xlsx', header=True, index=False)

This will generate a comprehensive report, allowing you to identify indexation issues, meta title discrepancies, and the usage of H1 tags as titles.

Analyzing the Results

Once you have the data, you can draw conclusions that will inform your next SEO actions:

  1. Manual Indexation: For URLs that are not indexed, manually submitting them to Google Search Console is a good short-term fix. For the long term, check if Googlebot can crawl the pages properly, optimize on-page content, and enhance internal and external linking.

  2. Meta Title Adjustments: For URLs where Google alters the meta title, consider shortening overly long titles or making them more relevant to the content and user experience.

  3. Optimizing H1 Tags: If Google is using your H1 tag as the meta title, make sure the H1 is user-friendly and not over-optimized for search engines. This is a good opportunity to refine H1 tags to improve both SEO and user engagement.

Final Thoughts

With tools like Oxylabs, Cloudscraper, and BeautifulSoup, you can gain a deeper understanding of how Google is interpreting your site’s metadata. By comparing SERP results with on-page meta titles and H1 tags, you can identify discrepancies and take targeted actions to optimize your site’s performance in search results. Whether it’s improving indexation or refining meta titles, this approach gives you the insights needed to make data-driven decisions for better SEO outcomes.


Daniel Dye

Daniel Dye is the President of NativeRank Inc., a premier digital marketing agency that has grown into a powerhouse of innovation under his leadership. With a career spanning decades in the digital marketing industry, Daniel has been instrumental in shaping the success of NativeRank and its impressive lineup of sub-brands, including MarineListings.com, LocalSEO.com, MarineManager.com, PowerSportsManager.com, NikoAI.com, and SearchEngineGuidelines.com. Before becoming President of NativeRank, Daniel served as the Executive Vice President at both NativeRank and LocalSEO for over 12 years. In these roles, he was responsible for maximizing operational performance and achieving the financial goals that set the foundation for the company’s sustained growth. His leadership has been pivotal in establishing NativeRank as a leader in the competitive digital marketing landscape. Daniel’s extensive experience includes his tenure as Vice President at GetAds, LLC, where he led digital marketing initiatives that delivered unprecedented performance. Earlier in his career, he co-founded Media Breakaway, LLC, demonstrating his entrepreneurial spirit and deep understanding of the digital marketing world. In addition to his executive experience, Daniel has a strong technical background. He began his career as a TAC 2 Noc Engineer at Qwest (now CenturyLink) and as a Human Interface Designer at 9MSN, where he honed his skills in user interface design and network operations. Daniel’s educational credentials are equally impressive. He holds an Executive MBA from the Quantic School of Business and Technology and has completed advanced studies in Architecture and Systems Engineering from MIT. His commitment to continuous learning is evident in his numerous certifications in Data Science, Machine Learning, and Digital Marketing from prestigious institutions like Columbia University, edX, and Microsoft. With a blend of executive leadership, technical expertise, and a relentless drive for innovation, Daniel Dye continues to propel NativeRank Inc. and its sub-brands to new heights, making a lasting impact in the digital marketing industry.

More Articles By Daniel Dye

Here’s how you can automate sending daily email reports in Python using smtplib for sending emails and scheduling the job with the schedule or APScheduler library. I’ll walk you through the process step by step. Step 1: Set Up Your Email Server Credentials To send emails using Python, you’ll need access to an email SMTP […]
Google’s search algorithm is one of the most sophisticated systems on the internet. It processes millions of searches every day, evaluating the relevance and quality of billions of web pages. While many factors contribute to how Google ranks search results, the underlying system is based on advanced mathematical models and principles. In this article, we’ll […]

Was this helpful?