Skip to main content

Automating Web Scraping and TF-IDF Analysis Using Python

In this guide, we will walk through how to scrape content from web pages, process that content, and extract the most relevant terms using TF-IDF (Term Frequency-Inverse Document Frequency). This approach is useful for identifying keywords and understanding the important topics from any website’s content, which can be applied to SEO or content strategy.

Step 1: Scraping Content Using Cloudscraper and BeautifulSoup

The first step is to scrape content from the desired web pages. We will focus on extracting <p> tags, which typically contain the main body of text. This content will then be processed with TextBlob to prepare it for term analysis.

Here’s a Python script that uses cloudscraper (which bypasses anti-bot systems like Cloudflare) and BeautifulSoup for parsing the HTML:

pythonCopy codeimport cloudscraper
from bs4 import BeautifulSoup
from textblob import TextBlob as tb

# List of web pages to scrape
list_pages = ["<insert_your_pages_here>"]

# Initialize the scraper
scraper = cloudscraper.create_scraper()

# List to store processed content
list_content = []

# Scraping each page
for page in list_pages:
    content = ""
    html = scraper.get(page)  # Get page content
    soup = BeautifulSoup(html.text, 'html.parser')
    
    # Extract all <p> content
    for paragraph in soup.find_all('p'):
        content += " " + paragraph.text.lower()
    
    # Append the processed text to the list
    list_content.append(tb(content))

Step 2: Extracting Key Terms with TF-IDF

Now that we have the text content from the pages, we can compute term frequencies and apply the TF-IDF algorithm to find the most important terms. TF-IDF helps to highlight terms that appear frequently within a page but are less common across the entire set of pages, making them more meaningful.

First, we define helper functions to compute the term frequency (TF), document frequency (DF), and the final TF-IDF score:

pythonCopy codeimport math
from textblob import TextBlob as tb

# Function to compute term frequency (TF)
def tf(word, blob):
    return blob.words.count(word) / len(blob.words)

# Function to count how many documents contain the word
def n_containing(word, bloblist):
    return sum(1 for blob in bloblist if word in blob.words)

# Function to compute inverse document frequency (IDF)
def idf(word, bloblist):
    return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

# Function to compute TF-IDF
def tfidf(word, blob, bloblist):
    return tf(word, blob) * idf(word, bloblist)

With these functions defined, we can now loop through the scraped content to extract the top five terms for each page:

pythonCopy code# List to store the URL, word, and TF-IDF score
list_words_scores = [["URL", "Word", "TF-IDF Score"]]

# Calculate TF-IDF for each word in each document
for i, blob in enumerate(list_content):
    scores = {word: tfidf(word, blob, list_content) for word in blob.words}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    
    # Store the top 5 terms for each page
    for word, score in sorted_words[:5]:
        list_words_scores.append([list_pages[i], word, score])

Step 3: Exporting Results to an Excel File

Finally, we will export the results (URL, top words, and their TF-IDF scores) into an Excel file for further analysis. This is achieved using the pandas library:

pythonCopy codeimport pandas as pd

# Convert the results into a DataFrame
df = pd.DataFrame(list_words_scores)

# Export the DataFrame to an Excel file
df.to_excel('tfidf_results.xlsx', header=False, index=False)

The resulting Excel file will have three columns: the URL of the page, the word, and its corresponding TF-IDF score. The closer the score is to 1, the more relevant the term is to that specific page.

Final Thoughts

By automating the process of scraping and analyzing text content with TF-IDF, you can uncover valuable insights into the most important keywords on any set of web pages. This technique is particularly useful for SEO professionals, content creators, and anyone looking to optimize web content.

This Python script can be easily customized for different types of data extraction and processing. You could extend it further by incorporating other NLP techniques, exporting data in different formats, or analyzing more types of web page content such as headers or meta tags.


Daniel Dye

Daniel Dye is the President of NativeRank Inc., a premier digital marketing agency that has grown into a powerhouse of innovation under his leadership. With a career spanning decades in the digital marketing industry, Daniel has been instrumental in shaping the success of NativeRank and its impressive lineup of sub-brands, including MarineListings.com, LocalSEO.com, MarineManager.com, PowerSportsManager.com, NikoAI.com, and SearchEngineGuidelines.com. Before becoming President of NativeRank, Daniel served as the Executive Vice President at both NativeRank and LocalSEO for over 12 years. In these roles, he was responsible for maximizing operational performance and achieving the financial goals that set the foundation for the company’s sustained growth. His leadership has been pivotal in establishing NativeRank as a leader in the competitive digital marketing landscape. Daniel’s extensive experience includes his tenure as Vice President at GetAds, LLC, where he led digital marketing initiatives that delivered unprecedented performance. Earlier in his career, he co-founded Media Breakaway, LLC, demonstrating his entrepreneurial spirit and deep understanding of the digital marketing world. In addition to his executive experience, Daniel has a strong technical background. He began his career as a TAC 2 Noc Engineer at Qwest (now CenturyLink) and as a Human Interface Designer at 9MSN, where he honed his skills in user interface design and network operations. Daniel’s educational credentials are equally impressive. He holds an Executive MBA from the Quantic School of Business and Technology and has completed advanced studies in Architecture and Systems Engineering from MIT. His commitment to continuous learning is evident in his numerous certifications in Data Science, Machine Learning, and Digital Marketing from prestigious institutions like Columbia University, edX, and Microsoft. With a blend of executive leadership, technical expertise, and a relentless drive for innovation, Daniel Dye continues to propel NativeRank Inc. and its sub-brands to new heights, making a lasting impact in the digital marketing industry.

More Articles By Daniel Dye

Social media and SEO (Search Engine Optimization) have a symbiotic relationship. While social signals themselves may not be a direct ranking factor, a strong social media presence can enhance your SEO efforts. Social platforms drive traffic, boost brand visibility, and help create valuable backlinks. Understanding how each social network aligns with SEO efforts allows businesses […]
Negative Google reviews are often a source of frustration for business owners, whether they arise from customer misunderstandings, high expectations, or deliberate attempts to damage a business’s reputation. However, negative feedback doesn’t have to mean disaster. When handled strategically, even the worst reviews can be an opportunity to rebuild trust, enhance your customer service, and […]

Was this helpful?