Automating Web Scraping and TF-IDF Analysis Using Python
In this guide, we will walk through how to scrape content from web pages, process that content, and extract the most relevant terms using TF-IDF (Term Frequency-Inverse Document Frequency). This approach is useful for identifying keywords and understanding the important topics from any website’s content, which can be applied to SEO or content strategy.
Step 1: Scraping Content Using Cloudscraper and BeautifulSoup
The first step is to scrape content from the desired web pages. We will focus on extracting <p>
tags, which typically contain the main body of text. This content will then be processed with TextBlob to prepare it for term analysis.
Here’s a Python script that uses cloudscraper
(which bypasses anti-bot systems like Cloudflare) and BeautifulSoup
for parsing the HTML:
pythonCopy codeimport cloudscraper
from bs4 import BeautifulSoup
from textblob import TextBlob as tb
# List of web pages to scrape
list_pages = ["<insert_your_pages_here>"]
# Initialize the scraper
scraper = cloudscraper.create_scraper()
# List to store processed content
list_content = []
# Scraping each page
for page in list_pages:
content = ""
html = scraper.get(page) # Get page content
soup = BeautifulSoup(html.text, 'html.parser')
# Extract all <p> content
for paragraph in soup.find_all('p'):
content += " " + paragraph.text.lower()
# Append the processed text to the list
list_content.append(tb(content))
Step 2: Extracting Key Terms with TF-IDF
Now that we have the text content from the pages, we can compute term frequencies and apply the TF-IDF algorithm to find the most important terms. TF-IDF helps to highlight terms that appear frequently within a page but are less common across the entire set of pages, making them more meaningful.
First, we define helper functions to compute the term frequency (TF), document frequency (DF), and the final TF-IDF score:
pythonCopy codeimport math
from textblob import TextBlob as tb
# Function to compute term frequency (TF)
def tf(word, blob):
return blob.words.count(word) / len(blob.words)
# Function to count how many documents contain the word
def n_containing(word, bloblist):
return sum(1 for blob in bloblist if word in blob.words)
# Function to compute inverse document frequency (IDF)
def idf(word, bloblist):
return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))
# Function to compute TF-IDF
def tfidf(word, blob, bloblist):
return tf(word, blob) * idf(word, bloblist)
With these functions defined, we can now loop through the scraped content to extract the top five terms for each page:
pythonCopy code# List to store the URL, word, and TF-IDF score
list_words_scores = [["URL", "Word", "TF-IDF Score"]]
# Calculate TF-IDF for each word in each document
for i, blob in enumerate(list_content):
scores = {word: tfidf(word, blob, list_content) for word in blob.words}
sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)
# Store the top 5 terms for each page
for word, score in sorted_words[:5]:
list_words_scores.append([list_pages[i], word, score])
Step 3: Exporting Results to an Excel File
Finally, we will export the results (URL, top words, and their TF-IDF scores) into an Excel file for further analysis. This is achieved using the pandas
library:
pythonCopy codeimport pandas as pd
# Convert the results into a DataFrame
df = pd.DataFrame(list_words_scores)
# Export the DataFrame to an Excel file
df.to_excel('tfidf_results.xlsx', header=False, index=False)
The resulting Excel file will have three columns: the URL of the page, the word, and its corresponding TF-IDF score. The closer the score is to 1, the more relevant the term is to that specific page.
Final Thoughts
By automating the process of scraping and analyzing text content with TF-IDF, you can uncover valuable insights into the most important keywords on any set of web pages. This technique is particularly useful for SEO professionals, content creators, and anyone looking to optimize web content.
This Python script can be easily customized for different types of data extraction and processing. You could extend it further by incorporating other NLP techniques, exporting data in different formats, or analyzing more types of web page content such as headers or meta tags.