Skip to main content

Using Python and Linear Algebra to Simplify Redirect Management for Large Websites

|

Redirect management for large websites can be a complex and time-consuming task, especially when you have thousands of URLs that need efficient handling. Whether it’s for SEO optimization, site migrations, or content restructuring, having an automated way to map old URLs to new URLs can save significant time while ensuring a seamless user experience. Here, I’ll show you how to leverage Python and linear algebra to create a flexible, scalable redirect management solution.

In this guide, we’ll walk through:

Setting Up a Redirect Mapping with Python
Using Weighted Redirects for Priority Pages
Applying Similarity-Based Matching for Partial URL Matches
Outputting .htaccess Redirects for Easy Upload
By the end, you’ll have a Python script that can manage redirects at scale, ensuring high accuracy and easy customization.

Step 1: Setting Up the Redirect Mapping
To start, let’s create two lists of URLs—one containing the old URLs and the other with the corresponding new URLs. In this basic setup, we’ll use a 1:1 mapping matrix to set a straightforward alignment. If you have pages that are directly equivalent, this simple identity matrix will automatically generate a redirect for each pair.

Here’s how it looks in Python:

python
Copy code
import numpy as np
import pandas as pd

Sample data – replace these with your actual lists of URLs

old_urls = [“/old-page-1”, “/old-page-2”, “/old-page-3”, “/old-page-4”]
new_urls = [“/new-page-1”, “/new-page-2”, “/new-page-3”, “/new-page-4”]

Check that lists are of the same length

if len(old_urls) != len(new_urls):
raise ValueError(“The number of old URLs must match the number of new URLs.”)

Create a 1:1 mapping matrix

mapping_matrix = np.identity(len(old_urls), dtype=int)
mapping_df = pd.DataFrame(mapping_matrix, columns=new_urls, index=old_urls)
This matrix now holds a 1:1 mapping, meaning each old URL corresponds directly to a new URL. But in most cases, this one-to-one mapping isn’t enough, especially for large sites. So, let’s make this mapping more dynamic by adding weighted redirects and similarity-based matching.

Step 2: Adding Weighted Redirects for Priority Pages
Some URLs are more valuable than others. For instance, your homepage or high-traffic category pages might need higher priority in a redirect mapping. To handle this, we can assign weights to each URL, allowing us to prioritize certain pages in the redirect process. These weights reflect how much traffic a page should get or its SEO importance.

Here’s how you can assign weights to prioritize critical pages:

python
Copy code

Assign weights to prioritize certain URLs (higher weight = higher priority)

old_weights = [0.8, 1.2, 0.5, 1.0]
new_weights = [1.0, 1.0, 1.0, 1.5]

Adjust the mapping to accommodate weights

mapping_matrix = np.zeros((len(old_urls), len(new_urls)))

for i, old_weight in enumerate(old_weights):
for j, new_weight in enumerate(new_weights):
mapping_matrix[i, j] = old_weight * new_weight
In this setup, each entry in the mapping_matrix represents the weighted importance of redirecting from a specific old_url to a new_url. This weighted setup lets us emphasize critical URLs when performing redirects.

Step 3: Using Similarity-Based Matching for Partial URL Matches
For large sites, some pages might not have an exact equivalent but still need relevant redirects. To address this, we’ll use cosine similarity to create partial matches. Cosine similarity measures how close two vectors are in terms of orientation, making it useful for matching URLs based on patterns, such as similar categories or keywords.

Here’s how to apply similarity-based matching:

python
Copy code
from sklearn.metrics.pairwise import cosine_similarity

Use cosine similarity to find matching URLs based on weights

for i, old_weight in enumerate(old_weights):
for j, new_weight in enumerate(new_weights):
similarity = cosine_similarity(
np.array([old_weight]).reshape(1, -1), np.array([new_weight]).reshape(1, -1)
)[0, 0]
mapping_matrix[i, j] = similarity
Using cosine similarity, we can calculate the most relevant matches between old_urls and new_urls based on their weighted values. This allows us to avoid irrelevant redirects and ensures that only pages with high relevance are connected. The resulting mapping_matrix captures these similarity scores.

Step 4: Generating the Redirects in .htaccess Format
Finally, we’ll convert the mappings to .htaccess redirects. Here’s how to format and output them:

python
Copy code

Redirect generation based on highest similarity scores

redirects = []
for i, old_url in enumerate(old_urls):
if old_url:
new_url_index = np.argmax(mapping_matrix[i])
new_url = new_urls[new_url_index]
similarity_score = mapping_matrix[i, new_url_index]

    # Set a threshold for relevance; adjust as needed
    if similarity_score > 0.5:
        redirects.append(f"Redirect 301 {old_url} {new_url} # Similarity: {similarity_score:.2f}")

Output redirect rules

redirects_text = "n".join(redirects)
print("Generated .htaccess Redirects:n")
print(redirects_text)

Optionally, write to an .htaccess file

with open(“.htaccess”, “w”) as file:
file.write(redirects_text)
This snippet finds the new_url with the highest similarity score for each old_url. Any match above a set threshold is deemed relevant and included in the .htaccess file as a 301 redirect.

Customizations and Further Expansions
The script above is a flexible starting point, but you can easily extend it further. Here are some ideas:

Multi-Domain Redirects: Modify the matrix to account for subdomains or cross-domain redirects.
Keyword-Based Matching: Use NLP or string pattern matching to fine-tune the redirect mappings.
Additional Priority Levels: Add multiple thresholds for different types of pages, such as essential pages or informational content.
Conclusion
Using Python and linear algebra can transform redirect management, making it faster and more scalable. This approach ensures that redirects are not only accurate but also SEO-optimized, helping you maintain a high-quality user experience even during major site transitions. By implementing weighted and similarity-based redirects, you’ll have a solution that grows with your website and adapts to changing needs.


Jakub Niedzwiecki

Jake’s journey in SEO began in 2018, and since then, he has crafted and implemented unique strategies for hundreds of companies. With an English degree from Colorado State University, Jake brings a refined communication style to his SEO work, making complex concepts accessible and engaging. What sets Jake apart is his passion for engineering and his deep understanding of advanced programming languages, including Python. This combination of technical expertise and a strong foundation in language allows him to create SEO strategies that are both innovative and effective, blending creativity with analytical precision. Jake’s ability to merge his linguistic skills with his technical knowledge enables him to develop SEO strategies that not only drive traffic but also resonate with target audiences, ensuring long-term success for the companies he works with.

More Articles By Jakub Niedzwiecki

In today’s fast-paced, digital-driven world, search engines have become integral to the buying journey. The year 2024 marks a significant shift in how consumers approach their purchasing decisions, with an increasing reliance on search platforms for research, price comparisons, and customer reviews. Understanding these trends is crucial for businesses looking to optimize their marketing strategies […]
1. Why Website Analytics Are Critical to Your SEO Success Website analytics are more than just a set of tools to measure performance—they are the foundation upon which successful SEO strategies are built. In an ever-evolving digital landscape, where search engine algorithms continually shift, having access to accurate, real-time data is essential to ensure that […]
Google has recently rolled out a significant update to Google Analytics 4 (GA4), introducing a powerful new feature that allows businesses to compare their performance with other companies in their industry. This update marks a pivotal enhancement in how businesses can use data to gain a competitive edge, offering deeper insights into their market positioning […]

Was this helpful?