Find cannibalizing internal links in bulk using Screaming Frog

by David Gossage August 30, 2022

Cannibalising internal links are when more than one link uses the same anchor text but link to different pages. This relates specifically to links with target keywords in the text, so you can ignore generic links such as "click here" or "buy now". These can be difficult to find in bulk, simply because a typical website will have a huge number of total internal links which can make them difficult to analyse.

Therefore, I have created a simple script to find all the cannibalising internal links on a site in seconds using Screaming Frog's handy All Anchor Text report. This will require you to use Python, but don't worry, you don't need to know how to code. It's literally a copy and paste job. For this task, you will need:

A full version of Screaming Frog (download here)
Python installed on your computer (download here)

At the time of wrting, Screaming Frog doesn't have a report for this built in. However, they are releasing tons of super useful new stuff all the time so who knows what they might release in the future?

Why is duplicated anchor text harmful?

A good metaphor is to think of your internal links as doorways to the other pages on your website. The anchor text is the sign that is on the door. Ideally, a user should know what is on the other side of the link just by seeing the anchor text and the destination page should match their expectations. This is why anchor text is an extremely powerful relevance signal for search engines. If multiple pages are linked to with the same keyword-rich anchor text then search engines will have a harder time knowing which page should be the ranking page.

How to find cannibalising internal links

The first step is to crawl your whole website using Screaming Frog and then go

to "Bulk Export > Links > All Anchor Text" like in the screenshot below:

This will download every link that Screaming Frog has found on the website as a CSV. You could use this sheet to analyse the internal links directly, but there will likely be a huge number of them - especially if its an ecommerce site. Many computers would struggle to handle this much data. Therefore, I have created a script to do the hard work for you.

Once you have installed Python, create a new script and paste the following code into it:

# import pandas with shortcut 'pd'
import pandas as pd
import csv

#Define website domain & socials
domain='yourwebsite.com' #Add your website domain here
facebook='www.facebook.com'
linkedin='www.linkedin.com'
twitter='twitter.com'

# read_csv function which is used to read the required CSV file
data = pd.read_csv('input.csv', dtype={"Target": "string", "Rel": "string"})

# Remove junk columns
data.drop('Size (Bytes)', inplace=True, axis=1)
data.drop('Status', inplace=True, axis=1)
data.drop('Target', inplace=True, axis=1)
data.drop('Rel', inplace=True, axis=1)
data.drop('Path Type', inplace=True, axis=1)
data.drop('Link Path', inplace=True, axis=1)
data.drop('Link Origin', inplace=True, axis=1)
data["Anchors and Alts"] = data["Alt Text"].fillna('') + data["Anchor"].fillna('') #merge anchors and alt text
data.drop('Alt Text', inplace=True, axis=1)
data.drop('Anchor', inplace=True, axis=1)

#Remove external links
data = data[data["Destination"].str.contains(domain)]
data = data[~data["Destination"].str.contains(facebook)]
data = data[~data["Destination"].str.contains(linkedin)]
data = data[~data["Destination"].str.contains(twitter)]

#Remove duplicate nav links
navtable = data.loc[data['Link Position'] == 'Navigation'] #create new table for nav links
data.drop(data.index[data['Link Position'] == 'Navigation'], inplace=True) #drop nav links from data
navtable = navtable.drop_duplicates(subset = ['Destination', 'Anchors and Alts'], keep = 'last').reset_index(drop = True) #de-dupe nav links

headertable = data.loc[data['Link Position'] == 'Header'] #create new table for header links
data.drop(data.index[data['Link Position'] == 'Header'], inplace=True) #drop header links from data
headertable = headertable.drop_duplicates(subset = ['Destination', 'Anchors and Alts'], keep = 'last').reset_index(drop = True) #de-dupe header links

data = pd.concat([data, headertable,navtable]) #rejoin tables

data = data.drop_duplicates(subset = ['Destination', 'Anchors and Alts']).reset_index(drop = True) #remove duplicate links
data = data[data.duplicated('Anchors and Alts', keep=False)].sort_values('Anchors and Alts')
#data = data[data.duplicated(['Anchors and Alts'])] #return only duplicated columns

#write to csv
data.to_csv (r'output.csv', index = False, header=True)
print("output.csv created")

Note: if this does not work you may need to install the Pandas library. Follow the steps here to do this.

The highlighted text should be changed to your website domain before running the script.

This script works by the following steps:

Removes irrelevant columns, these are not needed for your analysis
Merge alt text and anchor text columns for easier analysis
Strip out external links, these are irrelevant for your analysis
De-duplicate navigation links to avoid cluttering your report
Remove duplicate links with same anchor text and destination
Return only links where anchor text is duplicated

So to run your script, just follow these steps:

Crawl website using Screaming Frog
Export All Anchor Text report
Copy into same folder as your script and rename to input.csv
Update your domain name in the script and run

Within seconds a new file output.csv will be created in your script folder. This will contain all the internal links with duplicate anchor text to different URLs. As simple as that!

Have a go and let me know what you think.