You might not realize it, but many websites make the same content available via different URLs. This is not popular with the search engines, and it’s one part of what’s referred to as duplicate content.
There are a few different kinds of duplicate content, but in this blog post you’ll learn:
- the basics of what duplicate content and canonicalization is
- how to figure out if your website has duplicate content problems
- how you can use free tools to find and solve some of your duplicate content issues
What is duplicate content and canonicalization?
“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”
“Many sites make the same HTML content or files available via different URLs. […] To gain more control over how your URLs appear in search results…we recommend that you pick a canonical (preferred) URL as the preferred version of the page. You can indicate your preference to Google in a number of ways. We recommend them all, though none of them is required (if you don’t indicate a canonical URL, we’ll identify what we think is the best version).”
How might this apply to your website? Lets find out…
Duplicate content for your whole domain
If the same page/content on your website can be accessed via many different urls, you’re potentially suffering from duplicate content. This can happen if you haven’t set the preferred domain, displaying a link to a pdf version of the page, or indicated your preferred url structure for your website, etc. For example, does Google know if you prefer your URLs to be:
- with or without www
- with or without trailing slash /
- with or without file name – for example: index.php or .html
- which page version to use if you’re inconsistent with uppercase and lowercase in the urls
For example: if all these urls would show the same page then you have a problem:
…unless you’ve implemented 301 redirects to the preferred version, told Google which version it should index, etc (more about that further down).
Duplicate content due to dynamic url parameters
Sometimes your content management system adds different dynamic url parameters to the original url. Google might then index each url as an individual page, even though it’s not.
For example, if you allow comments on your blog (and you should), you usually have links leading directly to every comment on your blog. If you’re using WordPress, this is how the urls to the same page might look:
If you’re using Joomla with more than one menu, each link to a page will have different url parameters. Plus, you will probably also display a print and pdf version of each page. Duplicate content in the eyes of the search engines.
Keep on reading and you’ll find out how to check and fix this for your own site.
How to find duplicate content on your website
There are a few different ways to find duplicate content, and here you’ll learn about some quick and easy solutions.
Duplicate content – your domain
A good tool for checking if you have a duplicate content problem on a domain level is “Search Masters Redirect check” (a free online duplicate content checker tool)
Check your website, and then keep the page open so you can refer back to it later in this blog post, when it’s time to fix it.
Another way to check is to do a search in Google for a specific page on your site. If you come up with more than one result, then you need to look into why. For example, do a search for exactly this:
You’ll see that you get 2 results – one for the .html version, and one for the .pdf version of the page. This is a common Joomla issue, and something you need to be aware of if that’s your CMS of choice. Further down, we’ll talk about how you can prevent this to happen for your site.
If you want to check if a specific filetype is indexed for your website, you can use the ‘filetype’ search operator. Like this (replace yourdomain.com with your actual domain, and pdf with the filetype you’re looking for):
Useful to know: “What file types can Google index?”
Duplicate content due to dynamic url parameters
Your sitemap only shows the pages on your site that you want the search engines to know about. However, they will index more pages on your site due to issues with your preferred domain and dynamic url parameters. Your sitemap is not a bullet-proof way to tell the search engines what you want them to list in the search results.
If you already know about a dynamic url parameter that gets added on your website, for example for your comments in WordPress, you can check this with a simple Google search. There’s no need to know of all your url parameters here, it’s just a quick check for yourself for individual parameter issues.
Just type in this in Google:
Replace “joomlatips.com” with your own website address, and replace “replytocom” with the dynamic url you suspect might cause duplicate content issues. You will then see something like this:
As you can see, the “replytocom” parameter has been blocked via robots.txt already, but that doesn’t stop Google from indexing it (106 results for something that shouldn’t be indexed). You just won’t have any descriptions for the urls in the search result. Not good, in any way.
What content is sending you traffic via search engines?
One final thing you can do, and keep track of, is to check which urls are already sending you organic traffic.
This is just a tip for you to get an additional idea about the structure of your content, and maybe see pages that you don’t want to have indexed in the search engines.
In your Google Analytics account, go to Traffic Sources > Sources > Search > Organic.
Go through the list and check if something stands out to you, and add it to your list of urls to fix.
So, now you’ve checked your domain in general, and the dynamic url parameters that you know of. Time to get a complete list of all your indexed urls, including the ones you’re gonna fix to avoid duplicate content issues both now and in the future.
How to get list of all pages Google has indexed on your website
To figure out which pages have been indexed already, you want a list of all indexed urls for your website. Some people suggest to use Screaming Frog, which is one of my favorite tools for many reasons, but I personally prefer another way for finding indexed duplicate content.
It’s a bit tricky, but follow these instructions and you’ll be fine:
Step 1: Install the browser plugin SEO Quake (and make sure it’s enabled in your browser)
Step 2: Go to Google preferences. Turn off Instant results (so you can change the results per page). Then set your search results per page to 100 (or less if you think your website have fewer pages indexed).
Step 3: In Google search, type in “site:yourwebsiteurl.com” (replace “yourwebsiteurl.com” with your actual website url). You’ll then only see pages from your website in the search result.
Step 4: Under the search box, you’ll see SEO Quake information. This is where you can export a list of all your indexed urls to a csv file that you can open in Excel later. Click on the “Save” button and download the csv file to your computer.
Step 5: Open the csv file in Excel (or your favorite spreadsheet program). Voila! You can now sort your indexed urls and easily see which ones you do not want to have listed in the search results.
Make sure you keep the urls you want to remove from Google’s search result, and continue reading for instructions on how to fix all the duplicate content issues you have now discovered.
How to fix your duplicate content problems
Now you know exactly what your problems are with your website’s duplicate page content, right? Good, let’s fix it.
Set the preferred version of your domain (www vs non-www)
First, decide which version of the url you want (with or without www). Then make sure all other versions redirect to the preferred version…and tell Google about it:
1. Redirect your domain from the www version to the non-www version (or vice versa, depending on what you prefer). If your website is running on Apache, this can be done with a 301 redirect in your .htaccess file. If you’re unsure, contact your hosting provider and they’ll help you.
2. Set your preferred domain in Google Webmaster Tools. Read about how to set your preferred domain in Google Webmaster Tools.
Remove pages from Google’s search results
Based on the Excel work you did earlier, you should now have a list of urls that you want to remove from the Google search result page. This is how you do it:
Log into your Google Webmaster Tools account, and go to Optimization > Remove URLs. You can now enter the urls that you want to remove, one by one.
The above screenshot shows you the first step – add your url and click “Continue“.
In the screenshot to the right, you see the next step. Choose the highlighted option “Remove page from search result and cache“.
Click “Submit Request“, and repeat this process until you’ve added all urls you want to remove from the search engine results.
On the “Remove URLs” page you will also be able to see the status of each removal request, so you can see when all submitted urls have been deleted from the search results.
Tell Google which dynamic url parameters to ignore in the future
Use parameter handling in Google Webmaster Tools to tell Google about any parameters you would like ignored:
“If your site publishes content that can be reached via multiple URLs, you can gain more control over how your URLs appear in search results by specifying a canonical (preferred) version of the URL. Using the parameter handling tool is one way to do this…”
In your Google Webmaster Tools account, go to Configuration > URL Parameters. There you will see a list of parameters that Google has already picked up for your website. Go through them, and click Edit to change the option for each parameter.
Watch this video by Google for more information:
There is one more practical implementation you can do to make sure the search engines understand which version of your content is the right one: implement canonicalization on your website. (read what Google say about canonicalization)
Canonicalization for WordPress
If you’re using WordPress, the canonical tag is automatically integrated…but not for everything.
If you want to take it one step further, and also have full control over the canonical tag for individual posts, I warmly recommend Yoast’s SEO plugin.
Canonicalization for Joomla
For Joomla, canonicalization is not built into the core of the CMS. As you learnt earlier, you’ll also have a problem with duplicate content if you’re displaying the pdf icon for your articles.
Some warmly recommended Joomla extensions you can use to implement canonical tags are:
- RSSeo for Joomla (tutorial)
- AceSEF for Joomla (thanks for the tip Sean from Salyris Studios)
If you have suggestions for how to practically implement canonical tags for your CMS (including WordPress and Joomla), please let me know in the comments!
More reading about duplicate content
Beside all the links in the content above, here are some other great articles about duplicate content and canonicalization:
- What is Duplicate Content?
- Duplicate Content in a Post-Panda World and Fat Pandas and Thin Content
- Are You Making These 7 Panda-Punishing Content Mistakes?
- Duplicate Content: Block, Redirect or Canonical
- Learn SEO – Canonicalization
- About rel=”canonical”
- Using meta tags to block access to your site – noindex tag
- Indexation for SEO: Real Numbers in 5 Easy Steps
- Ranking factor dilution – a big problem for Joomla SEO
Know other great articles on the topic? Let me know in the comments below.
Did this tutorial help you? Was it easy to understand? I’d love to get your feedback in the comments below!