Tutorial: How to find and fix duplicate content on your website

Posted by on Aug 24, 2012 in Technical SEO | 10 comments

Duplicate content and canonicalization

You might not realize it, but many websites make the same content available via different URLs. This is not popular with the search engines, and it’s one part of what’s referred to as duplicate content.

There are a few different kinds of duplicate content, but in this blog post you’ll learn:

  • the basics of what duplicate content and canonicalization is
  • how to figure out if your website has duplicate content problems
  • how you can use free tools to find and solve some of your duplicate content issues
Please note: In this blog post, you’ll learn the basics of how to deal with duplicate content on your own website. We will not talk about duplicate content when someone copies your content and publish on another site. We will not go into the specific issues with an ecommerce site, as that is a whole different beast of duplicate content issues. Lets keep it simple for now…


What is duplicate content and canonicalization?

This is Google’s own definition of duplicate content:

“Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin.”

This is Google’s own definition of canonicalization:

“Many sites make the same HTML content or files available via different URLs. [...] To gain more control over how your URLs appear in search results…we recommend that you pick a canonical (preferred) URL as the preferred version of the page. You can indicate your preference to Google in a number of ways. We recommend them all, though none of them is required (if you don’t indicate a canonical URL, we’ll identify what we think is the best version).”

How might this apply to your website? Lets find out…

Duplicate content for your whole domain

If the same page/content on your website can be accessed via many different urls, you’re potentially suffering from duplicate content. This can happen if you haven’t set the preferred domain, displaying a link to a pdf version of the page, or indicated your preferred url structure for your website, etc. For example, does Google know if you prefer your URLs to be:

  • with or without www
  • with or without trailing slash /
  • with or without file name – for example: index.php or .html
  • which page version to use if you’re inconsistent with uppercase and lowercase in the urls

For example: if all these urls would show the same page then you have a problem:

  • http://yourwebsite.com/category/thepage
  • http://yourwebsite.com/category/ThePage
  • http://yourwebsite.com/category/thepage/
  • http://yourwebsite.com/category/thepage.html
  • http://www.yourwebsite.com/category/thepage
  • http://www.yourwebsite.com/category/thepage/
  • http://www.yourwebsite.com/category/thepage.html

…unless you’ve implemented 301 redirects to the preferred version, told Google which version it should index, etc (more about that further down).

Duplicate content due to dynamic url parameters

Sometimes your content management system adds different dynamic url parameters to the original url. Google might then index each url as an individual page, even though it’s not.

For example, if you allow comments on your blog (and you should), you usually have links leading directly to every comment on your blog. If you’re using WordPress, this is how the urls to the same page might look:

  • http://yourwebsite.com/blog/yourblogpost/
  • http://yourwebsite.com/blog/yourblogpost/?replytocom=123
  • http://yourwebsite.com/blog/yourblogpost/?replytocom=456

If you’re using Joomla with more than one menu, each link to a page will have different url parameters. Plus, you will probably also display a print and pdf version of each page. Duplicate content in the eyes of the search engines.

Keep on reading and you’ll find out how to check and fix this for your own site.


How to find duplicate content on your website

There are a few different ways to find duplicate content, and here you’ll learn about some quick and easy solutions.

Duplicate content – your domain

A good tool for checking if you have a duplicate content problem on a domain level is “Search Masters Redirect check” (a free online duplicate content checker tool)

Duplicate content checker tool

Duplicate content checker tool – redirect check for your website domain

Check your website, and then keep the page open so you can refer back to it later in this blog post, when it’s time to fix it.

Another way to check is to do a search in Google for a specific page on your site. If you come up with more than one result, then you need to look into why. For example, do a search for exactly this:

site:people.joomla.org/guidelines

You’ll see that you get 2 results – one for the .html version, and one for the .pdf version of the page. This is a common Joomla issue, and something you need to be aware of if that’s your CMS of choice. Further down, we’ll talk about how you can prevent this to happen for your site.

If you want to check if a specific filetype is indexed for your website, you can use the ‘filetype’ search operator. Like this (replace yourdomain.com with your actual domain, and pdf with the filetype you’re looking for):

site:yourdomain.com filetype:pdf

Useful to know: “What file types can Google index?

Duplicate content due to dynamic url parameters

Before you continue: Double-check your sitemap. Are all your pages listed only once in your sitemap? Or are you telling Google to index multiple urls of the same page? Solve this first.

Your sitemap only shows the pages on your site that you want the search engines to know about. However, they will index more pages on your site due to issues with your preferred domain and dynamic url parameters. Your sitemap is not a bullet-proof way to tell the search engines what you want them to list in the search results.

If you already know about a dynamic url parameter that gets added on your website, for example for your comments in WordPress, you can check this with a simple Google search. There’s no need to know of all your url parameters here, it’s just a quick check for yourself for individual parameter issues.

Just type in this in Google:

site:joomlatips.com inurl:replytocom

Replace “joomlatips.com” with your own website address, and replace “replytocom” with the dynamic url you suspect might cause duplicate content issues. You will then see something like this:

Wordpress comment url results in duplicate content

As you can see, the “replytocom” parameter has been blocked via robots.txt already, but that doesn’t stop Google from indexing it (106 results for something that shouldn’t be indexed). You just won’t have any descriptions for the urls in the search result. Not good, in any way.

What content is sending you traffic via search engines?

Google Analytics organic trafficOne final thing you can do, and keep track of, is to check which urls are already sending you organic traffic.

This is just a tip for you to get an additional idea about the structure of your content, and maybe see pages that you don’t want to have indexed in the search engines.

In your Google Analytics account, go to Traffic Sources > Sources > Search > Organic.

Go through the list and check if something stands out to you, and add it to your list of urls to fix.

So, now you’ve checked your domain in general, and the dynamic url parameters that you know of. Time to get a complete list of all your indexed urls, including the ones you’re gonna fix to avoid duplicate content issues both now and in the future.

How to get list of all pages Google has indexed on your website

To figure out which pages have been indexed already, you want a list of all indexed urls for your website. Some people suggest to use Screaming Frog, which is one of my favorite tools for many reasons, but I personally prefer another way for finding indexed duplicate content.

It’s a bit tricky, but follow these instructions and you’ll be fine:

Step 1: Install the browser plugin SEO Quake (and make sure it’s enabled in your browser)

Step 2: Go to Google preferences. Turn off Instant results (so you can change the results per page). Then set your search results per page to 100 (or less if you think your website have fewer pages indexed).

Google preferences - set number of results per page

Google preferences – set number of results per page

Step 3: In Google search, type in “site:yourwebsiteurl.com” (replace “yourwebsiteurl.com” with your actual website url). You’ll then only see pages from your website in the search result.

Step 4: Under the search box, you’ll see SEO Quake information. This is where you can export a list of all your indexed urls to a csv file that you can open in Excel later. Click on the “Save” button and download the csv file to your computer.

SEO Quake helps you create a list of all your indexed urls in Google

SEO Quake helps you create a list of all your indexed urls in Google

Step 5: Open the csv file in Excel (or your favorite spreadsheet program). Voila! You can now sort your indexed urls and easily see which ones you do not want to have listed in the search results.

Make sure you keep the urls you want to remove from Google’s search result, and continue reading for instructions on how to fix all the duplicate content issues you have now discovered.


How to fix your duplicate content problems

Now you know exactly what your problems are with your website’s duplicate page content, right? Good, let’s fix it.

Set the preferred version of your domain (www vs non-www)

First, decide which version of the url you want (with or without www). Then make sure all other versions redirect to the preferred version…and tell Google about it:

1. Redirect your domain from the www version to the non-www version (or vice versa, depending on what you prefer). If your website is running on Apache, this can be done with a 301 redirect in your .htaccess file. If you’re unsure, contact your hosting provider and they’ll help you.

2. Set your preferred domain in Google Webmaster Tools. Read about how to set your preferred domain in Google Webmaster Tools.

Remove pages from Google’s search results

Based on the Excel work you did earlier, you should now have a list of urls that you want to remove from the Google search result page. This is how you do it:

Log into your Google Webmaster Tools account, and go to Optimization > Remove URLs. You can now enter the urls that you want to remove, one by one.
Remove urls from Google search result

The above screenshot shows you the first step – add your url and click “Continue“.

Remove urls from Google search result - step 2

In the screenshot to the right, you see the next step. Choose the highlighted option “Remove page from search result and cache“.

Click “Submit Request“, and repeat this process until you’ve added all urls you want to remove from the search engine results.

On the “Remove URLs” page you will also be able to see the status of each removal request, so you can see when all submitted urls have been deleted from the search results.

Tell Google which dynamic url parameters to ignore in the future

Use parameter handling in Google Webmaster Tools to tell Google about any parameters you would like ignored:

“If your site publishes content that can be reached via multiple URLs, you can gain more control over how your URLs appear in search results by specifying a canonical (preferred) version of the URL. Using the parameter handling tool is one way to do this…”

In your Google Webmaster Tools account, go to Configuration > URL Parameters. There you will see a list of parameters that Google has already picked up for your website. Go through them, and click Edit to change the option for each parameter.

Url parameters in Google Webmaster Tools

Help Google crawl your site more efficiently by indicating how to handle the parameters in your URLs

Watch this video by Google for more information:

Use canonicalization!

There is one more practical implementation you can do to make sure the search engines understand which version of your content is the right one: implement canonicalization on your website. (read what Google say about canonicalization)

Canonicalization for WordPress

If you’re using WordPress, the canonical tag is automatically integrated…but not for everything.

If you want to take it one step further, and also have full control over the canonical tag for individual posts, I warmly recommend Yoast’s SEO plugin.

Canonicalization for Joomla

For Joomla, canonicalization is not built into the core of the CMS. As you learnt earlier, you’ll also have a problem with duplicate content if you’re displaying the pdf icon for your articles.

Some warmly recommended Joomla extensions you can use to implement canonical tags are:

If you have suggestions for how to practically implement canonical tags for your CMS (including WordPress and Joomla), please let me know in the comments!


More reading about duplicate content

Beside all the links in the content above, here are some other great articles about duplicate content and canonicalization:

Know other great articles on the topic? Let me know in the comments below.


Did this tutorial help you? Was it easy to understand? I’d love to get your feedback in the comments below!

About

Tess is a location independent Swedish emarketing tigress, and the founder of For the Love of SEO and owner of JoomlaTips. Follow Tess on Twitter: @tessneale / @joomlatips / @fortheloveofseo and Google Plus: Tess Neale

10 Comments

  1. It’s this sort of thoughtful, well organised article that makes you stand out. You’ve put a lot of work into this, and it comes across as highly professional and- somehow!- full of personality.

    Useful information for any of us doing our own web pages. Thanks for putting in the time to make your site well worth browsing. I trust you will reap the business results you deserve from such a dedicated application of time and effort.

    • Aawhh, thank you Dan! :) I have a lot of knowledge to share, and what’s the point of sharing it if it doesn’t help people? Keeping it simple is the key on this website – jargon free and easy instructions to follow.

      Yes I am putting a lot of work into each blog post, and getting comments like yours really makes it all worth it! Well-written, personal, and inspirational. Thank you so much for taking the time to give me your feedback!

      Love and light,
      Tess

  2. OK, I have been attacking the issue of duplicate content. When i went to Search Masters’ Redirect it said that I had eighteen 406 errors. These included index/ and default/ which sounded rather alarming because I wondered what was actually being picked up by Google.

    Then I read that 406 errors probably wouldn’t affect have Google saw the site and I was even more confused when the Webmaster Tools made no mention of any 406 errors but did say that there were 52 404 errors and 852 Not Found and I had a warning.

    When I followed the instructions and looked at how many items were indexed there were only 500 odd and not the 5,000 items with separate individual URLs that I would have expected. So do I have an issue of items not being indexed, duplicates or what! I am confused and not to say worried.

    My feedback is this. I think what you are doing is brilliant Tess because I am understanding very very slightly some of the issues. No-one has explained them to me before and the web designers have been particularly disinterested in anything to do with SEO believing that Google will find it almost whatever they do. I will continue to try and understand and apply what you are teaching, Tess.

    • Hi Jane,

      It will be a bit overwhelming first time you look at all the crawl errors in Google Webmaste Tools (GWT). Regarding what the different errors/warnings mean, take a look at https://support.google.com/webmasters/bin/answer.py?hl=en&answer=40132

      Because the description of 406 error isn’t very clear on the link above, I’ll give you a hint: it has to do with the different versions of your home page returning something else than a 200, 301 or 404 response. This means that you should decide which version of your home page is the correct one (www or non-www, with or without index.php, etc) and everything else should be redirected (301) to that version. Alternatively result in a 404 (not found) response. Your web developer should definitely know what to do with these instructions. :)

      Regardless of if Google picks up the errors in the search results, it’s always a very good idea to sort out the issues listed in GWT. Start with one thing, and most of the time it something global on your site. This means that many of those errors can be sorted out with just one solution (you’ll be surprised how many errors you will solve if your developer sorts out the issue mentioned above). :)

      In general, don’t be worried, just make it right. Take it slow, and one thing at a time. Do not rush…then you’ll just feel stressed, and it’s not a life or death situation we’re having here. ;)

      Start with one particular issue, and read everything you can about it. If you can not figure out how to solve it, come back here and write me a detailed description of where you see the issue and what you’ve done to try and solve it. I’m sure we’ll figure something out.

      Keep up the good work! Slowly but surely always wins. ;)

      Love and light,
      Tess

    • Hi Krit,

      Thanks for the tip! I updated the url in your comment, as it seems they recently moved the page to a new url.

      The product page doesn’t really say much about the product, but it seems to be a simple canonical plugin supporting a few CCK extensions.

      I’m not adding it to the list in the blog post yet, since I haven’t tried it and you’re the first one recommending it. Would love to hear more feedback on it before I endorse it. :)

      Thanks,
      Tess

  3. Nice article Tess. Do you know if there is a bulk method of doing the ‘Remove pages from Google’s search results’ process as I have a few hundred urls to remove?

    Cheers
    Gavin

    • Hi Gavin,

      Unfortunately there is no bulk method, as far as I know (hopefully I’m wrong).

      It’s really a pity, because it’s a common situation to be in – having a ton of urls to remove…but having to send them one by one.

      Sorry I couldn’t help more, but maybe someone else reading this knows a trick?

      Cheers,
      Tess

  4. Tess thank you for the details. I tried site:domain.com and google says they have 23,900 result for my domain. However i can only see 580 URLs when I go through all the pages.

    Any idea?

    • Hi Bilal,

      If you go to the very last page of the search results, you’ll see something like “Search again with the omitted results included”. Klick that and you’ll get most of the pages included.

      The fact that you have so many pages extra points towards a duplicate/irrelevant content issue. Go through the process with SEO Quake (mentioned in this blog post) and you’ll be able to get a list of your pages (not a complete list, but can be used as a good guideline).

      Work your way through your site and search results, as outlined in this blog post. Then leave it for a week or two, to let the search results update. Then do it all over again. At some point you’ll be done, but take it bit by bit to not get overwhelmed.

      I hope this helps? :)

      Cheers,
      Tess

Trackbacks/Pingbacks

  1. Using content marketing to support SEO - James Gurd smartinsights.com - [...] I recommend reading Tess Neale’s article on duplicate content and canonicalization for a more thorough explanation [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>