Drupal SEO - using robots.txt to avoid content duplication

By fiLi • Feb 22nd, 2007 • Category: Drupal, SEO

Google really doesn’t like content duplication on sites and so it is advisable to prevent the Google crawler from reaching the same content on your site from more than one url. Since Drupal does offer many ways of reaching your content, you should block certain URL and URL paths by defining the right robots.txt.

Up till Drupal 5, the Drupal installation didn’t come with a built it robots.txt, and so make sure you go over robots.txt if you’re running Drupal 6 or add a robots.txt file if you’re running versions lower than 5 .

Here’s my suggestion for the robots.txt with minor adjustments to the robots.txt that comes with Drupal 5 (notice, for example, disallowing feed and taxonomy):

User-agent: Googlebot
Crawl-delay: 10

# Directories
Disallow: /tracker/
Disallow: /xtracker/
Disallow: /user/
Disallow: /book/export/
Disallow: /forward/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
Disallow: /comment/
Disallow: /taxonomy/
Disallow: /popular/
Disallow: */feed*
Disallow: */comment/reply/
Disallow: /popular/*
Disallow: /comments*
Disallow: /frontpage*
Disallow: /comments*
Disallow: /aggregator*
Disallow: /aggregator2*
Disallow: *?sort=asc&order=Time*
Disallow: *comment-*
# Files
Disallow: /rss.xml
Disallow: /feed
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

Tagged as: ,

18 Responses »

  1. [...] If you define feed aliases, remember that they will contain duplicate content for search engines that might hurt your SE rankings, so you’ll have to add a disallow rule for those in your Drupal robots.txt. [...]

  2. This is definitely helpful for some client of mine currently on drupal.

    Thanks.

  3. [...] Drupal SEO - using robots.txt to avoid content duplication | fiLi’s tech (tags: drupal seo) [...]

  4. Hello, thanks for this very specific advice, nice to get someone who doesn’t just tell you to use a module but then not recommend how.

    BUT I have a concern, I had pathauto and path enabled from the start (I am just new to this) and clean URLs, but I also have taxonomy menu, which I am very pleased with.

    But so far when I navigate (I have good taxonomy driven menus every page), I have only added two nodes (storys) so far, if I go to one from the menu the URL is
    http://www.morethanoil.com/taxonomy_menu/3/93/1/4
    If I go from the frontpage link the URL is
    http://www.morethanoil.com/lowdown-gas-coal-concept/25/jun/2007

    I would like it if both links were links were the latter but haven’t fathomed that out. However, if I were to Disallow: /taxonomy/, as you suggest, won’t it screw-up the recognition of clicks inside my site via menu navigation?

    I’ll appreciate any comment you have,
    thanks
    Zaph

  5. Thanks…

    Robots.txt has nothing to do with the recognition of clicks. It doesn’t affect your site at all. It just asks the Google crawler not to index these pages. You should be fine with this robots.txt.

    BTW - I would drop all dates from any items in your pathauto. It looks bad, it complicates things later on, and it has no real added value.

  6. Oh good, so I’ll go ahead with that. And I already took your advice to drop the dates, thanks.

    I am also compiling ’steps taken’ help info and will credit you with considerable help here if I publish it anywhere.

    Re clicks within sites, and user bookmarks, do google and other SEs recognise or count that in some way at all?

    Thanks
    Zaph
    PS: If you read fiction, I’d like to send you something by snail mail in appreciation of your helpful site. Let me know, zaph@morethanoil.com

  7. Nope, as far as I know no major SE player uses clicks or user bookmarks for SERP rankings.

  8. To block all the Drupal feeds you need to do something like this:
    Disallow: /*/feed$

    That would still leave the main RSS feed exposed to search engines (/rss.xml by default). Blog search engines want your main feed…

  9. yeah, you’re right about */feed$, but I also hide the rss.xml from the Googlebot. I updated the posted robots.txt to fit the one I have on my sites.
    Thanks for the comment.

  10. I used to block even the main feed from Googlebot, but I wasn’t able to find out if Google Blogsearch has its own spider. Google Blogsearch is looking for the main RSS feed ( http://www.google.com/help/about_blogsearch.html ) so I started unblocking it. To avoid having rss.xml duplicate the front page content, I create a custom front page and block /node$ from robots (/node is a duplicate of the default front page).

  11. How do I stop paging URL from getting indexed? I’m having duplicate content error as i have taxonomy menu for taxonomy navigation. also the content is displayed on the homepage as well as the content link itself. Paging links are like [domain name]/[category name]/[article title]/?page=2 This link is also getting indexed in search engines. Can we write in robot.txt such that it should not crawl the paging links i.e. paging links are not followed. Currently we have patched the pageing to work on POST rather than the GET method.

  12. Dheeraj - robots.txt isn’t the only way to prevent from indexing. You can also insert a noindex metatag to the pages you don’t wish to index :

    meta name=”googlebot” content=”noindex,nofollow” /

    For paging, I believe you can just do “Disallow: */page=* . I suggest you experiment with robots.txt on the Google webmaster tools to see what the right robots.txt is for you to block the pages you don’t want to index.

    Good luck, let me know if that helped.

  13. If you block *all* of your Drupal paging URLs your search engine traffic might drop. Search engine crawlers won’t be able to reach all of your content. Better to only selectively block the specific paginated sections that are causing problems…

  14. Drupalzilla - what’s in paged URLs that isn’t in direct nodes? what kind of unique content do the paginated sections provide?

  15. Unless you have custom Views set up, spiders will not be able to find all of your content if you block paginated content.
    When you create nodes and promote them to the front page they will paginate like this: http://example.com/node?page=2.  
    If you don’t promote posts to the front page, you can access them through taxonomy terms, but those are paginated also, for example, http://example/category?page=2.   
    So if you block off all paginated pages, search engines will only be able to find the first 10 posts on your home page and the first 10 posts under any taxonomy term.
    There are other ways for nodes to be displayed, but they are all paginated with ?page=…   
    Post #20 in category X will be blocked from search engines unless you have created a specific link to it from another page.  I’ve blocked dynamic pages on a Drupal site as a test (two different sites). 
    As soon as I ended the experiment, search engine referrals went up significantly.  I usually create a custom front page on my Drupal sites, and block /node (which includes all front page pagination).  I let crawlers get to the content through taxonomy terms pages because they are listed with other posts on a similar keyword theme.

  16. [...] SEO : The top Drupal SEO modules and how to use them“, make sure you’re “Drupal SEO: using robots.txt to avoid content duplication” and add a few more general website SEO [...]

  17. Not only can you use robots.txt to prevent content duplication, but you can use your .htaccess file to resolve any canonical domain issues, Google Webmaster Central to pick your preferred domain and the Global Redirect module to get rid of trailing slashes.

    Hope that helps!

  18. If you block *all* of your Drupal paging URLs your search engine traffic might drop.
    That is right Drupalzilla.
    And if you want to increase it just continuing to use drupal:)

Leave a Reply