Wordpress SEO : using robots.txt to avoid content duplication
By fiLi • Mar 10th, 2007 • Category: SEO, WordpressGoogle really doesn’t like content duplication on sites and so it is advisable to prevent the Google crawler from reaching the same content on your site from more than one url. Since Wordpress does offer many ways of reaching your content, you should block certain URL and URL paths by defining the right robots.txt.
Here’s my suggestion for the Wordpress robots.txt :
User-agent: Googlebot
# Disallow all directories and files within
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
# Disallow all files ending with these extensions
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
# Disallow parsing individual post feeds, categories and trackbacks..
Disallow: */trackback/
Disallow: */feed/
Disallow: /category/*
Be extremely careful when implementing this. For example, some Wordpress installations have Gallery2 embedded which - for reasons unknown - likes to run with main.php in the url (even with url-rewrite enabled!). Furthermore, if your blog is in a sub-directory in your domain and you change the robots.txt for the entire domain note that you might block essential pages in other sub-directories. I imagine this is the reason why robots.txt isn’t included as part of the default wordpress installation.
As explained by my fellow bloggers who trackbacked, you also need to take care with the agents you block, and it would be wise to target bots specifically instead of using the problematic * symbol in the "user-agent" field.
AdSense and robots.txt (Part 2)…
In AdSense and robots.txt (Part 1) I described the basic syntax for the robots.txt file. Today we look at how the robots.txt file can affect your AdSense income if you’re not careful with how you declare the exclusion rules.
Mediabot: The AdSense Cra…
You’re absolutely right. Thanks for the helpful comments.
I tried your suggestion about the robot.txt. I am not sure if it was the reason, but my Google page rank went from a 2 to a 4. The only other reason I can think of is that I moved my blog from the URL http://www.blog.frostfox.com to http://www.frostfox.com/blog.
Consolidating your incoming links ups your page rank score. Google.com/webmaster has some tools to help you with consolidation….
I generally avoid any exclude tags on the site…The dangers outweigh the advantages for me…Any case studies on this one?
Frostfox - Yeah, I believe Lonnie is right, but you should know that pagerank only updates once every 3-4 month, so it’s not something that you get immediate results on. But redirecting a few reachable urls into one, is very good practice, especially if Google penalized you for duplicate content.
Lonnie - Yeah, there are, and those are all over the SEO blogosphere. You can start off by SEOBook and see his self-report as well as incoming trackbacks. As long as you closely monitor the robots.txt performance in Google webmasters tools, I believe you’ll be o’right.
I have been playing around with the Google webmaster thing for a bit now.
2 things about your robot.txt, one you have a spelling mistake “indididual” should be “individual”, and isn’t disallowing files that end with .php a bad idea? Your index for your site is index.php.
Thanks for the spelling correction
:$
Your question about index.php is actually what content duplication is about. Some blogs allow the exact same page to appear through index.php and their main blog path “/” and that’s something you want to avoid.
Nice post dude.. You will want to check out Wordpress robots.txt for more examples.
[...] גוגל לא אוהב טסטים כפולים ומוריד לכם את ה-PR על דפים שהוא מאנדקס מספר פעמים, כמו כן יש דפים שאתם בעצם ממש לא רוצים שגוגל יאנדקס לכם. כתוצאה מכך אתם בהחלט רוצים למנוע מגוגל להגיע לדפים שונים דרך יותר מכתובת אחת (וורדפרס דווקא מאפשר זאת) וזאת ניתן לעשות בעזרת הקובץ robots.txt. טיפים לשינוי הקובץ ניתן לקרוא בפוסט “איך להשתמש בקובץ robots.txt על מנת למנוע כפילות תוכן”. [...]
Yeah, I later found your post through various bloggers on the net (JohnTP etc.). That’s a good comprehensive post you wrote there…
[...] Using Feedburner? then you should disable your feeds from search engine indexing. After you’re done tweaking your robots.txt to avoid content duplication and sorting your .htaccess to setup all the redirects you should also take care of the content duplication happening with your Feedburner feed. [...]
Hey Fili - thanks for the advice; unfortunately by blocking all PHP files I stopped Google from accessing my home page (the Google Webmaster Tools said that
Glad I could help.
I believe this next plugin will take care of that problem for you (and a few other duplicate content issues) :
Permalink Redirect
[...] following fiLi’s advice for using robots.txt to avoid content duplication, I started to edit my robots.txt file. I won’t list the file contents here - suffice to say [...]
Thanks again fiLi - that plugin looks really useful. M
[...] just about the same thing, but might be a little more tricky to set up correctly. Check out “Wordpress SEO : using robots.txt to avoid content duplication” for [...]
[...] Robots.txt : Either robots.txt doesn’t exist at all (LT), it has something that does nothing (MKS), or it’s useless (TCI). Use robots.txt to avoid content duplication. [...]
[...] kinda like that, but it doesn’t seem to cover everything. Fili’s Tech has an article on wordpress seo for wordpress too, and I like his ideas. So I ended up with something like this: # Disallow all directories and [...]
[...] Wordpress SEO : using robots.txt to avoid content duplication [...]
Great blog and nice post Fili.. I like how you are keeping it simple, I recently changed my robots.txt from WordPress robots.txt example, to a simpler version, perhaps its time for a followup article???
[...] 刚才看到 filination.com上面提到的一个robots文件,并且提供了简短的解释。刚刚接触robots.txt文件的朋友,可以参考一下: User-agent: Googlebot [...]
Hey you should check out the Updated WordPress SEO robots.txt!
[...] a combination between the files on this robots.txt post at Connected Internet and this robots.txt post from Filination. If you use these you’ll manage to maximise your Page Rank on all of your [...]
This robots.txt file looks very simple and thanks for some of the points you made. Now I again have to go through all the site I visited for the robots.txt file and create my own file based on the convincing information provided by them. Hopefully, after experimenting a little I would come to know which is best for a wordpress site. Thank you for the information.
If an RSS feed is the Yahoo backdoor, is a Blog Googles?…
Though the answer is in a book I wrote this July, the question is still asked of me repeatedly. Why does it work for some sites and not others? And how come some blogs get indexed in a day and then are dropped, and others stay in Google indefinitely?…
[...] one suggestion for the WordPress robots.txt from Fili’s Tech: User-agent: [...]
[...] to the last line in the file. There has also been other bloggers such as Everton, 20 steps and FiLi who have created a Robots.txt and saw a marked increase in their blog traffic. Trust me, Google [...]
[...] robots.txt - שוב, שימוש בקובץ זה יכול לעזור להמנע מבעיות כפילות תכנים. שווה לקרוא את הפוסט של פילי בנושא. [...]
[...] Dica: O arquivo robots.txt controla o que os robôs dos mecanismos de busca podem ou não indexar em seu blog. Esse artigo contém o robots.txt ideal para um blog Wordpress. Leia mais… [...]
Many thanks for your preciuos info..
Grazie mille and greetings from Italy.
[...] WordPress SEO: using robots.txt to avoid content duplication [...]