Avoiding (and Fixing) Duplicate Content With .HTACCESS
So you just got done putting together the greatest website in the world, chock full of awesome share-worthy content and cutting-edge graphics. You submit your website to all the right places including Google, Bing and all the right social media websites. One month later, you aren't ranking for anything, and have no idea why. One reason could be: you have duplicate content hiding somewhere on your site and don't know it.
What is Duplicate Content?
Duplicate content generally refers to substantive blocks of content within or across domains that either completely match other content or are appreciably similar. Mostly, this is not deceptive in origin. - Source
Google knows that not all duplicate content is deceptive, but the search engine can still make mistakes and penalize you for having duplicate content on your site (more on that in another article). Common CMS's like Joomla and Wordpress have code written into their backend that generates additional URL's all on their own without politely letting you know. So how do you fix duplicate content you can't even see or know exists? We're getting there. First, let's understand what .htaccess is and what it does.
What is an .HTACCESS File?
An .htaccess file's technical definition is "a directory-level configuration file supported by several web servers, that allows for decentralized management of web server configuration." Source link. Sound technical? We said it was, but it doesn't have to be. The real world definition for an .htaccess file would be something like "the text file that controls how the server handles requests for information and allows your internet browser to locate the right pages to send a user." Sound better? You're welcome ;)
How Do I Edit My .HTACCESS File?
Editing your .htaccess file is simple. You can use either Dreamweaver, notepad (Windows), text edit (Mac) or any text editor of your choice. You will need an FTP (file transfer protocol) program installed on your computer to allow you to download the .htaccess file from your server. Some of the FTP programs we recommend are WinSCP for Windows PC's or Filezilla and Transit for Mac users. Once you have an FTP program installed, it's time to locate the file, download and begin editing. Login to FTP using the server username and password for your website. If you do not have this information, contact your web host for details. NOTE: Everyone should have the FTP username and password for their website. Once you get this from your host, save it safely and securely on your computer.
Time to Edit!
Ok. If you made it this far and are still with us after speaking about FTP and server nerd talk, Great job! You are a real trooper and deserve a pat on the back *virtual back pat*. Don't worry, we will help to explain every rule in detail that you are about to add to your .htaccess file.
IMPORTANT! - Modifying your .htaccess file without care and attention to detail can potentially redirect users to the wrong areas of your site, or allow your site to not display at all. Before adding or removing any data to or from your .htaccess file we highly recommend downloading and storing a backup of this file. Create a folder on your desktop called "old htaccess" and save it there.
Still with us? Let's get to work.
1. Force "www" On Your Domain
Your website should always go to one place or the other. Having a website that has multiple top level versions is not only unorganized, but can also have SEO consequences. Let's use the home page as an example. Type in your existing domain name into the address bar in a browser. Try to access your site with both the "www" and "non-www" versions, meaning type in www.yourdomain.com and yourdomain.com. (replace yourdomain with your actual domain name). If you can access both versions so can a search engine and it's possible they are counting both versions as a duplicate. If you can do this in the home page, all the other pages in your site are affected, which means you could have hundreds or thousands of pages that are exactly alike on two different URL's. See below for the code to force www in .htaccess. This code should be installed below the "/rewrite base" line of text in your file. If your file does not have the rewrite base text, contact your web host for further instructions of where to add the rules below.
RewriteCond %{HTTP_HOST} !^www\.YOURDOMAIN\.com$ [NC]
RewriteRule ^(.*)$ http://www.YOURDOMAIN.com/$1 [R=301,L]
Look confusing? Don't worry, you made it this far. There's no turning back now. So looking at the above code, there is a good chance you are wondering "What the hell am I looking at?" If you look closely, each section of code is telling the server something different. "RewriteCond" is the URL rewriting conditions, and the $1 at the end of the domain is telling it to take all other following submenu items and apply the same rule to it. Dollar signs are used as placeholders, letting the server know that this will be replaced with something else, or that something needs to be replaced. The 301 at the end is simply telling the search engines that this is a permanent change, and to index this version. See it's not THAT confusing. Before we move on, you did remember to replace the YOURDOMAIN text in the code above with your actual domain name correct? Just checking :) Onwards brave souls!
2. Remove Index.php From Within the URL
Index.php is a tricky file. The index file of your site gets the first request to let the browser know where to go first, and what file to jump to. Since it is in the root directory (highest in the file hierarchy) of your site, search engines will pick up this file, and display it along side your domain. Why is this bad? Again, we go to the test of your website's pages. Try to access any page other than the home page of your site, but this time we are going to add index.php at the end. Example: www.YOURDOMAIN.com/SUBPAGE/index.php and www.YOURDOMAIN.com/SUBPAGE. Do both versions display the same as if nothing happened? If they did, apply the following code in your .htaccess file, beneath the one you just added.
# remove index.php within the URL
RedirectMatch permanent index.php/(.*) http://www.YOURDOMAIN.com/$1
There are a few exceptions to this rule. If your site is set up using HTML instead of PHP, you will need to edit the code above to the following:
# remove index.html within the URL
RedirectMatch permanent index.html/(.*) http://www.YOURDOMAIN.com/$1
Adding the rule above will get rid of the index.php or index.html from appearing after your URL's and creating duplicate pages. It is important to know that the above rule may not work in all CMS and cart based websites, so test to see if the redirect worked by opening a page that displayed the index after your URL and refreshing the page to ensure it is gone and everything is working correctly. If not, deleting the rule will set everything back the way it was.
3. Remove Index.php From Root URL
If you were paying close attention in example number two, we stated that you should look at the subpages of your site, not the home page. The rule we are going to add now will get rid of index.php from the home page of your site, and kill the chance of your home page having duplicates.
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.(php|html) [NC]
RewriteRule ^index\.php$ http://www.YOURDOMAIN.com/ [R=301,L]
...and of course the index.html version:
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.(php|html) [NC]
RewriteRule ^index\.html$ http://www.YOURDOMAIN.com/ [R=301,L]
Be sure to test and make sure your site is functioning normally after adding whatever version you need. If not, delete the rule and contact your web host to have them create a custom .htaccess rule for your site
4. Remove index.php at the End of the URL and Change to /
This one isn't a requirement, but can help if the above two rules do not work. Think of it like a failsafe, along with helping to let users know they are on the actual top level URL of a page.
RewriteCond %{THE_REQUEST} ^GET\ /.*/index\.(php|html)\ HTTP
RewriteRule (.*)index\.(php|html)$ /$1 [R=301,L]
The .htaccess code above should work for both the html and php versions. Remember, testing is your friend.
5. Removing HOME and HOME.html from the URL
When using HTML for your site, often the home page is called "home.html". Without the proper redirects this version will get indexed along with the URL without home.html at the end of the domain. What does that mean? More duplicate pages on your site = bad for SEO. The code below is for websites coded in HTML:
RewriteCond %{HTTP_HOST} ^www\.YOURDOMAIN\.com$
RewriteRule ^home\.html$ "http\:\/\/www\.YOURDOMAIN\.com\/" [R=301,L]
CMS-based website such as Wordpress and Joomla are just as bad. When you create a post, article, page or menu item called home, the database will actually display that in the URL of your site. Even if you don't manually set it up, the template you install could already have a home page set to display along with the home text in the URL. See below for the CMS .htaccess rules to remove /home from the URL.
RewriteCond %{HTTP_HOST} ^www\.YOURDOMAIN\.com$
RewriteRule ^home$ "http\:\/\/www\.YOURDOMAIN\.com\/" [R=301,L]
If you look closely, you can see that both sets of code contain the www and non-www redirects. This will redirect the page with or without www, forcing the server to only display one page. It's ok, you can thank us later.
So there you have it! You made it to the end, and defeated a lot of duplicate content along the way! If you are worried about if you have the right code installed, we have included a sample from our own website's .htaccess file, to help you see what it should look like in the end.
RewriteCond %{HTTP_HOST} !^www\.webdesignandcompany\.com$ [NC]
RewriteRule ^(.*)$ http://www.webdesignandcompany.com/$1 [R=301,L]
##### remove index.php within the URL #####
RedirectMatch permanent index.php/(.*) http://www.webdesignandcompany.com/$1
##### remove index.php at the end of the URL and change to / #####
RewriteCond %{THE_REQUEST} ^GET\ /.*/index\.(php|html)\ HTTP
RewriteRule (.*)index\.(php|html)$ /$1 [R=301,L]
##### Remove index.php from root URL #####
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /index\.(php|html) [NC]
RewriteRule ^index\.php$ http://www.webdesignandcompany.com/ [R=301,L]
Like this post? Follow me on Google+ at plus.google.com/+DavidKley/ for more SEO tips and Content Marketing advice. Need help setting up the redirect or .htaccess on your site? Contact us here.