What is a Robots.txt File?
Robots.txt are directives for search engines to follow a set of rules. It is typically a file which sits in the root folder of your website. When a crawler visits your website it reads the robots.txt file and crawls the website accordingly. Most modern search engines honor the robots.txt file.
What are InValid Pages and Folders?
Invalid pages and folders are files on your website which are a near duplicate of each other. These are files which don’t offer much value to the user nor does it communicate any particular thing about your website to the search engines.
Example: Loggedin Dashboard CSS files or WordPress Comment Trackbacks.
Our job as a webmaster is to block all such invalid pages and folders from search engine access.
Why do we do this?
The reason is to optimize crawl budget. As per Google, every website has a limited crawl budget. Our job as a webmaster is to make the best use of this limited crawl bandwidth given by Google.
We do this by blocking Invalid Pages and Folders. This ensures Google and other search engines spend more time crawling other important pages on our site. This ensures our most recent pages get indexed and our top money pages get crawled repeatedly for fresh content.
How to check if all invalid pages and folders are blocked By Robots.Txt?
Use DeepCrawl to crawl all URLs not yet blocked by robots.txt
DeepCrawl is an SEO tool which crawls your website and finds SEO issues.
In order to find out all invalid URLs not being blocked by Robots.txt file, we need to let DeepCrawl crawl your website in a certain way.
Once the crawl is done visit All Pages section in the sidebar. Click the download button on the right to access all URLs fast.
To get a complete list of folders and pages to be blocked start by asking yourself these questions –
- Have we launched any widgets or subsection in our page which creates numerous pages on the fly?
- If you have a payment wall have you disallowed your cart, checkout, wishlist and thank you pages?
- Does your site have numerous tags most of which look spammy?
- Does your internal search create many URLs unworthy of crawl and indexation?
- Do you want to stop Google from crawling filter or faceted navigation pages?
- Have we launched any mirror sites in folders?
- Have you blocked your developer or user testing mirror sites completely?
- What are the crawlers you want to stop crawling the website?
- Do we have any login and membership pages which we don’t want search engines to crawl?
- Do any popups and gallery create URLs which don’t need search engine crawling them?
- Are there any query strings based URLs which you don’t want Google to crawl
- Have you blocked unnecessary JS, CSS, BMP PNG, JPG, GIF, and XML files?
- Are there any unnatural URL Patterns found in your Deepcrawl audit which needs further investigation?
- Are there any URLs being crawled by Google as per your log files which are invalid?
- What are the common files and folders your CMS generates which are not worthy of search engine crawl? Example for WordPress – /cgi-bin/, /xmlrpc.php, /wp-admin/
- Have you disallowed all the URLs above across multiple languages and regions?
Popular CMS and their typical robots.txt file
How to fix this?
You can fix this issue by using an appropriate robots.txt directive and be blocking crawler access to these invalid folders or pages.
Step 1: Make a list of all folders and pages you want to block by asking yourself the questions mentioned above. Check your log files and DeepCrawl URLs to see which URLs and folders need to be blocked. Once you make a list you can move on to step 2.
Step 2: Use the following directives to block these pages.
Disallow: This directive helps you block certain URLs and folders from search engine crawlers. For example
Allow: This directive allows the crawler to access a certain folder or URL. For Example
WildCards: Wildcards helps you give more specific instruction to Google.
- * denotes any character. For instance, if you want to stop search engines from crawling all PHP pages a directive like Disallow:/*.php will disallow search engine access to all .php pages.
- $ designates the end of the URL – If you want to block all internal search page URLs which end with ?pt then use the Disallow: /?pt$.
How to block a subfolder in a folder?
If you want to block all profiles of your members listed under the folder /members/profiles in search engines but allow access to other /members/ details then use the disallow first followed by the allow directive.
Following is Google’s official guide on directives.
The path indicates the folder or set of URLs. These are mapped to all the matches the URLs the rule will be applied to. You can also see which URLs the particular path doesn’t include in the third column.
Know that when both disallow and allow directives are operating together the most specific rule is honored.
Step 3: Upload the new robots.txt file to the root folder of your website. For instance, if you are running an online store at store.mydomain.com then the robots.txt file should be uploaded to the subdomain. Which means you should be able to find your robots.txt at store.mydomain.com/robots.txt.
Step 4: Inform Google about your updated robots.txt through webmaster tools.
Login to Google Search Console and visit robots.txt in the sidebar.
Click on “submit”
Click on “Submit” on the pop-up to let Google know that your robots.txt file has been updated.
Step 5: Test if the folders and URLs you intend to block is working as expected. Use the Google Robots.txt Tester for this purpose.