Sitemaps are a list of URLs which tell search engines all the possible pages on your website. Sitemaps help search engine crawler access deep level pages on your site. Sitemaps can be of 2 forms –
- XML Sitemap – A sitemap which comes with a .xml extension. To inform search engines about an XML sitemap you have to submit it via Google Search Console and Bing Webmaster Tools. Example: https://www.ecommerceyogi.com/sitemap_index.xml
- HTML Sitemap – A sitemap which is an HTML page with links to all other pages on the site. An HTML sitemap is often called a Site Directory. Search engine crawlers visit this page and crawl through the links. Example: https://www.zomato.com/directory
What is a clean sitemap?
A clean sitemap is one which contains only valid URLs which you want search engines to index. Every website is given a set crawl budget by Google. A well-optimized website uses this limited crawl budget effectively by serving only worthy pages in its sitemap. To do so you must remove pages/URLs which doesn’t serve any purpose.
A clean sitemap is devoid of the following URLs –
- Remove 4XX or Expired content – A healthy sitemap is one which removes expired content from itself automatically. Serving broken links in your sitemap would only mean you are giving search engines dead ends and roadblocks which offer no content value.
- Remove 5XX Error Pages – Ideally, a 5XX error is thrown when a server fails to deliver content. You should either fix these issues on your site or remove these pages from your sitemap and stop linking to these pages from your site.
- Replace 3XX Redirect Pages – When you serve 301 and 302 redirect pages in your sitemaps the search engine crawler have to perform multiple hops before reaching the destination page. This depletes site crawling budget which could mean crawlers missing out on other valuable content and pages on your site. When the URLs appear in a chain of redirects crawlers might stop following the URL after a set of URLs leading to non-discovery of the destination page. These 3XX redirect pages should be replaced by their destination URL pages.
- Remove non-canonicalized URLs – A canonical URL is the original source of content while a non-canonical URL is a duplicate page which points to an original page. You want to send search engine crawlers to the canonical URL since search engines index only the original source of content.
- Remove URLs which have a noindex tag – When you don’t want search engines to index a piece of content on your site you use ‘noindex’ tag. The primary objective of a sitemap is to give search engines valuable index-worthy content to search engines. When we are making a sitemap we should list only the pages you want search engines to crawl and index. Remove all pages which you have used a “noindex” tag on.
- Remove URLs blocked by Robots.txt – A robots.txt file is used to prevent search engine crawlers from accessing site folders and pages which don’t offer any significant information about the website. Such URLs blocked by robots.txt should not be part of a sitemap as search engines wouldn’t crawl such URLs anyway.
How to make your sitemap clean?
A quick way of guessing if a sitemap is clean or not is by checking the index to submit ratio for your sitemaps in search console. If your index to submit ratio is way below 100% then you should start getting suspicious. In one instance I found Google indexing only 20% of the sitemap. On further inquiry, I found that I didn’t replace the redirect URLs with their destination URLs. Hence the bad percentages.
However, this graph doesn’t get us to the bottom of the issue. For this, you might need a website crawler like DeepCrawl or ScreamingFrog. Following steps elucidate how to create a clean sitemap using DeepCrawl.
Step 1: Run a complete crawl of your site using DeepCrawl
Step 2: Visit Link > Sitemaps section in the left sidebar.
Step 3: Click on Nonindexable URLs in SiteMap under Links > Sitemaps section in the sidebar. These are the URLs which search engines cannot index because they are showing one of the following – 4XX error, 5XX error, Non-Canonical URLs, Noindex URLs.
Scan these URLs and see which among the 4xx and 5xx pages are down because of a technical glitch. If there is a technical issue and the page is supposed to exist then work on bringing this page up. If these URLs have been expired then they must be removed from the sitemap.
Step 4: Click on Disallowed/ Malformed URLs under Links > Sitemaps section. These are URLs mentioned in the sitemap by blocked from crawler access via Robots.txt. These URLs should be removed from your sitemap.
Step 5: We need to remove 3XX redirect URLs from the sitemap too. To do this visit All Sitemap Links in the sidebar and use the filter.
Target status code greater than 299.
Target status code lesser than 400.
Replace these 301 and 302 Redirect URLs with their destination URLs.
How to fix your sitemap?
All though removing these URLs from sitemaps help you clean them. It is not the complete solution.
Set automated rules – Ideally, you want to write a sitemap rule engine in a way that such URLs don’t qualify to sit in your sitemap in the first place. See if you can implement a rule which allows a URL in sitemap only when they throw a 200 status OK code and don’t have special attributes like noindex tag. See if you can implement a rule which discards all duplicate URL patterns and blocked URLs from entering the SiteMap.
Stop linking to such URLs – Removing invalid URLs from sitemaps alone don’t solve the problem completely. Google can access a URL if any of your site pages link to it. You will have to minimize linking to such pages.