Google may be the ultimate arbiter of whether an online business succeeds or fails, but they aren’t making their decisions haphazardly. They set forth guidelines for how sites should function for accessibility and usability, and they provide a wealth of tools to help site owners diagnose and fix any issues that can crop up.
One of those tools is the Google Search Console (formerly Webmaster Tools), which provides dozens of different kinds of reports about pretty much anything relevant to Google’s analysis of your site. Among the many different reports they provide is the Index Coverage report.
So, if you’ve received that kind of email, or you’re just concerned about issues you see in your report, here’s how to handle it.
The index coverage report is a simple report that shows the number of URLs on your site that are known to Google.
These URLs are divided into four categories:
Each type has sub-categories. For example, “excluded” pages can be 404 pages, pages with redirects on them, pages specifically excluded with a “noindex” tag, and pages that have a canonical URL pointing to a different version.
There are three “phases” to Google indexing and ranking a webpage. Just because Google knows about your site (and the pages on it) doesn’t mean they’ve added it to their index and given it a rank.
To view your index coverage report, simply log into your Google Search Console account, choose your website property, and click the Index subheading on the left, then the Coverage report. On the report, you can click each of the four categories to see how many URLs on your site fall into each and scroll down to a table further dividing them into sub-categories.
What should you look for, and what should be concerning in an index coverage report?
The different kinds of errors include:
Once you’ve fixed pages with errors on them – or noindexed them – you can move on to warnings.
Pages with warnings fall into two categories. The first is pages that are blocked with robots.txt directives but are still indexed because another website linked to them. If you want the page to be visible, remove it from robots.txt. If you want it blocked, remove it from robots.txt and add a noindex tag to the page itself.
Either the page is empty, or the content is somehow blocked or cloaked to Google. For example, the content generated solely by scripts may not render when Google visits. You’ll want to, again, either noindex the page or make the content visible to Google.
This is where most of the work will happen since sites usually don’t run into actual errors outside of gross misconfigurations, but exclusions are extremely common. So, rather than cover it here, we’ll cover it in the next section.
You may see some that say “Indexed, not submitted in sitemap,” which simply means your sitemap doesn’t have the page on it. This usually happens if your sitemap is slower to generate than you publish, so Google can find content that you’ve published between the last time your sitemap updated and now. That’s not a problem and usually goes away when the pages are added to the sitemap.
You can check to make sure these pages are pages you want to be indexed, and if they aren’t, you can remove them from the index. This may be relevant in the case of Attachment pages, Tag pages, or System pages, but usually, a good configuration will keep those pages hidden anyway.
Pages that are excluded from the index for one reason or another are all lumped together, but there are actually a ton of different reasons why they may be excluded. Sometimes they’re valid reasons, and sometimes they’re issues you want to fix. What are the different exclusions, and how can you handle them?
If the page has a noindex tag on it, it won’t be indexed. If you don’t want the page indexed, that’s fine.
If you do want the page indexed, you’ll need to find and remove the tag and request indexation.
Google provides a URL Removal Tool that is designed to be a quick removal option if you want to remove sensitive content from a page, remove pages that were hit by a hacker, or otherwise include something you don’t want to be indexed.
You submit the URL for removal; Google purges it from their index and removes it from search results… temporarily. The removal request expires after 90 days. Unfortunately, this tool is often misused. You can read more about it here.
Robots.txt tells Google what pages it should ignore. This works as long as Google doesn’t have external links telling them to visit the page.
It’s pretty easy to verify that everything in your robots.txt file should be there, at least, and remove anything you want to be indexed.
The 401 error is “unauthorized access.” If Google tries to access a page that requires a login, for example, they’ll get an unauthorized access error.
This is usually fine, which is why it’s an excluded page; you’re usually blocking access to pages users need to log in to see or pay to access or that you just don’t want the public to see at all.
These are two different statuses that basically just mean Google got partway through the process but stopped for one reason or another.
Usually, they’ll re-check the URL and index it eventually, so you can leave these alone unless your pages persist in this state for weeks or months.
When duplicate content appears, Google will look for a canonical URL in your page’s metadata.
You’ll end up with four possibilities.
Canonicalization can be pretty tricky to manage, and sometimes Google gets stuck on choosing the wrong pages. This is why it’s all marked as non-errors because they’ll eventually fix it, but it can take some wrangling.
If someone links to a page that doesn’t exist, that URL will show up here as a 404 page.
Unless it’s a URL you thought existed and should exist, you can mostly ignore this. It’s working as intended.
If a page has a redirect on it, that page shouldn’t be indexed since users can’t land on it anyway. Again, working as intended.
One thing that may be concerning is that you might see some very important pages here, and you’ll freak out wondering why they aren’t indexed. For example, your homepage. The key is in the details, though. See, did you know that:
…are all the same page? This is one page, your homepage, with four URLs assigned to it. To Google, that means four duplicate pages. This isn’t a duplicate content issue (Google is smart enough to know what’s going on here), so basically, they pick the one you want to be canon and index it while putting the other three in exclusions.
The canon version should usually be the HTTPS version, and it’s up to you if you want the www or not. The other versions should be excluded to avoid duplicates in the index. Since you should have a .htaccess rule redirecting users to the right one, they’re categorized as redirects.
The biggest thing to be concerned about with the index coverage report is actual errors preventing indexation of important pages. Once you’ve fixed those, you can address any warnings. These should all be relatively simple issues to fix and are usually just the result of minor misconfiguration.
From there, you may want to export a CSV of all of your excluded URLs and figure out if any of them need attention. 90% of them won’t, but now and then, you might find a few pages that have fallen through the cracks. In those cases, it’s important to sort it out.
Do you have a specific issue you can’t seem to solve? If so, let us know in the comments, and we’ll be happy to see if we can give you some tips.