Typically these kinds of issues are caused by one or more of the following
reasons:
- Robots.txt - This text file which sits in the root of your website's folder communicates a certain number of guidelines to search engine crawlers. For instance, if your robots.txt file has this line in it; User-agent: * Disallow: / it's basically telling every crawler on the web to take a hike and not index ANY of your site's content.
- .htaccess - This is an invisible file which also resides in your WWW or public_html folder. You can toggle visibility in most modern text editors and FTP clients. A badly configured htaccess can do nasty stuff like infinite loops, which will never let your site load.
- Meta Tags - Make sure that the page(s) that's not getting indexed doesn't have these meta tags in the source code: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
- Sitemaps- Your sitemap isn't updating for some reason, and you keep feeding the old/broken one in Webmaster Tools. Always check, after you have addressed the issues that were pointed out to you in the webmaster tools dashboard, that you've run a fresh sitemap and re-submit that.
- URL Parameters - Within the Webmaster Tools there's a section where you can set URL parameters which tells Google what dynamic links you do not want to get indexed. However, this comes with a warning from Google: "Incorrectly configuring parameters can result in pages from your site being dropped from our index, so we don't recommend you use this tool unless necessary."
- You don't have enough Pagerank - lolwut? Matt Cutts revealed in an interview with Eric Enge that the number of pages Google crawls is roughly proportional to your pagerank.
- Connectivity or DNS issues - It might happen that for whatever reason Google's spiders cannot reach your server when they try and crawl. Perhaps your host is doing maintenance on their network, or you've just moved your site to a new home, in which case the DNS delegation can stuff up the crawlers access.
- Inherited issues - You might have registered a domain which had a life before you. I've had a client who got a new domain (or so they thought) and did everything by the book. Wrote good content, nailed the on-page stuff, had a few nice incoming links, but Google refused to index them, even though it accepted their sitemap. After some investigating, it turned out that the domain was used several years before that, and part of a big linkspam farm. We had to file a reconsideration request with Google.
Some other obvious reasons that your site or pages might not get
indexed is because they consist of scraped content, are involved with
shady linkfarm tactics, or simply add 0 value to the web in Google's
opinion (think thin affiliate landing pages for example).
Does anyone have anything to add to this post? I think I've covered most of the indexation problems, but there's always someone smarter in the room.
Does anyone have anything to add to this post? I think I've covered most of the indexation problems, but there's always someone smarter in the room.