If your website is to rank, it needs to be indexed beforehand. If it is to be indexed, it must also be crawled.

Crawling and indexing are two processes that fundamentally contribute to the SEO performance of your website. In this guide, we explain what the individual processes mean and how you can optimize crawling and indexing for the best possible ranking.

Crawling

Crawling is the basis of indexing. The crawler – also called a spider or bot – goes through websites and determines the content of your website (crawling) so that it can then be added to the search index (indexing) and assessed in terms of its relevance for a search query and a user (ranking) .

With crawling management, you control the search engine’s crawler in such a way that all SEO-relevant pages, i.e. the links that are crucial for indexing and ranking, are crawled as often as possible. The indexing management then controls which of the crawled pages are actually indexed, i.e. should appear in the search results. Among the pages that were included in the search engine’s index, the ranking factors then determine which page appears where in the search results.

Google allocates a certain “crawl budget” to every website. This term describes the time the bot spends on the website crawling content. If you show the bot “unnecessary” pages, crawl budget is “wasted”, ie ranking-relevant pages may get too little crawl budget.

Which pages are SEO relevant?

Crucial for rankings are above all the URLs that offer a suitable entry point from the organic search. For example, if you want to buy a pair of ski boots and you are looking for “Salomin ski boots”, a suitable entry page would be a page on which a selection of Salomon ski boots are displayed in a shop. Or a product detail page if you have been looking for more details such as “salomon X pro”.

If you’re after “Burglary protection” then a suitable entry page would be, for example, a page like this:

Typical SEO relevant page types are:

  1. Homepage
  2. Category Pages
  3. Product detail pages
  4. Article Pages
  5. Brand Pages
  6. SEO landing pages
  7. Magazines, guides and blogs

Are product pages relevant to the ranking?

Product pages are practically always SEO-relevant. Every online shop would like to receive users via organic search. The question, however, is: would you like to optimize all of your products and product variants? If so, do you have the resources to do this? Depending on the size of your online shop, however, in practice there is often a lack of time or money to look at all the products and variants piece by piece.

As a result, you should prioritize your product pages for optimization:

  • Which articles are your top sellers?
  • Which articles have a particularly good margin or high price?
  • Which products take care of organic?
  • For which products and product options is there high demand in the search, ie search volume?

Around 2600 users search for “nike free run” every month, while the color variants red, blue and black are not searched. In this case it would be sufficient to have the main article indexed, while the color variants are not relevant for the ranking.

Which pages should be crawled even though they shouldn’t rank?

Why should a page be essential to SEO if it shouldn’t rank? Because it links to ranking-relevant pages. These pages have added value for the user, but they are not suitable entry pages. The crawler still has to look at these pages in order to find other ranking-relevant pages on the website via these “link pages”.

The classic example are paginations, which can often be found on category or overview pages. The products or articles of a category or topic are displayed on several pages and linked to one another via numbered navigation (pagination).

This is especially useful if there is too much content to be displayed on one page and, for example, to shorten the loading time. Pagination pages are relevant for crawling because they link many products and articles. However, they are not relevant for the ranking: The user should better start on page 1, because the bestsellers or the latest articles are often linked there.

Another example are so-called tag pages in WordPress. These are often not optimized and are not a good way to start the search for the user because they only list different articles. On the other hand, they contain a lot of links to articles that are ranking-relevant.

Example for a day page:

Which page types are not SEO relevant?

Every now and then there are pages that don’t even need to be crawled. So they neither contain content that is relevant for the ranking, nor do they indicate such. Login pages are a good example of this. Users can only call up the pages behind the login area if they are logged in. Shopping carts from online shops are also included.

Another example are filter pages that are not ranking-relevant. Numerous online shops and marketplaces offer their users many filter options, most of which can be combined. Some of them are certainly relevant for the search, especially if users are also looking for them, for example “Herrenenschuhe brown” (1370 searches per month). However, users do not search for others and are therefore not relevant to the ranking, for example “men’s brown striped shoes” (no search queries). A page with a “brown” filter and a “striped” filter would therefore not be relevant for the ranking. It is also irrelevant for crawling because it does not contain any links that cannot be found on the generic page about “men’s shoes”.

Internal search result pages are usually not crawling-relevant either, because all pages linked there should also be linked elsewhere on your page.

Classic examples of non-crawling-relevant pages:

  • Shopping carts
  • Login pages
  • Filter URLs without ranking relevance
  • Product variants without ranking relevance
  • Internal search results pages

Crawling best practice

All pages that are relevant for ranking, indexing and crawling must therefore be made accessible to the search engine bot. The other pages, however, should be hidden from him.

Which instruments can be used to control the crawling of the site?

You can use various tools to control the crawling of a website. Some of them serve more to ensure sufficient crawling (positive crawling control), others to exclude certain pages from crawling (negative crawling control).

Sitemaps

Basically, a bot will track every link it finds on a website. This means that if you have a clean internal link structure, the crawler will reliably find your pages. As already mentioned, Google assigns every website a certain crawl budget that cannot be influenced. Therefore, you do not know exactly how often the crawler will visit a page and how many and which pages it will crawl.

For this reason, a sitemap is very helpful. A sitemap is a file in which you can list the individual web pages on your website. This is how you let Google and other search engines know how the content of your website is structured. Search engine web crawlers like Googlebot read this file to help crawl your website more intelligently.

A sitemap does not guarantee that all of the content it contains will actually be crawled and indexed. But you can use it to support the crawler in its work.

When should you use a sitemap?

A sitemap plays an essential role in indexing a website. For small and medium-sized projects with few sub-pages and with good internal links, it is no problem for the crawler to find and read all the pages on the website.

With large and extensive projects, however, there is a risk that search engine robots will overlook new pages in a domain.

The reasons for this can be:

  • The website is very extensive, ie it contains many sub-pages (e.g. online shop, classifieds portal)
  • The website is very dynamic, with a lot of content that changes frequently (e.g. large online shops)
  • The individual content pages are poorly linked or even separated from one another
  • The website is new and there are only a few external inbound links that refer to individual pages of the website

What requirements does a sitemap have to meet?

The sitemap is stored in the root directory of the website so that it can be easily found by the crawler.
Example: https://www.ihrewebsite.de/sitemap.xml

The following formal requirements apply to the sitemaps:

  • Contain absolute URLs (e.g. https://www.ihrewebsite.de/)
  • be encoded in UTF-8 format
  • contain only ASCII characters
  • be a maximum of 50MB
  • Contain a maximum of 50,000 URLs

Large sitemaps should therefore be broken down into several smaller sitemaps. These then have to be linked from an index sitemap.

What types of sitemaps are there?

A basic distinction is made between HTML sitemaps and XML sitemaps.

The two types of sitemaps owe their names to the file format in which they are saved.

HTML sitemap

An HTML sitemap is mostly used to orient users within a website and is linked internally. The website contains a separate subpage on which the individual web addresses (URLs) within the website are listed. The user clicks on a URL and is taken directly to the desired page within the website. It is therefore comparable to a table of contents.

XML sitemap

An XML sitemap differs in structure from an HTML sitemap. It is written in a special format and contains additional metadata about each URL, such as the date of the last update, frequency of changes, importance of the URL.

How can I make the sitemap accessible to the bot?

So that the crawler can find and read the sitemap of a website, you should make the sitemap discoverable in two ways:

1. Through the robots.txt

Store the link to the sitemap in the robots.txt of your website. Since the bot always looks at the instructions in the robots.txt first, you ensure that it also regularly crawls the most important pages of your website via the sitemap.

2. Via the Google Search Console

You can submit one or more sitemaps via the tab “Sitemaps” in the left navigation bar of the Search Console. The advantage of an additional submission in the Search Console is that Google gives evaluations of the processed URLs from the sitemaps. For example, you can display how many of the URLs submitted via the sitemap have actually been indexed.

Which URLs should you include in a sitemap?

In principle, only ranking-relevant URLs should be included in the sitemap. You want to make sure that they are actually being crawled. They leave out all other pages. That means, the following pages should not be included:

  • forwarded pages (status code 301/302)
  • unreachable pages (status code 404/410)
  • URLs with the meta robots information noindex
  • URLs that have a different URL (not itself) than rel = “canonical”
  • Search results / tags
  • Paginations
  • Pages with restricted access (password-protected pages, status code 403 etc.)

When does it make sense to create multiple sitemaps?

Since a sitemap has no direct influence on the ranking of a website, it is suitable in combination with the Search Console as a control instrument to check whether all relevant URLs have been indexed. To make such an analysis particularly easy, it is advisable to create different sitemaps for different page types.

All of these sitemaps are then bundled in the aforementioned “index sitemap”. Instead of the individual sitemaps, this is then saved in the
robots.txt and the Google Search Console and serves the bot as a central starting point for all sitemaps.

Another use case are picture or video sitemaps if you host your pictures and videos yourself and want to achieve rankings with them. Then load all images into an image sitemap and link them in the index sitemap.

How do I create a sitemap?

There are several ways to create a sitemap. Most CMS systems and shop systems already have a function for creating sitemaps.

If you do not use a CMS (Content Management System) and would like to create your sitemap “yourself”, there are numerous sitemap generators.

robots.txt

The so-called Robots Exclusion Standard Protocol regulates how you can use a robots.txt file to influence the behavior of search engine robots on your domain. This protocol has now become a quasi-standard.

It is true that the use of the page can also be determined in individual HTML files with the help of a meta tag for search engines, but this only applies to the individual HTML file and at most all of the pages within it that can be accessed through links, but not to other resources such as images. In a central robots.txt, on the other hand, you can determine which rules should apply to directories and directory trees, regardless of the file and reference structure of your web project. Since there is no binding documentation, the interpretation of the robots.txt and its syntax is not always handled consistently by the search engines. The additional use of meta tags in HTML files is therefore recommended in cases of undesired indexing by the robot, if the robot has not or incorrectly interpreted the robots.txt.

The robots.txt file tells the search engine which pages or files on a website it can and cannot crawl. Individual pages, entire directories or certain file types can be excluded from crawling. It is important to know that the bot initially assumes that it is allowed to crawl the entire website. It must therefore be explicitly forbidden to crawl individual pages or file types.

If a website is to be excluded from indexing, robots.txt is not a suitable means. If you prohibit the crawler from accessing parts of your page via the robots.txt, it can see these pages but not read them. The crawler cannot see whether you have stored meta-robots information that prohibits indexing, for example.

The robots.txt is also only of limited relevance for crawling control. Because if other pages or you yourself refer to the pages of your website blocked in the robots.txt, Google thinks that they must be relevant because they are referred to. In the end, they may be indexed after all, because the crawler could not read whether they should be in the index or not. You finally forbade him to do this in the robots.txt. You can recognize such blocked pages in the Google search by the fact that instead of a meaningful description under the URL: “No information is available for this page.”

You can find out which URLs on your website are blocked from crawling in robots.txt, but were still indexed, in the Search Console under “Coverage”:

It is important to regularly check the messages in the Search Console and, if necessary, make improvements to the website so that search engines can crawl the website without problems.

Where is the robots.txt stored?

The robots.txt file must always be stored in the root directory of a website, e.g. http://ihrewebsite.de/robots.txt.

It should be noted that the robots.txt only applies to the host on which the file is stored and to the corresponding protocol.

Example: http://ihrewebsite.de/robots.txt

not valid for

http://shop.ihrewebsite.de/ (since it is a subdomain shop.)
https://ihrewebsite.de/ (since the protocol here is https)

valid for

http://ihrewebsite.de/
http://ihrewebsite.de/kategorie/

In theory, you can also store a robots.txt file on an IP address as a host name. However, it is then only valid for this specific IP and not automatically for all linked websites. To do this, you have to explicitly share them with these websites. So it is better to save the robots.txt individually for each host name, as you may have different specifications for the crawling of the individual host names.

The instructions in the robots.txt

The standard syntax of robots.txt is structured as follows:

User agent: Which user agent or bot is being addressed?
Disallow: What is excluded from crawling?
Allow: What can still be crawled?

The disallow and allow instructions can refer to the entire website or to individual subdomains, directories or URLs.

Which bots can be controlled via the robots.txt?

robots.txt file, both individual and all crawlers can be addressed. This is mainly used to control crawler traffic, for example to prevent server overloads. Too many bots send requests to your server, i.e. if they call too many pages at the same time, this can overload your server. So if you notice that the loads are getting too high, blocking individual bots using robots.txt could be one of several measures.

In addition to the Googlebot or the Bing-Bot, there are also tools with their own crawlers. For example, Screaming Frog or ahrefs.com have their own. It should be noted that blocking these bots can make it more difficult to evaluate a website if important SEO tools cannot crawl a page.

Unfortunately, robots.txt can hardly be used to protect against malware bots, as these generally do not adhere to the specifications. Serious crawlers, on the other hand, respect the information in the robots.txt.

When does it make sense to use robots.txt?

From an SEO point of view, there are rather few sensible use cases for the use of robots.txt. This is because other crawling tools have proven to be more reliable and more controllable. Nevertheless you can use the robots.txt in the following cases:

  1. You are currently developing a new website and do not want to have it crawled at first because it is still under development.
  2. You want to exclude certain areas or file types of your website from crawling and can ensure that these are not linked internally or externally.
  3. You want to prohibit individual tool bots from crawling.

The robots.txt is a very powerful tool. Therefore, you should carefully consider what you exclude in the robots.txt. It is best to only contain as many instructions as necessary and as little as possible.

Procedure recommended by Google regarding the robots.txt

To ensure that certain pages are not indexed by Google, a “ban” via the robots.txt only works very unreliably. For example, if the Google bot B. is seen via an external link, the website will still be crawled.

In order to reliably prevent your web pages from ending up in the Google index, this must be indicated on the corresponding page with the noindex meta tag.

In other words, in order to reliably remove pages from the Google index, access must not be prohibited in the robots.txt and the noindex meta tag must be set.

However, this does not work for non-HTML elements such as PDF files or videos. because these elements cannot contain a meta element. In this case the X-Robots-Tag should be used.

Can I use robots.txt to prevent my website from being crawled?

There is no guarantee that search engines will adhere to the prohibitions in robots.txt. The vast majority of the robots of modern search engines take into account the presence of a robots.txt, read it out and follow the instructions. Robots maliciously browsing the web are unlikely to comply.

Indexing

If a URL is crawled, you can use the indexing management to control which URLs are actually allowed to be included in the search index. And only these URLs can ultimately achieve rankings. If a page is not crawled, the bot cannot recognize the indexing settings either.

The following tools are available for indexing:

  • Meta Robots / X-Robots “noindex”
  • Canonical day
  • 301 redirects
  • Google Search Console function “Remove URL”

Meta Robots / X-Robots

The most important means of controlling indexing are the Meta-Robots and X-Robots information. The robots information (not to be confused with the robots.txt) tells the crawler whether a page can be included in the index or not.

By default, search engines assume that they can call up any document and make it searchable via Google search. Accordingly, the control of crawlers using robots information is only necessary if something is explicitly not desired.

The robots meta tag allows you a detailed, page-specific approach, in which you specify how a particular page should be indexed and displayed to users in Google search results. Place the robots meta tag in the section of each page as follows:

The robots meta tag in the example above tells search engines not to display the page in question in search results. The value of the name (robots) attribute indicates that the directive applies to all crawlers. If you want to target a specific crawler, replace the value robots of the name attribute with the name of the corresponding crawler. Certain crawlers are also known as user agents. A crawler uses its user agent to request a page. Google’s standard web crawler has the user-agent name Googlebot. If you just want to prevent Googlebot from indexing your page, update the tag as follows:

Use robots meta tag

Possible information in the meta tag are:

InstructionMeaning of the instruction
noindexThe page should not be found via Google search.
nofollowDo not follow the (internal & external) links on this page
noneCorresponds to the specification noindex, nofollow
noarchiveThe page should not be included as a copy in the search engine cache (temporary storage). The specification has no influence on whether the page can appear in the web search.
nosnippetThis information means that the meta description (description text) is not displayed.
notranslateThis means that no translation of the page is offered in the search results.

X robots

The specification of Meta Robots only works for pages that have one, i.e. for HTML pages. Non-HTML content can be excluded from indexing using the X-Robots. These include PDF files. Here, on the server side, rules are defined in the .htaccess file (for Apache servers) as to how certain files or file types are to be handled. If you do not individually define the indexing information for all URLs on your website, search engines will automatically assume that the URL may be included in the index.

The canonical day

The canonical tag is one of the most important instruments for ambitious SEO. With the Canonoical Tag, you can solve the common problem of so-called duplicate content.

Search engines rate duplicate content negatively, as there is no added value for the Internet user. For the indexing of website content, each content may only be accessible under a single URL. If you want to make the content available on other pages as well, the second URL must point to the original website and mark this as the main source. Otherwise, the same content counts as duplicate content.

At least one of the two websites will then be removed from the index by Google. In order to avoid this, so-called canonical tags are used. This can be obtained by adding a canonical tag in the head area of the HTML code.

The canonical tag is a specification in the source code of a website. It points to a standard resource – which is a canonical URL – for websites with the same or almost the same content. If a canonical URL is marked correctly, only the original source is used for indexing the search engines. This prevents the same content from being recognized by Google as duplicate content on different pages.

So with the canonical tag you are telling Google: “I am aware that this content is duplicate, only index the original”. The best optimized URL should always be specified as the “original”.

The tag is then implemented in “Duplicates” according to the following scheme:

link rel = “canonical” href = “https://www.ihrewebsite.de/original/”

The same canonical tag can also be included on several pages if, for example, there are several duplicates of an original.

The URL pointed to by the canonical is marked as original. This should be displayed in the search results and must therefore be provided with the meta robots specification “index”. But be careful! The target URL must not be marked with “noindex”, because these two signals are opposite and do not provide the crawler with clear instructions on how to handle the URL.

If the canonical points to itself (self-referencing canonical), i.e. to the source URL, this has no real effect. In some cases, however, it may be easier to use canonical tags on all URLs, regardless of whether the pages are similar or not.

When should you use the Canonical Tag?

You should use the canonical tag if the content on your pages is very similar or even duplicates.

Pagination Pages

The paginations of a URL are typically not duplicates as other products are displayed on them. Therefore, paginations should not have a canonical tag on page 1. The first page itself is an exception. Sometimes pagination pages can only be implemented in such a way that there is both a category page without parameters and a page 1. These two URLs are actually duplicates as they also list the same products or items. That’s why you should from www.ihrewebsite.de/ategorie?page=1 a canonical tag on the category www.ihrewebsite.de/ategorie put.

Product variants

If product variants cannot be excluded from crawling, the option remains to exclude them from indexing. The advantage is that you can display all individual product variants in one category without duplicating content. With this variant, you use the main product as the canonical URL. It then represents the only relevant URL for SEO that should be displayed in the search results. The other article variants then show the main article via canonical tag.

Parameter URLs

Parameter URLs are often an identical copy of the actual URL, but represent different pages to the search engine. The problem occurs particularly with filtering, internal search pages, session IDs or print versions of pages. Usually these URLs are not SEO-relevant. So you should exclude them from crawling in order to use your crawling budget effectively. If that is not possible, you can at least exclude them from indexing using the canonical tag.

Example: https://www.ihrewebsite.de/ategorie? session-id = 52345

This url delivers a duplicate

https://www.ihrewebsite.de/ategorie and should therefore use the canonical tag https://www.ihrewebsite.de/ategorie refer.

Pages that are assigned to multiple categories

Sometimes articles or products are made accessible through different categories and can be accessed through multiple directories. To prevent this from happening, content should only ever be accessible via a URL. You can still link the articles or products from several categories. The user can then navigate through the different categories of your shop or website, but always ends up on the same URL when clicking on an article or product.

Canonical tags and hreflang

If a website uses hreflang, the respective URLs should either refer to themselves with a canonical tag or use no canonicals at all. If both tags are used together, Google receives contradicting signals. While the hreflang tag shows that a different language version is available, the canonical tag would make that version the original URL.

External duplicate content

Example: External duplicate content can arise when contributions are published across multiple domains. Even if you make your website accessible via multiple hostnames, for example, this can lead to a problem with duplicate content.

Example: You have registered yourwebsite.de and your-website.de. If the same content can be reached under both hostnames, this is duplicate content and Google does not know which of your pages should be rated. The same applies if your website starts with both www. as well as without www. or can be reached under http and https.

At the beginning of 2017, Google made the use of a secure HTTPS connection for websites an important ranking factor. Since then, Google has preferred HTTPS pages as canonical URLS. The canonical tag should therefore point from the HTTP protocol to the HTTPS page, not the other way around.

Redirects

Another means of indexing management are redirects.

The most frequently used are status code 301 and status code 302.

Status code 301 is a “permanent forwarding”. The search engine is informed that the content that was previously found on URL A can now be found permanently on URL B. As a result, the search engine will remove the redirected URL A from the index and index the redirect destination URL B instead.

Status code 302, however, is a “temporary forwarding”. Here the search engine is informed that the contents of the previously indexed URL A can only be found temporarily on another URL B. The forwarding URL A thus remains indexed, the forwarding destination URL B is usually not indexed.

When to use redirects

Whenever you move a URL permanently, you should always set up a 301 redirect. If you only move a URL temporarily, you can use a 302 redirect. Another application of the 302 redirect are URLs that lead to an area of the website for which the user must be logged in. If he is not logged in and clicks on the link, he will be redirected to the login page via 302 redirect. The result: The target URL remains indexed, while the login page is not indexed.

If you move a URL, remember not only to set up a redirect, but also to adjust all internal links so that the old URL is no longer linked internally. This saves loading time and crawling budget.

The 301 redirect (301 redirect)

The 301 redirect (301 redirect)

The 301 redirect is a way to permanently forward a URL. This redirect is used to redirect old URLs that are no longer valid to new URLs. The big advantage of the 301 redirect is that this redirect forwards practically 100 percent of the link juice and sends a clear signal to search engines that the requested page can be found permanently under a different URL.

The 301 redirect can be implemented on Apache servers, for example, by adapting the htaccess file or using PHP.

This code is used for the htaccess file:

RewriteEngine on
rewritecond %{http_host} ^domain.com [nc] rewriterule ^(.*)$ http://www.domain.com/$1 [r=301,nc]

If the 301 redirect is implemented via PHP, the code to be used looks like this. It is stored directly in the source code of the forwarding document.

!--?php header("HTTP/1.1 301 Moved Permanently");
header("Location: http://www.domain.de/der-neue-name.php");
header("Connection: close"); ?--

Removing URLs

Sometimes a URL has to be removed from the Google index as quickly as possible, for example because illegal or warned content is visible there. For such cases, Google offers a tool in the Search Console to remove URLs from the index
However, the following points must be observed:

Such an exclusion only applies for approx. 90 days. Your information will then be displayed again in the results of the Google search (see also the information on permanent removal).

Clearing the cache or excluding a URL from search results does not change the crawling schedule or the caching behavior of the Googlebot. If you request to temporarily block a URL, Google will continue to crawl your URL if it exists and not through some other method, such as by email. B. a “noindex” tag is blocked. As a result, it is possible that your page will be crawled and cached again before you remove or password protect your page, and that it will reappear in search results after your temporary ban expires.

If your URL cannot be reached by Googlebot, it assumes that the page no longer exists. The period of validity of your request for blocking will therefore be terminated. If a page is later found under this URL, it will be treated as a new page, which can also be included in the results of the Google search.

Permanently remove URL

The URL removal tool can only be used to temporarily remove the latter. To permanently exclude content or a URL from Google searches, do one or more of the following:

  • Remove or update the content on your website such as images, pages or directories. Then check whether your web server returns the HTTP status code 404 (not found) or 410 (deleted). Non-HTML files such as PDFs should be completely removed from your server.
  • Block access to the content, e.g. B. by a password.
  • Mark the page with the “noindex” meta tag to prevent it from being indexed. This method is less secure than the others.

Conclusion

As soon as websites exceed the size of a small homepage, one of the most important tasks is that the existing content is as complete and up-to-date as possible in the Google index. Because the resources for capturing and storing web pages are limited, Google uses individual limits for each domain: How many URLs are crawled per day, how many of these pages are included in the index?

Large websites quickly reach these limits. It is therefore important to use the available resources as productively as possible with the smart crawl and indexing management.