If you want your website to rank, it must be indexed beforehand. If it is to be indexed, then it must also be crawled.

Crawling and indexing are two processes that fundamentally contribute to the SEO performance of your website. In this guide, we’ll explain what each process means and how to optimize crawling and indexing for the best possible ranking.

The introduction to crawling

Crawling is the basis for indexing. The crawler, also called spider or bot, goes through web pages and determines the content of a website. It is then included in the search index (indexing) and assessed in terms of its relevance to various search queries and their users (rankings).

With crawling management, you control the crawler of the search engine so that all SEO-relevant pages are crawled as often as possible. SEO relevant pages, contain links crucial for indexing and ranking.

Indexing management then controls which of the crawled pages are actually indexed, i.e. appear in the search results. Among the pages included in the index of the search engine, the ranking factors then determine which page appears in which position in the search results.

Google allocates a certain “crawl budget” to each web page. This term describes the time the bot spends on the web page crawling content. So if you show the bot “unnecessary” pages, crawl budget is “wasted”, i.e. ranking-relevant pages may get too little crawl budget.

How to identify SEO relevant pages

The URLs that offer a suitable entry point from the organic search are decisive for rankings. For example, if you want to buy a pair of ski boots and you search for “Salomon ski boots”, a suitable entry page would be one that displays a selection of Salomon ski boots in a store. Or else a product detail page if you exemplary searched for more details like “Salomon X pro”.

If you search for “burglary protection,” for example, a suitable entry page would be one like this:

Crawling SEO relevante Seiten erkennen

Typical SEO relevant page types are:

  1. Home pages
  2. Category pages
  3. Product detail pages
  4. Article pages
  5. Brand pages
  6. SEO landing pages
  7. Magazines, guides and blogs

Are product pages ranking relevant?

Product pages are practically always SEO relevant. Every online store would like to get users via organic search. The question, however, is:

Do you want to optimize all your products and product variants?

If so, do you have the resources to do so?

However, depending on the size of your online store, there is very often a lack of time or money in practice to take on all products and variants piece by piece.

As a result, you should prioritize your product pages for optimization:

  • Which items are your top sellers?
  • Which articles have a particularly good margin or high price
  • Which products drive organic traffic?
  • For which products and product options is there high demand in search, i.e. search volume?

Thus, about 2600 users per month search for “nike free run”, while the color variants red, blue, black are not searched. In this case, it would thus be sufficient to have the main article indexed, while the color variants are not relevant for ranking.

Suchvolumen Nike Free Run

Why let pages crawl if they are not supposed to rank at all?

Why should a page be essential for SEO if it is not supposed to rank? Because it links to ranking relevant pages. These pages have added value for the user, but they are not suitable entry pages. The crawler still has to look at these pages in order to find other ranking-relevant pages of the website via these “link pages”.

The classic example is pagination, which is often found on category or overview pages. The products or articles of a category or topic are displayed on several pages and linked to each other via a numbered navigation (pagination).

Seiten crawlen lassen ohne Rankings

This is especially useful when there is too much content to display on one page and also to reduce loading time, for example.

Pagination pages are relevant for crawling because they link to many products and articles. However, they are not relevant for the ranking:

The user should better enter on page 1, because the bestsellers or newest articles are often linked there.

Another example are so called tag pages on WordPress. These are often not optimized and are not a good entry point from search for the user because they only list different items. On the other hand, they contain many links to articles that are, however, ranking relevant.

Example of a tag page:

Beispiel Crawling Tag Seite

Which page types are not SEO relevant?

Every now and then there are pages that don’t need to be crawled at all. Thus, they neither contain nor point to content that is relevant for ranking. Login pages are a good example of this.

Users can access pages behind the login area only if they are logged in. Shopping carts from online stores are also included.

Another example are filter pages that are not ranking relevant. Numerous online stores and marketplaces offer their users many filter options, most of which can be combined.

Some of them are certainly relevant to search, especially if users also search for them, for example, “herrenschuhe braun” (1370 searches per month).

However, others are not searched for by users and are therefore not ranking relevant, for example “herrenschuhe braun gestreift” (no search queries).

A page with filter “brown” and filter “striped” would therefore not be relevant for the ranking. It is also not relevant for crawling because it does not contain any links that cannot be found on the generic page for “herrenschuhe”.

Internal search results pages are usually not crawling relevant either, because all pages linked there should also be linked elsewhere on your site.

Classic examples of non-crawling pages:

  • Shopping carts
  • Login pages
  • Filter URLs without ranking relevance
  • Product variants without ranking relevance
  • Internal search result pages

Best practice for crawling your website

Make all pages that are relevant for ranking, indexing and crawling accessible to the search engine bot. You can usually block the remaining pages for the crawler.

However, before you exclude pages, always make sure again that the page is really not relevant. Otherwise, large parts of your site may become inaccessible and lose visibility.

We help you to improve the crawling of your website!

How to control crawling on your website

For crawling a website, you can resort to various tools. Some of them are rather used to ensure sufficient crawling (positive crawling control), others to exclude certain pages from crawling (negative crawling control).

Crawling control with sitemaps

Basically, a bot follows every link it finds on a website. This means that if you have a clean internal link structure, the crawler will reliably find your pages. As already mentioned, Google allocates a certain crawl budget to each website, which cannot be influenced. Therefore, you do not know exactly how often the crawler will visit a page and how many and which pages it will crawl.

  • Crawl Budget:

    The crawl budget is related to how many requests the Google bot thinks your site can handle without affecting it too much.

For this reason, a sitemap is very helpful. A sitemap is a file where you can list the individual web pages of your website. This way you let Google and other search engines know how the content of your website is structured and which of it you consider relevant.

Search engine web crawlers like the Google bot read this file to crawl your website more intelligently. A sitemap does not guarantee that all the content specified in it will actually be crawled and indexed. But you can use it to support the crawler in its work.

When should they use a sitemap?

A sitemap plays an essential role in the indexing of a website. For small and medium-sized projects with few sub-pages and good internal linking, it is no problem for the crawler to find and read all pages of the website.
However, with large and extensive projects, there is a risk that search engine bots will overlook new pages of a domain.

The reasons for this may be:

  • The website is very extensive, i.e. it contains many subpages (e.g. online store, classifieds portal)
  • The website is very dynamic, with a lot of content that changes frequently (e.g. large online stores)
  • The individual content pages are poorly linked or even separated from each other
  • The website is new and there are few external inbound links pointing to individual pages of the website

What are the requirements for a sitemap?

The sitemap is placed in the root directory of the website so that it can be easily found by the crawler.

Example: https://www.ihrewebsite.de/sitemap.xml

The following formal requirements apply to the sitemaps:

  • contain absolute URLs (e.g.: https://www.ihrewebsite.de/)
  • be encoded in UTF-8 format
  • contain only ASCII characters
  • be maximum 50MB in size
  • Contain a maximum of 50,000 URLs

So large sitemaps should be divided into several smaller sitemaps. These must then be linked from an index sitemap.

What are the different types of sitemaps?

Basically, a distinction is made between HTML sitemaps and XML sitemaps.

The two types of sitemaps owe their name to the file format in which they are saved.

HTML Sitemap

An HTML sitemap is mostly used to orient users within a website and is internally linked. The user can click on a URL there and go directly to the desired page within the website.

It is therefore comparable to a table of contents.

XML Sitemap

An XML sitemap differs in structure from an HTML sitemap.

It is written in a special format and contains additional metadata about each URL, such as the date of the last update, change frequencies, importance of the URL.

How does the sitemap become accessible to the bot?

In order for the crawler to find and read the sitemap of a web page, you should make the sitemap discoverable in two ways:

  1. Through the robots.txt
    Store the link to the sitemap in the robots.txt of your website. Since the bot always looks at the instructions in robots.txt first, this ensures that it also crawls the most important pages of your website regularly via the sitemap.
  2. About Google Search Console
    You can submit one or more sitemaps via the “Sitemaps” tab in the left navigation bar of Search Console. The advantage of an additional submission in Search Console is that Google gives here evaluations of the processed URLs from the sitemaps. For example, you can view how many of the URLs submitted via the sitemap were actually indexed.
Einreichen Sitemap Search Console

Which URLs you should include in a sitemap

Basically, only ranking-relevant URLs should be included in the sitemap. After all, you want to make sure that they are actually crawled. They leave out all the other sides.

The following pages should not be included:

  • forwarded pages (status code 301/302)
  • inaccessible pages (status code 404/410)
  • URLs with the meta robots specifications noindex
  • URLs that have a URL (not themselves) other than rel=”canonical”.
  • Search results/Tags
  • Paginations
  • Pages with restricted access (password protected pages, status code 403 etc.)

When does it make sense to create multiple sitemaps?

Since a sitemap has no direct influence on the ranking of a website, it is suitable in combination with Search Console as a control tool for whether all relevant URLs have been indexed.

To make such an evaluation particularly easy, it is recommended to create different sitemaps for different page types. All these sitemaps are then bundled in the already mentioned “index sitemap”.

Instead of the individual sitemaps, this is then stored in robots.txt and Google Search Console and serves the bot as a central starting point for all sitemaps.

Another use case is image or video sitemaps, if you want to host your images and videos yourself and achieve rankings with them. Then load all images into an image sitemap and link them in the index sitemap as well.

How to create a sitemap

There are several ways to create a sitemap. Most content management systems and store systems already have a function for creating sitemaps.

If you do not use a CMS and want to create your sitemap “yourself”, there are numerous sitemap generators.

Crawling control via the robots.txt file

With the Robots Exclusion standard protocol, it was specified that the robots.txt file should be the first one that a bot crawls on a website. This is to ensure that you can control access to your own website. This protocol has now become the standard.

Although it is possible to specify the utilization of the page in individual HTML files with the help of a meta tag for search engines, this only applies to the individual HTML file and at most all the pages that can be accessed therein by links, but not to other resources such as images.

In a central robots.txt file, on the other hand, you can specify which rules should apply to directories and directory trees independently of the file and link structure of your website. Due to the lack of binding documentation, the interpretation of robots.txt and its syntax is not always handled uniformly by search engines.

The additional use of meta tags in HTML files is therefore recommended in cases of unwanted indexing by the crawler, if the robots.txt file was not interpreted or was interpreted incorrectly.

The robots.txt file tells the search engine which pages or files of a website may be crawled and which may not. Individual pages, entire directories or even certain file types can be excluded from crawling.

It is important to know that the bot initially assumes that it is allowed to crawl the entire web page. It must therefore be explicitly forbidden to crawl individual pages or file types.

If a web page is to be excluded from indexing, robots.txt is not a suitable tool. If you prohibit the crawler from accessing parts of your page via robots.txt, then it can see these pages, but not read them. This means that the crawler cannot see whether you have stored meta robots information that prohibits indexing, for example.

The robots.txt is also only conditionally relevant for crawling control. Because if other pages or even you yourself refer to the pages of your website that are blocked in robots.txt, Google thinks that they must be relevant, since they are referred to. In the end, they might be indexed after all, because the crawler couldn’t read whether they should be indexed or not. After all, you have forbidden him to do that in the robots.txt.

You can recognize such blocked pages in Google search by the fact that instead of a meaningful description under the URL it says: “No information is available for this page.”

Which URLs of your website are blocked from crawling in robots.txt, but were indexed anyway, you can find out in Search Console under “Coverage”:

It is important to check the messages in Search Console regularly and make improvements to the website if necessary, so that search engines can crawl the website without any problems.

Where is the robots.txt stored?

The robots.txt file must always be placed in the root directory of a website, e.g. http://ihrewebsite.de/robots.txt.

Note that the robots.txt is only valid for the host on which the file is stored and for the corresponding protocol.



is not valid for

http://shop.ihrewebsite.de/ (since it is a subdomain store.)

https://ihrewebsite.de/ (since the protocol here is https)

valid for



You can theoretically also store a robots.txt file on an IP address as a host name. However, it is then only valid for this specific IP and not automatically for all websites linked to it. To do this, you must explicitly share them with these websites. It is therefore better to store the robots.txt individually for each hostname, as you may also have different specifications for crawling the individual hostnames.

The instructions in robots.txt

The standard syntax of robots.txt is as follows:

User-agent: Which user agent or bot is being addressed?
Disallow: What is excluded from crawling?
Allow: What is still allowed to be crawled?

The Disallow and Allow statements can apply to the entire website or to individual subdomains, directories or URLs.

Which bots can be controlled via robots.txt?

In the robots.txt file, both individual and all crawlers can be addressed. This is mainly used to control crawler traffic, for example to prevent server overloads.

If too many bots make requests to your server, i.e. they call up too many pages at the same time, this can overload your server. So if you notice that the loads are getting too high, blocking individual bots using robots.txt could be one of several measures.

Besides the Googlebot or the Bing bot, there are also tools with their own crawlers. For example, Screaming Frog or ahrefs.com have their own bots. One thing to keep in mind is that blocking these bots can make it difficult to evaluate a website if important SEO tools cannot crawl a page.

Unfortunately, the robots.txt can hardly be used to protect against malware bots, as they usually do not adhere to the specifications. Serious crawlers, on the other hand, respect the information in robots.txt.

When does it make sense to use robots.txt?

From an SEO point of view, there are rather few useful use cases for the use of robots.txt. This is because other crawling tools have proven to be more reliable and controllable. Nevertheless, you can use robots.txt in the following cases:

  1. You are developing a new website and do not want to have it crawled yet because it is still under development.
  2. You want to exclude certain areas or file types of your website from crawling and can ensure that they are not linked internally or externally.
  3. You want to prohibit individual tool bots from crawling.

The robots.txt is a very powerful tool. Therefore, you should think very carefully about what you exclude in robots.txt. It is best to include only as many instructions as necessary and as few as possible.

Procedure recommended by Google regarding robots.txt

To ensure that certain pages are not indexed by Google, a “ban” via robots.txt only works very unreliably. If the Google bot sees the URL via an external link, for example, it crawls the website anyway.

To reliably prevent your web pages from ending up in Google’s index, this must be indicated in the corresponding page with the noindex metatag.

This means that in order to reliably remove pages from the Google index, access must not be prohibited in robots.txt and the noindex metatag must be set.

However, this does not work for non-HTML elements, such as PDF files or videos. because these elements cannot contain a meta element. In this case, the X-Robots tag should be used.

Can the robots.txt file be used to prevent crawling of a website?

There is no guarantee that search engines will adhere to the prohibitions in robots.txt. The vast majority of crawlers of modern search engines take into account the presence of a robots.txt file, read it and follow the instructions. Bots that search the web with bad intentions probably don’t comply.

Let’s work together on your website!

Introduction to indexing a website

When a URL is crawled, you can use indexing management to control which URLs may actually be included in the search index. And only these URLs can achieve rankings in the end.

If a page is not crawled, the bot cannot detect the indexing settings either.

The following tools are available as part of the indexing process:

  • Meta Robots/X-Robots “noindex
  • Canonical tag
  • 301 redirects
  • Google Search Console “Remove URL” function

The use of Meta Robots & X-Robots

The most important means of controlling indexing are the meta-robots and the X-robots specifications. The robots information (not to be confused with robots.txt) tells the crawler whether a page may be included in the index or not.

By default, search engines assume that they are allowed to retrieve any document and make it discoverable via Google search. Accordingly, the control of crawlers by means of robots specifications is only necessary if something is explicitly not desired.

The Robots meta tag allows you to take a detailed, page-specific approach to specifying how a particular page should be indexed and displayed to users in Google search results.

Place the robots meta tag in the section of the respective page as follows:

Meta Robots für Indexierung

The robots meta tag in the example above tells search engines not to display the page in question in search results. The value of the name (robots) attribute indicates that the instruction applies to all crawlers. If you want to target a specific crawler, replace the robots value of the name attribute with the name of the corresponding crawler. Certain crawlers are also called user agents.

A crawler uses its user agent to request a page. Google’s default web crawler has the user agent name Googlebot. If you just want to prevent Googlebot from indexing your page, update the tag as follows:

Indexierung mit Meta Robots

Use Robots Meta Tag

Possible specifications in the metatag are:

Instruction Meaning of the instruction
noindex The page should not be findable via Google search.
nofollow Do not follow the (internal & external) links on this page
none Corresponds to the specification noindex, nofollow
noarchive The page should not be included as a copy in the search engine cache. The specification has no influence on whether the page can appear in the web search.
nosnippet This specification causes the meta description (description text) not to be displayed.
notranslate As a result, no translation of the page is offered in the search results.


Specifying meta robots only works for pages that have one, i.e. HTML pages. Non-HTML content can be excluded from indexing by means of the X-Robots.

These include PDF files, among others. Here, rules are defined on the server side in the htaccess file (for Apache servers) as to how certain files or file types are to be handled. If you do not define the indexing information individually for all URLs of your website, search engines automatically assume that the URL may be included in the index.

How the Canonical Tag Works

The Canonical Tag is one of the most important tools for the ambitious SEO. With the Canonoical tag, you can solve the common problem of so-called duplicate content.

Search engines evaluate duplicate content negatively because there is no added value for the Internet user. For the indexing of website content, each piece of content must therefore only be accessible under a single URL. If you want to make the content available on other pages as well, the second URL must point to the original web page and identify it as the primary source. Otherwise, this same content counts as duplicate content.

At least one of the two websites will then be removed from the index by Google. To avoid this, one resorts to so-called Canonical Tag. This is obtained by adding a canonical tag to the head section of the HTML code.

The canonical tag is a specification in the source code of a website. It points to a standard resource – the one canonical URL – at websites with the same or almost the same content.

If a canonical URL is correctly marked up, only the original source is used for search engine indexing. This can avoid that the same content on different pages is recognized by Google as duplicate content.

So with the Canonical tag you are telling Google “I am aware that this content is duplicate, index only the original”. The best optimized URL should always be specified as the “original”.

The tag is then implemented in the duplicate according to the following scheme:

link rel=”canonical” href=”https://www.ihrewebsite.de/original/”

The same canonical tag can be included on multiple pages, for example, if there are multiple duplicates of an original.

The URL pointed to by the canonical is marked as original. This should be displayed in the search results and must therefore be provided with the meta robots specification “index”.

But beware! The target URL must not be marked with “noindex”, because these two signals are opposite and do not provide the crawler with clear instructions on how to handle the URL.

If the canonical points to itself (self-referencing canonical), i.e. to the source URL, this has no real effect. However, in some cases, it may be easier to implement if Canonical tags are specified on all URLs, regardless of whether they are similar pages or not.

Indexierung Canonical Tag

When should you use the Canonical Tag?

You should use the Canonical tag if content on your pages is very similar or even duplicates.

Examples of the use of the Canonical tag:

Pagination pages

The paginations of a URL are typically not duplicates because other products are displayed on them. Therefore, paginations should not have a canonical tag to page 1. The first page itself is an exception. Sometimes pagination pages can only be implemented in such a way that there is both a category page without parameters and a page 1.

These two URLs are actually duplicates, since the same products or articles are also listed here. Therefore, you should set a canonical tag on the www.ihrewebsite.de/kategorie category from www.ihrewebsite.de/kategorie?page=1.

Product variants

If product variants cannot be excluded from crawling, the option remains to exclude them from indexing. The advantage is that this way you can display all individual product variants in one category without producing duplicate content.

In this variant, you use the main product as the canonical URL. It then represents the only relevant URL for SEO to be displayed in the search results. The other article variants then point to the main article via canonical tag.

Parameter URLs

Parameter URLs are often an identical copy of the actual URL, but represent different pages to the search engine. The problem occurs especially with filtering, internal search pages, session IDs or print versions of pages.

As a rule, these URLs are not SEO relevant. So, you should exclude them from crawling in order to use your crawling budget effectively. If this is not possible, you can at least exclude them from indexing using the Canonical tag.

Example: https://www.ihrewebsite.de/kategorie? session-id=52345

This URL represents a duplicate to

https://www.ihrewebsite.de/kategorie and should therefore refer to https://www.ihrewebsite.de/kategorie via the canonical tag.

Pages assigned to multiple categories

Sometimes items or products are made accessible through different categories and are accessible through multiple directories. To prevent this from happening, content should only ever be accessible via one URL.

You can still link the items or products from multiple categories. The user can then navigate through the various categories of your store or website, but always lands on the same URL when clicking on an article or product.

Canonical tags and hreflang

If a website uses hreflang, the respective URLs should either refer to themselves via canonical tag or not use canonicals at all.

If both tags are used together, Google receives conflicting signals. While the hreflang tag shows that another language version is present, the Canonical tag would make that version the original URL.

External Duplicate Content

Example: External duplicate content can occur when posts are published across multiple domains. Also, if you make your website accessible via multiple hostnames, for example, this can lead to a duplicate content problem.

Example: You have registered yourwebsite.com and yourwebsite.com. If the same content is accessible under both hostnames, then this is duplicate content and Google does not know which of your pages to rank.

The same is true if your website is linked to both www. as well as without www. or can be reached under http and https.

At the beginning of 2017, Google elevated the use of a secure HTTPS connection for websites to an important ranking factor. Since then, Google prefers HTTPS pages as canonical URLs. The Canonical tag should therefore point from the HTTP protocol to the HTTPS page, not vice versa.

How to use redirects

Another means of indexing management are redirects. The most frequently used are status code 301 and status code 302.

Status code 301 is a “permanent forwarding”. The search engine is told that the content previously found on URL A is now permanently found on URL B. As a result, the search engine will remove the redirected URL A from the index and index the redirection target URL B instead.

Status code 302, on the other hand, is a “temporary forwarding”. Here, the search engine is informed that the content of the previously indexed URL A can be found only temporarily on another URL B. The forwarding URL A thus remains indexed, the forwarding destination URL B is usually not indexed.

When to use redirects

If you are moving a URL permanently, you should always set up a 301 redirect. If you move a URL only temporarily, you can use a 302 redirect.

Another application of 302 redirection is URLs that lead to an area of the website that requires the user to be logged in. If he is not logged in and clicks on the link, he will be redirected to the login page via 302 redirect.

As a result, the target URL remains indexed, while the login page is not.

When moving a URL, remember to not only set up a redirect, but also adjust all internal links so that the old URL is no longer linked internally. This saves loading time and crawling budget.

The 301 redirect (301 forwarding)

The 301 redirect is a way to permanently redirect a URL. This redirect is used to redirect old URLs that are no longer valid to new URLs.

The great advantage of the 301 redirect is that this redirect passes on practically 100 percent of the link juice and sends a clear signal to search engines that the requested page can be found permanently under a different URL.

The 301 redirect can be implemented on Apache servers, for example, by modifying the htaccess file or via PHP.

This code is used for the htaccess file:

RewriteEngine on
rewritecond %{http_host} ^domain.com [nc] rewriterule ^(.*)$ http://www.domain.com/$1 [r=301,nc]

If the 301 forwarding is implemented via PHP, the code to be used looks like this. It is stored directly in the source code of the forwarding document.

!–?php header(“HTTP/1.1 301 Moved Permanently”);
header(“Location: http://www.domain.de/der-neue-name.php”);
header(“Connection: close”); ?–

Remove URLs

Sometimes a URL has to be removed from the Google index as quickly as possible, e.g. because illegal or warned content is visible there.

For such cases, Google offers a tool in Search Console to remove URLs from the index.

However, the following points should be noted:

Such exclusion is valid only for about 90 days. After that, your information will be displayed again in the Google search results.

Clearing the cache or excluding a URL from search results does not change the crawling schedule or the caching behavior of the Google bot. If you request that a URL be temporarily blocked, Google will continue to crawl your URL if it is present and not blocked by another method, such as a “noindex” tag.

Therefore, it is possible that your page will be crawled and cached once again before you remove it or protect it with a password. So, it may appear in the search results again after your temporary exclusion expires.

If your URL is inaccessible to the Google bot, it assumes that the page no longer exists. The validity period of your blocking request will therefore be terminated. If a page is later found again under this URL, it will be considered a new page, which may also be included in Google search results.

Remove URL permanently

The URL removal tool can be used to remove the latter only temporarily. If you want to permanently exclude content or a URL from Google search, do at least one of the following:

  • Remove or update the content on your website such as images, pages or directories. After that, check if your web server returns HTTP status code 404 (not found) or 410 (deleted). Non-HTML files such as PDFs should be completely removed from your server.
  • Block access to the content, e.g. with a password.
  • Mark the page with the “noindex” meta tag so that it is not indexed. This method is less safe than the others.

Conclusion on crawling & indexing

As soon as websites exceed the size of a small homepage, one of the most important tasks is to ensure that the existing content is as complete and up-to-date as possible in the Google index.
Since the resources for capturing and storing web pages are limited, Google uses individual limits per domain for this purpose:

How many URLs are crawled per day, how many of these pages make it into the index?

Large websites quickly reach these limits. Therefore, it is important to use the available resources as productively as possible with smart crawl and indexing management.

How online stores save CPC on Google Shopping s

Recent Posts