Why do we study web crawlers

Website crawling and indexing: theory and practice

Through crawling and indexing, websites are included in the search engine's index and thus have the opportunity to rank on the search results pages (SERPs). When creating and optimizing websites, care must be taken to ensure that they can be easily crawled if the pages are to appear in the organic search results. In this way you ensure that all relevant content of the website ends up in the search engine index. Basically, a web crawler can only index content that it can find. That is why a page should always have a hierarchy that is as flat as possible and well-thought-out internal links. Deal intensively with the topic of crawling and indexing in order to feed the search engines with relevant content as best as possible. We will help you with this blog article.

content

 

What does crawling mean?

Crawling is the identification of publicly accessible websites using special software. One of the best-known crawlers - also known as bot - is the Googlebot, which searches the Internet for all available pages for the Google search engine. To do this, a crawler calls up websites and follows all internal and external links in order to index as many pages as possible. In order to find the content and the links on the page, the web crawler reads the source code of the page. If a website is password-protected or if it displays an error page, then it cannot be accessed by the crawler and the source code is not read. Such a page will not be included in the index.

 

"It starts with crawling, it's practically the entry gate into the Google Index."
Rene Dhemant

 

What does indexing mean?

The data recorded by the crawling is indexed by search engine operators and thus made available to the search engines. The index is the storage of all crawled pages that have not been excluded by the website operators or are considered irrelevant by the search engines. The index forms the data basis that is accessed by a user when a search query is made.

This search query then sets a complex algorithm in motion in order to be able to deliver the best possible results. The order of these results is then referred to as the ranking. Which website is in position 1 of the organic results is determined by different ranking factors.

Both crawling and indexing belong to the area of technical SEO at.

How do I find out if my page is indexed?

If a new subpage is created on the website, the aim is in most cases that it is found by users via the search engines. As we already know, the page has to be crawled and indexed for this. You can easily find out whether this was successful by querying the site or using the Google Search Console.

Site query

In the search engine's search field, you have the option of filtering the search results by entering so-called search operators. These are entered directly in the search field together with your search query. “Site” is one such search operator that helps you examine the indexing of your web pages or specific URLs. Simply enter “site: www.domain.com” in the search field to display all the indexed pages or test a specific subpage with “site: https: //www.domain.com/unterside”. If the page appears in the search results, it has been crawled and indexed.
Image: site-abfrage.png

Search Console url checking

As Google already suggests in the site query, the indexing can also be checked via the Search Console. The prerequisite here, however, is that you have access to the website property. You can then check individual URLs in the search field of the Google Search Console. In the coverage report you can find all information about the crawling and indexing of the page:

Can pages fly out of the index?

If a URL has been included in the search engine's index, this is no guarantee that it will stay there. Various factors can cause pages to be removed from the index:
Password-protected: If access to the site has been blocked by a password, this site can no longer be crawled. A page that is neither accessible to the crawler nor to users is removed from the index.
Status codes 4xx & 5xx: If a page can no longer be found or accessed due to an error (client error 4xx or server error 5xx), it will sooner or later be removed from the index. If these are important pages, they should be forwarded using the status code 301 as best as possible in order to avoid a loss of ranking.
Noindex: If the website operator has stored the meta robots tag “noindex” in the source code on the page, he hereby informs the search engine that this page should be removed from the index. If this text is stored from the beginning, the page will not even be included in the index.
Violation of webmaster guidelines: Last but not least, Google, for example, punishes violations of a page against its webmaster guidelines by excluding them from the index.

Back to top

 

What is meant by crawl budget & index budget?

For websites with few pages and / or websites whose URLs are usually indexed on the first day anyway, the subject of crawl budget is only of minor importance. For large pages with several thousand URLs, however, it is important to optimize the crawl budget. But what is a crawl budget and what is the crawling frequency or the crawling requirement?
Put simply, the crawl budget is the number of URLs the bot can and wants to crawl on a page. This budget is made up of the crawling frequency (ability) and the crawling need (willingness).

The crawling frequency is the number of requests per second that the bot executes on a page while crawling. The following applies: The faster the loading time and the fewer server errors, the higher the frequency. A technical optimization of the page has a clearly positive effect on the crawling frequency.
Search engine bots prefer to crawl pages that are popular. The popularity is determined by a number of factors such as linking, length of stay and bounce rate. Pages that are classified as less popular and / or out of date are crawled less or not at all.

The crawling requirement is therefore a value that assesses how important it is for individual pages to be crawled regularly or not. Pages with little added value have a negative effect on crawling and indexing, which means that good content is only found later. In detail, little added value means: Duplicate content, Soft 404 errors or spam. A complex optimization of various factors is therefore necessary in order to turn both screws. This is the only way to ensure that all pages that are to be indexed can also be crawled.

And be careful! Crawl budget is also used up at some point. It can be burdened by relaunches, changes to the URL structure or redirect chains. This leads to the fact that the URLs of a website have to be crawled again and again. If the crawl budget is used up, this can lead to important pages not being crawled and therefore not being included in the search engine index, ergo not being found by users either.

The crawl budget is differentiated from the index budget. The latter refers to the number of pages of a single domain that are included in the index of search engines. This number is also limited and only the URLs that are regularly crawled have the option of staying in the index.
Do you want to delve deeper into the subject? In our morefire bar chat, Rene Dhemant gives an insight into the topics of crawl budget and indexing.

 

How important are crawling & indexing in search engine optimization?

TheSearch engine optimization has the goal of optimizing a page in such a way that for certain keywords it lands at best on position 1 on the search results page. The basic building block of such optimizations is that the page can be crawled and indexed. As an SEO, you must therefore always pay attention to the crawlability of the website and be able to identify possible weaknesses in order not to endanger the indexing of the page. For SEOs it is especially important to control the crawler in such a way that all relevant pages can be found and indexed. If pages are less relevant or cause duplicate content, the web crawler can be notified of this in various ways.

How can I control the crawling & indexing of my website?

In addition to passive influences such as page performance (loading speed, server errors, etc.), a webmaster has the option of actively controlling the crawling and thus influencing the indexing. This can be done in very different ways.

Crawling control by the Robots.txt

With the help of a robots.txt, which must always be in the root directory of a domain (www.example.de/robots.txt), crawlers can be given various instructions:
Exclude individual crawlers from the entire page or individual directories
Provide a reference to the address of one or more XML sitemaps
The commands in a robots.txt are only a recommendation and are not necessarily taken into account by search engines. You can find detailed information about the possibilities of a robots.txt under: robots.txt - What is it and how do I use it?

Prevent indexing with Noindex

The meta tag "noindex" is either implemented in the area of ​​a page, which looks like this: or can be returned as a response in the HTTP header. Crawling budget is consumed when such a page is accessed, but the page is not indexed. The noindex directive is binding, which means that the page will be removed from the search engine's index after the next crawl. Such tags are useful, for example, on the following pages:

  • URLs with parameters through e.g. filter functions
  • Search results pages

Control indexing through canonicals

Unlike the noindex directive, canonicals are not binding on search engines, which means that there is no guarantee that search engines will follow the recommendations.
Canonicals look like this: and are also implemented in the header of a page.
Canonicals differ from the noindex statement in that the point here is not to remove a page from the index, but rather a recommendation is made as to which URL should be indexed instead of the page just called.
This is useful, for example, in an online shop, when filter functions result in duplicate content from category pages. There is also more about the Canonical Tag in our morefire blog article The canonical day to read!

Crawling control via the Search Console

In the Search Console, for example, you can exclude URL parameters and reduce the crawling frequency.
URLs with certain parameters are excluded under “Previous tools and reports” - “URL parameters”. This is useful, for example, to exclude parameter URLs that are generated by filter settings on the website from being indexed. It is important to mention here that 1. these settings only apply to Google, but not to other search engines. And 2. the problem should ideally be resolved using other means (robots.txt, noindex, canonicals) or avoiding the generation of filter URLs on the website so that the emergency solution via the Google Search Console is not even necessary.

Under “Previous tools and reports” - “Further information” - “Settings for the crawl frequency”, a maximum crawl frequency can also be set. Values ​​from a few to many requests per second can be selected here. The crawl frequency should only be limited if Google is slowing down the page's server. Attention: This setting is only valid for 90 days and must then be made again. The same applies here: this is only an emergency solution! If crawlers are slowing down the server on the page, it is important to optimize the server performance.

Back to top

Practical example: Avoid duplicate content with parameter URLs

There is a category page: https://www.beispiel-shop.de/ategorie
And several filter urls, such as this one:

  • https://www.beispiel-shop.de/ategorie?filter-farben
    https://www.beispiel-shop.de/ategorie?filter-preis
  • This creates duplicate content because all three URLs are identical except for the products displayed (meta data, headline, text, etc.). Here are some advantages and disadvantages of the four different variants:

Search Console

Advantages:

  • Works binding on Google

Disadvantage:

  • Relatively complicated configuration
  • Only applies to Google, has no relevance for other search engines

robots.txt

Advantages:

  • Valid for all search engines

Disadvantage:

  • It is only a recommendation and is not binding

noindex

Advantages:

  • Mandatory method of removing pages from the index
  • Valid for all search engines

Disadvantage:

  • No reference to the relevant (canonical) page possible

Canonical

Advantages:

  • Valid for all search engines
  • Reference to the relevant page, in this case the category page

Disadvantage:

  • Only a recommendation, is usually adopted, but not always

 

In this case I would set the filter urls to noindex. This ensures that only the category pages are recorded in the index and that there is no duplicate content.

However, there are two alternatives to this.
Firstly, CMS systems can be set up so that filtering does not change the URL. However, this is only possible with a few CMS systems and requires extensive technical know-how.

Second, there is the option of optimizing individual filter pages.
Instead of setting a URL (example: https://www.beispiel-shop.de/ategorie?filter-farben) to noindex, you can also set it with an individual title tag, a meta description, an H1 heading and a specific Add text. This means that the page is no longer a duplicate of the actual category page and the page can even be linked and used to generate rankings. This option also depends largely on the CMS and the technology used.

Conclusion

It is very important for website operators to control the crawling of search engines and to control the indexing of the individual URLs. There are a variety of ways to do this, from a slim, flat page hierarchy to locking individual pages from the index. I have explained the theoretical basics in this article and illustrated a possible implementation using the example. The topic is very complex and, depending on the application, individual solutions must be worked out with the aim of getting relevant URLs in the index and non-relevant URLs or duplicates must be kept out of the index.

Back to top

Claudia moved to Cologne after studying media communication in Würzburg and works as an SEO consultant at morefire. Above all, she feels at home in technical SEO and when she is not working, she is mainly on all kinds of boards, whether skate, snow or surfboard.