Search engines, search engines and spider robots. The future of search engines

How search engine robots work

Search robot (spider, bot) is a small program, capable of visiting millions of websites without the participation of an operator and scanning gigabytes of texts. Reading pages and saving their text copies is the first stage of indexing new documents. It should be noted that search engine robots do not carry out any processing of the received data. Their task is only to store textual information.

More videos on our channel - learn internet marketing with SEMANTICA

List of search robots

Of all the search engines involved in scanning the Runet, Yandex has the largest collection of bots. The following bots are responsible for indexing:

the main indexing robot that collects data from web pages;
a bot capable of recognizing mirrors;
Yandex search robot that indexes images;
a spider browsing the pages of sites accepted in the YAN;
robot scanning favicon icons;
several spiders that determine the availability of site pages.

Google's main search robot collects textual information. Basically, it looks at html files, analyzes JS and CSS at regular intervals. Able to accept any content types allowed for indexing. PS Google has a spider that controls the indexing of images. There is also a search robot - a program that supports the functioning mobile version search.

See the site through the eyes of a search robot

To correct code errors and other shortcomings, the webmaster can find out how the search robot sees the site. This option is provided by Google PS. You will need to go to webmaster tools, and then click on the "scan" tab. In the window that opens, select the line "browse as Googlebot". Next, you need to enter the address of the page under study in the search form (without specifying the domain and the http:// protocol).

By selecting the "get and display" command, the webmaster will be able to visually assess the state of the site page. To do this, you need to click on the "request to display" checkbox. A window will open with two versions of the web document. The webmaster learns how a regular visitor sees the page, and in what form it is available to the search spider.

Tip! If the web document being analyzed is not yet indexed, then you can use the “add to index” >> “crawl only this URL” command. The spider will analyze the document in a few minutes, in the near future the web page will appear in the search results. The monthly indexing request limit is 500 documents.

How to influence indexing speed

Having found out how search robots work, the webmaster will be able to promote his site much more efficiently. One of the main problems of many young web projects is poor indexing. Search engine robots are reluctant to visit non-authoritative Internet resources.
It has been established that the indexing speed directly depends on the intensity with which the site is updated. Regularly adding unique text materials will attract the attention of the search engine.

To speed up indexing, you can use social bookmarks and the twitter service. It is recommended to generate a Sitemap and upload it to the root directory of the web project.

Looking through the server logs, sometimes you can observe excessive interest in sites from search robots. If the bots are useful (for example, indexing bots of the PS), it remains only to observe, even if the load on the server increases. But there are still a lot of secondary robots, whose access to the site is not required. For myself and for you, dear reader, I have collected information and converted it into a convenient tablet.

Who are search robots

search bot, or as they are also called, robot, crawler, spider - nothing more than a program that searches and scans the content of sites by clicking on the links on the pages. Search robots are not only for search engines. For example, the Ahrefs service uses spiders to improve data on backlinks, Facebook performs web scraping of page code to display link reposts with titles, pictures, and descriptions. Web scraping is the collection of information from various resources.

Using spider names in robots.txt

As you can see, any serious project related to content search has its spiders. And sometimes it is an urgent task to restrict the access of some spiders to the site or its separate sections. This can be done through the robots.txt file in the root directory of the site. I wrote more about setting up the robots earlier, I recommend that you read it.

Please note that the robots.txt file and its directives can be ignored by search robots. Directives are just guidelines for bots.

Set directive for search robot You can use the section - an appeal to the user agent of this robot. Sections for different spiders are separated by one blank line.

User-agent: Googlebot Allow: /

User-agent: Googlebot

allow: /

The above is an example of a call to the main Google crawler.

Initially, I planned to add entries to the table about how search bots identify themselves in the server logs. But since this data is of little importance for SEO and there can be several types of records for each agent token, it was decided to get by with only the name of the bots and their purpose.

Search robots G o o g l e

user-agent	Functions
Googlebot	The main crawler-indexer for PC and smartphone-optimized pages
Mediapartners-Google	AdSense ad network robot
APIs-Google	APIs-Google user agent
AdsBot-Google	Checks the quality of ads on web pages designed for PC
AdsBot-Google-Mobile	Checks the quality of ads on web pages designed for mobile devices
Googlebot Image (Googlebot)	Indexes images on site pages
Googlebot News (Googlebot)	Looking for pages to add to Google News
Googlebot Video (Googlebot)	Indexes video content
AdsBot-Google-Mobile-Apps	Checks the quality of ads in apps for Android devices, works on the same principles as regular AdsBot

Search robots I index

user-agent	Functions
Yandex	When this agent token is specified in robots.txt, the request goes to all Yandex bots
YandexBot	Main indexing robot
YandexDirect	Downloads information about the content of YAN partner sites
YandexImages	Indexes site images
YandexMetrika	Robot Yandex.Metrica
YandexMobileBot	Downloads documents for analysis for the presence of layout for mobile devices
YandexMedia	Robot indexing multimedia data
YandexNews	Yandex.News indexer
YandexPagechecker	Microdata Validator
YandexMarket	Yandex.Market robot;
YandexCalenda	Robot Yandex.Calendar
YandexDirectDyn	Generates dynamic banners (Direct)
YaDirectFetcher	Downloads pages with advertisements to check their availability and clarify topics (YAN)
YandexAccessibilityBot	Downloads pages to check their availability for users
YandexScreenshotBot	Takes a snapshot (screenshot) of the page
YandexVideoParser	Yandex.Video service spider
YandexSearchShop	Downloads YML files of product catalogs
YandexOntoDBAPI	Object response robot downloading dynamic data

Other popular search bots

user-agent	Functions
Baiduspider	Chinese search engine Baidu spider
cliqzbot	Cliqz anonymous search engine robot
AhrefsBot	Ahrefs search bot (link analysis)
Genieo	Genieo service robot
bingbot	Bing search engine crawler
Slurp	Yahoo search engine crawler
DuckDuckBot	Web crawler PS DuckDuckGo
facebot	Facebook robot for web crawling
WebAlta (WebAlta Crawler/2.0)	Search crawler PS WebAlta
BomboraBot	Scans pages involved in the Bombora project
CCBot	Nutch-based crawler that uses the Apache Hadoop project
MSNBot	Bot PS MSN
Mail.Ru	Mail.Ru search engine crawler
ia_archiver	Scraping data for Alexa service
Teoma	Ask service bot

There are a lot of search bots, I have selected only the most popular and well-known ones. If there are bots that you have encountered due to aggressive and persistent site crawling, please indicate this in the comments, I will also add them to the table.

Higher education available - massage therapist training.

There are more than one hundred million resources on the Internet, and millions desired pages will never be known to us. How to find the drop we need in this ocean? This is where it comes to our aid. search ow machine. This spider, and only he knows what and in what place of the web he has.

Search new machines Internet ah, these are sites specially designed to help you find necessary information V global network world wide web. There are three main functions, the same for all search new machines:

- search oviks on the given keywords "search" the Internet;
- addresses indexed search ovikami along with words;
- indexed web pages form the base, which search oviki provide users with search A keywords or combinations of them.

First search Hoviki received up to 2,000 requests per day and indexed hundreds of thousands of pages. Today, the number of requests per day goes to hundreds of millions of pages and tens of millions.

P search engines up to world wide web.

First search ovikami Internet and there were "gopher" and "Archie" programs. They indexed files located on connected Internet servers, repeatedly reducing the time for search the necessary documents. In the late 1980s, the ability to work in Internet did not come down to the ability to use Archie, gopher, Veronica and the like search new programs.

Today web became the most requested part Internet and the majority Internet users carry out search only in world wide web (www).

Robot- spider

The robot program used in search new machines, it is also called "spider", spider(spider), performs the process of creating a list of words found on the wed-resource page. The process is called Web crawling(crawling). Search new spider looks through a lot of other pages, builds and fixes a list of useful words, i.e. having some meaning, weight.

Journey through search u on the network spider (spider) starts with the largest server and the most popular web pages. Having bypassed such a site and indexed all the words found, it goes to crawl other sites using the found links. In this way, the robot spider captures the entire web space.

The founders of Google, Sergey Brin and Laurence Page, give an example of the work of Google spider ov. There are several. Search starts three spider ami. One spider supports up to 300 page connections at the same time. Peak load, four spider and are capable of processing up to a hundred pages per second, while generating traffic of about 600 kilobytes / sec. On this moment, when you read this, perhaps the numbers will seem ridiculous to you.

Keywords for the search engine robot

Usually the owner of a web resource wants to be included in search new results for the required search ow words. These words are called key s. Klyuchev Words define the essence of the content of a web page. And Meta Tags help with this. They then offer the search robot a choice key th words used to index the page. But we don't recommend adding meta tags to popular queries that are not related to the content of the page itself. Search engine bots are fighting this phenomenon, and you'll be lucky if it just omits meta tags with key in other words, not corresponding to the content of the pages.

Meta tags are a very useful tool when key The first words of them are repeated several times in the text of the page. But do not overdo it, there is a chance that the robot will take the page for a doorway.

Search engine indexing algorithms

Algorithms search Hoviks are focused on the effectiveness of the final result, but everyone has different approaches to this. Lycos search New robots index words in the title (title), links (links) and up to a hundred frequently used words on the page and each word from the first 20 lines of page content.

Googlebot takes into account the location of the word on the page (in the body element). Words of service sections, such as subtitles, title, meta tags et al. marks as especially important, excluding the interjections "a," "an" and "the.".

Other search oviki may have a slightly different way of approaching the indexing of words used for search new requests by users.

Search engine robots, sometimes referred to as spiders or crawlers, are software modules searching for web pages. How do they work? What are they really doing? Why are they important?

With all the buzz around search engine optimization and search engine index databases, you might be thinking that robots must be great and powerful beings. Not true. Search engine robots have only basic features similar to those of early browsers in terms of what information they can recognize on a site. Like early browsers, robots simply can't do certain things. Robots do not understand frames, Flash animations, images, or JavaScript. They cannot enter password-protected sections and cannot click on all the buttons that are on the site. They can get stuck in the process of indexing dynamic URLs and be very slow, to the point of stopping and powerless over JavaScript navigation.

How do search engine robots work?

Web crawlers should be thought of as automated data mining programs that surf the web in search of information and links to information.

When, having visited the Submit a URL page, you register another web page in the search engine, a new URL is added to the queue for viewing sites by the robot. Even if you don't register a page, a lot of robots will find your site because there are links from other sites linking to yours. This is one of the reasons why it is important to build link popularity and place links on other thematic resources.

When they come to your site, the robots first check if there is a robots.txt file. This file tells robots which sections of your site are not to be indexed. Usually these can be directories containing files that the robot is not interested in or should not know about.

Robots store and collect links from every page they visit and later follow those links to other pages. The entire world wide web is built of links. The initial idea of creating the Internet network was that it would be possible to follow links from one place to another. This is how robots move.

The ingenuity about indexing pages in real time depends on the engineers of the search engines, who invented the methods used to evaluate the information received by the search engine robots. Once embedded in a search engine database, the information is available to users who perform searches. When a search engine user enters a search term, a series of quick calculations are made to ensure that the correct set of sites for the most relevant answer is actually returned.

You can view which pages of your site have already been visited by the search robot, guided by the server log files, or the results of statistical processing of the log file. By identifying robots, you can see when they visited your site, which pages and how often. Some robots are easily identified by their names, like Googles Googlebot. Others are more hidden, like Inktomis Slurp. Other robots can also be found in the logs and it is possible that you will not be able to immediately identify them; some of them may even be human-controlled browsers.

In addition to identifying unique crawlers and counting the number of visits they have, statistics can also show you aggressive, bandwidth-eating crawlers or crawlers that you don't want to visit your site.

How do they read the pages of your website?

When a crawler visits a page, it scans its visible text, the content of various tags in source code your page (title tag, meta tags, etc.), as well as hyperlinks on the page. Judging by the words of the links, the search engine decides what the page is about. There are many factors used to calculate the key points of a page "playing a role". Each search engine has its own algorithm for evaluating and processing information. Depending on how the robot is configured, the information is indexed and then delivered to the search engine database.

After that, the information delivered to the search engine index databases becomes part of the search engine and the database ranking process. When a visitor makes a request, the search engine goes through the entire database to return a final list that is relevant search query.

Search engine databases are carefully processed and aligned. If you are already in the database, robots will visit you periodically to collect any changes to the pages and make sure they have the latest information. The number of visits depends on the settings of the search engine, which may vary depending on its type and purpose.

Sometimes search robots are not able to index a website. If your site has crashed or a large number of visitors are visiting the site, the robot may be powerless in trying to index it. When this happens, the site cannot be re-indexed, depending on how often the robot visits it. In most cases, robots that were unable to reach your pages will try later, in the hope that your site will be available soon.

Many crawlers cannot be identified when you view the logs. They may be visiting you, but the logs say someone is using the Microsoft browser, etc. Some robots identify themselves using the name of a search engine (googlebot) or its clone (Scooter = AltaVista).

Depending on how the robot is configured, the information is indexed and then delivered to the search engine databases.

Search engine databases are subject to modification at various times. Even directories that have secondary search results use robot data as the content of their website.

Actually, robots are not used by search engines only for the above. There are robots that check databases for new content, visit old database content, check if links have changed, download entire sites for browsing, and so on.

For this reason, reading the log files and keeping track of the search engine results helps you keep an eye on the indexing of your projects.

search robot called special program any search engine that is designed to enter into the database (indexing) the sites found on the Internet and their pages. The names are also used: crawler, spider, bot, automaticindexer, ant, webcrawler, bot, webscutter, webrobots, webspider.

Principle of operation

The search robot is a browser type program. He constantly scans the network: he visits indexed (already known to him) sites, follows links from them and finds new resources. When a new resource is found, the procedure robot adds it to the search engine index. The search robot also indexes updates on sites, the frequency of which is fixed. For example, a site that is updated once a week will be visited by a spider with this frequency, and content on news sites can be indexed within minutes of being published. If no link from other resources leads to the site, then in order to attract search robots, the resource must be added through a special form (Google Webmaster Center, Yandex Webmaster Panel, etc.).

Types of search robots

Yandex spiders:

Yandex/1.01.001 I is the main indexing bot,
Yandex/1.01.001 (P) - indexes pictures,
Yandex/1.01.001 (H) - finds site mirrors,
Yandex/1.03.003 (D) - determines whether the page added from the webmaster panel matches the indexing parameters,
YaDirectBot/1.0 (I) - indexes resources from the Yandex advertising network,
Yandex/1.02.000 (F) — indexes site favicons.

Google Spiders:

Googlebot is the main robot,
Googlebot News - crawls and indexes news,
Google Mobile - indexes websites for mobile devices,
Googlebot Images - searches and indexes images,
Googlebot Video - indexes videos,
Google AdsBot - checks the quality of the landing page,
Google Mobile AdSense and Google AdSense - indexes the sites of the Google advertising network.

Other search engines also use several types of robots that are functionally similar to those listed.