Search engine robots. Search engines their robots and spiders Search spiders

search robot a special program of a search engine is called, which is designed to enter into the database (indexing) sites found on the Internet and their pages. The names are also used: crawler, spider, bot, automaticindexer, ant, webcrawler, bot, webscutter, webrobots, webspider.

Principle of operation

The search robot is a browser type program. He constantly scans the network: he visits indexed (already known to him) sites, follows links from them and finds new resources. When a new resource is found, the procedure robot adds it to the search engine index. The search robot also indexes updates on sites, the frequency of which is fixed. For example, a site that is updated once a week will be visited by a spider with this frequency, and content on news sites can be indexed within minutes of being published. If no link from other resources leads to the site, then in order to attract search robots, the resource must be added through a special form (Google Webmaster Center, Yandex Webmaster Panel, etc.).

Types of search robots

Yandex spiders:

Yandex/1.01.001 I is the main indexing bot,
Yandex/1.01.001 (P) - indexes pictures,
Yandex/1.01.001 (H) - finds site mirrors,
Yandex/1.03.003 (D) - determines whether the page added from the webmaster panel matches the indexing parameters,
YaDirectBot/1.0 (I) - indexes resources from the Yandex advertising network,
Yandex/1.02.000 (F) — indexes site favicons.

Google Spiders:

Googlebot is the main robot,
Googlebot News - crawls and indexes news,
Google Mobile - indexes websites for mobile devices,
Googlebot Images - searches and indexes images,
Googlebot Video - indexes videos,
Google AdsBot - checks the quality of the landing page,
Google Mobile AdSense and Google AdSense - indexes the sites of the Google advertising network.

Other search engines also use several types of robots that are functionally similar to those listed.

Usually, search engine is a site that specializes in finding information that matches the user's query criteria. The main task of such sites is to organize and structure information on the network.

Most people, using the services of a search engine, never wonder how exactly the machine works, looking for the necessary information from the depths of the Internet.

For an ordinary user of the network, the very concept of the principles of operation of search engines is not critical, since the algorithms that guide the system are able to satisfy the needs of a person who does not know how to make an optimized query when searching for the necessary information. But for a web developer and specialists involved in website optimization, it is simply necessary to have at least the initial concepts of the structure and principles of search engines.

Each search engine operates on precise algorithms that are kept in the strictest confidence and are known only to a small circle of employees. But when designing a site or optimizing it, it is imperative to take into account the general rules for the functioning of search engines, which are discussed in this article.

Despite the fact that each PS has its own structure, after careful study they can be combined into basic, generalizing components:

Indexing module

Indexing Module - This element includes three additional components (bots):

1. Spider (spider robot) - downloads pages, filters the text stream, extracting all internal hyperlinks from it. In addition, Spider saves the date of download and the title of the server response, as well as the URL - the page address.

2. Crawler (crawling robot spider) - analyzes all the links on the page, and based on this analysis, determines which page to visit and which one is not worth visiting. In the same way, the crawler finds new resources that should be processed by the PS.

3. indexer (Robot-indexer) - deals with the analysis of Internet pages downloaded by a spider. In this case, the page itself is divided into blocks and analyzed by the indexer using morphological and lexical algorithms. Various parts of a web page fall under the analysis of the indexer: headings, texts and other service information.

All documents processed by this module are stored in the searcher's database, called the system index. In addition to the documents themselves, the database contains the necessary service data - the result of careful processing of these documents, guided by which the search engine fulfills user requests.

search server

next, very important component systems - a search server whose task is to process a user request and generate a page of search results.

Processing the user's request, the search server calculates the relevance rating of the selected documents to the user's request. This ranking determines the position that a web page will take in search results. Each document that matches the search criteria is displayed on the results page as a snippet.

The snippet is short description page, including the title, link, keywords and brief text information. Based on the snippet, the user can evaluate the relevance of the pages selected by the search engine to his query.

The most important criterion that the search server is guided by when ranking the results of a query is the TCI indicator () already familiar to us.

All described components of the PS are expensive and very resource-intensive. The performance of a search engine directly depends on the effectiveness of the interaction of these components.

Liked the article? Subscribe to blog news or share on social networks, and I will answer you

6 comments on the post “Search engines are their robots and spiders”

I have been looking for this information for a long time, thanks.

Answer

I am glad that your blog is constantly evolving. Posts like this only add to the popularity.

Answer

I understood something. The question is, does PR somehow depend on the TIC?

Hello friends! Today you will learn how Yandex and Google search robots work and what function they perform in website promotion. So let's go!

Search engines do this action in order to find ten WEB projects out of a million sites that have a high-quality and relevant response to a user's query. Why only ten? Because it consists of only ten positions.

Search robots friends and webmasters and users

Why it is important to visit the site by search robots has already become clear, and why is it for the user? That's right, in order for the user to open only those sites that respond to his request in full.

Search robot- a very flexible tool, it is able to find a site, even one that has just been created, and the owner of this site has not yet been involved in . Therefore, this bot was called a spider, it can stretch its paws and get anywhere on the virtual web.

Is it possible to control the search robot in your interests

There are times when some pages are not included in the search. This is mainly due to the fact that this page has not yet been indexed by a search robot. Of course, sooner or later the search robot will notice this page. But it takes time, and sometimes quite a lot of time. But here you can help the search robot visit this page faster.

To do this, you can place your site in special directories or lists, social networks. In general, on all sites where the search robot simply lives. For example, in social networks there is an update every second. Try to claim your site, and the search robot will come to your site much faster.

From this follows one, but the main rule. If you want search engine bots to visit your site, they need to be fed new content on a regular basis. In the event that they notice that the content is being updated, the site is developing, they will visit your Internet project much more often.

Each search robot can remember how often your content changes. He evaluates not only quality, but time intervals. And if the material on the site is updated once a month, then it will come to the site once a month.

Thus, if the site is updated once a week, then the search robot will come once a week. If you update the site every day, then the search robot will visit the site every day or every other day. There are sites that are indexed within a few minutes after the update. This social media, news aggregators, and sites that post several articles per day.

How to give a task to a robot and forbid something to it?

At the very beginning, we learned that search engines have several robots that perform different tasks. Someone is looking for pictures, someone for links and so on.

You can control any robot using a special file robots.txt . It is from this file that the robot begins to get acquainted with the site. In this file, you can specify whether the robot can index the site, if so, which sections. All these instructions can be created for one or all robots.

Website promotion training

Learn more about the intricacies of SEO website promotion in search engines Google systems and Yandex, I talk on my Skype. I brought all my WEB projects to attendance more and get excellent results from this. I can teach you, if you're interested!

Friends, I greet you again! Now we will analyze what search robots are and talk in detail about the google search robot and how to be friends with them.

First you need to understand what search robots are in general, they are also called spiders. What job do search engine spiders do?

These are programs that check websites. They look through all the posts and pages on your blog, collect information, which they then transfer to the database of the search engine they work for.

You do not need to know the entire list of search robots, the most important thing is to know that Google now has two main spiders, which are called "panda" and "penguin". They are fighting with low-quality content and junk links, and you need to know how to repel their attacks.

The google panda search robot was created in order to promote only high-quality material in the search. All sites with low quality content are lowered in search results.

The first time this spider appeared in 2011. Before its appearance, it was possible to promote any site by publishing a large amount of text in articles and using a huge amount of keywords. Together, these two techniques did not bring high-quality content to the top of the search results, but good sites went down in the search results.

"Panda" immediately put things in order by checking all the sites and put everyone in their rightful places. Although she struggles with low-quality content, even small sites with quality articles can be promoted now. Although it was useless to promote such sites before, they could not compete with the giants who have a large amount of content.

Now we will figure out how to avoid the "panda" sanctions. We must first understand what she does not like. I already wrote above that she is struggling with bad content, but what kind of text is bad for her, let's figure it out so as not to publish this on her site.

The google search robot strives to ensure that only high-quality materials for applicants are issued in this search engine. If you have articles in which there is little information and they are not attractive in appearance, then urgently rewrite these texts so that the "panda" does not get to you.

Quality content can be both large and small, but if the spider sees a long article with a lot of information, then it will benefit the reader more.

Then it should be noted duplication, in other words plagiarism. If you think that you will rewrite other people's articles for your blog, you can immediately put an end to your site. Copying is severely punished by applying a filter, and plagiarism is checked very easy, I wrote an article on the topic how to check texts for uniqueness.

The next thing to notice is the oversaturation of the text with keywords. Whoever thinks that he will write an article from the same keywords and take first place in the search results is very much mistaken. I have an article on how to check pages for relevance, be sure to read it.

And what else can attract a “panda” to you is old articles that are morally outdated and do not bring traffic to the site. They need to be updated.

There is also a google search robot "penguin". This spider fights spam and junk links on your site. It also calculates purchased links from other resources. Therefore, in order not to be afraid of this search robot, you should not buy links, but publish high-quality content so that people link to you themselves.

Now let's formulate what needs to be done to make the site look perfect through the eyes of a search robot:

In order to make quality content, first study the topic well before writing an article. Then you need to understand that people are really interested in this topic.

Use concrete examples and pictures, this will make the article lively and interesting. Break the text into small paragraphs to make it easy to read. For example, if you open a page with jokes in a newspaper, which ones will you read first? Naturally, each person first reads short texts, then longer ones and, last but not least, long footcloths.

Panda's favorite nitpick is not the relevance of an article that contains outdated information. Stay tuned for updates and change texts.

Watch the density of keywords, I wrote above how to determine this density, in the service I talked about you will receive the exact number of keys required.

Do not plagiarize, everyone knows that you can not steal other people's things or text - it's the same thing. You will be responsible for theft by getting under the filter.

Write texts for at least two thousand words, then such an article will look informative through the eyes of search engine robots.

Don't go off topic on your blog. If you are running a blog on making money on the Internet, then you do not need to print articles about airguns. This may lower the rating of your resource.

Beautifully design articles, divide them into paragraphs and add pictures to make it pleasant to read and not want to quickly leave the site.

When buying links, make them to the most interesting and useful articles that people will actually read.

Well, now you know what kind of work search engine robots do and you can be friends with them. And most importantly, the google search robot and "panda" and "penguin" have been studied in detail by you.

1.1.1. Search engine components

Information on the Web is not only replenished, but also constantly changing, but no one tells anyone about these changes. Absent one system entering information that is simultaneously available to all Internet users. Therefore, in order to structure information, provide users with convenient means of searching for data, search engines were created.

Search engines there are different types. Some of them search for information based on what people put into them. These can be directories where editors enter information about sites, their brief descriptions or reviews. They are searched among these descriptions.

The latter collect information on the Web using special programs. These are search engines, consisting, as a rule, of three main components:

Index;

search engine.

Agent, or more commonly - a spider, a robot (in English literature - spider, crawler), in search of information bypasses the network or a certain part of it. This robot keeps a list of addresses (URLs) that it can visit and index, downloads documents corresponding to the links and analyzes them at regular intervals for each search engine. The resulting content of the pages is saved by the robot in a more compact form and transferred to the Index. If a new link is found during the analysis of the page (document), the robot will add it to its list. Therefore, any document or site that has links can be found by the robot. And vice versa, if the site or any part of it does not have any external links, the robot may not find it.

A robot is not just an information collector. He has a fairly developed "intelligence". Robots can search for sites of a certain subject, generate lists of sites sorted by traffic, extract and process information from existing databases, and can follow links of various nesting depths. But in any case, they pass all the information found to the database (Index) of the search engine.

Search robots there are various types:

? Spider(spider) is a program that downloads web pages in the same way as the user's browser. The difference is that the browser displays the information contained on the page (text, graphics, etc.), while the spider does not have any visual components and works directly with the HTML text of the page (similar to what you will see if you turn on the view HTML code in your browser).

? Crawler(crawler, "traveling" spider) - highlights all links present on the page. Its task is to determine where the spider should go next, based on links or based on a predefined list of addresses. The crawler, following the found links, searches for new documents that are still unknown to the search engine.

? Indexer parses the page into its component parts and analyzes them. Various page elements are selected and analyzed, such as text, headings, structural and style features, special service HTML tags, etc.

Index- this is the part of the search engine in which information is searched. The index contains all the data that was passed to it by robots, so the size of the index can reach hundreds of gigabytes. In fact, the index contains copies of all pages visited by robots. If the robot detects a change on a page it has already indexed, it sends updated information to the Index. It should replace the existing one, but in some cases not only a new page appears in the Index, but the old page also remains.

search engine is the very interface by which the visitor interacts with the Index. Through the interface, users enter their requests and receive responses, and site owners register them (and this registration is another way to convey the address of your site to the robot). When processing a query, the search engine selects the corresponding pages and documents from among the many millions of indexed resources and arranges them in order of importance or relevance to the query.