What is data mining. Data mining technology

OLAP systems provide the analyst with a means of testing hypotheses when analyzing data, that is, the main task of the analyst is to generate hypotheses, which he solves based on his knowledge and experience. However, not only a person has knowledge, but also the accumulated data that is analyzed . Such knowledge is contained in a huge amount of information that a person is not able to explore on his own. In this regard, there is a possibility of missing hypotheses that can bring significant benefits.

To detect "hidden" knowledge, special methods of automatic analysis are used, with the help of which one has to practically extract knowledge from the "blockages" of information. The term “data mining (DataMining)” or “data mining” has been assigned to this direction.

There are many definitions of DataMining that complement each other. Here are some of them.

DataMining is the process of discovering non-trivial and practically useful patterns in databases. (base group)

Data Mining is the process of extracting, exploring and modeling large amounts of data to discover previously unknown patterns (patters) in order to achieve business benefits (SAS Institute)

DataMining is a process that aims to discover new significant correlations, patterns, and trends by sifting through large amounts of stored data using pattern recognition techniques plus the application of statistical and mathematical methods (GartnerGroup)

DataMining is the study and discovery by a “machine” (algorithms, artificial intelligence tools) in raw data of hidden knowledge thatpreviously unknown, non-trivial, practically useful, available for interpretationhuman. (A. Bargesyan "Technologies for data analysis")

DataMining is the process of discovering useful knowledge about business. (N.M. Abdikeev "KBA")

Properties of discoverable knowledge

Consider the properties of the knowledge to be discovered.

  • Knowledge must be new, previously unknown. The effort spent on discovering knowledge that is already known to the user does not pay off. Therefore, it is new, previously unknown knowledge that is of value.
  • Knowledge must be non-trivial. The results of the analysis should reflect non-obvious, unexpectedpatterns in the data that make up the so-called hidden knowledge. Results that could be obtained more simple ways(for example, by visual inspection) do not justify the use of powerful DataMining methods.
  • Knowledge should be practically useful. The knowledge found should be applicable, including on new data, with sufficient a high degree reliability. The usefulness lies in the fact that this knowledge can bring some benefit in its application.
  • Knowledge must be accessible to human understanding. The patterns found must be logically explainable, otherwise there is a possibility that they are random. In addition, the discovered knowledge should be presented in a human-understandable form.

In DataMining, models are used to represent the acquired knowledge. Types of models depend on the methods of their creation. The most common are: rules, decision trees, clusters and mathematical functions.

Data Mining Tasks

Recall that the DataMining technology is based on the concept of patterns, which are regularities. As a result of the discovery of these regularities hidden from the naked eye, DataMining problems are solved. Different types of patterns, which can be expressed in a human-readable form, correspond to certain DataMining tasks.

There is no consensus on what tasks should be attributed to DataMining. Most authoritative sources list the following: classification,

clustering, prediction, association, visualization, analysis and discovery

deviations, evaluation, analysis of relationships, debriefing.

The purpose of the description that follows is to give an overview of the problems of DataMining, to compare some of them, and also to present some of the methods by which these problems are solved. The most common DataMining tasks are classification, clustering, association, prediction, and visualization. Thus, tasks are divided according to the types of information produced, this is the most general classification of DataMining tasks.

Classification

The task of dividing a set of objects or observations into a priori given groups, called classes, within each of which they are assumed to be similar to each other, having approximately the same properties and features. In this case, the solution is obtained on the basis of analysis attribute (feature) values.

Classification is one of the most important tasks datamining . It is applied in marketing when assessing the creditworthiness of borrowers, determining customer loyalty, pattern recognition , medical diagnostics and many other applications. If the analyst knows the properties of objects of each class, then when a new observation belongs to a certain class, these properties automatically apply to it.

If the number of classes is limited to two, thenbinary classification , to which many more complex problems can be reduced. For example, instead of defining such degrees of credit risk as "High", "Medium" or "Low", you can use only two - "Issue" or "Refuse".

For classification in DataMining, many different models are used: neural networks, decision trees , support vector machines, k-nearest neighbors, coverage algorithms, etc., which are constructed using supervised learning whenoutput variable(class label ) is given for each observation. Formally, the classification is based on the partitionfeature spaces in areas, within each of whichmultidimensional vectors are considered identical. In other words, if an object has fallen into a region of space associated with a certain class, it belongs to it.

Clustering

Short description. Clustering is a logical continuation of the idea

classification. This task is more complicated, the peculiarity of clustering is that the classes of objects are not initially predetermined. The result of clustering is the division of objects into groups.

An example of a method for solving a clustering problem: learning "without a teacher" of a special kind neural networks- Kohonen self-organizing maps.

Association (Associations)

Short description. In the course of solving the problem of searching for association rules, patterns are found between related events in the dataset.

The difference between the association and the two previous DataMining tasks is that the search for patterns is not based on the properties of the analyzed object, but between several events that occur simultaneously. The most well-known algorithm for solving the problem of finding association rules is the Apriori algorithm.

Sequence or sequential association

Short description. The sequence allows you to find temporal patterns between transactions. The task of a sequence is similar to an association, but its goal is to establish patterns not between simultaneously occurring events, but between events connected in time (i.e., occurring at some specific interval in time). In other words, the sequence is determined by the high probability of a chain of events related in time. In fact, an association is a special case of a sequence with zero time lag. This DataMining problem is also called the sequential pattern problem.

Sequence rule: after event X through certain time event Y will occur.

Example. After buying an apartment, tenants in 60% of cases purchase a refrigerator within two weeks, and within two months, in 50% of cases, a TV is purchased. The solution to this problem is widely used in marketing and management, for example, in managing the customer lifecycle (CustomerLifecycleManagement).

Regression, forecasting (Forecasting)

Short description. As a result of solving the problem of forecasting, based on the characteristics of historical data, the missing or future values ​​of target numerical indicators are estimated.

To solve such problems, methods of mathematical statistics, neural networks, etc. are widely used.

Additional tasks

Determination of deviations or outliers (DeviationDetection), variance or outlier analysis

Short description. The purpose of solving this problem is the detection and analysis of data that differs most from the general set of data, the identification of so-called uncharacteristic patterns.

Estimation

The task of estimation is reduced to predicting continuous values ​​of a feature.

Link analysis (LinkAnalysis)

The task of finding dependencies in a data set.

Visualization (Visualization, GraphMining)

As a result of visualization, a graphic image of the analyzed data is created. To solve the visualization problem, graphical methods are used to show the presence of patterns in the data.

An example of visualization methods is the presentation of data in 2-D and 3-D dimensions.

Summarization

The task, the purpose of which is the description of specific groups of objects from the analyzed data set.

Quite close to the above classification is the division of DataMining tasks into the following: research and discovery, prediction and classification, explanation and description.

Automatic research and discovery (free search)

Task example: discovery of new market segments.

To solve this class of problems, methods of cluster analysis are used.

Prediction and classification

Sample problem: predict sales growth based on current values.

Methods: regression, neural networks, genetic algorithms, decision trees.

The tasks of classification and forecasting constitute a group of so-called inductive modeling, which results in the study of the analyzed object or system. In the process of solving these problems, on the basis of a data set, a general model or a hypothesis.

Explanation and description

Sample problem: characterizing customers by demographics and purchase histories.

Methods: decision trees, rule systems, association rules, link analysis.

If the client's income is more than 50 conventional units, and his age is more than 30 years, then the client's class is the first.

Comparison of clustering and classification

Characteristic

Classification

Clustering

Controllability of learning

controlled

uncontrollable

Strategies

Learning with a teacher

Learning without a teacher

Presence of a class label

Training set

accompanied by a label indicating

the class to which it belongs

observation

Teaching class labels

sets unknown

Basis for classification

New data is classified based on the training set

Given a lot of data for the purpose

establishing the existence

classes or data clusters

Scopes of DataMining

It should be noted that today DataMining technology is most widely used in solving business problems. Perhaps the reason is that it is in this direction that the return on using DataMining tools can be, according to some sources, up to 1000%, and the costs of its implementation can quickly pay off.

We will look at the four main applications of DataMining technology in detail: science, business, government research, and the Web.

business tasks. Main areas: banking, finance, insurance, CRM, manufacturing, telecommunications, e-commerce, marketing, stock market and others.

    Whether to issue a loan to the client

    Market segmentation

    Attraction of new clients

    Credit card fraud

Application of DataMining for solving problems of the state level. Main directions: search for tax evaders; means in the fight against terrorism.

Application of DataMining for scientific research . Main areas: medicine, biology, molecular genetics and genetic engineering, bioinformatics, astronomy, applied chemistry, drug addiction research, and others.

Applying DataMining to a Solution Web tasks. Main directions: search engines (searchengines), counters and others.

E-commerce

In the field of e-commerce, DataMining is used to generate

This classification allows companies to identify specific groups of customers and conduct marketing policies in accordance with the identified interests and needs of customers. DataMining technology for e-commerce is closely related to WebMining technology.

The main tasks of DataMining in industrial production:

complex system analysis of production situations;

· short-term and long-term forecast of the development of production situations;

development of options for optimization solutions;

Predicting the quality of a product depending on some parameters

technological process;

detection of hidden trends and patterns of development of production

processes;

forecasting patterns of development of production processes;

detection of hidden factors of influence;

detection and identification of previously unknown relationships between

production parameters and factors of influence;

analysis of the environment of interaction of production processes and forecasting

changes in its characteristics;

processes;

visualization of analysis results, preparation of preliminary reports and projects

feasible solutions with estimates of the reliability and efficiency of possible implementations.

Marketing

In the field of marketing, DataMining is widely used.

Basic marketing questions "What is for sale?", "How is it for sale?", "Who is

consumer?"

In the lecture on classification and clustering problems, the use of cluster analysis for solving marketing problems, such as consumer segmentation, is described in detail.

Another common set of methods for solving marketing problems are methods and algorithms for searching for association rules.

The search for temporal patterns is also successfully used here.

Retail

In retail, as in marketing, apply:

Algorithms for searching for association rules (for determining frequently occurring sets

goods that buyers buy at the same time). The identification of such rules helps

place goods on the shelves of trading floors, develop strategies for the purchase of goods

and their placement in warehouses, etc.

use of time sequences, for example, to determine

the required amount of inventory in the warehouse.

methods of classification and clustering to identify groups or categories of customers,

knowledge of which contributes to the successful promotion of goods.

Stock market

Here is a list of stock market problems that can be solved using Data technology

Mining: Forecasting future values ​​of financial instruments and indicators

past values;

forecast of the trend (future direction of movement - growth, fall, flat) of the financial

instrument and its strength (strong, moderately strong, etc.);

allocation of the cluster structure of the market, industry, sector according to a certain set

characteristics;

· dynamic control portfolio

volatility forecast;

risk assessment;

prediction of the onset of the crisis and the forecast of its development;

selection of assets, etc.

In addition to the areas of activity described above, DataMining technology can be applied in a wide variety of business areas where there is a need for data analysis and a certain amount of retrospective information has been accumulated.

Application of DataMining in CRM

One of the most promising applications of DataMining is the use of this technology in analytical CRM.

CRM (Customer Relationship Management) - customer relationship management.

At sharing of these technologies, knowledge mining is combined with "money mining" from customer data.

An important aspect in the work of the marketing and sales departments is the preparationa holistic view of customers, information about their features, characteristics, structure of the customer base. CRM uses the so-called profilingcustomers, giving a complete view of all the necessary information about customers.

Customer profiling includes the following components: customer segmentation, customer profitability, customer retention, customer response analysis. Each of these components can be explored using DataMining, and analyzing them together as profiling components can result in knowledge that cannot be obtained from each individual characteristic.

webmining

WebMining can be translated as "data mining on the Web". WebIntelligence or Web.

Intelligence is ready to "open a new chapter" in the rapid development of e-business. The ability to determine the interests and preferences of each visitor by observing their behavior is a serious and critical competitive advantage in the e-commerce market.

WebMining systems can answer many questions, for example, which of the visitors is a potential client of the Web store, which group of customers of the Web store brings the most income, what are the interests of a particular visitor or group of visitors.

Methods

Classification of methods

There are two groups of methods:

  • statistical methods based on the use of average accumulated experience, which is reflected in retrospective data;
  • cybernetic methods, including many heterogeneous mathematical approaches.

The disadvantage of such a classification is that both statistical and cybernetic algorithms in one way or another rely on a comparison of statistical experience with the results of monitoring the current situation.

The advantage of such a classification is its convenience for interpretation - it is used in describing the mathematical tools of the modern approach to extracting knowledge from arrays of initial observations (operational and retrospective), i.e. in Data Mining tasks.

Let's take a closer look at the above groups.

Statistical Methods Data mining

In these methods are four interrelated sections:

  • preliminary analysis of the nature of statistical data (testing the hypotheses of stationarity, normality, independence, homogeneity, evaluation of the type of distribution function, its parameters, etc.);
  • identifying links and patterns(linear and non-linear regression analysis, correlation analysis, etc.);
  • multivariate statistical analysis (linear and non-linear discriminant analysis, cluster analysis, component analysis, factor analysis and etc.);
  • dynamic models and forecast based on time series.

The arsenal of statistical methods Data Mining is classified into four groups of methods:

  1. Descriptive analysis and description of initial data.
  2. Relationship analysis (correlation and regression analysis, factor analysis, analysis of variance).
  3. Multivariate statistical analysis (component analysis, discriminant analysis, multivariate regression analysis, canonical correlations, etc.).
  4. Time series analysis (dynamic models and forecasting).

Cybernetic Data Mining Methods

The second direction of Data Mining is a set of approaches united by the idea of ​​computer mathematics and the use of artificial intelligence theory.

This group includes the following methods:

  • artificial neural networks (recognition, clustering, forecast);
  • evolutionary programming (including algorithms of the method of group accounting of arguments);
  • genetic algorithms (optimization);
  • associative memory (search for analogues, prototypes);
  • fuzzy logic;
  • decision trees;
  • expert knowledge processing systems.

cluster analysis

The purpose of clustering is to search for existing structures.

Clustering is a descriptive procedure, it does not draw any statistical conclusions, but it provides an opportunity to conduct exploratory analysis and study the "structure of the data".

The very concept of "cluster" is defined ambiguously: each study has its own "clusters". The concept of a cluster (cluster) is translated as "cluster", "bunch". A cluster can be described as a group of objects that have common properties.

There are two characteristics of a cluster:

  • internal homogeneity;
  • external isolation.

A question that analysts ask in many problems is how to organize data into visual structures, i.e. expand taxonomies.

Initially, clustering was most widely used in such sciences as biology, anthropology, and psychology. For a long time, clustering has been little used to solve economic problems due to the specifics of economic data and phenomena.

Clusters can be non-overlapping, or exclusive (non-overlapping, exclusive), and intersecting (overlapping).

It should be noted that as a result of applying various methods of cluster analysis, clusters of various shapes can be obtained. For example, clusters of a "chain" type are possible, when clusters are represented by long "chains", elongated clusters, etc., and some methods can create arbitrary-shaped clusters.

Various methods may aim to create clusters of certain sizes (eg small or large) or assume clusters of different sizes in the data set. Some cluster analysis methods are particularly sensitive to noise or outliers, while others are less so. As a result of applying different clustering methods, different results can be obtained, this is normal and is a feature of the operation of a particular algorithm. These features should be taken into account when choosing a clustering method.

Let us give a brief description of approaches to clustering.

Algorithms based on data partitioning (Partitioningalgorithms), incl. iterative:

  • division of objects into k clusters;
  • iterative redistribution of objects to improve clustering.
  • Hierarchical algorithms (Hierarchyalgorithms):
  • agglomeration: each object is initially a cluster, clusters,
  • connecting with each other, form a larger cluster, etc.

Methods based on the concentration of objects (Density-basedmethods):

  • based on the connectivity of objects;
  • ignore noises, finding arbitrary-shaped clusters.

Grid - methods (Grid-based methods):

  • quantization of objects in grid structures.

Model methods (Model-based):

  • using the model to find the clusters that best fit the data.

Methods of cluster analysis. iterative methods.

With a large number of observations, hierarchical methods of cluster analysis are not suitable. In such cases, non-hierarchical methods based on division are used, which are iterative methods of splitting the original population. During the division process, new clusters are formed until the stopping rule is met.

Such non-hierarchical clustering consists of dividing a data set into a certain number of distinct clusters. There are two approaches. The first is to define the boundaries of clusters as the densest areas in the multidimensional space of the initial data, i.e. definition of a cluster where there is a large "concentration of points". The second approach is to minimize the measure of object difference

Algorithm k-means (k-means)

The most common among non-hierarchical methods is the k-means algorithm, also called fast cluster analysis. Full description algorithm can be found in the work of Hartigan and Wong (1978). Unlike hierarchical methods, which do not require preliminary assumptions about the number of clusters, to be able to use this method, it is necessary to have a hypothesis about the most probable number of clusters.

The k-means algorithm builds k clusters spaced as far apart as possible. The main type of problems that the k-means algorithm solves is the presence of assumptions (hypotheses) regarding the number of clusters, while they should be as different as possible. The choice of the number k may be based on previous research, theoretical considerations, or intuition.

The general idea of ​​the algorithm: a given fixed number k of observation clusters are compared to clusters in such a way that the averages in the cluster (for all variables) differ as much as possible from each other.

Description of the algorithm

1. Initial distribution of objects by clusters.

  • The number k is chosen, and at the first step these points are considered to be the "centers" of the clusters.
  • Each cluster corresponds to one center.

The choice of initial centroids can be carried out as follows:

  • choosing k-observations to maximize the initial distance;
  • random selection of k-observations;
  • choice of the first k-observations.

As a result, each object is assigned to a specific cluster.

2. Iterative process.

The centers of the clusters are calculated, which then and further are considered to be the coordinate means of the clusters. Objects are redistributed again.

The process of calculating centers and redistributing objects continues until one of the following conditions is met:

  • cluster centers have stabilized, i.e. all observations belong to the cluster they belonged to before the current iteration;
  • the number of iterations is equal to the maximum number of iterations.

The figure shows an example of the operation of the k-means algorithm for k equal to two.

An example of the k-means algorithm (k=2)

The choice of the number of clusters is a complex issue. If there are no assumptions about this number, it is recommended to create 2 clusters, then 3, 4, 5, etc., comparing the results.

Checking the quality of clustering

After obtaining the results of cluster analysis using the k-means method, one should check the correctness of the clustering (i.e., evaluate how the clusters differ from each other).

To do this, average values ​​for each cluster are calculated. Good clustering should produce very different means for all measurements, or at least most of them.

Advantages of the k-means algorithm:

  • ease of use;
  • speed of use;
  • clarity and transparency of the algorithm.

Disadvantages of the k-means algorithm:

  • the algorithm is too sensitive to outliers that can distort the mean.

Possible Solution this problem is to use a modification of the algorithm -k-median algorithm;

  • the algorithm can be slow on large databases. A possible solution to this problem is to use data sampling.

Bayesian networks

In probability theory, the concept of information dependency is modeled by conditional dependency (or strictly: lack of conditional independence), which describes how our confidence in the outcome of some event changes when we gain new knowledge about the facts, given that we already knew some set of other facts.

It is convenient and intuitive to represent dependencies between elements by means of a directed path connecting these elements in a graph. If the relationship between elements x and y is not direct and is carried out through the third element z, then it is logical to expect that there will be an element z on the path between x and y. Such intermediary nodes will "cut off" the dependence between x and y, i.e. to model a situation of conditional independence between them with a known value of direct factors of influence.Such modeling languages ​​are Bayesian networks, which serve to describe conditional dependencies between the concepts of a certain subject area.

Bayesian networks are graphical structures for representing probabilistic relationships between a large number of variables and for performing probabilistic inference based on those variables."Naive" (Bayesian) classification is a fairly transparent and understandable method of classification. "Naive" it is called because it proceeds from the assumption of mutualfeature independence.

Classification properties:

1. Using all variables and defining all dependencies between them.

2. Having two assumptions about variables:

  • all variables are equally important;
  • all variables are statistically independent, i.e. The value of one variable says nothing about the value of the other.

There are two main scenarios for using Bayesian networks:

1. Descriptive analysis. The subject area is displayed as a graph, the nodes of which represent concepts, and the directed arcs displayed by arrows illustrate the direct relationships between these concepts. The relationship between x and y means that knowing the value of x helps you make a better guess about the value of y. The absence of a direct connection between concepts models the conditional independence between them, given the known values ​​of a certain set of "separating" concepts. For example, a child's shoe size is obviously related to a child's ability to read through age. Thus, a larger shoe size gives more confidence that the child is already reading, but if we already know the age, then knowing the shoe size will no longer give us additional information about the child's ability to read.


As another, opposite, example, consider such initially unrelated factors as smoking and a cold. But if we know a symptom, for example, that a person suffers from a morning cough, then knowing that a person does not smoke increases our confidence that a person has a cold.

2. Classification and forecasting. The Bayesian network, allowing for the conditional independence of a number of concepts, makes it possible to reduce the number of joint distribution parameters, making it possible to estimate them confidently on the available data volumes. So, with 10 variables, each of which can take 10 values, the number of joint distribution parameters is 10 billion - 1. If we assume that only 2 variables depend on each other between these variables, then the number of parameters becomes 8 * (10-1) + (10 * 10-1) = 171. Having a model of joint distribution that is realistic in terms of computational resources, we can predict the unknown value of a concept as, for example, the most probable value of this concept with known values ​​of other concepts.

They note such advantages of Bayesian networks as a DataMining method:

Dependencies between all variables are defined in the model, this makes it easy tohandle situations in which the values ​​of some variables are unknown;

Bayesian networks are quite simply interpreted and allow at the stagepredictive modeling is easy to carry out the analysis of the scenario "what if";

The Bayesian method allows you to naturally combine patterns,derived from data and, for example, expert knowledge obtained explicitly;

Using Bayesian networks avoids the problem of overfitting(overfitting), that is, excessive complication of the model, which is a weaknessmany methods (for example, decision trees and neural networks).

The Naive Bayesian approach has the following disadvantages:

Multiplying conditional probabilities is correct only when all inputsthe variables are indeed statistically independent; although this method is oftenshows fairly good results if the condition of the statisticalindependence, but theoretically such a situation should be handled by more complexmethods based on training Bayesian networks;

Impossible direct processing of continuous variables - they are requiredconversion to an interval scale so that the attributes are discrete; however, suchtransformations can sometimes lead to the loss of meaningful patterns;

The classification result in the Naive Bayesian approach is affected only byindividual values ​​of input variables, combined influence of pairs ortriplets of values ​​of different attributes are not taken into account here. This could improvethe quality of the classification model in terms of its predictive accuracy,however, would increase the number of variants tested.

Artificial neural networks

Artificial neural networks (hereinafter referred to as neural networks) can be synchronous and asynchronous.In synchronous neural networks, at each moment of time, only one neuron. In asynchronous - the state changes immediately for a whole group of neurons, as a rule, for everything layer. Two basic architectures can be distinguished - layered and fully connected networks.The key concept in layered networks is the concept of a layer.Layer - one or more neurons, the inputs of which are supplied with the same common signal.Layered neural networks are neural networks in which neurons are divided into separate groups (layers) so that information processing is carried out in layers.In layered networks, the neurons of the i-th layer receive input signals, transform them, and pass them through the branch points to the neurons (i + 1) of the layer. And so on until the kth layer, which givesoutput signals for the interpreter and the user. The number of neurons in each layer is not related to the number of neurons in other layers, it can be arbitrary.Within one layer, data is processed in parallel, and across the entire network, processing is carried out sequentially - from layer to layer. Layered neural networks include, for example, multilayer perceptrons, networks of radial basis functions, cognitron, noncognitron, associative memory networks.However, the signal is not always applied to all neurons of the layer. In a cognitron, for example, each neuron of the current layer receives signals only from neurons close to it in the previous layer.

Layered networks, in turn, can be single-layer and multi-layer.

Single layer network- a network consisting of one layer.

Multilayer network- a network with several layers.

In a multilayer network, the first layer is called the input layer, the subsequent layers are called internal or hidden, and the last layer is the output layer. Thus, intermediate layers are all layers in a multilayer neural network, except for the input and output.The input layer of the network implements the connection with the input data, the output layer - with the output.Thus, neurons can be input, output and hidden.The input layer is organized from input neurons that receive data and distribute it to the inputs of neurons in the hidden layer of the network.A hidden neuron is a neuron located in hidden layer neural network.Output neurons, from which the output layer of the network is organized, producesresults of the neural network.

In fully connected networks each neuron transmits its output signal to the rest of the neurons, including itself. The output signals of the network can be all or some of the output signals of neurons after several clock cycles of the network.

All input signals are fed to all neurons.

Training of neural networks

Before using a neural network, it must be trained.The learning process of a neural network consists in adjusting its internal parameters for a specific task.The neural network algorithm is iterative, its steps are called epochs or cycles.Epoch - one iteration in the learning process, including the presentation of all examples from the training set and, possibly, checking the quality of training on the control set. The learning process is carried out on the training sample.The training sample includes the input values ​​and their corresponding output values ​​from the dataset. In the course of training, the neural network finds some dependences of the output fields on the input ones.Thus, we are faced with the question - what input fields (features) do we neednessesary to use. Initially, the choice is made heuristically, thenthe number of inputs can be changed.

Complexity can raise the issue of the number of observations in the dataset. And although there are some rules describing the relationship between the required number of observations and the size of the network, their correctness has not been proven.The number of necessary observations depends on the complexity of the problem being solved. With an increase in the number of features, the number of observations increases non-linearly, this problem is called the "curse of dimensionality". With insufficient quantitydata, it is recommended to use a linear model.

The analyst must determine the number of layers in the network and the number of neurons in each layer.Next, you need to assign such values ​​​​of weights and biases that canminimize decision error. The weights and biases are automatically adjusted in such a way as to minimize the difference between the desired and the output signals, which is called the learning error.The learning error for the constructed neural network is calculated by comparingoutput and target (desired) values. The error function is formed from the obtained differences.

The error function is an objective function that needs to be minimized in the processcontrolled neural network learning.Using the error function, you can evaluate the quality of the neural network during training. For example, the sum of squared errors is often used.The ability to solve the assigned tasks depends on the quality of neural network training.

Neural network retraining

When training neural networks, a serious difficulty often arises, calledoverfitting problem.Overfitting, or overfitting - overfittingneural network to a specific set of training examples, in which the network losesgeneralization ability.Overfitting occurs when training is too long, not enoughtraining examples or overcomplicated neural network structure.Overfitting is due to the fact that the choice of training (training) setis random. From the first steps of training, the error is reduced. Onsubsequent steps in order to reduce the error (objective function) parametersadjusted to the characteristics of the training set. However, this happens"adjustment" not to the general patterns of the series, but to the features of its part -training subset. In this case, the accuracy of the forecast decreases.One of the options for dealing with network retraining is to divide the training sample into twosets (training and test).On the training set, the neural network is trained. On the test set, the constructed model is checked. These sets must not intersect.With each step, the parameters of the model change, however, a constant decreasevalue of the objective function occurs precisely on the training set. When splitting the set into two, we can observe the change in the forecast error on the test set in parallel with the observations on the training set. Somethe number of prediction error steps decreases on both sets. However, onat a certain step, the error on the test set begins to increase, while the error on the training set continues to decrease. This moment is considered the beginning of retraining.

Data mining tools

Development in the DataMining sector of the global software market is occupied by both world-famous leaders and new emerging companies. DataMining tools can be presented either as a standalone application or as add-ons to the main product.The latter option is implemented by many software market leaders.So, it has already become a tradition that the developers of universal statistical packages, in addition to the traditional methods of statistical analysis, include in the packagea certain set of DataMining methods. These are packages like SPSS (SPSS, Clementine), Statistica (StatSoft), SAS Institute (SAS Enterprise Miner). Some OLAP solution developers also offer a set of DataMining techniques, such as the Cognos family of products. There are providers that include DataMining solutions in the functionality of the DBMS: these are Microsoft (MicrosoftSQLServer), Oracle, IBM (IBMintelligentMinerforData).

Bibliography

  1. Abdikeev N.M. Danko T.P. Ildemenov S.V. Kiselev A.D., “Reengineering of business processes. MBA Course”, Moscow: Eksmo Publishing House, 2005. - 592 p. - (MBA)
  1. Abdikeev N.M., Kiselev A.D. "Knowledge management in corporations and business reengineering" - M.: Infra-M, 2011.- 382 p. – ISBN 978-5-16-004300-5
  1. Barseghyan A.A., Kupriyanov M.S., Stepanenko V.V., Holod I.I. "Methods and models of data analysis: OLAP and Data Mining", St. Petersburg: BHV-Petersburg, 2004, 336 pp., ISBN 5-94157-522-X
  1. Duke IN., Samoilenko A., “Data Mining.Training course "SPb: Piter, 2001, 386s.
  1. Chubukova I.A., Data Mining Course, http://www.intuit.ru/department/database/datamining/
  1. IanH. Witten, Eibe Frank, Mark A. Hall, Morgan Kaufmann, Data Mining: Practical Machine Learning Tools and Techniques (Third Edition), ISBN 978-0-12-374856-0
  1. Petrushin V.A. , Khan L. , Multimedia Data Mining and Knowledge Discovery

Data Mining

Data Mining is a methodology and process for discovering large amounts of data that accumulate in information systems companies, previously unknown, non-trivial, practically useful and accessible for the interpretation of the knowledge necessary for decision-making in various areas of human activity. Data Mining is one of the stages of the larger Knowledge Discovery in Databases methodology.

The knowledge discovered in the process of Data Mining must be non-trivial and previously unknown. Non-triviality suggests that such knowledge cannot be discovered by simple visual analysis. They should describe relationships between the properties of business objects, predict the values ​​of some features based on others, and so on. Found knowledge should be applicable to new objects.

The practical usefulness of knowledge is due to the possibility of their use in the process of supporting the adoption management decisions and improvement of the company's activities.

Knowledge should be presented in a form that is understandable to users who do not have special mathematical training. For example, the logical constructions “if, then” are most easily perceived by a person. Moreover, such rules can be used in various DBMS as SQL queries. In the case when the extracted knowledge is not transparent to the user, there should be post-processing methods that allow them to be brought to an interpretable form.

Data mining is not one, but a combination of a large number of different knowledge discovery methods. All tasks solved by Data Mining methods can be conditionally divided into six types:

Data mining is multidisciplinary in nature, as it includes elements of numerical methods, mathematical statistics and probability theory, information theory and mathematical logic, artificial intelligence and machine learning.

The tasks of business analysis are formulated in different ways, but the solution of most of them comes down to one or another Data Mining task or to a combination of them. For example, risk assessment is a solution to a regression or classification problem, market segmentation is clustering, demand stimulation is association rules. In fact, Data Mining tasks are elements from which you can "assemble" the solution to most real business problems.

To solve the above problems, various methods and algorithms of Data Mining are used. In view of the fact that Data Mining has developed and is developing at the intersection of such disciplines as math statistics, information theory, machine learning and databases, it is quite natural that most of the algorithms and methods of Data Mining have been developed based on various methods from these disciplines. For example, the k-means clustering algorithm was borrowed from statistics.

We welcome you to the Data Mining Portal - a unique portal dedicated to modern Data Mining methods.

Data Mining technologies are a powerful tool for modern business intelligence and data mining to discover hidden patterns and build predictive models. Data Mining or knowledge mining is not based on speculative reasoning, but on real data.

Rice. 1. Scheme of application of Data Mining

Problem Definition - Problem definition: data classification, segmentation, building predictive models, forecasting.
Data Gathering and Preparation - Data collection and preparation, cleaning, verification, removal of duplicate records.
Model Building - Building a model, assessing accuracy.
Knowledge Deployment - Application of the model to solve the problem.

Data Mining is used to implement large-scale analytical projects in business, marketing, the Internet, telecommunications, industry, geology, medicine, pharmaceuticals and other areas.

Data Mining allows you to start the process of finding significant correlations and relationships as a result of sifting through a huge amount of data using modern methods pattern recognition and application of unique analytical technologies, including decision and classification trees, clustering, neural network methods, and others.

A user who discovers data mining technology for the first time is amazed at the abundance of methods and efficient algorithms that allow finding approaches to solving difficult problems related to the analysis of large amounts of data.

In general, Data Mining can be described as a technology designed to search in large amounts of data. non-obvious, objective and practically useful patterns.

Data Mining is based on effective methods and algorithms developed for the analysis of unstructured data of large volume and dimension.

The key point is that data of large volume and high dimension appear to be devoid of structure and relationships. The goal of data mining technology is to identify these structures and find patterns where, at first glance, chaos and arbitrariness reign.

Here is an actual example of the application of data mining in the pharmaceutical and drug industries.

Drug interactions are a growing problem facing modern healthcare.

Over time, the number of prescribed drugs (over the counter and all kinds of supplements) increases, making it more and more likely that interactions between drugs can cause serious side effects that doctors and patients are unaware of.

This area refers to post-clinical studies, when the drug is already on the market and is being used extensively.

Clinical studies refer to the evaluation of the effectiveness of the drug, but poorly take into account the interactions of this drug with other drugs on the market.

Researchers at Stanford University in California studied the FDA (Food and Drug Administration) database of drug side effects and found that two commonly used drugs - the antidepressant paroxetine and pravastatin, used to lower cholesterol levels - increase risk of developing diabetes if taken together.

A similar analysis study based on FDA data identified 47 previously unknown adverse interactions.

This is remarkable, with the caveat that many of the negative effects noted by patients remain undetected. Just in this case, network search is able to show itself in the best way.

Upcoming Data Mining courses at the StatSoft Academy of Data Analysis in 2020

We start our acquaintance with Data Mining using the wonderful videos of the Academy of Data Analysis.

Be sure to watch our videos and you will understand what Data Mining is!

Video 1. What is Data Mining?


Video 2: Data Mining Overview: Decision Trees, Generalized Predictive Models, Clustering, and More

JavaScript is disabled in your browser


Before starting a research project, we must organize the process of obtaining data from external sources, now we will show how this is done.

The video will introduce you to the unique technology STATISTICS In-place database processing and Data Mining connection with real data.

Video 3. The order of interaction with databases: a graphical interface for building SQL queries In-place database processing technology

JavaScript is disabled in your browser


Now we will get acquainted with interactive drilling technologies that are effective in conducting exploratory data analysis. The term drilling itself reflects the connection between Data Mining technology and geological exploration.

Video 4. Interactive Drilling: Exploration and Graphing Methods for Interactive Data Exploration

JavaScript is disabled in your browser


Now we will get acquainted with the analysis of associations (association rules), these algorithms allow you to find relationships that exist in real data. The key point is the efficiency of algorithms on large amounts of data.

The result of link analysis algorithms, for example, the Apriori algorithm, is to find the rules for links of the objects under study with a given reliability, for example, 80%.

In geology, these algorithms can be applied in the exploration analysis of minerals, for example, how feature A is related to features B and C.

you can find concrete examples such solutions on our links:

In retail, Apriori algorithms or their modifications allow you to explore the relationship of different products, for example, when selling perfumes (perfume - varnish - mascara, etc.) or products of different brands.

The analysis of the most interesting sections on the site can also be effectively carried out using association rules.

So check out our next video.

Video 5. Association rules

JavaScript is disabled in your browser

Let us give examples of the application of Data Mining in specific areas.

Internet trading:

  • analysis of customer trajectories from visiting the site to purchasing goods
  • evaluation of service efficiency, analysis of failures due to lack of goods
  • linking products that are of interest to visitors

Retail: Analysis of customer information based on credit cards, discount cards, etc.

Typical retail tasks solved by Data Mining tools:

  • shopping cart analysis;
  • creation of predictive models and classification models of buyers and purchased goods;
  • creation of buyer profiles;
  • CRM, assessment of customer loyalty of different categories, planning of loyalty programs;
  • time series research and time dependencies, selection of seasonal factors, evaluation of the effectiveness of promotions on a large range of real data.

The telecommunications sector opens up unlimited opportunities for the application of data mining methods, as well as modern big data technologies:

  • classification of clients based on key characteristics of calls (frequency, duration, etc.), SMS frequency;
  • identification of customer loyalty;
  • definition of fraud, etc.

Insurance:

  • risk analysis. By identifying combinations of factors associated with paid claims, insurers can reduce their liability losses. There is a known case when an insurance company discovered that the amounts paid out on the applications of people who are married are twice the amounts on the applications of single people. The company responded to this by revising its discount policy for family customers.
  • fraud detection. Insurance companies can reduce fraud by looking for stereotypes in claims claims that characterize relationships between lawyers, doctors, and claimants.

The practical application of data mining and the solution of specific problems is presented in our next video.

Webinar 1. Webinar "Practical tasks of Data Mining: problems and solutions"

JavaScript is disabled in your browser

Webinar 2. Webinar "Data Mining and Text Mining: Examples of Solving Real Problems"

JavaScript is disabled in your browser


You can get deeper knowledge on the methodology and technology of data mining at StatSoft courses.

Ministry of Education and Science of the Russian Federation

Federal State Budgetary Educational Institution of Higher Professional Education

"NATIONAL RESEARCH TOMSK POLYTECHNICAL UNIVERSITY"

Institute of Cybernetics

Direction Informatics and Computer Engineering

Department of VT

Test

in the discipline of informatics and computer technology

Topic: Data Mining Methods

Introduction

data mining. Basic concepts and definitions

1 Stages in the data mining process

2 Components of data mining systems

3 Data mining methods in Data Mining

Data Mining Methods

1 Derivation of association rules

2 Neural network algorithms

3 Nearest neighbor and k-nearest neighbor methods

4 Decision trees

5 Clustering algorithms

6 Genetic algorithms

Applications

Manufacturers of Data Mining Tools

Criticism of methods

Conclusion

Bibliography

Introduction

The result of the development of information technology is a colossal amount of data accumulated in electronic form, which is growing at a rapid pace. At the same time, data, as a rule, have a heterogeneous structure (texts, images, audio, video, hypertext documents, relational databases). Accumulated for long term data may contain patterns, trends and relationships that are valuable information in planning, forecasting, decision making, process control. However, a person is not physically able to effectively analyze such volumes of heterogeneous data. The methods of traditional mathematical statistics have long claimed the role of the main tool for data analysis. However, they do not allow the synthesis of new hypotheses, and can only be used to confirm pre-formulated hypotheses and “rough” exploratory analysis, which forms the basis of online analytical processing (OLAP). Often, it is the formulation of a hypothesis that turns out to be the most difficult task when conducting an analysis for subsequent decision making, since not all patterns in the data are obvious at first glance. Therefore, data mining technologies are considered as one of the most important and promising topics for research and application in the information technology industry. In this case, data mining is understood as the process of determining new, correct and potentially useful knowledge based on large data sets. Thus, MIT Technology Review described Data Mining as one of the ten emerging technologies that will change the world.

1. Data Mining. Basic concepts and definitions

Data Mining is the process of discovering previously unknown, non-trivial, practically useful and accessible knowledge in raw data, which is necessary for making decisions in various areas of human activity.

The essence and purpose of Data Mining technology can be formulated as follows: it is a technology that is designed to search for non-obvious, objective and practical patterns in large amounts of data.

Non-obvious patterns are patterns that cannot be detected by standard methods of information processing or by an expert.

Objective laws should be understood as laws that are fully consistent with reality, in contrast to expert opinion, which is always subjective.

This concept of data analysis suggests that:

§ data may be inaccurate, incomplete (contain gaps), contradictory, heterogeneous, indirect, and at the same time have gigantic volumes; therefore, understanding data in specific applications requires significant intellectual effort;

§ the data analysis algorithms themselves may have “elements of intelligence”, in particular, the ability to learn from precedents, that is, to draw general conclusions based on particular observations; the development of such algorithms also requires considerable intellectual effort;

§ The processes of processing raw data into information and information into knowledge cannot be performed manually and require automation.

The Data Mining technology is based on the concept of templates (patterns), reflecting fragments of multidimensional relationships in data. These patterns are patterns inherent in subsamples of data that can be expressed concisely in a human-readable form.

The search for templates is carried out by methods that are not limited by a priori assumptions about the structure of the sample and the type of distributions of the values ​​of the analyzed indicators.

An important feature Data Mining is the non-standard and non-obviousness of the wanted patterns. In other words, Data Mining tools differ from statistical data processing tools and OLAP tools in that instead of checking interdependencies that users presuppose, they are able to find such interdependencies on their own based on the available data and build hypotheses about their nature. There are five standard types of patterns identified by Data Mining methods:

association - high probability of connection of events with each other. An example of an association is items in a store, often purchased together;

sequence - a high probability of a chain of events connected in time. An example of a sequence is a situation where, within a certain period of time after the acquisition of one product, another will be purchased with a high degree of probability;

Classification - there are signs that characterize the group to which this or that event or object belongs;

clustering - a pattern similar to classification and differing from it in that the groups themselves are not specified - they are detected automatically in the process of data processing;

· temporal patterns - the presence of patterns in the dynamics of the behavior of certain data. A typical example of a temporal pattern is seasonal fluctuations in demand for certain goods or services.

1.1 Steps in the Data Mining Process

Traditionally, the following stages are distinguished in the process of data mining:

1. The study of the subject area, as a result of which the main goals of the analysis are formulated.

2. Data collection.

Data preprocessing:

a. Data cleaning - elimination of contradictions and random "noise" from the original data

b. Data integration - combining data from several possible sources in a single repository. Data conversion. At this stage, the data is converted to a form suitable for analysis. Data aggregation, attribute discretization, data compression, and dimensionality reduction are often used.

4. Data analysis. Within this stage, mining algorithms are applied to extract patterns.

5. Interpretation of found patterns. This stage may include visualization of the extracted patterns, identification of really useful patterns based on some utility function.

Use of new knowledge.

1.2 Components of mining systems

Typically, the following main components are distinguished in data mining systems:

1. Database, data warehouse or other repository of information. It can be one or more databases, data warehouse, spreadsheets, other types of repositories that can be cleaned and integrated.

2. Database or data warehouse server. The specified server is responsible for extracting relevant data based on the user's request.

Knowledge base. It is domain knowledge that indicates how to search and evaluate the usefulness of the resulting patterns.

Knowledge Mining Service. It is an integral part of the data mining system and contains a set of functional modules for tasks such as characterization, association search, classification, cluster analysis and variance analysis.

Pattern evaluation module. This component calculates measures of interest or utility of patterns.

Graphic user interface. This module is responsible for communication between the user and the data mining system, visualization of patterns in various forms.

1.3 Data Mining Methods

Most of the analytical methods used in Data Mining technology are well-known mathematical algorithms and methods. New in their application is the possibility of their use in solving certain specific problems, due to the emerging capabilities of hardware and software. It should be noted that most of the Data Mining methods were developed within the framework of the theory of artificial intelligence. Consider the most widely used methods:

Derivation of association rules.

2. Neural network algorithms, the idea of ​​which is based on an analogy with the functioning of the nervous tissue and lies in the fact that the initial parameters are considered as signals that are transformed in accordance with the existing connections between the "neurons", and the response of the entire network is considered as the answer resulting from the analysis on initial data.

Selection of a close analogue of the original data from the already available historical data. Also called the nearest neighbor method.

Decision trees are a hierarchical structure based on a set of questions that require a "Yes" or "No" answer.

Cluster models are used to group similar events into groups based on the similar values ​​of multiple fields in a dataset.

In the next chapter, we will describe these methods in more detail.

2. Data mining methods

2.1 Derivation of association rules

Association rules are rules of the form "if...then...". Searching for such rules in a data set reveals hidden relationships in seemingly unrelated data. One of the most frequently cited examples of the search for association rules is the problem of finding stable relationships in a shopping cart. This problem is to determine which products are purchased together by the customers, so that marketers can appropriately place these products in the store to increase sales.

Association rules are defined as statements of the form (X1,X2,…,Xn) -> Y, where it is understood that Y can be present in a transaction provided that X1,X2,…,Xn are present in the same transaction. Note that the word "may" implies that the rule is not an identity, but only holds with some probability. In addition, Y can be a set of elements, not just one element. The probability of finding Y in a transaction that contains elements X1,X2,…,Xn is called confidence. The percentage of transactions containing the rule out of the total number of transactions is called support. The level of certainty that must exceed the rule's certainty is called interestingness.

There are different types of association rules. In its simplest form, association rules report only the presence or absence of an association. Such rules are called Boolean Association Rules. An example of such a rule is “customers who purchase yogurt also purchase low-fat butter.”

Rules that collect several association rules together are called Multilevel or Generalized Association Rules. When constructing such rules, the elements are usually grouped according to a hierarchy, and the search is carried out at the highest conceptual level. For example, "customers who buy milk also buy bread." In this example, milk and bread contain a hierarchy various types and brands, but lower-level searches won't turn up interesting rules.

A more complex type of rules are Quantitative Association Rules. This type of rule is searched using quantitative (eg price) or categorical (eg gender) attributes, and is defined as ( , ,…,} -> . For example, "customers who are between 30 and 35 years old with an income of more than 75,000 a year buy cars worth more than 20,000."

The above types of rules do not affect the fact that transactions, by their nature, are time dependent. For example, searching before a product is listed for sale or after it has disappeared from the market will adversely affect the support threshold. With this in mind, the concept of attribute lifetime is introduced in the search algorithms for Temporal Association Rules.

The problem of finding association rules can be broadly decomposed into two parts: searching for frequently occurring sets of elements, and generating rules based on the found frequently occurring sets. Previous research has, for the most part, followed these lines and extended them in various directions.

Since the advent of the Apriori algorithm, this algorithm has been the most commonly used in the first step. Many improvements, for example, in speed and scalability, are aimed at improving the Apriori algorithm, at correcting its erroneous property of generating too many candidates for the most frequently occurring sets of elements. Apriori generates item sets using only the larger item sets found in the previous step, without revisiting transactions. The modified AprioriTid algorithm improves Apriori by using the database only on the first pass. The calculations in subsequent steps use only the data created in the first pass, which is much smaller than the original database. This results in a huge increase in productivity. A further improved version of the algorithm, called AprioriHybrid, can be obtained by using Apriori on the first few passes, and then, on later passes, when the kth candidate sets can already be fully placed in the computer's memory, switching to AprioriTid.

Further efforts to improve the Apriori algorithm are related to the parallelization of the algorithm (Count Distribution, Data Distribution, Candidate Distribution, etc.), its scaling (Intelligent Data Distribution, Hybrid Distribution), the introduction of new data structures, such as trees of frequently occurring elements (FP-growth ).

The second step is mainly characterized by authenticity and interestingness. The new modifications add the dimension, quality, and temporal support described above to the traditional boolean rule rules. An evolutionary algorithm is often used to find rules.

2.2 Neural network algorithms

Artificial neural networks appeared as a result of applying the mathematical apparatus to the study of the functioning of the human nervous system in order to reproduce it. Namely: the ability of the nervous system to learn and correct errors, which should allow us to model, albeit rather crudely, the work of the human brain. The main structural and functional part of the neural network is the formal neuron, shown in Fig. 1, where x0 , x1,..., xn are the components of the vector of input signals, w0 ,w1,...,wn are the values ​​of the weights of the input signals of the neuron, and y is the output signal of the neuron.

Rice. 1. Formal neuron: synapses (1), adder (2), converter (3).

A formal neuron consists of 3 types of elements: synapses, adder and converter. Synapse characterizes the strength of the connection between two neurons.

The adder performs the addition of the input signals previously multiplied by the corresponding weights. The converter implements the function of one argument - the output of the adder. This function is called the activation function or transfer function of the neuron.

The formal neurons described above can be combined in such a way that the output signals of some neurons are input to others. The resulting set of interconnected neurons is called artificial neural networks (ANNs) or, in short, neural networks.

There are the following three general types of neurons, depending on their position in the neural network:

Input neurons to which input signals are applied. Such neurons usually have one input with unit weight, there is no bias, and the output value of the neuron is equal to the input signal;

Output nodes, the output values ​​of which represent the resulting output signals of the neural network;

Hidden nodes that do not have direct connections with input signals, while the values ​​of the output signals of hidden neurons are not output signals of the ANN.

According to the structure of interneuronal connections, two classes of ANNs are distinguished:

ANN of direct propagation, in which the signal propagates only from input neurons to output neurons.

Recurrent ANN - ANN with feedback. In such ANNs, signals can be transmitted between any neurons, regardless of their location in the ANN.

There are two general approaches to training ANNs:

Training with a teacher.

Learning without a teacher.

Supervised learning involves the use of a pre-formed set of training examples. Each example contains a vector of input signals and a corresponding vector of reference output signals, which depend on the task at hand. This set is called the training set or training set. The training of the neural network is aimed at such a change in the weights of the ANN connections, in which the value of the output signals of the ANN differs as little as possible from the required values ​​of the output signals for a given vector of input signals.

In unsupervised learning, the connection weights are adjusted either as a result of competition between neurons or taking into account the correlation of the output signals of the neurons between which there is a connection. In the case of unsupervised learning, the training set is not used.

Neural networks are used to solve a wide range of problems, such as planning payloads for space shuttles and forecasting exchange rates. However, they are not often used in data mining systems due to the complexity of the model (knowledge, fixed as the weights of several hundred interneuronal connections, is completely impossible to analyze and interpret by a person) and long training time on a large training set. On the other hand, neural networks have such advantages for use in data analysis tasks as resistance to noisy data and high accuracy.

2.3 Nearest neighbor and k-nearest neighbor methods

Nearest neighbor algorithm and k-nearest neighbor algorithm (KNN) are based on object similarity. The nearest neighbor algorithm selects among all known objects the object that is as close as possible (using the distance metric between objects, for example, Euclidean) to a new previously unknown object. The main problem with the nearest neighbor method is its sensitivity to outliers in the training data.

The described problem can be avoided by the KNN algorithm, which distinguishes k-nearest neighbors from all observations that are similar to a new object. Based on the classes of nearest neighbors, a decision is made regarding the new object. An important task of this algorithm is the selection of the coefficient k - the number of records that will be considered similar. Modification of the algorithm, in which the contribution of the neighbor is proportional to the distance to the new object (method of k-weighted nearest neighbors), allows to achieve greater classification accuracy. The k nearest neighbors method also allows you to evaluate the accuracy of the forecast. For example, if all k nearest neighbors have the same class, then the probability that the object being checked will have the same class is very high.

Among the features of the algorithm, it is worth noting the resistance to anomalous outliers, since the probability of such a record falling into the number of k-nearest neighbors is small. If this happens, then the impact on voting (especially weighted) (for k>2) is also likely to be insignificant, and, consequently, the impact on the classification outcome will also be small. Also, the advantages are simple implementation, ease of interpretation of the result of the algorithm, the ability to modify the algorithm by using the most appropriate combination functions and metrics, which allows you to adjust the algorithm for a specific task. The KNN algorithm also has a number of disadvantages. First, the data set used for the algorithm must be representative. Second, the model cannot be separated from the data: all examples must be used to classify a new example. This feature severely limits the use of the algorithm.

2.4 Decision trees

The term "decision trees" refers to a family of algorithms based on the representation of classification rules in a hierarchical, sequential structure. This is the most popular class of algorithms for solving data mining problems.

A family of algorithms for constructing decision trees makes it possible to predict the value of a parameter for a given case based on a large amount of data on other similar cases. Typically, algorithms of this family are used to solve problems that make it possible to divide all initial data into several discrete groups.

When applying decision tree algorithms to a set of initial data, the result is displayed as a tree. Such algorithms make it possible to carry out several levels of such separation, breaking the resulting groups (tree branches) into smaller ones based on other features. The division continues until the values ​​to be predicted are the same (or, in the case of a continuous value of the predicted parameter, close) for all received groups (leaves of the tree). It is these values ​​that are used to make predictions based on this model.

The operation of algorithms for constructing decision trees is based on the use of regression and correlation analysis methods. One of the most popular algorithms of this family is CART (Classification and Regression Trees), based on the division of data in a tree branch into two child branches; further division of one branch or another depends on how much initial data is described by this branch. Some other similar algorithms allow you to split a branch into more child branches. In this case, the division is made on the basis of the highest correlation coefficient for the data described by the branch between the parameter according to which the division occurs and the parameter that must be further predicted.

The popularity of the approach is associated with visibility and comprehensibility. But decision trees are fundamentally incapable of finding the “best” (most complete and accurate) rules in the data. They implement the naive principle of successive viewing of signs and actually find parts of real patterns, creating only the illusion of a logical conclusion.

2.5 Clustering algorithms

Clustering is the task of partitioning a set of objects into groups called clusters. The main difference between clustering and classification is that the list of groups is not clearly defined and is determined in the course of the algorithm.

The application of cluster analysis in general terms is reduced to the following steps:

selection of a sample of objects for clustering;

definition of a set of variables by which the objects in the sample will be evaluated. If necessary - normalization of variable values;

calculation of similarity measure values ​​between objects;

application of the cluster analysis method to create groups of similar objects (clusters);

· presentation of the results of the analysis.

After receiving and analyzing the results, it is possible to adjust the selected metric and clustering method until an optimal result is obtained.

Among the clustering algorithms, hierarchical and flat groups are distinguished. Hierarchical algorithms (also called taxonomy algorithms) do not build a single partition of the sample into disjoint clusters, but a system of nested partitions. Thus, the output of the algorithm is a tree of clusters, the root of which is the entire sample, and the leaves are the smallest clusters. Flat algorithms build one partition of objects into non-intersecting clusters.

Another classification of clustering algorithms is into crisp and fuzzy algorithms. Clear (or non-overlapping) algorithms assign a cluster number to each sample object, that is, each object belongs to only one cluster. Fuzzy (or intersecting) algorithms assign each object a set of real values ​​showing the degree of relation of the object to clusters. Thus, each object belongs to each cluster with some probability.

There are two main types of hierarchical clustering algorithms: ascending and descending algorithms. Top-down algorithms work on a top-down basis: first, all objects are placed in one cluster, which is then divided into smaller and smaller clusters. More common are bottom-up algorithms that initially place each feature in a separate cluster and then merge the clusters into larger and larger clusters until all of the sampled features are contained in the same cluster. Thus, a system of nested partitions is constructed. The results of such algorithms are usually presented in the form of a tree.

The disadvantage of hierarchical algorithms is the system of complete partitions, which may be redundant in the context of the problem being solved.

Let us now consider flat algorithms. The simplest among this class are quadratic error algorithms. The clustering problem for these algorithms can be considered as the construction of an optimal partition of objects into groups. In this case, optimality can be defined as the requirement to minimize the root-mean-square partitioning error:

,

Where c j - "center of mass" of the cluster j(point with average values ​​of characteristics for a given cluster).

The most common algorithm in this category is the k-means method. This algorithm builds a given number of clusters located as far apart as possible. The work of the algorithm is divided into several stages:

Randomly choose k points that are the initial "centers of mass" of the clusters.

2. Assign each object to a cluster with the nearest "center of mass".

If the criterion for stopping the algorithm is not satisfied, return to step 2.

As a criterion for stopping the operation of the algorithm, the minimum change in the mean square error is usually chosen. It is also possible to stop the algorithm if at step 2 there were no objects that moved from cluster to cluster. The disadvantages of this algorithm include the need to specify the number of clusters for splitting.

The most popular fuzzy clustering algorithm is the c-means algorithm. It is a modification of the k-means method. Algorithm steps:

1. Choose an initial fuzzy partition n objects on k clusters by choosing a membership matrix U size n x k.

2. Using the matrix U, find the value of the fuzzy error criterion:

,

Where c k - "center of mass" of a fuzzy cluster k:

3. Regroup the objects in order to reduce this value of the fuzzy error criterion.

4. Return to step 2 until the matrix changes U will not become insignificant.

This algorithm may not be suitable if the number of clusters is not known in advance, or if it is necessary to uniquely attribute each object to one cluster.

The next group of algorithms are algorithms based on graph theory. The essence of such algorithms is that the selection of objects is represented as a graph G=(V, E), whose vertices correspond to objects, and whose edges have a weight equal to the "distance" between objects. The advantage of graph clustering algorithms is the visibility, relative ease of implementation and the possibility of making various improvements based on geometric considerations. The main algorithms are the algorithm for extracting connected components, the algorithm for constructing a minimum spanning (spanning) tree, and the algorithm for layered clustering.

To select a parameter R usually a histogram of distributions of pairwise distances is constructed. In tasks with a well-defined cluster data structure, the histogram will have two peaks - one corresponds to intra-cluster distances, the second - to inter-cluster distances. Parameter R is selected from the zone of minimum between these peaks. At the same time, it is quite difficult to control the number of clusters using the distance threshold.

The minimum spanning tree algorithm first builds a minimum spanning tree on the graph and then sequentially removes the edges with the highest weight. The layer-by-layer clustering algorithm is based on the selection of connected graph components at a certain level of distances between objects (vertices). The distance level is set by the distance threshold c. For example, if the distance between objects is , then .

The layered clustering algorithm generates a sequence of graph subgraphs G, which reflect the hierarchical relationships between clusters:

,

Where G t = (V, E t ) - level graph With t , ,

With t - t-th distance threshold, m - number of hierarchy levels,
G 0 = (V, o), o - empty set of graph edges obtained by t 0 = 1,
G m = G, that is, a graph of objects without restrictions on the distance (the length of the edges of the graph), since t m = 1.

By changing the distance thresholds ( With 0 , …, With m ), where 0 = With 0 < With 1 < …< With m = 1, it is possible to control the depth of the hierarchy of the resulting clusters. Thus, the layer-by-layer clustering algorithm is able to create both a flat data partition and a hierarchical one.

Clustering achieves the following goals:

Improves understanding of data by identifying structural groups. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision making by applying its own analysis method to each cluster;

Allows for compact storage of data. To do this, instead of storing the entire sample, one typical observation from each cluster can be left;

· detection of new atypical objects that did not fall into any cluster.

Usually, clustering is used as an auxiliary method in data analysis.

2.6 Genetic algorithms

Genetic algorithms are among the universal optimization methods that allow solving problems of various types (combinatorial, general problems with and without restrictions) and varying degrees of complexity. At the same time, genetic algorithms are characterized by the possibility of both single-criteria and multi-criteria search in a large space, the landscape of which is not smooth.

This group of methods uses an iterative process of evolution of a sequence of generations of models, including the operations of selection, mutation, and crossing. At the beginning of the algorithm, the population is formed randomly. To assess the quality of encoded solutions, the fitness function is used, which is necessary to calculate the fitness of each individual. Based on the results of evaluating individuals, the fittest of them are selected for crossing. As a result of crossing the selected individuals through the use of the genetic crossover operator, offspring are created, the genetic information of which is formed as a result of the exchange of chromosomal information between parent individuals. The created descendants form a new population, and some of the descendants mutate, which is expressed in a random change in their genotypes. The stage, which includes the sequence "Estimation of the population" - "Selection" - "Crossing" - "Mutation", is called a generation. The evolution of a population consists of a sequence of such generations.

The following algorithms for selecting individuals for crossing are distinguished:

Panmixia. Both individuals that make up the parent pair are randomly selected from the entire population. Any individual can become a member of several pairs. This approach is universal, but the efficiency of the algorithm decreases with the growth of the population.

· Selection. Individuals with fitness not lower than average can become parents. This approach provides faster convergence of the algorithm.

Inbreeding. The method is based on the formation of a pair based on close relationship. Kinship here refers to the distance between members of a population, both in the sense of the geometric distance of individuals in the parameter space and the Heming distance between genotypes. Therefore, there are genotypic and phenotypic inbreeding. The first member of the pair for crossing is chosen randomly, and the second is more likely to be the individual closest to him. Inbreeding can be characterized by the property of concentration of search in local nodes, which actually leads to the splitting of the population into separate local groups around areas of the landscape suspicious of extremum.

Outbreeding. Formation of a pair on the basis of distant relationship, for the most distant individuals. Outbreeding is aimed at preventing the convergence of the algorithm to already found solutions, forcing the algorithm to explore new, unexplored areas.

Algorithms for the formation of a new population:

Selection with displacement. Of all individuals with the same genotypes, preference is given to those whose fitness is higher. Thus, two goals are achieved: the best found solutions with different chromosome sets are not lost, sufficient genetic diversity is constantly maintained in the population. Displacement forms a new population of far-flung individuals, instead of individuals clustering around the current found solution. This method is used for multi-extremal problems.

Elite selection. Elite selection methods ensure that the best members of a population are sure to survive when selected. At the same time, some of the best individuals pass without any changes to the next generation. The fast convergence provided by elite selection can be compensated by an appropriate method of selecting parent pairs. In this case, outbreeding is often used. It is this combination of "outbreeding - elite selection" that is one of the most effective.

· Tournament selection. Tournament selection implements n tournaments to select n individuals. Each tournament is built on a selection of k elements from the population, and the choice of the best individual among them. Tournament selection with k = 2 is the most common.

One of the most demanded applications of genetic algorithms in the field of Data Mining is the search for the most optimal model (search for an algorithm that corresponds to the specifics of a particular area). Genetic algorithms are primarily used to optimize the topology of neural networks and weights. However, they can also be used as a standalone tool.

3. Applications

Data Mining technology has a really wide range of applications, being, in fact, a set of universal tools for analyzing data of any type.

Marketing

One of the very first areas where data mining technologies were applied was the field of marketing. The task with which the development of Data Mining methods began is called shopping cart analysis.

This task is to identify products that buyers tend to purchase together. Knowledge of the shopping basket is necessary for advertising campaigns, the formation of personal recommendations to customers, the development of a strategy for creating stocks of goods and ways to lay them out in the trading floors.

Also in marketing, such tasks are solved as determining the target audience of a particular product for its more successful promotion; research on time patterns that helps businesses make inventory decisions; creation of predictive models, which enables enterprises to recognize the nature of the needs of various categories of customers with certain behavior; predicting customer loyalty, which allows you to identify in advance the moment of customer departure when analyzing his behavior and, possibly, prevent the loss of a valuable customer.

Industry

One of the important areas in this area is monitoring and quality control, where, using analysis tools, it is possible to predict equipment failure, the appearance of malfunctions, and plan repair work. Predicting the popularity of certain features and knowing which features are usually ordered together helps to optimize production, orienting it to the real needs of consumers.

Medicine

In medicine, data analysis is also used quite successfully. An example of tasks can be the analysis of examination results, diagnostics, comparison of the effectiveness of treatments and drugs, analysis of diseases and their spread, identification of side effects. Data mining technologies such as association rules and sequential patterns have been successfully used to identify relationships between drug use and side effects.

Molecular genetics and genetic engineering

Perhaps the most acute and at the same time clear task of discovering regularities in experimental data is in molecular genetics and genetic engineering. Here it is formulated as a definition of markers, which are understood as genetic codes that control certain phenotypic traits of a living organism. Such codes may contain hundreds, thousands, or more related items. The result of the analytical analysis of the data is also the relationship discovered by geneticists between changes in the human DNA sequence and the risk of developing various diseases.

Applied chemistry

Data mining methods are also used in the field of applied chemistry. Here, the question often arises of elucidating the features of the chemical structure of certain compounds that determine their properties. This task is especially relevant in the analysis of complex chemical compounds, the description of which includes hundreds and thousands of structural elements and their bonds.

Fight against crime

In security, Data Mining tools are used relatively recently, but practical results have already been obtained that confirm the effectiveness of data mining in this area. Swiss scientists have developed a system for analyzing protest activity in order to predict future incidents and a system for tracking emerging cyber threats and actions of hackers in the world. The latter system makes it possible to predict cyber threats and other information security risks. Also, Data Mining methods are successfully used to detect credit card fraud. By analyzing past transactions that later turned out to be fraudulent, the bank identifies some stereotypes of such fraud.

Other applications

· Risk analysis. For example, by identifying combinations of factors associated with paid claims, insurers can reduce their liability losses. There is a well-known case in the United States when a large insurance company found that the amounts paid out on the applications of people who are married are twice the amount on the applications of single people. The company has responded to this new knowledge by revisiting its general family discount policy.

· Meteorology. Weather prediction by neural network methods, in particular, Kohonen's self-organizing maps are used.

· Personnel policy. Analysis tools help HR departments to select the most successful candidates based on the analysis of their resume data, model the characteristics of ideal employees for a particular position.

4. Producers of Data Mining Tools

Data Mining tools traditionally belong to expensive software products. Therefore, until recently, the main consumers of this technology were banks, financial and insurance companies, large trading enterprises, and the main tasks requiring the use of Data Mining were the assessment of credit and insurance risks and the development of a marketing policy, tariff plans and other principles of working with clients. In recent years, the situation has undergone certain changes: relatively inexpensive Data Mining tools and even free distribution systems have appeared on the software market, which has made this technology available to small and medium-sized businesses.

Among the paid tools and systems for data analysis, the leaders are SAS Institute (SAS Enterprise Miner), SPSS (SPSS, Clementine) and StatSoft (STATISTICA Data Miner). Well-known solutions are from Angoss (Angoss KnowledgeSTUDIO), IBM(IBM SPSS Modeler), Microsoft (Microsoft Analysis Services) and (Oracle) Oracle Data Mining.

The choice of free software is also varied. There are both universal analysis tools, such as JHepWork, KNIME, Orange, RapidMiner, and specialized tools, such as Carrot2 - a framework for clustering text data and search query results, Chemicalize.org - a solution in the field of applied chemistry, NLTK (Natural Language Toolkit) natural language processing tool.

5. Criticism of methods

The results of Data Mining largely depend on the level of data preparation, and not on the "wonderful capabilities" of some algorithm or set of algorithms. About 75% of the work on Data Mining consists of collecting data, which is done even before the use of analysis tools. Illiterate use of tools will lead to a waste of the company's potential, and sometimes millions of dollars.

The opinion of Herb Edelstein, a world-famous expert in the field of Data Mining, Data Warehousing and CRM: “A recent study by Two Crows showed that Data Mining is still at an early stage of development. Many organizations are interested in this technology, but only a few are actively implementing such projects. Another important point was made clear: the process of implementing Data Mining in practice turns out to be more complicated than expected. The teams were carried away by the myth that Data Mining tools are easy to use. It is assumed that it is enough to run such a tool on a terabyte database, and useful information will instantly appear. In fact, a successful data mining project requires an understanding of the essence of the activity, knowledge of data and tools, as well as the process of data analysis. Thus, before using Data Mining technology, it is necessary to carefully analyze the limitations imposed by the methods and the critical issues associated with it, as well as soberly assess the capabilities of the technology. Critical questions include:

1. Technology cannot provide answers to questions that have not been asked. It cannot replace the analyst, but only gives him a powerful tool to facilitate and improve his work.

2. The complexity of the development and operation of the Data Mining application.

Since this technology is a multidisciplinary field, to develop an application that includes Data Mining, it is necessary to involve specialists from different fields, as well as to ensure their high-quality interaction.

3. User qualification.

Various Data Mining tools have a different degree of "friendliness" of the interface and require a certain user skill. Therefore, the software must correspond to the user's level of training. The use of Data Mining should be inextricably linked with the improvement of the user's skills. However, there are currently few Data Mining specialists who are well versed in business processes.

4. Extracting useful information is impossible without a good understanding of the essence of the data.

Careful model selection and interpretation of the dependencies or patterns that are found are required. Therefore, working with such tools requires close cooperation between a domain expert and a specialist in Data Mining tools. Persistent models must be well integrated into business processes to be able to evaluate and update models. Recently, Data Mining systems have been supplied as part of data warehousing technology.

5. Complexity of data preparation.

Successful analysis requires high-quality data preprocessing. According to analysts and database users, the preprocessing process can take up to 80% of the entire Data Mining process.

Thus, for the technology to work for itself, it will take a lot of effort and time that goes into preliminary data analysis, model selection and its adjustment.

6. A large percentage of false, unreliable or useless results.

With the help of Data Mining technologies, you can find really very valuable information that can give a significant advantage in further planning, management, and decision making. However, the results obtained using Data Mining methods quite often contain false and meaningless conclusions. Many experts argue that Data Mining tools can produce a huge amount of statistically unreliable results. To reduce the percentage of such results, it is necessary to check the adequacy of the obtained models on test data. However, it is impossible to completely avoid false conclusions.

7. High cost.

A high-quality software product is the result of significant labor costs on the part of the developer. Therefore, Data Mining software is traditionally referred to as expensive software products.

8. Availability of sufficient representative data.

Data mining tools, unlike statistical ones, theoretically do not require a strictly defined amount of historical data. This feature can cause the detection of unreliable, false models and, as a result, making incorrect decisions based on them. It is necessary to control the statistical significance of the discovered knowledge.

neural network algorithm clustering data mining

Conclusion

A brief description of the areas of application is given and criticism of Data Mining technology and the opinion of experts in this field are given.

Listliterature

1. Han and Micheline Kamber. Data Mining: Concepts and Techniques. second edition. - University of Illinois at Urbana-Champaign

Berry, Michael J. A. Data mining techniques: for marketing, sales, and customer relationship management - 2nd ed.

Siu Ning Lam. Discovering Association Rules in Data Mining. - Department of Computer Science University of Illinois at Urbana-Champaign

What is Data Mining

The corporate database of any modern enterprise usually contains a set of tables that store records about certain facts or objects (for example, about goods, their sales, customers, invoices). As a rule, each entry in such a table describes a particular object or fact. For example, an entry in the sales table reflects the fact that such and such a product was sold to such and such a client by such and such a manager at that time, and by and large contains nothing but this information. However, the accumulation of a large number of such records accumulated over several years can become a source of additional, much more valuable information that cannot be obtained on the basis of one specific record, namely, information about patterns, trends or interdependencies between any data. Examples of such information are information about how sales of a particular product depend on the day of the week, time of day or season, which categories of buyers most often purchase a particular product, what proportion of buyers of one specific product purchases another specific product, which category of customers most often does not repay the loan on time.

This kind of information is usually used in forecasting, strategic planning, risk analysis, and its value for the enterprise is very high. Apparently, that is why the process of searching for it was called Data Mining (mining in English means “mining”, and the search for patterns in a huge set of actual data is really akin to this). The term Data Mining refers not so much to a specific technology as to the very process of searching for correlations, trends, relationships and patterns through various mathematical and statistical algorithms: clustering, creating subsamples, regression and correlation analysis. The purpose of this search is to present data in a form that clearly reflects business processes, as well as to build a model that can be used to predict processes that are critical for business planning (for example, the dynamics of demand for certain goods or services or the dependence of their purchase on certain then consumer characteristics).

Note that traditional mathematical statistics, which for a long time remained the main tool for data analysis, as well as online analytical processing (OLAP) tools, which we have already written about many times (see materials on this topic on our CD) , can not always be successfully used to solve such problems. Typically, statistical methods and OLAP are used to test pre-formulated hypotheses. However, it is often the formulation of the hypothesis that turns out to be the most difficult task in the implementation of business analysis for subsequent decision making, since not all patterns in the data are obvious at first glance.

The basis of modern Data Mining technology is the concept of patterns that reflect the patterns inherent in data subsamples. Patterns are searched by methods that do not use any a priori assumptions about these subsamples. While statistical analysis or OLAP applications usually formulate questions like “What is the average number of unpaid invoices by customers of this service?”, Data mining, as a rule, means answers to questions like “Is there a typical category of customers who do not pay bills?” . At the same time, it is the answer to the second question that often provides a more non-trivial approach to marketing policy and to the organization of work with clients.

An important feature of Data Mining is the non-standard and non-obviousness of the patterns being sought. In other words, Data Mining tools differ from statistical data processing tools and OLAP tools in that instead of checking interdependencies that users presuppose, they are able to find such interdependencies on their own based on the available data and build hypotheses about their nature.

It should be noted that the use of Data Mining tools does not exclude the use of statistical tools and OLAP tools, since the results of data processing using the latter, as a rule, contribute to a better understanding of the nature of the patterns that should be sought.

Initial data for Data Mining

The use of Data Mining is justified if there is a sufficiently large amount of data, ideally contained in a correctly designed data warehouse (in fact, data warehouses themselves are usually created to solve analysis and forecasting problems related to decision support). We also repeatedly wrote about the principles of building data warehouses; relevant materials can be found on our CD, so we will not dwell on this issue. We only recall that the data in the storage is a replenished set, common for the entire enterprise and allowing you to restore a picture of its activities at any point in time. Note also that the storage data structure is designed in such a way that the execution of requests to it is carried out as efficiently as possible. However, there are Data Mining tools that can search for patterns, correlations, and trends not only in data warehouses, but also in OLAP cubes, that is, in sets of pre-processed statistical data.

Types of patterns revealed by Data Mining methods

According to V.A.Dyuk, there are five standard types of patterns identified by Data Mining methods:

Association - a high probability of connecting events with each other (for example, one product is often purchased together with another);

Sequence - a high probability of a chain of events related in time (for example, within a certain period after the purchase of one product, another will be purchased with a high degree of probability);

Classification - there are signs that characterize the group to which this or that event or object belongs (usually, certain rules are formulated based on the analysis of already classified events);

Clustering is a pattern similar to classification and differs from it in that the groups themselves are not set in this case - they are detected automatically during data processing;

Temporal patterns - the presence of patterns in the dynamics of the behavior of certain data (a typical example is seasonal fluctuations in demand for certain goods or services) used for forecasting.

Data mining methods in Data Mining

Today there are quite a large number of different methods of data mining. Based on the above classification proposed by V.A. Dyuk, among them are:

Regression, dispersion and correlation analysis (implemented in most modern statistical packages, in particular in the products of SAS Institute, StatSoft, etc.);

Methods of analysis in a specific subject area based on empirical models (often used, for example, in inexpensive financial analysis tools);

Neural network algorithms, the idea of ​​which is based on an analogy with the functioning of the nervous tissue and lies in the fact that the initial parameters are considered as signals that are transformed in accordance with the existing connections between the "neurons", and as the answer resulting from the analysis, the response of the entire network to the initial data. Links in this case are created using the so-called network learning through a large sample containing both the original data and the correct answers;

Algorithms - the choice of a close analogue of the original data from the already available historical data. Also called the nearest neighbor method;

Decision trees - a hierarchical structure based on a set of questions that imply the answer "Yes" or "No"; despite the fact that this method of data processing does not always ideally find existing patterns, it is quite often used in forecasting systems due to the clarity of the response received;

Cluster models (sometimes also called segmentation models) are used to group similar events into groups based on the similar values ​​of several fields in a dataset; are also very popular in the creation of forecasting systems;

Limited search algorithms that calculate the frequencies of combinations of simple logical events in subgroups of data;

Evolutionary programming - search and generation of an algorithm that expresses the interdependence of data, based on an initially specified algorithm, modified in the search process; sometimes the search for interdependencies is carried out among any certain types of functions (for example, polynomials).

More details about these and other Data Mining algorithms, as well as the tools that implement them, can be found in the book “Data Mining: a training course” by V.A. Today it is one of the few books in Russian devoted to this problem.

Leading manufacturers of data mining tools

Data Mining tools, like most Business Intelligence tools, traditionally belong to expensive software tools - the price of some of them reaches several tens of thousands of dollars. Therefore, until recently, the main consumers of this technology were banks, financial and insurance companies, large trading enterprises, and the main tasks requiring the use of Data Mining were the assessment of credit and insurance risks and the development of a marketing policy, tariff plans and other principles of working with clients. In recent years, the situation has undergone certain changes: relatively inexpensive Data Mining tools from several manufacturers have appeared on the software market, which made this technology available to small and medium-sized businesses that had not thought about it before.

Modern Business Intelligence tools include report generators, analytical data processing tools, BI solution development tools (BI Platforms) and the so-called Enterprise BI Suites - enterprise-wide data analysis and processing tools that allow you to carry out a set of actions related to data analysis and reporting, and often include an integrated set of BI tools and BI application development tools. The latter, as a rule, contain both reporting tools and OLAP tools, and often Data Mining tools.

According to Gartner Group analysts, Business Objects, Cognos, Information Builders are the leaders in the enterprise-scale data analysis and processing market, and Microsoft and Oracle also claim leadership (Fig. 1). As for the development tools for BI solutions, the main contenders for leadership in this area are Microsoft and SAS Institute (Fig. 2).

Note that Microsoft's Business Intelligence tools are relatively inexpensive products available to a wide range of companies. That is why we are going to consider some practical aspects of using Data Mining using the products of this company as an example in the subsequent parts of this article.

Literature:

1. Duke V.A. Data Mining - data mining. - http://www.olap.ru/basic/dm2.asp .

2. Dyuk V.A., Samoylenko A.P. Data Mining: training course. - St. Petersburg: Peter, 2001.

3. B. de Ville. Microsoft Data Mining. Digital Press, 2001.



Loading...
Top