Dimension reduction. Evaluation of data dimensionality reduction methods used to transform a video stream for personal identification The essence of the problem of dimensionality reduction and various methods for solving it

In statistics, machine learning, and information theory, dimensionality reduction is a data transformation that consists of reducing the number of variables by obtaining principal variables. Transformation can be divided into feature selection and feature extraction.

Related concepts

References in literature

– loading and preprocessing of input data, – manual and automatic labeling of stimulus materials (selection of areas of interest), – algorithm for calculating the successor representation matrix, – building an extended data table with the values of input variables required for subsequent analysis, – method dimension reduction feature spaces (principal component method), – visualization of component loads for the selection of interpreted components, – decision tree learning algorithm, – tree predictive ability estimation algorithm, – decision tree visualization.

Related concepts (continued)

Spectral clustering techniques use the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in lower dimensional spaces. The similarity matrix is given as input and consists of quantitative estimates of the relative similarity of each pair of points in the data.

Spectral methods are a class of techniques used in applied mathematics to solve some differential equations numerically, possibly involving the Fast Fourier Transform. The idea is to rewrite the solution of differential equations as the sum of some "basis functions" (like how Fourier series are the sum of sinusoids) and then choose the coefficients in the sum to satisfy the differential equation as best as possible.

Mathematical analysis (classical mathematical analysis) - a set of sections of mathematics corresponding to the historical section under the name "analysis of infinitesimals", combines differential and integral calculus.

Differential evolution is a multidimensional mathematical optimization method that belongs to the class of stochastic optimization algorithms (that is, it works using random numbers) and uses some ideas of genetic algorithms, but, unlike them, does not require working with variables in binary code.

The Discrete Element Method (DEM) is a family of numerical methods designed to calculate the motion of a large number of particles such as molecules, grains of sand, gravel, pebbles and other granular media. The method was originally applied by Cundall in 1971 to solve problems in rock mechanics.

As a result of studying the material of chapter 5, the student should:

know

basic concepts and problems of dimension reduction:
approaches to solving the problem of transformation of feature space;

be able to

use the principal component method to move to standardized orthogonal features;
evaluate the decrease in the information content of data with a decrease in the dimension of the feature space;
solve the problem of constructing optimal multidimensional scales for the study of objects;

own

dimensionality reduction methods for solving applied problems statistical analysis;
skills of interpreting variables in the transformed feature space.

Basic concepts and problems of dimension reduction

At first glance, the more information about the objects of study in the form of a set of features characterizing them will be used to create a model, the better. However, too much information can reduce the effectiveness of data analysis. There is even a term "curse of dimensionality" (curse of dimension), characterizing the problems of working with high-dimensional data. The need to reduce the dimension in one form or another is related to the solution of various statistical problems.

Non-informative features are a source of additional noise and affect the accuracy of model parameter estimation. In addition, datasets with a large number features may contain groups of correlated variables. The presence of such groups of features means duplication of information, which can distort the specification of the model and affect the quality of the estimation of its parameters. The higher the data dimension, the higher the amount of calculations during their algorithmic processing.

Two directions can be distinguished in reducing the dimension of the feature space according to the principle of the variables used for this: the selection of features from the existing initial set and the formation of new features by transforming the initial data. Ideally, a reduced representation of the data should have a dimension corresponding to the dimension inherent in the data. (intrinsic dimension).

The search for the most informative features that characterize the phenomenon under study is an obvious way to reduce the dimension of the problem, which does not require transformation of the original variables. This makes it possible to make the model more compact and avoid losses associated with the interfering effect of uninformative features. The selection of informative features consists in finding the best subset of the set of all initial variables. The criteria for the concept of "best" can be either the most high quality modeling for a given dimension of the feature space, or the smallest data dimension at which it is possible to build a model of a given quality.

The direct solution of the problem of creating the best model is associated with the enumeration of all possible combinations of features, which usually seems to be excessively laborious. Therefore, as a rule, resort to direct or reverse selection of traits. In direct selection procedures, variables are sequentially added from the initial set until the required quality of the model is achieved. In the algorithms of successive reduction of the original feature space (reverse selection), the least informative variables are removed step by step until the information content of the model is reduced to an acceptable level.

It should be borne in mind that the information content of signs is relative. The selection should provide a high information content of a set of features, and not the total information content of its constituent variables. Thus, the presence of a correlation between features reduces their overall information content due to duplication of information common to them. Therefore, adding a new feature to those already selected provides an increase in information content to the extent that it contains useful information, which is absent in the previously selected variables. The simplest situation is the selection of mutually orthogonal features, in which the selection algorithm is implemented extremely simply: the variables are ranked according to their informativeness, and such a composition of the first features in this ranking is used that provides the given informativeness.

The limitation of feature selection methods to reduce the dimension of space is associated with the assumption of the direct presence of the necessary features in the initial data, which usually turns out to be incorrect. An alternative approach to dimensionality reduction is to transform the features into a reduced set of new variables. In contrast to the selection of initial features, the formation of a new feature space involves the creation of new variables, which are usually functions of the original features. These variables, not directly observable, are often referred to as latent, or latent. During the creation process, these variables can be endowed with various useful properties, such as orthogonality. In practice, the initial features are usually interconnected, so the transformation of their space into an orthogonal one generates new feature coordinates that do not have the effect of duplicating information about the objects under study.

Displaying objects in a new orthogonal feature space makes it possible to visualize the usefulness of each of the features in terms of differences between these objects. If the coordinates of the new basis are ordered according to the variance characterizing the scatter of values for them for the observations under consideration, then it becomes obvious that, from a practical point of view, some features with small values of variances are useless, since objects by these features are practically indistinguishable compared to their differences in more informative variables. In such a situation, we can talk about the so-called degeneration of the original feature space from k variables, and the real dimension of this space T may be less than the original (m< k).

The reduction of the feature space is accompanied by a certain decrease in the information content of the data, but the level of acceptable reduction can be determined in advance. Feature extraction projects a set of initial variables into a space of lower dimension. Compressing the feature space to 2-3D can be useful for data visualization. Thus, the process of forming a new feature space usually leads to a smaller set of really informative variables. Based on them, a better model can be built as based on a smaller number of the most informative features.

The formation of new variables based on the original ones is used for latent semantic analysis, data compression, classification and pattern recognition, increasing the speed and efficiency of learning processes. Compressed data is usually used for further analysis and modeling.

One of the important applications of feature space transformation and dimension reduction is the construction of synthetic latent categories based on measured feature values. These latent signs can characterize the general certain features of the phenomenon under study, integrating the particular properties of the observed objects, which makes it possible to build integral indicators of various levels of information generalization.

The role of the feature space reduction methods in the study of the problem of duplication of information in the initial features, leading to the "swelling" of the variance of estimates of the coefficients of regression models, is essential. The transition to new variables, ideally orthogonal and meaningfully interpreted, is an effective modeling tool in conditions of multicollinearity of the initial data.

The transformation of the initial feature space into orthogonal is convenient for solving classification problems, as it allows one to reasonably apply certain measures of proximity or differences of objects, such as the Euclidean distance or the square of the Euclidean distance. In regression analysis, the construction of the regression equation on the principal components allows solving the problem of multicollinearity.

In multivariate statistical analysis, each object is described by a vector whose dimension is arbitrary (but the same for all objects). However, a person can directly perceive only numerical data or points on a plane. It is already much more difficult to analyze clusters of points in three-dimensional space. Direct perception of higher-dimensional data is impossible. Therefore, it is quite natural to want to move from a multivariate sample to low-dimensional data so that “you can look at it”.

In addition to the desire for visibility, there are other motives for reducing the dimension. Those factors on which the variable of interest to the researcher does not depend only hinder the statistical analysis. First, collecting information about them consumes resources. Secondly, as can be proved, their inclusion in the analysis worsens the properties of statistical procedures (in particular, it increases the variance of estimates of parameters and characteristics of distributions). Therefore, it is desirable to get rid of such factors.

Let's discuss from the point of view of dimensionality reduction the example of using regression analysis to forecast sales, discussed in subsection 3.2.3. Firstly, in this example, it was possible to reduce the number of independent variables from 17 to 12. Secondly, it was possible to construct a new factor - a linear function of the 12 mentioned factors, which predicts sales volume better than all other linear combinations of factors. Therefore, we can say that as a result, the dimension of the problem decreased from 18 to 2. Namely, there was one independent factor (the linear combination given in subsection 3.2.3) and one dependent factor - sales volume.

When analyzing multivariate data, it is usually considered not one, but many problems, in particular, choosing independent and dependent variables differently. Therefore, consider the dimensionality reduction problem in the following formulation. Given a multivariate sample. It is required to move from it to a set of vectors of a smaller dimension, preserving the structure of the initial data as much as possible, without losing the information contained in the data if possible. The task is specified within the framework of each specific dimensionality reduction method.

Principal Component Method is one of the most commonly used dimensionality reduction methods. Its main idea is to sequentially identify the directions in which the data have the greatest spread. Let the sample consist of vectors equally distributed with the vector X = (x(1), x(2), … , x(n)). Consider linear combinations

Y(λ(1), λ(2), …, λ( n)) = λ(1) x(1) +λ(2) x(2) + … + λ( n)x(n),

λ 2 (1) + λ 2 (2) + …+ λ 2 ( n) = 1.

Here the vector λ = (λ(1), λ(2), …, λ( n)) lies on the unit sphere in n-dimensional space.

In the principal component method, first of all, the direction of maximum scatter is found, i.e. such λ at which the variance of the random variable reaches its maximum Y(λ) = Y(λ(1), λ(2), …, λ( n)). Then the vector λ defines the first principal component, and the quantity Y(λ) is a projection of a random vector X on the axis of the first principal component.

Then, in terms of linear algebra, one considers a hyperplane in n-dimensional space, perpendicular to the first principal component, and project all elements of the sample onto this hyperplane. The dimension of the hyperplane is 1 less than the dimension of the original space.

In the hyperplane under consideration, the procedure is repeated. The direction of the greatest spread is found in it, i.e. second principal component. Then allocate a hyperplane perpendicular to the first two principal components. Its dimension is 2 less than the dimension of the original space. Next is the next iteration.

From the point of view of linear algebra, we are talking about constructing a new basis in n-dimensional space, whose orts are principal components.

The variance corresponding to each new principal component is smaller than for the previous one. Usually they stop when it is less than a given threshold. If selected k principal components, this means that n-dimensional space managed to go to k- dimensional, i.e. reduce the dimension from n-before k, practically without distorting the structure of the source data .

For visual data analysis, the projections of the original vectors onto the plane of the first two principal components are often used. Usually, the data structure is clearly visible, compact clusters of objects and separately allocated vectors are distinguished.

The principal component method is one of the methods factor analysis. Various factor analysis algorithms are united by the fact that in all of them there is a transition to a new basis in the original n-dimensional space. The concept of “factor load” is important, which is used to describe the role of the initial factor (variable) in the formation of a certain vector from a new basis.

A new idea compared to the principal component method is that, based on the loads, the factors are divided into groups. One group combines factors that have a similar effect on the elements of the new basis. Then it is recommended to leave one representative from each group. Sometimes, instead of choosing a representative by calculation, a new factor is formed that is central to the group in question. Dimension reduction occurs in the transition to a system of factors that are representatives of groups. The rest of the factors are discarded.

The described procedure can be carried out not only with the help of factor analysis. We are talking about cluster analysis of features (factors, variables). Various cluster analysis algorithms can be used to divide features into groups. It is enough to enter the distance (proximity measure, difference indicator) between features. Let X And At- two signs. Difference d(X, Y) between them can be measured using sample correlation coefficients:

d 1 (X,Y) = 1 – rn(X,Y), d 2 (X,Y) = 1 – ρ n(X,Y),

Where rn(X, Y) is the sample linear Pearson correlation coefficient, ρ n(X, Y) is Spearman's sample rank correlation coefficient.

Multidimensional scaling. On the use of distances (measures of proximity, indicators of difference) d(X, Y) between features X And At an extensive class of multidimensional scaling methods was founded. The main idea of this class of methods is to represent each object by a point in the geometric space (usually of dimension 1, 2 or 3), the coordinates of which are the values of hidden (latent) factors that together adequately describe the object. In this case, relations between objects are replaced by relations between points - their representatives. So, data on the similarity of objects - by the distances between points, data on superiority - by the mutual arrangement of points.

In practice, a number of various models multidimensional scaling. All of them face the problem of estimating the true dimension of the factor space. Let's consider this problem using the example of processing data on the similarity of objects using metric scaling.

Let there be n objects ABOUT(1), ABOUT(2), …, O(n), for each pair of objects ABOUT(i), O(j) the measure of their similarity is given s(i, j). We think that always s(i, j) = s(j, i). Origin of numbers s(i, j) is irrelevant for describing how the algorithm works. They could be obtained either by direct measurement, or with the use of experts, or by calculation from a set of descriptive characteristics, or in some other way.

In Euclidean space, the considered n objects must be represented by a configuration n points, and the Euclidean distance d(i, j) between the corresponding points. The degree of correspondence between a set of objects and a set of points representing them is determined by comparing the similarity matrices || s(i, j)|| and distances || d(i, j)||. The metric similarity functional has the form

The geometric configuration must be chosen so that the functional S reaches its minimum value .

Comment. In non-metric scaling, instead of the proximity of the measures of proximity and distances themselves, the proximity of orderings on the set of measures of proximity and the set of corresponding distances is considered. Instead of functionality S analogues of Spearman's and Kendall's rank correlation coefficients are used. In other words, non-metric scaling assumes that proximity measures are measured on an ordinal scale.

Let the Euclidean space have the dimension m. Consider the minimum mean squared error

where the minimum is taken over all possible configurations n points in m-dimensional Euclidean space. It can be shown that the considered minimum is attained on some configuration. It is clear that with growth m the quantity α m decreases monotonically (more precisely, it does not increase). It can be shown that when m > n– 1 it is equal to 0 (if s(i, j) is a metric). To increase the possibilities of meaningful interpretation, it is desirable to act in a space of the smallest possible dimension. In this case, however, the dimension must be chosen so that the points represent objects without large distortions. The question arises: how to rationally choose the dimension, i.e. natural number m?

As part of deterministic analysis There seems to be no reasonable answer to this question. Therefore, it is necessary to study the behavior of α m in certain probabilistic models. If proximity measures s(i, j) are random variables whose distribution depends on the "true dimension" m 0 (and, possibly, on some other parameters), then in the classical mathematical and statistical style we can set the problem of estimating m 0 , search for consistent scores, and so on.

Let's start building probabilistic models. We assume that the objects are points in the Euclidean space of dimension k, Where k large enough. That the "true dimension" is m 0 , means that all these points lie on a hyperplane of dimension m 0 . Let us assume for definiteness that the set of points under consideration is a sample from a circular normal distribution with variance σ 2 (0). This means that the objects ABOUT(1), ABOUT(2), …, O(n) are collectively independent random vectors, each of which is constructed as ζ(1) e(1) + ζ(2) e(2) + … + ζ( m 0)e(m 0), where e(1), e(2), … , e(m 0) is an orthonormal basis in the subspace of dimension m 0 , where the considered points lie, and ζ(1), ζ(2), … , ζ( m 0) are collectively independent one-dimensional normal random variables with mathematical expectation) and variance σ 2 (0).

Consider two models for obtaining proximity measures s(i, j). In the first of them s(i, j) differ from the Euclidean distance between the corresponding points due to the fact that the points are known with distortions. Let With(1),With(2), … , With(n) are considered points. Then

s(i, j) = d(c(i) + ε( i), c(j) + ε( j)), i, j = 1, 2, … , n,

Where d is the Euclidean distance between points in k-dimensional space, vectors ε(1), ε(2), … , ε( n) represent a sample from a circular normal distribution in k-dimensional space with zero mathematical expectation and covariance matrix σ 2 (1) I, Where I is the identity matrix. In other words, ε( i) = η(1) e(1) + η(2) e(2) + … + η( k)e(k), Where e(1), e(2), …, e(k) is an orthonormal basis in k-dimensional space, and (η( i, t), i= 1, 2, …, n, t= 1, 2, … , k) is the set of independent in the set one-dimensional random variables with zero mathematical expectation and variance σ 2 (1).

In the second model, the distortions are imposed directly on the distances themselves:

s(i,j) = d(c(i), c(j)) + ε( i,j), i,j = 1, 2, … , n, i ≠ j,

where (ε( i, j), i, j = 1, 2, … , n) are collectively independent normal random variables with mathematical expectation) and variance σ 2 (1).

The paper shows that for both formulated models, the minimum of the mean square error α m for n→ ∞ converges in probability to

f(m) = f 1 (m) + σ 2 (1)( k – m), m = 1, 2, …, k,

So the function f(m) is linear on the intervals and , and it decreases faster on the first interval than on the second. It follows that the statistics

is a consistent estimate of the true dimension m 0 .

So, a recommendation follows from the probabilistic theory - as an estimate of the dimension of the factor space, use m*. Note that such a recommendation was formulated as heuristic by one of the founders of multidimensional scaling, J. Kraskal. He proceeded from the experience of practical use of multidimensional scaling and computational experiments. The probabilistic theory made it possible to substantiate this heuristic recommendation.

Keywords

MATHEMATICS / APPLIED STATISTICS / MATH STATISTICS/ GROWTH POINTS / PRINCIPAL COMPONENT METHOD / FACTOR ANALYSIS / MULTIDIMENSIONAL SCALING / DIMENSIONAL ESTIMATION OF DATA / MODEL DIMENSIONAL ESTIMATION/ MATHEMATICS / APPLIED STATISTICS / MATHEMATICAL STATISTICS / GROWTH POINTS / PRINCIPAL COMPONENT ANALYSIS / FACTOR ANALYSIS / MULTIDIMENSIONAL SCALING / ESTIMATION OF DATA DIMENSION / ESTIMATION OF MODEL DIMENSION

annotation scientific article in mathematics, author of scientific article - Alexander I. Orlov, Evgeny Veniaminovich Lutsenko

One of the "points of growth" applied statistics are methods for reducing the dimension of the space of statistical data. They are increasingly used in the analysis of data in specific applied research, for example, sociological. Let us consider the most promising methods of dimensionality reduction. Principal Component Method is one of the most commonly used dimensionality reduction methods. For visual data analysis, the projections of the original vectors onto the plane of the first two principal components are often used. Usually, the data structure is clearly visible, compact clusters of objects and separately allocated vectors are distinguished. Principal Component Method is one of the methods factor analysis. New idea compared to principal component method consists in the fact that, based on the loads, the factors are divided into groups. One group combines factors that have a similar effect on the elements of the new basis. Then it is recommended to leave one representative from each group. Sometimes, instead of choosing a representative by calculation, a new factor is formed that is central to the group in question. Dimension reduction occurs in the transition to a system of factors that are representatives of groups. The rest of the factors are discarded. An extensive class of methods is based on the use of distances (proximity measures, differences indicators) between features. multidimensional scaling. The main idea of this class of methods is to represent each object by a point in the geometric space (usually of dimension 1, 2 or 3), the coordinates of which are the values of hidden (latent) factors that together adequately describe the object. As an example of the application of probabilistic-statistical modeling and the results of statistics of non-numerical data, we justify the validity of the estimate of the dimension of the data space in multidimensional scaling, previously proposed by Kruskal for heuristic reasons. A number of works on estimating the dimensions of models(in regression analysis and in the theory of classification). Information about dimensionality reduction algorithms in automated system-cognitive analysis is given.

The text of the scientific work on the topic "Methods for reducing the dimension of the space of statistical data"

UDC 519.2: 005.521:633.1:004.8

01.00.00 Physical and mathematical sciences

METHODS FOR DIMENSIONAL REDUCTION OF STATISTICAL DATA SPACE

Orlov Alexander Ivanovich

Doctor of Economics, Doctor of Technical Sciences, Ph.D., Professor

RSCI BRSH code: 4342-4994

Moscow State Technical

university. N.E. Bauman, Russia, 105005,

Moscow, 2nd Baumanskaya st., 5, [email protected] T

Lutsenko Evgeny Veniaminovich Doctor of Economics, Ph.D., Professor RSCI BRSH-code: 9523-7101 Kuban State Agrarian University, Krasnodar, Russia [email protected] com

One of the "growth points" of applied statistics is the methods of reducing the dimension of the space of statistical data. They are increasingly used in the analysis of data in specific applied research, for example, sociological. Let us consider the most promising methods of dimensionality reduction. Principal component analysis is one of the most commonly used dimensionality reduction methods. For visual data analysis, the projections of the original vectors onto the plane of the first two principal components are often used. Usually, the data structure is clearly visible, compact clusters of objects and separately allocated vectors are distinguished. Principal component analysis is one of the methods of factor analysis. A new idea compared to the principal component method is that, based on the loads, the factors are divided into groups. One group combines factors that have a similar effect on the elements of the new basis. Then it is recommended to leave one representative from each group. Sometimes, instead of choosing a representative by calculation, a new factor is formed that is central to the group in question. Dimension reduction occurs in the transition to a system of factors that are representatives of groups. The rest of the factors are discarded. An extensive class of multidimensional scaling methods is based on the use of distances (measures of proximity, indicators of difference) between features. The main idea of this class of methods is to represent each object as a point in the geometric space (usually of dimensions 1, 2 or 3), the coordinates of which are the values of hidden (latent) factors that together adequately describe

UDC 519.2:005.521:633.1:004.8

Physics and mathematical sciences

METHODS OF REDUCING SPACE DIMENSION OF STATISTICAL DATA

Alexander Orlov

Dr.Sci.Econ., Dr.Sci.Tech., Cand.Phys-Math.Sci.,

Bauman Moscow State Technical University, Moscow, Russia

Lutsenko Eugeny Veniaminovich Dr.Sci.Econ., Cand.Tech.Sci., professor RSCI SPIN-code: 9523-7101

Kuban State Agrarian University, Krasnodar, Russia

[email protected] com

One of the "points of growth" of applied statistics is methods of reducing the dimension of statistical data. They are increasingly used in the analysis of data in specific applied research, such as sociology. We investigate the most promising methods to reduce the dimensionality. The principal components are one of the most commonly used methods to reduce the dimensionality. For visual analysis of data are often used the projections of original vectors on the plane of the first two principal components. Usually the data structure is clearly visible, highlighted compact clusters of objects and separately allocated vectors. The principal components are one method of factor analysis. The new idea of factor analysis in comparison with the method of principal components is that, based on loads, the factors breaks up into groups. In one group of factors, new factor is combined with a similar impact on the elements of the new basis. Then each group is recommended to leave one representative. Sometimes, instead of the choice of representative by calculation, a new factor that is central to the group in question. The reduced dimension occurs during the transition to the system factors, which are representatives of groups. Other factors are discarded. On the use of distance (proximity measures, indicators of differences) between features and extensive class are based methods of multidimensional scaling. The basic idea of this class of methods is to present each object as point of the geometric space (usually of dimension 1, 2, or 3) whose coordinates are the values of the hidden (latent) factors which combine to adequately describe the object. As an example of the application of probabilistic and statistical modeling and the results of statistics of non-numeric data, we justify the consistency of estimators of the

an object. As an example of the application of probabilistic-statistical modeling and the results of statistics of non-numerical data, we justify the consistency of the estimate of the dimension of the data space in multidimensional scaling, previously proposed by Kruskal from heuristic considerations. A number of works on estimating the dimensions of models (in regression analysis and in the theory of classification) are considered. Information about dimensionality reduction algorithms in automated system-cognitive analysis is given.

Keywords: MATHEMATICS, APPLIED STATISTICS, MATHEMATICAL STATISTICS, GROWTH POINTS, PRINCIPAL COMPONENT METHOD, FACTOR ANALYSIS, MULTIDIMENSIONAL SCALING, DIMENSIONAL ESTIMATION OF DATA, MODEL DIMENSIONAL ESTIMATION

dimension of the data in multidimensional scaling, which are proposed previously by Kruskal from heuristic considerations. We have considered a number of consistent estimations of dimension of models (in regression analysis and in theory of classification). We also give some information about the algorithms for reducing the dimensionality in the automated system-cognitive analysis

Keywords: MATHEMATICS APPLIED STATISTICS MATHEMATICAL STATISTICS GROWTH POINTS THE PRINCIPAL COMPONENT ANALYSIS FACTOR ANALYSIS MULTIDIMENSIONAL SCALING ESTIMATION OF DATA DIMENSION ESTIMATION OF MODEL DIMENSION

1. Introduction

As already noted, one of the “growth points” of applied statistics is the methods of reducing the dimension of the statistical data space. They are increasingly used in the analysis of data in specific applied research, for example, sociological. Let us consider the most promising methods of dimensionality reduction. As an example of the application of probabilistic-statistical modeling and the results of statistics of non-numerical data, we will justify the consistency of the estimate of the dimension of space, previously proposed by Kruskal from heuristic considerations.

look". For example, a marketer can visually see how many various types consumer behavior (i.e. how many it is expedient to single out market segments) and which consumers (with what properties) are included in them.

In addition to the desire for visibility, there are other motives for reducing the dimension. Those factors on which the variable of interest to the researcher does not depend only hinder the statistical analysis. Firstly, financial, time and human resources are spent on collecting information about them. Secondly, as can be proved, their inclusion in the analysis worsens the properties of statistical procedures (in particular, it increases the variance of estimates of parameters and characteristics of distributions). Therefore, it is desirable to get rid of such factors.

2. Principal component method

It is one of the most commonly used dimensionality reduction methods. Its main idea is to sequentially identify the directions in which the data have the greatest spread. Let the sample consist of vectors equally distributed with the vector X = (x(1), x(2), ... , x(n)). Consider linear combinations

7(^(1), X(2), ., l(n)) = X(1)x(1) + X(2)x(2) + ... + l(n)x(n) ,

X2(1) + X2(2) + ... + X2(n) = 1. Here the vector X = (X(1), X(2), ..., X(n)) lies on the unit sphere in n-dimensional space.

In the principal component method, first of all, the direction of maximum scatter is found, i.e. such X at which the variance of the random variable 7(X) = 7(X(1), X(2), ..., X(n)) reaches its maximum. Then the vector X specifies the first principal component, and the value 7(X) is the projection of the random vector X onto the axis of the first principal component.

Then, in terms of linear algebra, a hyperplane in n-dimensional space is considered, perpendicular to the first principal component, and all elements of the sample are projected onto this hyperplane. The dimension of the hyperplane is 1 less than the dimension of the original space.

From the point of view of linear algebra, we are talking about constructing a new basis in an n-dimensional space, the orts of which are the principal components.

The variance corresponding to each new principal component is smaller than for the previous one. Usually they stop when it is less than a given threshold. If k principal components are selected, then this means that it was possible to pass from the n-dimensional space to the k-dimensional one, i.e. reduce the dimension from p-to k, practically without distorting the structure of the source data.

For visual data analysis, the projections of the original vectors onto the plane of the first two principal components are often used. Usually

the data structure is clearly visible, compact clusters of objects and separately distinguished vectors are distinguished.

3. Factor analysis

Principal component analysis is one of the methods of factor analysis. Various factor analysis algorithms are united by the fact that in all of them there is a transition to a new basis in the original n-dimensional space. The concept of “factor load” is important, which is used to describe the role of the initial factor (variable) in the formation of a certain vector from a new basis.

The described procedure can be carried out not only with the help of factor analysis. We are talking about cluster analysis of features (factors, variables). To divide features into groups, various cluster analysis algorithms can be used. It is enough to enter the distance (proximity measure, difference indicator) between features. Let X and Y be two features. The difference d(X,Y) between them can be measured using sample correlation coefficients:

di(X,Y) = 1 - \rn(X,Y)\, d2(X,Y) = 1 - \pn(X,Y)\, where rn(X,Y) is Pearson's sample linear correlation coefficient, pn(X, Y) - Spearman's sample rank correlation coefficient.

4. Multidimensional scaling.

An extensive class of multidimensional scaling methods is based on the use of distances (measures of proximity, indicators of difference) d (X, Y) between features X and Y. The main idea of this class of methods is to represent each object by a point in the geometric space (usually of dimension 1, 2 or 3), the coordinates of which are the values of hidden (latent) factors that together adequately describe the object. In this case, relations between objects are replaced by relations between points - their representatives. So, data on the similarity of objects - by the distances between points, data on superiority - by the mutual arrangement of points.

5. The problem of estimating the true dimension of the factor space

In the practice of sociological data analysis, a number of different multidimensional scaling models are used. All of them face the problem of estimating the true dimension of the factor space. Let's consider this problem using the example of processing data on the similarity of objects using metric scaling.

Let there be n objects 0(1), O(2), ..., O(n), for each pair of objects 0(/), O(j) a measure of their similarity s(ij) is given. We assume that always s(i,j) = s(j,i). The origin of the numbers s(ij) does not matter for the description of the operation of the algorithm. They could be obtained either by direct measurement, or with the use of experts, or by calculation from a set of descriptive characteristics, or in some other way.

In the Euclidean space, the n objects under consideration must be represented by a configuration of n points, and the Euclidean distance d(i,j)

between corresponding points. The degree of correspondence between a set of objects and a set of points representing them is determined by comparing the similarity matrices ||i(,)|| and distances The CMM-metric similarity functional has the form

i = t|*(/, ]) - d(/, M

The geometric configuration must be chosen so that the functional S reaches its minimum value .

Comment. In non-metric scaling, instead of the proximity of the measures of proximity and distances themselves, the proximity of orderings on the set of measures of proximity and the set of corresponding distances is considered. Instead of the functional S, analogues of the Spearman and Kendall rank correlation coefficients are used. In other words, non-metric scaling assumes that proximity measures are measured on an ordinal scale.

Let the Euclidean space have dimension m. Consider the minimum of the mean squared error

where the minimum is taken over all possible configurations of n points in m-dimensional Euclidean space. It can be shown that the considered minimum is attained on some configuration. It is clear that as m increases, the value of am decreases monotonically (more precisely, does not increase). It can be shown that for m > n - 1 it is equal to 0 (if is a metric). To increase the possibilities of meaningful interpretation, it is desirable to act in a space of the smallest possible dimension. In this case, however, the dimension must be chosen so that the points represent objects without large distortions. The question arises: how to rationally choose the dimension of space, i.e. natural number t?

6. Models and methods for estimating the dimension of the data space

Within the framework of deterministic data analysis, there seems to be no reasonable answer to this question. Therefore, it is necessary to study the behavior of am in certain probabilistic models. If the proximity measures s(ij) are random variables whose distribution depends on the “true dimension” m0 (and, possibly, on some other parameters), then we can pose the problem of estimating m0 in the classical mathematical-statistical style, look for consistent estimates, and etc.

Let's start building probabilistic models. We assume that the objects are points in a Euclidean space of dimension k, where k is large enough. The fact that the "true dimension" is equal to m0 means that all these points lie on a hyperplane of dimension m0. Let us assume for definiteness that the set of points under consideration is a sample from a circular normal distribution with a variance o(0). This means that the objects 0(1), 0(2), ..., O(n) are mutually independent random vectors, each of which is constructed as

Z(1)e(1) + Z(2)e(2) + ... + Z(m0)e(m0), where e(1), e(2), ... , e(m0) is an orthonormal basis in the subspace of dimension m0, in which the considered points lie, and Z(1), Z(2), , Z(m0) are mutually independent one-dimensional normal random variables with mathematical expectation 0 and variance o (0).

Consider two models for obtaining proximity measures s(ij). In the first of them, s(ij) differ from the Euclidean distance between the corresponding points due to the fact that the points are known with distortions. Let c(1), c(2), ... , c(n) be the points under consideration. Then

s(i,j) = d(c(i) + e(i), c(j) + s(/)), ij = 1, 2, ... , n,

where d is the Euclidean distance between points in the d-dimensional space, the vectors e(1), e(2), ... , e(n) are a sample from the circular normal distribution in the d-dimensional space with zero mathematical expectation and the covariance matrix o (1)/, where I is the identity matrix. In other words,

e(0 = n(1)e(1) + P(2)e(2) + ... + u(k)v(k), where e(1), e(2), ..., e(k) is an orthonormal basis in ^-dimensional space, and [^^^), i = 1, 2, ... , n, ? =1, 2, ... , k) - a set of one-dimensional random variables independent in the set with zero mathematical expectation and variance o (1).

In the second model, the distortions are imposed directly on the distances themselves:

Kch) = d(F\ SI)) + £(YX u = 1, 2 . , n, i f j,

where and , and on the first interval it decreases faster than on the second. It follows that the statistics

m* = Arg minam+1 - 2am + an-x)

is a consistent estimate of the true dimension of m0.

So, a recommendation follows from the probabilistic theory - to use m* as an estimate of the dimension of the factor space. Note that such a recommendation was formulated as heuristic by one of the founders of multidimensional scaling, J. Kraskal. He proceeded from the experience of practical use of multidimensional scaling and computational experiments. The probabilistic theory made it possible to substantiate this heuristic recommendation.

7. Model dimension estimation

If possible subsets of features form an expanding family, for example, the degree of a polynomial is estimated, then it is natural to introduce the term “model dimension” (this concept is in many respects similar to the concept of data space dimension used in multidimensional scaling). The author of this article owns a number of works on estimating the dimension of the model, which are worth comparing with the works on estimating the dimension of the data space discussed above.

The first such work was done by the author of this article during a business trip to France in 1976. In it, one estimate of the model dimension in regression was studied, namely, the estimate of the degree of a polynomial under the assumption that the dependence is described by a polynomial. This estimate was known in the literature, but later it was erroneously attributed to the author of this article, who only studied its properties, in particular, found that it is not consistent, and found its limiting geometric distribution . Other, already consistent estimates of the dimension of the regression model were proposed and studied in the article. This cycle was completed by a work containing a number of clarifications.

The latest publication on this topic includes a discussion of the results of studying the rate of convergence in the limit theorems I obtained by the Monte Carlo method.

Methodologically similar estimates of the model dimension in the problem of splitting mixtures (part of the theory of classification) are considered in the article.

The estimates of model dimension considered above in multidimensional scaling are studied in the works. In the same works, the limiting behavior of the characteristics of the principal component method was established (using the asymptotic theory of the behavior of solutions to extremal statistical problems).

8. Algorithms for Dimension Reduction in Automated System Cognitive Analysis

In automated system-cognitive analysis (ASC-analysis), another method of dimensionality reduction is proposed and implemented in the "Eidos" system. It is described in the work in sections 4.2 "Description of algorithms for basic cognitive operations of system analysis (BCOSA)" and 4.3 "Detailed algorithms for BCOSA (ASC analysis)". Let's bring short description two algorithms - BKOSA-4.1 and BKOSA-4.2.

BKOSA-4.1. "Abstraction of factors (reducing the dimension of the semantic space of factors)"

Using the method of successive approximations (iterative algorithm), under given boundary conditions, the dimension of the attribute space is reduced without a significant reduction in its volume. The criterion for stopping the iterative process is the achievement of one of the boundary conditions.

BKOSA-4.2. "Abstracting classes (reducing the dimension of the semantic space of classes)"

Using the method of successive approximations (iterative algorithm), under given boundary conditions, the dimension of the class space is reduced without a significant reduction in its volume. The criterion for stopping the iterative process is the achievement of one of the boundary conditions.

Here are all the real algorithms implemented in the Eidos system of the version that was implemented at the time of preparation of the work (2002): http://lc.kubagro.ru/aidos/aidos02/4.3.htm

The essence of algorithms is as follows.

1. The amount of information in the values of the factors about the transition of the object to the states corresponding to the classes is calculated.

2. The value of the factor value is calculated for object differentiation by classes. This value is simply the variability of the informativity of the factor values (there are many quantitative measures of variability: the average deviation from the average, the standard deviation, etc.). In other words, if the value of a factor on average contains little information about whether an object belongs to a class or not, then this value is not very valuable, and if there is a lot, then it is valuable.

3. The value of descriptive scales for differentiating objects by classes is calculated. In the works of E.V. Lutsenko now this is done as an average of the values of the gradations of this scale.

4. Then Pareto optimization of the values of factors and descriptive scales is carried out:

The values of the factors (gradations of descriptive scales) are ranked in descending order of value and the least valuable ones that go to the right of the tangent to the 45° Pareto curve are removed from the model;

Factors (descriptive scales) are ranked in descending order of value and the least valuable factors that go to the right of the tangent to the 45° Pareto curve are removed from the model.

As a result, the dimension of the space built on descriptive scales is significantly reduced due to the removal of scales that correlate with each other, i.e. in fact, this is the orthonormalization of space in the information metric.

This process can be repeated, i.e. be iterative, while new version system "Eidos" iterations are started manually.

The information space of classes is orthonormalized similarly.

Scales and their gradations can be numeric (in this case, interval values are processed), and they can also be textual (ordinal or even nominal).

Thus, with the help of BKOSA (ASK-analysis) algorithms, the dimension of space is reduced as much as possible with minimal loss of information.

A number of other dimensionality reduction algorithms have been developed for the analysis of statistical data in applied statistics. The objectives of this article do not include a description of the entire variety of such algorithms.

Literature

1. Orlov A.I. Growth points of statistical methods // Polythematic network electronic scientific journal of the Kuban State Agrarian University. 2014. No. 103. P. 136-162.

2. Kraskal J. Relationship between multidimensional scaling and cluster analysis // Classification and cluster. M.: Mir, 1980. S.20-41.

4. Harman G. Modern factor analysis. M.: Statistics, 1972. 489 p.

5. Orlov A.I. Notes on the theory of classification. / Sociology: methodology, methods, mathematical models. 1991. No. 2. S.28-50.

6. Orlov A.I. Basic results of the mathematical theory of classification // Polythematic network electronic scientific journal of the Kuban State Agrarian University. 2015. No. 110. S. 219-239.

7. Orlov A.I. Mathematical methods of the theory of classification // Polythematic network electronic scientific journal of the Kuban State Agrarian University. 2014. No. 95. P. 23 - 45.

8. Terekhina A.Yu. Data analysis by multidimensional scaling methods. -M.: Nauka, 1986. 168 p.

9. Perekrest V. T. Nonlinear typological analysis of socio-economic information: Mathematical and computational methods. - L.: Nauka, 1983. 176 p.

10. Tyurin Yu.N., Litvak B.G., Orlov A.I., Satarov G.A., Shmerling D.S. Analysis of non-numerical information. M.: Scientific Council of the Academy of Sciences of the USSR on the complex problem "Cybernetics", 1981. - 80 p.

11. Orlov A.I. General view on the statistics of objects of non-numerical nature // Analysis of non-numerical information in sociological research. - M.: Nauka, 1985. S.58-92.

12. Orlov A.I. Limiting distribution of one estimate of the number of basis functions in regression // Applied Multivariate Statistical Analysis. Scientific notes on statistics, v.33. - M.: Nauka, 1978. S.380-381.

13. Orlov A.I. Model Dimension Estimation in Regression // Algorithmic and software applied statistical analysis. Scientific notes on statistics, v.36. - M.: Nauka, 1980. S. 92-99.

14. Orlov A.I. Asymptotics of some model dimension estimates in regression // Applied Statistics. Scientific notes on statistics, v.45. - M.: Nauka, 1983. S.260-265.

15. Orlov A.I. On estimation of the regression polynomial // Zavodskaya laboratory. material diagnostics. 1994. V.60. No. 5. P.43-47.

16. Orlov A.I. Some probabilistic questions in the theory of classification // Applied Statistics. Scientific notes on statistics, v.45. - M.: Nauka, 1983. S. 166-179.

17. Orlov A.I. On the Development of the Statistics of Nonnumerical Objects // Design of Experiments and Data Analysis: New Trends and Results. - M.: ANTAL, 1993. Р.52-90.

18. Orlov A.I. Dimension reduction methods // Appendix 1 to the book: Tolstova Yu.N. Fundamentals of multidimensional scaling: Tutorial for universities. - M.: Publishing house KDU, 2006. - 160 p.

19. Orlov A.I. Asymptotics of solutions to extremal statistical problems // Analysis of non-numerical data in system research. Collection of works. Issue. 10. - M.: All-Union Scientific Research Institute for System Research, 1982. S. 412.

20. Orlov A.I. Organizational and economic modeling: textbook: at 3 o'clock Part 1: Non-numerical statistics. - M.: Publishing house of MSTU im. N.E. Bauman. - 2009. - 541 p.

21. Lutsenko E.V. Automated system-cognitive analysis in the management of active objects (system theory of information and its application in the study of economic, socio-psychological, technological and organizational-technical systems): Monograph (scientific edition). -Krasnodar: KubGAU. 2002. - 605 p. http://elibrary.ru/item.asp?id=18632909

1. Orlov A.I. Tochki rosta statisticheskih metodov // Politematicheskij setevoj jelektronnyj nauchnyj zhurnal Kubanskogo gosudarstvennogo agrarnogo universiteta. 2014. No. 103. S. 136-162.

2. Kraskal J. Vzaimosvjaz" mezhdu mnogomernym shkalirovaniem i klaster-analizom // Klassifikacija i klaster. M.: Mir, 1980. S.20-41.

3. Kruskal J.B., Wish M. Multidimensional scaling // Sage University paper series: Qualitative applications in the social sciences. 1978. No. 11.

4. Harman G. Sovremennyj faktornyj analiz. M.: Statistika, 1972. 489 s.

5. Orlov A.I. Notes po theorii klassifikacii. / Sociologija: metodologija, metody, matematicheskie modeli. 1991. No. 2. S.28-50.

6. Orlov A.I. Bazovye rezul "taty matematicheskoj teorii klassifikacii // Politematicheskij setevoj jelektronnyj nauchnyj zhurnal Kubanskogo gosudarstvennogo agrarnogo universiteta. 2015. No. 110. S. 219-239.

7. Orlov A.I. Matematicheskie metody teorii klassifikacii // Politematicheskij setevoj jelektronnyj nauchnyj zhurnal Kubanskogo gosudarstvennogo agrarnogo universiteta. 2014. No. 95. S. 23 - 45.

8. Terehina A.Ju. Analiz dannyh metodami mnogomernogo shkalirovanija. - M.: Nauka, 1986. 168 s.

9. Perekrest V.T. Nelinejnyj tipologicheskij analiz social "no-jekonomicheskoj informacii: Matematicheskie i vychislitel"nye metody. - L.: Nauka, 1983. 176 s.

10. Tjurin J.N., Litvak B.G., Orlov A.I., Satarov G.A., Shmerling D.S. Analiz nechislovoj informacii. M.: Nauchnyj Sovet AN SSSR po kompleksnoj probleme "Kibernetika", 1981. - 80 s.

11. Orlov A.I. Obshhij vzgljad na statistiku ob#ektov nechislovoj prirody // Analiz nechislovoj informacii v sociologicheskih issledovanijah. - M.: Nauka, 1985. S.58-92.

12. Orlov A.I. Predel "noe raspredelenie odnoj ocenki chisla bazisnyh funkcij v regressii // Prikladnoj mnogomernyj statisticheskij analiz. Uchenye zapiski po statistike, t.33. - M.: Nauka, 1978. S.380-381.

13. Orlov A.I. Ocenka razmernosti modeli v regressii // Algoritmicheskoe i programmnoe obespechenie prikladnogo statisticheskogo analiz. Uchenye zapiski po statistike, t.36. - M.: Nauka, 1980. S.92-99.

14. Orlov A.I. Asimptotika nekotoryh ocenok razmernosti modeli v regressii // Prikladnaja statistika. Uchenye zapiski po statistike, t.45. - M.: Nauka, 1983. S.260-265.

15. Orlov A.I. Ob ocenivanii regressionnogo polinoma // Zavodskaja laboratorija. Diagnostic materialov. 1994. T.60. No. 5. S.43-47.

16. Orlov A.I. Nekotorye verojatnostnye voprosy teorii klassifikacii // Prikladnaja statistika. Uchenye zapiski po statistike, t.45. - M.: Nauka, 1983. S.166-179.

17. Orlov A.I. On the Development of the Statistics of Nonnumerical Objects // Design of Experiments and Data Analysis: New Trends and Results. - M.: ANTAL, 1993. R.52-90.

18. Orlov A.I. Metody snizhenija razmernosti // Prilozhenie 1 k book: Tolstova Ju.N. Osnovy mnogomernogo shkalirovanija: Uchebnoe posobie dlja vuzov. - M.: Izdatel "stvo KDU, 2006. - 160 s.

19. Orlov A.I. Asimptotika reshenij jekstremal "nyh statisticheskih zadach // Analiz nechislovyh dannyh v sistemnyh issledovanijah. Sbornik trudov. Vyp.10. - M.: Vsesojuznyj nauchno-issledovatel" skij institut sistemnyh issledovanij, 1982. S. 4-12.

20. Orlov A.I. Organizacionno-jekonomicheskoe modelirovanie: uchebnik: v 3 ch. Chast" 1: Nechislovaja statistika. - M.: Izd-vo MGTU im. N.Je. Baumana. - 2009. - 541 s.

21. Lucenko E.V. Avtomatizirovannyj sistemno-kognitivnyj analiz v upravlenii aktivnymi ob#ektami (sistemnaja teorija informacii i ee primenenie v issledovanii jekonomicheskih, social "no-psihologicheskih, tehnologicheskih i organizacionno-tehnicheskih sistem): Monografija (nauchnoe izdandarie). 605 s. http://elibrary.ru/item.asp?id=18632909

Dimension reduction (Data reduction)

IN analytical technologies data dimensionality reduction is understood as the process of data transformation into the most convenient form for analysis and interpretation. Usually it is achieved by reducing their volume, reducing the number of features used and the variety of their values.

Often the analyzed data is incomplete when it poorly reflects the dependencies and patterns of the business processes under study. The reasons for this may be an insufficient number of observations, the absence of signs that reflect the essential properties of objects. In this case, data enrichment is applied.

Dimension reduction is applied in the opposite case, when the data is redundant. Redundancy occurs when the analysis problem can be solved with the same level of efficiency and accuracy, but using a smaller data dimension. This makes it possible to reduce the time and computational costs for solving the problem, to make the data and the results of their analysis more interpretable and understandable for the user.

Reducing the number of data observations is applied if a solution of comparable quality can be obtained on a sample of a smaller size, thereby reducing computational and time costs. This is especially true for algorithms that are not scalable, when even a small reduction in the number of entries leads to a significant gain in computational time.

It makes sense to reduce the number of features when the information necessary for a qualitative solution of the problem is contained in a certain subset of features and it is not necessary to use all of them. This is especially true for correlated traits. For example, the characteristics "Age" and "Work experience" essentially carry the same information, so one of them can be excluded.

The most effective means of reducing the number of features is factor analysis and principal component analysis.

Reducing the diversity of feature values makes sense, for example, if the accuracy of data representation is excessive and integer values can be used instead of real values without compromising the quality of the model. But at the same time, the amount of memory occupied by the data and computational costs will decrease.

The subset of data obtained as a result of dimensionality reduction should inherit from the original set as much information as is necessary to solve the problem with a given accuracy, and the computational and time costs of data reduction should not devalue the benefits received from it.

An analytical model built on a reduced set of data should become easier to process, implement and understand than a model built on the original set.

The decision to choose a dimensionality reduction method is based on a priori knowledge about the features of the problem being solved and the expected results, as well as the limited time and computing resources.