Methodical manual "Statistical analysis and visualization of data using R". Methodological guide "Statistical analysis and data visualization with R" What you need to know to listen to the course

Course program

Elements of programming in R

  • Descriptive statistics and visualization
  • For example, what is more important: an average check or a typical check?

cluster analysis

  • What problem is being solved. Divide a group of objects into subgroups.
  • Task example. Segmentation of sites, identification of similar sites.
  • Studied methods. Hierarchical cluster analysis, k-means method, k-medoid method.

Testing statistical hypotheses

  • What problem is being solved. Compare two groups of objects.
  • Task example. A/B testing user behavior on different versions site pages.
  • Studied methods. Proportion test, Student's t-test, Levine's t-test, Wilcoxon-Mann-Whitney test

Linear regression analysis.

  • Task example. Estimate how much prices for used cars have fallen after the increase in customs duties.
  • Studied methods. Variable selection, collinearity, influential observations, residual analysis. Nonparametric regression (kernel smoothing). Predicting short series with a seasonal component using linear regression

Forecasting

  • What problem is being solved. Build Time Series Forecast
  • Task example. Predict site traffic for 6 months in advance.
  • studied method. Exponential Smoothing

Machine Learning (Pattern Recognition)

  • Task example. Recognize the gender and age of each site visitor
  • Studied methods. k-nearest neighbor method Classification Trees (CART). Random forests. gradient boosting machine

Course grades

Students will be given 14 laboratory work. The grade for the course is set according to the following rule:

  • Excellent - all works are credited;
  • Good - all the works were credited, except for one?;
  • Satisfactory - all works except two were credited;
  • Unsatisfactory - in other cases.

The laboratory work is

  • the listener is given a data set and a question;
  • the listener answers the question, supporting his statements with tables, graphs and a script written in the R language;
  • the listener answers additional questions.

Question example. Suggest parameters that will ensure optimal operation of the Random Forest algorithm when recognizing a brand of wine based on the results of chemical analysis.

What you need to know to take the course

It is assumed that course participants have already taken a course in probability theory.

Literature

  • Shipunov, Baldin, Volkova, Korobeinikov, Nazarova, Petrov, Sufiyanov Visual statistics. Using R
  • Mastitsky, Shitikov Statistical analysis and data visualization with R
  • Bishop Pattern Recognition and Machine Learning.
  • James, Witten, Hastie, Tibshirani. An Introduction to Statistical Learning. With Applications in R.
  • Hastie, Tibshirani, Friedman. The Elements of Statistical Learning_ data mining, Inference, and Prediction 2+ed
  • Crawley. The R Book.
  • Kabacoff R in Action. Data analysis and graphics with R.

teachers

List of lectures

Introduction to R: basic commands. Median, quantiles and quartiles. Bar chart. Bar chart. Pie chart. Scatter diagram. Scatterplot matrix. The use of color in the chart. Boxes with whiskers (box diagram). Typical sample observation: arithmetic mean, median, or trimmed mean. The choice of a method for describing a typical value that is adequate to the analyzed data. lognormal distribution. Emissions and extreme observations.

Hierarchical cluster analysis. Cluster, distances between objects, distances between clusters. Algorithm for constructing a dendrogram. Scree/Elbow. Data standardization. Typical mistakes in data preparation. Interpretation of results.

k-means method. Random number sensors, sensor grain. Visualization of the k-means algorithm. Methods for determining the number of clusters. NbClust library. Scree/Elbow. Multidimensional scaling for visualization of clusters.

Testing statistical hypotheses. Hypotheses of agreement, homogeneity, independence, hypotheses about distribution parameters.

Testing statistical hypotheses. Errors of the first and second kind, p-value and significance level, algorithm for testing the statistical hypothesis and interpretation of the results. Hypothesis of normal distribution. Shapiro-Wilk and Kolmogorov-Smirnov criteria. Minor deviations from normality. Comparison of samples. Independent and paired samples. Choice between Student's t-test, Mann-Whitney-Wilcoxon test and Mood's test. Varieties of Student's t-tests and comparison of variances. Visualization in comparisons. One-sided and two-sided tests.

Testing statistical hypotheses. Comparison of samples. Independent and paired samples. Choice between Student's t-test, Mann-Whitney-Wilcoxon test and Mood's test. Varieties of Student's t-tests and comparison of variances. Visualization in comparisons. One-sided and two-sided tests. Independence. Pearson, Kendall and Spearman correlation coefficients, typical errors in studying the relationship between two phenomena. Visual check of conclusions.

Linear regression analysis Model, interpretation of coefficient estimates, multiple coefficient of determination. Interpretation of the multiple coefficient of determination, restrictions on the scope of its application. Identification of the most significant predictors and assessment of the contribution of each predictor. Algorithms for correcting the constructed models. Collinearity.

Linear Regression Analysis: Forecasting Short Time Series.

Forecasting based on a regression model with seasonal indicator (dummy, structural) variables. Trend, seasonal components, series change, outliers. Logarithm is a technique for converting multiplicative seasonality into additive. indicator variables. Retraining.

Linear Regression - Residual Analysis. Violations of the model constraints of the Gauss-Markov theorem. Residue analysis. Specification error. Multicollinearity, Tolerance and VIF. Checking the constancy of the variances of the residuals. Correction of models in the presence of deviations in the distribution of residuals from normality. Cook distance and leverage. Durbin-Watson statistics. Reducing the number of seasonal adjustments.

Exponential Smoothing The Holt "a-Winters" method. Local trend, local seasonality.

Terminology: Machine Learning, Artificial Intelligence, Data Mining and Pattern Recognition.

k-th nearest neighbor method. Consistency of the method. Lazy learning (lazy learning). Feature Selection. Cross validation. k-fold cross-validation. Overfitting (Excessive fit). Training and test sets.

k-th nearest neighbor method Examples. Determining the number of nearest neighbors. Contingency table to determine the quality of the method.

CART classification trees. Geometric representation. Representation as a set of logical rules. Tree representation. Nodes, parents and children, end nodes. Thresholds. rpart library. Node impurity measures. Purity measurement methods: Gini, entropy, classification errors. Rules for stopping the learning tree. rpart.plot library.

Last time (in November 2014; I am very ashamed that I took so long with the continuation!) I talked about the basic features of the R language. Despite the presence of all the usual control constructs, such as loops and conditional blocks, the classical approach to data processing based on iteration far from The best decision, since cycles in R extraordinarily slow. So now I will tell you how to actually work with the data so that the calculation process does not force you to drink too many cups of coffee in anticipation of the result. In addition, I will spend some time talking about how to use modern data visualization tools in R. Because the convenience of presenting the results of data processing in practice is no less important than the results themselves. Let's start simple.

Vector operations

As we remember, the base type in R is not a number at all, but a vector, and the basic arithmetic operations act on vectors element by element:

> x<- 1:6; y <- 11:17 >x + y 12 14 16 16 18 20 22 18 18> x> 2 false false true true true> x * y 11 24 39 56 75 96 17> x / y 0.09090909 0.166666666660.28571429 0.33333333 0.37500000 0.0588235353

Everything is quite simple here, but it is quite logical to ask the question: what will happen if the length of the vectors does not match? If we, say, write k<- 2, то будет ли x * k соответствовать умножению вектора на число в математическом смысле? Короткий ответ - да. В более общем случае, когда длина векторов не совпадает, меньший вектор просто продолжается повторением:

>z<- c(1, 0.5) >x * z 1 1 3 2 5 3

The same is true for matrices.

> x<- matrix(1:4, 2, 2); y <- matrix(rep(2,4), 2, 2) >x * y [,1] [,2] 2 6 4 8 > x / y [,1] [,2] 0.5 1.5 1.0 2.0

In this case, “normal”, and not bitwise, matrix multiplication will look like this:

> x %*% y [,1] [,2] 8 8 12 12

All this, of course, is very good, but what to do when we need to apply our own functions to the elements of vectors or matrices, that is, how can this be done without a cycle? The approach that R uses to solve this problem is very similar to what we are used to in functional languages ​​- it's all reminiscent of the map function in Python or Haskell.

Useful function lapply and its friends

The first function in this family is lapply . It allows you to apply a given function to each element of a list or vector. Moreover, the result will be exactly the list, regardless of the type of the argument. The simplest example using lambda functions:

> q<- lapply(c(1,2,4), function(x) x^2) >q 1 4 16

If the function to be applied to a list or vector requires more than one argument, then those arguments can be passed via lapply .

> q<- lapply(c(1,2,4), function(x, y) x^2 + y, 3)

With a list, the function works in a similar way:

> x<- list(a=rnorm(10), b=1:10) >lapply(x, mean)

Here, the rnorm function specifies a normal distribution (in this case, ten normally distributed numbers between 0 and 1), and mean calculates the mean. The sapply function is exactly the same as the lapply function, except that it tries to simplify the result. For example, if each element of the list has length 1, then a vector will be returned instead of a list:

> sapply(c(1,2,4), function(x) x^2) 1 4 16

If the result is a list of vectors of the same length, then the function will return a matrix, if nothing is clear, then just a list, like lapply .

> x<- list(1:4, 5:8) >sapply(x, function(x) x^2) [,1] [,2] 1 25 4 36 9 49 16 64

To work with matrices, it is convenient to use the apply function:

> x<- matrix(rnorm(50), 5, 10) >apply(x, 2, mean) > apply(x, 1, sum)

Here, we first create a matrix of five rows and ten columns, then first we calculate the average of the columns, and then the sum of the rows. To complete the picture, it should be noted that the tasks of calculating the average and sum over rows are so common that R provides special functions for this purpose rowSums , rowMeans , colSums and colMeans .
The apply function can also be used for multidimensional arrays:

> arr<- array(rnorm(2 * 2 * 10), c(2, 2, 10)) >apply(arr, c(1,2), mean)

The last call can be replaced with a more readable version:

> rowMeans(arr, dim = 2)

Let's move on to the mapply function, which is a multidimensional analogue of lapply . Let's start with a simple example that can be found right in the standard R documentation:

> mapply(rep, 1:4, 4:1) 1 1 1 1 2 2 2 3 3 4

As you can see, here we are applying the rep function to a set of parameters that are generated from two sequences. The rep function itself simply repeats the first argument the number of times specified as the second argument. So the previous code is simply equivalent to the following:

> list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))

Sometimes it is necessary to apply a function to some part of an array. This can be done using the tapply function. Let's consider the following example:

> x<- c(rnorm(10, 1), runif(10), rnorm(10,2)) >f<- gl(3,10) >tapply(x,f,mean)

First we create a vector whose parts are formed from random variables with a different distribution, then we generate a vector of factors, which is nothing more than ten ones, then ten twos and the same number of threes. Then we calculate the average for the corresponding groups. The default tapply function tries to simplify the result. This option can be turned off by specifying simplify=FALSE as the parameter.

> tapply(x, f, range, simplify=FALSE)

When talking about apply functions, one usually also talks about the split function, which splits a vector into parts, similar to tapply . So, if we call split(x, f) we get a list of three vectors. So the lapply / split pair works just like tapply with simplify set to FALSE:

> lapply(split(x, f), mean)

The split function is also useful outside of working with vectors: it can also be used to work with data frames. Consider the following example (I borrowed it from Coursera's R Programming course):

> library(datasets) > head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6 > s<- split(airquality, airquality$Month) >lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))

Here we are working with a dataset that contains information about the state of the air (ozone, solar radiation, wind, temperature in Fahrenheit, month and day). We can easily report monthly averages using split and lapply as shown in the code. Using sapply , however, will give us a more convenient result:

> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")])) 5 6 7 8 9 Ozone NA NA NA NA NA Solar.R NA 190.16667 216.483871 NA 167.4333 Wind 11.62258 10.26667 8.941935 8.793548 10.1800

As you can see, some values ​​of the quantities are not defined (and the reserved value NA is used for this). This means that some (at least one) values ​​in the Ozone and Solar.R columns were also not defined. In this sense, the colMeans function behaves quite correctly: if there are any undefined values, then the mean is thus undefined. The problem can be solved by forcing the function to ignore NA values ​​with the na.rm=TRUE parameter:

>sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm=TRUE)) 5 6 7 8 9 Ozone 23.61538 29.44444 59.115385 59.961538 31.44828 Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333 Wind 11.62258 10.26667 8.941935 8.793548 10.18000

Why do we need so many functions to solve very similar tasks? I think every second person who has read all this will ask such a question. All these functions are actually trying to solve the problem of processing vector data without using loops. But it's one thing to achieve high processing speed, and quite another to get at least some of the flexibility and control that such control constructs as loops and conditional statements provide.

Data visualization

The R system is unusually rich in data visualization tools. And here I am faced with a difficult choice - what to talk about if the area is so large. If in the case of programming there is some basic set of functions, without which nothing can be done, then in visualization there are a huge number of different tasks and each of them (as a rule) can be solved in several ways, each of which has its pros and cons. Moreover, there are always many options and packages that allow you to solve these problems in various ways.
A lot has been written about standard renderers in R, so here I would like to talk about something more interesting. In recent years, the package has become more and more popular. ggplot2, let's talk about him.

To get started with ggplot2, you need to install the library using the install.package("ggplot2") command. Next, we connect it for use:

> library("ggplot2") > head(diamonds) carat cut color clarity depth table price x y z 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 4.05 4.07 2.31 4 0.29 Premium i vs2 62.4 58 334 4.20 4.23 2.63 5 0.31 Good J Si2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very GOOD J VVS2 62.8 3.94 3.96 2.48> HEAD (MTCARS) MPG cla. Am Gear Carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 WAG 21.0 6 160 110 3.9 2.875 17.02 0 1 4 DATSUN 710 22.8 4 108 93 3.85 2.320 18.61 1 4 1 Hornet 4 Drive 68 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

The diamonds and mtcars data are part of the ggplot2 package and are what we'll be working with now. With the first one, everything is clear - these are data on diamonds (purity, color, cost, etc.), and the second set is data on road tests (number of miles per gallon, number of cylinders ...) of cars from 1973-1974 from the American magazine Motor Trends . More information about the data (such as dimension) can be obtained by typing?diamonds or?mtcars .

For visualization, the package provides many functions, of which qplot will be the most important for us now. The ggplot function gives you much more control over the process. Everything that can be done with qplot can also be done with ggplot . Consider it on simple example:

> qplot(clarity, data=diamonds, fill=cut, geom="bar")

The same effect can be achieved with the ggplot function:

> ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

However, calling qplot looks simpler. On fig. 1 you can see how the dependence of the number of diamonds with different cut quality (cut) on clarity (clarity) is built.

Now let's build the dependence of the mileage per unit of fuel of cars on their mass. The resulting scatter plot (or scatterplot scatter plot) is represented
in fig. 2.

> qplot(wt, mpg, data=mtcars)

You can also add a color display of the quarter mile acceleration time (qsec):

> qplot(wt, mpg, data=mtcars, color=qsec)

When visualizing, you can also transform data:

> qplot(log(wt), mpg - 10, data=mtcars)

In some cases, discrete color division looks more representative than continuous. For example, if we want to display information about the number of cylinders in color instead of the acceleration time, then we need to indicate that the value is discrete (Fig. 3):

> qplot(wt, mpg, data=mtcars, color=factor(cyl))

You can also change the size of the points using, for example, size=3 . If you are going to print graphs on a black and white printer, then it is better not to use colors, but instead change the shape of the marker depending on the factor. This can be done by replacing color=factor(cyl) with shape=factor(cyl) .
The plot type is specified using the geom parameter, and in the case of scatter plots, the value of this parameter is "points" .

Now let's say we just want to build a histogram of the number of cars with the corresponding cylinder value:

> qplot(factor(cyl), data=mtcars, geom="bar") > qplot(factor(cyl), data=mtcars, geom="bar", color=factor(cyl)) > qplot(factor(cyl) , data=mtcars, geom="bar", fill=factor(cyl))

The first call simply draws three histograms for different cylinder values. I must say that the first attempt to give color to the histogram will not lead to the expected result - the black bars will still be black, they will only get a colored outline. But the last call to qplot will make a beautiful histogram, as shown in fig. 4.

There should be some clarity here. The fact is that the current object we have built is not a histogram in the strict sense of the word. Usually, a histogram is understood as a similar display for continuous data. In English bar chart(this is what we just did) and histogram are two different concepts (see the relevant Wikipedia articles). Here, with some heaviness of heart, I will use the word "histogram" for both concepts, believing that the very nature of the data speaks for itself.

If we return to fig. 1, ggplot2 provides several useful options for plot positioning (the default value is position="stack"):

> qplot(clarity, data=diamonds, geom="bar", fill=cut, position="dodge") > qplot(clarity, data=diamonds, geom="bar", fill=cut, position="fill") > qplot(clarity, data=diamonds, geom="bar", fill=cut, position="identity")

The first of the proposed options builds diagrams side by side, as shown in Fig. 5, the second shows the share of diamonds of different cut quality in the total number of diamonds of a given clarity (Fig. 6).

Now consider an example of a real histogram:

> qplot(carat, data=diamonds, geom="histogram", bandwidth=0.1) > qplot(carat, data=diamonds, geom="histogram", bandwidth=0.05)

Here the bandwidth parameter just shows how wide the band is in the histogram. The histogram shows how much data is in which range. The results are presented in fig. 7 and 8.

Sometimes, when we need to build a model (linear or, say, polynomial), we can do it right in qplot and see the result. For example, we can plot mpg versus mass wt right on top of a scatter plot:

> qplot(wt, mpg, data=mtcars, geom=c("point", "smooth"))

By default, local polynomial regression (method="loess") will be used as the model. The result of the work will look like shown in Fig. 9, where the dark gray bar is the standard error. It is displayed by default, you can turn off its display by writing se=FALSE .

If we want to try to stretch a linear model on this data, then this can be done by simply specifying method=lm (Fig. 10).

And finally, of course, you need to show how to build pie charts:

>t<- ggplot(mtcars, aes(x=factor(1), fill=factor(cyl))) + geom_bar(width=1) >t + coord_polar(theta="y")

Here we will use the more flexible ggplot function. It works like this: first, we build a graph that displays the proportion of cars with different numbers of cylinders in the total mass (Fig. 11), then we translate the graph into polar coordinates (Fig. 12).

Instead of a conclusion

Now we've got the hang of using R. What's next? It is clear that the most basic features of ggplot2 are given here and issues related to vectorization are considered. There are a few good books on R that are worth mentioning, and they are certainly worth consulting more than the services of a corporation of very obsessive kindness. First, there is Norman Matloff's The Art of R Programming. If you already have experience programming in R, then The R Inferno, written by Patrick Burns, will come in handy. The classic book Software for Data Analysis by John Chambers is also quite appropriate.

If we talk about visualization in R, then there is a good book R Graphics Cookbook by W. Chang (Winston Chang). The examples for ggplot2 in this article were taken from Tutorial: ggplot2 . See you in the next article "Data Analysis and Machine Learning in R"!

"STATISTICAL ANALYSIS AND DATA VISUALIZATION USING R grass roots fruits foliage Heidelberg - London - Togliatti 2014,..."

-- [ Page 1 ] --

S.E. Mastitsky, V.K. Shitikov

STATISTICAL ANALYSIS AND

DATA VISUALIZATION WITH R

grass roots fruit foliage

Heidelberg – London – Togliatti

2014, Sergey Eduardovich Mastitsky, Vladimir Kirillovich Shitikov

Website: http://r-analytics.blogspot.com

This work is distributed under license

Creative Commons "Attribution - Non-commercial

use – Same terms 4.0 Worldwide”. Under this license, you may freely copy, distribute, and modify this work, provided that the authors and source are clearly identified. If you modify this work or use it in your work, you may only distribute the result under the same or a similar license. It is forbidden to use this work for commercial purposes without the consent of the authors. For more information about the license, please visit www.creativecommons.com

Please refer to this book as follows:

Mastitsky S.E., Shitikov V.K. (2014) Statistical Analysis and Data Visualization with R.

EBook, access address:

http://r-analytics.blogspot.com

FOREWORD 5

1. MAIN COMPONENTS OF THE STATISTICAL ENVIRONMENT R 8


1.1. The history of the emergence and basic principles of organization 8 of the R environment

1.2. Working with the command console interface R 11

1.3. Working with R Commander 13 package menu

1.4. Objects, packages, functions, devices 17

2. LANGUAGE DESCRIPTION R 23

2.1. R 23 Data Types

2.2. Vectors and matrices 24

2.3. Factors 29

2.4. Lists and tables 31

2.5. Importing data into R 37

2.6. Date and time representation; time series 40

2.7. Organization of calculations: functions, branches, loops 46

2.8. Vectorized calculations in R using apply-50 functions

3. BASIC GRAPHIC FEATURES R 58

3.1. plot() scatterplots and parameters of plotting 58 functions

3.2. Histograms, Kernel Density Functions, and the 66 cdplot() Function

3.3. Span Charts 74

3.4. Pie and bar charts 77

3.5. Cleveland and 1D scatterplots 84

4. DESCRIPTIVE STATISTICS AND FITTING 97

DISTRIBUTION

–  –  –

FOREWORD

One of the main tools for understanding the world is the processing of data received by a person from various sources. The essence of modern statistical analysis is an interactive process consisting of the study, visualization and interpretation of the flow of incoming information.

The history of the last 50 years is also the history of the development of data analysis technology.

One of the authors fondly recalls the end of the 1960s and his first pair correlation calculation program, which was typed with metal pins on the "operational field" from 150 cells of the Promin-2 personal computer weighing more than 200 kg.

Nowadays, high-performance computers and affordable software allow to implement the full cycle of the information technology process, which, in general, consists of the following steps:

° access to processed data (their loading from different sources and completing a set of interconnected source tables);

° editing loaded indicators (replacing or deleting missing values, converting features into a more convenient form);

° data annotation (to remember what each piece of data is);

° receiving general information about the data structure (calculation of descriptive statistics in order to characterize the analyzed indicators);

° graphical representation of data and calculation results in an understandable informative form (one picture is actually sometimes worth a thousand words);

° data modeling (finding dependencies and testing statistical hypotheses);

° presentation of results (preparation of tables and charts of acceptable publication quality).

In conditions when there are dozens of packages for the user's services application programs, the problem of choice is relevant (sometimes tragic, if we recall the “buridan donkey”): which data analysis software should be preferred for your practical work? This usually takes into account the specifics of the problem being solved, the efficiency of setting up processing algorithms, the cost of purchasing programs, as well as the tastes and personal preferences of the analyst. At the same time, for example, the template Statistica with its mechanical set of menu buttons cannot always satisfy a creative researcher who prefers to independently control the progress of the computational process. Combine Various types analysis, have access to intermediate results, control the style of data display, add your own extensions software modules and final reports can be prepared in the required form by commercial computing systems that include high-level command language tools such as Matlab, SPSS, etc. An excellent alternative to them is the free R software environment, which is a modern and constantly developing general-purpose statistical platform.



Today, R is the undisputed leader among freely distributed statistical analysis systems, as evidenced, for example, by the fact that in 2010 the R system became the winner of the annual Bossie Awards open source software competition in several categories. Leading universities in the world, analysts of the largest companies and research centers constantly use R when conducting scientific and technical calculations and creating large information projects. The widespread teaching of statistics based on the packages of this environment and the full support of the scientific community have led to the fact that bringing R scripts is gradually becoming a universally recognized "standard" both in journal publications and in informal communication between scientists around the world.

The main obstacle for Russian-speaking users in mastering R, of course, is that almost all the documentation for this environment exists in English. Only since 2008, through the efforts of A.V. Shipunova, E.M. Baldina, S.V. Petrova, I.S. Zaryadova, A.G. Bukhovets and other enthusiasts, methodological manuals and books appeared in Russian (links to them can be found in the list of references at the end of this book; links to educational resources, the authors of which make a feasible contribution to the promotion of R among Russian-speaking users).

This manual summarizes a set of methodical posts published by one of the authors since 2011 in the blog “R: Data Analysis and Visualization”

(http://r-analytics.blogspot.com). It seemed to us expedient to present for the convenience of readers all this somewhat fragmented material in a concentrated form, and also to expand some sections for completeness.

The first three chapters contain detailed instructions for working with the interactive components of R, a detailed description of the language and basic graphical features of the environment.

This part of the book is quite accessible to beginners in the field of programming, although a reader already familiar with the R language may find interesting code fragments there or use the descriptions of graphical parameters provided as a reference.

The following chapters (4-8) describe common procedures for data processing and building statistical models, which is illustrated by several dozen examples. These include short description analysis algorithms, the main results obtained and their possible interpretation. We tried, as far as possible, to do without the abuse of "ritual" word-turns, characteristic of numerous manuals on applied statistics, quoting well-known theorems and bringing multi-level calculation formulas. The emphasis was placed primarily on practical use- so that the reader, guided by what he has read, can analyze his data and present the results to colleagues.

The sections of this part are built as the complexity of the presented material.

Chapters 4 and 5 are aimed at the reader who is interested in statistics only within the framework of an initial university course. In chapters 6 and 7, within the framework of a unified theory of general linear models, dispersion and regression analyzes are presented and various algorithms for studying and structurally identifying models are presented. Chapter 8 is dedicated to some modern methods construction and analysis of generalized regression models.

Since the researcher is constantly interested in spatial analysis and displaying the results on geographical maps and diagrams, Chapter 9 provides some examples of such visualization techniques.

We address our methodical guide to students, graduate students, as well as young and established scientists who want to learn how to analyze and visualize data using the R environment. We hope that by the end of reading this guide you will have some understanding of how R works, where you can get further information, as well as how to deal with simple and fairly complex data analysis tasks.

Files with scripts of R codes for all chapters of the book, as well as the necessary tables of initial data for their execution, are freely available for download from the GitHub repository https://github.com/ranalytics/r-tutorials, as well as from the website of the Institute of Ecology of the Volga Basin of the Russian Academy of Sciences at link http://www.ievbras.ru/ecostat/Kiril/R/Scripts.zip.

It should be noted that the text in this manual is presented in the author's edition, and therefore, despite our best efforts, there is a possibility of typos, grammatical inaccuracies and unsuccessful phrases in it. We will be grateful to you, the Reader, for reporting these, as well as other detected shortcomings by e-mail [email protected] We will also be grateful for any other comments and suggestions you may have regarding this work.

–  –  –

1. MAIN COMPONENTS OF THE STATISTICAL ENVIRONMENT R

1.1. The history of the emergence and basic principles of organizing the R environment The system of statistical analysis and data visualization R consists of the following main parts:

° high-level programming language R, which allows one line to implement various operations with objects, vectors, matrices, lists, etc.;

° a large set of data processing functions collected in separate packages (package);

° an advanced support system that includes updates to environment components, online help, and various educational resources designed for both initial learning of R and subsequent advice on emerging difficulties.

The beginning of the journey dates back to 1993, when two young New Zealand scientists, Ross Ihaka and Robert Gentleman, announced their new development, which they called R. They took as a basis the programming language of the advanced commercial statistical data processing system S- PLUS and created its free open source implementation, which differs from its progenitor in an easily extensible modular architecture. Soon a distributed system for storing and distributing packages to R, known under the acronym "CRAN" (Comprehensive R Archive Network - http://cran.r-project.org), arose, the main idea of ​​\u200b\u200borganizing which is constant expansion, collective testing and operational dissemination of applied data processing tools.

It turned out that such a product of the continuous and well-coordinated efforts of a powerful "collective intelligence" of thousands of selfless intellectual developers turned out to be much more effective than commercial statistical programs, the license cost of which can be several thousand dollars. Since R is the favorite language of professional statisticians, everyone recent achievements statistical science is very quickly becoming available to R users worldwide in the form of add-on libraries. No commercial statistical analysis system is developing so rapidly today. R has a large army of users who report bugs to the authors of additional libraries and the R system itself, which are promptly fixed.

The R computing language, although it requires some effort to master it, remarkable search skills and an encyclopedic memory, allows you to quickly perform calculations that are almost “as inexhaustible as an atom” in their diversity. As of July 2014, enthusiasts around the world have written 6,739 add-on libraries for R, including 137,506 functions (see below).

http://www.rdocumentation.org), which significantly expand the basic capabilities of the system. It is very difficult to imagine any class of statistical methods that has not yet been implemented today in the form of R packages, including, of course, the entire "gentleman's set": linear and generalized linear models, non-linear regression models, experiment design, time series analysis, classical parametric and nonparametric tests, Bayesian statistics, cluster analysis and smoothing methods. With the help of powerful visualization tools, the results of the analysis can be summarized in the form of various graphs and charts. In addition to traditional statistics, the developed functionality includes a large set of algorithms for numerical mathematics, optimization methods, solving differential equations, pattern recognition, etc. Genetics and sociologists, linguists and psychologists, chemists and physicians, specialists in GIS and Web technologies.

The "proprietary" documentation on R is very voluminous and not always sensibly written (according to the strange tradition of English-language literature, too many words are spent on describing trivial truths, while important points are skimmed over in a tongue twister). However, in addition to this, the world's leading publishers (Springer, Cambridge University Press and Chapman & Hall / CRC) or simply individual groups of enthusiasts have published a huge number of books describing various aspects of data analysis in R (see, for example, the list of references on the website "Encyclopedia of Psychodiagnostics", http://psylab.info/R: Literature). In addition, there are several active international and Russian R user forums where anyone can ask for help with a problem. In the list of references, we provide a couple of hundred books and Internet links that we advise you to refer to. Special attention while studying R.

direct learning practical work in R consists of a) mastering the constructions of the R language and getting acquainted with the features of calling functions that perform data analysis, and b) acquiring skills in working with programs that implement specific methods for analyzing and visualizing data.

The choice of R user interface tools is ambiguous and highly dependent on the tastes of users. There is no consensus even among authoritative experts.

Some believe that there is nothing better than the standard R console interface. Others believe that for convenient work it is worth installing one of the available integrated development environments (IDE) with a rich set of button menus. For example, great option is a free integrated development environment RStudio.

Below we will focus on the description of the console version and work with R Commander, but the reader's further searches can be helped by an overview of various versions of the IDE, presented in the appendix to the book by Shipunov et al. (2014).

One of the R-experts, Joseph Rickert, believes that the process of learning R can be divided into the following stages (for more details, see.

his article on inside-r.org):

1. Familiarity with the general culture of the R community and the programming environment in which the R language was developed and operates. Visiting the main and auxiliary resources and mastering a good introductory textbook. Installing R on the user's computer and executing the first test scripts.

2. Reading data from standard operating system files and confidently using R-functions to perform a limited set of statistical analysis procedures familiar to the user.

3. Using the basic structures of the R language to write simple programs.

Writing your own functions. Familiarize yourself with the data structures that R can work with and more advanced features of the language. Working with databases, web pages and external data sources.

4. Writing complex programs in the R language. Independent development and deep understanding of the structure of objects of the so-called S3- and S4-classes.

5. Development professional programs in the R language. Self-creation of additional library modules for R.

Most casual R users stop at stage 3 because

the knowledge gained by this time is quite enough for them to perform statistical tasks in the profile of their main professional activity.

About this volume, we provide a description of the R language within the framework of this guide.

Installing and configuring the base R statistical environment is very easy. As of July 2014, the current version is R 3.1.1 for 32 and 64-bit Windows (distribution kits for all other common operating systems are also available). You can download the distribution kit of the system along with a basic set of 29 packages (54 megabytes) completely free of charge from the main project site http://cran.r-project.org or the Russian "mirror" http://cran.gis-lab.info. The process of installing the system from the downloaded distribution does not cause any difficulties and does not require any special comments.

For the convenience of storing scripts, initial data and calculation results, it is worth allocating a special working directory on the user's computer. It is highly undesirable to use Cyrillic characters in the name of the working directory.

It is advisable to place the path to the working directory and some other settings options by changing it with any text editor system file C:\Program Files\R\Retc\Rprofile.site (it may have a different location on your computer). In the example below, the modified lines are marked in green.

In addition to specifying the working directory, these lines define a link to the Russian source for downloading R packages and automatic start R Commander.

Rprofile.site file listing # Anything followed by a "#" comment character is ignored by the environment # options(papersize="a4") # options(editor="notepad") # options(pager="internal") # set display type background information# options(help_type="text") options(help_type="html") # set local library location #.Library.site - file.path(chartr("\\", "/", R.home()) , "site-library") # Start the R Commander menu when loading the environment # Put "#" signs if Rcmdr is not needed local(( old - getOption("defaultPackages") options(defaultPackages = c(old, "Rcmdr") ) )) # Define CRAN mirror local((r - getOption("repos") r["CRAN"] - "http://cran.gis-lab" options(repos=r))) # Define path to working directory (any other on your computer) setwd("D:/R/Process/Resampling") As far as a "good introductory tutorial" is concerned, any of our recommendations will be subjective. Nevertheless, the officially recognized introduction to R by W. Venables and D. Smith (Venables, Smith, 2014) and the book by R. Kabakov (Kabaco, 2011) should be mentioned, partly because there is a Russian translation of them. We should also note the traditional "instruction for dummies" (Meys, Vries, 2012) and the manual (Lam, 2010), written with enviable Dutch pedantry. Of the Russian-language introductory courses, the most complete are the books by I. Zaryadov (2010a) and A. Shipunov et al. (2014).

1.2. Working with the R Interface Shell The R static environment executes any set of meaningful R language instructions contained in a script file or represented by a sequence of commands given from the console. Working with the console can be difficult for modern users who are accustomed to push-button menus, because you need to memorize the syntax of individual commands. However, after acquiring some skills, it turns out that many data processing procedures can be performed faster and with less difficulty than, say, in the same Statistica package.

The R console is a dialog box in which the user enters commands and sees the results of their execution. This window appears immediately when the environment is started (for example, after clicking on the R shortcut on the desktop). In addition, the standard R graphical user interface (RGui) includes a script editing window and pop-up windows with graphical information (figures, diagrams, etc.)

In command mode, R can work, for example, like a normal calculator:

To the right of the prompt character, the user can enter an arbitrary arithmetic expression, press the Enter key, and immediately get the result.

For example, in the second command in the figure above, we used the factorial and sine functions, as well as the built-in number p. The results obtained in text form can be selected with the mouse and copied via the clipboard to any text file of the operating system (for example, a Word document).

When working with RGui, we recommend in all cases to create a file with a script (that is, a sequence of R commands that perform certain actions). As a rule, this is a plain text file with any name (but, for definiteness, it is better with the *.r extension), which can be created and edited with a regular editor such as Notepad. If this file exists, it is best to place it in the working directory, and then after starting R and choosing the menu item "File Open Script", the contents of this file will appear in the "R Editor" window. You can execute a sequence of script commands from the "Edit Run All" menu item.

You can also select a meaningful fragment from any place of the prepared script with the mouse (from the name of one variable to the entire content) and launch this block for execution. This can be done in four possible ways: from the main and context menu, the Ctrl+R key combination, or the button on the toolbar.

In the figure shown, the following actions were performed:

° R-object gadm was downloaded from the free Internet source Global Administrative Areas (GADM) with data on the territorial division of the Republic of Belarus;

° Latinized names of cities are replaced by commonly used equivalents;

° using the spplot() function of the sp package, an administrative map of the republic was displayed in the graphics window, which can be copied to the clipboard using the menu or saved as a standard meta- or raster graphic file.

We will consider the meaning of individual operators in more detail in the following sections, but here we will pay attention to the fact that by selecting in the script and running a combination of characters [email protected], we will receive in the console window the entire data set for the object, and the command, composed of the selected symbols gadm, [email protected]$NAME_1 will give us a list of admin center names before and after it was modified.

Thus, the R Editor makes it easy to navigate through the script, edit and execute any combination of commands, search and replace certain parts of the code. The RStudio add-in mentioned above allows you to additionally perform syntax highlighting of the code, its automatic completion, "packing" the command sequence into functions for their subsequent use, working with Sweave or TeX documents, and other operations that will be useful to an advanced user.

R has extensive built-in help materials that can be accessed directly from RGui.

Issuing the help.start() command from the console opens a page in your Internet browser that provides access to all help resources: basic manuals, author's materials, answers to likely questions, lists of changes, links to help on other R objects, etc. .d.:

Help on individual functions can be obtained using the following commands:

° help("foo") or? foo - help on function foo (quotes are optional);

° help.search("foo") or ?? foo - search for all help files containing foo;

° example("foo") – examples of using the foo function;

° RSiteSearch("foo") – search for links in online manuals and mailing list archives;

° apropos("foo", mode="function") – list of all functions with combination foo;

° vignette("foo") - A list of manuals on the topic foo.

1.3. Working with the R Commander package menu A convenient tool for mastering calculations in R for a novice user is R Commander - platform-independent GUI in the style of a button menu, implemented in the Rcmdr package. It allows you to carry out a large set of statistical analysis procedures without having to learn the functions in the command language beforehand, but unwittingly contributes to this, since it displays all the instructions being executed in a special window.

You can install Rcmdr, like any other extensions, from the R console menu "Packages Install a package", but better by running the command:

install.packages("Rcmdr", dependencies=TRUE) where enabling the dependencies option will cause a guaranteed installation of the full set of other packages that may be required when processing data through the Rcmdr menu.

R Commander is launched when the Rcmdr package is loaded via the "Packages Include package" menu or the library(Rcmdr) command. If for some reason it was decided to analyze data exclusively using R Commander, then for automatic download this graphical shell, when you start R, you need to edit the Rprofile.site file as shown in section 1.1.

We will consider work in R Commander using an example correlation analysis data on the level of infection of the bivalve mollusk Dreissena polymorpha with ciliates Conchophthirus acuminatus in three lakes of Belarus (Mastitsky S.E. // BioInvasions Records.

2012. V. 1. P 161–169). In the table with the initial data, which we download from the figshare website, we will be interested in two variables: the length of the mollusk shell (ZMlength, mm) and the number of ciliates found in the mollusk (CAnumber). This example will be discussed in detail in Chapters 4 and 5, so here we will not dwell on the meaning of analysis in detail, but will focus on the technique of working with Rcmdr.

Next, we define the data loading mode and the link address on the Internet in pop-up windows. It is easy to see that we could easily load the same data from the local text file, Excel workbook or database table. To make sure that our data is loaded correctly (or edit it if necessary), click the "View data" button.

Data organization definition window Fragment of the loaded table

At the second stage, in the "Statistics" menu, select "Correlation test":

We select a pair of correlated variables and in the Output Window we obtain the Pearson correlation coefficient (R = 0.467), the level of statistical significance achieved (p-value 2.2e-16) and 95% confidence limits.

–  –  –

The results obtained can be easily copied from the output window via the clipboard.

Now we get graphic image correlation dependence. Select a scatterplot of CAnumber versus ZMlength and provide it with edge plots of ranges, a linear least squares trend line (in green), a line smoothed by a local regression method (in red), represented with a confidence region (dotted line). For each of the three lakes (Lake variable), the experimental points will be represented by different symbols.

–  –  –

Graph copied from the R Commander graphics window As the equivalent of all R Commander menu button presses, R instructions appear in the script window.

In our case, they look like this:

Clams read.table("http://figshare.com/media/download/98923/97987", header=TRUE, sep="\t", na.strings="NA", dec=".", strip. white=TRUE) cor.test(Clam$CAnumber, Clam$ZMlength, alternative="two.sided", method="pearson") scatterplot(CAnumber ~ ZMlength | Lake, reg.line=lm, smooth=TRUE, spread= TRUE, boxplots="xy", span=0.5, ylab="Number of ciliates", xlab="Shell length", by.groups=FALSE, data=Mollusks) Script itself or output results (or both) ) can be saved in files and repeated at any time. The same result can be obtained without running R Commander by loading the saved file through the R console.

By and large, without knowing R language constructs (or simply not wanting to burden your memory with remembering them), using Rcmdr you can perform data processing using almost all basic statistical methods. It presents parametric and non-parametric tests, methods for fitting various continuous and discrete distributions, analysis of multivariate contingency tables, one-dimensional and multivariate analysis of variance, principal component analysis and clustering, various forms of generalized regression models, etc. The developed apparatus for analyzing and testing the resulting models is worthy of careful study. .

A detailed description of the technique of working with R Commander, as well as the implementation of data processing algorithms, can be found in the manuals (Larson-Hall, 2009; Karp, 2014).

However, just as sign language cannot replace human communication in natural language, knowledge of the R language significantly expands the boundaries of the user's capabilities and makes communication with the R environment pleasant and exciting. And here the automatic generation of scripts in R Commander can be an excellent tool for the reader to get acquainted with the operators of the R language and learn the specifics of calling individual functions. We will devote the following chapters of the guide to discussing data processing procedures only at the level of language constructs.

1.4. Objects, Packages, Functions, Devices The R language belongs to a family of so-called high-level object-oriented programming languages. For a non-specialist, a strict definition of the concept of "object" is rather abstract. However, for simplicity, you can call everything that was created in the process of working with R as objects.

There are two main types of objects:

1. Objects intended for data storage ("data objects") are individual variables, vectors, matrices and arrays, lists, factors, data tables;

2. Functions ("function objects") are named programs designed to create new objects or perform certain actions on them.

Objects of the R environment, intended for collective and free use, are packaged in packages that are united by similar topics or data processing methods. There is some difference between the terms "package" and "library". The term "library" defines a directory which may contain one or more packages. The term "package" refers to a collection of functions, HTML manual pages, and sample data objects intended for testing or learning.

Packages are installed in a specific directory of the operating system or, in an uninstalled form, can be stored and distributed in archived *. zip files Windows (package version must match the specific version of your R).

Full information about the package (version, main topic, authors, dates of changes, licenses, other functionally related packages, a complete list of functions with indication of their purpose, etc.) can be obtained by the command

library(help=package_name), for example:

library(help=Matrix) All R packages fall into one of three categories: base ("base"), recommended ("recommended"), and other user-installed.

You can get a list of them on a specific computer by issuing the library() command or:

installed.packages(priority = "base") installed.packages(priority = "recommended") # Get complete list packlist packlist - rownames(installed.packages()) # Output information to the clipboard in excel format write.table(packlist,"clipboard",sep="\t", col.names=NA) Base and recommended packages are usually included to the R installation file.

Of course, there is no need to immediately install "in reserve" many different packages.

To install a package, it is enough to select the "Packages Install package(s)" menu item in the R Console command window or enter, for example, the command:

install.packages(c("vegan", "xlsReadWrite", "car"))

Packages can be downloaded, for example, from the Russian "mirror" http://cran.gis-lab.info, for which it is convenient to use the edition of the Rprofile.site file as shown in section 1.1.

Another option for installing packages is to go to the site http://cran.gis-lab.info/web/packages, select the desired package in the form of a zip file and download it to the selected folder on your computer.

In this case, you can preview all the information on the package, in particular, the description of the functions included in it, and decide how much you need it. Next, you need to execute the command menu item "Packages Install packages from local zip-files".

When starting the RGui console, only some of the base packages are loaded. To initialize any other package before directly using its functions, you need to enter the command library (package_name).

You can determine which packages are loaded at each moment of the ongoing session by issuing the command:

sessionInfo() R version 2.13.2 (2011-09-30) Platform: i386-pc-mingw32/i386 (32-bit)

–  –  –

other attached packages:

Vegan_2.0-2 permute_0.6-3

loaded via a namespace (and not attached):

Grid_2.13.2 lattice_0.19-33 tools_2.13.2 The following table lists (perhaps not exhaustively) the packages that were used in the scripts presented in this book:

R packages Purpose "Basic" packages Basic constructs R base Package compiler R compiler A set of tables with data for testing and demonstrating functions datasets Basic graphics functions graphics Graphics device drivers, color palettes, fonts grDevices Functions for creating graphic layers grid Object-oriented programming components (classes , methods methods) Functions for working with regression splines different type splines Basic statistical analysis functions stats Methods statistical functions class S4 stats4 User interface components (menus, selection boxes, etc.) tcltk Information support, administration and documentation tools Various utilities for debugging, I/O, archiving, and so on.

Utils "Recommended" packages Functions of various bootstrap and "jackknife" routines boot Various non-hierarchical classification and recognition algorithms class Partitioning and hierarchical clustering algorithms cluster Code analysis and verification R codetools Reading and writing files in various formats (DBF, SPSS, DTA, Stata) foreign Functions servicing KernSmooth kernel smoothing optimization Graphic functions extended functionality (Sarkar, 2008) lattice A set of data and statistical functions (Venables, Ripley, 2002) MASS Matrix and vector operations Matrix Generalized additive and mixed effects models mgcv Linear and nonlinear mixed effects models nlme Neural networks direct propagation nnet Construction of classification and regression trees rpart Functions of kriging and analysis of the spatial distribution of points spatial Survival analysis (Cox model, etc.) survival Packages installed during operation adegenet Genetic distance analysis algorithms arm Analysis of regression models - appendix to the book (Gelman, Hill , 2007) car Procedures related to applied regression analysis corrplot Graphical display of correlation matrices fitdistrplus Fitting of statistical distributions FWDselect, Selection of a set of informative variables in regression models packfor gamair Datasets for testing geosphere additive models Estimation of geographic distances ggplot2 Advanced graphics package with high functionality DAAG Data analysis and graphics functions for the book (Maindonald, Braun, 2010) Hmisc Harrell function set HSAUR2 Book supplement (Everitt, Hothorn, 2010) ISwR Primary statistical analysis c to R jpeg Working with graphic files jpeg lars Special types of regression (LARS, Lasso, etc.) lavaan Confirmatory analysis and structural equation models lmodel2 Implementation of types I and II regression models (MA, SMA, RMA) maptools Geographic map tools mice Procedures for analyzing and filling in missing values ​​moments Functions calculation of sample moments nortest Criteria for testing the hypothesis of normal distribution outliers Analysis of outliers in pastecs data Analysis of spatial and time series in ecology pls Regression on principal components pwr Estimation of the statistical power of hypotheses reshape Flexible transformation of data tables robustbase Robust methods for building regression models rootSolve Finding the roots of a function with scales Selection of color scales sem Models of structural equations semPlot Visualization of structural relationships sm Estimation of distribution density and smoothing methods sp Classes and methods of accessing spatial data spatstat Methods of spatial statistics, fit spdep models Spatial dependencies: geostatistical methods and stargazer modeling Display information about statistical models in different vcd formats Visualize categorical data Perform calculations on community ecology (similarity, diversity and vegan nesting measures, ordination and multivariate analysis) If we try to download the package, it is not yet installed in R, or we try to use the functions of a package that has not yet been loaded, we will receive system messages:

sem(model, data=PoliticalDemocracy) Error: cannot find function "sem" library(lavaan) Error in library(lavaan) : no package named "lavaan" package user and she understands which should be downloaded and which need to be pre-installed. Understanding how the script works requires knowledge of the R language constructs described in the next section, but the interested reader can return to these commands later.

instant_pkgs - function(pkgs) ( pkgs_miss - pkgs)] # Install packages not prepared for download:

if (length(pkgs_miss) 0) ( install.packages(pkgs_miss) ) # Download packages that haven't been downloaded yet:

Attached - search() attached_pkgs - attached need_to_attach - pkgs if (length(need_to_attach) 0) ( for (i in 1:length(need_to_attach)) require(need_to_attach[i], character.only = TRUE) ) ) # Call example:

instant_pkgs(c("base", "jpeg", "vegan"))

You can get a list of the functions of each package, for example, by issuing the command:

ls(pos = "package:vegan") Note: ls() is a general purpose function for listing objects in a given environment. The command above sets the vegan package as such an environment. If this command is issued without parameters, we will get a list of objects created during the current session.

You can get the argument list of the incoming parameters of any function in a loaded package by issuing the args() command.

For example, when launching the lm() linear model derivation function, which we later use widely, the following parameters are set:

Args(lm) function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset,...) If you enter a command consisting only of the abbreviation of the function (for example, calculating the interquartile range of IQR), then you can get the source text of the function in R codes:

IQR function (x, na.rm = FALSE) diff(quantile(as.numeric(x), c(0.25, 0.75), na.rm = na.rm, names = FALSE)) An advanced user can modify this code and "redirect" the standard function call to your version.

However, if we want to look at the code of the predict() function, which is used to calculate the predicted values ​​of the linear model, in the same way, we get:

predict function (object,...) UseMethod("predict") In this case, predict() is a "universal" function: depending on which model object is fed into its input (lm for linear regression, glm for Poisson or logistic regression, lme for a mixed effects model, etc.), the corresponding method for obtaining predictive values ​​is updated.

In particular, this function is used to implement the following methods:

methods("predict") predict.ar* predict.Arima* predict.arima0* predict.glm predict.HoltWinters* predict.lm predict.loess* predict.mlm predict.nls* predict.poly predict.ppr* predict.prcomp* predict.princomp* predict.smooth.spline* predict.smooth.spline.fit* predict.StructTS* Non-visible functions are asterisked In S3 style, a method is actually a function that is called by another generic function, such as print() , plot() , or summary() , depending on the class of the object supplied as input. At the same time, the class attribute is responsible for "object orientation", which ensures correct dispatching and calling the necessary method for this object. So the "function-method" for obtaining the predicted values ​​of the generalized linear model will have a call to predict.glm(), when smoothing with splines - predict.smooth.spline(), etc. Detailed information the S3 OOP model can be found in the S3Methods help section, and the more advanced S4 OOP model in the Methods section.

Finally, let's look at some simple tricks for saving the results of the work received during the R session:

° sink(file= filename) - outputs the results of execution of subsequent commands in real time to a file with a given name; to terminate this command, you must execute the sink () command without parameters;

° save(file= filename, list of objects to be saved) – saves the specified objects in a binary XDR format file that can be used on any operating system;

° load(file= filename) - restores saved objects in the current environment;

° save.image(file= filename) - saves all objects created in the course of work in the form of an R-specific rda file.

An example of transferring the generated table with data to the clipboard in a format compatible with the structure of an Excel sheet was given above in this section. Chapter 6 will show an example of transferring data from a linear model object to a Word file.

The R environment can generate a pixel image of the required quality for almost any display resolution or print device, as well as save the resulting graphic windows in files of various formats. There is a driver function for each graphics output device: you can type help(Devices) for a complete list of drivers.

The most commonly used graphics devices are:

° windows() - graphical Windows window(screen, printer, or metafile).

° png(), jpeg(), bmp(), tiff() - output to a bitmap file of the corresponding format;

° pdf(),postscript() - output graphic information V PDF file or PostScript.

When finished working with an output device, disable its driver with the dev.off() command. It is possible to activate several graphical output devices at the same time and switch between them: see, for example, the corresponding section in the book by Shipunov et al. (2012, p. 278).

1. DESCRIPTION OF THE R LANGUAGE

2.1. R data types All data objects (and hence variables) in R can be divided into the following classes (i.e. object types):

° numeric – objects that include integers (integer) and real numbers (double);

° logical - logical objects that take only two values: FALSE (abbreviated as F) and TRUE (T);

° character – character objects (values ​​of variables are specified in double or single quotes).

In R, you can create names for various objects (functions or variables) in both Latin and Cyrillic, but keep in mind that a (Cyrillic) and a (Latin) are two different objects. In addition, the R environment is case sensitive, i.e. lowercase and uppercase letters are different. Variable names (identifiers) in R must begin with a letter (or dot.) and consist of letters, digits, periods, and underscores.

With the help of a team? name can check if a variable or function with the given name exists.

Checking if a variable belongs to a certain class is checked by the functions is.numeric(object_name), is.integer(name), is.logical(name), is.character(name), and to convert an object to another type, you can use the as.numeric functions (name), as.integer(name), as.logical(name), as.character(name).

There are a number of special objects in R:

° Inf - positive or negative infinity (usually the result of dividing a real number by 0);

° NA - "missing value" (Not Available);

° NaN - "not a number" (Not a Number).

You can check whether a variable is one of these special types with the is.nite(name), is.na(name), and is.nan(name) functions, respectively.

An R expression is a combination of elements such as an assignment operator, arithmetic or logical operators, object names, and function names. The result of executing an expression, as a rule, is immediately displayed in the command or graphics window. However, when an assignment operation is performed, the result is stored in the corresponding object and is not displayed on the screen.

As an assignment operator in R, you can use either the symbol “=“, or a pair of characters “-“ (assigning a specific value to an object on the left) or “-“ (assigning a value to an object on the right). It is considered good programming style to use “-“.

R expressions are organized in a script by lines. You can enter several commands on one line, separating them with the “;“ symbol. One command can also be placed on two (or more) lines.

Objects of type numeric can form expressions using the traditional arithmetic operations + (addition), - (subtraction), * (multiply), / (division), ^ (exponentiation), %/% (integer division), %% (remainder) from division). Operations have normal precedence, i.e. exponentiation is performed first, then multiplication or division, then addition or subtraction. Parentheses can be used in expressions and operators in them have the highest precedence.

Boolean expressions can be composed using the following logical operators:

° "Equal" == ° "Not equal" != ° "Less than" ° "Greater than" ° "Less than or equal" = ° "Greater than or equal" = ° "Logical AND" & ° "Logical OR" | ° "Logical NOT" !

SUPPORT, OUTSOURCING AND FUND ADMINISTRATION 2 AMICORP GROUP STAND OUT FROM THE CROWD w w w.am icor p. c om AMICORP GROUP COMPANY FIELDS OF ACTIVITY CONTENTS ABOUT COMPANY OUR SERVICES Services for corporate clients Services for institutional sales Creation and management of...»

“Federal State Educational Budgetary Institution of Higher Professional Education “Financial University under the Government of the Russian Federation” Department “Marketing” MODERN DIRECTIONS OF MARKETING: THEORY, METHODOLOGY, PRACTICE COLLECTIVE MONOGRAPH Under the general editorship of S.V. Karpova Moscow 2011 Reviewers: N.S. Perekalina - Doctor of Economics, Professor, Head. Department of "Marketing" "MATI" - Russian State Technological University. K. E. Tsiolkovsky S.S. Solovyov ... "

“Malko digital messenger for CAFE and TEA EDUCATION: Ch. Editor: Vesela Dabova Br. 4 Decemvri, 2011 Editor: Relax with Teodor Vasilev's tea Gergana Ivanov Publisher: ABB How many cases in the body or when taking tea drink and how all cases are weakened consumption for a cup of tea. There are different opinions regarding the reliability of the Tasi theory, but there is little evidence that one cup of tea is counted from the defined regime for ... "

«INTERNATIONAL INTERDISCIPLINARY SCIENTIFIC CONFERENCE RADICAL SPACE IN BETWEEN DISCIPLINES RCS 2015 CONFERENCE BOOK OF ABSTRACTS EDITORS Romana Bokovi Miljana Zekovi Slaana Milievi NOVI SAD / SERBIA / SEPTEMBER 21-23 / 2015 Radical Space In Between Disciplines Conference Book of Abstracts Editors: Romana Bokovi S Milievi ISBN: 978-86-7892-733-1 Leyout: Maja Momirov Cover design: Stefan Vuji Published by Department of Architecture and Urbanism, Faculty of Technical Sciences,...»

"SAINT PETERSBURG STATE UNIVERSITY Faculty of Geography and Geoecology Department of Geomorphology THESIS (final qualification work) on the topic: "Geomorphological features and paleoclimate of the Arctic lakes (on the example of the lakes of the central sector of the Russian Arctic)" Completed by: student of the evening department Morozova Elena Aleksandrovna Scientific supervisors: d.g.s., prof. Bolshiyanov Dmitry Yurievich Lecturer Savelyeva Larisa Anatolyevna Reviewer: Candidate of Geological Sciences, Head....»

“Apacer M811 mouse – laser mini-SUV Kit. http://news.kosht.com/computer/mouse/2009/11/26/mysh_apacer_m811. KOSHT.com day price search plugin for Firefox browser. Install One click. One kilobyte. Home News Prices Announcements Jobs Forums Companies Moby Find Find your news All KOSHTA news PCs and accessories Mice PCs and accessories Mice All KOSHTA news Best gaming computers On-line calculation at UltraPrice.by Apacer M811 mouse – laser mini SUV [...»

"FEDERAL AGENCY FOR EDUCATION STATE EDUCATIONAL INSTITUTION OF HIGHER PROFESSIONAL EDUCATION MOSCOW STATE INDUSTRIAL UNIVERSITY (GO MGIU)" INFORMATION SYSTEMS AND TECHNOLOGIES information systems» student Chumakova Tatyana Andrevna on the topic "Calculation of separated flows behind a poorly streamlined body" Supervisor of the work: prof., Ph.D. n. Aleksin Vladimir Adamovich ... "

“R WIPO A/45/3 ORIGINAL: English DATE: 15 August 2008 WORLD INTELLECTUAL PROPERTY ORGANIZATION GENEVA ASSEMBLY OF WIPO MEMBER STATES Forty-fifth Series of Meetings Geneva, 22-30 September 2008 ADMISSION OF OBSERVERS Memorandum Director General I. ADMISSION OF INTERNATIONAL NON-GOVERNMENTAL ORGANIZATIONS AS OBSERVERS 1. At their previous sessions, the Assemblies adopted a set of principles to be applied when sending to international non-governmental organizations...”

“1 Oleg Sanaev. AROUND THE WORLD EXTENDED IN FOUR YEARS AND COST OF ONE HUNDRED DOLLARS With the terms of travel of Evgeny Alexandrovich Gvozdev on the yacht Lena, indicated in the title, everything is in order - four years plus two weeks: on July 7, 1992 he left the port of Makhachkala, on July 19, 1996 he returned . But with money - a clear exaggeration, or rather an understatement: you can’t, of course, live on a hundred dollars for four years - you will stretch your legs. But, starting the voyage, Gvozdev had exactly this amount. And at least the legs ... "

« Institute of Management, Research University Belgorod State National Research University TECHNOLOGIES OF SECURITY FORMATION SECURING THE FORMATION OF THE CANDIDATE POOL STATE FOR STATE AND MUNICIPAL AND MUNICIPAL SERVICE SERVICE Annotation: Summary: The article deals with...»

“Lidiya YANOVSKAYA NOTES ABOUT MIKHAIL BULGAKOV MOSCOW “TEXT” UDC 821.161.1 BBK 84 (2Ros-Rus)6-44 Y64 ISBN 978-5-7516-0660-2 ISBN 978-985-16-3297-4 (LLC Harvest ”) “Text”, 2007 “BRAVO, BIS, PAWNSHOP!” "BRAVO, BIS, PAWNSHOP!" I don't know where the editorial office of Yunost magazine is located in Moscow today. Does such a magazine still exist? In the mid-70s, this youngest and prettiest editorial office in Moscow was located on Sadovaya-Triumfalnaya, next to Mayakovsky Square, occupying a small but extremely comfortable ... "

“Appendix 1 of the 2013 COMPETITION APPLICATION FORMS Form “T”. Title page of the application to the RHF Project name Project number Project type (a, c, d, e, f) Field of knowledge (code) RHF classifier code GRNTI code (http://www.grnti.ru/) Priority direction of science and technology development and technicians in the Russian Federation, critical technology1 Surname, name, patronymic of the head Contact phone number of the project project manager Full and short name of the organization through which...”

«FNI Report 8/2014 Implementing EU Climate and Energy Policies in Poland: From Europeanization to Polonization? Jon Birger Skjrseth Implementing EU Climate and Energy Policies in Poland: From Europeanization to Polonization? Jon Birger Skjrseth [email protected] December 2014 Copyright © Fridtjof Nansen Institute 2014 Title Implementing EU Climate and Energy Policies in Poland: From Europeanization to Polonization? Publication Type and Number Pages FNI Report 8/2014 57 Author ISBN 978-82-7613-683-8 Jon...”

""Scientific Notes TOGU" Volume 6, No. 4, 2015 ISSN 2079-8490 Electronic scientific publication "Scientific Notes TOGU" 2015, Volume 6, No. 4, S. 173 - 178 Certificate El No. FS 77-39676 dated 05.05.2010 http ://pnu.edu.ru/ru/ejournal/about/ [email protected] UDC 316.33 © 2015 I. A. Gareeva, Doctor of Sociology. Sci., A. G. Kiseleva (Pacific State University, Khabarovsk) FORMATION OF SOCIAL INSURANCE SYSTEMS This article analyzes the formation of social insurance systems and its current state...»

Conference Program Chiang Mai, Thailand November, 2015 APCBSS Asia -Pacific Conference on Business & Social Sciences ICEI International Conference on Education Innovation APCLSE Asia-Pacific Conference on Life Science and Engineering APCBSS Asia -Pacific Conference on Business & Social Sciences ISBN978-986- 90263-0-7 ICEI International Conference on Education Innovation ISBN 978-986-5654-33-7 APCLSE Asia-Pacific Conference on Life Science and Engineering ISBN 978-986-90052-9-6 Content Content..."



Loading...
Top