**One, factor and level**

1、Simple and direct understanding of factors and levels

The factor can be simply understood as a vector containing more information. That is, the factor = vector + level. (of course, in fact, their internal mechanisms are different). The level is in the vector**Different values**The following code is taken as an example:

> x <- c(5, 12, 13, 12) > x [1] 5 12 13 12 > xf <- factor(x) > xf [1] 5 12 13 12Levels: 5 12 13

But when we talk about the length of the factor, we define it as the length of the data, not the number of levels.

> length(xf) [1] 4

2、Factor increase, delete, change, check.**increase**）

When we increase the level of the factor, we need to insert it in advance and not follow the matrix or list.

> x <- c(5,12,13,12) > xf <- factor(x) > xff <- factor(x, levels = c(5, 12, 13, 88)) > xff [1] 5 12 13 12 Levels: 5 12 13 88

For example, if you use the following method, you will prompt illegal insertion.

xff[3] <- 6 Warning message: In `[<-.factor`(`*tmp*`, 3, value = 6) : invalid factor level, NA generated > xff

**Two, the common function of factor**

1、tapplyfunction

The typical tapply function is tapply (x, F, g), X is a vector, f is a factor or factor list, G () is a function that needs to be applied to X.

tapplyThe process of function execution: first, the X is grouped by factor F, and several sub vectors are obtained. Then, the G () function is used for each subvector, and a matrix which is divided into a good class is returned.

> ages <- c(25,26,55,37,21,42) > affils <- c("R", "D", "D", "R", "U", "D") > tapply(ages, affils, mean) D R U 41 31 21

The example is the average age of each party (democratic, Republican, and non party).

Then the next example goes further. When there are two or two factors, it is necessary to use factor list to operate.

d <- data.frame(list(gender = c("M", "M", "F", "M", "F", "F"), + age = c(47, 59, 21, 32, 33, 24), + income = c(55000, 88000, 32450, 76500, 123000, 45650))) > d gender age income 1 M 47 55000 2 M 59 88000 3 F 21 32450 4 M 32 76500 5 F 33 123000 6 F 24 45650 > d$over25 <- ifelse(d$age > 25, 1, 0) > d gender age income over25 1 M 47 55000 1 2 M 59 88000 1 3 F 21 32450 0 4 M 32 76500 1 5 F 33 123000 1 6 F 24 45650 0 > tapply(d$income, list(d$gender, d$over25), mean) 0 1 F 39050 123000.00 M NA 73166.67

The function of the above program is to get the average income level according to sex and age (two factors). So it’s divided into four sub vectors:

- 25Men under the age of age
- 25Women under the age of age
- 25Men over the age of age
- 25Women over the age of age

** The more ingenious here is to add a list of “over 25” to make a simple distinction between age, which greatly facilitates the use of the later tapply ().**

2、split ()Function

** split()The function of execution is to group vectors according to factor level, and then return to a list.**Continue to operate on the above data box D.

split(d$income, list(d$gender, d$over25)) $F.0 [1] 32450 45650 $M.0 numeric(0) $F.1 [1] 123000 $M.1 [1] 55000 88000 76500

Another question about the sex of abalone is that we can quickly know which sex abalone is located by split function.

split(1:7, g) $F [1] 2 3 7 $I [1] 4 $M [1] 1 5 6

3、by() function

by() Function and tapply () function is similar, but its action object is not only a vector, but also a matrix or data frame. The next step is to use by () function as an example of regression analysis. The files are accessed from the links attached to the textbooks (unfortunately, the data lady is too fragmentary).

> aba2 <- read.csv("E:/files_for_R/abalone.data", header = F) > #read.table vs .csv ：.tableThe default file content is separated by "/", ".Csv" by default ","> colnames (aba2) < - C ("gender", "length", "diameter", "height", "whol")EWT "," shuckedwt "," viscwt "," shellwt "," rings "")> by (aba2, aba2$gender, function (m) LM (m[, 2]~m[).3]))Aba2$gender: FCall:LM (formula = m[, 2] ~ m[, 3])Coefficients:(Intercept) m[, 3]0.04288 1.17918---------------------------------------------------------Aba2$gender: ICall:LM (formula = m[, 2] ~ m[, 3])Coefficients:(Intercept) m[, 3]0.02997 1.2One thousand eight hundred and thirty-three---------------------------------------------------------Aba2$gender: MCall:LM (formula= m[, 2] ~ m[, 3])Coefficients:(Intercept) m[, 3]0.03653 1.19480

The data given in the book is incomplete, and a header is missing, so I added a paragraph to myself. I play the following.

colnames(aba2) <- c("gender", "length", "diameter","height","wholewt", "shuckedwt", "viscwt", "shellwt", "rings")Three

** Three. The operation of the table**

1、On the table function in the R language

So far, we have met two functions related to table: one is read.table (), the other is table (). Read.table () is used to read data files, the default separator is “”, and the table () function is the pair factor.Or a list of factors to process, so as to get a contingency table, that is, a method of recording frequency.

2、table()Function detailed operation

First, let’s get a data box like this

> ct <- data.frame( + Vote.for.X = factor(c("yes", "yes", "no", "not sure", "no")), + Voted.for.X = factor(c("yes", "no", "no", "yes", "no")) + ) > ct Vote.for.X Voted.for.X 1 yes yes 2 yes no 3 no no 4 not sure yes 5 no no

After processing with the table () function, the following frequency table is obtained.

> cttab <- table(ct) > cttab Voted.for.X Vote.for.X no yes no 2 0 not sure 0 1 yes 1 1

Similarly, if you have three dimensional data, table () can be typed out in the form of two dimensional tables. I don’t want to give another example here.

3、Operations on matrices and similar arrays in the table

3.1 Access to cell frequency

The operation here is actually the same as the list. Still take the cttab above as an example.

> class(cttab) [1] "table" > cttab[,1] no not sure yes 2 0 1 > class(cttab) [1] "table" > cttab[,1] no not sure yes 2 0 1 > cttab[1,1] [1] 2

3.2 Equal proportion change the frequency of cell

> cttab/5 Voted.for.X Vote.for.X no yes no 0.4 0.0 not sure 0.0 0.2 yes 0.2 0.2

3.3 Get the boundary value of the table

- The boundary value of a variable: the value obtained by summing the value corresponding to other variables when the variable is constant.
- A more direct way is to achieve it directly through the apply () function.

> apply(cttab, 1, sum) no not sure yes 2 1 2

- A more direct way is to use the function addmargins () adding the boundary value directly to get the boundary values of two dimensions directly.

> addmargins(cttab) Voted.for.X Vote.for.X no yes Sum no 2 0 2 not sure 0 1 1 yes 1 1 2 Sum 3 2 5

** 4、Expansion case: find the highest frequency cell in the table.**

The design idea of the whole function can be as follows:

- Add a new column of Freq to represent the frequency of all kinds of data (which can be achieved through as.data.frame).
- Sort each row according to frequency size (through order () function).
- Line k in front of the required area.
- The specific code is as follows:

tabdom <- function(tbl, k){ #create a data frame representing tbl, add a Freq column tablframe <- as.data.frame(tbl) #determine the proper position of the frequencies in an ordered frequency #rearrange the data frame, get the first k rows tblfreord <- order(tablframe$Freq, decreasing = TRUE) dom <- tablframe[tblfreord,][1:k,] return(dom) }