Up: Introduction to S Previous: Data Structures

Data Frames

Data frames are S objects (data structures) which combine features of matrices and lists, that is a list of variables all containing the same number of observations. Typically, the variables are sets of measurements on a collection of cases, so that each row of the data frame is the set of measurements for one case (subject), and each column is the set of measurements for all subjects on a single variable.

Creating Data Frames

The data.frame function: if x,y and z are S datasets of the same length, we can combine them into a dataframe directly:
```
	X <- data.frame(x,y,z)
```
Here is a simple example, illustrating the fact that the objects don't all have to be the same type. Columns of a data.frame can be numeric, character, logical, or other modes, such as factors.
```
        x <- 1:5
        y <- letters[1:5]
        z <- rnorm(5)

        X <- data.frame(x,y,z)

        X
          x y          z 
        1 1 a -0.8358069
        2 2 b  0.8515970
        3 3 c -0.1151393
        4 4 d -0.7857153
        5 5 e  0.2684005
```
Similarly, if our data is already in the form of a matrix:
```
	X <- data.frame(X)
```
The read.table function can be used to create a dataframe directly from a unix file containing a data table. The data table should be stored in plain text format, with rows coresponding to cases and columns to variables. In the typical use, the names of the variables are the first line of the file:
```
	X <- read.table("filename",header=T)
```
If a data frame has been saved in S dump format (ie. using the data.dump function, then it can be loaded with the data.restore command, which will reload the data frame.

Subscripting

A data frame may be subscripted as if it were a matrix object:

	X["row",]  # select the row labeled "row"
	X[,"col"]  # select the column labeled "col"
	X[2,]      # select row 2
	X[,3]      # select column 3
	X[2,3]     # select the element in row 2, column 3.

Attaching Data Frames

It is often useful to use the attach command to facilitate access to the columns of a data frame directly by column name. Suppose the data frame X has column names "A" and "B":

	attach(X)
	plot(A,B)
vs.
	plot(X[,"A"],X[,"B"])

Note: if there is a variable in the .Data directory named "A", then the reference to "A" will select it, rather than the desired column of the data frame, unless the search order is modified. To "detach" a data frame, use the "search" function to find the posistion of the data frame in the search list, and then use the "detach" function to remove it from the search list.

Use in Statistical Modelling

Many S functions are set up to take advantage of the data frame format. If X is a data frame with columns "A" and "B", then

	plot(X)

has a special meaning, and

	X.lm <- lm(A ~ B,data=X)

causes S to fit a "linear model", ie. regression line to the columns of X specified.

Albyn Jones
Tue Jun 25 11:03:47 PDT 1996