Data frames are far and away the most useful structures in R in routine use for handling tabular data. They are a type of list vector, about which more later, so don’t think about it yet. They act a little bit like a matrix. You’ve already exhausted many of the surprising parts of R by the time you get to data frames, so they’re pretty well-behaved, considering.
A feature of R that is useful for pedagogy but otherwise unholy is that the default search path (at least for my install!) is polluted with example datasets from the
datasets package. Run
data() at the interpreter to take a peek. Most of these are data frames. I use the
Formaldehyde frame below.
If anyone ever suggests that you use
attach() while working with data frames, consider them pityingly for a moment and proceed as before. Do not take their advice.
Creating data frames
From text data
read.table() function, which is used to read data from delimited text files, gives you data frames. The syntax you will almost always want to use to get data into R is
read.table('path/to/file', header=TRUE, sep='\t') (substituting your favorite separator, of course).
read.table isn’t the zippiest option for giant data sets but don’t worry about it until you’re sure you’re already worried about it.
Some things that you should know about
- String values are, by default, treated as
factors, not as character atomic vectors. If the strings in your data file describe elements of a set of discrete possibilities, this is often convenient and desirable. If you are not expecting it, this behavior may threaten to ruin your career. Pass
stringsAsFactors=FALSEto leave your character vectors alone.
- Lines starting with
#will be silently ignored.
- Any values which are the string
NAwill be understood as a missing value, unless you pass
- The strings
Infwill be interpreted appropriately if they appear in columns that otherwise appear to be numeric.
If they don’t already have them, you should give your data files column headers that are explanatory but easy to type, because it makes a lot of the access semantics for data frames much nicer.
The R Data Import/Export page elaborates on the above at greater length.
?read.table is also helpful. “An Introduction to R” warns that:
R input facilities are simple and their requirements are fairly strict and even rather inflexible. There is a clear presumption by the designers of R that you will be able to modify your input files using other tools, such as file editors or Perl to fit in with the requirements of R.
“Generally,” they add, with a straight face, “[modifying your input file with Perl] is very simple.” Snort. Ahem.
From the clipboard
Grabbing data from the clipboard is slightly wacky.
read.table('clipboard',…) works on Windows (so don’t name your input files ‘clipboard’). Mac users will sigh and resort to the inelegant but functional
write.table('clipboard',…) works similarly; Mac users should do
write.table(pipe('pbcopy'),…) to write data back. (The
pbcopy shell commands ship with OS X and are useful in many non-R contexts!)
From vectors in your workspace
Let’s take an example data set and put it back together:
> Formaldehyde carb optden 1 0.1 0.086 2 0.3 0.269 3 0.5 0.446 4 0.6 0.538 5 0.7 0.626 6 0.9 0.782 > mycarb <- Formaldehyde$carb > OD <- Formaldehyde$optden > my.new.frame <- data.frame(carb=mycarb, od=OD) > my.new.frame carb od 1 0.1 0.086 2 0.3 0.269 3 0.5 0.446 4 0.6 0.538 5 0.7 0.626 6 0.9 0.782
Accessing data in frames
Frames use MATLAB-like matrix addressing, as
frame[row,col]. More conveniently!,
frame$colname gives you the column vector
colname (as a vector);
frame[foo,] gives you the
footh row (as a single-row data frame). You can index both thusly:
You can see a list of columns with
names(frame). You rename columns by, spookily, assigning into
names(frame). Do you know how and why this works? Please educate me.
> a <- Formaldehyde > a carb optden 1 0.1 0.086 2 0.3 0.269 3 0.5 0.446 4 0.6 0.538 5 0.7 0.626 6 0.9 0.782 > names(a)  "carb" "optden" > names(a) <- "OD" > names(a)  "carb" "OD" > a carb OD 1 0.1 0.086 2 0.3 0.269 3 0.5 0.446 4 0.6 0.538 5 0.7 0.626 6 0.9 0.782 > a$OD  0.086 0.269 0.446 0.538 0.626 0.782
Would you like to know how many rows you have in your data frame? Use
nrow(foo). Do not use
length(foo), which unhelpfully tells you how many columns you have (it is describing the length of the list object containing the column vectors!); if you actually want that number, you can use
ncol(foo) for clarity.
dim(foo) will give you a vector containing both dimensions.
Adding data to frames
Adding columns is easy:
a <- Formaldehyde; a$bar <- seq(6) creates a new column
a$bar <- seq(100) fails (since 100 does not divide 6 evenly), BUT
a$bar <- 1:2 works silently, repeating the sequence (1,2) down the column, so fuck everything.
Don’t add data to the frame a row at a time. It is slower than molasses and just as deadly; this is the second level of the R inferno. Create your data frame from lists that are big enough to contain all of the data you expect the frame to contain, perhaps filling with
NA. You may, however, merge two data frames vertically using