![]() DATA FRAMES Preamble and Editorial There is plenty to say about data frames because they are the primary data
structure in R. Some of what follows is essential knowledge. Some of it will be
satisfactorily learned for now if you remember that "R can do that." I will try
to point out which parts are which. Set aside some time. This is a long one! If
you break this one up into multiple sessions, always save your workspace when
you quit.
> setwd("Rspace") # if you have this directory > rm(list=ls()) # clear the workspaceA Note About Data Management. We can hardly discuss data frames without talking about data management. How do you get your data in? How do you edit them once they're in? I'm sorry to report that this is one area in which R is particularly poor. The facilities in R for data management are, to say the least, clumsy and inadequate. On top of that, there doesn't seem to be any move afoot to improve them. If I were a programmer, this is where I'd be working to improve R. The single most common "excuse" I've heard from people for not adopting R is lack of data management tools. Now don't get me wrong. R does contain very powerful data management tools, and you can accomplish just about any data management task from within R. It's just not the way most people want to work with their data. Most people (so they tell me anyway) find a command line interface a clumsy way to manipute a (large) data set. I get that. People working with other statistics software packages are used to a spreadsheet-like interface for entering and editing data. I've worked with that interface in SPSS, and I personally find it clunky and awkward. I'd much rather use a modern spreadsheet to manage my data, and that's what I do in R. For some reason, other people want the spreadsheet interface integrated, even if it's "clunky and awkward." R does have a spreadsheet-like data editor. It is invoked by functions such as edit(), fix(), data.entry(), and maybe a couple others. I don't use these functions, and I'm not going to discuss them. Here's why. They just flat out don't work on my system. They're not awkward or clumsy, they generate error messages! I am at the moment sitting beside a Windows XP computer running R 3.1.2, and the functions are working there, but they are awkward and illogical. So even when these data management functions do work, they are just not convenient (or particularly safe!) ways to manage data. Bottom line--you are probably going to end up using a spreadsheet or some other third-party software to manage larger data sets. I will show you a little of how to do that in this tutorial and the next one. I should also say that R can be set up to work with data base management software such as SQL, whatever that is. I don't know how to do that, and I've read mixed reviews of its effectiveness. It also sounds like you better be running Windows if you want to make it work, but I haven't really looked into it, and don't plan to. Final note: R keeps data in RAM, so if you plan to work with really, really large data sets, you're going to have to interact with some sort of data base software, or have lots and lots of RAM. I have 4 GB in my system and have worked with data sets that have tens of thousands of cases and scores of variables. Having all the data in RAM makes R very fast. However, available RAM is the limiting factor in how large a data set you can work with entirely within R. Definition and Examples (essential) A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. As we shall see, a "case" is not necessarily the same as an experimental subject or unit, although they are often the same. Technically, in R a data frame is a list of column vectors, although there is only one reason why you might need to remember such an arcane thing. Unlike an array, the data you store in the columns of a data frame can be of various types. I.e., one column might be a numeric variable, another might be a factor, and a third might be a character variable. All columns have to be the same length (contain the same number of data items, although some of those data items may be missing values). Let's say we've collected data on one response variable or DV from 15
subjects, who were divided into three experimental groups called control
("contr"), treatment one ("treat1"), and treatment two ("treat2"). We might be
tempted to table the data as follows.
contr treat1 treat2 --------------------------- 22 32 30 18 35 28 25 30 25 25 42 22 20 31 33 ---------------------------While this is a perfectly acceptable table, it is NOT a data frame, because values on our one response variable have been divided into three columns (and so have values on the grouping or independent variable). A data frame has the name of the variable at the top of the column, and values of that variable in the column under the variable name. So the data above should be tabled as follows. scores group ---------------- 22 contr 18 contr 25 contr 25 contr 20 contr 32 treat1 35 treat1 30 treat1 42 treat1 31 treat1 30 treat2 28 treat2 25 treat2 22 treat2 33 treat2 ----------------This is a proper data frame. It does not matter in what order the columns appear, as long as each column contains values of one variable, and every recorded value of that variable is in that column. In a previous tutorial we used the data object "women" as an example of a
data frame.
> women height weight 1 58 115 2 59 117 3 60 120 4 61 123 5 62 126 6 63 129 7 64 132 8 65 135 9 66 139 10 67 142 11 68 146 12 69 150 13 70 154 14 71 159 15 72 164In this data frame we have two numeric variables and no real explanatory variables (IVs) or response variables (DVs). Notice when R prints out a data frame, it numbers the rows. These numbers are for convenience only and are not part of the data, and I'll have much more to say about them shortly. We can refer to any value, or subset of values, in this data frame using the
already familiar notation.
> women[12,2] # row 12, column 2; note the square brackets [1] 150 > women[8,] # row 8, all columns (blank index means "all") height weight 8 65 135 > women[1:5,] # rows 1 through 5, all columns height weight 1 58 115 2 59 117 3 60 120 4 61 123 5 62 126 > women[,2] # all rows, column 2 [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164 > women[c(1,3,7,13),] # rows 1, 3, 7, and 13, all columns height weight 1 58 115 3 60 120 7 64 132 13 70 154 > women[c(1,3,7,13),1] # rows 1, 3, 7, and 13, column 1 [1] 58 60 64 70Here's the catch. Those index numbers do NOT necessarily correspond to the numbers you see printed at the left of the data frame. This can be confusing, and it is something you need to keep in mind. I will explain in a moment. Another built-in data object that is a data frame is "warpbreaks". This data
frame contains 54 cases, so I will print out only every third one. I do this
with the sequence function, since this function creates a vector just as the
c() function did in the above examples.
> warpbreaks[seq(from=1, to=54, by=3),] breaks wool tension 1 26 A L 4 25 A L 7 51 A L 10 18 A M 13 17 A M 16 35 A M 19 36 A H 22 18 A H 25 28 A H 28 27 B L 31 19 B L 34 41 B L 37 42 B M 40 16 B M 43 21 B M 46 20 B H 49 17 B H 52 15 B HIn this data frame we have one numeric variable (number of breaks), and two categorical variables (type of wool and tension on the wool). We don't have to look at the data frame itself to get this information. We can also use the str() function, which displays a breakdown of the structure of a data frame (or other data object). > str(warpbreaks) 'data.frame': 54 obs. of 3 variables: $ breaks : num 26 30 54 25 70 52 51 26 67 18 ... $ wool : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ... $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...Here are two more handy functions for finding out what's in a data frame. > head(warpbreaks) # see the first six rows of data breaks wool tension 1 26 A L 2 30 A L 3 54 A L 4 25 A L 5 70 A L 6 52 A L > summary(warpbreaks) # see a summary of each of the variables breaks wool tension Min. :10.00 A:27 L:18 1st Qu.:18.25 B:27 M:18 Median :26.00 H:18 Mean :28.15 3rd Qu.:34.00 Max. :70.00Another example is the data object "sleep". > sleep extra group ID 1 0.7 1 1 2 -1.6 1 2 3 -0.2 1 3 4 -1.2 1 4 5 -0.1 1 5 6 3.4 1 6 7 3.7 1 7 8 0.8 1 8 9 0.0 1 9 10 2.0 1 10 11 1.9 2 1 12 0.8 2 2 13 1.1 2 3 14 0.1 2 4 15 -0.1 2 5 16 4.4 2 6 17 5.5 2 7 18 1.6 2 8 19 4.6 2 9 20 3.4 2 10Here we have two variables, the change in sleep time a subject got ("extra"), and what drug the subject received ("group"). There is also a subject identifier (ID), indicating that the first 10 cases and last 10 cases are the same subjects. In this data frame, the first variable (the dependent variable, DV, response variable, etc.) is numeric and the second (the independent variable, IV, explanatory variable, grouping variable, etc.) is categorical, even though the categorical variable is coded as a number. Once again, it does not matter in what order the columns occur. Put the IV in the first column, the DV in the third column, and the subject ID between them, if you want. However, if categorical variables are coded as numbers (a common practice),
R will not know this until you tell it. Has R been told that "group" is a factor
in this data frame? The str() is one handy way to
find out.
> str(sleep) 'data.frame': 20 obs. of 3 variables: $ extra: num 0.7 -1.6 -0.2 -1.2 -0.1 3.4 3.7 0.8 0 2 ... $ group: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ... $ ID : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...In this case, the fact that "group" is a factor is stored internally in the data frame, but that will not always be the case. So it's worth taking a look to make sure things you intend to be factors are being interpreted as factors by R. You can do this with str(), but you can also do it with summary(), because numeric variables and factors are summarized differently. > summary(sleep) extra group ID Min. :-1.600 1:10 1 :2 1st Qu.:-0.025 2:10 2 :2 Median : 0.950 3 :2 Mean : 1.540 4 :2 3rd Qu.: 3.400 5 :2 Max. : 5.500 6 :2 (Other):8Notice that numeric variables ("extra") are summarized with numerical summary statistics, while factors are summarized with a frequency table. In these data, there are 10 subjects in group 1 and 10 subjects in group 2. There are also two subjects with ID 1, two subjects with ID 2, etc. An Ambiguous Case (essential) Entering data into a data frame sometimes involves making a tough decision as
to what your variables are. The following example is from a built-in data object
called "anorexia". This data set is not in the libraries that are loaded by
default when R starts, so to see it, the first thing we need to do is attach the
correct library to the search path. Let's see how that works.
> search() [1] ".GlobalEnv" "tools:RGUI" "package:stats" [4] "package:graphics" "package:grDevices" "package:utils" [7] "package:datasets" "package:methods" "Autoloads" [10] "package:base"This is the default search path, the one you have right after R starts. (It will be a little different in different operating systems.) We want to see an object in the MASS library (or package), which is not currently in the search path. So to get it into the search path, do this. > library(MASS) > search() [1] ".GlobalEnv" "package:MASS" "tools:RGUI" [4] "package:stats" "package:graphics" "package:grDevices" [7] "package:utils" "package:datasets" "package:methods" [10] "Autoloads" "package:base"Notice we have added "package:MASS" to the search path in position 2. This means if we request an R object, R will look first in the global environment (the workspace), and if the object is not found there, R will look next in MASS, then in RGUI, then in stats, and so on, until the object either is found or R runs out of places to look for it. The "anorexia" data frame is 72 cases long, so to conserve space we will look at only every fifth row of it. > data(anorexia) # put a copy in your workspace > anorexia[seq(from=1, to=72, by=5),] Treat Prewt Postwt 1 Cont 80.7 80.2 6 Cont 88.3 78.1 11 Cont 77.6 77.4 16 Cont 77.3 77.3 21 Cont 85.5 88.3 26 Cont 89.0 78.8 31 CBT 79.9 76.4 36 CBT 80.5 82.1 41 CBT 70.0 90.9 46 CBT 84.2 83.9 51 CBT 83.3 85.2 56 FT 83.8 95.2 61 FT 79.6 76.7 66 FT 81.6 77.8 71 FT 86.0 91.7The data frame contains data from women who underwent treatment for anorexia. In the first column we have the treatment variable ("Treat"). The second column contains the pretreatment body weight in pounds ("Prewt"). The third column contains the posttreatment body weight in pounds ("Postwt"). So where is the ambiguity? Here's the awkward question. In our analysis of these data, do we wish to treat weight as two variables ("Prewt" and "Postwt") each measured once on each subject, or as one variable (call it "weight") measured twice on each subject? The data frame is currently arranged as if the plan was for an analysis of covariance, with "Postwt" being the response, "Treat" the explanatory variable, and "Prewt" the covariate. Prewt and Postwt are treated as two variables. If the plan was for a repeated measures ANOVA, then the data frame is wrong, because in this case, "weight" is ONE variable measured twice ("pre" and "post") on each woman. In this analysis, we would also need to add a "subject" identifier to the data frame as well, since each subject would have two lines, a "pre" line and a "post" line. (NOTE: There is an optional package that can be downloaded from CRAN which will do repeated measures ANOVA on data in this format. Google EZANOVA for details. The package is called ez.) It's not a disaster. The data frame is easy enough to rearrange on the fly, and we will do so below. FYI, this is how we get the MASS package out of the search path if
we no longer need it, which we don't. (Don't remove "anorexia" from your
workspace, however.)
> detach("package:MASS") Creating a Data Frame in R (essential) The easiest way--and the usual way--of getting a data frame into the R workspace is to read it in from a file. We will do that below (a few sections from now). Sometimes it becomes necessary to create one at the console, however. Here are the steps involved:
You may remember these data from the "Objects" tutorial.
name age hgt wgt race year SAT Bob 21 70 180 Cauc Jr 1080 Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 Af.Am Fr 840 Sue 24 66 118 Cauc Sr 1340 Jeff 20 72 202 Asian So 880Let's make a data frame of this. The method used here is a somewhat unforgiving method of entering data. I will make an intentional mistake and show you how to correct it below. However, the values for each of the variables have to remain aligned. I.e., Bob is age 21, 70 in. tall, weighs 180 lbs., etc. If you get the data values out of order in any given vector, or if you leave one out, for now all I can say is, "Start again!" > ls() [1] "anorexia" > name = scan(what="character") 1: Bob Fred Barb Sue Jeff # Remember: press Enter twice to end data entry. 6: Read 5 items > age = scan() 1: 21 18 18 24 20 6: Read 5 items > hgt = scan() 1: 70 67 64 66 72 6: Read 5 items > wgt = scan() # I am making a mistake intentionally here. 1: 180 156 128 1118 202 6: Read 5 items > race = scan(what="character") # One value here is being recorded as missing, NA. 1: Cauc Af.Am NA Cauc Asian 6: Read 5 items > year = scan(what="character") 1: Jr Fr Fr Sr So 6: Read 5 items > SAT = scan() # One value here is being recorded as missing, NA. 1: 1080 1210 840 NA 880 6: Read 5 items > my.data = data.frame(name, age, hgt, wgt, race, year, SAT) > my.data name age hgt wgt race year SAT 1 Bob 21 70 180 Cauc Jr 1080 2 Fred 18 67 156 Af.Am Fr 1210 3 Barb 18 64 128 <NA> Fr 840 4 Sue 24 66 1118 Cauc Sr NA 5 Jeff 20 72 202 Asian So 880Tah dah! It's as "simple" as that. You wouldn't want to have to do that with a large data set, however, and that's why we'll learn how to read them in from a file. DON'T clean up your workspace. We will carry this example over into the next section. > ls() # Messy! But leave it that way for now. [1] "age" "anorexia" "hgt" "my.data" "name" "race" [7] "SAT" "wgt" "year" Accessing Information Inside a Data Frame (essential) First, let's look at a few functions that allow us to get general
information about a data frame...
> dim(my.data) # Get size in rows by columns. [1] 5 7 > names(my.data) # Get the names of variables in the data frame. [1] "name" "age" "hgt" "wgt" "race" "year" "SAT" > str(my.data) # See the internal structure of the data frame. 'data.frame': 5 obs. of 7 variables: $ name: Factor w/ 5 levels "Barb","Bob","Fred",..: 2 3 1 5 4 $ age : num 21 18 18 24 20 $ hgt : num 70 67 64 66 72 $ wgt : num 180 156 128 1118 202 $ race: Factor w/ 3 levels "Af.Am","Asian",..: 3 1 NA 3 2 $ year: Factor w/ 4 levels "Fr","Jr","So",..: 2 1 1 4 3 $ SAT : num 1080 1210 840 NA 880These are self-explanatory, with the exception of str(). First, notice that our character variables were entered into the data frame as factors. This is standard in R, but it may not be what you want. Second, notice on the lines giving info about factors that there are strange numbers at the ends of those lines. You don't have to worry about these. What R is telling you is that factors are coded internally in R as numbers. R will keep it all straight for you, so don't sweat the details. The summary() function is VERY useful here.
> summary(my.data) name age hgt wgt race year Barb:1 Min. :18.0 Min. :64.0 Min. : 128.0 Af.Am:1 Fr:2 Bob :1 1st Qu.:18.0 1st Qu.:66.0 1st Qu.: 156.0 Asian:1 Jr:1 Fred:1 Median :20.0 Median :67.0 Median : 180.0 Cauc :2 So:1 Jeff:1 Mean :20.2 Mean :67.8 Mean : 356.8 NA's :1 Sr:1 Sue :1 3rd Qu.:21.0 3rd Qu.:70.0 3rd Qu.: 202.0 Max. :24.0 Max. :72.0 Max. :1118.0 SAT Min. : 840 1st Qu.: 870 Median : 980 Mean :1002 3rd Qu.:1112 Max. :1210 NA's :1Let's take a look. There is a variable called "name", which R is summarizing as a factor. We probably don't really want that, because it's not a grouping variable, but for now no harm no foul. There is a variable called "age", which is numeric, a variable called "hgt", which is numeric, and a variable called "wgt", which is numeric. Do you see any problems here? The "age" and "hgt" variables look entirely reasonable as far as min and max values are concerned, but wgt does not. Maximum wgt is 1118 lbs. Really? Something clearly went wrong here, and we are going to have to track it down and fix it! The variables "race" and "year" are factors or categorical variables. See any problems? Yes, there is a missing value (NA) in "race" that didn't occur in the original data table. Something else we're going to have to fix. Finally, "SAT", also numeric, has a missing value that we are going to have to track down. This is the advantage of using summary(). It shows which variables have values missing, and you can look at the range of the numeric variables and see if there is anything suspicious, like a body weight of 1118 lbs. There are four ways to get at the data inside a data frame, and this is NOT
one of them.
> SAT [1] 1080 1210 840 NA 880That only seemed to work, because remember when you created the data frame, you started by putting a vector called "SAT" into the workspace. THAT'S what you're seeing now! You are NOT seeing the SAT variable from inside the data frame. R looks in your workspace FIRST, so that is the "SAT" that it came up with. Confusing, right? So that we don't remain confused, let's erase all those vectors EXCEPT "age", which we will keep to illustrate something that you will need to remember about R. > ls() [1] "age" "anorexia" "hgt" "my.data" "name" "race" [7] "SAT" "wgt" "year" > rm(hgt, name, race, SAT, wgt, year) ##### DON'T erase my.data! > ls() [1] "age" "anorexia" "my.data"Now if we try to see SAT as we did above... > SAT Error: object 'SAT' not found...we get an error. R will not look inside data frames for variables unless you tell it to. Here are the four ways to do that.
A data frame is a list of column vectors. We can extract items from inside
it by using the usual list indexing device, $. To do this, type the name of the
data frame, a dollar sign, and the name of the variable you want to work
with. You can leave spaces around the $ if you want to. Or not.
> my.data $ SAT [1] 1080 1210 840 NA 880 > my.data$SAT [1] 1080 1210 840 NA 880This can certainly be a nuisance, because it will mean that in some commands you have to type the data frame name multiple times. An example is the command that calculates a correlation. > cor(my.data$hgt, my.data$wgt) [1] -0.2531835In this case, you can use the with() function to tell R where to get the data. > with(my.data, cor(hgt, wgt)) # syntax: with(dataframe.name, function(arguments)) [1] -0.2531835It doesn't save much typing in this example, but there are cases where that will save a LOT of typing! Notice the syntax of this function. You type the name of the data frame first, followed by a comma, followed by the function you want to execute, then you close the parentheses on with(). As we will learn later, some functions, especially significance tests, take
what's called a formula interface. When that's the case, there is (almost) always
a data= option to specify the name of the data frame where the variables are to be
found. I'll just show you an example for now. We'll have plenty of time to
examine the formula interface later. For now, all you need to be aware of is the
tilde, which is always present in a formula. In this case, the formula starts
with the tilde (the squiggly line), which is unusual.
> cor.test( ~ hgt + wgt, data=my.data) #syntax: function(formula, data=dataframe.name) Pearson's product-moment correlation data: hgt and wgt t = -0.4533, df = 3, p-value = 0.6811 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.9281289 0.8100218 sample estimates: cor -0.2531835 Finally, there is the dreaded attach()
function. This attaches the data frame to your search path (in position 2) so
that R will know to look there for data objects that are referenced by name.
Some people use this device routinely when working with data frames, but it
can cause problems, and we are about to see one.
> attach(my.data) The following object(s) are masked _by_ .GlobalEnv : ageSay what? When an object is masked (or shadowed) by the global environment, that means there is a data object in the workspace that has this name AND there is a variable inside the data frame that has this name. I can now ask directly for any variable inside the data frame EXCEPT age. > SAT # same as my.data$SAT (well, almost) [1] 1080 1210 840 NA 880 > mean(wgt) # same as mean(my.data$wgt) [1] 356.8 > table(year) # same as table(my.data$SAT) year Fr Jr So Sr 2 1 1 1 > age [1] 21 18 18 24 20You might think you are seeing my.data$age here, but YOU ARE NOT! You're seeing "age" from the workspace, BECAUSE THAT'S WHERE R LOOKS FIRST. In this case, both copies of "age" are the same, but that won't always be true. > age = 112 # modifies the first copy it finds > age [1] 112 > my.data$age [1] 21 18 18 24 20The assignment changed the value of "age" in the workspace, because that is the first "age" R saw, but did not change the value of "age" in the data frame. If we remove age from the workspace, R will then search inside the data frame for it, because the data frame is attached in position 2 of the search path. > rm(age) > age [1] 21 18 18 24 20The lesson is, when you get one of these masking (or shadowing) conflicts, WATCH OUT! Be extra careful to know which version of the variable you're working with. This has tripped up many an R user, including me. This is why you want to keep your workspace as clean as possible. The best strategy here is to remove the masking variable from the workspace. If you want to keep it, at least rename it and then remove the conflicting version from the workspace. You'll eventually be sorry if you don't! One more lesson...
> detach(my.data)When you're done with an attached data frame, ALWAYS detach it. This will remove it from the search path so that R will no longer look inside it for variables. You'll have to go back to using $ to reference variables inside the data frame after it is detached. This isn't necessary if you're going to quit your R session right away. Quitting detaches everything that was attached. But if you're going to continue working, detach data frames you no longer need. Otherwise, your search path will get messy, and you'll get more and more masking conflicts as other objects are attached. DON'T erase my.data. We still need it. Data Frame Indexing and Row Names (critical) This will cost you BIGTIME eventually if you don't pay close attention!
This drove me nuts for awhile until I figured out what was happening.
> ls() # Still there? [1] "anorexia" "my.data" > my.data name age hgt wgt race year SAT 1 Bob 21 70 180 Cauc Jr 1080 2 Fred 18 67 156 Af.Am Fr 1210 3 Barb 18 64 128 <NA> Fr 840 4 Sue 24 66 1118 Cauc Sr NA 5 Jeff 20 72 202 Asian So 880Let's talk about those line numbers at the leftmost verge of the printed data frame. THEY ARE NOT NUMBERS. Let me repeat that. THEY ARE NOT NUMBERS. They are row names. They are character values. So the rows and columns of this data frame are NAMED as follows: > dimnames(my.data) [[1]] [1] "1" "2" "3" "4" "5" [[2]] [1] "name" "age" "hgt" "wgt" "race" "year" "SAT"What's the big deal? Look at the printed data frame. Suppose we wanted to extract Barb's weight.
That's the value in row 3 and column 4, so we could get it this way.
> my.data[3,4] # Remember to use square brackets for indexing. [1] 128"Yeah, so?" We could also get it this way... > my.data[3,"wgt"] [1] 128...and this way... > my.data["3","wgt"] [1] 128Those last two ways seem to be the same, BUT THEY ARE NOT!!! Let's sort the data frame using the age variable. Sorting a data frame is
done using the order() function. Remember how
it worked when we sorted a vector? If a call to the
order() function is put in place of the row
index, the data frame will be sorted on whatever variable is specified inside
that function. You will have to use the full name of the variable; i.e., you
will have to use the $ notation. (Why?) Otherwise, R will be looking in your
workspace for a variable called "age", not finding it, and giving a "not
found" error. It happens to me a lot, so you might as well just get used to it!
> my.data[order(my.data$age),] name age hgt wgt race year SAT 2 Fred 18 67 156 Af.Am Fr 1210 3 Barb 18 64 128 Af.Am Fr 840 5 Jeff 20 72 202 Asian So 880 1 Bob 21 70 180 Cauc Jr 1080 4 Sue 24 66 1118 Cauc Sr 1340Observe the row names! They have also been sorted, haven't they? Let's save this into a new data object so we can play with it a bit. > my.data[order(my.data$age),] -> my.data.sorted # Did you remember up arrow? > my.data.sorted name age hgt wgt race year SAT 2 Fred 18 67 156 Af.Am Fr 1210 3 Barb 18 64 128 Af.Am Fr 840 5 Jeff 20 72 202 Asian So 880 1 Bob 21 70 180 Cauc Jr 1080 4 Sue 24 66 1118 Cauc Sr 1340Now let's try to extract Barb's weight from this new data frame. > my.data.sorted[3,4] ### Wrong! [1] 202 > my.data.sorted[3,"wgt"] ### Also wrong! [1] 202 > my.data.sorted["3","wgt"] ### Correct! [1] 128 > my.data.sorted[2,4] ### Also correct! [1] 128Confused yet? Here's what you have to remember. Those numbers that often print out on the
left side of a data frame ARE NOT NUMBERS. They're row names--character values.
So data frames
have both row and column names, whether you like it or not! The point becomes
clearer when we give the rows actual names. Let's erase the names from my.data
and then re-enter them as row names.
> rm(my.data.sorted) # Get rid of that first. > my.data$name = NULL # This is how you erase a variable from a data frame. > my.data age hgt wgt race year SAT 1 21 70 180 Cauc Jr 1080 2 18 67 156 Af.Am Fr 1210 3 18 64 128 <NA> Fr 840 4 24 66 1118 Cauc Sr NA 5 20 72 202 Asian So 880 > rownames(my.data) = c("Bob","Fred","Barb","Sue","Jeff") > my.data age hgt wgt race year SAT Bob 21 70 180 Cauc Jr 1080 Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 <NA> Fr 840 Sue 24 66 1118 Cauc Sr NA Jeff 20 72 202 Asian So 880 > my.data["Barb", "wgt"] # Makes getting Barb's weight a lot easier! [1] 128Notice the numbers are gone now because we have actual row names. And OF COURSE they sort with the rest of the data frame, just as the "number" row names did above. > my.data[order(my.data$age),] age hgt wgt race year SAT Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 <NA> Fr 840 Jeff 20 72 202 Asian So 880 Bob 21 70 180 Cauc Jr 1080 Sue 24 66 1118 Cauc Sr NAIt would be absolutely silly if they didn't! Just remember: Data frames ALWAYS have row names. Sometimes those row names just happen to look like numbers. It's the row names that print out to your console when you ask to see the data frame, or any part of it, and NOT the index numbers. (R Studio shows you both when you ask to View a data frame.) (NOTE: All row names have to be unique. You can't have two Barbs, for obvious reasons.) Don't remove my.data yet. We still need it. Modifying a Data Frame (pretty important) Rule number one with a bullet:
While this isn't strictly against the law, it's a bad idea and can get very confusing as to exactly what it is you've modified. I could try to explain it, but I'm not sure I understand it myself. So just don't do it! (An attached data frame is a copy of the data frame in the workspace, not the actual data frame in the workspace. Modifications will be made to the actual data frame in the workspace, but not to the attached copy.) The time will come when you want to change a data frame in some way. Here are
some examples of how to do that. You may have noticed that Sue (in the my.data
data frame) is a wee bit on the chunky side. This was an innocent mistake. I
really didn't do that on purpose. How do we fix it? The value was supposed to
be 118, but let's change it to 135 just for kicks.
> ls() # Still there? [1] "my.data" > my.data age hgt wgt race year SAT Bob 21 70 180 Cauc Jr 1080 Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 <NA> Fr 840 Sue 24 66 1118 Cauc Sr NA Jeff 20 72 202 Asian So 880 > my.data["Sue", "wgt"] = 135 > my.data age hgt wgt race year SAT Bob 21 70 180 Cauc Jr 1080 Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 <NA> Fr 840 Sue 24 66 135 Cauc Sr NA Jeff 20 72 202 Asian So 880That's all there is to it. Use any kind of indexing you like. Let's use numerical indexing to give Sue her correct weight, and while we're at it, let's fix those missing values, too. > my.data[4,3] = 118 > my.data[3, "race"] = "Af.Am" > my.data["Sue", 6] = 1340 > my.data age hgt wgt race year SAT Bob 21 70 180 Cauc Jr 1080 Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 Af.Am Fr 840 Sue 24 66 118 Cauc Sr 1340 Jeff 20 72 202 Asian So 880Just remember that "wgt" is now in column 3, since the row names don't count as a column. Now, I have a confession to make. I neglected to detach my.data before I
made those changes. Here are the consequences.
> SAT # sees the attached copy [1] 1080 1210 840 NA 880 > my.data$SAT # sees the copy in the workspace [1] 1080 1210 840 1340 880 > wgt [1] 180 156 128 1118 202 > race [1] Cauc Af.Am <NA> Cauc Asian Levels: Af.Am Asian CaucAck! The attached copy has not been changed. But the copy in the workspace has been changed. Here's the fix. (Won't work if you weren't as stupid as I was and didn't have my.data attached while you were making those modifications.) > detach(my.data) # will just give an error if not attached > attach(my.data) # previous attached copy tossed; new attached copy created > SAT [1] 1080 1210 840 1340 880 > detach(my.data)I have to warn you about modifying data frames. It's always a good idea to make a backup copy in the workspace first. Because there are some commands that modify data frames that, if they go wrong, can really screw things up! But let's live dangerously. Suppose we wanted "wgt" to be in kilograms instead of pounds. Easy enough... > my.data$wgt / 2.2 [1] 81.81818 70.90909 58.18182 53.63636 91.81818 > my.data # Nothing has changed yet. Why not? age hgt wgt race year SAT Bob 21 70 180 Cauc Jr 1080 Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 Af.Am Fr 840 Sue 24 66 118 Cauc Sr 1340 Jeff 20 72 202 Asian So 880 > my.data$wgt / 2.2 -> my.data$wgt # Aha! It has to be stored back into my.data. > my.data age hgt wgt race year SAT Bob 21 70 81.81818 Cauc Jr 1080 Fred 18 67 70.90909 Af.Am Fr 1210 Barb 18 64 58.18182 Af.Am Fr 840 Sue 24 66 53.63636 Cauc Sr 1340 Jeff 20 72 91.81818 Asian So 880 > my.data$wgt = round(my.data$wgt, 1) # A little rounding for good measure. > my.data age hgt wgt race year SAT Bob 21 70 81.8 Cauc Jr 1080 Fred 18 67 70.9 Af.Am Fr 1210 Barb 18 64 58.2 Af.Am Fr 840 Sue 24 66 53.6 Cauc Sr 1340 Jeff 20 72 91.8 Asian So 880Now that we've rounded them off, we've lost the original weight data in pounds. > my.data$wgt * 2.2 [1] 179.96 155.98 128.04 117.92 201.96We could have avoided this by making a backup copy of my.data first, or by putting the new weight in kilograms into a new column in the data frame. Let's see how to create a new column in the data frame.
> my.data$IQ = c(115, 122, 100, 144, 96) > my.data age hgt wgt race year SAT IQ Bob 21 70 81.8 Cauc Jr 1080 115 Fred 18 67 70.9 Af.Am Fr 1210 122 Barb 18 64 58.2 Af.Am Fr 840 100 Sue 24 66 53.6 Cauc Sr 1340 144 Jeff 20 72 91.8 Asian So 880 96Just name it and assign values to the name in a vector. The new vector has to be the same length as the other variables already in the data frame. > ls() [1] "anorexia" "my.data" Keep all of that. We're going to be referring to my.data in the next tutorial. Missing Values (kinda important, so listen up!) Do this.
> data(Cars93, package="MASS") # Get data from MASS without attaching MASS first. > str(Cars93) # Lots of output not shown!This is a data frame with 93 observations on 27 variables. You can see what the variables represent by looking at the help page for this data set: help(Cars93, package="MASS"). We're interested in the variable "Luggage.room" in particular, which is the trunk space in cubic feet, to the nearest cubic foot. > attach(Cars93) > summary(Luggage.room) Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 6.00 12.00 14.00 13.89 15.00 22.00 11.00This is a numeric variable, so we get the summary we are accustomed to by now. But what are those NAs? Whether we like it or not, data sets often have missing values, and we need to know how to deal with them. R's standard code for missing values is "NA", for "not available". The number associated with NA is a frequency. There are 11 cases in this data frame in which "Luggage.room" is a missing value. If you looked at the help page, you know why. Some functions fail to work when there are missing values, but this can
(almost always) be fixed with a simple option.
> mean(Luggage.room) [1] NA > mean(Luggage.room, na.rm=TRUE) [1] 13.89024 > mean(Luggage.room, na.rm=T) [1] 13.89024There is no mean when some of the values are missing, so the "na.rm" option removes them when set to TRUE (must be all caps, but the shorter form T also works provided you haven't assigned another value to it). If you want to clean the data set by removing casewise all cases with missing values on any variable (not always a good idea!), use the na.omit() function. > na.omit(Cars93) # Output not shown.I will not reproduce the output here because it is extensive, but it is also instructive, so take a look at it. Scroll the console window backwards to see all of it. Of course, to use this cleaned data frame, you would have to assign it to a new data object. The which() function does not work to
identify which of the values are missing. Use is.na( ) instead.
> which(Luggage.room == NA) integer(0) > is.na(Luggage.room) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [12] FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE [23] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [34] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [56] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE [67] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE [89] TRUE FALSE FALSE FALSE FALSE > which(is.na(Luggage.room)) [1] 16 17 19 26 36 56 57 66 70 87 89Finally, some data sets come with other codes for missing values. 999 is a common missing value code, as are blank spaces. Blanks are a very bad idea. If you find a data set with blanks in it, it may have to be edited in a text editor or spreadsheet before the file can be read into R. It depends on how the file is formatted. In some cases, R will automatically assign NA to blank values, but in other cases it will not. Other missing value codes are not a problem, as they can be recoded. > ifelse(is.na(Luggage.room), 999, Luggage.room) -> temp > temp [1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 999 999 [18] 20 999 15 14 17 11 13 14 999 16 11 11 15 12 12 13 12 [35] 18 999 18 21 10 11 8 12 14 11 12 9 14 15 14 9 19 [52] 22 16 13 14 999 999 12 15 6 15 11 14 12 14 999 14 14 [69] 16 999 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13 [86] 15 999 10 999 14 15 14 15 > # first we'll mess it up > # and then we'll fix it > ifelse(temp == 999, NA, temp) -> fixed > fixed [1] 11 15 14 17 13 16 17 21 14 18 14 13 14 13 16 NA NA 20 NA 15 14 17 11 [24] 13 14 NA 16 11 11 15 12 12 13 12 18 NA 18 21 10 11 8 12 14 11 12 9 [47] 14 15 14 9 19 22 16 13 14 NA NA 12 15 6 15 11 14 12 14 NA 14 14 16 [70] NA 17 8 17 13 13 16 18 14 12 10 15 14 10 11 13 15 NA 10 NA 14 15 14 [93] 15The ifelse() function is very handy for recoding a data vector, so let me take a moment to explain it. Inside the parentheses, the first thing you give is a test. In the second of these commands above, where we are going from the messed up variable back to "fixed", the test was "if any value of temp is equal to 999". Notice the double equals sign meaning "equal". (I still get this wrong a lot!) The second thing you give is how to recode those values, and finally you tell what to do with the values that don't pass the test. So the whole command reads like this: "If any value of temp is equal to 999, assign it the value NA, else assign it the value that is currently in temp." In the first instance of the function, we had to use is.na, since nothing can
really be "equal to" something that is not available! Try these, and say them
in words as you're typing them.
> ifelse(fixed == 10, 0, 100) # Output not shown. > ifelse(fixed > 10, 100, 0) # Output not shown. > ifelse(fixed > 10, "big", "small") # Output not shown.If you stored that last one, it would create a character vector. Don't forget to clean up your workspace and search path!
> rm(Cars93) # and anything else other than anorexia and my.data > detach(Cars93) # removing it does not detach it! Inline Data Entry In R (optional) (NOTE: This may or may not work. I've just tested it in R 3.1.2 on both a Mac running Snow Leopard and a Windows XP machine, and it worked in both cases. Some of my students claim to have problems with it, especially in R Studio. I've been unable to duplicate those problems.) Those of you who are old enough to have used SPSS in a version where you had to type commands into a batch file for execution may remember inline data entry. You typed BEGIN DATA (as I recall), typed your data into a table-like format, and then typed END DATA. Is there anything like that in R? Sort of. Open a script window: File > New Script in Windows, or File > New Document on
a Mac. In that script window, type exactly this. Include a blank line at the end.
You can create white space by either tabbing or spacing on a Mac, but in Windows
you must create white space by spacing with the spacebar. (The help page
suggests otherwise, but I have been unable to get the Windows version to
recognize tabs as white space.) You can edit freely as you are typing.
new.dataframe = read.table(header=T, text=" name age hgt wgt race year SAT Bob 21 70 180 Cauc Jr 1080 Fred 18 67 156 Af.Am Fr 1210 Barb 18 64 128 Af.Am Fr 840 Sue 24 66 118 Cauc Sr 1340 Jeff 20 72 202 Asian So 880 ")Then, in Windows, go to the Edit menu, and choose Run all. On a Mac, highlight the whole thing in the script window with your mouse, go to the Edit menu, and choose Execute. That should put a data frame in your workspace called new.dataframe. Check it to make sure it's sound. (NOTE: In Linux, R scripts are created in a text editor such as vim or gedit, saved, and then read into R by using the source() function at the Console command prompt.) On my Mac, the script looked like this. (In Windows the script window is much plainer, and the lines are not numbered.) ![]() You can save the contents of this window as an R script, which you can always reopen and modify if necessary. Further hint: I just got those data into R by copying and pasting the command and data above directly into the R Console. So that worked. It led me to wonder if I could copy and paste an HTML table-formatted object from a web page. I just tried it, and it caused R to crash, so I can't recommend it! (That's only the second time in 12 years that R has crashed on me.) However, here's what did work. I copied the contents of the table on the web page and pasted it into a text editor (I used TextWrangler on a Mac). Then I added the necessary R commands in the text editor and copied and pasted it into the R Console. Sure beats typing in that data myself! Subsetting a Data Frame (optional) We will use a data frame called USArrests for this exercise.
> data(USArrests) > head(USArrests) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7Here is another useful function for looking at a data frame. The head() function shows the first six lines of data (cases) inside a data frame. There is also a tail() function that shows the last six lines, and the number of lines shown can be changed with an option (see the help pages). In this case we have a data frame with row names set to state names and containing variables that give the crime rates (per 100,000 population) for Murder, Assault, and Rape, as well as the percentage of the population that lives in urban areas. These data are from 1973 so are not current. Because state names are used as row names, to see the data for any state,
all we have to do is be able to spell the name of the state.
> USArrests["Pennsylvania",] # No column index, so all columns displayed. Murder Assault UrbanPop Rape Pennsylvania 6.3 106 72 14.9We do not have to figure out what the index number would be for that row. Thus, explicit row names can be very handy. To display the entire row of data for PA, we just leave out the column index, but THE COMMA STILL HAS TO BE THERE! Otherwise, you are trying to index a two-dimensional data object using only one index, and R will tell you to knock it off! Let's answer the following questions from these data.
> min(USArrests$Murder) # What is the minimum murder rate? [1] 0.8 > which(USArrests$Murder == 0.8) # Which line of the data is that? [1] 34 > USArrests[34,] # Give me the data from that line. Murder Assault UrbanPop Rape North Dakota 0.8 45 44 7.3 > USArrests[USArrests$Murder==min(USArrests$Murder),] # All at once (showing off). Murder Assault UrbanPop Rape North Dakota 0.8 45 44 7.3 > > which(USArrests$Murder < 4.0) # Gives the result in a vector. [1] 7 12 15 19 23 29 34 39 41 44 45 49 > USArrests[which(USArrests$Murder < 4.0),] # Use that vector as an index. Murder Assault UrbanPop Rape Connecticut 3.3 110 77 11.1 Idaho 2.6 120 54 14.2 Iowa 2.2 56 57 11.3 Maine 2.1 83 51 7.8 Minnesota 2.7 72 66 14.9 New Hampshire 2.1 57 56 9.5 North Dakota 0.8 45 44 7.3 Rhode Island 3.4 174 87 8.3 South Dakota 3.8 86 45 12.8 Utah 3.2 120 80 22.9 Vermont 2.2 48 32 11.2 Wisconsin 2.6 53 66 10.8 > > summary(USArrests$UrbanPop) Min. 1st Qu. Median Mean 3rd Qu. Max. 32.00 54.50 66.00 65.54 77.75 91.00 > USArrests[which(USArrests$UrbanPop >= 77.75),] Murder Assault UrbanPop Rape Arizona 8.1 294 80 31.0 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Florida 15.4 335 80 31.9 Hawaii 5.3 46 83 20.2 Illinois 10.4 249 83 24.0 Massachusetts 4.4 149 85 16.3 Nevada 12.2 252 81 46.0 New Jersey 7.4 159 89 18.8 New York 11.1 254 86 26.1 Rhode Island 3.4 174 87 8.3 Texas 12.7 201 80 25.5 Utah 3.2 120 80 22.9 Suppose we wanted to work with data only from these states. How can we
extract them from the data frame and make a new data frame that contains only
those states? I'm glad you asked.
> subset(USArrests, subset=(UrbanPop >= 77.75)) -> high.urban > high.urban Murder Assault UrbanPop Rape Arizona 8.1 294 80 31.0 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Florida 15.4 335 80 31.9 Hawaii 5.3 46 83 20.2 Illinois 10.4 249 83 24.0 Massachusetts 4.4 149 85 16.3 Nevada 12.2 252 81 46.0 New Jersey 7.4 159 89 18.8 New York 11.1 254 86 26.1 Rhode Island 3.4 174 87 8.3 Texas 12.7 201 80 25.5 Utah 3.2 120 80 22.9The subset() function does the trick. The syntax is a little squirrelly, so let me go through it. The first thing you give is the name of the data frame. That is followed by the subset= option. Then inside of parentheses (which actually aren't necessary) give the test that defines the subset. Store the output into a new data object so that you can then work with it. Functions that take a data= option can also take a subset option, so it's a useful thing to know. Of course you can also use subset() without an assignment if you just want to display the results. This eliminates the need to do the fancy indexing tricks above. Or you can use the fancy indexing tricks with an assignment to get the same result stored in a new data object. Whatever paddles your canoe. In R there are generally multiple ways to accomplish things, and this is a good example. You can clean up your workspace now, but KEEP anorexia and my.data. Stacking and Unstacking (optional) Suppose someone has retained your services as a data analyst and gives you
his data (from an Excel file or something) in this format.
contr treat1 treat2 --------------------------- 22 32 30 18 35 28 25 30 25 25 42 22 20 31 33 ---------------------------If you're working for free, you can yell at him and make him do it the right way, but if you're being paid, you probably shouldn't. Here's how to deal with it. First, let's get these data into a "data frame" in this format, and I will leave out the command prompts so that you can just copy and paste these three lines directly into R. ### start copying here wrong.data = data.frame(contr = c(22,18,25,25,20), treat1 = c(32,35,30,42,31), treat2 = c(30,28,25,22,33)) ### stop copying here and paste into R > wrong.data contr treat1 treat2 1 22 32 30 2 18 35 28 3 25 30 25 4 25 42 22 5 20 31 33Now do this. > stack(wrong.data) -> correct.data > correct.data values ind 1 22 contr 2 18 contr 3 25 contr 4 25 contr 5 20 contr 6 32 treat1 7 35 treat1 8 30 treat1 9 42 treat1 10 31 treat1 11 30 treat2 12 28 treat2 13 25 treat2 14 22 treat2 15 33 treat2 > colnames(correct.data) = c("scores","groups") > head(correct.data) scores groups 1 22 contr 2 18 contr 3 25 contr 4 25 contr 5 20 contr 6 32 treat1And there you go. Now you have a proper data frame. There is also an unstack() function that does the reverse of this, and it will work automatically on a data frame that has been created by stack(), but otherwise is a little trickier to use. You probably won't have to use it much, so I'll refer you to the help page if you ever need it (and good luck to you!). You can remove these data objects. We won't use them again. Going From Wide to Long and Long to Wide (eventually you'll probably need to know this) I mentioned this above under "An Ambiguous Case." There are two kinds of data
frames in R, and in most statistical software: wide ones and long ones. If you
deleted the "anorexia" data frame from your workspace, it's easy enough to get
back. Here's how to fetch
the "anorexia" data again (and we'll do it without attaching the MASS
package this time).
> data(anorexia, package="MASS")What we are about to do is a little confusing until you get some experience with it, so it will be necessary to be able to see what's happening. The anorexia data frame is too long to print to a single console screen without causing it to scroll, so I'm going to cut it down to only nine cases, three from each group. This will help us to see the difference between wide and long data frames without constantly scrolling the console window. > anor = anorexia[c(1,2,3,27,28,29,56,57,58),] > anor Treat Prewt Postwt 1 Cont 80.7 80.2 2 Cont 89.4 80.1 3 Cont 91.8 86.4 27 CBT 80.5 82.2 28 CBT 84.9 85.6 29 CBT 81.5 81.4 56 FT 83.8 95.2 57 FT 83.3 94.3 58 FT 86.0 91.5I also shortened up the name of our data frame, because we're going to be typing it a lot. This is a wide data frame (wide format). It's wide because each line of the data frame contains information on ONE SUBJECT, even though that subject was measured multiple times (twice) on weight (Prewt, Postwt). So all the data for each subject goes on ONE LINE, even though we could interpret this as a repeated measures design, or longitudinal data. In a long format data frame, each value of weight (if we consider that as a single variable) would define a case. So each of these subjects would have two lines in such a data frame, one for the subject's Prewt, and one for her Postwt. A wide data frame would be used, for example, in analysis of covariance. A long data frame would be used in repeated measures analysis of variance. Do we have to retype the data frame to get from wide to long? Fortunately not! Because R has a function called reshape() that will do the work for us. It is not an easy function to understand, however (and don't count on the
help page being a whole lot of help!). So let me illustrate it, and then I will
explain what's happening.
> reshape(data=anor, direction="long", + varying=c("Prewt","Postwt"), v.names="Weight", + idvar="subject", ids=row.names(anor), + timevar="PrePost", times=c("Prewt","Postwt") + ) -> anor.long > anor.long Treat PrePost Weight subject 1.Prewt Cont Prewt 80.7 1 2.Prewt Cont Prewt 89.4 2 3.Prewt Cont Prewt 91.8 3 27.Prewt CBT Prewt 80.5 27 28.Prewt CBT Prewt 84.9 28 29.Prewt CBT Prewt 81.5 29 56.Prewt FT Prewt 83.8 56 57.Prewt FT Prewt 83.3 57 58.Prewt FT Prewt 86.0 58 1.Postwt Cont Postwt 80.2 1 2.Postwt Cont Postwt 80.1 2 3.Postwt Cont Postwt 86.4 3 27.Postwt CBT Postwt 82.2 27 28.Postwt CBT Postwt 85.6 28 29.Postwt CBT Postwt 81.4 29 56.Postwt FT Postwt 95.2 56 57.Postwt FT Postwt 94.3 57 58.Postwt FT Postwt 91.5 58In this example, the first argument I gave to the reshape() function was the name of the data frame to be reshaped, and that was given in the data= argument. Then I specified the direction= argument as "long" so that the data frame would be converted TO a long format. In the second line of this command, I specified varying= as a vector of variable names in anor that correspond to the repeated measures or longitudinal measures (the time-varying variables). These values will be given in one column in the new data frame, so I named that new column using the v.names= argument. A long data frame needs two things that a wide one does not have. One of those things is a column identifying the subject (case or experimental unit) from which the data in a row of the data frame come from. This is necessary because each subject will have multiple rows of data in a long data frame. So I used the idvar= argument to specify the name of this new column that would identify the subjects. I then used ids= to specify how the subjects were to be named. I told it to use the row names from anor, which is a sensible thing to do. The other thing a long format data frame needs that a wide one does not is a variable giving the condition (or time) in which the subject is being measured for this particular row of data. In the wide format, this information is in the column (variable) names, but that will no longer be true in the long format. We need to know which measure is Prewt and which measure is Postwt for each subject, since these will be on different rows of the data frame in long format. I named this new variable using the timevar= argument, and I gave its possible values in a vector using the times= argument. The order in which those values should be listed is the same as the order in which the corresponding columns occur in the wide data frame. Finally, I closed the parentheses on the reshape() function and assigned the output to a new data object. Done! Whew! This can also be made to work if you have more than one repeated measures variable, in which case all I can say is may the saints be with you! Surely there must be an easier syntax for this!! If the data frame results from a
reshape() command, then it can be converted back very simply.
All you have to do is this.
> reshape(anor.long) Treat subject Prewt Postwt 1.Prewt Cont 1 80.7 80.2 2.Prewt Cont 2 89.4 80.1 3.Prewt Cont 3 91.8 86.4 27.Prewt CBT 27 80.5 82.2 28.Prewt CBT 28 84.9 85.6 29.Prewt CBT 29 81.5 81.4 56.Prewt FT 56 83.8 95.2 57.Prewt FT 57 83.3 94.3 58.Prewt FT 58 86.0 91.5The row names have gone a little screwy, but all the correct information is there. This isn't very useful actually, because we already have the data in wide format in the data frame anor, which we were smart enough not to overwrite. So let's see how to convert from long to wide the hard way. First, we will get rid of those ridiculous row names.
> rownames(anor.long) <- as.character(1:18) # Just do it! > anor.long Treat PrePost Weight subject 1 Cont Prewt 80.7 1 2 Cont Prewt 89.4 2 3 Cont Prewt 91.8 3 4 CBT Prewt 80.5 27 5 CBT Prewt 84.9 28 6 CBT Prewt 81.5 29 7 FT Prewt 83.8 56 8 FT Prewt 83.3 57 9 FT Prewt 86.0 58 10 Cont Postwt 80.2 1 11 Cont Postwt 80.1 2 12 Cont Postwt 86.4 3 13 CBT Postwt 82.2 27 14 CBT Postwt 85.6 28 15 CBT Postwt 81.4 29 16 FT Postwt 95.2 56 17 FT Postwt 94.3 57 18 FT Postwt 91.5 58And now for the reshaping. I won't bother storing it. > reshape(data=anor.long, direction="wide", + v.names=c("Weight"), + idvar="subject", + timevar="PrePost" + ) Treat subject Weight.Prewt Weight.Postwt 1 Cont 1 80.7 80.2 2 Cont 2 89.4 80.1 3 Cont 3 91.8 86.4 4 CBT 27 80.5 82.2 5 CBT 28 84.9 85.6 6 CBT 29 81.5 81.4 7 FT 56 83.8 95.2 8 FT 57 83.3 94.3 9 FT 58 86.0 91.5We didn't quite recover the original table, but then we probably didn't really want to. The first two arguments name the data frame we are reshaping and tell the direction we are reshaping TO. The next argument, v.names=, gives the name of the time-varying variable that will be split into two (or more) columns. The idvar= argument gives the name of the variable that is the subject identifier. Finally, the timevar= argument gives the name of the variable that contains the conditions under which the longitidinal information was collected; i.e., there were two weights, a Prewt and a Postwt. Notice these values were used to name the two new columns of Weight data. Want a pneumonic to help you remember all that? Yeah, me too! Clean up. Get rid of everything EXCEPT my.data. Working With Spreadsheets and CSV Files (optional) Even if you can get the spreadsheet-like data editing interface in R to work for you, it's still really no great shakes. Even when I'm in Windows (where it works), I use a spreadsheet to manage my data files, especially larger ones. I'm going to type the data in my.data into a spreadsheet. I use OpenOffice Calc. You can use whatever. ![]() At this point, I can copy and paste any one of those columns into scan(). That's handy, but it's not why I created the spreadsheet. (Notice that Calc wouldn't let me type SAT into a column header. It kept insisting it was Sat, abbreviation for Saturday. I HATE software that thinks it knows what I want! That's why I don't use Excel, but the open source spreadsheets are getting just as bad. Don't presume you know what I want. JUST DO WHAT I'M TELLING YOU!!! Gasp! I am so sick and tired of software--and operating systems--being written for morons! My computer is not a phone, it's not an iPad, it's a computer. Stop turning it into a toy! And if I want SAT to be Sat, I'll damn well type it that way! There! That will do no good whatsoever, but at least I vented.) Now I'm going to save that as a CSV file. (And once again, my computer will nag the crap out of me--in Excel especially. "Are you really sure you want to do this? You're going to lose your formatting." Just do what I tell you to do and SHUT UP! Anyone who doesn't know what a CSV file is and that it contains no formatting can use a damn typewriter!) To do that I pulled down the File menu (you're on your own with that idiot ribbon bar in Excel!), I chose Save As..., and in the dialog box I entered mydata.csv as the file name, I specified where to save (Desktop--I'll deal with it from there), and I chose File type: Text CSV. I chose to edit the filter settings because who knows how they might have them set? Then I clicked Save. Then it nagged me, and I clicked Keep Current Format (because there was no choice that said Do What I Tell You--I'm The Human Here!). In the filter settings I made sure the Field delimiter was set to a comma (which is what it should always be because, hey, CSV means comma separated values), and I removed the Text delimiter. Then I clicked OK. Then I clicked away another warning popup. (See how hard they make this because every idiot has to be able to use a computer these days?) Here's what the CSV file actually looks like, and if you don't want to have
to deal with a nagging spreadsheet, you can just type this into a plain text
editor. You can even save it with a .txt extension. R won't care. ("THANK YOU"
to the people who write the R software for not treating me like I'm
feebleminded!)
rownames,age,wgt,race,year,"SAT" Bob,21,180,Cauc,Jr,1080 Fred,18,156,Af.Am,Fr,1210 Barb,18,128,Af.Am,Fr,840 Sue,24,118,Cauc,Sr,1340 Jeff,20,202,Asian,So,880Drop this file into your working directory (Rspace), and then read it into R like this. > my.newdata = read.csv(file="mydata.csv", row.names="rownames") > # notice there is no annoying message telling you this has been done! > my.newdata age wgt race year SAT Bob 21 180 Cauc Jr 1080 Fred 18 156 Af.Am Fr 1210 Barb 18 128 Af.Am Fr 840 Sue 24 118 Cauc Sr 1340 Jeff 20 202 Asian So 880Yay! R even dealt with those annoying quotes around SAT. Since "rownames" was the first column in the file, you could also have set that option as row.names=1. Now suppose somehow your CSV file gets some whitespace in it. This could happen due to mistyping in the spreadsheet, or because you typed it that way intentionally into a text editor. (It would be easier in that case just to leave out the commas and use read.table). (NOTE: SPSS data files tend to have variable names and value labels padded with white space, an idiot programming practice if ever there was one!) Do this. If the file looks something like this... rownames, age, wgt, race, year, SAT Bob, 21, 180, Cauc, Jr, 1080 Fred, 18, 156, Af.Am, Fr, 1210 Barb, 18, 128, Af.Am, Fr, 840 Sue, 24, 118, Cauc, Sr, 1340 Jeff, 20, 202, Asian, So, 880 Do this... > my.newdata = read.csv(file="mydata.csv", row.names=1, strip.white=T)I personally think using a spreadsheet for a data file this small would be like driving in a tack with a sledge hammer, but it's up to you. A spreadsheet comes in very handy for dealing with large data files, however. Before you quit today, clean everything out of your workspace EXCEPT my.data, then save the workspace when you quit. revised 2016 January 21 |