Intro, Data, Reshape, Plots (WISC)
This session we shall work through some basics. The general idea here is to get a set of scripts in place that can be used whenever one obtains and is getting familiar with a new repeated measures data set.
Prelim - Loading libraries used in this script.
library(psych) #for general use functions
library(car) #for general use functions
library(GGally) #for a specific plot
library(lattice) #for plotting
library(tidyverse) #ggplot2, stringr, dplyr, readr, forcats, tidyr, purrReading in Repeated Measures Data
The first step, and often the most frustrating part of data analysis is … getting the data into the program!
Recently, the trend is that everyone exchanges files in .csv format. This usually works well and is a good practice to follow.
Our examples makes use of data from McArdle and colleagues and has been used in many papers. These are classic data from Osborne & Suddick (1972) as described in McArdle & Epstein (1987), McArdle (1988), McArdle & Aber (1990), McArdle & Nesselroade (1994), McArdle (1990, 2001)…. They are the basis for the Longitudinal Research Institute Workshops, and are used here with permission.
Getting the data into the program
To get the data, we set the working directory and read in a data file. (Note that in R markdown documents, these must be done in the same code chunk).
#Setting the working directory
setwd("~/Desktop/") #Collaborator 1's Computer
#setwd("~/Desktop/") #Collaborator 2's Computer
# Reading in the Data
#set filepath for data file
filepath <- "https://raw.githubusercontent.com/The-Change-Lab/collaborations/main/GrowthModeling/wisc3raw.csv"
#read in the .csv file using the url() function
wisc3raw <- read.csv(file=url(filepath), header=TRUE)Note that the working directory has been set to a specific location
on Collaborator 1’s computer. For convenience, an additional
setwd() line is included that is commented out (using
# - or can comment blocks of text using Cmd-Shift-c within
RStudio). This set-up allows for easy passing back and forth among
multiple collaborators (or across home and work computers) who have the
data housed in an idiosyncratic location. If seeking help with your
scripts, it may be useful to send the script and data file, along with
this kind of set-up. It makes it much easier to give help when the data
read-in process is easy.
For reading in other files types see … http://www.statmethods.net/input/importingdata.html.
Clean-up of data file (“data hygiene”)
Once the data are in, there are often a few bits of clean-up that need to be done to make a file easy to work with. Driven by necessity, individual R developers have made a wide variety of convenience functions. The trick is finding them. If you think that someone else may have ever had to do something similar, use a search engine to find it.
For example, R is case-sensitive. Sometimes variables are named with a mix of lower-case and upper-case letters. Depending on the fluidity of these names, it is sometimes convenient to rename the variables to all lower case. The consistency of naming often reduces typographical coding errors (although the front-end engines, like R-Studio, are quite good at helping now).
Lets have a look at the variable names in our example data.
## [1] "id" "verb1" "verb2" "verb4" "verb6" "perfo1"
## [7] "perfo2" "perfo4" "perfo6" "info1" "comp1" "simi1"
## [13] "voca1" "info6" "comp6" "simi6" "voca6" "momed"
## [19] "grad" "constant"
Sidebar - a little trick to get the variable names into a format that is easy to cut and paste back into R code. This is super helpful when working with lots of names.
## c("id", "verb1", "verb2", "verb4", "verb6", "perfo1", "perfo2",
## "perfo4", "perfo6", "info1", "comp1", "simi1", "voca1", "info6",
## "comp6", "simi6", "voca6", "momed", "grad", "constant")
Sometimes data have a few variables that have upper-case names. Everything can be converted to lower-case …
# Set all variable names to lowercase
var_names <- tolower(colnames(wisc3raw))
colnames(wisc3raw)<-var_namesA few other idiosyncratic tendencies of mine are (a) the naming of
the analysis unit (e.g., person, family), because my collection of
scripts is written in a particular way, and (b) the ordering of the
variables so that id and time are always in the left-most columns,
because I find this easier when trying to locate specific observations
while scrolling through the data file. Thus, I often change the
identification variable to id and reorder the columns. In
these example data, the id variable is already named as
such (e.g., http://www.statmethods.net/management/variables.html),
and the column ordering is ok (e.g., google search for “reordering
columns in R”).
Make sure you know what is going on with missing data values - and recode as necessary. (e.g., http://www.statmethods.net/input/missingdata.html)
Outputting a clean data file for future use
Depending on work-flow, it may be useful to create a “working file” - both for use with R, and for passing to other programs.
To output a comma delimited data file …
#Output a comma delimited data file;
write.csv(wisc3raw, file="NewData.csv", row.names=FALSE, na="")Note that by default the write.csv() function will
include an extra column of row numbers and will notate missing data with
an NA. Both are inconvenient for other programs, so we have
included therow.names=FALSE and na="" options
to keep the .csv file “clean”.
To output some other files types see … http://www.statmethods.net/input/exportingdata.html
Manipulating the Repeated Measures Data (Long Data and Wide Data)
Behavioral science tends to use relational data structures - in basic form, spreadsheets. Typically the data are stored/configured as a data frame (a “fancy” matrix) with multiple rows and multiple columns. Two main schema are used to accommodate repeated measures data - “Wide Format” and “Long Format”. Different analysis and plotting functions require different kinds of data input. Thus, it is imperative that one can convert the data back and forth between wide and long formats.
There are lots of ways to do this. We illustrate one way.
First, lets subset down to the variables we need.
#making a vector of variable names (id, time-varying, time-invariant)
var_names_sub <- c("id",
"verb1","verb2","verb4","verb6",
"perfo1","perfo2","perfo4","perfo6",
"momed","grad")
#subsetting
wiscraw <- wisc3raw[,var_names_sub]
#looking at the data
head(wiscraw)## id verb1 verb2 verb4 verb6 perfo1 perfo2 perfo4 perfo6 momed grad
## 1 1 24.42 26.98 39.61 55.64 19.84 22.97 43.90 44.19 9.5 0
## 2 2 12.44 14.38 21.92 37.81 5.90 13.44 18.29 40.38 5.5 0
## 3 3 32.43 33.51 34.30 50.18 27.64 45.02 46.99 77.72 14.0 1
## 4 4 22.69 28.39 42.16 44.72 33.16 29.68 45.97 61.66 14.0 1
## 5 5 28.23 37.81 41.06 70.95 27.64 44.42 65.48 64.22 11.5 0
## 6 6 16.06 20.12 38.02 39.94 8.45 15.78 26.99 39.08 14.0 1
Reshape Wide to Long
One way to go from wide to long is using the
reshape() function (not to be confused with the
reshape2 package).
#reshaping long to wide
wisclong <- reshape(data=wiscraw,
varying = c("verb1","verb2","verb4","verb6",
"perfo1","perfo2","perfo4","perfo6"),
timevar=c("grade"),
idvar=c("id"),
direction="long", sep="")
#sorting for easy viewing
#reorder by id and day
wisclong <- wisclong[order(wisclong$id,wisclong$grade), ]
#looking at the data
head(wisclong, 8)## id momed grad grade verb perfo
## 1.1 1 9.5 0 1 24.42 19.84
## 1.2 1 9.5 0 2 26.98 22.97
## 1.4 1 9.5 0 4 39.61 43.90
## 1.6 1 9.5 0 6 55.64 44.19
## 2.1 2 5.5 0 1 12.44 5.90
## 2.2 2 5.5 0 2 14.38 13.44
## 2.4 2 5.5 0 4 21.92 18.29
## 2.6 2 5.5 0 6 37.81 40.38
BE CAREFUL WITH RESHAPE ##### THE ARGUMENTS NEEDED ARE NOT ALWAYS STRAIGHTFORWARD.
Make sure that the missing data has been treated properly.
Reshape Long to Wide
One way to go from long to wide is using the
reshape() function (not to be confused with the
reshape2 package).
#reshaping long to wide
wiscwide <- reshape(data=wisclong,
timevar=c("grade"),
idvar=c("id"),
v.names=c("verb","perfo"),
direction="wide", sep="_")
#reordering columns for easy viewing
wiscwide <- wiscwide[,c("id",
"verb_1","verb_2","verb_4","verb_6",
"perfo_1","perfo_2","perfo_4","perfo_6",
"momed","grad")]
#looking at the data
head(wiscwide)## id verb_1 verb_2 verb_4 verb_6 perfo_1 perfo_2 perfo_4 perfo_6 momed grad
## 1.1 1 24.42 26.98 39.61 55.64 19.84 22.97 43.90 44.19 9.5 0
## 2.1 2 12.44 14.38 21.92 37.81 5.90 13.44 18.29 40.38 5.5 0
## 3.1 3 32.43 33.51 34.30 50.18 27.64 45.02 46.99 77.72 14.0 1
## 4.1 4 22.69 28.39 42.16 44.72 33.16 29.68 45.97 61.66 14.0 1
## 5.1 5 28.23 37.81 41.06 70.95 27.64 44.42 65.48 64.22 11.5 0
## 6.1 6 16.06 20.12 38.02 39.94 8.45 15.78 26.99 39.08 14.0 1
BE CAREFUL WITH RESHAPE ##### THE ARGUMENTS NEEDED ARE NOT ALWAYS STRAIGHTFORWARD.
Also, make sure that the missing data has been treated properly.
Note: A more general reshaping solution is provided by Hadley
Wickham’s reshape2 package through melt and
cast functions. It is very useful when one has complicated
data structures. Always keep track of the length of your data, so that
no observations get lost or merged improperly. We have not yet done this
here, but in general it seems like everything is moving toward the tidy
universe, so that will be useful to do eventually.
Ok - now we have both wide and long format data sets and can start describing the data.
Yay!
Describing the Data
Once the wide and long data sets are in place, we can begin describing and plotting the data.
I cannot emphasize enough the importance of these steps for understanding one’s data!
Some descriptives and plots are produced from wide data, some from long data. Having both in place facilitates learning about the data.
Continually keep in mind what portions of the data-box are being described (e.g., persons, variables, occasions).
Basic Descriptives of All the Data
Most often we are using the functions in the psych and
ggplot2 packages.
## vars n mean sd median trimmed mad min max range skew
## id 1 204 102.50 59.03 102.50 102.50 75.61 1.00 204.00 203.00 0.00
## verb_1 2 204 19.59 5.81 19.34 19.50 5.41 3.33 35.15 31.82 0.13
## verb_2 3 204 25.42 6.11 25.98 25.40 6.57 5.95 39.85 33.90 -0.06
## verb_4 4 204 32.61 7.32 32.82 32.42 7.18 12.60 52.84 40.24 0.23
## verb_6 5 204 43.75 10.67 42.55 43.46 11.30 17.35 72.59 55.24 0.24
## perfo_1 6 204 17.98 8.35 17.66 17.69 8.30 0.00 46.58 46.58 0.35
## perfo_2 7 204 27.69 9.99 26.57 27.34 10.51 7.83 59.58 51.75 0.39
## perfo_4 8 204 39.36 10.27 39.09 39.28 10.04 7.81 75.61 67.80 0.15
## perfo_6 9 204 50.93 12.48 51.76 51.07 13.27 10.26 89.01 78.75 -0.06
## momed 10 204 10.81 2.70 11.50 11.00 2.97 5.50 18.00 12.50 -0.36
## grad 11 204 0.23 0.42 0.00 0.16 0.00 0.00 1.00 1.00 1.30
## kurtosis se
## id -1.22 4.13
## verb_1 -0.05 0.41
## verb_2 -0.34 0.43
## verb_4 -0.08 0.51
## verb_6 -0.36 0.75
## perfo_1 -0.11 0.58
## perfo_2 -0.21 0.70
## perfo_4 0.59 0.72
## perfo_6 0.18 0.87
## momed 0.01 0.19
## grad -0.30 0.03
## vars n mean sd median trimmed mad min max range skew
## id 1 816 102.50 58.93 102.50 102.50 75.61 1.00 204.00 203.00 0.00
## momed 2 816 10.81 2.69 11.50 11.00 2.97 5.50 18.00 12.50 -0.36
## grad 3 816 0.23 0.42 0.00 0.16 0.00 0.00 1.00 1.00 1.31
## grade 4 816 3.25 1.92 3.00 3.19 2.22 1.00 6.00 5.00 0.28
## verb 5 816 30.34 11.86 28.46 29.39 11.33 3.33 72.59 69.26 0.71
## perfo 6 816 33.99 16.14 33.14 33.34 18.14 0.00 89.01 89.01 0.34
## kurtosis se
## id -1.20 2.06
## momed 0.03 0.09
## grad -0.28 0.01
## grade -1.43 0.07
## verb 0.33 0.42
## perfo -0.43 0.56
Note the n.
What slices of the box are being used to make the calculations?
Basic Descriptives for the Repeated Measures Data
Now, just concentrating on the repeated measures of verbal ability.
Sample-level descriptives = all persons and occasions (Verbal Ability)
This step is useful to get a general view on what verbal ability scores look like, but note that we are ignoring time, and not considering how the repeated measures are nested within indivuals.
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 816 30.34 11.86 28.46 29.39 11.33 3.33 72.59 69.26 0.71 0.33
## se
## X1 0.42
#histogram
wisclong %>%
ggplot(aes(x=verb)) +
geom_histogram(aes(y=after_stat(density)),
binwidth=2.5, fill="white", color="black") +
geom_density(color="red") +
xlab("Verbal Ability (Grade 1 to 6)") + ylab("Density")Sample-level descriptives across Time (Verbal Ability)
Note that our variable is actually “multivariate” because we have repeated measures.
We should really consider the time-splits when we are doing descriptives.
Thus, we are interested in Verbal Ability across Time = all persons faceted by grade.
Note that we start descriptives with MEANS and VARIANCES
#sample descriptives by occasion
#in the wide file
psych::describe(wiscwide[,c("verb_1","verb_2","verb_4","verb_6")])## vars n mean sd median trimmed mad min max range skew
## verb_1 1 204 19.59 5.81 19.34 19.50 5.41 3.33 35.15 31.82 0.13
## verb_2 2 204 25.42 6.11 25.98 25.40 6.57 5.95 39.85 33.90 -0.06
## verb_4 3 204 32.61 7.32 32.82 32.42 7.18 12.60 52.84 40.24 0.23
## verb_6 4 204 43.75 10.67 42.55 43.46 11.30 17.35 72.59 55.24 0.24
## kurtosis se
## verb_1 -0.05 0.41
## verb_2 -0.34 0.43
## verb_4 -0.08 0.51
## verb_6 -0.36 0.75
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 204 19.59 5.81 19.34 19.5 5.41 3.33 35.15 31.82 0.13 -0.05 0.41
## ------------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 204 25.42 6.11 25.98 25.4 6.57 5.95 39.85 33.9 -0.06 -0.34 0.43
## ------------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 204 32.61 7.32 32.82 32.42 7.18 12.6 52.84 40.24 0.23 -0.08 0.51
## ------------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 204 43.75 10.67 42.55 43.46 11.3 17.35 72.59 55.24 0.24 -0.36
## se
## X1 0.75
#histogram faceted by grade
wisclong %>%
ggplot(aes(x=verb)) +
geom_histogram(binwidth=5, pad = TRUE, fill="white", color="black") +
xlab("Verbal Ability") +
facet_grid(grade ~ .)#boxplot by grade
#use factor() to convert "time" from numeric to categorical
wisclong %>%
ggplot(aes(x=factor(grade), y=verb)) +
geom_boxplot(notch = TRUE) +
stat_summary(fun.y="mean", geom="point", shape=23, size=3, fill="white") +
xlab("Grade") + ylab("Verbal Ability")#Density distribution by grade
wisclong %>%
ggplot(aes(x=verb)) +
geom_density(aes(group=factor(grade),
colour=factor(grade), fill=factor(grade)), alpha=0.3) +
guides(colour="none",
fill=guide_legend(title="Grade")) +
labs(x="Verbal Ability", y="Density")Notice in these plots how much “change” there is at the sample level across grades.
Is that expected?
Then to the COVARIANCES (Correlations)
Above, we looked at the means and variances. Because these are repeated measures, we also have covariances.
# Correlations
cor(wiscwide[,c("verb_1","verb_2","verb_4","verb_6")], use="complete.obs", method="spearman")## verb_1 verb_2 verb_4 verb_6
## verb_1 1.0000000 0.7184755 0.7336648 0.6563948
## verb_2 0.7184755 1.0000000 0.7552318 0.7386984
## verb_4 0.7336648 0.7552318 1.0000000 0.8016574
## verb_6 0.6563948 0.7386984 0.8016574 1.0000000
see … http://stats.stackexchange.com/questions/8071/how-to-choose-between-pearson-and-spearman-correlation … Spearman is monotonic (i.e., ranks), Pearson is linear.
Corresponding plot …
#see also ...
#http://gettinggeneticsdone.blogspot.com/2011/07/scatterplot-matrices-in-r.html
#look for ggally or gpairs
#from library(car)
car::scatterplotMatrix(~ verb_1 + verb_2 + verb_4 + verb_6,
data=wiscwide)Note that none of these plots show mean or variance differences - all do automatic, data-based ranges.
Individual-level descriptives across Time (Verbal Ability)
Note that our interest is often in individual development, rather than sample development. We need to consider how each individual is changing over time.
Thus, we are interested in Verbal Ability across Time = individual persons.
Note that we tend to do this visually rather than by looking at numbers (a collection of individual vectors).
#Using library(lattice)
#Plotting intraindividual change
lattice::xyplot(verb ~ grade, groups=id,
data=wisclong, type="l",
main="Verbal Ability Trajectories")#Using library(ggplot2) ... see also http://ggplot.yhathq.com/docs/index.html
#Plotting intraindividual change
wisclong %>%
ggplot(aes(x=grade, y=verb, group=id)) +
geom_point() +
geom_line() +
ylim(0,80) +
xlab("Grade") +
ylab("Verbal Ability") +
scale_x_continuous(breaks=seq(1,6, by=1))Sometimes the “blob” gets too dense. This can be fixed by selecting only a subset of persons.
#plot a random subsample
#set seed for easy replication
set.seed(1234)
#select sample, n=20
smallsamp <- sample(wisclong$id,20)
smallsamp## [1] 71 26 156 162 100 25 26 182 151 82 20 68 96 46 144 1 166 138 53
## [20] 49
wisclong_randsub <- wisclong[wisclong$id %in% smallsamp, ]
#Plotting intraindividual change
wisclong_randsub %>%
ggplot(aes(x=grade, y=verb, group=id)) +
geom_point() +
geom_line() +
xlab("Grade") +
ylab("Verbal Ability") + ylim(0,80) +
scale_x_continuous(breaks=seq(1,6, by=1))#plot the people with id<=20, in this case it is 20 persons
wisclong %>%
filter(id <= 20) %>%
ggplot(aes(x=grade, y=verb, group=id)) +
geom_point() +
geom_line() +
xlab("Grade") +
ylab("Verbal Ability") + ylim(0,80) +
scale_x_continuous(breaks=seq(1,6, by=1))Lets clean up the aesthetics a bit.
wisclong %>%
filter(id <= 20 & verb !="NA") %>%
ggplot(aes(x=grade, y=verb, group=id, color=factor(id))) +
geom_point() +
geom_line() +
xlab("Grade") +
ylab("Verbal Ability") + ylim(0,80) +
scale_x_continuous(breaks=seq(1,6, by=1)) +
theme(legend.position = "none")It is also sometimes useful to look at the collection of
individual-level plots. Here we use the facetwrap()
functionality in ggplot.
#old-school style using library(lattice)
lattice::xyplot(verb~grade|id, data=wisclong_randsub, as.table=TRUE)#ggplot2 style
wisclong %>%
filter(id <=20) %>%
ggplot(aes(x=grade, y=verb, group=id)) +
geom_point() +
geom_line(data=wisclong[which(wisclong$id <= 20 & wisclong$verb !="NA"),]) +
xlab("Grade") +
ylab("Verbal Ability") + ylim(0,80) +
scale_x_continuous(breaks=seq(1,6, by=1)) +
facet_wrap( ~ id)Some other aestheics to get to the formal APA style (note: publication data viz is evolving).
#ggplot version .. see also http://ggplot.yhathq.com/docs/index.html
wisclong_randsub %>%
ggplot(aes(x=grade, y=verb, group=id)) +
geom_point() +
geom_line() +
xlab("Grade") +
ylab("WISC Verbal Score") + ylim(0,100) +
scale_x_continuous(breaks=seq(1,6,by=1)) +
#title
ggtitle("Intraindividual Change in Verbal Ability") +
#theme with white background
theme_classic() +
#increase font size of axis and point labels
theme(axis.title = element_text(size = rel(1.5)),
axis.text = element_text(size = rel(1.2)),
plot.title = element_text(hjust = 0.5, face = "bold"),
legend.position = "none")Saving the plot file. See also … http://www.cookbook-r.com/Graphs/Output_to_a_file/
#ggsave(filename = default_name(plot), plot = last_plot(), device = default_device(filename),
# path = NULL, scale = 1, width = par("din")[1], height = par("din")[2],
# units = c("in", "cm", "mm"), dpi = 300, limitsize = TRUE, ...)
ggsave(filename = "wiscverbal.png", width = 5, height = 5, dpi=300)Now we have a good set of strategies to apply when looking at new longitudinal data.
Thank You For Playing!