P.Mean: Data management in R versus SAS (created 2012-08-27).

News: Sign up for "The Monthly Mean," the newsletter that dares to call itself average, www.pmean.com/news.

Someone on LinkedIn was arguing that data management is easier in SAS than in R. A lot of times these claims are subjective. What is easier for one person may be more difficult for another. Also, you need to consider whether it is easy in that it is efficient (uses little computer time), fast to program, less likely to need debugging, or simpler for a non-statistician to understand and cross-check the code. It's probably some mix of this. So anyway, this person on LinkedIn was challenging the group to come up with a "simple" way in R to replicatea common data management scenario. Here's my response to that challenge.

Here's the description of the problem.

Derive treatment-emergent AEs from the original AE dataset. I assume severity code is never missing. Toss out AE records where the severity did not get worse after baseline, and keep only one AE record for each medical code (and person, of course). Severity is 1, 2, 3 for mild, medium , and severe. Bodysys and prefterm are "Body System" and "Preferred Term", your medical coding.

and here's his code

data base1 ;
set ae ;
if visit<=2 ;

proc sort data=base1 ;
by patid bodysys prefterm severity ;

data base2 ;
set base1 ;
by patid bodysys prefterm severity ;
if last.prefterm ;
rename severity=severe_b ;

data post1 ;
set ae ;
if visit>2 ;

proc sort data=post1 ;
by patid bodysys prefterm severity ;

data post2 ;
set post1 ;
by patid bodysys prefterm severity ;
if last.prefterm ;
rename severity=severe_p ;

data ae2 ;
merge base2 post2 ;
by patid bodysys prefterm ;
if severe_p > severe_b or severe_b=. ;

If you ask me, the SAS code looks pretty cryptic. I'm assuming that visit<=2 means that there are multiple baseline measurements? Why not show explicitly that the baseline measurements are visits 1 and 2. Similarly, the followup visits should be designated explicitly as 3, 4, and 5. (I'm making an assumption here, but since the SAS code is so cryptic, I'm forced to make that assumption).

Also, what's the purpose of "if last.prefterm"? As I remember, it is to insure that the severity chosen is the worst severity. You could simplify things a bit here by using the output option in proc means, perhaps. But let's not quibble too much. This is intended as a hypothetical example.

Okay, here's a solution in R.

Assume that ae is a four dimensional array with dimensions:
1: patid
2: bodysys
3: prefterm
4: visit

Here's the R code.

baseline.dat <- ae[ , , ,1:2]
baseline.max <- apply(baseline.dat,1:3,max)

followup.dat <- ae[ , , ,3:5]
followup.max <- apply(followup.dat,1:3,max)

decline.dat <- followup.max - baseline.max
decline.dat[decline.dat<=0] <- NA

Now, I'm "cheating" here because unlike the statement in the problem there will be lots of missing data. Not all patients will have exactly 5 visits and not all patients will have symptoms on all body systems. I presume that some of the missing data is implicit in the lack of all combinations of patid bodysys prefterm and visit in the original data set. If you put NA in for those combinations, add an argument na.rm=TRUE to the two apply functions, everything should work fine. If you need to store both the baseline and followup severity rather than the difference, the code needs a bit more modification.

Now, let's compare the two codes.

My program is probably less efficient because it store all possible combinations of the four factors. It may be GROSSLY inefficient depending on how many combinations of the factors are possible. Without more details I can only guess. I think my program was a bit faster to write (fewer lines anyway). Certainly it is easier to debug. I also think it is simpler to understand and cross-check by a non-statistician. The sticking point might be the apply function. But ask yourself if apply is more cryptic than last.prefterm. For what it's worth, I followed the SAS example of not using any comment statements, but in any programming language, a sufficient number of comment statements will make even a very cryptic program easy to follow.

These are all subjective calls, of course. You also need to factor in the time needed to convert from a two-dimensional file to a four dimensional array or vice-versa. There is a reshape library that can vastly simplify this.

Now, let's assume that you are constrained to solve the problem using only two dimensional arrays. You need to create a new variable that is a combination of the three factors:

combo <- paste(patid bodysys prefterm, sep=",")

Later when you need to split these apart, the strsplit function should work nicely.

Then assume that ae is a data frame with three variables, combo, visit, and severity.

tmp1 <- ae[ae$visit <=2, ]
severity.max <- tapply(tmp1$severity,tmp1$combo,max)
names(baseline.max)[2] <- "severity.baseline"

tmp2 <- ae[ae$visit > 2, ]
followup.max <- tapply(tmp2$severity,tmp$combo,max)

names(followup.max)[2] <- "severity.followup"

tmp <- merge(baseline.dat,followup.dat)
decline.dat <- tmp[tmp$severity.followup > tmp$severity.baseline]

Now I probably have some things backwards in here, but you get the general idea. You might be able to simplify the code by creating a new variable:

tim <- ifelse(ae$visit <=2,"Baseline","Followup")

though this encourages you to recast the problem as an array with more than two dimensions again. You end up with a three dimensional matrix with the first layer being the maximum baseline severity and the second layer being the maximum followup severity. I'm not sure whether that is good or bad here, but a good data analyst should not feel restricted by the rectangular grid that SAS imposes on you. The more choices you have, the better.

This approaches loses some of the simplicity of the arrays, but that's the price you pay when you force all of your data into a rectangular grid. This example does have an advantage over the previous R example of being more efficient.

So is R better than SAS for data management? It depends on a lot of factors, and it would not be too hard to find examples where SAS shines and R chokes (and vice versa). But for this example, I believe that R is clearly superior. Why? Two reasons. First, pulling out the maximum value is a single line in R (apply or tapply). In SAS, you have to sort and rely on the last. modifier, or use the output option in proc means. Second, you don't have to sort the files before merging them.

If you free yourself from the restriction of thinking of every problem as fitting into a rectangular grid with observations and variables, then you can gain even more efficiencies in R. For example, the example with a four dimensional array does not even need to merge the data.

Creative Commons License This page was written by Steve Simon and is licensed under the Creative Commons Attribution 3.0 United States License. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Incomplete pages.