Merging in R (July 26, 2005)

This page is moving to a new website.

Merging is tricky in any program, and you have to be sure that you know what you are doing. I have found that keeping an index variable will allow me to backtrack and find information in the original files if needed. Here's a simple example of how this would work.

> # randomly select 20 letters without replacement.
> x0 <- sample(letters,20)
> x1 <- sample(letters,20)
> x0
[1] "k" "e" "f" "d" "c" "h" "z" "l" "w" "a" "p" "q" "y"
[14] "x" "v" "t" "m" "r" "b" "n"
> x1
[1] "w" "f" "g" "j" "d" "e" "o" "b" "c" "u" "v" "q" "x"
[14] "i" "n" "z" "m" "t" "k" "l"
> # The intersect functions lets us know which letters
> # are in both x0 and x1.
> intersect(x0,x1)
[1] "k" "e" "f" "d" "c" "z" "l" "w" "q" "x" "v" "t" "m"
[14] "b" "n"
> # The setdiff function lets us know which letters
> # are in x0 but not x1 (reverse the arguments to
> # get letters in x1 but not in x1).
> setdiff(x0,x1)
[1] "h" "a" "p" "y" "r"
> setdiff(x1,x0)
[1] "g" "j" "o" "u" "i"
> # By default, the merge function includes only those
> # values in common with the two files. Note that the
> # id1 and id2 values tells us that the first row of
> # x2 comes from the 19th and 8th rows of x0 and x1,
> # respectively. Notice also that the merged file is
> # sorted by the intersection of the two files.
> x2 <- merge(list(x=x0,id0=1:20),list(x=x1,id1=1:20))
> x2
   x id0 id1
 1 b  19   8
 2 c   5   9
 3 d   4   5
 4 e   2   6
 5 f   3   2
 6 k   1  19
 7 l   8  20
 8 m  17  17
 9 n  20  15
10 q  12  12
11 t  16  18
12 v  15  11
13 w   9   1
14 x  14  13
15 z   7  16
> # To include all values, including mismatches, specify
> # all=T. Notice now that the merged file is sorted by the
> # values of x0, followed by the sorted values of x1 that
> # are not found in x0. An NA for an index value tells you
> # that the letter is not found in the original x0 or x1
> # vector.
> x3 <- merge(list(x=x0,id0=1:20),list(x=x1,id1=1:20),all=T)
> x3
   x id0 id1
 1 a  10  NA
 2 b  19   8
 3 c   5   9
 4 d   4   5
 5 e   2   6
 6 f   3   2
 7 h   6  NA
 8 k   1  19
 9 l   8  20
10 m  17  17
11 n  20  15
12 p  11  NA
13 q  12  12
14 r  18  NA
15 t  16  18
16 v  15  11
17 w   9   1
18 x  14  13
19 y  13  NA
20 z   7  16
21 g  NA   3
22 i  NA  14
23 j  NA   4
24 o  NA   7
25 u  NA  10

Merging files when there are duplicate values for the key leads to some further trickiness.

> # Use the argument replace=T to get a few duplicate letters.
> y0 <- sample(letters,20,replace=T)
> y1 <- sample(letters,20,replace=T)
> y0
 [1] "r" "p" "d" "r" "l" "s" "t" "y" "z" "t" "a" "v" "m" "l"
[15] "u" "o" "n" "m" "x" "g"
> y1
 [1] "z" "g" "n" "r" "y" "n" "m" "v" "q" "b" "p" "t" "a" "c"
[15] "z" "l" "d" "x" "b" "r"
> # For the letters "l" and "m" which appear twice in y0,
> # each one is paired with the matching row in y1. For
> # the letter "n" which appears twice in y1, each one
> # is paired with the matching row in y0. For the letter
> # "r" which appears twice in both y0 and y1, each row
> # is paired with each other row, creating four rows in
> # the new file.
> y2 <- merge(list(y=y0,id0=1:20),list(y=y1,id1=1:20))
> y2
   y id0 id1
 1 a  11  13
 2 d   3  17
 3 g  20   2
 4 l   5  16
 5 l  14  16
 6 m  13   7
 7 m  18   7
 8 n  17   3
 9 n  17   6
10 p   2  11
11 r   1   4
12 r   4   4
13 r   1  20
14 r   4  20
15 t   7  12
16 t  10  12
17 v  12   8
18 x  19  18
19 y   8   5
20 z   9   1
21 z   9  15

The %in% operator is also useful.

> # the %in% operator compares each element of the first
> # set against all the elements of the second set and
> # returns TRUE if there is a match.
> x0 %in% x1
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
 [9]  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
[17]  TRUE FALSE  TRUE  TRUE
> # This function is not symmetric, of course,
> x1 %in% x0
 [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
 [9]  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
[17]  TRUE  TRUE  TRUE  TRUE
> # Using the %in% function, you can re-create the
> # intersect function,
> x0[x0 %in% x1]
[1] "k" "e" "f" "d" "c" "z" "l" "w" "q" "x" "v" "t" "m"
[14] "b" "n"
> # or the setdiff function,
> x0[!x0 %in% x1]
[1] "h" "a" "p" "y" "r"

A less intuitive version of the %in% operator is the match function. Additional information on merging can be found at