Question
Consider the following data.table
s. The first defines a set of regions with
start and end positions for each group 'x':
library(data.table)
d1 <- data.table(x = letters[1:5], start = c(1,5,19,30, 7), end = c(3,11,22,39,25))
setkey(d1, x, start)
# x start end
# 1: a 1 3
# 2: b 5 11
# 3: c 19 22
# 4: d 30 39
# 5: e 7 25
The second data set has the same grouping variable 'x', and positions 'pos' within each group:
d2 <- data.table(x = letters[c(1,1,2,2,3:5)], pos = c(2,3,3,12,20,52,10))
setkey(d2, x, pos)
# x pos
# 1: a 2
# 2: a 3
# 3: b 3
# 4: b 12
# 5: c 20
# 6: d 52
# 7: e 10
Ultimately I'd like to extract the rows in 'd2' where 'pos' falls within the
range defined by 'start' and 'end', within each group x
. The desired result
is
# x pos start end
# 1: a 2 1 3
# 2: a 3 1 3
# 3: c 20 19 22
# 4: e 10 7 25
The start/end positions for any group x
will never overlap but there may be
gaps of values not in any region.
Now, I believe I should be using a rolling join. From what i can tell, I cannot use the "end" column in the join.
I've tried
d1[d2, roll = TRUE, nomatch = 0, mult = "all"][start <= end]
and got
# x start end
# 1: a 2 3
# 2: a 3 3
# 3: c 20 22
# 4: e 10 25
which is the right set of rows I want; However "pos" has become "start" and the original "start" has been lost. Is there a way to preserve all the columns with the roll join so i could report "start", "pos", "end" as desired?
Answer
Overlap joins was
implemented with commit
1375
in data.table v1.9.3, and is
available in the current stable release,
v1.9.4. The
function is called foverlaps
. From
NEWS:
Overlap joins
#528 is now here, finally!! Except fortype="equal"
andmaxgap
andminoverlap
arguments, everything else is implemented. Check out?foverlaps
and the examples there on its usage. This is a major feature addition todata.table
.
Let's consider x, an interval defined as [a, b]
, where a <= b
, and y,
another interval defined as [c, d]
, where c <= d
. The interval y is said
to overlap x at all, iff d >= a
and c <= b
1. And y is
entirely contained within x, iff a <= c,d <= b
2. For the
different types of overlaps implemented, please have a look at ?foverlaps
.
Your question is a special case of an overlap join: in d1
you have true
physical intervals with start
and end
positions. In d2
on the other
hand, there are only positions (pos
), not intervals. To be able to do an
overlap join, we need to create intervals also in d2
. This is achieved by
creating an additional variable pos2
, which is identical to pos
(d2[, pos2 := pos]
). Thus, we now have an interval in d2
, albeit with identical
start and end coordinates. This 'virtual, zero-width interval' in d2
can
then be used in foverlap
to do an overlap join with d1
:
require(data.table) ## 1.9.3
setkey(d1)
d2[, pos2 := pos]
foverlaps(d2, d1, by.x = names(d2), type = "within", mult = "all", nomatch = 0L)
# x start end pos pos2
# 1: a 1 3 2 2
# 2: a 1 3 3 3
# 3: c 19 22 20 20
# 4: e 7 25 10 10
by.y
by default is key(y)
, so we skipped it. by.x
by default takes
key(x)
if it exists, and if not takes key(y)
. But a key doesn't exist for
d2
, and we can't set the columns from y
, because they don't have the same
names. So, we set by.x
explicitly.
The type of overlap is within , and we'd like to have all matches, only if there is a match.
NB: foverlaps
uses data.table's binary search feature (along with roll
where necessary) under the hood, but some function arguments (types of
overlaps, maxgap, minoverlap etc..) are inspired by the function
findOverlaps()
from the Bioconductor package
IRanges
,
an excellent package (and so is
GenomicRanges
,
which extends IRanges
for Genomics).
So what's the advantage?
A benchmark on the code above on your data results in foverlaps()
slower
than Gabor's answer (Timings: Gabor's data.table solution = 0.004 vs foverlaps
= 0.021 seconds). But does it really matter at this granularity?
What would be really interesting is to see how well it scales - in terms of
both speed and memory. In Gabor's answer, we join based on the key column
x
. And then filter the results.
What if d1
has about 40K rows and d2
has a 100K rows (or more)? For each
row in d2
that matches x
in d1
, all those rows will be matched and
returned, only to be filtered later. Here's an example of your Q scaled only
slightly:
Generate data:
require(data.table)
set.seed(1L)
n = 20e3L; k = 100e3L
idx1 = sample(100, n, TRUE)
idx2 = sample(100, n, TRUE)
d1 = data.table(x = sample(letters[1:5], n, TRUE),
start = pmin(idx1, idx2),
end = pmax(idx1, idx2))
d2 = data.table(x = sample(letters[1:15], k, TRUE),
pos1 = sample(60:150, k, TRUE))
foverlaps:
system.time({
setkey(d1)
d2[, pos2 := pos1]
ans1 = foverlaps(d2, d1, by.x=1:3, type="within", nomatch=0L)
})
# user system elapsed
# 3.028 0.635 3.745
This took ~ 1GB of memory in total, out of which ans1
is 420MB. Most of the
time spent here is on subset really. You can check it by setting the argument
verbose=TRUE
.
Gabor's solutions:
## new session - data.table solution
system.time({
setkey(d1, x)
ans2 <- d1[d2, allow.cartesian=TRUE, nomatch=0L][between(pos1, start, end)]
})
# user system elapsed
# 15.714 4.424 20.324
And this took a total of ~3.5GB.
I just noted that Gabor already mentions the memory required for intermediate
results. So, trying out sqldf
:
# new session - sqldf solution
system.time(ans3 <- sqldf("select * from d1 join
d2 using (x) where pos1 between start and end"))
# user system elapsed
# 73.955 1.605 77.049
Took a total of ~1.4GB. So, it definitely uses less memory than the one shown above.
[The answers were verified to be identical after removing pos2
from ans1
and setting key on both answers.]
Note that this overlap join is designed with problems where d2
doesn't
necessarily have identical start and end coordinates (ex: genomics, the field
where I come from, where d2
is usually about 30-150 million or more rows).
foverlaps()
is stable, but is still under development, meaning some
arguments and names might get changed.
NB: Since I mentioned GenomicRanges
above, it is also perfectly capable of
solving this problem. It uses interval
trees
under the hood, and is quite memory efficient as well. In my benchmarks on
genomics data, foverlaps()
is faster. But that's for another (blog) post,
some other time.