Question
In a data.frame
(or data.table
), I would like to "fill forward" NAs with
the closest previous non-NA value. A simple example, using vectors (instead of
a data.frame
) is the following:
> y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
I would like a function fill.NAs()
that allows me to construct yy
such
that:
> yy
[1] NA NA NA 2 2 2 2 3 3 3 4 4
I need to repeat this operation for many (total ~1 Tb) small sized
data.frame
s (~30-50 Mb), where a row is NA is all its entries are. What is a
good way to approach the problem?
The ugly solution I cooked up uses this function:
last <- function (x){
x[length(x)]
}
fill.NAs <- function(isNA){
if (isNA[1] == 1) {
isNA[1:max({which(isNA==0)[1]-1},1)] <- 0 # first is NAs
# can't be forward filled
}
isNA.neg <- isNA.pos <- isNA.diff <- diff(isNA)
isNA.pos[isNA.diff < 0] <- 0
isNA.neg[isNA.diff > 0] <- 0
which.isNA.neg <- which(as.logical(isNA.neg))
if (length(which.isNA.neg)==0) return(NULL) # generates warnings later, but works
which.isNA.pos <- which(as.logical(isNA.pos))
which.isNA <- which(as.logical(isNA))
if (length(which.isNA.neg)==length(which.isNA.pos)){
replacement <- rep(which.isNA.pos[2:length(which.isNA.neg)],
which.isNA.neg[2:max(length(which.isNA.neg)-1,2)] -
which.isNA.pos[1:max(length(which.isNA.neg)-1,1)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
} else {
replacement <- rep(which.isNA.pos[1:length(which.isNA.neg)], which.isNA.neg - which.isNA.pos[1:length(which.isNA.neg)])
replacement <- c(replacement, rep(last(which.isNA.pos), last(which.isNA) - last(which.isNA.pos)))
}
replacement
}
The function fill.NAs
is used as follows:
y <- c(NA, 2, 2, NA, NA, 3, NA, 4, NA, NA)
isNA <- as.numeric(is.na(y))
replacement <- fill.NAs(isNA)
if (length(replacement)){
which.isNA <- which(as.logical(isNA))
to.replace <- which.isNA[which(isNA==0)[1]:length(which.isNA)]
y[to.replace] <- y[replacement]
}
Output
> y
[1] NA 2 2 2 2 3 3 3 4 4 4
... which seems to work. But, man, is it ugly! Any suggestions?
Answer
You probably want to use the na.locf()
function from the
zoo package to carry the last
observation forward to replace your NA values.
Here is the beginning of its usage example from the help page:
library(zoo)
az <- zoo(1:6)
bz <- zoo(c(2,NA,1,4,5,2))
na.locf(bz)
1 2 3 4 5 6
2 2 1 4 5 2
na.locf(bz, fromLast = TRUE)
1 2 3 4 5 6
2 1 1 4 5 2
cz <- zoo(c(NA,9,3,2,3,2))
na.locf(cz)
2 3 4 5 6
9 3 2 3 2