--- title: "igapfill: A console-based interface for the R package gapfill" author: "Inder Tecuapetla" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{igapfill: A console-based interface for the R package gapfill} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} library(terra) library(igapfill) library(heatmaply) library(htmltools) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # WD <- "/home/series_tiempo/Escritorio/Rtest/igapfill/PNG" ``` The functions and methods of `igapfill` provide an intuitive, user-friendly, console-based interface that allows us to fill missing values of time series of satellite images (TSSI); the main engine for filling these missing values is the spatio-temporal approach implemented in the `R` package `gapfill`. Throughout this note, we will employ the Garnica dataset, also included in `igapfill`, to showcase the use of this package. ## Garnica dataset This dataset is comprised of two `.tif` files which store a TSSI recorded over the National Park _Cerro de Garnica_ located in Michoacan, Mexico. The file `garnica_250m_16_days_NDVI.tif` contains 572 layers of the Normalized Differenced Vegetation Index (NDVI). The accompanying file `garnica_250m_16_days_pixel_reliability.tif` contains information about the quality of each pixel of the 572 NDVI layers. Both datasets are spatial subsets extracted from larger images belonging to the MOD13Q1 NASA's product. Due to the temporal resolution of this product -16 days- the 572 layers correspond to a time series of images starting on the 49-th Julian date of 2000 and ending on the 353-rd Julian date of 2024. We can access to this dataset as follows: ```{r datasetUploading} ndvi_path <- system.file("extdata", "garnica_250m_16_days_NDVI.tif", package = "igapfill") reliability_path <- system.file("extdata", "garnica_250m_16_days_pixel_reliability.tif", package = "igapfill") spatNDVI <- rast(ndvi_path) spatRELIABILITY <- rast(reliability_path) ``` Although by definition the NDVI takes values on the interval `(-1,1)`, NASA uses `1e4` as an scale factor and distributes the NDVI MOD13Q1 product through files that when read in the `R` environment, the stored information has the `INT4S` datatype. Thus `spatNDVI` has values ranging from `-1e4` to `1e4`. ## mvSieve: A sieve with the amount of missing values in TSSI How much of the information contained in the Garnica NDVI TSSI can be used with high reliability in a scientific study? `mvSieve` counts the amount of missing values in a TSSI and hence it gives us a fair estimate of the overall quality of this type of datasets. Now, prior to use `mvSieve`, let's see how to use the pixel realiability product to yield a highly reliable NDVI product. For scientific purposes, it is mandatory to filter out those pixels with low reliability from a TSSSI dataset. Quality layers, such as those provided in `spatRELIABILITY`, allow us to pursuit this task. Specifically, for the Garnica dataset, the quality flags are 0 (Good data), 1 (Marginal data), 2 (Snow/clouds) and 3 (No data). It is known that a reliable scientific dataset based on TSSI can be built upon pixels with Good and Marginal data quality flags. In the Appendix, there is code to generate the product `garnica_250m_16_days_NDVI_QA` that results from filtering out ( _masking_ ) those NDVI pixels whose reliability is either equal to Snow/clouds or No data.^[We have saved the layers of this product in sub-directories organized by years, the first being 2000 and the last one being 2024.] ```{r mSieve-prelim, echo=FALSE} garnicaNDVIQAdir <- paste0( getwd(), "/garnica/250m_16_days_NDVI_QA" ) dir.create( path = garnicaNDVIQAdir, recursive = TRUE ) YEARS <- 2000:2024 ndviNAMES <- names(spatNDVI) reliabilityNAMES <- names(spatRELIABILITY) ndviQANAMES <- sapply(ndviNAMES, function(s) paste0( strsplit(s, ".tif")[[1]][1], "_QA.tif" )) for(i in 1:20){ if(!dir.exists( paste0( garnicaNDVIQAdir, "/", YEARS[1] ) )){ dir.create( paste0( garnicaNDVIQAdir, "/", YEARS[1] ) ) } TEMPndvi <- subset(spatNDVI,i) TEMPreliability <- subset(spatRELIABILITY,i) TEMPndvi[ TEMPreliability >= 2] <- NA writeRaster(TEMPndvi, filename = paste0( garnicaNDVIQAdir, "/", YEARS[1], "/", ndviQANAMES[i] ), datatype = "INT2S", overwrite = TRUE) } for (i in 2:length(YEARS)) { if(!dir.exists( paste0( garnicaNDVIQAdir, "/", YEARS[i] ) )){ dir.create( paste0( garnicaNDVIQAdir, "/", YEARS[i] ) ) } for(j in 1:23){ count <- 20 + ((i-2)*23) + j TEMPndvi <- subset(spatNDVI, count) TEMPreliability <- subset(spatRELIABILITY,count) TEMPndvi[ TEMPreliability >= 2 ] <- NA writeRaster(TEMPndvi, filename = paste0( garnicaNDVIQAdir, "/", YEARS[i], "/", ndviQANAMES[count] ), datatype = "INT2S", overwrite = TRUE) } } ``` The function `mvSieve()` can now be applied to the `NDVI_QA` product. Observe that `mvSieve`'s argument `dirs` is equal to a character vector pointing to the sub-directories containing the TSSI files. The remaining arguments are straightforward: `filesPerDir=23`, `startPeriod=2001`, `endPeriod=2024` and `colNames` is a character vector defined below:^[Although the TSSI starts in 2000, there are only 20 images for that year. Starting in 2001, every year has 23 NDVI images. Hence in our example `startPeriod=2001`.] ```{r mSieve-usage, warning=FALSE, message=FALSE} garnicaNDVIQAdir <- paste0( getwd(), "/garnica/250m_16_days_NDVI_QA" ) dir.create( path = garnicaNDVIQAdir, recursive = TRUE ) DIRS <- list.dirs(path = garnicaNDVIQAdir)[-1] COLNAMES <- paste0( "DoY-", seq(1,365,by=16) ) missingValueSieve <- mvSieve(dirs=DIRS[-1], filesPerDir = 23, startPeriod = 2001, endPeriod = 2024, colNames = COLNAMES) ``` `missingValueSieve` can be construed as a missing values matrix and thanks to the `R` package `heatmaply` we can display this kind of matrices in a graphical form. Below is the missing value matrix of the Garnica dataset as a heatmap. ```{r mSieve-heatmaply, echo=FALSE, fig.align='center'} out <- heatmaply((missingValueSieve/max(missingValueSieve)) * 100, limits = c(0,100), colors = cool_warm, dendrogram = "none", xlab = "", ylab = "", width = 900, height = 650, main = "% of missing values in NDVI TS of Cerro de Garnica (2001-2024)", scale = "none", draw_cellnote=TRUE, cellnote_textposition = "middle center", cellnote_size = 10, margins = c(60, 100, 40, 20), titleX = FALSE, hide_colorbar = TRUE, labRow = rownames(missingValueSieve), labCol = colnames(missingValueSieve), heatmap_layers = theme(axis.line = element_blank())) html_out <- as.tags(out) scrollable_out <- tags$div( class = "scrollable", style = "width: 100%;", html_out ) scrollable_out ``` This heatmap allows us to spot a recurrent feature in NDVI TS, i.e. that during the period of vegetation growth, this kind of datasets usually exhibit a fair amount of information loss. In particular, for the Garnica dataset, there is a non-negligible amount of missing values in the interval between the 145-th day of the year (DoY) and the 289-th. That is, for this example, between May and October -arguably the season of vegetation growth- there is an important information loss. The spatio-temporal approach of `gapfill` is applied to **blocks** of $d \times d$ images. It is clear that in order to reduce significantly the amount of missing values in the entire dataset, `gapfill` has to be applied sequentially on a series of blocks whose dimensions can be of order $d=2,3,4,\ldots$. Based on a heatmap such as the one above, it may be not too difficult to select a block of $2\times 2$, images from which to start reducing the amount of overall missing values in the dataset. For instance, we can define `block0` as the $2\times 2$ set comprised of the images acquired in the 113-th and 129-th DoY of 2011 and 2012. This block-selection task can become automatic due to the `minmaxBlock()` function, though. ## minmaxBlock: establishing blocks with minimal (or maximal) missingness For those studies in which the task of establishing a block of images with the smallest (or largest) amount of missing values requires something more than a quick visual inspection we have developed `minmaxBlock()`. This function has three arguments, `sieve`, `type` and `rank`. In order to set the first argument, we can use a missing values matrix, such as `missingValueSieve`. The argument `type` defines whether you seek for a block with minimal or a maximal amount of missing values. The last argument is a `numeric` controlling the size of the search grid on which the block with the minimal or maximal amount of missing values will be found. In what follows we provide details for the case `type="min"`, the other case is analogous. `minmaxBlock` searches for the minimal $2\times 2$, non-zero, sub-matrix of a general missing values matrix with entries denoted as $m_{i,j}$. For any given $2\times 2$ block we have defined `blockMissingness` as $\log( \texttt{cumsum} (m_{i,j}) )$ for $1\leq i,j\leq 2$. The minimal block is defined as that block with the minimum block missingness. We preferred the `cumsum` function rather than others, such as `cumprod` or `det`, because a general missing values matrix could have a large amount of zeros. The search for the minimal block runs over the values of a non-zero numeric vector of length equal to the value of the `rank` argument; for simplicity we call this vector as the _rank vector_. The entries of this vector are non-zero values of the missing values matrix provided by the `sieve` argument. Consider a given entry of the rank vector and notice that this number can occupy any of the four entries of a $2 \times 2$ sub-matrix of the original missing values matrix `sieve`. Correspondingly, based on this value, four different $2 \times 2$ matrices are defined and the block missingness of each of these matrices is calculated. This process is applied to each value of the rank vector. Having calculated all the possible blocks, and their corresponding missingness, we define the minimal block of the `sieve` as the $2\times 2$ block with the smallest missingness. Below it is established that for the Garnica's missing values matrix, the $2\times 2$ minimal block is comprised by the images acquired in the 113-rd and 129-th DoY of 2011 and 2012. ```{r minmaxBlock} SIEVE <- (missingValueSieve/max(missingValueSieve)) * 100 minmaxBlock(sieve=SIEVE, rank=10) ``` In the next section we show how to fill the missing values of this block. ## Using gapfill to fill missing values in TSSI Now that we have found a block of images to apply the spatio-temporal approach introduced in the `R` package `gapfill` let's make a sub-directory ( _block0_ ) in which we make safe copies of the images to be filled. ```{r block0} DIRS <- list.dirs(path = garnicaNDVIQAdir)[-1] files2011 <- list.files(path = DIRS[12], pattern = ".tif$", full.names = TRUE) files2012 <- list.files(path = DIRS[13], pattern = ".tif$", full.names = TRUE) block0DIR <- paste0( getwd(), "/garnica/block0" ) if( !dir.exists(block0DIR) ){ dir.create( block0DIR ) } file.copy(from=files2011[8:9], to=block0DIR) file.copy(from=files2012[8:9], to=block0DIR) ``` ### `create_dirs()` Speaking of safety, we have found extremely useful to create a set of sub-directories in which we can store all the auxiliary files that are generated through the process of setting arguments for applying `gapfill` functions; we encourage to the users to avoid saving auxiliary files in hard-to-remember locations of their systems. With `create_dirs()`, as shown below, we can create this directory tree. ```{r create-dirs, message=FALSE, echo=FALSE} # knitr::include_graphics(normalizePath(paste0(WD, "/create_dirs.png"))) knitr::include_graphics("figs/create_dirs.png") ``` The directory tree with _block0_ as its root looks as follows ```{r dirTREE, echo=FALSE} # knitr::include_graphics(normalizePath(paste0(WD, "/garnica_block0_directoryTree.png"))) knitr::include_graphics("figs/garnica_block0_directoryTree.png") ``` ### `dimsReport()` In order to avoid memory overflow and/or to reduce execution time, it is recommended to know the spatial dimensions of the images to be processed with `gapfill`; obviously, having a fair understanding of your systems characteristics is highly recommended too. In addition to information about the spatial dimensions (number of rows and columns) of the images, `dimsReport()` returns all the divisors of these dimensions. A direct application of this function returns something as shown below: ```{r dimsReport, echo=FALSE} # knitr::include_graphics(normalizePath(paste0(WD, "/dimsReport.png"))) knitr::include_graphics("figs/dimsReport.png") ``` ```{r dimsReport-ignore, include=FALSE, eval=FALSE} knitr::include_graphics(normalizePath(paste0(WD, "/dimsReport.png"))) html_out <- as.tags(out) scrollable_out <- tags$div( class = "scrollable", style = "width: 100%;", html_out ) scrollable_out ``` ### `sort_split()` Depending on system characteristics, the execution time of the `gapfill`'s functions tends to increase in direct relationship with the images dimension. We have found that an effective way of speeding up the gap-filling process consists of splitting the original images into smaller spatio-temporal chunks. The `sort_split()` function takes care of the splitting process through a console-based application. That is, the user is asked to pass a series of arguments. By accepting the default arguments of `sort_split()`, we ensure that the general workflow of `igapfill()` is followed but the user has the flexibility of changing this workflow. When its arguments `nrow_split` and `ncol_split` are not provided, `sort_split()` integrates `dimsReport()` in its execution as it can be seen below: ```{r sort_split, include=FALSE, eval=FALSE} # knitr::include_graphics(normalizePath(paste0(WD, "/sort_split_3.png"))) ```