--- title: "CodeDepends: Static analysis and dependency detection for R code" author: "Gabriel Becker" output: rmarkdown::html_vignette: fig_width: 8 fig_height: 6 vignette: > %\VignetteIndexEntry{CodeDependsIntro} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- # Introduction The CodeDepends package provides a flexible framework for statically analyzing R code (i.e., without evaluating it). It also contains higher-level functionality for: detecting dependencies between R code blocks or expressions, "tree-shaking" (pruning a script down to only the expressions necessary to evaluate a given expression), plotting variable usage timelines, and more. # The workhorses: readScript and getInputs The primary functions to perform basic code analysis are `readScript` which reads in R scripts of various forms (including .R and .Rmd files), and `getInputs` which performs the low-level code-analysis. The `readScript` function returns a `Script` object (essentially a list of `ScriptNodes` representing the top-level expressions in the script). This can then be passed to the `getInputs` which, in that case, returns a `ScriptInfo` object, which can be thought of as a list of `ScriptNodeInfo` objects representing information about those top-level expressions. R expressions can also be passed directly to `getInputs`, which returns a single `ScriptNodeInfo` object in that case. While in practice users will generally call `getInputs` on entire scripts, passing expressions directly is useful for testing and illustration. As stated above, `ScriptNodeInfo` objects are the units of information about single expressions being analyzed, and collect various information extracted from examining the expression itself: ```{r scriptnodeinfo} library(CodeDepends) getInputs(quote(x <- y + rnorm(10, sd = z))) ``` As we can see, the information includes the any string literals used in the expression, split into file and non-file strings based on whether the string appears to point to an existing path at analysis time with respect to the `basedir` argument (which defaults to the current directory). It also contains any libraries loaded by the code (via `library`, `require`, or `requireNamespace` calls). Next is are the inputs and outputs of the expression, which are the variables used by the expression and created by the expression (via assignment), respectively. By default, these lists will not include symbols used in ways that mean they are non-standardly evaluated (e.g., within the construction of a `ggplot2` plot object). These non-standard evaluation variables are collected separately (as nsevalVars). Variables whose values are updated (ie ones who are assigned new values which depend on their existing value) are collected separately. These updates can take a large number of forms, including: ```{r updateexprs, eval=FALSE} x = x + 5 rownames(x) = 5 x[1:3] = 5 x = lapply(1:5, function(i) x[i]^2) x$y = 5 ``` In all of the above cases, the variable `x` will be listed in both the `updated` and `inputs` categories, but *NOT* in the `outputs` category. Next are the functions which were called by the expression. These include those invoked as funtionals, e.g. via the `apply` family or `mutate_*` and `summarize_*` families. We note here that the functions list is actually a `logical` vector, indicating whether the function was locally defined within the script (`TRUE`), defined within a package (`FALSE`), or unkown (`NA`). The names of the vector indicate the names of the functions. Currently, functions will always be unknown if a single expression is analyzed directly. Function provenance detection is only applied to full scripts. Finally, the list of removed variables, side-effects `CodeDepends` is able to detect, and a copy of the code complete the list of information extracted. ## Symbols within formulas Symbols within formulas are treated specially when analyzing code, based on the `formulaInputs` argument to `getInputs`. If `FALSE` (the default), they are assumed to evaluated nonstandardly (e.g., in the context of a `data.frame`), if `TRUE`, they are counted as standard inputs. Currently there is no capacity for mixing these behaviors within a single call to `getInputs`. # Input collectors, function handlers, and customization The `getInputs` function accepts a `collector` argument, which essentially specifies a state tracker to be used when walking the code to collect inputs, functions called, etc. For largely historical reasons, input collectors are roughly defined as the output from the `inputCollector` constructor, rather than as a more formal class. When creating an input collector, various behavior can be customized, primarily in the form of \function handlers\ which specify behavior when analyzing calls to specific functions. This is, for example, how `CodeDepends` knows that some arguments within certain functions are non-standardly evaluated. CodeDepends ships with a robust set of default handlers, but these can be overridden or supplemented with custom handlers by specifying them when constructing the collector, either via the `...` arguments or as list. In both cases, the names are the names of the function the handler should be used on. ```{r custhandler} col = inputCollector(library = function(e, collector, ...) { print(paste("Hello", asVarName(e))) defaultFuncHandlers$library(e, collector, ...) }) getInputs(quote(library(CodeDepends)), collector = col) ``` `inputCollector` also accepts arguments which control what is counted as an input when processing expressions. The `inclPrevOutput` argument specifies whether output variables should be included as inputs to subsequent expressions when processing multiple expressions as an single block (e.g., when they are wrapped in `{}`). The `checkLibrarySymbols` and `funcsAsInputs` arguments control how symbols which appear to be resolved within libraries, and functions which are called are handled, respectively. The default behavior is for all of these to be `FALSE`. # Dependency detection and script visualization `CodeDepends` can visualize code in various ways. ## Variable dependency graphs We can create the variable graph of dependnecies between variables, via the `makeVariableGraph` function: ```{r variablegraph} f = system.file("samples", "results-multi.R", package = "CodeDepends") sc = readScript(f) g = makeVariableGraph( info = getInputs(sc)) if(require(Rgraphviz)) plot(g) ``` ## call graphs We can also create call graphs for functions or entire packages: ```{r callgraphs} gg = makeCallGraph("package:CodeDepends") if(require(Rgraphviz)) { gg = layoutGraph(gg, layoutType = "circo") graph.par(list(nodes = list(fontsize=55))) renderGraph(gg) ## could also call plot directly } ``` ## Variable definitions timelines Finally we can display timelines for when variables are defined, redefined, and used: ```{r timelines} f = system.file("samples", "results-multi.R", package = "CodeDepends") sc = readScript(f) dtm = getDetailedTimelines(sc, getInputs(sc)) plot(dtm) # A big/long function info = getInputs(arima0) dtm = getDetailedTimelines(info = info) plot(dtm, var.cex = .7, mar = 4, srt = 30) ```