--- title: "Workflow functions" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Overview} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup} library(ggtrace) ``` ## Workflows for interacting with ggplot internals ```{r} library(ggplot2) packageVersion("ggplot2") ``` **NOTE**: Making the most out of these workflow functions requires a hint of knowledge about ggplot internals, namely the fact that **ggproto objects** like [Stat](https://ggplot2.tidyverse.org/reference/ggplot2-ggproto.html#stats) and [Geom](https://ggplot2.tidyverse.org/reference/ggplot2-ggproto.html#geoms) exists, and that these [ggprotos have methods](https://ggplot2-book.org/spring1.html#methods) that step in at different parts of the ggplot build/render pipeline to modify the data. If you are completely new to these concepts, you should at least watch [Thomas Lin Pedersen](https://twitter.com/thomasp85)'s talk on [Extending your ability to extend ggplot2](https://www.rstudio.com/resources/rstudioconf-2020/extending-your-ability-to-extend-ggplot2/) before proceeding. ### A walkthrough with `geom_smooth()` Say we want to learn more about how `geom_smooth()` layer works, exactly ```{r} geom_smooth() ``` To do this, we're going to adopt the example from the [ggplot2 internals chapter of the ggplot book](https://ggplot2-book.org/internals.html) ```{r} p <- ggplot(mpg, aes(displ, hwy, color = drv)) + geom_point(position = position_jitter(seed = 1116)) + geom_smooth(method = "lm", formula = y ~ x) + facet_wrap(vars(year)) + ggtitle("A plot for expository purposes") p ``` Let's focus on the Stat ggproto. We see that `geom_smooth()` uses the `StatSmooth` ggproto ```{r} class( geom_smooth()$stat ) identical(StatSmooth, geom_smooth()$stat) ``` The bulk of the work by a Stat is done in the `compute_*` family of methods, which are essentially just functions. We'll focus on `compute_group` here: ```{r} # ggproto methods wrap over the actual function and print extra info class( StatSmooth$compute_group ) # Use `get_method` to pull out just the function component class( get_method(StatSmooth$compute_group) ) # StatSmooth inherits `compute_layer`/`compute_panel` and defines `compute_group` get_method_inheritance(StatSmooth) ``` ### Inspect Here we introduce our first workflow function `ggtrace_inspect_n()`, which takes a ggplot as the first argument and a ggproto method as the second argument, returning the number of times the ggproto method has been called in the ggplot's evaluation: ```{r} ggtrace_inspect_n(x = p, method = StatSmooth$compute_group) ``` As we might have guessed, `StatSmooth$compute_group` is called for each fitted line (each group) in the plot. But if `StatSmooth$compute_group` is essentially a function, what does it return? We can answer that with another workflow function `ggtrace_inspect_return()`, which shares a similar syntax: ```{r} return_val <- ggtrace_inspect_return(x = p, method = StatSmooth$compute_group) dim(return_val) head(return_val) ``` Note that `ggtrace_inspect_return()` only gave us 1 dataframe, corresponding to the return value of `StatSmooth$compute_group` the _first time_ it was called. This comes from the default value of the third argument `cond` being set to `quote(._counter_ == 1)`. Here, `._counter_` is an internal variable that keeps track of how many times the method has been called. It's available for all workflow functions and you can read more in the [**Tracing context**](https://yjunechoe.github.io/ggtrace/reference/ggtrace_inspect_return.html#tracing-context) section of the docs. If we instead wanted to get the return value of `StatSmooth$compute_group` for the third group of the second panel, for example, we can do so in one of two ways: 1. Set the value of `cond` to an expression that evaluates to true for that panel and group: ```{r} return_val_2_3_A <- ggtrace_inspect_return( x = p, method = StatSmooth$compute_group, cond = quote(data$PANEL[1] == 2 && data$group[1] == 3) ) ``` 2. Find the counter value when that condition is satisfied with `ggtrace_inspect_which()`, and then simply check for the value of `._counter_` back in `ggtrace_inspect_return()`: ```{r} ggtrace_inspect_which( x = p, method = StatSmooth$compute_group, cond = quote(data$PANEL[1] == 2 && data$group[1] == 3) ) return_val_2_3_B <- ggtrace_inspect_return( x = p, method = StatSmooth$compute_group, cond = 6L # shorthand for `quote(._counter_ == 6L)` ) ``` These two approaches work the same: ```{r} identical(return_val_2_3_A, return_val_2_3_B) ``` ### Capture Okay, so we know what `StatSmooth$compute_group` returns, but how does this return value change with different input? More generally put, how does `StatSmooth$compute_group` behave under different contexts? We _could_ answer this by making a bunch of different plots using `geom_smooth()` and repeating the inspection workflow. Alternatively, we can capture a call to `StatSmooth$compute_group` and extract it as a function with `ggtrace_capture_fn()`: ```{r} captured_fn_2_3 <- ggtrace_capture_fn( x = p, method = StatSmooth$compute_group, cond = quote(data$PANEL[1] == 2 && data$group[1] == 3) ) ``` `captured_fn_2_3` is essentially a snapshot of the `compute_group` when it is called for the third group of the second panel. Simply calling `captured_fn_2_3` gives us the expected return value: ```{r} identical(return_val_2_3_A, captured_fn_2_3()) ``` But the true power of the "capture" workflow functions lies in the ability to interact with what has been captured. In the case of `ggtrace_capture_fn()`, the returned function has all of the arguments passed to it at its execution stored in the formals. In other words, it is "pre-filled" with its original values, which we can inspect with `formals()`: ```{r, R.options = list(width = 70)} # Just showing their type/class for space sapply( formals(captured_fn_2_3) , class) ``` This makes it very convenient for us to explore its behavior with different arguments passed to it. For example, when `flipped_aes = TRUE`, we get `xmin` and `xmax` columns replacing `ymin` and `ymax`: ```{r} head( captured_fn_2_3(flipped_aes = TRUE) ) ``` In this sense, we can effectively simulate what happens in `geom_smooth(orientation = "y")` without needing to construct an entirely different ggplot. For another example, when we set the confidence interval to 10% with `level = 0.1`, the `ymin` and `ymax` values deviate less from the `y` value: ```{r} head( captured_fn_2_3(level = 0.1) ) ``` Lastly, let's talk about the `data` variable we've been using inside the `cond` argument of some of these workflow functions. What is `data$group` and `data$PANEL`? How do you know what `data` looks like? The answer is actually simple: it's an argument passed to `StatSmooth$compute_group`. We saw earlier that it's stored in `formals(captured_fn_2_3)`, but to target it explicitly we can also use `ggtrace_inspect_args()`: ```{r} args_2_3 <- ggtrace_inspect_args( x = p, method = StatSmooth$compute_group, cond = quote(data$PANEL[1] == 2 && data$group[1] == 3) ) identical(names(args_2_3), names(formals(captured_fn_2_3))) args_2_3$data ``` We see that `PANEL` and `group` columns conveniently give us information about the panel and group that `compute_group` is doing calculations over. ### Highjack Once we have some understanding of how `StatSmooth$compute_group` works, we may want to test some hypotheses about what would happen to the resulting graphical output if the method returned something else. Let's revisit our examples from the Capture workflow. What if the third group of the second panel calculated a more conservative confidence interval (`level = 0.1`)? What is this effect on the graphical output? To answer this question, we use `ggtrace_highjack_return()` to have a method return an entirely different value. First we store the modified return value in some variable: ```{r} modified_return_smooth <- captured_fn_2_3(level = 0.1) ``` Then we target the same group inside `cond` and pass `modified_return_smooth` to the `value` argument: ```{r} ggtrace_highjack_return( x = p, method = StatSmooth$compute_group, cond = quote(data$PANEL[1] == 2 && data$group[1] == 3), value = modified_return_smooth ) ``` The confidence band is now nearly invisible for that fitted line because it's only capturing a 10% confidence interval! Here's another example where we make the method fit predictions from a loess regression instead. To achieve this directly, we use `ggtrace_highjack_args()` here and set the `values` to `list(method = "loess")`: ```{r} ggtrace_highjack_args( x = p, method = StatSmooth$compute_group, cond = quote(data$PANEL[1] == 2 && data$group[1] == 3), values = list(method = loess) ) ``` Lastly, `ggtrace_highjack_return()` exposes an internal function called `returnValue()` in the `value` argument, which simply returns the original return value. Passing the `value` argument an [**expression**](https://adv-r.hadley.nz/expressions.html) computing on `returnValue()` allows on-the-fly modifications to the graphical output. For example, we can "intercept" the dataframe output of a ggproto method, do data wrangling on it, and have the method return that new dataframe instead. Here, we hack the data for the group to make it look like there's an absurd degree of heteroskedasticity: ```{r, message=FALSE, warning=FALSE} library(dplyr) ggtrace_highjack_return( x = p, method = StatSmooth$compute_group, cond = quote(data$PANEL[1] == 2 && data$group[1] == 3), value = quote({ returnValue() %>% mutate( ymin = y - se * seq(0, 10, length.out = n()), ymax = y + se * seq(0, 10, length.out = n()) ) }) ) ```