Introduction to ivs

library(ivs)
library(clock)
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)

Introduction

ivs (said, “eye-vees”) is a package dedicated to working with intervals in a generic way. It introduces a new type, the interval vector, which is generally referred to as an iv. An iv is generally created from two parallel vectors representing the starts and ends of the intervals, like this:

# Interval vector of integers
iv(1:5, 7:11)
#> <iv<integer>[5]>
#> [1] [1, 7)  [2, 8)  [3, 9)  [4, 10) [5, 11)

# Interval vector of dates
starts <- as.Date("2019-01-01") + 0:2
ends <- starts + c(2, 5, 10)

iv(starts, ends)
#> <iv<date>[3]>
#> [1] [2019-01-01, 2019-01-03) [2019-01-02, 2019-01-07) [2019-01-03, 2019-01-13)

The neat thing about interval vectors is that they are generic, so you can create them from any comparable type that is supported by vctrs. For example, the integer64 type from the bit64 package:

start <- bit64::as.integer64("900000000000")
end <- start + 1234

iv(start, end)
#> <iv<integer64>[1]>
#> [1] [900000000000, 900000001234)

Or the year-month type from clock:

start <- year_month_day(c(2019, 2020), c(1, 3))
end <- year_month_day(c(2020, 2020), c(2, 6))

iv(start, end)
#> <iv<year_month_day<month>>[2]>
#> [1] [2019-01, 2020-02) [2020-03, 2020-06)

The rest of this vignette explores some of the useful things that you can do with ivs.

Structure

As mentioned above, ivs are created from two parallel vectors representing the starts and ends of the intervals.

x <- iv(1:3, 4:6)
x
#> <iv<integer>[3]>
#> [1] [1, 4) [2, 5) [3, 6)

You can access the starts with iv_start() and the ends with iv_end():

iv_start(x)
#> [1] 1 2 3
iv_end(x)
#> [1] 4 5 6

You can use an iv as a column in a data frame or tibble and it’ll work just fine!

tibble(x = x)
#> # A tibble: 3 × 1
#>           x
#>   <iv<int>>
#> 1    [1, 4)
#> 2    [2, 5)
#> 3    [3, 6)

Right-open intervals

The only interval type that is supported by ivs is a right-open interval, i.e. [a, b). While this might seem restrictive, it rarely ends up being a problem in practice, and often it aligns with the easiest way to express a particular interval. For example, consider an interval that spans the entire day of 2019-01-02. If you wanted to represent this interval with second precision with a right-open interval, you’d do [2019-01-02 00:00:00, 2019-01-03 00:00:00). This nicely captures the exclusive “end” of the interval as the start of the next day. This also means it exactly aligns with the start of the next day’s interval, [2019-01-03 00:00:00, 2019-01-04 00:00:00).

If you wanted to represent this with a closed interval, you might do [2019-01-02 00:00:00, 2019-01-02 23:59:59]. Not only is this a bit awkward, it can also cause issues if the precision changes! Say you wanted to up the precision on this interval from second level precision to millisecond precision. The right-open interval wouldn’t have to change at all since the end of that interval is set to 2019-01-03 00:00:00 and anything before that is fair game. But the closed interval can’t be naively changed from 2019-01-02 23:59:59 to 2019-01-02 23:59:59.000, as you’d lose the 999 milliseconds in that last second. Extra care would have to be taken to set the milliseconds to 2019-01-02 23:59:59.999.

If you still aren’t convinced, I’d encourage you to take a look at these resources that also advocate for right-open intervals:

Empty intervals

In ivs, it is required that start < end when generating an interval vector. This means that intervals like [5, 2) are invalid, but it also means that an “empty” interval of [5, 5) is also invalid. Practically, I’ve found that attempting to allow these ends up resulting in more implementation headaches than anything else, and they don’t end up having very many uses.

Finding overlaps

One of the most compelling reasons to use this package is that it tries to make finding overlapping intervals as easy as possible. iv_locate_overlaps() takes two ivs and returns a data frame containing information about where they overlap. It works somewhat like base::match() in that for each element of needles, it looks for a match in all of haystack. Unlike match(), it actually returns all of the overlaps rather than just the first.

# iv_pairs() is a useful way to create small ivs from individual intervals
needles <- iv_pairs(c(1, 5), c(3, 7), c(10, 12))
needles
#> <iv<double>[3]>
#> [1] [1, 5)   [3, 7)   [10, 12)

haystack <- iv_pairs(c(0, 6), c(13, 15), c(0, 2), c(7, 8), c(4, 5))
haystack
#> <iv<double>[5]>
#> [1] [0, 6)   [13, 15) [0, 2)   [7, 8)   [4, 5)

locations <- iv_locate_overlaps(needles, haystack)
locations
#>   needles haystack
#> 1       1        1
#> 2       1        3
#> 3       1        5
#> 4       2        1
#> 5       2        5
#> 6       3       NA

The $needles column of the result is an integer vector showing where to slice needles to generate the intervals that overlap the intervals in haystack described by the $haystack column. When a needle doesn’t overlap with any intervals in the haystack, an NA location is returned. An easy way to align both needles and haystack using this information is to pass everything to iv_align(), which will automatically perform the slicing and store the results in another data frame:

iv_align(needles, haystack, locations = locations)
#>    needles haystack
#> 1   [1, 5)   [0, 6)
#> 2   [1, 5)   [0, 2)
#> 3   [1, 5)   [4, 5)
#> 4   [3, 7)   [0, 6)
#> 5   [3, 7)   [4, 5)
#> 6 [10, 12) [NA, NA)

If you just wanted to know if an interval in needles overlapped any interval in haystack, then you can use iv_overlaps(), which returns a logical vector.

iv_overlaps(needles, haystack)
#> [1]  TRUE  TRUE FALSE

By default, iv_locate_overlaps() will detect if there is any kind of overlap between the two inputs, but there are various other types of overlaps that you can detect. For example, you can check if needles “contains” haystack:

locations <- iv_locate_overlaps(
  needles, 
  haystack, 
  type = "contains", 
  no_match = "drop"
)

iv_align(needles, haystack, locations = locations)
#>   needles haystack
#> 1  [1, 5)   [4, 5)
#> 2  [3, 7)   [4, 5)

I’ve also used no_match = "drop" to drop all of the needles that don’t have any matching overlaps.

You can also check for the reverse, i.e. if needles is “within” the haystack:

locations <- iv_locate_overlaps(
  needles, 
  haystack, 
  type = "within", 
  no_match = "drop"
)

iv_align(needles, haystack, locations = locations)
#>   needles haystack
#> 1  [1, 5)   [0, 6)

Precedes / Follows

Two other functions that are related to iv_locate_overlaps() are iv_locate_precedes() and iv_locate_follows().

# Where does `needles` precede `haystack`?
locations <- iv_locate_precedes(needles, haystack)
locations
#>   needles haystack
#> 1       1        2
#> 2       1        4
#> 3       2        2
#> 4       2        4
#> 5       3        2

This returns a data frame of the same structure as iv_locate_overlaps(), so you can use it with iv_align().

iv_align(needles, haystack, locations = locations)
#>    needles haystack
#> 1   [1, 5) [13, 15)
#> 2   [1, 5)   [7, 8)
#> 3   [3, 7) [13, 15)
#> 4   [3, 7)   [7, 8)
#> 5 [10, 12) [13, 15)
# Where does `needles` follow `haystack`?
locations <- iv_locate_follows(needles, haystack)

iv_align(needles, haystack, locations = locations)
#>    needles haystack
#> 1   [1, 5) [NA, NA)
#> 2   [3, 7)   [0, 2)
#> 3 [10, 12)   [0, 6)
#> 4 [10, 12)   [0, 2)
#> 5 [10, 12)   [7, 8)
#> 6 [10, 12)   [4, 5)

If you are only interested in the closest interval in haystack that the needle precedes or follows, set closest = TRUE.

locations <- iv_locate_follows(
  needles = needles, 
  haystack = haystack, 
  closest = TRUE,
  no_match = "drop"
)

iv_align(needles, haystack, locations = locations)
#>    needles haystack
#> 1   [3, 7)   [0, 2)
#> 2 [10, 12)   [7, 8)

Allen’s Interval Algebra

Maintaining Knowledge about Temporal Intervals is a great paper by James Allen that outlines an interval algebra that completely describes how any two intervals are related to each other (i.e. if one interval precedes, overlaps, or is met-by another interval). The paper describes 13 relations that make up this algebra, which are faithfully implemented in iv_locate_relates() and iv_relates(). These relations are extremely useful because they are distinct (i.e. two intervals can only be related by exactly 1 of the 13 relations), but they are a bit too restrictive to be practically useful. iv_locate_overlaps(), iv_locate_precedes(), and iv_locate_follows() combine multiple of the individual relations into three broad ideas that I find most useful. If you want to learn more about this, I’d encourage you to read the help documentation for iv_locate_relates().

Between-ness

Often you just want to know if a vector of values falls between the bounds of an interval. This is particularly common with dates, where you might want to know if a sale you made corresponded to an interval range when any commercial was being run.

sales <- as.Date(c("2019-01-01", "2020-05-10", "2020-06-10"))

commercial_starts <- as.Date(c(
  "2019-10-12", "2020-04-01", "2020-06-01", "2021-05-10"
))
commercial_ends <- commercial_starts + 90

commercials <- iv(commercial_starts, commercial_ends)

sales
#> [1] "2019-01-01" "2020-05-10" "2020-06-10"
commercials
#> <iv<date>[4]>
#> [1] [2019-10-12, 2020-01-10) [2020-04-01, 2020-06-30) [2020-06-01, 2020-08-30)
#> [4] [2021-05-10, 2021-08-08)

You can check if a sale was made while any commercial was being run with iv_between(), which works like %in% and is similar to iv_overlaps():

tibble(sales = sales) %>%
  mutate(commercial_running = iv_between(sales, commercials))
#> # A tibble: 3 × 2
#>   sales      commercial_running
#>   <date>     <lgl>             
#> 1 2019-01-01 FALSE             
#> 2 2020-05-10 TRUE              
#> 3 2020-06-10 TRUE

You can find the commercials that were airing when the sale was made with iv_locate_between() and iv_align():

iv_align(sales, commercials, locations = iv_locate_between(sales, commercials))
#>      needles                 haystack
#> 1 2019-01-01                 [NA, NA)
#> 2 2020-05-10 [2020-04-01, 2020-06-30)
#> 3 2020-06-10 [2020-04-01, 2020-06-30)
#> 4 2020-06-10 [2020-06-01, 2020-08-30)

If you aren’t looking for the %in%-like behavior of iv_between(), and instead want to pairwise detect whether one value falls between an interval or not, you can use iv_pairwise_between():

x <- c(1, 5, 10, 12)
x
#> [1]  1  5 10 12

y <- iv_pairs(c(0, 6), c(7, 9), c(10, 12), c(10, 12))
y
#> <iv<double>[4]>
#> [1] [0, 6)   [7, 9)   [10, 12) [10, 12)

iv_pairwise_between(x, y)
#> [1]  TRUE FALSE  TRUE FALSE

Keep in mind that the intervals are half-open, so 12 doesn’t fall between the interval of [10, 12)! This is different from dplyr::between().

Counting overlaps

Sometimes you just need the counts of the number of overlaps rather than the actual locations of them. For example, say your business has a subscription service and you’d like to compute a rolling monthly count of the total number of subscriptions that are active (i.e. in January 2019, how many subscriptions were active?). Customers are only allowed to have one subscription active at once, but they may cancel it and reactivate it at any time. If a customer was active at any point during the month, then they are counted in that month.

enrollments <- tribble(
  ~name,      ~start,          ~end,
  "Amy",      "1, Jan, 2017",  "30, Jul, 2018",
  "Franklin", "1, Jan, 2017",  "19, Feb, 2017",
  "Franklin", "5, Jun, 2017",  "4, Feb, 2018",
  "Franklin", "21, Oct, 2018", "9, Mar, 2019",
  "Samir",    "1, Jan, 2017",  "4, Feb, 2017",
  "Samir",    "5, Apr, 2017",  "12, Jun, 2018"
)

# Parse these into "day" precision year-month-day objects
enrollments <- enrollments %>%
  mutate(
    start = year_month_day_parse(start, format = "%d, %b, %Y"),
    end = year_month_day_parse(end, format = "%d, %b, %Y"),
  )

enrollments
#> # A tibble: 6 × 3
#>   name     start      end       
#>   <chr>    <ymd<day>> <ymd<day>>
#> 1 Amy      2017-01-01 2018-07-30
#> 2 Franklin 2017-01-01 2017-02-19
#> 3 Franklin 2017-06-05 2018-02-04
#> 4 Franklin 2018-10-21 2019-03-09
#> 5 Samir    2017-01-01 2017-02-04
#> 6 Samir    2017-04-05 2018-06-12

Even though we have day precision information, we only actually need month precision intervals to answer this question. We’ll use calendar_narrow() from clock to convert our "day" precision dates to "month" precision ones. We’ll also add 1 month to the end intervals to reflect the fact that the end month is open (remember, ivs are half-open).

enrollments <- enrollments %>%
  mutate(
    start = calendar_narrow(start, "month"),
    end = calendar_narrow(end, "month") + 1L
  )

enrollments
#> # A tibble: 6 × 3
#>   name     start        end         
#>   <chr>    <ymd<month>> <ymd<month>>
#> 1 Amy      2017-01      2018-08     
#> 2 Franklin 2017-01      2017-03     
#> 3 Franklin 2017-06      2018-03     
#> 4 Franklin 2018-10      2019-04     
#> 5 Samir    2017-01      2017-03     
#> 6 Samir    2017-04      2018-07

enrollments <- enrollments %>%
  mutate(active = iv(start, end), .keep = "unused")

enrollments
#> # A tibble: 6 × 2
#>   name                 active
#>   <chr>      <iv<ymd<month>>>
#> 1 Amy      [2017-01, 2018-08)
#> 2 Franklin [2017-01, 2017-03)
#> 3 Franklin [2017-06, 2018-03)
#> 4 Franklin [2018-10, 2019-04)
#> 5 Samir    [2017-01, 2017-03)
#> 6 Samir    [2017-04, 2018-07)

To answer this question, we are going to need to create a sequential vector of months that span the entire range of intervals. This starts at the smallest start and goes to the largest end. Because the end is half-open, there won’t be any hits for that month, so we won’t include it.

bounds <- range(enrollments$active)
lower <- iv_start(bounds[[1]])
upper <- iv_end(bounds[[2]]) - 1L

months <- tibble(month = seq(lower, upper, by = 1))

months
#> # A tibble: 27 × 1
#>    month       
#>    <ymd<month>>
#>  1 2017-01     
#>  2 2017-02     
#>  3 2017-03     
#>  4 2017-04     
#>  5 2017-05     
#>  6 2017-06     
#>  7 2017-07     
#>  8 2017-08     
#>  9 2017-09     
#> 10 2017-10     
#> # … with 17 more rows

Now we need to add a column to months to represent the number of subscriptions that were active in that month. To do this we can use iv_count_between(). It works like iv_between() and iv_locate_between() but returns an integer vector corresponding to the number of times the i-th “needle” value fell between any of the values in the “haystack”.

months %>%
  mutate(count = iv_count_between(month, enrollments$active)) %>%
  print(n = Inf)
#> # A tibble: 27 × 2
#>    month        count
#>    <ymd<month>> <int>
#>  1 2017-01          3
#>  2 2017-02          3
#>  3 2017-03          1
#>  4 2017-04          2
#>  5 2017-05          2
#>  6 2017-06          3
#>  7 2017-07          3
#>  8 2017-08          3
#>  9 2017-09          3
#> 10 2017-10          3
#> 11 2017-11          3
#> 12 2017-12          3
#> 13 2018-01          3
#> 14 2018-02          3
#> 15 2018-03          2
#> 16 2018-04          2
#> 17 2018-05          2
#> 18 2018-06          2
#> 19 2018-07          1
#> 20 2018-08          0
#> 21 2018-09          0
#> 22 2018-10          1
#> 23 2018-11          1
#> 24 2018-12          1
#> 25 2019-01          1
#> 26 2019-02          1
#> 27 2019-03          1

There are also iv_count_overlaps(), iv_count_precedes(), and iv_count_follows() for working with two ivs at once.

Grouping by overlaps

One common operation when working with interval vectors is merging all the overlapping intervals within a single interval vector. This removes all the redundant information, while still maintaining the full range covered by the iv. For this, you can use iv_groups() which computes the minimal set of interval “groups” that contain all of the intervals in x.

x <- iv_pairs(c(1, 5), c(5, 7), c(9, 11), c(10, 13), c(12, 13))
x
#> <iv<double>[5]>
#> [1] [1, 5)   [5, 7)   [9, 11)  [10, 13) [12, 13)

iv_groups(x)
#> <iv<double>[2]>
#> [1] [1, 7)  [9, 13)

By default, this grouped abutting intervals that aren’t considered to overlap but also don’t have any values between them. If you don’t want this, use the abutting argument.

iv_groups(x, abutting = FALSE)
#> <iv<double>[3]>
#> [1] [1, 5)  [5, 7)  [9, 13)

With .by

Grouping overlapping intervals is often a useful way to create a new variable to group on with dplyr’s .by argument. For example, consider the following problem where you have multiple users racking up costs across multiple systems. The date ranges represent the range when the corresponding cost was accrued over, and the ranges don’t overlap for a given (user, system) pair.

costs <- tribble(
  ~user, ~system, ~from, ~to, ~cost,
  1L, "a", "2019-01-01", "2019-01-05", 200.5,
  1L, "a", "2019-01-12", "2019-01-13", 15.6,
  1L, "b", "2019-01-03", "2019-01-10", 500.3,
  2L, "a", "2019-01-02", "2019-01-03", 25.6,
  2L, "c", "2019-01-03", "2019-01-04", 30,
  2L, "c", "2019-01-05", "2019-01-07", 66.2
)

costs <- costs %>%
  mutate(
    from = as.Date(from),
    to = as.Date(to)
  ) %>%
  mutate(range = iv(from, to), .keep = "unused")

costs
#> # A tibble: 6 × 4
#>    user system  cost                    range
#>   <int> <chr>  <dbl>               <iv<date>>
#> 1     1 a      200.  [2019-01-01, 2019-01-05)
#> 2     1 a       15.6 [2019-01-12, 2019-01-13)
#> 3     1 b      500.  [2019-01-03, 2019-01-10)
#> 4     2 a       25.6 [2019-01-02, 2019-01-03)
#> 5     2 c       30   [2019-01-03, 2019-01-04)
#> 6     2 c       66.2 [2019-01-05, 2019-01-07)

Now let’s say you don’t care about the system anymore, and instead want to sum up the costs for any overlapping date ranges for a particular user. iv_groups() can give us an idea of what the non-overlapping ranges would be for each user:

costs %>%
  reframe(range = iv_groups(range), .by = user)
#> # A tibble: 4 × 2
#>    user                    range
#>   <int>               <iv<date>>
#> 1     1 [2019-01-01, 2019-01-10)
#> 2     1 [2019-01-12, 2019-01-13)
#> 3     2 [2019-01-02, 2019-01-04)
#> 4     2 [2019-01-05, 2019-01-07)

But how can we sum up the costs? For this, we need to turn to iv_identify_group() which allows us to identify the group that each range falls in. This will give us something to group on so we can sum up the costs.

costs2 <- costs %>%
  mutate(range = iv_identify_group(range), .by = user)

# `range` has been updated with the corresponding group
costs2
#> # A tibble: 6 × 4
#>    user system  cost                    range
#>   <int> <chr>  <dbl>               <iv<date>>
#> 1     1 a      200.  [2019-01-01, 2019-01-10)
#> 2     1 a       15.6 [2019-01-12, 2019-01-13)
#> 3     1 b      500.  [2019-01-01, 2019-01-10)
#> 4     2 a       25.6 [2019-01-02, 2019-01-04)
#> 5     2 c       30   [2019-01-02, 2019-01-04)
#> 6     2 c       66.2 [2019-01-05, 2019-01-07)

# So now we can group on that to summarise the cost
costs2 %>%
  summarise(cost = sum(cost), .by = c(user, range))
#> # A tibble: 4 × 3
#>    user                    range  cost
#>   <int>               <iv<date>> <dbl>
#> 1     1 [2019-01-01, 2019-01-10) 701. 
#> 2     1 [2019-01-12, 2019-01-13)  15.6
#> 3     2 [2019-01-02, 2019-01-04)  55.6
#> 4     2 [2019-01-05, 2019-01-07)  66.2

Minimal ivs

iv_groups() is a critical function in this package because its defaults also produce what is known as a minimal iv. A minimal interval vector:

Minimal interval vectors are nice because they cover the range of an interval vector in the most compact form possible. They are also nice to know about because the set operations described in the set operations section below all return minimal interval vectors.

Splitting on endpoints

While iv_groups() generates less intervals than you began with, it is sometimes useful to go the other way and generate more intervals by splitting on all the overlapping endpoints. This is what iv_splits() does. Both operations end up generating a result that contains completely disjoint intervals, but they go about it in very different ways.

Let’s look back at our first iv_groups() example:

x <- iv_pairs(c(1, 5), c(5, 7), c(9, 11), c(10, 13), c(12, 13))
x
#> <iv<double>[5]>
#> [1] [1, 5)   [5, 7)   [9, 11)  [10, 13) [12, 13)

Notice that [9, 11) overlaps [10, 13) which in turn overlaps [12, 13). If we looked at the sorted unique values of the endpoints (i.e. c(9, 10, 11, 12, 13)) and then paired these up like [9, 10), [10, 11), [11, 12], [12, 13), then we will have nicely split on the endpoints, generating a disjoint set of intervals that we refer to as “splits”. iv_splits() returns these intervals.

iv_splits(x)
#> <iv<double>[6]>
#> [1] [1, 5)   [5, 7)   [9, 10)  [10, 11) [11, 12) [12, 13)

With .by

Splitting an iv into its disjoint pieces is another operation that works nicely with .by. Consider this data set containing details about a number of guests that arrived to your party. You’ve been meticulous, so you’ve got their arrival and departure times logged (don’t ask me why, maybe it’s for COVID-19 Contact Tracing purposes).

guests <- tibble(
  arrive = as.POSIXct(
    c("2008-05-20 19:30:00", "2008-05-20 20:10:00", "2008-05-20 22:15:00"),
    tz = "UTC"
  ),
  depart = as.POSIXct(
    c("2008-05-20 23:00:00", "2008-05-21 00:00:00", "2008-05-21 00:30:00"),
    tz = "UTC"
  ),
  name = list(
    c("Mary", "Harry"),
    c("Diana", "Susan"),
    "Peter"
  )
)

guests <- unnest(guests, name) %>%
  mutate(iv = iv(arrive, depart), .keep = "unused")

guests
#> # A tibble: 5 × 2
#>   name                                          iv
#>   <chr>                                 <iv<dttm>>
#> 1 Mary  [2008-05-20 19:30:00, 2008-05-20 23:00:00)
#> 2 Harry [2008-05-20 19:30:00, 2008-05-20 23:00:00)
#> 3 Diana [2008-05-20 20:10:00, 2008-05-21 00:00:00)
#> 4 Susan [2008-05-20 20:10:00, 2008-05-21 00:00:00)
#> 5 Peter [2008-05-20 22:15:00, 2008-05-21 00:30:00)

Let’s figure out who was at your party at any given point throughout the night. To do this, we’ll need to break iv up into all possible disjoint intervals that mark either an arrival or departure. Like with iv_groups(), iv_splits() can show us those disjoint intervals, but this doesn’t help us map them back to each guest.

iv_splits(guests$iv)
#> <iv<datetime<UTC>>[5]>
#> [1] [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#> [2] [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> [3] [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> [4] [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> [5] [2008-05-21 00:00:00, 2008-05-21 00:30:00)

Instead, we’ll need iv_identify_splits(), which identifies which of the splits overlap with each of the original intervals and returns a list of the results which works nicely as a list-column. This is a little easier to understand if we first look at a single guest:

# Mary's arrival/departure times
guests$iv[[1]]
#> <iv<datetime<UTC>>[1]>
#> [1] [2008-05-20 19:30:00, 2008-05-20 23:00:00)

# The first start and last end correspond to Mary's original times,
# but we've also broken her stay up by the departures/arrivals of
# everyone else
iv_identify_splits(guests$iv)[[1]]
#> <iv<datetime<UTC>>[3]>
#> [1] [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#> [2] [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#> [3] [2008-05-20 22:15:00, 2008-05-20 23:00:00)

Since this generates a list-column, we’ll also immediately use tidyr::unnest() to expand it out.

guests2 <- guests %>%
  mutate(iv = iv_identify_splits(iv)) %>%
  unnest(iv) %>%
  arrange(iv)

guests2
#> # A tibble: 15 × 2
#>    name                                          iv
#>    <chr>                                 <iv<dttm>>
#>  1 Mary  [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#>  2 Harry [2008-05-20 19:30:00, 2008-05-20 20:10:00)
#>  3 Mary  [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#>  4 Harry [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#>  5 Diana [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#>  6 Susan [2008-05-20 20:10:00, 2008-05-20 22:15:00)
#>  7 Mary  [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#>  8 Harry [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#>  9 Diana [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 10 Susan [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 11 Peter [2008-05-20 22:15:00, 2008-05-20 23:00:00)
#> 12 Diana [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> 13 Susan [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> 14 Peter [2008-05-20 23:00:00, 2008-05-21 00:00:00)
#> 15 Peter [2008-05-21 00:00:00, 2008-05-21 00:30:00)

Now that we have the splits for each guest, we can group by iv and summarize to figure out who was at the party at any point throughout the night.

guests2 %>%
  summarise(n = n(), who = list(name), .by = iv)
#> # A tibble: 5 × 3
#>                                           iv     n who      
#>                                   <iv<dttm>> <int> <list>   
#> 1 [2008-05-20 19:30:00, 2008-05-20 20:10:00)     2 <chr [2]>
#> 2 [2008-05-20 20:10:00, 2008-05-20 22:15:00)     4 <chr [4]>
#> 3 [2008-05-20 22:15:00, 2008-05-20 23:00:00)     5 <chr [5]>
#> 4 [2008-05-20 23:00:00, 2008-05-21 00:00:00)     3 <chr [3]>
#> 5 [2008-05-21 00:00:00, 2008-05-21 00:30:00)     1 <chr [1]>

Set operations

There are a number of set theoretical operations that you can use on ivs. These are:

iv_set_complement() works on a single iv, while all the others work on two intervals at a time. All of these functions return a minimal interval vector. The easiest way to think about these functions is to imagine iv_groups() being called on each of the inputs first (to reduce them down to their minimal form) before applying the operation.

iv_set_complement() computes the set complement of the intervals in a single iv.

x <- iv_pairs(c(1, 3), c(2, 5), c(10, 12), c(13, 15))
x
#> <iv<double>[4]>
#> [1] [1, 3)   [2, 5)   [10, 12) [13, 15)

iv_set_complement(x)
#> <iv<double>[2]>
#> [1] [5, 10)  [12, 13)

By default, iv_set_complement() uses the smallest/largest values of its input as the bounds to compute the complement over, but you can supply bounds explicitly with lower and upper:

iv_set_complement(x, lower = 0, upper = Inf)
#> <iv<double>[4]>
#> [1] [0, 1)    [5, 10)   [12, 13)  [15, Inf)

iv_set_union() takes the union of two ivs. It is essentially a call to vctrs::vec_c() followed by iv_groups(). It answers the question, “Which intervals are in x or y?”

y <- iv_pairs(c(-5, 0), c(1, 4), c(8, 10), c(15, 16))

x
#> <iv<double>[4]>
#> [1] [1, 3)   [2, 5)   [10, 12) [13, 15)
y
#> <iv<double>[4]>
#> [1] [-5, 0)  [1, 4)   [8, 10)  [15, 16)

iv_set_union(x, y)
#> <iv<double>[4]>
#> [1] [-5, 0)  [1, 5)   [8, 12)  [13, 16)

iv_set_intersect() takes the intersection of two ivs. It answers the question, “Which intervals are in x and y?”

iv_set_intersect(x, y)
#> <iv<double>[1]>
#> [1] [1, 4)

iv_set_difference() takes the asymmetrical difference of two ivs. It answers the question, “Which intervals are in x but not y?”

iv_set_difference(x, y)
#> <iv<double>[3]>
#> [1] [4, 5)   [10, 12) [13, 15)

Pairwise set operations

The set operations described above all treat x and y as two complete “sets” of intervals and operate on the intervals as a group. Occasionally it is useful to have pairwise equivalents of these operations that, say, take the intersection of the i-th interval of x and the i-th interval of y.

One case in particular comes from combining iv_locate_overlaps() with iv_pairwise_set_intersect(). Here you might want to know not only where two ivs overlaps, but also what that intersection was for each value in x.

starts <- as.Date(c("2019-01-05", "2019-01-20", "2019-01-25", "2019-02-01"))
ends <- starts + c(5, 10, 3, 5)
x <- iv(starts, ends)

starts <- as.Date(c("2019-01-02", "2019-01-23"))
ends <- starts + c(5, 6)
y <- iv(starts, ends)

x
#> <iv<date>[4]>
#> [1] [2019-01-05, 2019-01-10) [2019-01-20, 2019-01-30) [2019-01-25, 2019-01-28)
#> [4] [2019-02-01, 2019-02-06)
y
#> <iv<date>[2]>
#> [1] [2019-01-02, 2019-01-07) [2019-01-23, 2019-01-29)

iv_set_intersect() isn’t very useful to answer this particular question, because it first merges all overlapping intervals in each input.

iv_set_intersect(x, y)
#> <iv<date>[2]>
#> [1] [2019-01-05, 2019-01-07) [2019-01-23, 2019-01-29)

Instead, we can find the overlaps and align them, and then pairwise intersect the results:

locations <- iv_locate_overlaps(x, y, no_match = "drop")
overlaps <- iv_align(x, y, locations = locations)

overlaps %>%
  mutate(intersect = iv_pairwise_set_intersect(needles, haystack))
#>                    needles                 haystack                intersect
#> 1 [2019-01-05, 2019-01-10) [2019-01-02, 2019-01-07) [2019-01-05, 2019-01-07)
#> 2 [2019-01-20, 2019-01-30) [2019-01-23, 2019-01-29) [2019-01-23, 2019-01-29)
#> 3 [2019-01-25, 2019-01-28) [2019-01-23, 2019-01-29) [2019-01-25, 2019-01-28)

Note that the pairwise set operations come with a number of restrictions that limit their usage in many cases. For example, iv_pairwise_set_intersect() requires that x[i] and y[i] overlap, otherwise they would result in an empty interval, which isn’t allowed.

iv_pairwise_set_intersect(iv(1, 5), iv(6, 9))
#> Error in `iv_pairwise_set_intersect()`:
#> ! Can't take the intersection of non-overlapping intervals.
#> ℹ This would result in an empty interval.
#> ℹ Location 1 contains non-overlapping intervals.

See the documentation page of iv_pairwise_set_intersect() for a complete list of restrictions for all of the pairwise set operations.

Missing intervals

Missing intervals are allowed in ivs, you can generate them by supplying vectors to iv() or iv_pairs() that contain missing values in either input.

x <- iv_pairs(c(1, 5), c(3, NA), c(NA, 3))
x
#> <iv<double>[3]>
#> [1] [1, 5)   [NA, NA) [NA, NA)

The defaults of all functions in ivs treat missing intervals in one of two ways:

y <- iv_pairs(c(NA, NA), c(0, 2))
y
#> <iv<double>[2]>
#> [1] [NA, NA) [0, 2)
# Match-like operations treat missing intervals as overlapping
iv_locate_overlaps(x, y)
#>   needles haystack
#> 1       1        2
#> 2       2        1
#> 3       3        1

iv_set_intersect(x, y)
#> <iv<double>[2]>
#> [1] [1, 2)   [NA, NA)
# Pairwise operations treat missing intervals as infectious
z <- iv_pairs(c(1, 2), c(1, 4))

iv_pairwise_set_intersect(y, z)
#> <iv<double>[2]>
#> [1] [NA, NA) [1, 2)