Using mdbplyr

Introduction

This repository contains mdbplyr, an R package that provides a disciplined, lazy dplyr-style interface for MongoDB aggregation pipelines.

mdbplyr sits between raw mongolite usage and broader compatibility layers. Compared with writing aggregation JSON by hand, it lets you express supported queries with tidy verbs and inspect the generated pipeline before execution. Compared with approaches that try to hide MongoDB behind table-like semantics, it stays explicit about scope, keeps translation conservative, and fails clearly when a verb or expression is outside the supported subset.

The practical advantages are:

Usage examples

Loading data

Suppose there is a running MongoDB instance on localhost with default port and no authentication. The code below loads dplyr::starwars into a collection named starwars.

library(dplyr)
library(mongolite)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_collection$drop()
starwars_collection$insert(dplyr::starwars)

starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = names(dplyr::starwars)
)

Once the collection is loaded, starwars_tbl is the lazy table used in the examples below.

library(dplyr)
library(mdbplyr)

starwars_collection <- mongolite::mongo(
  collection = "starwars",
  db = "mdbplyr"
)

starwars_tbl <- tbl_mongo(starwars_collection) %>% 
  infer_schema()

Schema

mdbplyr uses a schema to know which fields are available in a collection. This matters especially for:

The most reliable approach is to pass schema = ... explicitly when creating the lazy table:

starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = c("name", "species", "height", "mass", "homeworld")
)

When writing the schema by hand is inconvenient, infer_schema() can populate it from the first document in the collection:

starwars_tbl <- tbl_mongo(starwars_collection) |>
  infer_schema()

This is convenient for exploratory work, but it has an important limitation: it only sees one document. If the collection is heterogeneous, fields that do not appear in the first document may still need to be added manually.

You can inspect the currently known fields with:

schema_fields(starwars_tbl)

Basic inspection

Inspect the known schema and the generated pipeline without executing the query.

schema_fields(starwars_tbl)

starwars_tbl |>
  filter(species == "Human", height > 180) |>
  select(name, height, mass) |>
  show_query()

Streaming with cursor()

When you want a MongoDB iterator instead of an eagerly materialized tibble, open the lazy query with cursor().

iter <- starwars_tbl |>
  filter(species == "Human", height > 180) |>
  select(name, height, mass) |>
  cursor()

iter$page(10)

Supported verbs

Each subsection below shows one of the supported dplyr-like verbs on the starwars collection.

filter()

starwars_tbl |>
  filter(species == "Droid", height > 100) |>
  collect()

select()

starwars_tbl |>
  select(name, species, homeworld) |>
  collect()

Selecting dotted paths preserves nested MongoDB structure by default. It does not flatten nested fields unless you explicitly ask for that:

sensor_tbl |>
  select(`message.timestamp`, `message.measurements`) |>
  collect()

rename()

starwars_tbl |>
  rename(character_name = name, planet = homeworld) |>
  collect()

mutate()

starwars_tbl |>
  mutate(height_m = height / 100, bmi_like = mass / (height_m * height_m)) |>
  select(name, height, mass, height_m, bmi_like) |>
  collect()

transmute()

starwars_tbl |>
  transmute(name = name, height_m = height / 100) |>
  collect()

arrange()

starwars_tbl |>
  arrange(desc(height), name) |>
  select(name, height) |>
  slice_head(n = 10) |>
  collect()

group_by()

starwars_tbl |>
  group_by(species)

summarise()

starwars_tbl |>
  group_by(species) |>
  summarise(
    n = n(),
    avg_height = mean(height),
    max_mass = max(mass)
  ) |>
  arrange(desc(n)) |>
  collect()

slice_head()

starwars_tbl |>
  select(name, species) |>
  slice_head(n = 5) |>
  collect()

slice_tail()

starwars_tbl |>
  select(name, species) |>
  slice_tail(n = 5) |>
  collect()

flatten_fields()

Use flatten_fields() when you explicitly want nested object leaves to become flat tibble columns. By default the output names are the schema dot paths.

sensor_tbl |>
  select(`message.timestamp`, `message.measurements`) |>
  flatten_fields() |>
  collect()

You can also target a specific nested root and optionally rename the flattened output columns:

sensor_tbl |>
  flatten_fields(
    `message.measurements`,
    names_fn = function(x) gsub(".", "_", x, fixed = TRUE)
  ) |>
  collect()

unwind_array()

Use unwind_array() when a document field contains an array and you want one output row per array element.

orders_tbl |>
  unwind_array(items) |>
  collect()

If array elements are nested objects, unwind_array() and flatten_fields() can be chained:

orders_tbl |>
  unwind_array(items) |>
  flatten_fields(items) |>
  collect()

New in v0.4.0

The features in this section were added in mdbplyr v0.4.0. Each heading is marked (New from v0.4.0) so it is easy to tell them apart from the longer-standing subset above.

Server version awareness (New from v0.4.0)

mdbplyr now records the MongoDB server version when it can probe it, and some features are gated on a minimum version (for example median() needs 7.0+ and window functions need 5.0+). When the version cannot be determined it is left unknown and the feature is allowed through, so the server still has the final say.

# Probed automatically from a live connection, or set it explicitly:
starwars_tbl <- tbl_mongo(
  starwars_collection,
  schema = names(dplyr::starwars),
  server_version = "7.0"
)

mongo_server_version(starwars_tbl)

Computed group_by() keys (New from v0.4.0)

Grouping is no longer limited to bare field names. A named computed key is translated into the MongoDB $group._id.

starwars_tbl |>
  group_by(height_band = floor(height / 50) * 50) |>
  summarise(n = n(), avg_mass = mean(mass)) |>
  arrange(height_band) |>
  collect()

More summarise() accumulators (New from v0.4.0)

summarise() gained sd(), var(), first(), last(), and n_distinct(), alongside the original n(), sum(), mean(), min(), and max().

starwars_tbl |>
  group_by(species) |>
  summarise(
    n = n(),
    sd_height = sd(height),
    var_mass = var(mass),
    first_name = first(name),
    distinct_worlds = n_distinct(homeworld)
  ) |>
  collect()

median() and quantile(probs = ...) are also supported, but compile to $percentile and therefore require MongoDB 7.0+:

starwars_tbl |>
  group_by(species) |>
  summarise(
    median_height = median(height),
    p90_mass = quantile(mass, probs = 0.9)
  ) |>
  collect()

tidyselect helpers in select() (New from v0.4.0)

When the schema is known, select() accepts name-based tidyselect helpers (starts_with(), ends_with(), contains(), matches(), everything(), all_of(), any_of()), ranges, and negation. (where() is intentionally not supported, because column types are unknown without reading data.)

starwars_tbl |>
  select(name, starts_with("home"), ends_with("_color")) |>
  collect()

# Drop columns with negation, or select from a character vector:
starwars_tbl |>
  select(-ends_with("_color")) |>
  collect()

cols <- c("name", "height", "mass")
starwars_tbl |>
  select(all_of(cols)) |>
  collect()

Pure bare-name selections still work exactly as before, including nested root paths such as `message.measurements`.

coalesce() (New from v0.4.0)

coalesce() compiles to $ifNull, returning the first non-missing value.

starwars_tbl |>
  mutate(world = coalesce(homeworld, "unknown")) |>
  select(name, world) |>
  collect()

across() (New from v0.4.0)

across() works in both mutate() and summarise(). It expands a column selection and a function spec into one assignment per (column, function) pair. .fns accepts a bare function name, a ~ lambda (using .x), or a named list of those, and .names glue ({.col}, {.fn}) controls the output names.

# One aggregate over several columns
starwars_tbl |>
  group_by(species) |>
  summarise(across(c(height, mass), mean, na.rm = TRUE)) |>
  collect()

# A named list of aggregates -> {.col}_{.fn} output names
starwars_tbl |>
  group_by(species) |>
  summarise(across(c(height, mass), list(avg = mean, max = max))) |>
  collect()
# A lambda applied to each selected column, with custom output names
starwars_tbl |>
  mutate(across(c(height, mass), ~ .x / 100, .names = "{.col}_scaled")) |>
  select(name, height_scaled, mass_scaled) |>
  collect()

Window functions (New from v0.4.0)

Window functions in mutate() / transmute() compile to MongoDB $setWindowFields and therefore require MongoDB 5.0+. They partition by the current group_by() keys.

Ranking functions take the column to order by (optionally wrapped in desc()):

starwars_tbl |>
  group_by(species) |>
  mutate(height_rank = min_rank(desc(height))) |>
  select(name, species, height, height_rank) |>
  collect()

Cumulative and offset windows take their order from a preceding arrange() (and the stage reorders the output by that key):

starwars_tbl |>
  arrange(height) |>
  mutate(cumulative_mass = cumsum(mass)) |>
  select(name, height, mass, cumulative_mass) |>
  collect()
starwars_tbl |>
  arrange(height) |>
  mutate(
    prev_height = lag(height),
    next_height = lead(height, default = 0)
  ) |>
  select(name, height, prev_height, next_height) |>
  collect()

row_number() is available as an alias for the existing 1:n() row-numbering.

Joins (New from v0.4.0)

inner_join(), left_join(), semi_join(), and anti_join() compile to MongoDB $lookup against another collection in the same database. The right-hand side must be a plain tbl_mongo (a collection reference with a known schema and no lazy operations).

films_tbl <- tbl_mongo(
  mongolite::mongo(collection = "films", db = "mdbplyr"),
  schema = c("film_id", "title", "director")
)

A mutating join flattens the match by default (one row per matched pair, with colliding columns suffixed):

appearances_tbl |>
  inner_join(films_tbl, by = "film_id") |>
  collect()

left_join() keeps unmatched left rows. Filtering joins keep the left columns and only test whether a match exists:

appearances_tbl |>
  semi_join(films_tbl, by = c("movie_id" = "film_id")) |>
  collect()

Pass unnest = FALSE to keep the matched documents as a native nested array column instead of flattening them:

appearances_tbl |>
  left_join(films_tbl, by = "film_id", unnest = FALSE) |>
  collect()

Notes

The examples above stay within the currently supported subset:

If a query falls outside that subset, mdbplyr is designed to fail explicitly rather than guess or silently change execution semantics.