orderly bundles

If you have an orderly report that takes a very long time, or needs to run in parallel, you might need to send it to run on another computer. There are a number of ways of achieving this - the simplest might be to clone the source tree to another computer, run the reports there and use one of a number of possible approaches to sync the outputs between computers. However, there will be cases where that is not ideal, and you want to move around much less data around.

This vignette describes a way of parcelling together all dependencies of an orderly report into a zip file (a “bundle”) that can be distributed to another machine (e.g., via scp, rsync or a shared file system), run there, and returned. It does not provide a transparent approach to using high-performance computing with orderly as we feel that the specific circumstances are too varied to support this directly.

Overview

In order to use orderly bundles we make some assumptions and conventions.

First, we assume that you will be running your exported report on another machine (otherwise you would have access to the orderly tree) and that your report takes really quite a long time.

Second, we assume you have your own way of getting the bundled reports to your other machine and the completed bundles back again — we expect that the details here will be specific to your needs and situation and that the overhead of doing this will be trivial compared with the cost of running the report.

Third, we assume that you will deal with all issues around queuing, locking and fault tolerance. From orderly’s point of view work will be exported from orderly and at this point you’re in control - we expect it to come back computed at some point, though we do not enforce that.

Fourth, that the machine running the report can be trusted to actually run the report - if you have set up an orderly server that is safe from interactively run reports, don’t allow importing from anyone’s laptops if you want to preserve this.

Fifth, that you trust your remote machine with your data, and that the remote machine trusts your orderly archive enough to run arbitrary code on it.

The process

We will use the orderly demo example, and pack up the use_dependency report.

path <- orderly::orderly_example("demo")

The use_dependency report has a dependency, which we run

id <- orderly::orderly_run("other", parameters = list(nmin = 0),
                           echo = FALSE, root = path)
## [ name       ]  other
## [ id         ]  20210922-102117-69521688
## [ sources    ]  functions.R
## [ parameter  ]  nmin: 0
## [ start      ]  2021-09-22 10:21:17
## [ data       ]  source => extract: 20 x 2
## [ parameter  ]  nmin: 0
## [ end        ]  2021-09-22 10:21:17
## [ elapsed    ]  Ran report in 0.04658222 secs
## [ artefact   ]  summary.csv: 3fac8347e152c84c96e6676413c718b7
## [ ...        ]  graph.png: 1adf6a0f2e70dd8c25d3d8ba702d299e
orderly::orderly_commit(id, root = path)
## [ commit     ]  other/20210922-102117-69521688
## [ copy       ]
## [ import     ]  other:20210922-102117-69521688
## [ success    ]  :)
## [1] "/tmp/RtmpLyoPhu/file3f3d14ee64cc7/archive/other/20210922-102117-69521688"

We need a place that we’ll put the bundles:

path_bundles <- tempfile()

Now, we can pack up use_dependency to run

bundle <- orderly::orderly_bundle_pack(path_bundles, "use_dependency",
                                       root = path)
## [ name       ]  use_dependency
## [ id         ]  20210922-102117-ab4ad8a0
## [ depends    ]  other@20210922-102117-69521688:summary.csv -> incoming.csv
## [ start      ]  2021-09-22 10:21:17
## [ bundle pack ]  20210922-102117-ab4ad8a0
bundle
## $id
## [1] "20210922-102117-ab4ad8a0"
## 
## $path
## [1] "/tmp/RtmpLyoPhu/file3f3d118cfa1e7/20210922-102117-ab4ad8a0.zip"

orderly_bundle_pack has created a zip file. The format of this file is internal to orderly (it will likely change and will at some point become resistant to tampering), but contains:

zip::zip_list(bundle$path)
##                                      filename compressed_size uncompressed_size
## 1                   20210922-102117-ab4ad8a0/               0                 0
## 2              20210922-102117-ab4ad8a0/meta/               0                 0
## 3    20210922-102117-ab4ad8a0/meta/config.rds            1461              1456
## 4      20210922-102117-ab4ad8a0/meta/info.rds            3422              3417
## 5  20210922-102117-ab4ad8a0/meta/manifest.rds             286               281
## 6   20210922-102117-ab4ad8a0/meta/session.rds           21888             21883
## 7              20210922-102117-ab4ad8a0/pack/               0                 0
## 8  20210922-102117-ab4ad8a0/pack/incoming.csv             542               888
## 9   20210922-102117-ab4ad8a0/pack/orderly.yml             214               358
## 10     20210922-102117-ab4ad8a0/pack/script.R             175               220
##              timestamp permissions    crc32 offset
## 1  2021-09-22 09:21:16         755 00000000      0
## 2  2021-09-22 09:21:16         755 00000000     55
## 3  2021-09-22 09:21:16         644 04ce8ea6    115
## 4  2021-09-22 09:21:16         644 1cfc1843   1662
## 5  2021-09-22 09:21:16         644 f652c2d9   5168
## 6  2021-09-22 09:21:16         644 1caad66a   5542
## 7  2021-09-22 09:21:16         755 00000000  27517
## 8  2021-09-22 09:21:16         644 e3c70810  27577
## 9  2021-09-22 09:21:16         644 93fda3f3  28207
## 10 2021-09-22 09:21:16         644 58329b6b  28508

The subdirectory pack contains the report working directory, all code and dependencies, etc, while meta contains additional information required to run the report. In particular the incoming.csv file in the pack directory contains the dependency imported from other.

Then copy this zip file somewhere else to run it (details vary based on your system, and moving the file is not necessary to run it, though it will be the most likely situation).

Once the files have been moved we can run it with:

workdir <- tempfile()
res <- orderly::orderly_bundle_run(bundle$path, workdir)
## [ start      ]  2021-09-22 10:21:17
## 
## > d <- read.csv("incoming.csv", stringsAsFactors = FALSE)
## 
## > png("graph.png")
## 
## > par(mar = c(15, 4, 0.5, 0.5))
## 
## > barplot(setNames(d$number, d$name), las = 2)
## 
## > dev.off()
## png 
##   2 
## 
## > info <- orderly::orderly_run_info()
## 
## > saveRDS(info, "info.rds")
## [ end        ]  2021-09-22 10:21:17
## [ elapsed    ]  Ran report in 0.01021791 secs
## [ artefact   ]  graph.png: 1adf6a0f2e70dd8c25d3d8ba702d299e
## [ ...        ]  info.rds: 53a86dfc9d8c9828b7926a8eec9271ff

With the workdir being a directory that you want the report to be run in. This can be the same as the path the incoming zip file is found, if you want, but this will make it harder to know what has been run already or not.

This creates another zip file, but this time contains the results of running the report.

The result can be imported into order by using orderly::orderly_bundle_import with the path to the zip file:

orderly::orderly_bundle_import(res$path, root = path)
## [ import     ]  use_dependency:20210922-102117-ab4ad8a0

The copy of use_dependency is now in the archive and can be used like any other orderly report

orderly::orderly_list_archive(path)
##             name                       id
## 1          other 20210922-102117-69521688
## 2 use_dependency 20210922-102117-ab4ad8a0
orderly::orderly_graph("other", root = path)
## other [20210922-102117-69521688]
## └──use_dependency [20210922-102117-ab4ad8a0]

Limitations