--- title: "Getting started with crawlee" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with crawlee} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ```{r setup} library(crawlee) ``` ## The mental model crawlee mirrors the architecture of [Crawlee](https://crawlee.dev) in pure R. A **crawler** owns: - a **request queue** — a deduplicating, resumable list of URLs to visit; - one or more **handlers** — functions run on each fetched page; - a **dataset** — the structured records your handlers produce. You build a crawler with `crawler()` and configure it with `cr_*` verbs that compose through the native pipe (`|>`). ## A minimal crawl ```{r, eval = FALSE} resultado <- crawler("https://example.com") |> cr_options(delay = 0.5, max_depth = 2) |> cr_use_http() |> cr_on_html(function(ctx) { ctx$push_data(list( url = ctx$request$url, titulo = ctx$page |> rvest::html_element("h1") |> rvest::html_text2() )) ctx$enqueue_links() }) |> cr_run() |> cr_collect() ``` ## The handler context Every handler receives a context object, conventionally named `ctx`: | Element | Description | |---------|-------------| | `ctx$request` | The current request (`url`, `label`, `depth`, ...). | | `ctx$response` | The raw `httr2` response. | | `ctx$page` | The parsed page (`xml_document`) for HTML/XML, else `NULL`. | | `ctx$push_data(data)` | Append a record (list or data frame) to the dataset. | | `ctx$enqueue_links(...)` | Discover and enqueue links from the page. | | `ctx$log` | Logging helpers (`info()`, `success()`, `warn()`, `error()`). | ### Controlling link discovery `enqueue_links()` accepts `glob`, `include`/`exclude` patterns and a `same_domain` flag (on by default), so you only follow the links you care about: ```{r, eval = FALSE} ctx$enqueue_links( glob = "*/blog/*", exclude = "*/tag/*", label = "article" ) ``` Requests enqueued with a `label` are routed to the matching handler registered with `cr_on_html(..., label = "article")`. ## Reproducibility The request queue deduplicates URLs by a normalised key (see `cr_normalize_url()`), so the same page is never fetched twice and crawls are deterministic. Persistent, resumable storage backends (DuckDB, Parquet) are on the roadmap. ```