---
title: "Introduction-to-rPandas"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{introduction-to-rPandas}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(eval=FALSE,
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup, include=FALSE}
library(rPandas)
```


## Introduction
rPandas is an R package designed to serve as a translation layer, allowing R users to leverage the power and speed of the Python pandas library without ever leaving their R session.

For R users familiar with dplyr or data.table, the rPandas syntax will feel natural. The package provides a set of R functions (e.g., rp_filter(), rp_select()) that:

Capture R code (like carat > 1).

Translate it into an equivalent Python/pandas command string.

Execute the Python code in the background via the reticulate package.

Return the resulting data.frame back to your R session.

This allows you to write R code while the data processing is handled by Python's pandas library.

## Setup and Installation
### Python Dependencies
rPandas depends on a working Python installation and the pandas library. The reticulate package handles the R‑to‑Python connection.

### Quick installation (default environment)
You can install pandas into a dedicated R environment by running:

```{r}
reticulate::py_install("pandas")
```
#### Troubleshooting the Python environment

If you encounter issues (e.g., “pandas not found”), rPandas provides a built‑in health check:

```{r}
# Run this if you have connection issues
rp_check_env()
```

The health check prints the Python path reticulate is using and whether `pandas` is installed.

##### Finding and selecting the correct Python environment

You can list all available conda environments (if you use Conda) with:

```{r}
reticulate::conda_list()
```

Once you identify the desired environment (e.g., the first one), you can tell reticulate to use it **before loading rPandas**:

```{r}
# Replace with the path from conda_list()
reticulate::use_python(python = reticulate::conda_list()$python[1], required = TRUE)

# Or, if you prefer to use a conda environment by name:
reticulate::use_condaenv("your_environment_name", required = TRUE)
```

For system Python installations, simply provide the path to the Python binary:

```{r}
reticulate::use_python("/usr/local/bin/python3", required = TRUE)
```

After setting the environment, rerun `rp_check_env()` to confirm everything is ready. For more detailed guidance, see the [reticulate documentation](https://rstudio.github.io/reticulate/).

> **Note:** The code chunks above are not executed when building the vignette (they are meant to be run interactively by the user).

## The Core Verbs
All examples will use the diamonds dataset from the ggplot2 package.

```{r eval=TRUE}
# Make sure ggplot2 is installed to access the data
data(diamonds, package = "ggplot2")
head(diamonds)

```
**1. rp_filter():**
This verb filters rows based on a logical expression, similar to dplyr::filter().
It supports common logical operators: & (AND), | (OR), ! (NOT), %in%, and %notin%

```{r} 
# Simple condition
v1 <- rp_filter(diamonds, carat > 1)
print(head(v1))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  1.17 Very Good     J      I1  60.2    61  2774 6.83 6.90 4.13
#> 2  1.01   Premium     F      I1  61.8    60  2781 6.39 6.36 3.94
#> 3  1.01      Fair     E      I1  64.5    58  2788 6.29 6.21 4.03
#> 4  1.01   Premium     H     SI2  62.7    59  2788 6.31 6.22 3.93
#> 5  1.05 Very Good     J     SI2  63.2    56  2789 6.49 6.45 4.09
#> 6  1.05      Fair     J     SI2  65.8    59  2789 6.41 6.27 4.18

# AND: multiple conditions
v2 <- rp_filter(diamonds, carat > 1 & cut == "Ideal")
print(head(v2))
#>   carat   cut color clarity depth table price    x    y    z
#> 1  1.01 Ideal     I      I1  61.5    57  2844 6.45 6.46 3.97
#> 2  1.02 Ideal     H     SI2  61.6    55  2856 6.49 6.43 3.98
#> 3  1.02 Ideal     I      I1  61.7    56  2872 6.44 6.49 3.99
#> 4  1.02 Ideal     J     SI2  60.3    54  2879 6.53 6.50 3.93
#> 5  1.01 Ideal     I      I1  61.5    57  2896 6.46 6.45 3.97
#> 6  1.02 Ideal     I      I1  61.7    56  2925 6.49 6.44 3.99

# OR: use | (pipe)
v3 <- rp_filter(diamonds, color == "D" | color == "E")
print(head(v3))
#>   carat     cut color clarity depth table price    x    y    z
#> 1  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.22    Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
#> 5  0.20 Premium     E     SI2  60.2    62   345 3.79 3.75 2.27
#> 6  0.32 Premium     E      I1  60.9    58   345 4.38 4.42 2.68


# NOT: use !
v4 <- rp_filter(diamonds, !(price > 10000))
print(head(v4))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48


# %in% operator
v5 <- rp_filter(diamonds, color %in% c("D", "E", "F"))
print(head(v5))
#>   carat     cut color clarity depth table price    x    y    z
#> 1  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.22    Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
#> 5  0.22 Premium     F     SI1  60.4    61   342 3.88 3.84 2.33
#> 6  0.20 Premium     E     SI2  60.2    62   345 3.79 3.75 2.27

# %notin% (if implemented)
v6 <- rp_filter(diamonds, color %notin% c("D", "E", "F"))
print(head(v6))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 2  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 3  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
#> 4  0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
#> 5  0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53
#> 6  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39
```

**2. rp_select():**
This verb selects specific columns by name, similar to dplyr::select().

```{r}
# Select three columns
v4 <- rp_select(diamonds, carat, cut, price)
print(head(v4))
#>   carat       cut price
#> 1  0.23     Ideal   326
#> 2  0.21   Premium   326
#> 3  0.23      Good   327
#> 4  0.29   Premium   334
#> 5  0.31      Good   335
#> 6  0.24 Very Good   336
```

**3. rp_sort():**
This verb sorts the data frame by one or more columns, similar to dplyr::arrange().
```{r}
# Sort by price (ascending by default)
v8 <- rp_sort(diamonds, price)
print(head(v8))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6  0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47

# Use desc() to sort in descending order
v9 <- rp_sort(diamonds, desc(price))
print(head(v9))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  2.29   Premium     I     VS2  60.8    60 18823 8.50 8.47 5.16
#> 2  2.00 Very Good     G     SI1  63.5    56 18818 7.90 7.97 5.04
#> 3  1.51     Ideal     G      IF  61.7    55 18806 7.37 7.41 4.56
#> 4  2.07     Ideal     G     SI2  62.5    55 18804 8.20 8.13 5.11
#> 5  2.00 Very Good     H     SI1  62.8    57 18803 7.95 8.00 5.01
#> 6  2.29   Premium     I     SI1  61.8    59 18797 8.52 8.45 5.24

# Sort by multiple columns
v10 <- rp_sort(diamonds, cut, desc(price))
print(head(v10))
#>   carat  cut color clarity depth table price     x     y    z
#> 1  2.01 Fair     G     SI1  70.6    64 18574  7.43  6.64 4.69
#> 2  2.02 Fair     H     VS2  64.5    57 18565  8.00  7.95 5.14
#> 3  4.50 Fair     J      I1  65.8    58 18531 10.23 10.16 6.72
#> 4  2.00 Fair     G     VS2  67.6    58 18515  7.65  7.61 5.16
#> 5  2.51 Fair     H     SI2  64.7    57 18308  8.44  8.50 5.48
#> 6  3.01 Fair     I     SI2  65.8    56 18242  8.99  8.94 5.90
```


**4. rp_mutate():**
This verb creates new columns or modifies existing ones, similar to dplyr::mutate().
You can also remove columns using the to_remove argument.
```{r}

# Create a new column
v11 <- rp_mutate(diamonds, price_per_carat = price / carat)
print(head(v11))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
#>   price_per_carat
#> 1        1417.391
#> 2        1552.381
#> 3        1421.739
#> 4        1151.724
#> 5        1080.645
#> 6        1400.000

# Create multiple columns
v12 <- rp_mutate(
  diamonds, 
  price_per_carat = price / carat,
  depth_pct = depth / 100
)
print(head(v12))
#>   carat       cut color clarity depth table price    x    y    z
#> 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
#> 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
#>   price_per_carat depth_pct
#> 1        1417.391     0.615
#> 2        1552.381     0.598
#> 3        1421.739     0.569
#> 4        1151.724     0.624
#> 5        1080.645     0.633
#> 6        1400.000     0.628

# Remove one or more columns
v13 <- rp_mutate(diamonds, to_remove = c("table", "depth"))
print(head(v13))
#>   carat       cut color clarity price    x    y    z
#> 1  0.23     Ideal     E     SI2   326 3.95 3.98 2.43
#> 2  0.21   Premium     E     SI1   326 3.89 3.84 2.31
#> 3  0.23      Good     E     VS1   327 4.05 4.07 2.31
#> 4  0.29   Premium     I     VS2   334 4.20 4.23 2.63
#> 5  0.31      Good     J     SI2   335 4.34 4.35 2.75
#> 6  0.24 Very Good     J    VVS2   336 3.94 3.96 2.48
```


**5. rp_summarize():**
This verb collapses a data frame into a summary, often after grouping. It is similar to dplyr::summarise().
```{r}

# Summarize the entire data frame
v14 <- rp_summarize(diamonds, avg_price = mean(price), max_carat = max(carat))
print(v14)
#>    price carat
#> 1 3932.8   NaN
#> 2    NaN  5.01


# Group by one column (unquoted)
v15 <- rp_summarize(diamonds, avg_price = mean(price), .by = cut)
print(v15)
#>         cut avg_price
#> 1      Fair  4358.758
#> 2      Good  3928.864
#> 3 Very Good  3981.760
#> 4   Premium  4584.258
#> 5     Ideal  3457.542

# Group by multiple columns (unquoted)
v16 <- rp_summarize(
  diamonds, 
  avg_price = mean(price), 
  count = n(),
  .by = c(cut, color)
)
print(head(v16))
#>    cut color avg_price count
#> 1 Fair     D  4291.061   163
#> 2 Fair     E  3682.312   224
#> 3 Fair     F  3827.003   312
#> 4 Fair     G  4239.255   314
#> 5 Fair     H  5135.683   303
#> 6 Fair     I  4685.446   175

# Grouping also accepts quoted column names
v17 <- rp_summarize(
  diamonds, 
  avg_price = mean(price), 
  .by = c("cut", "color")
)
print(head(v17))
#>    cut color avg_price
#> 1 Fair     D  4291.061
#> 2 Fair     E  3682.312
#> 3 Fair     F  3827.003
#> 4 Fair     G  4239.255
#> 5 Fair     H  5135.683
#> 6 Fair     I  4685.446
```

**6. rp_calculate()– Apply Multiple Functions to Multiple Columns**
This verb is a powerful alternative to rp_summarize. It applies the same set of functions (e.g., "mean", "sd") to every selected column.
```{r}

# Apply two functions to two columns, grouped by 'cut'
v13 <- rp_calculate(
  diamonds,
  price, carat,
  the.functions = c("mean", "sd"),
  .by = cut
)
print(head(v13))
#>         cut price.mean price.std carat.mean carat.std
#> 1      Fair   4358.758  3560.387  1.0461366 0.5164043
#> 2      Good   3928.864  3681.590  0.8491847 0.4540544
#> 3 Very Good   3981.760  3935.862  0.8063814 0.4594354
#> 4   Premium   4584.258  4349.205  0.8919549 0.5152616
#> 5     Ideal   3457.542  3808.401  0.7028370 0.4328763
```

**7. rp_first_k_rows() and rp_last_k_rows() – First/Last K Rows per Group**
These verbs extract the first or last k rows from the whole data frame, or from each group when .by is provided.
```{r} 
# First 3 rows overall
v19 <- rp_first_k_rows(diamonds, k = 3)
print(v19)
#>   carat     cut color clarity depth table price    x    y    z
#> 1  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
#> 2  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
#> 3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31

# Last 2 rows per group (cut and clarity)
v20 <- rp_last_k_rows(diamonds, k = 2, .by = c(cut, clarity))
print(head(v20))
#>   carat     cut color clarity depth table price    x    y    z
#> 1  0.70    Fair     J    VVS1  67.6    54  1691 5.56 5.41 3.71
#> 2  0.50    Fair     D    VVS1  65.9    64  1792 4.92 5.03 3.28
#> 3  0.52    Fair     F      IF  64.6    58  2144 5.04 5.17 3.30
#> 4  0.47    Fair     D      IF  60.6    60  2211 5.09 4.98 3.05
#> 5  0.55    Good     F      IF  60.8    60  2266 5.26 5.36 3.23
#> 6  0.54 Premium     F      IF  61.9    60  2391 5.26 5.21 3.24

# Both quoted and unquoted group specifications work
v21 <- rp_first_k_rows(diamonds, k = 1, .by = c("cut", "color"))
print(v21)
#>    carat       cut color clarity depth table price    x    y    z
#> 1   0.23     Ideal     E     SI2  61.5  55.0   326 3.95 3.98 2.43
#> 2   0.21   Premium     E     SI1  59.8  61.0   326 3.89 3.84 2.31
#> 3   0.23      Good     E     VS1  56.9  65.0   327 4.05 4.07 2.31
#> 4   0.29   Premium     I     VS2  62.4  58.0   334 4.20 4.23 2.63
#> 5   0.31      Good     J     SI2  63.3  58.0   335 4.34 4.35 2.75
#> 6   0.24 Very Good     J    VVS2  62.8  57.0   336 3.94 3.96 2.48
#> 7   0.24 Very Good     I    VVS1  62.3  57.0   336 3.95 3.98 2.47
#> 8   0.26 Very Good     H     SI1  61.9  55.0   337 4.07 4.11 2.53
#> 9   0.22      Fair     E     VS2  65.1  61.0   337 3.87 3.78 2.49
#> 10  0.23     Ideal     J     VS1  62.8  56.0   340 3.93 3.90 2.46
#> 11  0.22   Premium     F     SI1  60.4  61.0   342 3.88 3.84 2.33
#> 12  0.30     Ideal     I     SI2  62.0  54.0   348 4.31 4.34 2.68
#> 13  0.30      Good     I     SI2  63.3  56.0   351 4.26 4.30 2.71
#> 14  0.23 Very Good     E     VS2  63.8  55.0   352 3.85 3.92 2.48
#> 15  0.23 Very Good     G    VVS2  60.4  58.0   354 3.97 4.01 2.41
#> 16  0.23 Very Good     D     VS2  60.5  61.0   357 3.96 3.97 2.40
#> 17  0.23 Very Good     F     VS1  60.9  57.0   357 3.96 3.99 2.42
#> 18  0.23      Good     F     VS1  58.2  59.0   402 4.06 4.08 2.37
#> 19  0.31      Good     H     SI1  64.0  54.0   402 4.29 4.31 2.75
#> 20  0.26      Good     D     VS2  65.2  56.0   403 3.99 4.02 2.61
#> 21  0.23     Ideal     G     VS1  61.9  54.0   404 3.93 3.95 2.44
#> 22  0.22   Premium     D     VS2  59.3  62.0   404 3.91 3.88 2.31
#> 23  0.30   Premium     J     SI2  59.3  61.0   405 4.43 4.38 2.61
#> 24  0.30     Ideal     D     SI1  62.5  57.0   552 4.29 4.32 2.69
#> 25  0.31   Premium     G     SI1  61.8  58.0   553 4.35 4.32 2.68
#> 26  0.30   Premium     H     SI1  62.9  59.0   554 4.28 4.24 2.68
#> 27  0.96      Fair     F     SI2  66.3  62.0  2759 6.27 5.95 4.07
#> 28  0.81     Ideal     F     SI2  58.8  57.0  2761 6.14 6.11 3.60
#> 29  0.91      Fair     H     SI2  64.4  57.0  2763 6.11 6.09 3.93
#> 30  0.77     Ideal     H     VS2  62.0  56.0  2763 5.89 5.86 3.64
#> 31  0.72      Good     G     VS2  59.7  60.5  2776 5.80 5.84 3.47
#> 32  0.84      Fair     G     SI1  55.1  67.0  2782 6.39 6.20 3.47
#> 33  1.05      Fair     J     SI2  65.8  59.0  2789 6.41 6.27 4.18
#> 34  0.90      Fair     I     SI1  67.3  59.0  2804 5.93 5.84 3.96
#> 35  0.75      Fair     D     SI2  64.6  57.0  2848 5.74 5.72 3.70

```

**8. rp_count() – Count Rows (Overall or by Group)**
This verb returns the number of rows in the data frame, optionally by groups.
```{r}
# Total row count
v22 <- rp_count(diamonds)
print(v22)
#>       n
#> 1 53940

# Count per group
v23 <- rp_count(diamonds, .by = cut)
print(v23)
#>         cut     n
#> 1      Fair  1610
#> 2      Good  4906
#> 3 Very Good 12082
#> 4   Premium 13791
#> 5     Ideal 21551

# Count per combination of multiple groups
v24 <- rp_count(diamonds, .by = c(cut, color))
print(head(v24))
#>    cut color   n
#> 1 Fair     D 163
#> 2 Fair     E 224
#> 3 Fair     F 312
#> 4 Fair     G 314
#> 5 Fair     H 303
#> 6 Fair     I 175
```

## Chaining Verbs with the Pipe
All rPandas functions are "pipe-friendly" and use .data as their first argument, allowing you to chain operations together using magrittr's %>% pipe (or the native R |>).

```{r}
# Load the pipe
v25 <- diamonds |>
  rp_filter(carat > 1 & color == "D") |>
  rp_mutate(price_per_carat = price / carat) |>
  rp_summarize(avg_ppc = mean(price_per_carat), .by = cut) |>
  rp_sort(desc(avg_ppc))

print(head(v25))
#>         cut  avg_ppc
#> 1     Ideal 7546.163
#> 2 Very Good 6789.316
#> 3   Premium 6548.397
#> 4      Good 5784.918
#> 5      Fair 5414.87
```

## Viewing the Python Code
A key feature of rPandas is its role as a learning and translation tool. Every user-facing verb has a return.as argument that lets you inspect the Python code it generates.

You can set return.as to:

**"result" (default):** Returns the final R data frame.

**"code":** Returns the generated Python command as a string.

**"all":** Returns a list containing both the result and the code.

```{r}
# See the code for a simple filter
rp_filter(diamonds, carat > 1 & price < 400, return.as = "code")
#> [1] "df.query('(carat > 1) and (price < 400)')"

# See the code for a mutate
rp_mutate(diamonds, ppc = price / carat, return.as = "code")
#> [1] "df.assign(ppc = lambda x: (x['price'] / x['carat']))"

# See the code for a complex summary
rp_summarize(
  diamonds,
  avg_price = mean(price), 
  count = n(),
  .by = c(cut, color),
  return.as = "code"
)
#> [1] "df.groupby(['cut', 'color'], as_index=False, observed=True).agg(avg_price = ('price', 'mean'), count = ('price', 'size'))"
```

## Customizing the Generated Code with table_name
By default, when you request return.as = "code", the generated Python code uses the placeholder rpandas_df_in as the DataFrame name. This is because the function only receives the 
data object, not its name. However, you can provide a custom name using the table_name argument (available in all verbs). This is especially useful when you want to copy‑paste the 
code into a Python script or notebook.
```{r}
# Default placeholder
rp_filter(diamonds, carat > 1, return.as = "code")
#> [1] "df.query('carat > 1')"

# With custom table name
rp_filter(diamonds, carat > 1, table_name = "diamonds", return.as = "code")
#> [1] "diamonds.query('carat > 1')"

```
The output changes from:
**"rpandas_df_in.query('(carat > 1)')"** to: **"diamonds.query('(carat > 1)')"**.
This works for all verbs – rp_select, rp_mutate, rp_summarize, etc. Simply pass table_name = "your_data_frame_name" as an argument.