---
title: "ONNX Model Import"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{ONNX Model Import}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(eval = FALSE, collapse = TRUE, comment = "#>")
library(ggmlR)
```

ggmlR includes a built-in zero-dependency ONNX loader (hand-written protobuf
parser in C). Load any compatible ONNX model and run inference on CPU or
Vulkan GPU — no Python, no TensorFlow, no ONNX Runtime required.

> **Note:** The examples below require a valid `.onnx` model file.
> Replace `"path/to/model.onnx"` with the actual path on your system.

```{r, eval=FALSE}
library(ggmlR)
```

---

## 1. Load and inspect a model

```{r, eval=FALSE}
model <- onnx_load("path/to/model.onnx")

# Model summary (layers, ops, parameters)
onnx_summary(model)

# Input tensor info (name, shape, dtype)
onnx_inputs(model)
```

---

## 2. Run inference

Inputs are named R arrays in NCHW order (matching the ONNX model's expected
layout).

```{r, eval=FALSE}
# Random image batch — replace with real data
input <- array(runif(1 * 3 * 224 * 224), dim = c(1L, 3L, 224L, 224L))

result <- onnx_run(model, list(input_name = input))

cat("Output shape:", paste(dim(result[[1]]), collapse = " x "), "\n")
```

For models with multiple inputs, pass a named list:

```{r, eval=FALSE}
result <- onnx_run(model, list(
  input_ids      = array(as.integer(tokens), dim = c(1L, length(tokens))),
  attention_mask = array(1L, dim = c(1L, length(tokens)))
))
```

---

## 3. GPU inference

By default ggmlR tries Vulkan first and falls back to CPU automatically.
To force a specific backend:

```{r, eval=FALSE}
# Check what's available
if (ggml_vulkan_available()) {
  cat("Vulkan GPU ready\n")
  ggml_vulkan_status()
}

# Load with explicit device
model_gpu <- onnx_load("path/to/model.onnx", device = "vulkan")
model_cpu <- onnx_load("path/to/model.onnx", device = "cpu")
```

Weights are transferred to the GPU once at load time. Repeated calls to
`onnx_run()` do not re-transfer weights.

---

## 4. Dynamic input shapes

Some models accept variable-length inputs. Override shapes at load time:

```{r, eval=FALSE}
model <- onnx_load("path/to/bert.onnx",
                    input_shapes = list(input_ids = c(1L, 128L)))
```

---

## 5. FP16 inference

Run in half-precision for faster GPU inference:

```{r, eval=FALSE}
model_fp16 <- onnx_load("path/to/model.onnx", dtype = "f16")
result <- onnx_run(model_fp16, list(input = input))
```

---

## 6. Supported operators

ggmlR supports 50+ ONNX operators, including:

- **Convolution:** Conv, ConvTranspose, MaxPool, AveragePool, GlobalAveragePool
- **Linear:** Gemm, MatMul, Linear
- **Activations:** Relu, Sigmoid, Tanh, Gelu, HardSigmoid, Mish, Clip, Elu
- **Normalization:** BatchNormalization, LayerNormalization, GroupNormalization
- **Shape ops:** Reshape, Transpose, Flatten, Squeeze, Unsqueeze, Concat, Split, Slice, Gather, ScatterElements
- **Elementwise:** Add, Sub, Mul, Div, Pow, Sqrt, Exp, Log, Abs, Neg
- **Reduction:** ReduceMean, ReduceSum, ReduceMax
- **Attention:** Attention (fused), MultiHeadAttention
- **Quantized:** QLinearConv, QLinearMatMul, DynamicQuantizeLinear
- **Other:** Cast, Pad, Resize, Dropout (identity at inference), LSTM, GRU, Einsum

Custom fused ops: **RelPosBias2D** (BoTNet).

---

## 7. Examples

For full working examples with real ONNX Zoo models see:

```{r, eval=FALSE}
# GPU vs CPU benchmark across multiple models
# inst/examples/benchmark_onnx.R

# FP16 inference benchmark
# inst/examples/benchmark_onnx_fp16.R

# Run all supported ONNX Zoo models
# inst/examples/test_all_onnx.R

# BERT sentence similarity
# inst/examples/bert_similarity.R
```

---

## 8. Debugging tips

If a model fails to load or produces wrong results:

1. **Check operator support** — print the model's op list with Python's
   `onnx` package and compare against the table above.

2. **Verify protobuf field numbers** — the built-in parser is hand-written;
   an unexpected field can cause silent mis-parsing.

3. **NaN tracing** — use the eval callback for per-node inspection rather
   than a post-compute scan (which aliases buffers and gives false readings).

4. **Repeated-run aliasing** — `ggml_backend_sched` aliases intermediate
   buffers over weight buffers. ggmlR calls `sched_alloc_and_load()` before
   each compute to reset allocation. If you see correct results on the first
   run but garbage on subsequent runs, this is the cause.