---
title: "Working with Quantized Models"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Working with Quantized Models}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = FALSE
)
```

## Introduction

Quantization reduces model size and memory usage by representing weights with
fewer bits. GGML supports various quantization formats that can significantly
reduce memory requirements while maintaining acceptable accuracy.

## Supported Data Types

ggmlR supports the following data types:
```{r data-types}
library(ggmlR)

# Standard floating point
GGML_TYPE_F32   # 32-bit float (4 bytes per element)
GGML_TYPE_F16   # 16-bit float (2 bytes per element)

# Integer
GGML_TYPE_I32   # 32-bit integer

# Quantized types
GGML_TYPE_Q4_0  # 4-bit quantization, type 0
GGML_TYPE_Q4_1  # 4-bit quantization, type 1
GGML_TYPE_Q8_0  # 8-bit quantization
```

## Memory Savings

Quantization provides significant memory savings:
```{r memory-comparison}
ctx <- ggml_init(64 * 1024 * 1024)

# Create tensors of same logical size with different types
n <- 1000000  # 1M elements

f32_tensor <- ggml_new_tensor_1d(ctx, GGML_TYPE_F32, n)
f16_tensor <- ggml_new_tensor_1d(ctx, GGML_TYPE_F16, n)
q8_tensor  <- ggml_new_tensor_1d(ctx, GGML_TYPE_Q8_0, n)
q4_tensor  <- ggml_new_tensor_1d(ctx, GGML_TYPE_Q4_0, n)

cat("Memory usage for", n, "elements:\n")
cat("  F32:", ggml_nbytes(f32_tensor) / 1024^2, "MB\n")
cat("  F16:", ggml_nbytes(f16_tensor) / 1024^2, "MB\n")
cat("  Q8_0:", ggml_nbytes(q8_tensor) / 1024^2, "MB\n")
cat("  Q4_0:", ggml_nbytes(q4_tensor) / 1024^2, "MB\n")

ggml_free(ctx)
```

## Quantization Functions

### Initialize Quantization Tables

Before quantizing, initialize the quantization tables:
```{r init-quant}
# Initialize quantization (required before first use)
ggml_quantize_init(GGML_TYPE_Q4_0)
ggml_quantize_init(GGML_TYPE_Q8_0)
```

### Quantize Data

Use `ggml_quantize_chunk()` to quantize floating-point data:
```{r quantize-data}
ctx <- ggml_init(16 * 1024 * 1024)

# Create source data (F32)
n <- 256  # Must be multiple of block size (32 for Q4_0)
src <- ggml_new_tensor_1d(ctx, GGML_TYPE_F32, n)
ggml_set_f32(src, rnorm(n))

# Extract numeric data from tensor
src_data <- ggml_get_f32(src)

# Quantize to Q4_0
quantized <- ggml_quantize_chunk(
  type = GGML_TYPE_Q4_0,
  src = src_data,
  nrows = 1,
  n_per_row = n
)

cat("Original size:", length(src_data) * 4, "bytes\n")  # F32 = 4 bytes
cat("Quantized size:", length(quantized), "bytes\n")
cat("Compression ratio:", round(ggml_nbytes(src) / length(quantized), 1), "x\n")

ggml_free(ctx)
```

### Dequantize Data

To convert quantized data back to float:
```{r dequantize}
# Q4_0 dequantization
q4_data <- quantized  # From previous example
dequantized <- dequantize_row_q4_0(q4_data, n)

# Compare with original
error <- mean(abs(src_data - dequantized))
cat("Mean absolute error:", error, "\n")
```

## Block Sizes and Alignment

Quantized types have specific block sizes:
```{r block-info}
# Get block information for quantized types
q4_info <- ggml_quant_block_info(GGML_TYPE_Q4_0)
cat("Q4_0 block size:", q4_info$blck_size, "elements\n")
cat("Q4_0 type size:", q4_info$type_size, "bytes per block\n")

q8_info <- ggml_quant_block_info(GGML_TYPE_Q8_0)
cat("Q8_0 block size:", q8_info$blck_size, "elements\n")
cat("Q8_0 type size:", q8_info$type_size, "bytes per block\n")

# Check if type is quantized
cat("\nIs Q4_0 quantized?", ggml_is_quantized(GGML_TYPE_Q4_0), "\n")
cat("Is F32 quantized?", ggml_is_quantized(GGML_TYPE_F32), "\n")
```

## Using Quantized Tensors in Computations

GGML automatically handles dequantization during computation:
```{r compute-quantized}
ctx <- ggml_init(32 * 1024 * 1024)

# Create quantized weight matrix (e.g., for neural network)
weight_rows <- 256
weight_cols <- 128

# In practice, you would load pre-quantized weights
# Here we create F32 weights and the computation handles mixed types
weights <- ggml_new_tensor_2d(ctx, GGML_TYPE_F32, weight_cols, weight_rows)
input <- ggml_new_tensor_1d(ctx, GGML_TYPE_F32, weight_cols)

# Matrix-vector multiplication works with mixed types
output <- ggml_mul_mat(ctx, weights, input)

graph <- ggml_build_forward_expand(ctx, output)

# Initialize data
ggml_set_f32(weights, rnorm(weight_rows * weight_cols, sd = 0.1))
ggml_set_f32(input, rnorm(weight_cols))

ggml_graph_compute(ctx, graph)

cat("Output shape:", ggml_tensor_shape(output), "\n")
cat("Output sample:", head(ggml_get_f32(output), 5), "\n")

ggml_free(ctx)
```

## Available Dequantization Functions

ggmlR provides dequantization for all GGML quantized types:
```{r dequant-functions}
# Standard quantization
# dequantize_row_q4_0()  - 4-bit, type 0
# dequantize_row_q4_1()  - 4-bit, type 1
# dequantize_row_q5_0()  - 5-bit, type 0
# dequantize_row_q5_1()  - 5-bit, type 1
# dequantize_row_q8_0()  - 8-bit, type 0

# K-quants (better quality)
# dequantize_row_q2_K()  - 2-bit K-quant
# dequantize_row_q3_K()  - 3-bit K-quant
# dequantize_row_q4_K()  - 4-bit K-quant
# dequantize_row_q5_K()  - 5-bit K-quant
# dequantize_row_q6_K()  - 6-bit K-quant
# dequantize_row_q8_K()  - 8-bit K-quant

# I-quants (importance matrix)
# dequantize_row_iq2_xxs(), dequantize_row_iq2_xs(), dequantize_row_iq2_s()
# dequantize_row_iq3_xxs(), dequantize_row_iq3_s()
# dequantize_row_iq4_nl(), dequantize_row_iq4_xs()

# Special types
# dequantize_row_tq1_0()  - Ternary quantization
# dequantize_row_tq2_0()
```

## Importance Matrix Quantization

Some quantization types require an importance matrix for better quality:
```{r imatrix}
# Check if type requires importance matrix
cat("Q4_0 requires imatrix:", ggml_quantize_requires_imatrix(GGML_TYPE_Q4_0),
    "\n")

# I-quants typically require importance matrix for best results
# The imatrix captures which weights are most important for model quality
```

## Cleanup

Always free quantization resources when done:
```{r cleanup}
# Free quantization tables
ggml_quantize_free()
```

## Performance Considerations

### When to Use Quantization

- **Large models**: Quantization is essential for running large language models
- **Memory-constrained environments**: Reduce memory footprint by 2-8x
- **Inference**: Quantization is primarily used for inference, not training

### Choosing Quantization Type

| Type | Bits | Quality | Speed | Use Case |
|------|------|---------|-------|----------|
| Q8_0 | 8 | High | Fast | When quality matters |
| Q4_K | 4 | Good | Fast | Balanced choice |
| Q4_0 | 4 | Medium | Fastest | Maximum compression |
| Q2_K | 2 | Lower | Fast | Extreme compression |

### Tips

1. **Start with Q4_K or Q5_K** for a good balance of quality and size
2. **Use Q8_0** when quality is critical
3. **Test accuracy** after quantization on your specific use case
4. **Align tensor sizes** to block sizes for optimal performance

## See Also

- `vignette("vulkan-backend")` for GPU acceleration
- `vignette("multi-gpu")` for distributed inference