Introduction to R package copulaSim

What is the usage of the R package copulaSim?

This package serves to simulate jointly distributed patient-level data from historical data based on the copula invariance property.

Why do we need to perform virtual patient simulation?

To consistently optimize clinical trial designs and data analysis methods through trial simulation, we need to simulate multivariate mixed-type virtual patient data independent of designs and analysis methods under evaluation. To make the outcome of optimization more realistic, we should utilize relevant empirical patient-level data when it is available.

The challenges faced when simulating small data

When simulating small empirical data, the underlying marginal distributions and their dependence structure cannot be understood or verified thoroughly due to the limited sample size.

Solution: copula invariance property

To resolve this issue, we use the copula invariance property to generate the joint distribution without making a strong parametric assumption. The theoretical background is addressed below.

Sklar’s Theorem

The idea of copula was first introduced by Dr. Abe Sklar in 1959 in the field of statistics. He proposed a theorem, which is later called Sklar’s theorem. This theorem essentially consists of two parts. First, the copula function can be used to describe the relationship between the joint and marginal distributions. This function assigns the value of joint distribution to each ordered pair of values of marginal distributions. That is, the coupla function maps the range of joint distribution from a d-dimensional ball to a unit line. The second part is that the copula function can be uniquely determined for every joint distribution.

Why can the copula function be employed to generate joint distribution?

Each joint density can be viewed as the product of marginal densities multiplied by copula density. The copula density, which is defined as the partial derivative of the copula function, contains all the information about the dependence structure of the joint distribution. As a result, the joint distribution can be flexibly constructed by copula dependency and marginal distributions.

The implementation of copula into a new R package

To share this finding with the community, we have implemented the copula algorithm into a new R package entitled copulaSim. The copulaSim package is designed to perform virtual patient simulation. The idea of the copula simulation algorithm is given in the following. Based on the copula invariance property, the dependence structure of the joint distribution can be well preserved when performing quantile transformation. Because of this feature, the copula simulation algorithm allows for the simulated data to resemble the empirical data.

The demonstration of R Package copulaSim

Generate empirical data:

Assume that the single-arm, 5-dimensional empirical data follows multivariate normal data, and let the sample size of empirical data be 30.

library(mvtnorm)
arm1 <- rmvnorm(n = 30, mean  = rep(10, 5), sigma = diag(5) + 0.5)
test_data <- as.data.frame(cbind(1:30, rep(1, 30), arm1))
colnames(test_data) <- c("id","arm",paste0("time_", 1:5))
knitr::kable((test_data), "simple")

id	arm	time_1	time_2	time_3	time_4	time_5
1	1	11.646910	10.269294	10.481858	11.210719	10.090970
2	1	6.627463	8.317597	9.015460	9.574590	7.978692
3	1	10.773175	10.175516	10.487008	9.766223	11.746998
4	1	9.144505	9.348546	9.813288	9.634444	8.353821
5	1	11.520665	9.453226	10.269135	9.331822	9.627985
6	1	9.657934	10.185403	9.124125	9.730094	9.247632
7	1	6.954266	8.553138	7.376362	7.053906	7.832456
8	1	8.725410	9.969419	11.386591	8.857343	9.638618
9	1	10.219469	8.959432	10.036200	8.549264	12.132358
10	1	11.995693	8.897448	8.208588	8.067165	9.118994
11	1	10.105470	11.162182	10.956714	13.127127	11.821425
12	1	9.923194	11.248229	10.475311	9.990856	8.646211
13	1	9.051581	9.270937	7.823429	9.915488	7.917853
14	1	10.324965	10.447948	10.343337	11.329486	7.466363
15	1	10.751450	8.252985	10.579884	10.301893	10.175539
16	1	11.156088	9.999761	9.638360	8.909357	8.695054
17	1	9.775169	8.812014	10.392023	11.253106	10.642096
18	1	10.884094	9.192141	10.126902	10.351798	11.504468
19	1	10.916468	10.423716	11.082694	11.297520	12.444548
20	1	10.078414	8.432021	9.948595	9.750644	9.451321
21	1	11.523814	11.341755	11.695170	12.323437	8.786644
22	1	10.878517	7.820464	10.442991	10.250747	9.547050
23	1	10.540325	12.465634	10.704040	12.488671	9.934841
24	1	11.487375	10.196161	9.989142	9.223528	11.231917
25	1	9.254913	10.702573	10.184756	7.749516	10.799756
26	1	8.894291	6.957514	7.319530	7.972501	8.272946
27	1	10.979493	9.881441	9.284832	12.328774	11.101806
28	1	11.732922	12.032556	12.015052	11.676075	10.880316
29	1	9.994325	9.410948	10.061306	10.074411	10.925945
30	1	8.569415	10.020435	11.014066	9.317199	9.464022

Load package copulaSim

library(copulaSim)

Use function copula.sim to generate ONE simulated dataset for the emperical data

Argument	Definition	Assigned Value
data.input	The empirical data	test_data[,-c(1,2)]
id.vec	ID fo individual patient in the input data	test_data$id
arm.vec	The column to identify the arm in clinical trial	test_data$arm
n.patient	The targeted number of patients in each simulated dataset	50
n.simulation	The number of simulated datasets	1
seed	The random seed to reproduce the simulation study	2022
validation.type	Specify hypothesis test to detect the difference between empirical data and simulated data	“energy”
verbose	Whether to print message for simulation process or not	TRUE

As a means of avoiding extreme sampling results when performing one simulated dataset, it is advised to use “energy” or “ball” in the argument validation.type. The purpose of doing this is to perform data validation, which ensures the similarity between empirical data and the simulated data based on 2-sample test.
Below shows the 2-sample test result, which indicates that the joint distribution of the simulated data is not significantly different from the empirical data.

## Generate 1 simulated dataset
simu_S1 <- copula.sim(data.input = test_data[,-c(1,2)], 
                      id.vec = test_data$id, 
                      arm.vec = test_data$arm,
                      n.patient = 50 , 
                      n.simulation = 1, 
                      seed = 2022, 
                      validation.type = "energy",
                      verbose = TRUE)

## Simulate 1th Dataset
## p.value for energy test: 0.7970
## Compelete simulating 1th Dataset in 0.160 seconds

## Obtain the simulated long-form dataset
simu_S1$data.simul

## # A tibble: 250 x 6
##       id   arm col.num col.name data.sim sim.id
##    <dbl> <dbl>   <dbl> <chr>       <dbl>  <int>
##  1     1     1       1 time_1      10.9       1
##  2     2     1       1 time_1       8.86      1
##  3     3     1       1 time_1      11.5       1
##  4     4     1       1 time_1      11.5       1
##  5     5     1       1 time_1       9.49      1
##  6     6     1       1 time_1      10.0       1
##  7     7     1       1 time_1      11.6       1
##  8     8     1       1 time_1      10.7       1
##  9     9     1       1 time_1       8.70      1
## 10    10     1       1 time_1      10.0       1
## # ... with 240 more rows
## # i Use `print(n = ...)` to see more rows

Also, from the marginal density plot, we can see that the copula simulated data is close to the empirical data.

library(dplyr)
## Obtain the empirical long-form dataset
empir <- simu_S1$data.input.long %>% mutate(cate = "empirical_n30") %>% rename(data = data.input)

## Produce the marginal density plot
simul <- simu_S1$data.simul %>%  mutate(cate = "copulaSim_n50") %>% 
         rename(data = data.sim) %>% select(-sim.id)  
library(ggplot2)
rbind(empir, simul) %>% filter(grepl('time', col.name)) %>%
  ggplot(aes(x = data, color = cate, fill = cate)) +
  facet_wrap(.~col.name, ncol = 5) +
  geom_density(alpha = 0.001, size = 1)

Use function extract.data.sim to convert simulated data into a list of wide-form matrices

## Converting the long-form simulated dataset to wide-form 
simu.wide <- extract.data.sim(simu_S1)
simu.wide

## $`sim.id=1`
## $`sim.id=1`$`arm=1`
##          time_1    time_2    time_3    time_4    time_5
##  [1,] 10.882214 11.782795 11.829682 12.703234 11.771197
##  [2,]  8.864297 11.552918 10.559735  9.090521  7.671552
##  [3,] 11.523731 12.326746 11.935101 13.116663 10.956906
##  [4,] 11.523183 10.244910 11.000751 10.149085 12.430483
##  [5,]  9.487384  8.182731 10.364411  9.830494 10.671266
##  [6,] 10.035734  8.311126  9.336391  9.740772 11.024167
##  [7,] 11.562668 10.436134 10.377385 10.904602 11.773561
##  [8,] 10.670770  8.180897  7.347419  8.631056  9.295954
##  [9,]  8.698251 10.044597 10.345788 11.213242  8.345151
## [10,] 10.040009 10.006756  9.221086  9.319113 10.990252
## [11,]  9.974107  8.174987 10.482672  7.990362 12.036490
## [12,] 10.883196  9.346975  9.689017  7.736601  7.896831
## [13,] 10.001058 10.186662 10.542313  8.694998  9.621085
## [14,] 11.785938 11.164226 11.745467 12.366167 11.335162
## [15,] 10.835325  9.707887  9.069775  7.798309 10.854649
## [16,]  9.977064  7.935337  7.578375  7.916174 10.012469
## [17,]  8.620151  7.922876  7.360444  7.720451  7.554585
## [18,]  8.634307  7.255901  7.383075  7.432532  7.574158
## [19,] 10.953475  8.457960 10.621095  8.524596  9.482929
## [20,]  8.081317  9.195779  9.176780  7.841376  7.934947
## [21,] 10.883318  9.406466 10.312328 11.348170 11.454102
## [22,]  6.859692  8.343532 10.003129 10.688375  8.688254
## [23,]  9.721303 10.214915 11.010897 11.308270  9.628388
## [24,] 10.912065 11.174474 10.067575  9.795591  8.964425
## [25,]  9.130176 10.185375 11.028622  9.953608 10.663402
## [26,] 10.282695  8.585448  7.351914  8.037196  9.523494
## [27,] 10.434761  9.433771  7.576724  8.857205  8.692707
## [28,] 10.878189  9.370089 10.469696 11.976034 11.323951
## [29,] 11.043796  9.361647 10.507122 11.480949 11.394239
## [30,]  7.973582  9.906306 10.484769 12.285020  8.703697
## [31,] 11.422629 11.245840 11.043007 10.156698  8.315459
## [32,] 10.763738 10.583369  9.445984  9.592568  9.205381
## [33,] 10.978934  8.422350  9.332877  9.319909 10.381483
## [34,] 10.095110 10.192373 10.486172 10.837454 11.811422
## [35,]  9.738053 10.030813 11.868960 11.528727 10.953413
## [36,] 11.557554  8.504001 10.036275  8.866637 10.804163
## [37,] 10.758517  8.301460 10.040082  8.866884  8.685943
## [38,]  9.020668  8.593177 10.180643  9.596501 11.766536
## [39,] 10.880619  9.888545 10.371159  8.619562 10.887735
## [40,]  9.738026  8.905980 10.491806  8.543275 10.808045
## [41,] 10.250263 10.432757 10.552412 10.321520  9.637336
## [42,]  8.774332  9.186204 10.358860  8.694322 11.239469
## [43,] 10.080631 11.216590  9.423501  8.864013  7.873485
## [44,] 10.023380  8.260732 10.360511  9.469259  9.380528
## [45,]  9.142095  7.169126  9.704404  8.219107  9.629918
## [46,]  7.012989  8.295393 10.086290  9.737844  9.501242
## [47,] 11.551832 10.436350 10.401349  9.244385  8.672310
## [48,]  9.986050 10.317522  8.869056  9.324915  8.661655
## [49,] 10.044765  9.443939  8.041164  8.107614  8.650004
## [50,] 10.978597 10.017485 10.985975  9.959680  9.505250

Use function compare.copula.sim to perform the comparison between empirical data and multiple simulated datasets.

To increase the data variability, we suggest that the data validation option should be turned off when producing multiple simulated datasets. That is, the argument for validation.type should be set to “none”.

## Generate 100 simulated datasets
simu_S100 <- copula.sim(data.input = test_data[,-c(1,2)], 
                         id.vec = test_data$id, 
                         arm.vec = test_data$arm,
                         n.patient = 50 , 
                         n.simulation = 100, 
                         seed = 2022, 
                         validation.type = "none",
                         verbose = FALSE)

## Compare the marginal mean via the function compare.copula.sim 
compare <- compare.copula.sim(simu_S100)
knitr::kable(compare$mean.comparison, "simple")

marginal.name	arm	empir.sample	simu.sample	n.simu	empir.mean	simu.mean	simu.mean.low.lim	simu.mean.upp.lim	simu.mean.RB	simu.mean.SB	simu.mean.RMSE	empir.sd	simu.sd
time_1	1	30	50	100	10.1363	10.1790	9.9230	10.4897	0.0042	0.1138	0.1589	1.3158	1.1974
time_2	1	30	50	100	9.7400	9.7401	9.4410	10.0268	0.0000	0.1071	0.1482	1.2368	1.1104
time_3	1	30	50	100	10.0092	10.0045	9.7583	10.2418	-0.0005	0.0970	0.1275	1.1526	1.0543
time_4	1	30	50	100	10.0469	10.0435	9.7241	10.3945	-0.0003	0.1083	0.1819	1.4916	1.3700
time_5	1	30	50	100	9.8493	9.8444	9.5266	10.1315	-0.0005	0.1005	0.1751	1.3968	1.3143

Employ function new.arm.copula.sim to simulate new multivariate datasets with shifted mean vector from existing empirical data.

This function assumes that the dependence structure of the new-arm data and that of the empirical data are the same.

## Generate Empirical Data 
##  Assume that the single-arm, 3-dimensional empirical data follows multivariate normal data
arm1 <- rmvnorm(n = 80, mean  = c(10,10.5,11), sigma = diag(3) + 0.5)
test_data2 <- as.data.frame(cbind(1:80, rep(1,80), arm1))
colnames(test_data2) <- c("id", "arm", paste0("time_", 1:3))

## Generate 1 simulated datasets with one empirical arm and two new-arms
## The mean difference between empirical arm and 
## (i) the 1st new arm is assumed to be 2.5, 2.55, and 2.6 at each time point
## (ii) the 2nd new arm is assumed to be 4.5, 4.55, and 4.6 at each time point

newARM <- new.arm.copula.sim(data.input = test_data2[,-c(1,2)], 
                             id.vec = test_data2$id,
                             arm.vec = test_data2$arm, 
                             n.patient = 100 , 
                             n.simulation = 1, 
                             seed = 2022,
                             shift.vec.list = list(c(2.5,2.55,2.6), c(4.5,4.55,4.6)),
                             verbose = FALSE)

## Obtain the simulated long-form dataset
newARM$data.simul

## # A tibble: 900 x 6
##       id   arm col.num col.name data.sim sim.id
##    <dbl> <dbl>   <dbl> <chr>       <dbl>  <int>
##  1     1     1       1 time_1       9.63      1
##  2     2     1       1 time_1       9.65      1
##  3     3     1       1 time_1       8.24      1
##  4     4     1       1 time_1      12.1       1
##  5     5     1       1 time_1       9.12      1
##  6     6     1       1 time_1      10.0       1
##  7     7     1       1 time_1       8.56      1
##  8     8     1       1 time_1      10.6       1
##  9     9     1       1 time_1      10.5       1
## 10    10     1       1 time_1       8.48      1
## # ... with 890 more rows
## # i Use `print(n = ...)` to see more rows

## Verify the mean difference
newARM$data.simul %>%  
  group_by(.data$arm, .data$col.num) %>%
  summarise(N = n(), Mean = mean(.data$data.sim), SD = sd(.data$data.sim))

## # A tibble: 9 x 5
## # Groups:   arm [3]
##     arm col.num     N  Mean    SD
##   <dbl>   <dbl> <int> <dbl> <dbl>
## 1     1       1   100  9.71 1.01 
## 2     1       2   100 10.2  1.26 
## 3     1       3   100 10.8  1.23 
## 4     2       1   100 12.0  0.975
## 5     2       2   100 12.7  1.30 
## 6     2       3   100 13.3  1.14 
## 7     3       1   100 14.0  0.975
## 8     3       2   100 14.7  1.30 
## 9     3       3   100 15.3  1.14

- Authored by Pei-Shan Yen

- CRAN page: https://CRAN.R-project.org/package=copulaSim

- github page: https://github.com/psyen0824/copulaSim

Acknowledgement

This research project and the development of the R package are supported by AbbVie Experiential Internship Program. I am also grateful to Dr. Xuemin Gu, Dr. Jenny Jiao, and Dr. Jane Zhang at the Eyecare Clinical Statistics Team for valuable comments on this work.