Simulation from the bivariate negative binomial and bi- and trivariate logarithmic series distribution

Almut E. D. Veraart

2021-02-22

The multivariate negative binomial distribution.

Recall that a random variable \(X\) has (univariate) negative binomial law with parameters \(\kappa>0, 0<p<1\), i.e., \(X\sim \text{NB}(\kappa, p)\) if its probability mass function is given by \[ P(X=x) = {\kappa+x-1 \choose x} p^x(1-p)^{\kappa}, \quad x \in \{0,1,\ldots\}. \]

A random vector \({\bf X}=(X_1,\dots,X_n)'\) is said to follow the multivariate negative binomial distribution with parameters \(\kappa, p_1, \dots, p_n\) if its probability mass function is given by \[ P({\bf X}={\bf x})=\frac{\Gamma(x_1+\cdots+x_n+\kappa)}{x_1! \cdots x_n! \Gamma(\kappa)}p_1^{x_1}\cdots p_n^{x_n}(1-p_1-\cdots-p_n)^{\kappa}, \] where, for \(i=1,\dots,n\), \(x_i\in\{0,1,\dots\}\), \(0<p_i<1\) such that \(\sum_{i=1}^np_i<1\) and \(\kappa>0\).

We note that the function stats::rnbinom can be used to simulate from the univariate negative binomial distribution. The trawl package introduces the function Bivariate_NBsim which generates samples from the bivariate negative binomial distribution. The simulation algorithm proceeds in two steps: First, we simulate \(X_1\) from the univariate negative binomial distribution NB(\(\kappa\),\(p_1/(1-p_2)\)). Second, we simulate \(X_2|X_1=x_1\) from the univariate negative binomial distribution NB(\(\kappa+x_1,p_2\)), see for instance (Dunn 1967).

###Example

set.seed(1)
kappa<- 3
p1 <- 0.1
p2 <- 0.85
p <- p1+p2
p0 <-1-p1-p2

N<- 10000

#Simulate from the bivariate negative binomial distribution
y <- trawl::Bivariate_NBsim(N,kappa,p1,p2)

#Compare the empirical and theoretical mean of the first component
base::mean(y[,1])
#> [1] 5.9653
kappa*p1/(1-p)
#> [1] 6

#Compare the empirical and theoretical variance of the first component
stats::var(y[,1])
#> [1] 18.0047
kappa*p1*(1-p2)/(1-p)^2
#> [1] 18

#Compare the empirical and theoretical mean of the second component
base::mean(y[,2])
#> [1] 50.9742
kappa*p2/(1-p)
#> [1] 51

#Compare the empirical and theoretical variance of the second component
stats::var(y[,2])
#> [1] 926.3358
kappa*p2*(1-p1)/(1-p)^2
#> [1] 918

#Compare the empirical and theoretical correlation between the two components
stats::cor(y[,1],y[,2])
#> [1] 0.7891007
(p1*p2/(p0+p1)*(p0+p2))^(1/2)
#> [1] 0.7141428

The multivariate logarithmic series distribution

We say that a vector \({\bf X}=(X_1,\dots,X_n)'\) follows the multivariate logarithmic series distribution (LSD), see, e.g., (Patil and Bildikar 1967). \({\bf X} \sim \text{LSD}(p_1,\ldots,p_n)\), where \(0<p_i<1, p:=\sum_{i=1}^np_i <1\) if for \({\bf x}\in \mathbb{N}_0^n \setminus \{{\bf 0} \}\), if its probability mass function is given by \[ P({\bf X}={\bf x})=\frac{\Gamma(x_1+\cdots+x_n)}{x_1! \cdots x_n!}\frac{p_1^{x_1}\cdots p_n^{x_n}}{\{-\ln(1-p)\}}. \] Note that each component \(X_i\) follows the modified univariate logarithmic distribution with parameters \(\tilde p_i = p_i/(1-p+p_i)\) and \(\delta_i= \ln(1-p+p_i)/\ln(1-p)\), i.e., \(X_i\sim\text{ModLSD}(\tilde p_i, \delta_i)\) with \[ P(X_i =x_i) = \left\{ \begin{array}{ll} \delta_i, & \text{for } x_i=0\\ (1-\delta_i) \frac{1}{x_i}\frac{\tilde p_i^{x_i}}{\{-\ln(1-\tilde p_i)\}}, & \text{for } x_i \in \mathbb{N}. \end{array} \right. \] Simulations from the univariate LSD can be carried out using the function Runuran::urlogarithmic. The trawl package implements the functions Bivariate_LSDsim and Trivariate_LSDsim to simulate from both the bivariate and the trivariate logarithmic series distribution.

Simulating from the bivariate logarithmic series distribution

The function Bivariate_NBsim can be used to simulate from the bivariate logarithmic series distribution. To this end, note that the probability mass function of a random vector \({\bf X}=(X_1,X_2)'\) following the bivariate logarithmic series distribution with parameters \(0<p_1, p_2<1\) with \(p:=p_1+p_2<1\) is given by \[P(X_1=x_1,X_2=x_2)=\frac{\Gamma(x_1+x_2)}{x_1!x_2!} \frac{p_1^{x_1}p_2^{x_2}}{\{-\ln(1-p)\}},\] for \(x_1,x_2=0,1,2,\dots\) such that \(x_1+x_2>0\).

The simulation proceeds in two steps: First, \(X_1\) is simulated from the modified logarithmic distribution with parameters \(\tilde p_1=p_1/(1-p_2)\) and \(\delta_1=\ln(1-p_2)/\ln(1-p)\). Then we simulate \(X_2\) conditional on \(X_1\). We note that \(X_2|X_1=x_1\) follows the logarithmic series distribution with parameter \(p_2\) when \(x_1=0\), and the negative binomial distribution with parameters \((x_1,p_2)\) when \(x_1>0\).

Example

Next we provide an example of a simulation from the bivariate LSD and we showcase the functions ModLSD_Mean, ModLSD_Var, BivLSD_Cor and BivLSD_Cov which compute the mean and the variance of the univariate modified LSD and the correlation and covariance of the bivariate LSD, respectively.

set.seed(1)
p1<-0.15
p2<-0.3

N<-10000

#Simulate N realisations from the bivariate LSD 
y<-trawl::Bivariate_LSDsim(N, p1, p2)

#Compute the empirical and theoretical mean of the first component
base::mean(y[,1])
#> [1] 0.4575
trawl::ModLSD_Mean(base::log(1-p2)/base::log(1-p1-p2),p1/(1-p2))
#> [1] 0.45619

#Compute the empirical and theoretical mean of the second component
base::mean(y[,2])
#> [1] 0.8995
trawl::ModLSD_Mean(base::log(1-p1)/base::log(1-p1-p2),p2/(1-p1))
#> [1] 0.91238

#Compute the empirical and theoretical variance of the first component
stats::var(y[,1])
#> [1] 0.3686306
trawl::ModLSD_Var(base::log(1-p2)/base::log(1-p1-p2),p1/(1-p2))
#> [1] 0.3724961

#Compute the empirical and theoretical variance of the second component
stats::var(y[,2])
#> [1] 0.5690567
trawl::ModLSD_Var(base::log(1-p1)/base::log(1-p1-p2),p2/(1-p1))
#> [1] 0.5776045

##Compute the empirical and theoretical correlation between the two components
stats::cor(y[,1],y[,2])
#> [1] -0.3961486
trawl::BivLSD_Cor(p1,p2)
#> [1] -0.3608673

##Compute the empirical and theoretical covariance between the two components
stats::cov(y[,1],y[,2])
#> [1] -0.1814394
trawl::BivLSD_Cov(p1,p2)
#> [1] -0.1673877

Simulating from the trivariate logarithmic series distribution

The function Trivariate_NBsim can be used to simulate from the trivariate logarithmic series distribution. The simulation proceeds in two steps: First, \(X_1\) is simulated from the modified logarithmic distribution with parameters \(\tilde p_1=p_1/(1-p_2-p_3)\) and \(\delta_1=\ln(1-p_2-p_3)/\ln(1-p)\). Then we simulate \((X_2,X_3)'\) conditional on \(X_1\). We note that \((X_2,X_3)'|X_1=x_1\) follows the bivariate logarithmic series distribution with paramaters \((p_2,p_3)\) when \(x_1=0\), and the bivariate negative binomial distribution with parameters \((x_1,p_2,p_3)\) when \(x_1>0\).

Example

set.seed(1)
p1<-0.15
p2<-0.25
p3<-0.55

N<- 10000

#Simulate N realisations from the bivariate LSD 
y<-trawl::Trivariate_LSDsim(N, p1, p2, p3)

#Compute the empirical and theoretical mean of the first component
base::mean(y[,1])
#> [1] 1.0152
trawl::ModLSD_Mean(base::log(1-p2-p3)/base::log(1-p1-p2-p3),p1/(1-p2-p3))
#> [1] 1.001425

#Compute the empirical and theoretical mean of the second component
base::mean(y[,2])
#> [1] 1.6857
trawl::ModLSD_Mean(base::log(1-p1-p3)/base::log(1-p1-p2-p3),p2/(1-p1-p3))
#> [1] 1.669041

#Compute the empirical and theoretical mean of the third component
base::mean(y[,3])
#> [1] 3.7275
trawl::ModLSD_Mean(base::log(1-p1-p2)/base::log(1-p1-p2-p3),p3/(1-p1-p2))
#> [1] 3.67189

#Compute the empirical and theoretical variance of the first component
stats::var(y[,1])
#> [1] 2.999069
trawl::ModLSD_Var(base::log(1-p2-p3)/base::log(1-p1-p2-p3),p1/(1-p2-p3))
#> [1] 3.002847

#Compute the empirical and theoretical variance of the second component
stats::var(y[,2])
#> [1] 7.255041
trawl::ModLSD_Var(base::log(1-p1-p3)/base::log(1-p1-p2-p3),p2/(1-p1-p3))
#> [1] 7.228548

#Compute the empirical and theoretical variance of the third component
stats::var(y[,3])
#> [1] 30.87613
trawl::ModLSD_Var(base::log(1-p1-p2)/base::log(1-p1-p2-p3),p3/(1-p1-p2))
#> [1] 30.5799

#Computing the bivariate covariances and correlations
#Cor(X1,X2):
delta <- base::log(1-p3)/base::log(1-p1-p2-p3)
hatp1 <-p1/(1-p3)
hatp2<-p2/(1-p3)

stats::cov(y[,1],y[,2])
#> [1] 3.316309
trawl::BivModLSD_Cov(delta,hatp1,hatp2)
#> [1] 3.335704

stats::cor(y[,1],y[,2])
#> [1] 0.7109545
trawl::BivModLSD_Cor(delta,hatp1,hatp2)
#> [1] 0.6041258

#Cor(X1,X3):
delta <- log(1-p2)/log(1-p1-p2-p3)
hatp1 <-p1/(1-p2)
hatp2<-p3/(1-p2)

stats::cov(y[,1],y[,3])
#> [1] 7.287671
trawl::BivModLSD_Cov(delta,hatp1,hatp2)
#> [1] 7.338549

stats::cor(y[,1],y[,3])
#> [1] 0.7573281
trawl::BivModLSD_Cor(delta,hatp1,hatp2)
#> [1] 0.7220044

#Cor(X2,X3):
delta <- log(1-p1)/log(1-p1-p2-p3)
hatp1 <-p2/(1-p1)
hatp2<-p3/(1-p1)

stats::cov(y[,2],y[,3])
#> [1] 12.31899
trawl::BivModLSD_Cov(delta,hatp1,hatp2)
#> [1] 12.23092
stats::cor(y[,2],y[,3])
#> [1] 0.8230829
trawl::BivModLSD_Cor(delta,hatp1,hatp2)
#> [1] 0.7969081

##References

Dunn, James E. 1967. “Characterization of the Bivariate Negative Binomial Distribution.” Journal of the Arkansas Academy of Science 21.
Patil, Ganapati P., and Sheela Bildikar. 1967. “Multivariate Logarithmic Series Distribution as a Probability Model in Population and Community Ecology and Some of Its Statistical Properties.” Journal of the American Statistical Association 62 (318): 655–74.