hdpGLM

Usage

Estimation

The function hdpGLM estimates a semi-parametric Bayesian regression model. The syntax is similar to other R functions such as lm(), glm(), and lmer().

Here is a toy example. Suppose we are studying how income inequality affects support policies that help alleviate poverty in a given country A. Yet, suppose further that (1) the effect of inequality varies between groups of people; for some people, inequality increases support for welfare policies, but for others, it decreases welfare policy support; (2) we don’t know which individual belongs to which group. The data set welfare contains simulated data for this example.

## loading and looking at the data
welfare = read.csv2('welfare.csv')
head(welfare)
#>       support inequality     income   ideology
#> 1 -18.5649610  0.3392724  0.1425111  1.9225985
#> 2  -9.3905812 -0.9906646 -0.5117102  0.2483346
#> 3   0.9276234 -2.2318510 -0.3856288 -1.3619216
#> 4 -12.3594498 -3.0079501 -0.9440585 -0.2088675
#> 5  -2.4834411  0.1000455  0.8322192  0.1321378
#> 6 -11.4187853 -0.9543883 -0.8810503  0.2916444

Now, suppose that inequality increases support for welfare only among women, but it decreases support among men. We didn’t collect data on gender (male versus female). We could estimate the hdpGLM and recover the coefficients even if gender wasn’t observed. The package provides a function called hdpGLM, which estimates a semi-parametric Bayesian generalized linear model using a Dirichlet mixture. Let’s estimate the model. The example uses few iterations in the MCMC, but in real applications, one should use a much larger number.

library(hdpGLM)
#> 
#> ## ===============================================================
#> ## Hierarchial Dirichlet Process Generalized Linear Model (hdpGLM)
#> ## ===============================================================
#> 
#> Author: Diogo Ferrari
#> Usage : https://github.com/DiogoFerrari/hdpGLM
#> 
#> Attaching package: 'hdpGLM'
#> The following object is masked _by_ '.GlobalEnv':
#> 
#>     welfare

## estimating the model
mcmc = list(burn.in=10, ## MCMC burn-in period
            n.iter =500) ## number of MCMC iterations to keep
mod = hdpGLM(support ~ inequality + income + ideology, data=welfare,
             mcmc=mcmc)

## printing the outcome
summary(mod)
#>  
#> -------------------------------- 
#> dpGLM model object
#> 
#> Maximum number of clusters activated during the estimation: 10
#> Number of MCMC iterations: 500
#> burn-in: 10
#> -------------------------------- 
#> 
#> Summary statistics of clusters with data points
#> 
#> --------------------------------
#> Coefficients for cluster 1 (cluster label 1)
#> 
#>               Post.Mean Post.Median  HPD.lower HPD.upper
#> 1 (Intercept) -3.870705   -3.870752 -3.9347272 -3.801209
#> 2  inequality  1.996214    1.995899  1.9312844  2.059735
#> 3      income  3.851003    3.852709  3.7751987  3.913612
#> 4    ideology -8.307562   -8.305992 -8.3709154 -8.238262
#> 5       sigma  1.003149    1.003833  0.9203266  1.097709
#> 
#> --------------------------------
#> Coefficients for cluster 2 (cluster label 3)
#> 
#>                Post.Mean Post.Median  HPD.lower HPD.upper
#> 1 (Intercept) -3.8158824  -3.8165950 -3.8868877 -3.743373
#> 2  inequality -1.5287681  -1.5269775 -1.6060558 -1.458008
#> 3      income  3.8833961   3.8829753  3.8005268  3.944675
#> 4    ideology -8.2567409  -8.2570198 -8.3354394 -8.186184
#> 5       sigma  0.9684549   0.9745855  0.8464886  1.094251
#> 
#> --------------------------------

The summary function prints the result in a tidy format. The column k in the summary shows the label of the estimated clusters. The column Mean is the average of the posterior distribution for each linear coefficient in each cluster.

The function classify can be used to classify the data points into clusters based on the estimation.

welfare_clustered = classify(welfare, mod)
head(welfare_clustered)
#>   Cluster     support inequality     income   ideology
#> 1       1 -18.5649610  0.3392724  0.1425111  1.9225985
#> 2       1  -9.3905812 -0.9906646 -0.5117102  0.2483346
#> 3       1   0.9276234 -2.2318510 -0.3856288 -1.3619216
#> 4       1 -12.3594498 -3.0079501 -0.9440585 -0.2088675
#> 5       3  -2.4834411  0.1000455  0.8322192  0.1321378
#> 6       1 -11.4187853 -0.9543883 -0.8810503  0.2916444
tail(welfare_clustered)
#>      Cluster     support  inequality     income   ideology
#> 1995       3  -1.5230053 1.055855140 -0.7295937 -0.7067871
#> 1996       3   0.4814892 0.582588091  2.0051082  0.3090389
#> 1997       3 -14.1929956 0.391164197 -0.9607449  0.7765482
#> 1998       1  -8.2396789 0.074437376  1.2020300  1.0874928
#> 1999       3 -23.1583753 0.434223018 -0.6176438  2.0387294
#> 2000       3  -7.2075582 0.008355317 -0.4538951  0.2268072

There are a series of built-in functions, with various options, to plot the results. In the example below, you see two of those options. The separate parameter plot the posterior samples for each cluster separately, and the option ncols controls how many columns to use for the panels in the figure (to see more, run help(plot.hdpGLM) and help(plot.dpGLM)).

plot(mod, separate=T, ncols=4)
#> 
#> 
#> Generating plot...

Estimating Context-dependent Latent Heterogeneity

To continue the previous toy example, suppose that we are analyzing data from many countries, and we suspect that the latent heterogeneity is different in each country. The effect of inequality on support for welfare may be gender-specific only in some countries (contexts). Or maybe the way it is gender-specific varies from country to country. Suppose we didn’t have data on gender, but we collect information on countries’ gender gap in welfare provision. Let’s look at this new data set.

## loading and looking at the data
welfare = read.csv2('welfare2.csv')
head(welfare)
#>       support inequality     income   ideology country gap
#> 1 -18.5649610  0.3392724  0.1425111  1.9225985       0 0.1
#> 2  -9.3905812 -0.9906646 -0.5117102  0.2483346       0 0.1
#> 3   0.9276234 -2.2318510 -0.3856288 -1.3619216       0 0.1
#> 4 -12.3594498 -3.0079501 -0.9440585 -0.2088675       0 0.1
#> 5  -2.4834411  0.1000455  0.8322192  0.1321378       0 0.1
#> 6 -11.4187853 -0.9543883 -0.8810503  0.2916444       0 0.1
tail(welfare)
#>         support inequality     income    ideology country        gap
#> 3195  0.3190583 -0.7504798 -0.7839583  0.92300705       4 -0.8280808
#> 3196 -1.3837239  0.6620435 -1.5566268  0.05634618       4 -0.8280808
#> 3197 -1.3820016 -0.4298706 -1.0945688  0.71559078       4 -0.8280808
#> 3198  0.6878775  0.5450604  2.6175887 -1.94844469       4 -0.8280808
#> 3199 -7.9282930  1.7846004  1.6755823  1.29160208       4 -0.8280808
#> 3200 -1.7472485  0.5030992 -0.5395479  0.20109879       4 -0.8280808

The variable country indicates the country (context) of the observation, and the variable gap the gender gap in welfare provision in the respective country. The estimation is similar to the previous example, but now there is a second formula for the context-level variables. Again, the example below uses few iterations in the MCMC, but in practical applications, one needs to increase that).

## estimating the model
mcmc = list(burn.in=1, ## MCMC burn-in period
            n.iter =50) ## number of MCMC iterations to keep
mod = hdpGLM(support ~ inequality + income + ideology, 
             support ~ gap,
         data=welfare, mcmc=mcmc)

summary(mod)
#>  
#> -------------------------------- 
#> hdpGLM Object 
#> 
#> Maximum number of clusters activated during the estimation: 1
#> Number of MCMC iterations: 50
#> Burn-in: 1
#> 
#> Number of contexts : 5
#> 
#> Number of clusters (summary across contexts): 
#> 
#>   Average Std.Dev Median Min. Max.
#> 1     4.4 2.50998      4    2    7
#> -------------------------------- 
#> 
#> 
#> Summary statistics of clusters with data points in each context
#> 
#> --------------------------------
#> Coefficients and clusters for context 1
#> 
#>               Post.Mean Post.Median  HPD.lower HPD.upper Cluster
#> 1 (Intercept) -3.868662   -3.861583 -3.9290318 -3.798271       1
#> 2  inequality  1.886711    1.978894  0.6154336  2.089487       1
#> 3      income  3.862804    3.857908  3.7773563  3.915118       1
#> 4    ideology -8.308469   -8.309497 -8.3620989 -8.208680       1
#> 5 (Intercept) -3.844991   -3.820484 -4.4052116 -3.729841       2
#> 6  inequality -1.541230   -1.542952 -1.7368677 -1.435435       2
#> 7      income  3.784585    3.874054  2.3898101  3.961620       2
#> 8    ideology -8.210759   -8.246714 -8.3338233 -7.649155       2
#> 
#> --------------------------------
#> Coefficients and clusters for context 2
#> 
#>                  Post.Mean  Post.Median   HPD.lower    HPD.upper Cluster
#> 1  (Intercept)  0.63608663  0.604670753  0.41029726  0.942699177       1
#> 2   inequality -0.12138831 -0.108093774 -0.40613166  0.279662242       1
#> 3       income  0.45736125  0.417579163 -0.01262893  1.171451339       1
#> 4     ideology -1.70865481 -1.726107460 -1.97041383 -1.473791372       1
#> 5  (Intercept)  0.53682754  0.731760184 -0.64231366  1.111578020       3
#> 6   inequality  1.74552607  1.450023124  1.07143994  3.688079653       3
#> 7       income  0.80874654  0.837610679 -0.20647338  1.148847108       3
#> 8     ideology -2.37718039 -2.155949404 -4.29533114 -1.664332621       3
#> 9  (Intercept) -0.63139312 -0.503583218 -3.62710068  0.562613785       4
#> 10  inequality -0.79426488 -0.824701886 -2.08984009  3.438943387       4
#> 11      income -0.53472107 -0.456450211 -1.59652147  0.002176292       4
#> 12    ideology -2.27940507 -2.462499162 -2.76188279 -0.026772553       4
#> 13 (Intercept)  1.12125288  1.257220338 -0.69650480  1.771953611       5
#> 14  inequality  1.38417195  1.022015295 -0.15059523  5.050242200       5
#> 15      income  0.98666478  1.019371479 -1.13356465  1.968422485       5
#> 16    ideology -1.74677095 -1.708224941 -2.88300146 -1.025584783       5
#> 17 (Intercept) -0.20886965  0.009925016 -2.59654348  0.805611907       6
#> 18  inequality  0.41888686  0.437839820 -0.16994929  1.329009024       6
#> 19      income  1.98442241  1.895909152  1.17828313  3.701977148       6
#> 20    ideology  0.17874351  0.363307964 -0.75737087  0.746287842       6
#> 21 (Intercept) -0.24678165 -0.168657129 -0.58021111  0.303767798       7
#> 22  inequality -1.23890960 -1.301398409 -1.56077743 -0.335725958       7
#> 23      income -0.66892121 -0.590477057 -0.93619549 -0.323270787       7
#> 24    ideology -2.34515640 -2.263825003 -3.44243570 -2.038109880       7
#> 25 (Intercept)  1.20814371  1.228993733 -0.34773277  3.594721925       8
#> 26  inequality  0.56717093  0.508505593 -0.56218367  2.389555119       8
#> 27      income -0.02564722  0.023059663 -1.60916985  1.299396581       8
#> 28    ideology -1.56338357 -1.640618600 -4.88340856  2.888463515       8
#> 
#> --------------------------------
#> Coefficients and clusters for context 5
#> 
#>                  Post.Mean Post.Median   HPD.lower   HPD.upper Cluster
#> 1  (Intercept) -0.02686690  0.13078790 -0.70029790  0.53387554       1
#> 2   inequality -0.97521350 -1.01336311 -1.70284239 -0.35517116       1
#> 3       income -0.22894193 -0.18808926 -0.53443826  0.19315455       1
#> 4     ideology -2.86237789 -2.97920765 -3.14518515 -2.32104264       1
#> 5  (Intercept) -0.72447855 -1.40105266 -2.67524614  3.62087172       3
#> 6   inequality -0.97316633 -1.31895707 -5.79684552  3.91003308       3
#> 7       income -0.05196567 -0.13506523 -5.25240456  2.41733471       3
#> 8     ideology -2.88248807 -2.81261567 -6.18261552 -0.30288393       3
#> 9  (Intercept) -0.52425479 -0.38858075 -2.02600816  0.33243122       4
#> 10  inequality -1.88715246 -1.91371057 -2.58109392 -1.18523595       4
#> 11      income  0.83083529  0.78597937  0.01398649  1.64532686       4
#> 12    ideology -2.70547277 -2.87241622 -3.57526128 -1.42456075       4
#> 13 (Intercept) -0.15694516 -0.01286858 -1.58116759  0.36959814       5
#> 14  inequality  1.10151451  1.00100861  0.66532012  2.07622598       5
#> 15      income -0.18605699 -0.26879225 -0.59814051  0.64102542       5
#> 16    ideology -2.01875818 -2.05180220 -2.38379513 -1.28705230       5
#> 17 (Intercept) -1.54357827 -1.35678302 -2.44433039 -0.96914425       6
#> 18  inequality  0.97147989  0.88855510 -0.41955736  3.39702063       6
#> 19      income  0.36492918  0.48759398 -3.27541908  4.11682621       6
#> 20    ideology -1.25674177 -1.57054387 -2.77081793  0.05202434       6
#> 21 (Intercept)  0.15880179  0.24507248 -0.49161709  1.20491647       7
#> 22  inequality -2.68247221 -2.93118774 -3.57423898  1.23683307       7
#> 23      income -0.35965954 -0.22632917 -1.55233219  0.23160660       7
#> 24    ideology -2.48535641 -2.45139757 -3.02374869 -1.98444742       7
#> 25 (Intercept)  1.09596299  0.36824254 -0.54674248  8.94090691      11
#> 26  inequality -2.23269644 -2.76108101 -3.54904185  4.39777554      11
#> 27      income -0.43147861 -0.51498426 -1.45026420  1.16096401      11
#> 28    ideology -1.98467186 -1.98724789 -2.79785624 -0.86356507      11
#> 
#> --------------------------------
#> Coefficients and clusters for context 3
#> 
#>                  Post.Mean  Post.Median   HPD.lower  HPD.upper Cluster
#> 1  (Intercept) -1.46483940 -1.370030756 -2.26995487 -1.2106692       2
#> 2   inequality -0.64015322 -0.620787099 -0.94632901 -0.4237996       2
#> 3       income -2.67498762 -2.906339977 -3.14482045 -1.4933246       2
#> 4     ideology -0.05577040  0.001567111 -0.40670455  0.1577710       2
#> 5  (Intercept) -0.10885735 -0.084663417 -0.43810694  0.1895083       3
#> 6   inequality  0.43032221  0.450939421  0.06892038  0.7098216       3
#> 7       income -3.10610256 -3.110755621 -3.35750943 -2.9238722       3
#> 8     ideology  2.07609097  2.061962738  1.73770325  2.3337093       3
#> 9  (Intercept) -0.03200613 -0.197482712 -0.43864540  2.0379839       4
#> 10  inequality -0.66741034 -0.587308884 -2.05051498 -0.4594999       4
#> 11      income -3.87035648 -4.189285531 -4.40078976  0.4202151       4
#> 12    ideology  0.33347229  0.493833456 -2.40601285  0.7632697       4
#> 13 (Intercept) -0.82938281 -1.006901964 -1.82869478  2.7803631       6
#> 14  inequality -1.54374777 -1.299174077 -3.75647552  0.7881480       6
#> 15      income -3.49053947 -4.157891028 -5.33515726  4.3383841       6
#> 16    ideology  0.66132994  0.484406333  0.08076867  6.2075277       6
#> 
#> --------------------------------
#> Coefficients and clusters for context 4
#> 
#>                 Post.Mean Post.Median  HPD.lower   HPD.upper Cluster
#> 1 (Intercept) -0.01257292 -0.03446332 -0.5346842  1.54338283       4
#> 2  inequality -1.23824126 -1.19203300 -2.2818380 -0.89394873       4
#> 3      income -2.36258475 -2.65087135 -3.0118591  1.24603303       4
#> 4    ideology -0.82168807 -0.89039757 -1.2414280 -0.03623593       4
#> 5 (Intercept)  0.11138372  0.15699415 -0.1274362  0.58086213       6
#> 6  inequality -1.04534097 -1.08310677 -1.5223982 -0.69896446       6
#> 7      income -2.52401127 -2.60204631 -3.1632441 -2.42770899       6
#> 8    ideology -0.04910296  0.08289552 -0.3548482  0.21480141       6
#> 
#> --------------------------------
#> Context-level coefficients:
#>                Description  Post.Mean HPD.lower HPD.upper
#> 1     Intercept of beta[0] -0.1230822 -3.894788  2.221448
#> 2     Intercept of beta[1] -0.3784623 -3.640756  2.356387
#> 3     Intercept of beta[2] -0.4321821 -3.341328  2.135596
#> 4     Intercept of beta[3] -1.1616771 -3.568604  1.398443
#> 5 Effect of gap on beta[0] -0.2463572 -3.163778  2.136121
#> 6 Effect of gap on beta[1]  0.3928379 -1.460944  2.259470
#> 7 Effect of gap on beta[2] -0.5822625 -2.343928  1.904400
#> 8 Effect of gap on beta[3]  0.6022640 -1.958404  2.204238
#> 
#> --------------------------------

The summary contains more information now. As before, the column k indicates the estimated clusters. The column j indicates the country (context) of the estimated value for the respective cluster’s coefficient. The second summary ($tau) shows the marginal effect of the context-level feature (gap). Details of the interpretation can be found in Ferrari (2020).

There are a series of built-in functions to visualize the output. The function plot_tau() displays the estimation of the effect of the context-level variables.

plot_tau(mod)

The function plot_pexp_beta() displays the association between the context-level features and the latent heterogeneity in the effect of the linear coefficients in each context. The paramter ‘smooth.line’ plots a line representing the linear association between the context-level feature (gap) and the posterior averages of the marginal effects in each cluster. The parameter ncol.beta controls the number of columns in the figure for the panels. For more options, see help(plot_pexp_beta)

plot_pexp_beta(mod, smooth.line=TRUE, ncol.beta=2)
#> 
#> 
#> Generating plots ...
#> Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
#> of ggplot2 3.3.4.
#> ℹ The deprecated feature was likely used in the hdpGLM package.
#>   Please report the issue at <https://github.com/DiogoFerrari/hdpGLM/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> `geom_smooth()` using formula = 'y ~ x'
#> `geom_smooth()` using formula = 'y ~ x'

hdpGLM

Introduction

Background

Usage

Estimation

Estimating Context-dependent Latent Heterogeneity

Reference