The nlpembeds package provides efficient methods to compute co-occurrence matrices, pointwise mutual information (PMI) and singular value decomposition (SVD) embeddings. In the biomedical and clinical setting, one challenge is the huge size of databases of electronic health records (EHRs), e.g. when analyzing data of millions of patients over tens of years. To address this, this package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices of tens of gigabytes of data, representing millions of patients over tens of years. PMI-SVD embeddings are extensively used, e.g. in Hong C. (2021).
Co-occurrence, PMI and SVD are methods to produce vector representations of concepts, i.e. embeddings, similarly to the popular word2vec method (Mikolov 2013). However, while Word2vec is highly parallelized, it may lack stability, i.e. the same computation will produce significantly different results. To improve the stability, one can increase the number of iterations, but it may be preferable to compute co-occurrence, PMI (or more precisely, single-shift positive PMI), and SVD (Levy and Goldberg 2014).
The nlpembeds package aims to provide efficient methods to compute co-occurrence matrices and PMI-SVD. In the biomedical and clinical settings, one challenge is the huge size of databases, e.g. when analyzing data of millions of patients over tens of years. To address this, the nlpembeds package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices from tens of gigabytes of data.
Consider the following data object:
library(nlpembeds)
df_ehr = data.frame(Patient = c(1, 1, 2, 1, 2, 1, 1, 3, 4),
Month = c(1, 1, 1, 2, 2, 3, 3, 4, 4),
Parent_Code = c('C1', 'C2', 'C2', 'C1', 'C1', 'C1', 'C2',
'C3', 'C4'),
Count = 1:9)
df_ehr
#> Patient Month Parent_Code Count
#> 1 1 1 C1 1
#> 2 1 1 C2 2
#> 3 2 1 C2 3
#> 4 1 2 C1 4
#> 5 2 2 C1 5
#> 6 1 3 C1 6
#> 7 1 3 C2 7
#> 8 3 4 C3 8
#> 9 4 4 C4 9
This represents our patient-level data: Patient
is the
patient identifier, Parent_Code
is the diagnosis that was
performed (or medication prescribed, etc.), Month
is the
year/month of the diagnosis, and Count
is the number of
times the diagnosis was performed during that month.
We can compute a monthly co-occurrence which gives us:
spm_cooc = build_df_cooc(df_ehr)
spm_cooc
#> 4 x 4 sparse Matrix of class "dgCMatrix"
#> C1 C2 C3 C4
#> C1 16 7 . .
#> C2 . 12 . .
#> C3 . . 8 .
#> C4 . . . 9
What the data represents is:
The monthly co-occurrence between two codes is defined as the minimum of their two counts within a month. The co-occurrences are first computed for each patient for each month, and sum aggregated.
Here, let’s consider the co-occurrence between codes C1 and C2. We decompose the computation to make it easily understandable. The codes co-occurred in months 1 and 3 in patient 1. We can decompose the co-occurrence matrices as follows:
cooc_1 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 1), min_code_freq = 0)
cooc_1
#> 2 x 2 sparse Matrix of class "dgCMatrix"
#> C1 C2
#> C1 1 1
#> C2 . 2
cooc_2 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 3))
cooc_2
#> 2 x 2 sparse Matrix of class "dgCMatrix"
#> C1 C2
#> C1 6 6
#> C2 . 7
When summed up, we get a co-occurrence of 7 between C1 and C2
Note that although codes C3 and C4 were both present during month 4, they were not for the same patient, so their co-occurrence is 0.
We can then call PMI on that object
m_pmi = get_pmi(spm_cooc)
m_pmi
#> C1 C2 C3 C4
#> C1 0.5791377 -0.0564856 0.000000 0.000000
#> C2 -0.0564856 0.6735661 0.000000 0.000000
#> C3 0.0000000 0.0000000 1.998096 0.000000
#> C4 0.0000000 0.0000000 0.000000 1.880313
Here,
log(7 * 59 / ((16 + 7) * (12 + 7)))
log(16 * 59 / (16 + 7) ^ 2)
I acknowledge here Dr. Doudou Zhou and Dr. Ziming Gan for their contributions to the implementation of the PMI and SVD computations.
We then call SVD on the PMI matrix, and then feed the
m_svd
object (i.e. the embedding matrix) to
downstream analyses, such as knowledge graphs.
m_svd = get_svd(m_pmi, embedding_dim = 2)
m_svd
#> [,1] [,2]
#> C1 -3.543017e-01 6.735043e-01
#> C2 7.579192e-01 3.148406e-01
#> C3 3.808060e-16 -3.031175e-17
#> C4 4.321832e-16 -8.587719e-18
The SVD is computed with Randomized SVD (Erichson et al. 2016), an efficient
approximation of truncated SVD in which only the first principal
components are returned, with the rsvd package. The author suggests the
number of dimensions requested k
should follow:
k < n / 4
, where n
is the number of
features, for it to be efficient, and that otherwise one should rather
use either SVD or truncated SVD. Furthermore, we select only the
positive principal vectors (further details in the documentation of the
function).
For EHR, I recommend a SVD rank between 1000 and 2000 when analyzing
codified and natural language processing (NLP) data (15k - 25k
features), and less if only analyzing codified data (300 - 800
dimensions for 5k - 10k codified features). Known pairs can be used to
compute AUC and select the best performing rank, e.g. with the
fit_embeds_kg
function from the kgraph package or
with the kgraph
Shiny app. The kgraph package also includes a dataset of known pairs
of NLP concepts derived from PrimeKG (Chandak,
Huang, and Zitnik 2023) (further introduced in the vignette).
When selecting the number of dimensions, one should consider balancing two objectives: higher AUC and lower number of dimensions, as a large number of dimensions can lead to overfitting and a lower number of dimensions will act as a regularization. The aim is thus to identify the range of numbers of dimensions after which increasing the number of dimensions does not significantly increase the AUC anymore. For generalization and reproducibility within other cohorts, it may be better to have a slightly smaller AUC and a lower number of dimensions. As an example, an edge case could be if for example the AUC is 0.732 at 1000 dimensions, and 0.75 at 2000 dimensions, and in my opinion in that case either way would be fine and perhaps a more thorough investigation could be performed.
To efficiently perform co-occurrence matrices of large databases of tens of gigabytes, two important optimizations are available:
To perform batching by patients, we want to have our input file as a
SQL file, which we will read by batches of patients and aggregate the
results. This is performed with the sql_cooc
function.
To demonstrate how it is used, we first write the example object
above as a SQL database. It is important to name the columns and the SQL
table as here (i.e. four columns Patient
,
Month
, Parent_Code
, Count
and
table name df_monthly
). Also, we index the table by
patients, and write as a second table df_uniq_codes
the
unique codes (this is optional, it is done automatically in
sql_cooc
if the table df_uniq_codes
is not
found in the input SQL file and the autoindex
parameter is
set to TRUE). It is a good idea to also order your table by patients
before writing it, as it will make read accesses more efficient, and
some of my experiments showed it can make the computation up to 4 times
faster.
library(RSQLite)
test_db_path = tempfile()
test_db = dbConnect(SQLite(), test_db_path)
dbWriteTable(test_db, 'df_monthly', df_ehr, overwrite = TRUE)
###
# optional, done automatically by sql_cooc if table 'df_uniq_codes' not found
# and parameter autoindex set to TRUE
dbExecute(test_db, "CREATE INDEX patient_idx ON df_monthly (Patient)")
#> [1] 0
df_uniq_codes = unique(df_ehr['Parent_Code'])
dbWriteTable(test_db, 'df_uniq_codes', df_uniq_codes, overwrite = TRUE)
###
dbDisconnect(test_db)
We can then call the sql_cooc
function on that filename,
and specify a new filename for the output.
For each batch, information about the current memory usage is printed, which can be helpful for debugging memory usage and getting an idea of the computation progress from the log file.
output_db_path = tempfile()
sql_cooc(input_path = test_db_path, output_path = output_db_path)
#> Batch size: 1752
#> Batch completed
#> Full sparse matrix size: 1912
Once the computation is complete, we read the co-occurrence sparse matrix from the output SQL file.
test_db = dbConnect(SQLite(), output_db_path)
spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;')
dbDisconnect(test_db)
spm_cooc
#> V1 V2 value
#> 1 C1 C1 16
#> 2 C1 C2 7
#> 3 C2 C2 12
#> 4 C3 C3 8
#> 5 C4 C4 9
As previously, we can then feed it to get_pmi
.
m_pmi = get_pmi(spm_cooc)
m_pmi
#> C1 C2 C3 C4
#> C1 0.5791377 -0.0564856 0.000000 0.000000
#> C2 -0.0564856 0.6735661 0.000000 0.000000
#> C3 0.0000000 0.0000000 1.998096 0.000000
#> C4 0.0000000 0.0000000 0.000000 1.880313
Or transform it as a classic matrix.
spm_cooc = build_spm_cooc_sym(spm_cooc)
m_cooc = as.matrix(spm_cooc)
m_cooc
#> C1 C2 C3 C4
#> C1 16 7 0 0
#> C2 7 12 0 0
#> C3 0 0 8 0
#> C4 0 0 0 9
That is the minimum call. Two important parameters come into play here:
n_batch
which is the number of patients to include by
batchn_cores
which is the number of cores on which the
computation will be parallelizedFor the two parameters, the higher the value, the faster the
computation and the more RAM memory required. The user can fine-tune
these parameters to fit machine specifications and to optimize. As an
example, n_batch = 300
and n_cores = 6
should
be able to run on 16GB machines if the number of codes is not too large
(e.g. if only considering rolled-up codified data).
When the number of unique codes is large, e.g. when analyzing NLP concepts, one may want to perform a first subset of the codes by providing a dictionary of codes to include.
Three parameters are available here:
exclude_code_pattern
: Pattern of codes prefixes to
exclude (will be used in SQL appended by ‘%’). For example, ‘AB’.exclude_dict_pattern
: Used in combination with
parameter codes_dict
. Pattern of codes prefixes to exclude,
except if they are found in codes_dict
(will be used in
grep prefixed by ‘^’). For example, ‘C[0-9]’.codes_dict_fpaths
: Used in combination with
exclude_dict_pattern
. Filepaths to define codes to avoid
excluding using exclude_dict_pattern
. First column of each
file must define the code identifiers.To demonstrate the behavior, we will rename the two first codes in the object above to two NLP concepts present in the dictionaries provided with the package.
df_ehr$Parent_Code %<>% ifelse(. == 'C1', 'C0000545', .)
df_ehr$Parent_Code %<>% ifelse(. == 'C2', 'C0000578', .)
df_ehr
#> Patient Month Parent_Code Count
#> 1 1 1 C0000545 1
#> 2 1 1 C0000578 2
#> 3 2 1 C0000578 3
#> 4 1 2 C0000545 4
#> 5 2 2 C0000545 5
#> 6 1 3 C0000545 6
#> 7 1 3 C0000578 7
#> 8 3 4 C3 8
#> 9 4 4 C4 9
We write it as a SQL file.
test_db_path = tempfile()
test_db = dbConnect(SQLite(), test_db_path)
dbWriteTable(test_db, 'df_monthly', df_ehr)
dbDisconnect(test_db)
The code is then called like this:
codes_dict_fpaths = list.files(system.file('dictionaries',
package = 'nlpembeds'),
full.names = TRUE)
sql_cooc(input_path = test_db_path, output_path = output_db_path,
exclude_dict_pattern = 'C[0-9]',
codes_dict_fpaths = codes_dict_fpaths,
autoindex = TRUE, overwrite_output = TRUE)
#> Writing SQL index on patients
#> Writing table of unique codes df_uniq_codes
#> Batch size: 1768
#> Batch completed
#> Full sparse matrix size: 1752
Finally, we read the co-occurrence sparse matrix from that SQL file.
test_db = dbConnect(SQLite(), output_db_path)
spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;')
dbDisconnect(test_db)
spm_cooc
#> V1 V2 value
#> 1 C0000545 C0000545 16
#> 2 C0000545 C0000578 7
#> 3 C0000578 C0000578 12
As previously, we can then feed it to get_pmi
.
Or transform it as a classic matrix.
If using a CentOS server, nlpembeds may not install correctly and you may need to use v2.4.0 of RcppAlgos. One can install it this way, before installing nlpembeds:
The n_batch
patients are parallelized on the
n_cores
available, so n_batch
needs to be
larger than n_cores
, and it is best to have at least 2-3
patients per core. The aggregated sparse matrix will logarithmically
increase in size, so if you are close to the RAM limits your job may
fail only after several hours.
The number of codes have a quadratic impact on memory, so depending
on the number of codes considered you can optimize n_batch
and n_cores
. Assuming 250 GB available and ~30k codes, I
would recommend n_batch = 75
and n_cores = 25
,
and estimate the co-occurrence matrix for ~60k patients should take 3
hours. You should be able to more or less multiply linearly the
parameters by the amount of RAM, meaning if you have 500 GB available
you could use n_batch = 150
and n_cores = 50
and it should run in ~1h30. If only considering 10-15k codes, you can
most likely have n_batch = 600
and
n_cores = 20
. If considering more than 30k codes, you will
need to reduce n_batch
and/or n_cores
.