1 Summary

The nlpembeds package provides efficient methods to compute co-occurrence matrices, pointwise mutual information (PMI) and singular value decomposition (SVD) embeddings. In the biomedical and clinical setting, one challenge is the huge size of databases of electronic health records (EHRs), e.g. when analyzing data of millions of patients over tens of years. To address this, this package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices of tens of gigabytes of data, representing millions of patients over tens of years. PMI-SVD embeddings are extensively used, e.g. in Hong C. (2021).

2 Background

Co-occurrence, PMI and SVD are methods to produce vector representations of concepts, i.e. embeddings, similarly to the popular word2vec method (Mikolov 2013). However, while Word2vec is highly parallelized, it may lack stability, i.e. the same computation will produce significantly different results. To improve the stability, one can increase the number of iterations, but it may be preferable to compute co-occurrence, PMI (or more precisely, single-shift positive PMI), and SVD (Levy and Goldberg 2014).

The nlpembeds package aims to provide efficient methods to compute co-occurrence matrices and PMI-SVD. In the biomedical and clinical settings, one challenge is the huge size of databases, e.g. when analyzing data of millions of patients over tens of years. To address this, the nlpembeds package provides functions to efficiently compute monthly co-occurrence matrices, which is the computational bottleneck of the analysis, by using the RcppAlgos package and sparse matrices. Furthermore, the functions can be called on SQL databases, enabling the computation of co-occurrence matrices from tens of gigabytes of data.

3 Co-occurrence matrix

Consider the following data object:

  library(nlpembeds)
  df_ehr = data.frame(Patient = c(1, 1, 2, 1, 2, 1, 1, 3, 4),
                      Month = c(1, 1, 1, 2, 2, 3, 3, 4, 4),
                      Parent_Code = c('C1', 'C2', 'C2', 'C1', 'C1', 'C1', 'C2',
                                      'C3', 'C4'),
                      Count = 1:9)
  df_ehr
#>   Patient Month Parent_Code Count
#> 1       1     1          C1     1
#> 2       1     1          C2     2
#> 3       2     1          C2     3
#> 4       1     2          C1     4
#> 5       2     2          C1     5
#> 6       1     3          C1     6
#> 7       1     3          C2     7
#> 8       3     4          C3     8
#> 9       4     4          C4     9

This represents our patient-level data: Patient is the patient identifier, Parent_Code is the diagnosis that was performed (or medication prescribed, etc.), Month is the year/month of the diagnosis, and Count is the number of times the diagnosis was performed during that month.

We can compute a monthly co-occurrence which gives us:

  spm_cooc = build_df_cooc(df_ehr)
  spm_cooc
#> 4 x 4 sparse Matrix of class "dgCMatrix"
#>    C1 C2 C3 C4
#> C1 16  7  .  .
#> C2  . 12  .  .
#> C3  .  .  8  .
#> C4  .  .  .  9

What the data represents is:

At month 1, patient 1 had the code C1 once and C2 twice, patient 2 had the code C2 thrice.
At month 2, patients 1 and 2 had the code C1 4 and 5 times, respectively.
At month 3, patient 1 had the codes C1 and C2 6 and 7 times, respectively.
At month 4, patients 3 and 4 had the codes C3 and C4 8 and 9 times, respectively.

The monthly co-occurrence between two codes is defined as the minimum of their two counts within a month. The co-occurrences are first computed for each patient for each month, and sum aggregated.

Here, let’s consider the co-occurrence between codes C1 and C2. We decompose the computation to make it easily understandable. The codes co-occurred in months 1 and 3 in patient 1. We can decompose the co-occurrence matrices as follows:

The codes co-occurred min(1, 2) = 1, in Month 1

  cooc_1 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 1), min_code_freq = 0)
  cooc_1
#> 2 x 2 sparse Matrix of class "dgCMatrix"
#>    C1 C2
#> C1  1  1
#> C2  .  2

The codes co-occurred min(6, 7) = 6, in Month 3

  cooc_2 = build_df_cooc(subset(df_ehr, Patient == 1 & Month == 3))
  cooc_2
#> 2 x 2 sparse Matrix of class "dgCMatrix"
#>    C1 C2
#> C1  6  6
#> C2  .  7

When summed up, we get a co-occurrence of 7 between C1 and C2

   cooc_1 + cooc_2
#> 2 x 2 sparse Matrix of class "dgCMatrix"
#>    C1 C2
#> C1  7  7
#> C2  .  9

Note that although codes C3 and C4 were both present during month 4, they were not for the same patient, so their co-occurrence is 0.

4 PMI

We can then call PMI on that object

   m_pmi = get_pmi(spm_cooc)
   m_pmi
#>            C1         C2       C3       C4
#> C1  0.5791377 -0.0564856 0.000000 0.000000
#> C2 -0.0564856  0.6735661 0.000000 0.000000
#> C3  0.0000000  0.0000000 1.998096 0.000000
#> C4  0.0000000  0.0000000 0.000000 1.880313

Here,

PMI(C1, C2) is log(7 * 59 / ((16 + 7) * (12 + 7)))
PMI(C1, C1) is log(16 * 59 / (16 + 7) ^ 2)

I acknowledge here Dr. Doudou Zhou and Dr. Ziming Gan for their contributions to the implementation of the PMI and SVD computations.

5 SVD

We then call SVD on the PMI matrix, and then feed the m_svd object (i.e. the embedding matrix) to downstream analyses, such as knowledge graphs.

  m_svd = get_svd(m_pmi, embedding_dim = 2)
  m_svd
#>             [,1]          [,2]
#> C1 -3.543017e-01  6.735043e-01
#> C2  7.579192e-01  3.148406e-01
#> C3  3.808060e-16 -3.031175e-17
#> C4  4.321832e-16 -8.587719e-18

The SVD is computed with Randomized SVD (Erichson et al. 2016), an efficient approximation of truncated SVD in which only the first principal components are returned, with the rsvd package. The author suggests the number of dimensions requested k should follow: k < n / 4, where n is the number of features, for it to be efficient, and that otherwise one should rather use either SVD or truncated SVD. Furthermore, we select only the positive principal vectors (further details in the documentation of the function).

For EHR, I recommend a SVD rank between 1000 and 2000 when analyzing codified and natural language processing (NLP) data (15k - 25k features), and less if only analyzing codified data (300 - 800 dimensions for 5k - 10k codified features). Known pairs can be used to compute AUC and select the best performing rank, e.g. with the fit_embeds_kg function from the kgraph package or with the kgraph Shiny app. The kgraph package also includes a dataset of known pairs of NLP concepts derived from PrimeKG (Chandak, Huang, and Zitnik 2023) (further introduced in the vignette).

When selecting the number of dimensions, one should consider balancing two objectives: higher AUC and lower number of dimensions, as a large number of dimensions can lead to overfitting and a lower number of dimensions will act as a regularization. The aim is thus to identify the range of numbers of dimensions after which increasing the number of dimensions does not significantly increase the AUC anymore. For generalization and reproducibility within other cohorts, it may be better to have a slightly smaller AUC and a lower number of dimensions. As an example, an edge case could be if for example the AUC is 0.732 at 1000 dimensions, and 0.75 at 2000 dimensions, and in my opinion in that case either way would be fine and perhaps a more thorough investigation could be performed.

6 Out-of-memory SQL databases

To efficiently perform co-occurrence matrices of large databases of tens of gigabytes, two important optimizations are available:

Batching by patients
Code subsets based on dictionaries

6.1 Batching by patients

To perform batching by patients, we want to have our input file as a SQL file, which we will read by batches of patients and aggregate the results. This is performed with the sql_cooc function.

To demonstrate how it is used, we first write the example object above as a SQL database. It is important to name the columns and the SQL table as here (i.e. four columns Patient, Month, Parent_Code, Count and table name df_monthly). Also, we index the table by patients, and write as a second table df_uniq_codes the unique codes (this is optional, it is done automatically in sql_cooc if the table df_uniq_codes is not found in the input SQL file and the autoindex parameter is set to TRUE). It is a good idea to also order your table by patients before writing it, as it will make read accesses more efficient, and some of my experiments showed it can make the computation up to 4 times faster.

  library(RSQLite)

  test_db_path = tempfile()
  test_db = dbConnect(SQLite(), test_db_path)
  dbWriteTable(test_db, 'df_monthly', df_ehr, overwrite = TRUE)

  ###
  # optional, done automatically by sql_cooc if table 'df_uniq_codes' not found
  # and parameter autoindex set to TRUE
  dbExecute(test_db, "CREATE INDEX patient_idx ON df_monthly (Patient)")
#> [1] 0

  df_uniq_codes = unique(df_ehr['Parent_Code'])
  dbWriteTable(test_db, 'df_uniq_codes', df_uniq_codes, overwrite = TRUE)
  ###

  dbDisconnect(test_db)

We can then call the sql_cooc function on that filename, and specify a new filename for the output.

For each batch, information about the current memory usage is printed, which can be helpful for debugging memory usage and getting an idea of the computation progress from the log file.

  output_db_path = tempfile()
  sql_cooc(input_path = test_db_path, output_path = output_db_path)
#> Batch size: 1752
#> Batch completed
#> Full sparse matrix size: 1912

Once the computation is complete, we read the co-occurrence sparse matrix from the output SQL file.

  test_db = dbConnect(SQLite(), output_db_path)
  spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;')
  dbDisconnect(test_db)

  spm_cooc
#>   V1 V2 value
#> 1 C1 C1    16
#> 2 C1 C2     7
#> 3 C2 C2    12
#> 4 C3 C3     8
#> 5 C4 C4     9

As previously, we can then feed it to get_pmi.

  m_pmi = get_pmi(spm_cooc)
  m_pmi
#>            C1         C2       C3       C4
#> C1  0.5791377 -0.0564856 0.000000 0.000000
#> C2 -0.0564856  0.6735661 0.000000 0.000000
#> C3  0.0000000  0.0000000 1.998096 0.000000
#> C4  0.0000000  0.0000000 0.000000 1.880313

Or transform it as a classic matrix.

  spm_cooc = build_spm_cooc_sym(spm_cooc)                                       
  m_cooc = as.matrix(spm_cooc)                                       
  m_cooc
#>    C1 C2 C3 C4
#> C1 16  7  0  0
#> C2  7 12  0  0
#> C3  0  0  8  0
#> C4  0  0  0  9

That is the minimum call. Two important parameters come into play here:

n_batch which is the number of patients to include by batch
n_cores which is the number of cores on which the computation will be parallelized

For the two parameters, the higher the value, the faster the computation and the more RAM memory required. The user can fine-tune these parameters to fit machine specifications and to optimize. As an example, n_batch = 300 and n_cores = 6 should be able to run on 16GB machines if the number of codes is not too large (e.g. if only considering rolled-up codified data).

6.2 Code subsets based on dictionaries

When the number of unique codes is large, e.g. when analyzing NLP concepts, one may want to perform a first subset of the codes by providing a dictionary of codes to include.

Three parameters are available here:

exclude_code_pattern: Pattern of codes prefixes to exclude (will be used in SQL appended by ‘%’). For example, ‘AB’.
exclude_dict_pattern: Used in combination with parameter codes_dict. Pattern of codes prefixes to exclude, except if they are found in codes_dict (will be used in grep prefixed by ‘^’). For example, ‘C[0-9]’.
codes_dict_fpaths: Used in combination with exclude_dict_pattern. Filepaths to define codes to avoid excluding using exclude_dict_pattern. First column of each file must define the code identifiers.

To demonstrate the behavior, we will rename the two first codes in the object above to two NLP concepts present in the dictionaries provided with the package.

  df_ehr$Parent_Code %<>% ifelse(. == 'C1', 'C0000545', .)
  df_ehr$Parent_Code %<>% ifelse(. == 'C2', 'C0000578', .)

  df_ehr
#>   Patient Month Parent_Code Count
#> 1       1     1    C0000545     1
#> 2       1     1    C0000578     2
#> 3       2     1    C0000578     3
#> 4       1     2    C0000545     4
#> 5       2     2    C0000545     5
#> 6       1     3    C0000545     6
#> 7       1     3    C0000578     7
#> 8       3     4          C3     8
#> 9       4     4          C4     9

We write it as a SQL file.

  test_db_path = tempfile()
  test_db = dbConnect(SQLite(), test_db_path)
  dbWriteTable(test_db, 'df_monthly', df_ehr)
  dbDisconnect(test_db)

The code is then called like this:


  codes_dict_fpaths = list.files(system.file('dictionaries',
                                             package = 'nlpembeds'),
                                 full.names = TRUE)

  sql_cooc(input_path = test_db_path, output_path = output_db_path,
           exclude_dict_pattern = 'C[0-9]',
           codes_dict_fpaths = codes_dict_fpaths,
           autoindex = TRUE, overwrite_output = TRUE)
#> Writing SQL index on patients
#> Writing table of unique codes df_uniq_codes
#> Batch size: 1768
#> Batch completed
#> Full sparse matrix size: 1752

Finally, we read the co-occurrence sparse matrix from that SQL file.

  test_db = dbConnect(SQLite(), output_db_path)
  spm_cooc = dbGetQuery(test_db, 'select * from df_monthly;')
  dbDisconnect(test_db)

  spm_cooc
#>         V1       V2 value
#> 1 C0000545 C0000545    16
#> 2 C0000545 C0000578     7
#> 3 C0000578 C0000578    12

As previously, we can then feed it to get_pmi.

# m_pmi = get_pmi(spm_cooc)
# m_pmi

Or transform it as a classic matrix.

# spm_cooc = build_spm_cooc_sym(spm_cooc)                                       
# m_cooc = as.matrix(spm_cooc)                                       
# m_cooc

7 Running on HPC servers

7.1 Installation

If using a CentOS server, nlpembeds may not install correctly and you may need to use v2.4.0 of RcppAlgos. One can install it this way, before installing nlpembeds:

  # remotes::install_git('https://github.com/jwood000/RcppAlgos@v2.4.0.git')

7.2 Parameters tuning

The n_batch patients are parallelized on the n_cores available, so n_batch needs to be larger than n_cores, and it is best to have at least 2-3 patients per core. The aggregated sparse matrix will logarithmically increase in size, so if you are close to the RAM limits your job may fail only after several hours.

The number of codes have a quadratic impact on memory, so depending on the number of codes considered you can optimize n_batch and n_cores. Assuming 250 GB available and ~30k codes, I would recommend n_batch = 75 and n_cores = 25, and estimate the co-occurrence matrix for ~60k patients should take 3 hours. You should be able to more or less multiply linearly the parameters by the amount of RAM, meaning if you have 500 GB available you could use n_batch = 150 and n_cores = 50 and it should run in ~1h30. If only considering 10-15k codes, you can most likely have n_batch = 600 and n_cores = 20. If considering more than 30k codes, you will need to reduce n_batch and/or n_cores.

References

Chandak, Payal, Kexin Huang, and Marinka Zitnik. 2023. “Building a Knowledge Graph to Enable Precision Medicine.” Nature Scientific Data. https://doi.org/https://doi.org/10.1038/s41597-023-01960-3.

Erichson, N Benjamin, Sergey Voronin, Steven L Brunton, and J Nathan Kutz. 2016. “Randomized Matrix Decompositions Using r.” arXiv Preprint arXiv:1608.02148.

Levy, Omer, and Yoav Goldberg. 2014. “Neural Word Embedding as Implicit Matrix Factorization.” Advances in Neural Information Processing Systems 27.

Mikolov, Tomas. 2013. “Efficient Estimation of Word Representations in Vector Space.” arXiv Preprint arXiv:1301.3781 3781.

Co-occurrence Matrices and PMI-SVD Embeddings

Thomas Charlon

2025-01-17