--- title: "How to best manage computationally heavy analyses" output: github_document #rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Computer_Capacity} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Many NLP analyses require a lot of computational resources; but standard sized datasets in psychology can be analysed with *text* on a standard laptop. It is rather the computation of the pre-trained language models that require a huge amount of computational resources. Wolf et al., (2020) exemplify this by pointing out that the RoBERTa language model: >"was trained on 160 GB of text using 1024 32GB V100. On Amazon-Web-Services cloud computing (AWS), such a pretraining would cost approximately 100K USD." (p. 2) Hence, there are a lot of computations behind the word embeddings. In *text*, the most computationally heavy and time consuming elements are the process of retrieving word embeddings using ```textEmbedLayersOutput``` (which is also used in ```textEmbed```). Retrieving word embeddings for a standard dataset with a few hundred participants may take between 15 minutes to an hour. Hence, it is worth planning analyses. A few time and resource management advice include: 1. *Testing:* Before you run the analyses on your entire dataset, ensure that everything first runs smoothly on a small part of the data set (e.g., 20 rows of your data). Note that this should not be a process of testing different setting, but only to see that everything works. 2. *Timekeeping:* When running different analyses have the computer take time (see code example below), so that you get a better understanding of how long time different analyses take. 3. *Scheduling:* Run those analyses that take longer time over a coffee break or over night. So for example, it might be worth retrieving all word embeddings at a separate time. ```{r timing_Computer_Capacity, eval = FALSE, warning=FALSE, message=FALSE} library(text) # Save starting time T1 <- Sys.time() textEmbed(Language_based_assessment_data_8_10[1,1], layers = 12, decontexts = FALSE) # Save stoping time T2 <- Sys.time() # Compute time taken to run above function T2-T1 ``` ### Your system's capacity Thinking about your computer's memory capacity may become important if you have a lot of data and use multiple layers with many dimensions. For example consider that one sentence of 10 words/tokens, which are each represented by 12 layers a 768 dimensions results in 92 160 values (i.e., 10 x 12 x 768). To avoid running out of memory and get analyses to run faster, consider to: 1. Only retrieve the layers that you plan to use (e.g., ```layers = 11:12```) rather than retrieving all layers (i.e., ```layers = 'all'```). 2. Only retrieve tokens if you are planing to use them; otherwise set ```return_tokens = FALSE```. 3. Do not ask for decontextualized word embeddings if you are not going to us them (e.g., in plotting). ### References Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., & Funtowicz, M. (2019). [Huggingface’s transformers: State-of-the-art natural language processing.](https://arxiv.org/abs/1910.03771)