library(BIDistances)
#> Warning: vorhergehender Import 'Rcpp::LdFlags' durch 'RcppParallel::LdFlags'
#> während des Ladens von 'DataVisualizations' ersetzt
This packages contains various functions for distances-measures useful for bioinformatic data.
Installation using GitHub
#{r} #library(remotes) #install_github("Mthrun/BIDistances") #
The cosine distance is a distance-measure based on the cosine similarity. Let \(A\) be the data matrix and \(A_i\), \(A_j\) some row vectors of \(A\). The cosine similarity is then defined as \(\begin{equation} \text{s(i,j)} = \cos(\theta) = \frac{\mathbf{A_i} \cdot \mathbf{A_j}}{|\mathbf{A_i}| |\mathbf{A_j}|} \end{equation}\), and the cosine distance as \(d(i,j)=\max{s}-s(i,j)\).
The Dist2All function calculates the distances of a given point \(x\), to all other points (rows) of a given data matrix \(A\). For the calculation of the distances, various distance-measures can be chosen, for e.g. Euclidean, Manhattan (City Block), Mahalanobis, Bhjattacharyya, for a complete list see parallelDist. The distance-measure can be specified with the method argument. The function returns an ordered vector of the distances from point \(x\) to all points in \(A\) in ascending order, as well as the indices of k-nearest-neighbors for the chosen distance measure.
data(Hepta)
V = Dist2All(Hepta$Data[1,],Hepta$Data, method = "euclidean", knn=3)
# Vector of distances from Hepta$Data[1,] to all other rows in Hepta$Data
print(V$distToAll)
#> [1] 0.00000000 0.08058895 0.04781043 0.09214454 0.03886835 0.09699465
#> [7] 0.09783282 0.06061589 0.10267484 0.08356635 0.12395455 0.13170909
#> [13] 0.09367107 0.07311107 0.10489254 0.07201004 0.16944962 0.08932404
#> [19] 0.13545775 0.03665085 0.12492320 0.13485587 0.05292353 0.03097211
#> [25] 0.14614814 0.08677934 0.02394002 0.11479889 0.03230518 0.14903284
#> [31] 0.13893677 0.14948497 2.52446917 3.01913398 3.21579110 3.17836461
#> [37] 3.15769630 2.57796199 3.31600855 3.42089925 3.83891632 3.58352663
#> [43] 3.82807942 3.74241501 3.66695458 2.56495911 3.48798635 2.51574735
#> [49] 2.47210300 3.08088830 3.29044933 2.51797162 2.77927187 3.11074797
#> [55] 3.09722767 3.55496535 3.33477595 3.09117752 3.14815058 2.98838821
#> [61] 2.90743022 2.93734164 2.66675926 2.64259662 3.36701734 3.90809765
#> [67] 3.30677234 2.66204951 2.20011186 3.05133415 3.51307757 2.65792595
#> [73] 3.32292220 2.87368651 2.72029774 2.61406504 2.92254677 2.83595565
#> [79] 3.55531532 2.53112234 2.76796248 2.90106297 3.11617310 3.60091602
#> [85] 3.55476025 2.31093773 2.62329978 3.22935873 2.56992529 3.40588864
#> [91] 3.90022679 2.81386462 3.02034146 3.05283900 2.13176098 3.07322426
#> [97] 3.49643390 3.33818899 3.28321841 2.64631876 3.34413644 3.69818991
#> [103] 2.86346004 3.65485742 3.78012860 3.58415974 2.65018304 3.56255550
#> [109] 3.65163561 2.91422029 3.07258132 2.45181926 2.29991462 3.20917355
#> [115] 3.70924494 2.59280107 2.97424022 2.83887470 3.53219603 2.70771842
#> [121] 3.03205030 3.31160172 2.47996181 3.05245948 3.12721819 3.63906971
#> [127] 3.07121966 2.40720597 2.77981952 3.75378880 3.93878434 2.63787864
#> [133] 3.57013739 3.00944011 3.00081140 3.10025752 2.44570366 3.09900684
#> [139] 2.94566780 3.22610410 3.77257806 2.95948219 3.04835200 3.29707317
#> [145] 2.38829944 3.36077136 3.68833648 2.18316289 2.99890839 2.81540383
#> [151] 2.42404613 3.81733227 2.92926568 3.45549966 3.21561093 3.37903200
#> [157] 2.41146632 3.09742210 2.93177839 3.02379783 3.01282943 2.31164299
#> [163] 2.92613725 3.30081802 2.89988712 2.83634572 2.95293088 3.38450777
#> [169] 2.22953148 2.83342086 3.52553473 2.32071642 2.65455358 2.52694921
#> [175] 2.78506782 3.55896170 2.21862698 3.10491516 2.20840668 2.95602706
#> [181] 3.02296244 3.87704358 3.02381731 3.93379495 3.70924221 3.03949680
#> [187] 3.13826953 3.00121181 2.97098494 2.90194795 3.67516270 3.42685363
#> [193] 3.65196565 2.40343230 3.17742347 2.80353846 3.04065098 2.98600351
#> [199] 3.22565744 3.16701313 2.52899115 3.72693787 2.57746647 3.77579621
#> [205] 2.94798545 3.06495823 2.52541787 2.76796966 3.30391078 2.95077124
#> [211] 3.67311616 2.69897901
# Vector of the indices of the k-nearest-neighbors, according to the euclidean distance
print(V$KNN)
#> [1] 1 27 24
For a given \([1:n, 1:d]\) data matrix \(A\), with \(n\) cases and \(d\) variables, the function calculates the symmetric \([1:n, 1:n]\) distance matrix, given a chosen distance-measure. The method argument specifies the distance-measure (euclidean by default).
Options for method include :
‘euclidean’, ‘sqEuclidean’, ‘binary’, ‘cityblock’, ‘maximum’, ‘canberra’, ‘cosine’, ‘chebychev’, ‘jaccard’, ‘mahalanobis’, ‘minkowski’ ,‘manhattan’ , ‘braycur’ ,‘cosine’.
For the method ‘minkowski’, the parameter dim, can be used to specify the value of p in \(\left( \sum_{i=1}^{n} |A_{j i} - A_{l i}|^p \right)^{1/p}\)
The fractional distance function uses the formula of the Minkowski-metric to calculate the distances and allows the usage of fractional values \(p \in [0,1]\), which can be useful for high-dimensional data [Aggrawal et al., 2001].
The term frequency-inverse document frequency (Tf-idf) is a statistical measure of relevance of a term \(t\) to a document \(d\) in a collection of documents \(D\). The Tfidf-distance for two documents \(d_i\), \(d_j \in D\) is then the absolute difference between the Tfidf-values.
An exemplary usage for bioinformatic data is the calculation of distances between genes using the Tfidf-distance, based on GO-Terms (Gene-Ontology-terms). For this a matrix \(A\) of \(n\) genes as rows and \(m\) GO-Terms as columns is used, where genes can be interpreted as documents and GO-terms as terms [Thrun, 2022].
data(Hearingloss_N109)
V = Tfidf_dist(Hearingloss_N109$FeatureMatrix_Gene2Term, tf_fun = mean)
# Get distances
dist = V$dist
# Get weights
TfidfWeights = V$TfidfWeights
For the calculation of the (augmented) term-frequency, per default the mean of the non-zero entries is used, but can be specified with the argument tf_fun.
[Thrun, 2021] Thrun, M. C.: The Exploitation of Distance Distributions for Clustering, International Journal of Computational Intelligence and Applications, Vol. 20(3), pp. 2150016, DOI: 10.1142/S1469026821500164, 2021.
[Thrun, 2022] Thrun, M. C.: Knowledge-based Indentification of Homogenous Structures in Genes, 10th World Conference on Information Systems and Technologies (WorldCist’22), in: Rocha, A., Adeli, H., Dzemyda, G., Moreira, F. (eds) Information Systems and Technologies, Lecture Notes in Networks and Systems, Vol 468.,pp. 81-90, DOI: 10.1007/978-3-031-04826-5_9, Budva, Montenegro, 12-14 April, 2022.
[Aggrawal et al., 2001] Aggrawal, C. C., Hinneburg, A., Keim, D. (2001), On the Suprising Behavior of Distance Metrics in High Dimensional Space.