Complete Self-Attention from Scratch

It follows the same steps as the Simple Self-Attention from Scratch, but does not rely on any of the helper functions defined in the attention package, rather it implements everything in base R.

# encoder representations of four different words
word_1 = matrix(c(1,0,0), nrow=1)
word_2 = matrix(c(0,1,0), nrow=1)
word_3 = matrix(c(1,1,0), nrow=1)
word_4 = matrix(c(0,0,1), nrow=1)

Next, we stack the word embeddings into a single array (in this case a matrix) which we call words.

# stacking the word embeddings into a single array
words = rbind(word_1,
              word_2,
              word_3,
              word_4)

print(words)
#>      [,1] [,2] [,3]
#> [1,]    1    0    0
#> [2,]    0    1    0
#> [3,]    1    1    0
#> [4,]    0    0    1

# initializing the weight matrices (with random values)
set.seed(0)
W_Q = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_K = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)
W_V = matrix(floor(runif(9, min=0, max=3)),nrow=3,ncol=3)

Next, we generate the Queries (Q), Keys (K), and Values (V). The %*% operator performs the matrix multiplication. You can view the R help page using help('%*%') (or the online An Introduction to R).

# generating the queries, keys and values
Q = words %*% W_Q
K = words %*% W_K
V = words %*% W_V

Following this, we score the Queries (Q) against the Key (K) vectors (which are transposed for the multiplation using t(), see help('t') for more info).

# scoring the query vectors against all key vectors
scores = Q %*% t(K)
print(scores)
#>      [,1] [,2] [,3] [,4]
#> [1,]    6    4   10    5
#> [2,]    4    6   10    6
#> [3,]   10   10   20   11
#> [4,]    3    1    4    2

We now calculate the maximum value for each row and preserve the structure (i.e. the 4 rows, now with only one column which contains the maximum value for the corresponding row).

# calculate the max for each row of the scores matrix
maxs = as.matrix(apply(scores, MARGIN=1, FUN=max))
print(maxs)
#>      [,1]
#> [1,]   10
#> [2,]   10
#> [3,]   20
#> [4,]    4

As you can see, the value for each row in maxs is the maximum value of the corresponding row in scores.

# initialize weights matrix
weights = matrix(0, nrow=4, ncol=4)

# computing the weights by a softmax operation
for (i in 1:dim(scores)[1]) {
  weights[i,] = exp((scores[i,]-maxs[i,]) / ncol(K) ^ 0.5)/sum(exp((scores[i,]-maxs[i,]) / ncol(K) ^ 0.5))
}

print(weights)
#>             [,1]        [,2]      [,3]        [,4]
#> [1,] 0.083717538 0.026383741 0.8429010 0.046997679
#> [2,] 0.025449248 0.080752324 0.8130461 0.080752324
#> [3,] 0.003072728 0.003072728 0.9883811 0.005473487
#> [4,] 0.273384789 0.086157735 0.4869837 0.153473823

Finally, we compute the attention as a weighted sum of the value vectors (which are combined in the matrix V).

# computing the attention by a weighted sum of the value vectors
attention = weights %*% V

print(attention)
#>          [,1]     [,2]        [,3]
#> [1,] 2.816517 1.900235 0.046997679
#> [2,] 2.732294 1.757743 0.080752324
#> [3,] 2.985308 1.988381 0.005473487
#> [4,] 2.400826 1.674211 0.153473823