While Transformer's Self Attention(SA) is great, there are many applications where SA doesn't apply. For a more comprehensive overview of attention mechanisms, I often find myself coming back to Lilian Weng's post:
In the context of attention, there is a very interesting recent paper that warns against conflating attention and token importance - "Is Attention Interpretable?" [1]. This is an accepted paper in ACL-2019:
>When we think about the English word “Attention”, we know that it means directing your focus at something and taking greater notice. The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data.
I actually think they should rename it to 'Focus Mechanism'
Exactly. Shannon could have called his metric a 'Logarithmic measure on probability spaces', but he chose to call it 'Information' -- even though it's not exactly what we mean when we talk about information informally; the inspiration and analogy was very important to the work (personally I'd have called S 'Average Information', but that's with hindsight).\*
As long as you clearly define and state things (definitions) and can separate the context-specific [technical] meaning from general usage, I find it's a very good strategy not only for popularization purposes but to inspire and draw some valuable intuition from our daily lives into technical matter.
Naming things in the sciences is an art :)
*: It should be obvious I don't love the usual term 'Kullball-Leibler Divergence' in place of 'Relative Information', although in this case the names are obscure and difficult to pronounce enough to give it an air of rigour and nobility.
There is very limited vocabulary for concepts you see in Deep Learning; even anthropomorphic ones are usually badly used, but you aren't going to have many fans if you start talking about key-value weighting of intermediate layers instead of "attention".
Focus is usually directed at a single thing though, it's a narrowing of attention. Attention mechanisms can pay attention to more than one thing at a time.
Would it be more accurate to use the word "importance" than "attention"? I feel like the later is encroaching on "intention" and conciousness more than these techniques warrant.
"Importance" is a fairly overused word in DL. e.g. importance sampling and importance weighted gradients.
"Attention" works by creating inductive bias for the upstream network, which is analogous to human attention, and the word itself is much more intuitive.
Keep in mind machine learning is largely a descriptive science(modeling the behavior), whereas neuroscience is more prescriptive. So from the behavioral perspective, attention is better suited than importance.
On the other hand the word "attention" in deep learning is often used non-intuitively, e.g. "one token attends to another token" in self attention is hardly analogous to human cognitive processes.
Did a FAANG company patent it already? If so, can we safely assume that since such a patent is ridiculous, it should be ignored in relation to any commercial service that might use these techniques?
http://www.peterbloem.nl/blog/transformers
https://nostalgebraist.tumblr.com/post/185326092369/the-tran...
https://papers.nips.cc/paper/7181-attention-is-all-you-need....
http://jalammar.github.io/illustrated-transformer/
https://arxiv.org/pdf/1807.03819.pdf