Attention Mechanism in Deep Learning

kuu · on Sept 16, 2019

In case you find Attention (and specially transformers) interesting I have some saved links introducing the topic:

http://www.peterbloem.nl/blog/transformers

https://nostalgebraist.tumblr.com/post/185326092369/the-tran...

https://papers.nips.cc/paper/7181-attention-is-all-you-need....

http://jalammar.github.io/illustrated-transformer/

https://arxiv.org/pdf/1807.03819.pdf

BiasRegularizer · on Sept 16, 2019

While Transformer's Self Attention(SA) is great, there are many applications where SA doesn't apply. For a more comprehensive overview of attention mechanisms, I often find myself coming back to Lilian Weng's post:

https://lilianweng.github.io/lil-log/2018/06/24/attention-at...

abhgh · on Sept 16, 2019

In the context of attention, there is a very interesting recent paper that warns against conflating attention and token importance - "Is Attention Interpretable?" [1]. This is an accepted paper in ACL-2019:

[1] https://www.aclweb.org/anthology/P19-1282

stochastic_monk · on Sept 16, 2019

See also Attention is Not Explanation [0].

[0] https://www.aclweb.org/anthology/N19-1357

physicsyogi · on Sept 17, 2019

There's a rebuttal to this as well: Attention is not not Explanation. https://arxiv.org/abs/1908.04626

abhgh · on Sept 16, 2019

Thanks!

thereyougo · on Sept 16, 2019

>When we think about the English word “Attention”, we know that it means directing your focus at something and taking greater notice. The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data.

I actually think they should rename it to 'Focus Mechanism'

codesushi42 · on Sept 16, 2019

I disagree.

"Attention" is all you need.

elcomet · on Sept 16, 2019

Why do you think Focus Mechanism is more appropriate than Attention?

nerdponx · on Sept 16, 2019

Less anthropomorphism in machine learning is good IMO.

bitL · on Sept 16, 2019

We are literally talking about intelligence, for which the best model in nature are humans. It's difficult not to be anthropomorphic in general.

darkmighty · on Sept 16, 2019

Exactly. Shannon could have called his metric a 'Logarithmic measure on probability spaces', but he chose to call it 'Information' -- even though it's not exactly what we mean when we talk about information informally; the inspiration and analogy was very important to the work (personally I'd have called S 'Average Information', but that's with hindsight).\*

As long as you clearly define and state things (definitions) and can separate the context-specific [technical] meaning from general usage, I find it's a very good strategy not only for popularization purposes but to inspire and draw some valuable intuition from our daily lives into technical matter.

Naming things in the sciences is an art :)

*: It should be obvious I don't love the usual term 'Kullball-Leibler Divergence' in place of 'Relative Information', although in this case the names are obscure and difficult to pronounce enough to give it an air of rigour and nobility.

nerdponx · on Sept 16, 2019

The article is about deep learning, not AI.

bitL · on Sept 16, 2019

There is very limited vocabulary for concepts you see in Deep Learning; even anthropomorphic ones are usually badly used, but you aren't going to have many fans if you start talking about key-value weighting of intermediate layers instead of "attention".

JoeSamoa · on Sept 17, 2019

Thank you for defending this point so rigorously. I agree.

joe_the_user · on Sept 16, 2019

Well, attention implies something that can shift rapidly, that might have some meaning and that can lead to action.

Focus is broadly like attention but without those implications.

I don't know enough to say which one is more appropriate here but that's how the terms seem to "color" the concepts.

physicsyogi · on Sept 17, 2019

Focus is usually directed at a single thing though, it's a narrowing of attention. Attention mechanisms can pay attention to more than one thing at a time.

phkahler · on Sept 16, 2019

Would it be more accurate to use the word "importance" than "attention"? I feel like the later is encroaching on "intention" and conciousness more than these techniques warrant.

BiasRegularizer · on Sept 16, 2019

"Importance" is a fairly overused word in DL. e.g. importance sampling and importance weighted gradients.

"Attention" works by creating inductive bias for the upstream network, which is analogous to human attention, and the word itself is much more intuitive.

Keep in mind machine learning is largely a descriptive science(modeling the behavior), whereas neuroscience is more prescriptive. So from the behavioral perspective, attention is better suited than importance.

msamwald · on Sept 16, 2019

On the other hand the word "attention" in deep learning is often used non-intuitively, e.g. "one token attends to another token" in self attention is hardly analogous to human cognitive processes.

abakus · on Sept 16, 2019

For a very easy to understand explanation of the transformer and attention mechansim, see:

https://blue-season.github.io/transformer-in-5-minutes/

eanzenberg · on Sept 16, 2019

Is there any work on attention for cnns or other computer vision algo's?

lucidrains · on Sept 16, 2019

https://arxiv.org/abs/1904.09925

ilaksh · on Sept 16, 2019

Did a FAANG company patent it already? If so, can we safely assume that since such a patent is ridiculous, it should be ignored in relation to any commercial service that might use these techniques?