Post

[Paper Review] ๐Ÿฆฉ Flamingo: a Visual Language Model for Few-Shot Learning

[๋…ผ๋ฌธ ๋ฆฌ๋ทฐ]๐Ÿฆฉ Flamingo: a Visual Language Model for Few-Shot Learning

๐Ÿฆฉ Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac et al

NeurIPS 2022

[arXiv]

๊ตฌ๊ธ€ DeepMind์—์„œ ๊ฐœ๋ฐœํ•œ Vision-Language Model์ด๋‹ค.

Background

Multimodal learning์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ์ดํ•ดํ•˜์—ฌ VQA(Vision Question Anwsering), Image Captioning ๋“ฑ์˜ task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค.

๊ธฐ์กด์˜ VLM๋“ค์€ ์ฃผ๋กœ ๋Œ€๋Ÿ‰์˜ Image-Text ๋ฐ์ดํ„ฐ๋กœ pretrainํ•œ ํ›„, ๊ฐ downstream task๋ณ„๋กœ fine-tuningํ•˜๋Š” Supervised Learning์„ ์‚ฌ์šฉํ•œ๋‹ค.

โ†’ task๋งˆ๋‹ค ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ๋ชจ๋ธ์„ ์žฌํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์—์„œ ํ™•์žฅ์„ฑ์˜ ํ•œ๊ณ„

์˜ˆ๋ฅผ ๋“ค์–ด, CLIP๊ณผ ๊ฐ™์€ contrastive learning ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์€ ์›น์œผ๋กœ๋ถ€ํ„ฐ ๋Œ€๊ทœ๋ชจ Image-Text๋ฅผ ํ•™์Šตํ•˜์—ฌ zero-shot Classification์„ ๋ณด์—ฌ์ฃผ์—ˆ์ง€๋งŒ, ์ถœ๋ ฅ์ด Image-Text similarity score ํ˜•ํƒœ์— ํ•œ์ •๋˜์–ด ์žˆ์–ด Generation์ด ํ•„์š”ํ•œ ๊ฐœ๋ฐฉํ˜• ์งˆ๋ฌธ(์˜ˆ: captioning, Q&A)์—๋Š” ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค.

ViLT ๋“ฑ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•จ๊ป˜ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ Transformer๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ์ƒ์„ฑํ˜• VLM์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค.

  • ํฌ๊ธฐ๊ฐ€ ์ œํ•œ์ 
  • ์†Œ๋Ÿ‰์˜ ์˜ˆ์‹œ๋งŒ์œผ๋กœ ์ƒˆ๋กœ์šด task์— generalization ๋ถˆ๊ฐ€

์š”์•ฝํ•˜๋ฉด, ๊ธฐ์กด VLM๋“ค์€ ๊ฑฐ๋Œ€ํ•˜์ง€๋งŒ ๋น„์ƒ์„ฑ์  ๋ชจ๋ธ(์˜ˆ: CLIP) ๋˜๋Š” ์ƒ์„ฑ์€ ๊ฐ€๋Šฅํ•ด๋„ ํƒœ์Šคํฌ ์ ์‘์„ ์œ„ํ•ด ์ถ”๊ฐ€ ํ•™์Šต์ด ํ•„์š”ํ•œ ๋ชจ๋ธ(์˜ˆ: ViLT ๋“ฑ)๋กœ ๊ตฌ๋ถ„๋˜๋ฉฐ, ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋งŒ ๋ณด๊ณ ๋„ ์ƒˆ๋กœ์šด ์‹œ๊ฐ ์–ธ์–ด task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ ๋ชจ๋ธ์€ ๋ถ€์žฌํ–ˆ๋‹ค.

Contrastive Learning \(L_{\text{contrastive:txt2im}} = -\frac1N\sum_{i=1}^N \log\frac{\exp\bigl(L_i^\top V_i\beta\bigr)} {\sum_{j=1}^N\exp\bigl(L_i^\top V_j\beta\bigr)}\)

\[L_{\text{contrastive:I2T}} = -\frac1N\sum_{i=1}^N \log\frac{\exp\bigl(V_i^\top L_i\beta\bigr)} {\sum_{j=1}^N\exp\bigl(V_i^\top L_j\beta\bigr)}\]
  • ๋ถ„์ž์— ๋“ค์–ด๊ฐ€๋Š” $\exp(L_i^\top V_i/\tau)$ ํ•ญ์€ positive ์Œ์˜ ์œ ์‚ฌ๋„๋ฅผ ๋†’์ž„
  • ๋ถ„๋ชจ์— ์žˆ๋Š” $\sum_j \exp(L_i^\top V_j/\tau)$ ํ•ญ์€ โ€œnegative ์Œ๊ณผ์˜ ์œ ์‚ฌ๋„๋ฅผ ์ƒ๋Œ€์ ์œผ๋กœ ๋‚ฎ์ถค

โ€œ์˜ฌ๋ฐ”๋ฅธ(pair) ์ž„๋ฒ ๋”ฉ๋งŒ ๊ณจ๋ผ๋‚ด๋„๋กโ€ ํฌ๋กœ์Šค์—”ํŠธ๋กœํ”ผ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ณ , ๋‹ค๋ฅธ ๋ชจ๋“  ์ž„๋ฒ ๋”ฉ์™€๋Š” ๊ฑฐ๋ฆฌ๋ฅผ ๋ฒŒ๋ฆฌ๋„๋ก ํ•™์Šตํ•˜๋Š” Loss

Introduction

Flamingo๋Š” Few-Shot ํ•™์Šต์— ํŠนํ™”๋œ VLM๋กœ์„œ, ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ(Text-Image pair)๋งŒ์œผ๋กœ๋„ ์ƒˆ๋กœ์šด Multimodal task๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. Flamingo๋Š” ์ด๋ฏธ์ง€/๋น„๋””์˜ค์™€ ํ…์ŠคํŠธ๊ฐ€ ์„ž์ธ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ ๋ฐ›์•„ ๋‹ค์Œ์— ์ด์–ด์งˆ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•จ์œผ๋กœ์จ VQA, ์ด๋ฏธ์ง€ ์บก์…˜ ์ƒ์„ฑ, ๋น„๋””์˜ค ์„ค๋ช… ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Flamingo์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๊ฐ•๋ ฅํ•œ pretrained (Vision, LLM)์„ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

Flamingo์—์„œ๋Š” ์ด ๋‘˜์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ์•„ํ‚คํ…์ฒ˜ ์š”์†Œ๋“ค์„ ๋„์ž…ํ–ˆ๋‹ค.

  1. Pretrained Vision and LLM์„ ์—ฐ๊ฒฐ module
  2. Image-Text๊ฐ€ ์ž„์˜๋กœ ๋ฐฐ์น˜๋œ ์‹œํ€€์Šค ์ฒ˜๋ฆฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜
  3. ์ด๋ฏธ์ง€๋ฟ ์•„๋‹ˆ๋ผ ๋™์˜์ƒ๊นŒ์ง€ Input

์ด๋Ÿฌํ•œ ๊ตฌ์กฐ ๋•๋ถ„์— Flamingo๋Š” ์ธํ„ฐ๋„ท์—์„œ ์ˆ˜์ง‘ํ•œ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ ํ˜ผํ•ฉ ๋ฐ์ดํ„ฐ๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์‚ฌ์ „ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ ๋ณ„๋„ fine-tuning ์—†์ด ๋‹ค์–‘ํ•œ Vision-Language task์—์„œ ๋›ฐ์–ด๋‚œ few-shot ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค .

Flamingo๋Š” LLM์˜ ๋ฒ”์šฉ์  ์ดํ•ด๋ ฅ๊ณผ Vision ๋ชจ๋ธ์˜ ์ธ์ง€ ๋Šฅ๋ ฅ์„ ๊ฒฐํ•ฉํ•˜์—ฌ, ์ ์€ ์˜ˆ์‹œ๋งŒ์œผ๋กœ ์ƒˆ multimodal task๋“ค์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ VLM์ด๋‹ค.

fig2

  • ์ขŒ : ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ด์ „๊นŒ์ง€ SOTA ์„ฑ๋Šฅ๊ณผ Flamingo์˜ ์„ฑ๋Šฅ ๋น„๊ต ๊ทธ๋ž˜ํ”„

  • ์šฐ : Flamingo ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ์™€ shot (๋ฐ์ดํ„ฐ) ์ˆ˜์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๊ทธ๋ž˜ํ”„

Goal

few-shot๋งŒ์œผ๋กœ ์ƒˆ๋กœ์šด Vision-Language task๋ฅผ ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ VLM์„ ๋งŒ๋“œ๋Š” ๊ฒƒ.

Motivation

๊ธฐ์กด VLM๋“ค์€

  • Contrastive learning(CLIP)
    • zero-shot classification์€ ๊ฐ€๋Šฅ, text generation ๋Šฅ๋ ฅ ๋ถ€์žฌ
  • Generative ๋ชจ๋ธ(ViLT)
    • text generation์€ ๊ฐ€๋Šฅ, task๋ณ„ fine-tuning ํ•„์ˆ˜

๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ์ƒˆ๋กœ์šด task๋งˆ๋‹ค ๋Œ€๋Ÿ‰์˜ labeled ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ ์žฌํ•™์Šต์ด ํ•„์š”ํ•˜๋ฏ€๋กœ, โ€œ์‚ฌ๋žŒ์ฒ˜๋Ÿผ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋งŒ ๋ณด๊ณ ๋„ ์ฆ‰์‹œ ์ ์‘โ€ํ•˜๋Š” Few-Shot ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์„ ๊ตฌํ˜„ํ•  ํ•„์š”๊ฐ€ ์žˆ์—ˆ๋‹ค.

Contributions

  1. Flamingo ์•„ํ‚คํ…์ฒ˜ ์ œ์•ˆ

    • ๊ฐ•๋ ฅํ•œ pretrain Vision ๋ชจ๋ธ(NFNet-F6)๊ณผ Language ๋ชจ๋ธ(Chinchilla)์„ freezeํ•œ ์ฑ„

    • ๋‘ ๋ชจ๋ธ ์‚ฌ์ด์— Perceiver Resampler์™€ ๊ฒŒ์ดํŠธ๋œ cross-attention-dense(XATTN-Dense) layer ์‚ฝ์ž…

      โ€œ์ด๋ฏธ์ง€/๋น„๋””์˜ค โ†” ํ…์ŠคํŠธโ€๋ฅผ ์ž„์˜๋กœ ๊ต์ฐจํ•œ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ์ž์œ ํ˜• ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑ

  2. In-context Few-Shot learning

    • ๋ณ„๋„ fine-tuning ์—†์ด, 4~32๊ฐœ์˜ ์˜ˆ์‹œ(Image text pair)๋ฅผ ํ”„๋กฌํ”„ํŠธ๋กœ ์ œ๊ณตํ•˜๋ฉด ์ƒˆ๋กœ์šด task์— ์ฆ‰์‹œ ์ ์‘
      • VQA, Captioning, Video QA ๋“ฑ 16๊ฐœ ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด zero/few-shot SOTA ๋‹ฌ์„ฑ
      • 6๊ฐœ task์—์„œ๋Š” ํŒŒ์ธํŠœ๋‹๋œ SOTA๋„ ๋Šฅ๊ฐ€

      In-context learning

      • ๋ชจ๋ธ์—๊ฒŒ **์ถ”๊ฐ€ fine-tuning **์—†์ด, ์ž…๋ ฅ ํ”„๋กฌํ”„ํŠธ(๋ฌธ์ œ ์„ค๋ช…+์˜ˆ์‹œ)๋งŒ ๋ณด์—ฌ ์คŒ์œผ๋กœ์จ ์ƒˆ๋กœ์šด task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•˜๋Š” ๋ฐฉ์‹

      Few-shot learning

      • โ€œ์†Œ์ˆ˜(๋ช‡ ๊ฐœ)์˜ labeled ์˜ˆ์‹œ๋งŒ์œผ๋กœโ€ ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ๋ฅผ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์ ์‘ํ•˜๋Š” ๋Šฅ๋ ฅ
      • ์ผ๋ฐ˜์ ์œผ๋กœ 1~32๊ฐœ ์ •๋„์˜ ์˜ˆ์‹œ(shots)๋ฅผ ํ”„๋กฌํ”„ํŠธ์— ํฌํ•จ์‹œ์ผœ ๋ชจ๋ธ์ด ํŒจํ„ด์„ ํŒŒ์•…ํ•˜๋„๋ก ํ•œ๋‹ค.
  3. ํšจ์œจ์  Pretrain multi-modal ์ „๋žต

    • unlabeling ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉ
      • ์›นํŽ˜์ด์ง€์—์„œ ์ถ”์ถœํ•œ โ€œimage-text interleavedโ€ ๋Œ€๊ทœ๋ชจ ์ฝ”ํผ์Šค(M3W)
      • ์ˆ˜์‹ญ์–ต ์Œ์˜ Image-text ๋ฐ Video-text pair
    • ์ด๋ฅผ ํ˜ผํ•ฉ ํ•™์Šตํ•˜์—ฌ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฌธ๋งฅ์—์„œ์˜ ๋‹ค์Œ token ์˜ˆ์ธก ๋Šฅ๋ ฅ์„ ํ™•๋ณด

ํ˜์‹ ์  ๊ธฐ์ˆ  ์š”์†Œ

  • Perceiver Resampler: ๊ฐ€๋ณ€ ๊ธธ์ด Vision ํŠน์ง•์„ 64๊ฐœ token์œผ๋กœ ์••์ถ•
  • ๊ฒŒ์ดํŠธ๋œ XATTN-Dense: tanh ๊ฒŒ์ดํŠธ๋กœ ์•ˆ์ •ํ™”๋œ cross-attention ์‚ฝ์ž… ๊ธฐ๋ฒ•
  • ์ด๋ฏธ์ง€๋ณ„ ์–ดํ…์…˜ ๋งˆ์Šคํ‚น: ๊ฐ ํ…์ŠคํŠธ token์ด ์ง์ „ ์ด๋ฏธ์ง€ token๋งŒ ์ฐธ์กฐํ•˜๋„๋ก ์ œํ•œ

Method

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2025-07-29 แ„‹แ…ฉแ„’แ…ฎ 4.42.18

Flamingo๋Š” ์ด๋ฏธ์ง€/๋น„๋””์˜ค์™€ ํ…์ŠคํŠธ๊ฐ€ ๊ต์ฐจ๋œ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ Autoregressive text generation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค.

  1. (frozen) Vision encoder๊ฐ€ ํ”ฝ์…€ ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ฐจ์› feature์œผ๋กœ ๋ณ€ํ™˜
  2. Perceiver Resampler๊ฐ€ ์ด ๊ฐ€๋ณ€ ๊ธธ์ด์˜ visoin feature๋“ค์„ ๊ณ ์ • ๊ธธ์ด์˜ token์œผ๋กœ ์š”์•ฝ
  3. (frozen) LLM ๋‚ด๋ถ€์— cross-attention layer๋“ค์„ ์‚ฝ์ž…
    • ์ด๋ฏธ์ง€/๋น„๋””์˜ค๋กœ๋ถ€ํ„ฐ ์–ป์€ ์ •๋ณด๋ฅผ ํ…์ŠคํŠธ ์ƒ์„ฑ์— ํ™œ์šฉ

Perceiver Resampler์™€ Cross-Attention ๋ชจ๋“ˆ๋งŒ ํ•™์Šต

Flamingo๋Š” ์‚ฝ์ž…๋œ ์ด๋ฏธ์ง€ ๋ฐ ๋น„๋””์˜ค $๐‘ฅ$์— ์กฐ๊ฑด๋ถ€๋กœ ํ…์ŠคํŠธ $๐‘ฆ$์˜ ํ™•๋ฅ ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ชจ๋ธ๋งํ•œ๋‹ค.

\[p(y|x) = \prod^{L}_{\ell = 1} p(y_\ell | y_{<\ell}, x_{ \leq \ell})\]
  • $y_\ell$: ์ž…๋ ฅ text์˜ $\ell$ ๋ฒˆ์งธ token
  • $y<\ell$: ์ด์ „ text token ์ง‘ํ•ฉ
  • $x \leq \ell$: ์ด์ „์— ์œ„์น˜ํ•œ ์ด๋ฏธ์ง€/๋น„๋””์˜ค ์ง‘ํ•ฉ

์ด๋ฅผ ํ†ตํ•ด ์›๋ž˜ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๋“ค์ด ์ง€๋‹Œ ์ง€์‹๊ณผ ๋Šฅ๋ ฅ์„ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•˜๋ฉด์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒํ•œ๋‹ค.

Visual processing

Vision Encoder : NFNet-F6

Vision Encoder๋Š” Normalizer-Free ResNet (NFNet) ๊ณ„์—ด์˜ F6 ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.

ImageNet ๋“ฑ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฐ•๋ ฅํ•œ ์ด๋ฏธ์ง€ feauter extractor

NFNet-F6 Encoder๋Š” Flamingo ํ•™์Šต ์ด์ „์— ๋ณ„๋„๋กœ ๋Œ€๊ทœ๋ชจ Image-text pair ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด contrastive learning์œผ๋กœ pretrain๋˜์—ˆ์œผ๋ฉฐ freezeํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค.

CLIP๊ณผ ์œ ์‚ฌํ•˜๋ฉฐ, ํ•™์Šต ๊ณผ์ •์—์„œ Image/Video์˜ feature๋งŒ ์ œ๊ณต

๋ณ„๋„ ํ•™์Šต ๊ณผ์ •์—์„œ BERT ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋Œ€์กฐํ•™์Šตํ•จ โ†’ โ€œ๋ฌธ์žฅ ์ „์ฒด โ†’ ํ•˜๋‚˜์˜ ๊ณ ์ • ๊ธธ์ด ๋ฒกํ„ฐโ€๋ฅผ ์ž˜ ๋ฝ‘์•„๋‚ด๋ ค๋ฉด ์–‘๋ฐฉํ–ฅ ์ปจํ…์ŠคํŠธ๊ฐ€ ์ค‘์š”

Vision ์ธ์ฝ”๋”๋ฅผ contrastive ๋ฐฉ์‹์œผ๋กœ ํŠœ๋‹ํ•˜์—ฌ ์ข‹์€ Vision ์ž„๋ฒ ๋”ฉ์„ ์–ป๋Š” ๊ฒƒโ€๊ณผ โ€œ์–ป์€ Vision ์ž„๋ฒ ๋”ฉ์„ ๋‚˜์ค‘์— Chinchilla ๊ธฐ๋ฐ˜ ์ƒ์„ฑ ๋ชจ๋ธ์— ์ฃผ์ž…ํ•˜๋Š” ๊ฒƒโ€์„ ์™„์ „ํžˆ ๋ถ„๋ฆฌ

Flamingo๋Š” NFNet-F6์˜ ๋งˆ์ง€๋ง‰ layer ์ถœ๋ ฅ์„ feature map์œผ๋กœ ๋ฐ›์•„๋“ค์ธ๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ,

  • ์ด๋ฏธ์ง€
    • NFNet์œผ๋กœ $Hร—W$ ํฌ๊ธฐ์˜ 2D feature map์œผ๋กœ ๋ณ€ํ™˜, flatten โ†’ ํ•˜๋‚˜์˜ 1D token ์‹œํ€€์Šค
  • ๋™์˜์ƒ
    • 1์ดˆ๋‹น 1 frame์„ ์ƒ˜ํ”Œ๋งํ•˜์—ฌ ๊ฐ๊ฐ NFNet์œผ๋กœ ์ธ์ฝ”๋”ฉ
    • ์‹œ๊ฐ„ ์ถ•๊นŒ์ง€ ํฌํ•จํ•œ 3D feature map์„ ์–ป๊ณ  ์—ฌ๊ธฐ์— ํ•™์Šต๋œ time embedding์„ ๋”ํ•ด์ค€๋‹ค.
    • ๋ชจ๋“  frame์˜ ๊ณต๊ฐ„-์‹œ๊ฐ„ feautre๋“ค์„ flattenํ•˜์—ฌ 1D ์‹œํ€€์Šค๋กœ ๋ณ€ํ™˜

์ด๋ ‡๊ฒŒ ๋‚˜์˜จ 1D ์‹œํ€€์Šค๋Š” Perceiver Resampler์— ์ „๋‹ฌ๋œ๋‹ค.

Perceiver Resampler

Vision Encoder๊ฐ€ ์ถœ๋ ฅํ•˜๋Š” ๊ฐ€๋ณ€ ๊ธธ์ด์˜ ๋ฐฉ๋Œ€ํ•œ feature ์‹œํ€€์Šค๋ฅผ ์ผ์ •ํ•œ ๊ธธ์ด๋กœ ์š”์•ฝํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.

  • Input : NFNet์—์„œ ์ถ”์ถœ๋œ visual feature
  • Output: 64๊ฐœ์˜ visual token

์ด๋ฏธ์ง€/๋™์˜์ƒ ํฌ๊ธฐ๊ฐ€ ์–ด๋–ป๋“ , Perceiver Resampler๋Š” 64๊ฐœ์˜ token์œผ๋กœ ๊ทธ ์ •๋ณด๋ฅผ ์••์ถ•ํ•ด์„œ ์–ธ์–ด ๋ชจ๋ธ ์ชฝ์œผ๋กœ ์ „๋‹ฌํ•œ๋‹ค.

LM๊ณผ visual feature๋ฅผ cross-attentionํ•  ๋•Œ ๋น„์šฉ์ด ํฌ๊ฒŒ ์ ˆ๊ฐ

Perceiver Resampler์˜ ๊ตฌ์กฐ๋Š” Transformer encoder์˜ ํ•œ ๋ธ”๋ก ์ •๋„๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

fig5

64๊ฐœ์˜ learnableํ•œ latent query ๋ฒกํ„ฐ๋ฅผ ์ค€๋น„ํ•˜๊ณ , ์ด๋ฅผ visual feature๋“ค์— ๋Œ€ํ•ด cross-attention ์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด๋‹ค .

์‰ฝ๊ฒŒ ๋งํ•ด, ์ˆ˜๋ฐฑ ๊ฐœ์— ์ด๋ฅด๋Š” Image feautre ์ค‘ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ •๋ณด๋งŒ ๋ฝ‘์•„ 64๊ฐœ ๋ฒกํ„ฐ์— ๋‹ด๋Š” ์—ญํ• 

์ €์ž๋“ค์€ ์ด๋ ‡๊ฒŒ Resampler ์ „์šฉ ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋‹จ์ˆœํžˆ flattened feature๋ฅผ ๋ฐ”๋กœ ์ค„์ด๊ธฐ ์œ„ํ•œ MLP๋‚˜ Transformer๋ฅผ ์“ฐ๋Š” ๊ฒƒ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค

๊ฒฐ๊ณผ์ ์œผ๋กœ, Resampler๋ฅผ ํ†ตํ•ด ์••์ถ•๋œ 64๊ฐœ์˜ token์€ ์ดํ›„ ์–ธ์–ด ๋ชจ๋ธ์ด ์ฐธ๊ณ ํ•  ์‹œ๊ฐ์  context๋กœ ํ™œ์šฉ๋œ๋‹ค.

Conditioning frozen LM on visual representations

Flamingo์˜ ์–ธ์–ด ์ดํ•ด ๋ฐ ์ƒ์„ฑ ๋Šฅ๋ ฅ์€ DeepMind๊ฐ€ ๊ฐœ๋ฐœํ•œ LLM์ธ Chinchilla๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ๋‹ค.

Flamingo์—์„œ๋Š” Chinchilla ๋ชจ๋ธ์„ ํฌ๊ธฐ๋ณ„๋กœ 3๊ฐ€์ง€ ์‚ฌ์šฉํ•œ๋‹ค:

  • Flamingo-3B
  • Flamingo-9B
  • Flamingo-80B

LM ์—ญ์‹œ ํ•™์Šต ์‹œ frozen๋˜์–ด, ์›๋ž˜์˜ ์–ธ์–ด ์ง€์‹์„ ๊ทธ๋Œ€๋กœ ๊ฐ„์งํ•œ ์ฑ„ ์‚ฌ์šฉ๋œ๋‹ค.

๋Œ€์‹  Flamingo๋Š” ์ด LM ๋‚ด๋ถ€์— ์‹œ๊ฐ ์ •๋ณด๋ฅผ ๋ผ์›Œ๋„ฃ์„ ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด layer๋“ค์„ ์ถ”๊ฐ€ํ•œ๋‹ค.

Gated XATTN-Dense layers

Flamingo์—์„œ๋Š” learnableํ•œ cross-attention ๋ธ”๋ก๋“ค์„ pretrained LM์˜ ์ค‘๊ฐ„์— ์‚ฝ์ž…ํ•˜์—ฌ, LM์ด ์ƒ์„ฑ ๊ณผ์ •์—์„œ Visual token์— ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์ด๋„๋ก ํ•œ๋‹ค.

fig4

์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด Cross Attention layer์™€ feed-forward layer๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, tanh ๊ฒŒ์ดํŠธ๊ฐ€ ๊ณฑํ•ด์ง„๋‹ค.

  • Cross Attention layer
    • Query : LM์˜ ์ค‘๊ฐ„ hidden state
    • Key, Value : 64๊ฐœ visual token
    • ์–ธ์–ด ๋ชจ๋ธ์€ ํ˜„์žฌ๊นŒ์ง€ ์ƒ์„ฑ๋œ ํ…์ŠคํŠธ ๋งฅ๋ฝ์— ๋งž์ถ”์–ด ์‹œ๊ฐ ์ •๋ณด์— ์งˆ์˜(query)๋ฅผ ๋ณด๋‚ด ํ•„์š”ํ•œ ๋‚ด์šฉ์„ ์–ป์–ด์˜ฌ ์ˆ˜ ์žˆ๋‹ค.
  • Feed-Forward(Dense) layer :์‹œ๊ฐ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•œ ํ‘œํ˜„์„ ๊ฐ ์œ„์น˜๋ณ„๋กœ ๋ณ€ํ™˜

์ด๋ ‡๊ฒŒ ์ถ”๊ฐ€๋œ ๋ ˆ์ด์–ด๋“ค์˜ ์ถœ๋ ฅ์€ tanh ๊ฒŒ์ดํŠธ๋ฅผ ํ†ตํ•ด ์Šค์ผ€์ผ์ด ์กฐ์ •๋œ ํ›„, ์›๋ž˜ ์–ธ์–ด ๋ชจ๋ธ์˜ ๋ ˆ์ด์–ด ์ถœ๋ ฅ๊ณผ ํ•ฉ์ณ์ง„๋‹ค.

gate์—์„œ๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์Šค์นผ๋ผ $\alpha$๋ฅผ ํ†ตํ•ด $\tanh(\alpha)$๋งŒํผ ์Šค์ผ€์ผ์„ ์กฐ์ •ํ•œ๋‹ค.

์ฒ˜์Œ์—” $\alpha=0$์œผ๋กœ ์„ค์ •ํ•˜์—ฌ $\tanh(0)=0$, ์ฆ‰ ํ•™์Šต ์ดˆ๋ฐ˜์—๋Š” ์ƒˆ๋กœ์šด ๋ ˆ์ด์–ด๊ฐ€ ์–ธ์–ด ๋ชจ๋ธ์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š”๋‹ค.

์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ $\alpha$๊ฐ€ ํ•™์Šต๋˜๋ฉด์„œ gate๊ฐ€ ์—ด๋ ค ์‹œ๊ฐ ์ •๋ณด๊ฐ€ ์ ์ง„์ ์œผ๋กœ ํ†ตํ•ฉ๋œ๋‹ค.

f6

์œ„ ์ด๋ฏธ์ง€๋Š” tahn ๊ฒŒ์ดํŠธ์˜ ๊ฐ’์ด ํ•™์Šต ๊ณผ์ •์—์„œ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ž˜ํ”„์ด๋‹ค.

Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def gated_xattn_dense(y, x, alpha_xattn, alpha_dense):
    # y: ์–ธ์–ด ๋ชจ๋ธ์˜ ์ค‘๊ฐ„ ํžˆ๋“ (queries)
    # x: Vision token(key, value) โ€“ Perceiver Resampler ์ถœ๋ ฅ
    # alpha_xattn, alpha_dense: ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์Šค์นผ๋ผ(๊ฒŒ์ดํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ)
    
    # 1) Cross-Attention with tanh gating
    y = y + tanh(alpha_xattn) * Attention(q=y, kv=x)
    
    # 2) Feed-Forward Dense layer with tanh gating
    y = y + tanh(alpha_dense) * FeedForward(y)
    
    # 3) ๊ธฐ์กด ์–ธ์–ด ๋ชจ๋ธ์˜ Self-Attention + FFN (๋™๊ฒฐ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ)
    y = y + FrozenSelfAttention(q=y, kv=y)
    y = y + FrozenFeedForward(y)
    
    return y
  • Attention(q=y, kv=x): ์ฟผ๋ฆฌ y๊ฐ€ Vision token x์— cross-attention์„ ์ˆ˜ํ–‰
  • FeedForward(y): Transformer์˜ ์ผ๋ฐ˜์ ์ธ position-wise FFN(Dense)
  • ๋’ค์˜ FrozenSelfAttention๊ณผ FrozenFeedForward๋Š” ์‚ฌ์ „ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋ฉฐ, ํ•™์Šต๋˜์ง€ ์•Š์€ ์ฑ„ frozen๋˜์–ด ์žˆ๋‹ค .

ํšจ์œจ์„ฑ๊ณผ ํ‘œํ˜„๋ ฅ ์‚ฌ์ด trade-off๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ๋ช‡ ๊ฐœ layer๋งˆ๋‹ค ํ•œ ๋ฒˆ์”ฉ ์‚ฝ์ž…ํ•˜๋Š” ์ „๋žต์„ ํƒํ•œ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•จ์œผ๋กœ์จ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ๊ณ„์‚ฐ๋Ÿ‰์„ ๋Š˜๋ฆฌ์ง€ ์•Š์œผ๋ฉด์„œ๋„ ์‹œ๊ฐ ์ •๋ณด๊ฐ€ ์ถฉ๋ถ„ํžˆ ์–ธ์–ด ๋ชจ๋ธ์— ์ฃผ์ž…๋œ๋‹ค.

Multi-visual input support

per-image attention masking์„ ํ†ตํ•ด ํ•˜๋‚˜์˜ ๋Œ€ํ™” ๋‚ด์— ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€/๋น„๋””์˜ค๊ฐ€ interleaved๋œ ๊ฒฝ์šฐ๋„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

{์ด๋ฏธ์ง€ A - ์งˆ๋ฌธ X - ์ด๋ฏธ์ง€ B - ์งˆ๋ฌธ Y} ์ฒ˜๋Ÿผ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๊ฐ€ ๋ฒˆ๊ฐˆ์•„ ์—ฌ๋Ÿฌ ๊ฐœ ๋“ฑ์žฅํ•ด๋„ ์ผ๊ด€์„ฑ ์žˆ๊ฒŒ ์ดํ•ดํ•˜๊ณ  ๋Œ€๋‹ตํ•  ์ˆ˜ ์žˆ๋‹ค.

Cross-Attention ๋‹จ๊ณ„์—์„œ, ๊ฐ text token์ด ๋ณผ ์ˆ˜ ์žˆ๋Š” Vision token์— ์ œํ•œ์„ ๊ฑฐ๋Š” ๊ฒƒ์ด๋‹ค.

๊ตฌ์ฒด์ ์œผ๋กœ Flamingo๋Š” ํŠน์ • ์œ„์น˜์˜ ํ…์ŠคํŠธ token์ด ์˜ค์ง ์ง์ „์— ๋“ฑ์žฅํ•œ ํ•œ ์žฅ์˜ ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ Vision token๋“ค๋งŒ ์ฐธ์กฐํ•˜๋„๋ก ์–ดํ…์…˜ ๋งˆ์Šคํฌ๋ฅผ ์ ์šฉํ•œ๋‹ค .

์˜ˆ๋ฅผ ๋“ค์–ด ๋Œ€ํ™” ํ”„๋กฌํ”„ํŠธ๊ฐ€ <Image1><Question1><Image2><Question2> ๋ผ๋ฉด, <Question2>์— ๋Œ€ํ•œ ์ƒ์„ฑ ์‹œ์—๋Š” Image2 token์—๋งŒ attention์„ ์ ์šฉ

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ ์งˆ๋ฌธ์— ํ•ด๋‹นํ•˜๋Š” ์ด๋ฏธ์ง€ ์ •๋ณด๋งŒ ์ง‘์ค‘ํ•˜์—ฌ ๋‹ต์„ ์ƒ์„ฑํ•˜๊ฒŒ ๋˜๊ณ , ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ์—๋„ ๋ฌธ๋งฅ์ด ์„ž์—ฌ ํ˜ผ๋™๋˜๋Š” ๊ฒƒ์„ ๋ง‰์„ ์ˆ˜ ์žˆ๋‹ค.

LM ๋‚ด๋ถ€์˜ self-attention์€ ๋ชจ๋“  token๋ฅผ ํ™œ์šฉํ•˜๋ฏ€๋กœ ๊ฐ„์ ‘์ ์œผ๋กœ ๋‹ค๋ฅธ token๋“ค๊ณผ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ง์ ‘์ ์ธ cross-attention ์—ฐ๊ฒฐ์„ ํ•œ ์ด๋ฏธ์ง€์”ฉ์œผ๋กœ ์ œํ•œํ•จ์œผ๋กœ์จ, Flamingo๋Š” ํ›ˆ๋ จ ์‹œ ์‚ฌ์šฉ๋œ ์ด๋ฏธ์ง€ ๊ฐœ์ˆ˜๋ณด๋‹ค ๋” ๋งŽ์€ ์ด๋ฏธ์ง€๊ฐ€ ๋“ค์–ด์™€๋„ ์ž˜ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค .

fig7

์ด ๋•๋ถ„์— ํ•™์Šต ๋•Œ ํ•œ ์‹œํ€€์Šค๋‹น ์ตœ๋Œ€ 5์žฅ์˜ ์ด๋ฏธ์ง€๋งŒ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, Test ์‹œ์—๋Š” ์ตœ๋Œ€ 32๊ฐœ์˜ Image-Text pair๊นŒ์ง€ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์ฐธ๊ณ ๋กœ, ์ €์ž๋“ค์€ ๋Œ€์•ˆ ์‹คํ—˜์œผ๋กœ โ€œํ…์ŠคํŠธ๊ฐ€ ๋ชจ๋“  ์ด์ „ ์ด๋ฏธ์ง€๋“ค์— cross-attendํ•˜๋„๋กโ€ ํ•ด๋ณธ ๊ฒฝ์šฐ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์กŒ๋‹ค๊ณ  ๋ณด๊ณ ํ•œ๋‹ค . ์ด๋Š” ํ•œ๊บผ๋ฒˆ์— ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€๋ฅผ ๋ชจ๋‘ ์ฐธ๊ณ ํ•˜๋ฉด ๋ชจ๋ธ์ด ์–ด๋А ์ •๋ณด๋ฅผ ์–ด๋””์— ์จ์•ผ ํ• ์ง€ ํ˜ผ๋ž€์Šค๋Ÿฌ์›Œํ•˜๊ธฐ ๋•Œ๋ฌธ์œผ๋กœ ์ถ”์ •๋œ๋‹ค.

๊ฒฐ๊ตญ ์ด๋ฏธ์ง€-์ธ๊ณผ์ (image-causal) ์–ดํ…์…˜ ๋งˆ์Šคํ‚น์„ ์ ์šฉํ•œ ํ˜„์žฌ์˜ ๋ฐฉ์‹์ด ๊ฐ€์žฅ ํšจ๊ณผ์ ์ด์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

Training on a mixture of vision and language datasets

Flamingo์˜ ์‚ฌ์ „ํ•™์Šต(Pretraining)์—๋Š” ๋Œ€๊ทœ๋ชจ unlabeled multi-modal ์›น ๋ฐ์ดํ„ฐ๊ฐ€ ํ™œ์šฉ๋˜์—ˆ๋‹ค.

๋ฐ์ดํ„ฐ๋Š” 3๊ฐ€์ง€๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

  • M3W (MultiModal MassiveWeb)
    • ์›นํŽ˜์ด์ง€๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ํ˜ผํ•ฉ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค.
    • ๊ฐ ํŽ˜์ด์ง€์—์„œ ์ด๋ฏธ์ง€๊ฐ€ ๋“ฑ์žฅํ•œ ์ž๋ฆฌ์— <image> token์„ ์‚ฝ์ž…ํ•˜๊ณ , ๋ฌธ์„œ์˜ ์„น์…˜์ด ๋๋‚  ๋•Œ <EOC> token์„ ๋„ฃ๋Š” ์‹์œผ๋กœ ์‹œํ€€์Šคํ™”
  • Image-Text Pairs
    • ALIGN ๋ฐ์ดํ„ฐ์…‹ + ์ถ”๊ฐ€๋กœ ์ˆ˜์ง‘ํ•œ LTIP (Long Text Image Pairs) ๋ฐ์ดํ„ฐ์…‹
    • Flamingo ์ž…๋ ฅ ํ˜•์‹์— ๋งž์ถ”๊ธฐ ์œ„ํ•ด ์บก์…˜ ์•ž์—๋Š” <image> token์„, ๋์—๋Š” <EOC> token์„ ๋ถ™์—ฌ ๊ตฌ์„ฑ
  • Video-Text Pairs
    • ์ž์ฒด ์ˆ˜์ง‘ํ•œ ๋™์˜์ƒ ์„ค๋ช… ๋ฐ์ดํ„ฐ
    • ํ‰๊ท  22์ดˆ ๋ถ„๋Ÿ‰ ๋™์˜์ƒ - ํ•œ ๋ฌธ์žฅ์งœ๋ฆฌ ์„ค๋ช…
    • ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ <image> (๋˜๋Š” ๋™์˜์ƒ ํ”„๋ ˆ์ž„์˜ placeholder)์™€ <EOC> token์„ ํ™œ์šฉํ•ด ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๊ตฌ์„ฑ

์ด ์„ธ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ์„ž์–ด ๋ชจ๋ธ์„ ํ•™์Šตํ•  ๋•Œ, ๋‹จ์ˆœํžˆ ํ•œ ๋ฐ์ดํ„ฐ์…‹์”ฉ ๋ฒˆ๊ฐˆ์•„ ํ›ˆ๋ จํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค weight๋ฅผ ์ฃผ๋ฉฐ batch์— ์„ž์–ด gradient๋ฅผ ๋ˆ„์ ํ•˜๋Š” ๋ฐฉ์‹์ด ๋” ํšจ๊ณผ์ ์ด์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.

Training strategy

Flamingo ๋ชจ๋ธ์€ ์œ„ ๊ฑฐ๋Œ€ํ•œ ์›น ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ corpora์—์„œ ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก task๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ์‚ฌ์ „ํ•™์Šต๋˜์—ˆ๋‹ค.

์ฆ‰, ์ฃผ์–ด์ง„ ์‹œํ€€์Šค์—์„œ ํ•œ token์”ฉ autoregressiveํ•˜๊ฒŒ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ ํ˜ผํ•ฉ ๋ฌธ๋งฅ์—์„œ ํ…์ŠคํŠธ ์ƒ์„ฑ ํ™•๋ฅ  $P(\text{ํ…์ŠคํŠธ} \mid \text{์ด์ „ ํ…์ŠคํŠธ+์ด๋ฏธ์ง€})$ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํ›ˆ๋ จ๋˜์—ˆ๋‹ค . ํ›ˆ๋ จ ๊ฒฐ๊ณผ, Flamingo๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๊ฐ€ ์„ž์ธ ๊ธด ์‹œํ€€์Šค๋ฅผ ๋ณด๊ณ ๋„ ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋Šฅ๋ ฅ์„ ํš๋“ํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋Šฅ๋ ฅ์„ ๋ฐ”ํƒ•์œผ๋กœ, ๋ชจ๋ธ์ด ํ•™์Šต์— ์‚ฌ์šฉ๋˜์ง€ ์•Š์€ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ์— ๋Œ€ํ•ด few-shot ์„ค์ •์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋…ผ๋ฌธ์—์„œ ์‹คํ—˜์œผ๋กœ ์ฆ๋ช…ํ•˜์˜€๋‹ค.

  • Loss

    • ์„ธ ๋ฐ์ดํ„ฐ์…‹ $\mathcal{D}_m $ ๊ฐ๊ฐ์— ๋Œ€ํ•œ negative log-likelihood ์†์‹ค์„ ๊ฐ€์ค‘์น˜$\lambda_m$๋กœ ํ•ฉ์‚ฐ:
    \[\mathcal{L} = \sum_{m=1}^{3} \lambda_m \, \cdot\mathbb{E}_{(x,y)\sim \mathcal{D}_m}\Bigl[-\sum_{\ell=1}^{L} \log p\bigl(y_\ell\,|\,y_{<\ell},\,x_{\le\ell}\bigr)\Bigr]\]
    • ์—ฌ๊ธฐ์„œ $x$๋Š” Vision ์ž…๋ ฅ(์ด๋ฏธ์ง€/๋น„๋””์˜ค), $y$๋Š” ํ…์ŠคํŠธ token
    • ๊ฐ€์ค‘์น˜ ํŠœ๋‹์„ ํ†ตํ•ด, ๋ชจ๋ธ์ด ์„ธ ์†Œ์Šค ๋ชจ๋‘์—์„œ ๊ณ ๋ฅธ ์„ฑ๋Šฅ์„ ๋‚ด๋„๋ก ๊ท ํ˜•์„ ๋งž์ถ˜๋‹ค.
  • Gradient ๋ˆ„์ 

    • ๊ฐ step๋งˆ๋‹ค ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์— ๊ฑธ์ณ gradient๋ฅผ ๋ˆ„์ (accumulate)ํ•˜๊ณ  ํ•œ ๋ฒˆ์— ์—…๋ฐ์ดํŠธ

Task adaption with few-shot in-context learning

์‚ฌ์ „ํ•™์Šต์„ ๋งˆ์นœ Flamingo๋Š”, task๋ณ„ fine-tuning ์—†์ด โ€œํ”„๋กฌํ”„ํŠธ์— ๋ช‡ ๊ฐœ์˜ ์˜ˆ์‹œ๋งŒ ๋ณด์—ฌ ์คŒ์œผ๋กœ์จโ€ ์ƒˆ๋กœ์šด task์— ์ฆ‰์‹œ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

In-context learning

Prompt ๊ตฌ์„ฑ

image-20250731111959257

  • Support set: ํ•ด๋‹น task์˜ ${(x_1,y_1),\dots,(x_K,y_K)}$ ์˜ˆ์‹œ $K$๊ฐœ(๋ณดํ†ต $K=4,8,16,32$).
    • $x$ : Image or Video
  • Query: โ€œ์ƒˆ๋กœ์šด ์ž…๋ ฅโ€ $x_{q}$ (์ด๋ฏธ์ง€ยท์˜์ƒ)
  • ์ตœ์ข… ์ž…๋ ฅ ์‹œํ€€์Šค
\[<\text{image}_1>\,y_1\;<\text{image}_2>\,y_2\;\dots\;<\text{image}_K>\,y_K\;<\text{image}_{q}>\,\phantom{y}\]

์˜ˆ์ƒ ์‘๋‹ต ์•ž์— โ€œOutput:โ€์„ ์ถ”๊ฐ€

์‹œ๊ฐ์  QA task์—๋Š” โ€œQuestion: {question} Answer: {answer}โ€ ํ˜•์‹์œผ๋กœ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ตฌ์„ฑ

  1. Decoding method

    • Open-ended task(์บก์…˜ยท์ž์œ  ์‘๋‹ต): Beam Search๋กœ ํ…์ŠคํŠธ ์ƒ์„ฑ

      Beam Search

      ์ƒ์„ฑ ๋ชจ๋ธ์—์„œ ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ์ถœ๋ ฅ ์‹œํ€€์Šค๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•œ ํƒ์ƒ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ํ›„๋ณด๋ฅผ ๋ฌด์ž‘์œ„ ํ•˜๋‚˜๋งŒ ๋ฝ‘๋Š” greedy search์— ๋น„ํ•ด, ๋‹ค์–‘ํ•œ ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ ๋ คํ•ด ์ตœ์ข… ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋ฏ€๋กœ ํ’ˆ์งˆ ์ข‹์€ ํ…์ŠคํŠธ๋ฅผ ์–ป๊ธฐ ์‰ฝ๋‹ค.

      1. ์ดˆ๊ธฐํ™”

        • ๋นˆ ์‹œํ€€์Šค []๋ฅผ ์œ ์ผํ•œ ํ›„๋ณด๋กœ ๋‘๊ณ , ๊ทธ ์ ์ˆ˜(log-prob) $s=0$
        • ๋น” ํฌ๊ธฐ(beam width) $B$๋ฅผ ๋ฏธ๋ฆฌ ์ •ํ•จ (์˜ˆ: $B=5$ ๋˜๋Š” $B=10$)
      2. ๋ฐ˜๋ณต(๋งค ํƒ€์ž„์Šคํ… t)

        1. ๊ฐ ํ›„๋ณด ์‹œํ€€์Šค $y_{1:t-1}$์— ๋Œ€ํ•ด, ๋ชจ๋ธ์ด ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์Œ token $v$๋“ค๊ณผ ๊ทธ ๋กœ๊ทธ ํ™•๋ฅ  $\log p(v\mid y_{1:t-1},\,x)$์„ ๊ณ„์‚ฐ

        2. ๋ชจ๋“  ํ›„๋ณด ร— ๋ชจ๋“  token ์Œ์— ๋Œ€ํ•ด, ํ™•์žฅ๋œ ์‹œํ€€์Šค์™€ ๋ˆ„์  ์ ์ˆ˜๋ฅผ ๊ตฌํ•จ:

          $\bigl(y_{1:t-1},\,v\bigr)\quad,\quad s_{\text{new}} = s_{\text{old}} + \log p(v \mid y_{1:t-1},\,x)$

        3. ์ด๋“ค ์ค‘ ๊ฐ€์žฅ ์ ์ˆ˜๊ฐ€ ๋†’์€ ์ƒ์œ„ B๊ฐœ ํ›„๋ณด๋งŒ์„ ๋‹ค์Œ ๋‹จ๊ณ„์˜ ํ›„๋ณด๋กœ ์œ ์ง€

        4. ์ด ๊ณผ์ •์—์„œ โ€œ์ข…๊ฒฐ tokenโ€(</s>)์ด ๋‚˜์˜จ ํ›„๋ณด๋Š” ๋ณ„๋„์— ์ €์žฅํ•ด ๋†“์„ ์ˆ˜ ์žˆ์Œ

      3. ์ข…๋ฃŒ ์กฐ๊ฑด

        • ๋ชจ๋“  ํ›„๋ณด๊ฐ€ ์ข…๊ฒฐ token์„ ๋ฝ‘์•˜๊ฑฐ๋‚˜, ๋ฏธ๋ฆฌ ์ •ํ•œ ์ตœ๋Œ€ ๊ธธ์ด์— ๋„๋‹ฌํ•˜๋ฉด ์ข…๋ฃŒ
        • ์ €์žฅํ•ด๋‘” โ€œ์™„์„ฑ๋œโ€ ํ›„๋ณด๋“ค ์ค‘ ๊ฐ€์žฅ ๋†’์€ ์ ์ˆ˜๋ฅผ ๊ฐ€์ง„ ์‹œํ€€์Šค๋ฅผ ์ตœ์ข… ์ถœ๋ ฅ
    • Closed-ended task(์„ ๋‹คํ˜•): ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ •๋‹ต ํ›„๋ณด๋ฅผ ์งˆ์˜ ์ด๋ฏธ์ง€ ๋’ค์— ๊ฐ๊ฐ ํ•˜๋‚˜์”ฉ ๋ถ™์—ฌ log-likelihood๋ฅผ ๊ณ„์‚ฐํ•ด ๊ฐ€์žฅ ๋†’์€ ์˜ต์…˜ ์„ ํƒ

  2. Zero-shot Generalization

    ๋…ผ๋ฌธ์—์„œ๋Š” Zero-shot ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ๋•Œ, ์˜ˆ์‹œ ์—†์ด ํ…์ŠคํŠธ ์˜ˆ์‹œ ๋‘ ๊ฐœ๋งŒ ์ฃผ๊ณ (์ด๋ฏธ์ง€๋Š” ์ œ๊ฑฐ) prompt๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑํ•œ๋‹ค.

    1
    2
    3
    4
    
    <BOS>
    Output: This is a cat wearing sunglasses.<EOC>
    Output: Three elephants walking in the savanna.<EOC>
    <image> Output:
    

    1 ๊ฐœ ์˜ˆ์‹œ๋งŒ ๋ณด์—ฌ ์ฃผ๋ฉด ๋ชจ๋ธ์ด ๊ณผ๋„ํ•˜๊ฒŒ ํŽธํ–ฅ๋˜๋ฏ€๋กœ, 2 ๊ฐœ ์˜ˆ์‹œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ๊ทธ ์ด์ƒ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋ฏธ๋ฏธํ•ด ๋‘ ๊ฐœ๋กœ ๊ณ ์ •ํ–ˆ๋‹ค.

  3. Retrieval-based In-Context Example Selection (RICES)

    ๋Œ€๊ทœ๋ชจ ์˜ˆ์‹œ ์ง‘ํ•ฉ์—์„œ๋Š” ํ”„๋กฌํ”„ํŠธ ๊ธธ์ด ์ œํ•œ ๋•Œ๋ฌธ์— ๋ชจ๋‘ ๋„ฃ๊ธฐ ์–ด๋ ต๊ณ , ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๋„ˆ๋ฌด ๊ธธ๋ฉด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ์ €ํ•˜๋œ๋‹ค.

    ์ด๋Ÿด ๋•Œ An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA์˜ RICES ๋ฐฉ์‹์„ ๋”ฐ๋ผ,

    1. ์งˆ์˜ ์ด๋ฏธ์ง€์™€ visual feature ๋ฒกํ„ฐ๋ฅผ ๋น„๊ตํ•ด ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ์ƒ์œ„ $N$๊ฐœ์˜ ์˜ˆ์‹œ๋งŒ ์„ ํƒ
    2. ์œ ์‚ฌ๋„ ์ˆœ, ์ฆ‰ ๊ฐ€์žฅ ๋น„์Šทํ•œ ์˜ˆ์‹œ๊ฐ€ Query ์ง์ „์— ์˜ค๋„๋ก ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค.

โ€‹ ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ธธ์ด๋ฅผ ์ œํ•œํ•˜๋ฉด์„œ๋„ ํ”„๋กฌํ”„ํŠธ ํ’ˆ์งˆ์„ ๋†’์—ฌ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

Experiments

Flamingo ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ๋‹ค์–‘ํ•œ ์ƒˆ๋กœ์šด Vision-language task์— ๋น ๋ฅด๊ฒŒ ์ ์‘(fast adaptation)ํ•˜๋Š”์ง€ ํ‰๊ฐ€ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ด 16๊ฐœ์˜ ๋Œ€ํ‘œ multimodal image/video - language benchmark๋ฅผ ์„ ์ •ํ–ˆ๊ณ , ์ด๋“ค ์ค‘ 5๊ฐœ๋Š” ๋ชจ๋ธ ์„ค๊ณ„ยทํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹(DEV set) ๊ณผ์ •์—์„œ, ๋‚˜๋จธ์ง€ 11๊ฐœ๋Š” ์˜ค์ง ์ตœ์ข… ํ‰๊ฐ€(held-out) ์šฉ๋„๋กœ๋งŒ ์‚ฌ์šฉํ–ˆ๋‹ค.

  1. DEV ๋ฒค์น˜๋งˆํฌ(๋ชจ๋ธ ๊ฐœ๋ฐœ์— ์‚ฌ์šฉ)

    • COCO Captions, OK-VQA, VQAv2, MSVDQA, VATEX
  2. Held-out ๋ฒค์น˜๋งˆํฌ(์ตœ์ข… ์„ฑ๋Šฅ ์ธก์ •)

    • Flickr30k, YouCook2 VideoQA, Visual Dialogue, Hateful Memes, TextVQA, STAR, NextQA, RareAct ๋“ฑ 11๊ฐœ
  3. ํ‰๊ฐ€ ๋ฐฉ์‹ ํ†ต์ผ

    • Few-shot in-context learning์œผ๋กœ๋งŒ ๋ชจ๋ธ์„ ์ ์šฉ

    • Open-ended๋Š” Beam Search(beam size=3)

    • Closed-ended๋Š” log-likelihood scoring ๋ฐฉ์‹์œผ๋กœ ์ •๋‹ต ์„ ํƒ

      ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ, promt๊ตฌ์„ฑ, beam size ๋“ฑ์€ ์ „ ๋ฒค์น˜๋งˆํฌ์— ๊ฑธ์ณ ๊ณ ์ •ํ•ด ํ‰๊ฐ€ ํŽธํ–ฅ์„ ์ตœ์†Œํ™”

  4. Ablation study

    • ๋ถ€๋ก B.2: Flamingo ๋ชจ๋ธ์˜ fine-tuning ์„ฑ๋Šฅ(VQAv2, VATEX, VizWiz ๋“ฑ 9๊ฐœ task)
    • ๋ถ€๋ก B.2: ImageNetยทKinetics700 ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ, contrastive ๋น„์ „ ์ธ์ฝ”๋” ์„ฑ๋Šฅ
    • ๋ถ€๋ก C: ์งˆ์˜์‘๋‹ตยท์บก์…˜ยท๋Œ€ํ™” ๋“ฑ ๋‹ค์–‘ํ•œ ์ •์„ฑ์  ์˜ˆ์‹œ

Few-shot learning

t1

Flamingo(80B) ๋ชจ๋ธ์€ 16๊ฐœ ๋ฒค์น˜๋งˆํฌ ๋ชจ๋‘์—์„œ,

  • ๊ธฐ์กด Zero/Few-shot SOTA ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ
  • 6๊ฐœ task(OK-VQA, MSVDQA ๋“ฑ)์—์„œ๋Š” 32-shot์œผ๋กœ fine-tuning SOTA ๋Šฅ๊ฐ€

  • ๋ชจ๋ธ ํฌ๊ธฐ(3Bโ†’9Bโ†’80B)์™€ ์ƒท ์ˆ˜(0โ†’4โ†’32)๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก ์„ฑ๋Šฅ์ด ์ผ๊ด€๋˜๊ฒŒ ํ–ฅ์ƒ

    fig2

Fine-tuning

Flamingo๋ฅผ pretrained VLM์œผ๋กœ ์—ฌ๊ธฐ๊ณ  labeling๋œ fine-tuningํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ task์— ์ ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

  • ๋ชจ๋ธ ๊ตฌ์กฐ
    • Chinchilla์™€ Flamingo์˜ cross-attention ๋ฐ Perceiver Resampler ๋ชจ๋“ˆ๊นŒ์ง€ ๋ชจ๋‘ ๋ฏธ์„ธ์กฐ์ •
    • NFNet-F6๋Š” โ€œ๋” ๋†’์€ ํ•ด์ƒ๋„ ์ž…๋ ฅโ€์„ ๋ฐ›๋„๋ก unfreezeํ•˜์—ฌ ํ•™์Šต

แ„‰แ…ณแ„แ…ณแ„…แ…ตแ†ซแ„‰แ…ฃแ†บ 2025-07-31 แ„‹แ…ฉแ„Œแ…ฅแ†ซ 11.30.08

fine-tuning ํ›„ Flamingo๋Š” Few-shot ๊ฒฐ๊ณผ๋ฅผ ํฌ๊ฒŒ ๋›ฐ์–ด๋„˜์–ด, ๊ธฐ์กด์˜ fine-tuning SOTA์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ VQAv2, VATEX, VizWiz, MSRVTTQA, Hateful Memes 5๊ฐœ task์—์„œ ์ƒˆ SOTA๋ฅผ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

์ด๋Š” Flamingo๊ฐ€ โ€œ๋‹จ์ผ ๊ฐ€์ค‘์น˜โ€๋กœ few-shot ๋ฐ fine-tuning ๋‘ ๊ฐ€์ง€ ๋ชจ๋‘์— ๋Œ€์‘ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์ž…์ฆํ•œ๋‹ค.

Ablation Studies

t3

  1. Training data

    • ์›๋ณธ: M3W(ํฌ๋กค๋ง) + Imageโ€“Text pair (ALIGN+LTIP) + Videoโ€“Text pair(VTP)
      • w/o M3W : ์ „์ฒด 17.3% ํ•˜๋ฝ
      • w/o Image-text pair : ์ „์ฒด 9.8% ํ•˜๋ฝ
      • w/o Video-text pair : video task ์„ฑ๋Šฅ ํ•˜๋ฝ
      • Image-text pair โ†’ LAION dataset์œผ๋กœ ๊ต์ฒด : 4.3% ํ•˜๋ฝ
  2. Optimisation

    • ์›๋ณธ: ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์˜ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ํ•œ ์Šคํ…์— ๋ˆ„์ (accumulate)
    • ๋น„๊ต: โ€œround-robinโ€ ์‚ฌ์šฉ ์‹œ ์ „์ฒด ์ ์ˆ˜ 70.7% โ†’ 62.9%
  3. Tanh gating

    • ์›๋ณธ: gate $\tanh(ฮฑ)$ ์˜ $ฮฑ$๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  XATTN-Dense ์ถœ๋ ฅ์„ ์Šค์ผ€์ผ

    • ๋น„๊ต: ๊ฒŒ์ดํŒ… ์—†์ด : ์ „์ฒด ์ ์ˆ˜ 70.7 โ†’ 66.5 (โ€“4.2)

      ๊ฒŒ์ดํŒ…์ด ์—†์œผ๋ฉด ์ดˆ๊ธฐํ™” ์‹œ pretrained LM ์ถœ๋ ฅ๊ณผ ์ผ์น˜ํ•˜์ง€ ์•Š์•„ ํ›ˆ๋ จ ๋ถˆ์•ˆ์ • ์ดˆ๋ž˜ .

  4. Cross-attention architecture

    • ์›๋ณธ: GATED XATTN-DENSE
    • ๋น„๊ต:
      • VANILLA XATTN (๊ธฐ์กด Transformer cross-attn๋งŒ ์‚ฝ์ž…) : 70.7 โ†’ 66.9
      • GRAFTING (frozen LM ์œ„์— cross+self-attn ์ธต ๋ง๋ถ™์ž„) : 70.7 โ†’ 63.1
  5. Cross-attention frequency

    • ์›๋ณธ: GATED XATTN-DENSE๋ฅผ ๋งค ์ธต๋งˆ๋‹ค ์‚ฝ์ž… (cost ์ฆ๊ฐ€)
    • ๋น„๊ต:
      • ๋งค 2๋ฒˆ์งธ ์ธต : 70.7 โ†’ 68.2 (โ€“2.5%)
      • ๋งค 4๋ฒˆ์งธ ์ธต : 70.7 โ†’ 68.8 (โ€“1.9%)
      • ํ•œ ๋ฒˆ(์ค‘๊ฐ„) : 70.7 โ†’ 59.8 (โ€“15.4%)
    • ์ ˆ์ถฉ: trade-off๋ฅผ ๊ณ ๋ คํ•ด์„œ Flamingo-9B๋Š” 4์ธต๋งˆ๋‹ค, Flamingo-80B๋Š” 7์ธต๋งˆ๋‹ค ์‚ฝ์ž…
  6. Perceiver Resampler

    • ์›๋ณธ: Perceiver Resampler (64๊ฐœ ์‹œ๊ฐ ํ† ํฐ ์ถœ๋ ฅ)
    • ๋น„๊ต:
      • Transformer (๋™์ผ ํŒŒ๋ผ๋ฏธํ„ฐ) : 70.7 โ†’ 66.7
      • ๋‹จ์ผ MLP : 70.7 โ†’ 66.6
  7. Vision encoder

    • ์›๋ณธ: NFNet-F6 (contrastive pre-trained)
    • ๋น„๊ต:
      • CLIP ViT-L/14 : 70.7 โ†’ 64.9 (โ€“5.8)
      • NFNet-F0 : 70.7 โ†’ 62.7 (โ€“8.0)
    • ๊ฒฐ๋ก : ๊ฐ•๋ ฅํ•œ contrastive pretrained NFNet-F6๊ฐ€ ์ตœ์ƒ .
  8. Freezing LM

    • ์›๋ณธ: Chinchilla LM ์ธต ๋ชจ๋‘ freeze
    • ๋น„๊ต:
      • LM๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต(random init) : 70.7 โ†’ 57.8 (โ€“12.9)
      • LM๋ฅผ pretrain๋œ ์ƒํƒœ๋กœ fine-tuning(unfreeze) : 70.7 โ†’ 62.7 (โ€“8.0)
CLIP vs NFNet-F6

CLIP ViT-L/14๋„ contrastive learning์œผ๋กœ ํ•™์Šต๋œ ๊ฐ•๋ ฅํ•œ ๋น„์ „ ์ธ์ฝ”๋”์ง€๋งŒ, Flamingo์—์„œ๋Š” NFNet-F6๊ฐ€ ํ›จ์”ฌ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„

  1. ํ•ด์ƒ๋„์™€ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„์˜ ์ฐจ์ด
ํ•ญ๋ชฉCLIP ViT-L/14NFNet-F6
๊ธฐ๋ณธ ๊ตฌ์กฐVision Transformer (ViT-L/14)Convolutional Feedforward Network (ResNet ๊ณ„์—ด)
์ž…๋ ฅ ํ•ด์ƒ๋„๋ณดํ†ต 224ร—224288ร—288๋กœ ํ•™์Šต (Flamingo์—์„œ๋Š” 480ร—480๊นŒ์ง€ ์‚ฌ์šฉ)
receptive field์ œํ•œ์  (patch 14ร—14 ๊ธฐ์ค€)๋” ๋„“๊ณ  dense (CNN ํŠน์„ฑ)
inductive bias๊ฑฐ์˜ ์—†์Œ (transformer)์žˆ์Œ (CNN: ์ง€์—ญ์„ฑ, ๊ณ„์ธต์  ํ‘œํ˜„)
  • CLIP
    • self-attention ๊ธฐ๋ฐ˜์ด๊ธฐ ๋•Œ๋ฌธ์— ์ง€์—ญ ํŒจํ„ด, low-level visual feature๋ฅผ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ค์›€
  • NFNet

    • ์ด๋ฏธ์ง€์˜ local ๊ตฌ์กฐ, ๊ณ„์ธต์  feauture ์ถ”์ถœ์— ๋งค์šฐ ๊ฐ•ํ•จ

    โ†’ Flamingo์ฒ˜๋Ÿผ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ Vision-language task๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ๊ฒฝ์šฐ, ์ผ๋ฐ˜์ ์ธ ์ธ์‹ ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚œ NFNet์ด ๋” ํšจ๊ณผ์ 

  1. ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ์Šค์ผ€์ผ ์ฐจ์ด

    • NFNet-F6

      • ALIGN + LTIP ๋ฐ์ดํ„ฐ

        • ALIGN: 1.8B image-text pair (Google ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ)

        • LTIP: 4B text image pair (web-scale large-scale textโ€“image pairs)

      • ํ•™์Šต ์Šคํ…: 1.2M update steps, batch size 16,384, TPUv4 ร— 512 ์‚ฌ์šฉ

    • CLIP

      • LAION-400M ์ˆ˜์ค€์˜ ์›น ๋ฐ์ดํ„ฐ

      • ์ƒ๋Œ€์ ์œผ๋กœ ํ•™์Šต ๊ทœ๋ชจ๊ฐ€ ์ž‘๊ณ , fine-grained noise๋„ ๋” ๋งŽ์Œ

โ€‹ โ†’ Flamingo์˜ NFNet์€ ๋” ์–‘์งˆ์˜ ๋ฐ์ดํ„ฐ๋กœ ํ›จ์”ฌ ๋” ์˜ค๋žซ๋™์•ˆ ํ•™์Šต๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์—, CLIP๋ณด๋‹ค ์‹œ๊ฐ ํ‘œํ˜„์ด ํ›จ์”ฌ ์ •๊ต

  1. Flamingo ์•„ํ‚คํ…์ฒ˜์™€์˜ ์ ํ•ฉ์„ฑ

    • Flamingo์—์„œ๋Š” ๋น„์ „ ์ธ์ฝ”๋”์˜ ์ถœ๋ ฅ์ด Perceiver Resampler๋ฅผ ํ†ตํ•ด 64๊ฐœ์˜ latent token์œผ๋กœ ์••์ถ•๋˜์–ด ์–ธ์–ด ๋ชจ๋ธ์— ์—ฐ๊ฒฐ

      • CLIP ViT-L/14๋Š” patch-wise token ์ถœ๋ ฅ์ด๋ผ ๋น„์„ ํ˜• ๊ตฌ์กฐ์™€ ์ž˜ ๋งž์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ

      • ๋ฐ˜๋ฉด NFNet์€ convolutional feature map ํ˜•ํƒœ๋กœ ์ถœ๋ ฅ๋˜๊ธฐ ๋•Œ๋ฌธ์—

        โ†’ Perceiver ๊ตฌ์กฐ์™€์˜ ํ† ํฐ ์ถ”์ถœ, ์œ„์น˜ ๋…๋ฆฝ์„ฑ, ์ •๋ณด ์••์ถ•์— ํ›จ์”ฌ ์œ ๋ฆฌ

Conclusion

Flamingo๋Š” LLM์˜ few-shot learning ๋Šฅ๋ ฅ์„ Image/Video ๋„๋ฉ”์ธ์œผ๋กœ ํ™•์žฅํ•จ์œผ๋กœ์จ, ์ฃผ์–ด์ง„ ๋ช‡ ๊ฐœ์˜ ์˜ˆ์‹œ๋งŒ์œผ๋กœ ์ƒˆ๋กœ์šด Vision-language task๋“ค์„ ์‹ ์†ํžˆ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ์„ ์ž…์ฆํ–ˆ๋‹ค.

Flamingo์˜ ์•„ํ‚คํ…์ฒ˜๋Š” ์‚ฌ์ „ํ•™์Šต๋œ Vision backbone๊ณผ Language model์„ ํšจ๊ณผ์ ์œผ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ, ๋‘ ๋ชจ๋ธ์ด ์ถ•์ ํ•œ ์ง€์‹์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•œ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ, ๋Œ€๊ทœ๋ชจ ์›น ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ Flamingo๋Š” Captioning, VQA, ์˜์ƒ ์งˆ๋ฌธ์‘๋‹ต, ๋Œ€ํ™”ํ˜• ์‘๋‹ต ๋“ฑ ๋‹ค์–‘ํ•œ task์—์„œ SOTA ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๋‹ค.

ํŠนํžˆ Inference ์‹œ ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค๋Š” ์ ์—์„œ, ํ–ฅํ›„ Multimodal AI์˜ ๊ฐœ๋ฐœ ๋ฐ ์‘์šฉ ํŒจ๋Ÿฌ๋‹ค์ž„์— ๋ณ€ํ™”๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. ์ด์ „๊นŒ์ง€๋Š” ๊ฐ task๋งˆ๋‹ค ๊ฐœ๋ณ„ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•ด์•ผ ํ–ˆ๋‹ค๋ฉด, ์ด์ œ๋Š” Flamingo ๊ฐ™์€ ๋ฒ”์šฉ ๋ชจ๋ธ์— ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋ฅผ ์ฃผ๋Š” ๊ฒƒ๋งŒ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐˆ ๊ฐ€๋Šฅ์„ฑ์„ ์‹œ์‚ฌํ•œ ๊ฒƒ์ด๋‹ค.

Flamingo๋Š” Few-Shot Multimodal Learning์˜ ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์—ˆ์œผ๋ฉฐ, Multimodal AI๊ฐ€ ์–ด๋–ป๊ฒŒ ์ง„ํ™”ํ• ์ง€์— ๋Œ€ํ•œ ํ•˜๋‚˜์˜ ๋ฐฉํ–ฅ์„ฑ์„ ์ œ๊ณตํ•œ๋‹ค.

Limitations

  • LM์˜ ์•ฝ์ ์„ ๊ณ„์Šน : hallucination and ungrounded guesses

    • ์ด๋ฏธ์ง€ ๋‚ด์šฉ๊ณผ ๋ฌด๊ด€ํ•˜์ง€๋งŒ ํ…์ŠคํŠธ ๋งฅ๋ฝ์ƒ ๊ทธ๋Ÿด๋“ฏํ•ด ๋ณด์ด๋Š” ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ํ˜„์ƒ
    • ์–ธ์–ด ๋ชจ๋ธ์˜ ์‚ฌ์ „์ง€์‹์— ๊ณผ๋„ํ•˜๊ฒŒ ์˜์กดํ•˜๊ธฐ ๋•Œ๋ฌธ
    • train ๋•Œ ๋ณธ ์  ์—†๋Š” ๊ธธ์ด์˜ ์ž…๋ ฅ ์‹œํ€€์Šค์— ๋Œ€ํ•ด์„œ๋Š” ์ผ๋ฐ˜ํ™”์•ˆ๋จ
      • ๋„ˆ๋ฌด ๋งŽ์€ ์˜ˆ์‹œ๋ฅผ ํ•œ ๋ฒˆ์— ๋„ฃ์œผ๋ฉด ์„ฑ๋Šฅ์ด ํ•˜๋ฝ
  • ๋†’์€ ๊ณ„์‚ฐ ๋น„์šฉ

    • Flamingo-80B์— NFNet-F6๊ณผ Resampler๊นŒ์ง€ ๊ฒฐํ•ฉ๋˜์–ด ์—ฐ์‚ฐ๋Ÿ‰๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰์ด ๋ง‰๋Œ€
      • TPUv4 ์นฉ 512๊ฐœ
      • ์ด 1.2 million update steps
    • ์ถ”๋ก  ์‹œ ์ˆ˜์‹ญ ์ƒท์˜ ์˜ˆ์‹œ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด token ์ˆ˜๊ฐ€ ๊ธธ์–ด์ ธ ๊ณ„์‚ฐ cost๊ฐ€ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€
    • few-shot learning์€ ๋ณ„๋„ fine-tuning์ด ํ•„์š” ์—†๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋Š” ๋Œ€์‹  Inference cost ์ฆ๊ฐ€ .
  • Prompt sensitive

    • LLM์ฒ˜๋Ÿผ ํ”„๋กฌํ”„ํŠธ์— ์ œ๊ณตํ•˜๋Š” ์˜ˆ์‹œ์˜ ๊ตฌ์„ฑ๊ณผ ํ‘œํ˜„์— ๋ฏผ๊ฐ
    • ์˜ˆ๋ฅผ ๋“ค์–ด, ๋™์ผํ•œ 4๊ฐœ์˜ ์˜ˆ์‹œ๋ผ๋„ ๋ฐฐ์—ด ์ˆœ์„œ๋‚˜ ์„ค๋ช… ์–ดํˆฌ์— ๋”ฐ๋ผ ๋ชจ๋ธ ์ถœ๋ ฅ์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค.
    • few-shot์—์„œ shot ์ˆ˜๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก ์–ด๋А ์ •๋„๊นŒ์ง€๋Š” ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€๋งŒ, ์ผ์ • ์ˆ˜์ค€(์ˆ˜์‹ญ ๊ฐœ ์ด์ƒ)์„ ๋„˜์–ด์„œ๋ฉด ์˜คํžˆ๋ ค ๋ชจ๋ธ์ด ํ˜ผ๋ž€์„ ์ผ์œผํ‚ค๊ฑฐ๋‚˜ ์„ฑ๋Šฅ ๊ฐœ์„ ์ด ์ •์ฒด๋˜๋Š” ํ˜„์ƒ ๋ฐœ์ƒ
  • Classification task

    • Classification task์—์„œ๋Š” ์ตœ์ฒจ๋‹จ Contrastive model(CLIP)๋ณด๋‹ค ์„ฑ๋Šฅ ํ•˜๋ฝ

      • ImageNet few-shot์—์„œ CLIP ๋“ฑ Image-text embedding ๊ธฐ๋ฐ˜ classification ๋ชจ๋ธ๋ณด๋‹ค ํ•˜๋ฝ

      • Flamingo๋Š” text generation ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ํ•™์Šต๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ

        Flamingo๋Š” ๋‹ค์–‘ํ•œ task๋ฅผ ํญ๋„“๊ฒŒ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด ํŠนํ™”๋œ ์ตœ์ ํ™”๋Š” ํฌ๊ธฐ