๐Ÿ“š Study/Paper Review

[Paper Review] Vision-Language Models for Vision Tasks: A Survey

์œฐ๊ฐฑ 2025. 5. 8. 17:23

0. Abstract

๊ธฐ์กด์˜ visual recognition ์—ฐ๊ตฌ๋Š” ๋”ฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ(DNN)๋ฅผ ๊ฐ visual recognition task ๋งˆ๋‹ค ๋ณ„๋„๋กœ ํ•™์Šต์‹œ์ผœ์•ผ ํ–ˆ๋‹ค.
--> ์ด๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์ˆ˜์ž‘์—… ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜๋ฉฐ ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ์ธ๋ ฅ ์ž์›์ด ์†Œ๋ชจ๋˜๋Š” ๊ตฌ์กฐ

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ Vision-Language Models (VLMs)์ด๋‹ค.
VLM์€ (1) ์›น์—์„œ ๊ฑฐ์˜ ๋ฌดํ•œํžˆ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ-์–ธ์–ด ๊ฐ„์˜ ํ’๋ถ€ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋ฉฐ,
(2) ๋‹จ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ visual recognition task์— ๋Œ€ํ•ด ์ œ๋กœ์ƒท(Zero-Shot) ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฐ•์ ์„ ๊ฐ€์ง„๋‹ค.

์ด ๋…ผ๋ฌธ์€ VLM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์‹œ๊ฐ ์ธ์‹ ๊ธฐ์ˆ ์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‚ด์šฉ์„ ์ฒด๊ณ„์ ์œผ๋กœ ์ •๋ฆฌํ•œ๋‹ค:

  1. Visual recognition ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ๋ฐœ์ „ ๊ณผ์ •
  2. Foundations of VLM: ์ฃผ์š” ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜, ์‚ฌ์ „ํ•™์Šต ๋ชฉํ‘œ, ๋‹ค์šด์ŠคํŠธ๋ฆผ ๊ณผ์ œ
  3. Datasets: VLM ์‚ฌ์ „ํ•™์Šต ๋ฐ ํ‰๊ฐ€์— ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ์…‹
  4. ๊ธฐ์กด VLM์˜ pre-training, transfer learning, knowledge distillation ๋ฐฉ๋ฒ•์˜ ๋ถ„์„
  5. ๋‹ค์–‘ํ•œ VLM ๋ฐฉ๋ฒ•๋“ค์— ๋Œ€ํ•œ ๋ฒค์น˜๋งˆํ‚น, ์„ฑ๋Šฅ ๋ถ„์„ ๋ฐ ๋…ผ์˜
  6. ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ณผ์ œ์™€ ๋ฐœ์ „ ๋ฐฉํ–ฅ

๋˜ํ•œ ์ด ์„œ๋ฒ ์ด ๋…ผ๋ฌธ๊ณผ ์—ฐ๊ณ„๋œ ํ”„๋กœ์ ํŠธ๊ฐ€ GitHub ๋งํฌ์— ๊ณต๊ฐœ๋˜์–ด ์žˆ๋‹ค.

(๋ณธ repo์—์„œ ๊พธ์ค€ํžˆ VLM paper๋“ค์„ ์—…๋ฐ์ดํŠธ ํ•ด์ฃผ๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.)


1. Introduction

1-1. ๋ฐฐ๊ฒฝ: Visual Recognition์˜ ์ค‘์š”์„ฑ๊ณผ ํ•œ๊ณ„

  • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜image classification, ๊ฐ์ฒด ํƒ์ง€object detection, ์˜๋ฏธ ๋ถ„ํ• semantic segmentation ๋“ฑ ์‹œ๊ฐ ์ธ์‹(Visual Recognition)์€ ์ปดํ“จํ„ฐ ๋น„์ „์—์„œ ํ•ต์‹ฌ ๋ฌธ์ œ์ด๋ฉฐ, ์ž์œจ์ฃผํ–‰, ๋กœ๋ด‡, ์›๊ฒฉํƒ์‚ฌ ๋“ฑ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ž„.
  • ๊ธฐ์กด์—๋Š” ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ DNN(Deep Neural Networks)์œผ๋กœ ํฐ ์„ฑ๊ณผ๋ฅผ ๋ƒˆ์ง€๋งŒ,
    • ํ•™์Šต ์†๋„ ๋А๋ฆผ (from scratch ํ•™์Šต ์‹œ)
    • ๋ผ๋ฒจ๋ง๋œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ํ•„์š” ๋“ฑ์˜ ํ•œ๊ณ„๊ฐ€ ์กด์žฌ

1-2. ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„: Pre-training → Fine-tuning → Prediction

  • ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์™€์„œ ํƒœ์Šคํฌ ํŠนํ™” ๋ฐ์ดํ„ฐ๋กœ ๋ฏธ์„ธ์กฐ์ •(Fine-tuning)ํ•˜๋Š” ๋ฐฉ์‹์€
    • ํ•™์Šต ์ˆ˜๋ ด์„ ๋น ๋ฅด๊ฒŒ ํ•˜๊ณ ,
    • ๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„.
  • ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ ํƒœ์Šคํฌ๋ณ„ ๋ผ๋ฒจ๋ง ๋ฐ์ดํ„ฐ๊ฐ€ ์ถ”๊ฐ€๋กœ ํ•„์š”ํ•จ.

1-3. ์ƒˆ๋กœ์šด ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„: Vision-Language Models (VLMs)์™€ Zero-shot ์˜ˆ์ธก

  • ์ตœ๊ทผ์—๋Š” VLM ์‚ฌ์ „ํ•™์Šต + Zero-shot ์˜ˆ์ธก ๋ฐฉ์‹์ด ์ฃผ๋ชฉ๋ฐ›์Œ.
  • CLIP ๊ฐ™์€ ๋ชจ๋ธ์€ ์›น์ƒ์˜ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ์‚ฌ์šฉํ•ด ํ•™์Šตํ•˜๊ณ ,
  • ํ•™์Šต ํ›„์—๋Š” ๋ณ„๋„ ํŒŒ์ธํŠœ๋‹ ์—†์ด๋„ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ๊ณง๋ฐ”๋กœ ์ ์šฉ ๊ฐ€๋Šฅ.

โžก๏ธ ์˜ˆ์‹œ: CLIP์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜๋ถ€ํ„ฐ ํ–‰๋™ ์ธ์‹, OCR๊นŒ์ง€ ์ด 36๊ฐœ ํƒœ์Šคํฌ์—์„œ ํƒ์›”ํ•œ zero-shot ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ.

1-4. ์ฃผ์š” ์—ฐ๊ตฌ ํ๋ฆ„ 2๊ฐ€์ง€

VLM์˜ ์„ฑ๊ณต ์ดํ›„, ํ˜„์žฌ ์—ฐ๊ตฌ๋Š” ํฌ๊ฒŒ ๋‘ ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜๋‰จ:

  1. Transfer Learning:
    • Prompt tuning, visual adaptation ๋“ฑ์„ ํ†ตํ•ด VLM์„ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์— ํšจ๊ณผ์ ์œผ๋กœ ์ ์‘์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
  2. Knowledge Distillation:
    • VLM์˜ ์ง€์‹์„ ๋‹ค๋ฅธ ๋ชจ๋ธ๋กœ ์ฆ๋ฅ˜ํ•˜์—ฌ ๊ฐ์ฒด ํƒ์ง€, ๋ถ„ํ•  ๋“ฑ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋„๋ชจ

1-5. ์ด ๋…ผ๋ฌธ์˜ Contribution

C1 ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€, ์˜๋ฏธ ๋ถ„ํ•  ๋“ฑ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ๋ฅผ ํฌํ•จํ•œ VLM ๊ธฐ๋ฐ˜ ์‹œ๊ฐ ์ธ์‹ ์—ฐ๊ตฌ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ •๋ฆฌํ•œ ์ฒซ ์„œ๋ฒ ์ด ๋…ผ๋ฌธ
C2 ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด ์—ฐ๊ตฌ์˜ ๋ฒค์น˜๋งˆํฌ ๋ฐ ๋น„๊ต ์ œ๊ณต
C3 ํ–ฅํ›„ VLM ๊ธฐ๋ฐ˜ ์‹œ๊ฐ ์ธ์‹ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ challenges ๋ฐ research directions ์ œ์•ˆ

 

1-6. ์ •๋ฆฌ

Vision-Language Model ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ• (c)์€
(a)์™€ (b)์˜ Pretraining–Finetuning–Prediction ๊ตฌ์กฐ์™€ ๋‹ฌ๋ฆฌ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ๊ฐ–๋Š”๋‹ค:

  1. Image–Text Pair ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ํ•™์Šต
     - ๊ธฐ์กด์˜ ๋ผ๋ฒจ๋œ ์ด๋ฏธ์ง€๊ฐ€ ์•„๋‹Œ, ์›น์—์„œ ์ˆ˜์ง‘๋œ ์ด๋ฏธ์ง€–ํ…์ŠคํŠธ ์Œ์„ ํ™œ์šฉ
  2. ์ด์— ์ ํ•ฉํ•œ ํ•™์Šต objective๋ฅผ ์‚ฌ์šฉ
     - ์˜ˆ: Contrastive Learning, Masked Cross-modal Modeling ๋“ฑ
  3. Web-scale ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒ”์šฉ ํ‘œํ˜„ ํ•™์Šต
     - ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์˜ ํ‘œํ˜„์„ ํ•™์Šตํ•จ์œผ๋กœ์จ task-specific fine-tuning ์—†์ด๋„ zero-shot prediction ๊ฐ€๋Šฅ

 

์ตœ๊ทผ์˜ VLM ์—ฐ๊ตฌ๋“ค์€ ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋‹ค์Œ์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๊ด€์ ์—์„œ ์ ‘๊ทผํ•˜๊ณ  ์žˆ๋‹ค:

  1. ์ •๋ณด์„ฑ์ด ๋†’์€ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
     - ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ๊ณผ ํ‘œํ˜„์„ ํฌํ•จํ•œ, ํ•™์Šต์— ์œ ์˜๋ฏธํ•œ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•
  2. ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ํ•™์Šต์„ ์œ„ํ•œ ๊ณ ์šฉ๋Ÿ‰(high-capacity) ๋ชจ๋ธ ์„ค๊ณ„
     - ๋ณต์žกํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ‘œํ˜„์„ ์ถฉ๋ถ„ํžˆ ํฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ ์„ค๊ณ„
  3. VLM์— ํŠนํ™”๋œ ์‚ฌ์ „ํ•™์Šต objective ์—ฐ๊ตฌ
     - contrastive learning, masked modeling ๋“ฑ ํšจ๊ณผ์ ์ธ ํ•™์Šต์„ ์œ„ํ•œ ๋ชฉ์  ํ•จ์ˆ˜ ๊ณ ์•ˆ

2. Background

2-1. Development of VLMs for Visual Recognition

1) Pretraining-Objective: ๋‹จ์ผ → ๋ณตํ•ฉ(hybrid) ๋ชฉ์ 

  • ์ดˆ๊ธฐ VLM (์˜ˆ: CLIP)์€ contrastive learning ๋“ฑ ๋‹จ์ผ ํ•™์Šต ๋ชฉ์ ๋งŒ ์‚ฌ์šฉ
  • ์ตœ๊ทผ VLM๋“ค์€ ์—ฌ๋Ÿฌ ๋ชฉ์ ์„ ๋™์‹œ์— ํ™œ์šฉํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ ์‹œ๋„
    • Contrastive (์œ ์‚ฌ๋„ ๊ธฐ๋ฐ˜ ์ •๋ ฌ)
    • Alignment (ํŠน์ • ์œ„์น˜ ์ •๋ ฌ)
    • Generative (ํ…์ŠคํŠธ ์ƒ์„ฑ ๋“ฑ)

โžก๏ธ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชฉ์  ๊ฐ„์˜ ์‹œ๋„ˆ์ง€๋ฅผ ํ†ตํ•ด ๋” ๊ฐ•๊ฑดํ•œ ํ‘œํ˜„ ํ•™์Šต ๊ฐ€๋Šฅ

2) Pre-training Framework: ์ด์ค‘ ํƒ€์›Œ(multiple separate networks) → ๋‹จ์ผ ํƒ€์›Œ(unified network)

  • ์ดˆ๊ธฐ์—๋Š” ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”์™€ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๊ฐ€ ์„œ๋กœ ๋ถ„๋ฆฌ๋œ ๋‘ ๋„คํŠธ์›Œํฌ๋กœ ๊ตฌ์„ฑ๋จ
  • (์˜ˆ: CLIP: ์ด๋ฏธ์ง€ ↔ ํ…์ŠคํŠธ ๋”ฐ๋กœ ์ฒ˜๋ฆฌ)
  • ์ตœ๊ทผ์—๋Š” ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ๋„คํŠธ์›Œํฌ(one-tower) ๊ตฌ์กฐ ์‚ฌ์šฉ → ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•˜๋‚˜์˜ ๋ชจ๋ธ์ด ํ•จ๊ป˜ ์ฒ˜๋ฆฌ

โžก๏ธ ์žฅ์ : GPU ๋ฉ”๋ชจ๋ฆฌ ์ ˆ์•ฝ & ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๊ฐ„ ์ •๋ณด ๊ต๋ฅ˜ ๋” ์›ํ™œ

3) Downstream Task: ๊ฐ„๋‹จํ•œ ํƒœ์Šคํฌ → ๋ณต์žกํ•˜๊ณ  ์ •๋ฐ€ํ•œ ํƒœ์Šคํฌ

  • ์ดˆ๊ธฐ VLM๋“ค์€ ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ๋“ฑ ์ด๋ฏธ์ง€ ๋‹จ์œ„ ํƒœ์Šคํฌ์— ์ง‘์ค‘
  • ์ตœ๊ทผ VLM๋“ค์€ dense prediction ํƒœ์Šคํฌ๋กœ ํ™•์žฅ ์ค‘
  • (์˜ˆ: ๊ฐ์ฒด ํƒ์ง€, ์˜๋ฏธ ๋ถ„ํ•  ๋“ฑ ์œ„์น˜ ์ธ์‹์ด ํ•„์š”ํ•œ ์ž‘์—… ํฌํ•จ)

โžก๏ธ VLM์ด ์ ์  ๋” ๋ฒ”์šฉ์ ์ด๊ณ  ๋ณตํ•ฉ์ ์ธ ์‹œ๊ฐ ํƒœ์Šคํฌ๋„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋กœ ์ง„ํ™” ์ค‘


3. VLM Foundations

3-1. Network Architectures

pre-training dataset: $D = (x_{n}^I, x_{n}^T)^N_{n=1}$

image sample $x_{n}^I$, text sample $x_{n}^T$

image encoder $f_{\theta}$, text encoder $f_{\phi}$

image embedding $z_{n}^I = f_{\theta}(x_{n}^I)$, text embedding $z_{n}^T = f_{\phi}(x_{n}^T)$

3-1-1. Architectures for Learning Image Features

CNN-based Architectures(ResNet) / Transformer-based Architectures(ViT)

3-1-2. Architectures for Learning Language Features

๋Œ€๋ถ€๋ถ„ Transformer ๋˜๋Š” ๊ทธ ๋ณ€ํ˜•(GPT, BERT ๋“ฑ) ์„ ์‚ฌ์šฉ

 

3-2. VLM Pre-training Objectives

3-2-1. Contrastive Objectives

์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์Œ(positive pair)์€ ๊ฐ€๊นŒ์ด, ๋‹ค๋ฅธ ์Œ(negative)์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง€๋„๋ก ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ ํ•™์Šตํ•จ.

Image Contrastive Learning: ์ด๋ฏธ์ง€๋“ค ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ(์˜ˆ: data augmentation๋œ ์Œ)์„ ๊ธฐ์ค€์œผ๋กœ ํ•™์Šต

$L_{I}^{InfoNCE} = -\frac{1}{B} \sum_{i = 1} ^ B log \frac{exp(z_i^I * z_+^I / \tau)}{\sum_{j = 1, i \neq j} ^ {B+1} exp(z_i^I * z_j^I / \tau)}$

Image-Text Contrastive Learning: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ž„๋ฒ ๋”ฉ์„ ์ •๋ ฌ(์ด๋ฏธ์ง€→ํ…์ŠคํŠธ / ํ…์ŠคํŠธ→์ด๋ฏธ์ง€ ๋‘ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต)

$L_{infoNCE}^{IT} = L_{I->T} + L_{T->I}$

$L_{I->T} = -\frac{1}{B} \sum_{i = 1} ^ B log \frac{exp(z_i^I * z_j^T / \tau)}{\sum_{j = 1} ^ {B} exp(z_i^I * z_j^T / \tau)}$

$L_{T->I} = -\frac{1}{B} \sum_{i = 1} ^ B log \frac{exp(z_i^T * z_j^I / \tau)}{\sum_{j = 1} ^ {B} exp(z_i^T * z_j^I / \tau)}$

Image-Text Label Contrastive Learning: ๊ฐ™์€ ํด๋ž˜์Šค(label)์ธ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ๋” ๊ฐ€๊นŒ์ด ์ •๋ ฌ

$L_{infoNCE}^{ITL} = L_{I->T}^{ITL} + L_{T->I}^{ITL}$

$L_{I->T}^{ITL} = -\sum_{i = 1}^B \frac{1}{|P(i)|}\sum_{k \in P(i)}log \frac{exp(z_i^I * z_k^T / \tau)}{\sum_{j = 1} ^ {B} exp(z_i^I * z_j^T / \tau)}$

$L_{T->I}^{ITL} = -\sum_{i = 1}^B \frac{1}{|P(i)|}\sum_{k \in P(i)}log \frac{exp(z_i^T * z_k^I / \tau)}{\sum_{j = 1} ^ {B} exp(z_i^T * z_j^I / \tau)}$

3-2-2. Generative Objectives

์‹œ๋งจํ‹ฑํ•œ ์ •๋ณด ํ•™์Šต์„ ์œ„ํ•ด ์ด๋ฏธ์ง€๋‚˜ ํ…์ŠคํŠธ๋ฅผ ์ง์ ‘ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ๋ณต์›ํ•˜๋„๋ก ๋„คํŠธ์›Œํฌ๋ฅผ ํ›ˆ๋ จํ•จ

์ผ๋ฐ˜์ ์ธ ๋ถ„๋ฅ˜๋ณด๋‹ค๋Š” ๋ณต์›/์ƒ์„ฑ ๊ธฐ๋ฐ˜ ์˜ˆ์ธก ๋ฌธ์ œ๋กœ ์ ‘๊ทผํ•ด, ๋ณด๋‹ค ํ’๋ถ€ํ•œ ํ‘œํ˜„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•จ

Masked Image Modelling (MIM): ์ด๋ฏธ์ง€๋ฅผ patch ๋‹จ์œ„๋กœ ๋‚˜๋ˆˆ ๋’ค, ์ผ๋ถ€ ํŒจ์น˜๋ฅผ maskํ•˜๊ณ  ๋‚˜๋จธ์ง€๋ฅผ ์ด์šฉํ•ด ๋ณต์›ํ•˜๋„๋ก ํ•™์Šต ( MAE (Masked AutoEncoder), BEiT)

$ L_{MIN} = \frac{1}{B}\sum_{i=1}^B log f_{\theta}(\bar{x}_{i}^I|\hat{x}_{i}^I)$

masked image patch $\bar{x_i}^I$, unmasked image patch $\hat{x_i}^I$

Masked Language Modeling (MLM): ์ž์—ฐ์–ด ํ…์ŠคํŠธ์—์„œ ์ผ๋ถ€ ๋‹จ์–ด๋ฅผ ๊ฐ€๋ฆฌ๊ณ  ๋‚˜๋จธ์ง€๋กœ๋ถ€ํ„ฐ ๋ณต์›ํ•˜๋„๋ก ํ•™์Šต (BERT)

(vision-language ๋ชจ๋ธ์—์„œ๋„ ํ…์ŠคํŠธ encoder ํ›ˆ๋ จ ์‹œ ์ž์ฃผ ์‚ฌ์šฉ๋จ)

$L_{MIN} = \frac{1}{B}\sum_{i=1}^B log f_{\phi}(\bar{x}_{i}^T|\hat{x}_{i}^T)$

masked token $\bar{x_i}^T$ , unmasked token $\hat{x_i}^T$

Masked Cross-Modal Modeling (MCMM): ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๋ชจ๋‘์—์„œ ์ผ๋ถ€๋ฅผ maskํ•˜๊ณ  ์ƒํ˜ธ ์กฐ๊ฑดํ™”๋ฅผ ํ†ตํ•ด ๋ณต์›

(modality ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์„ ๊นŠ๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ)

$L_{MCM} = -\frac{1}{B} \sum_{i=1}^B [log f_{\theta}(\bar{x_i}^I|\hat{x_i}^I,\hat{x_i}^T) + log f_{\phi}(\bar{x_i}^T|\hat{x_i}^I,\hat{x_i}^T)]$

$\bar{x}_i^I/\hat{x}_i^I$: masked/unmasked patches in $x_i^T$

$\bar{x_i}^T/\hat{x_i}^T$: masked/unmasked text tokens in $x_i^T$

Image-to-Text Generation (ITG): ์ด๋ฏธ์ง€ $z^I$๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ์ž์—ฐ์–ด ์„ค๋ช…, ์งˆ๋ฌธ, ์บก์…˜ ๋“ฑ์„ ์ƒ์„ฑ

(GPT๋ฅ˜์˜ autoregressive decoder๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ)

$L_{ITG} = - \sum_{l=1}^L logf_{\theta}(x^T|x^T_{<l},z^I)$

 

3-2-3. Alignment Objectives

์ด๋ฏธ์ง€-ํ…์ŠคํŠธ๊ฐ€ ์„œ๋กœ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ ์žˆ๋Š”์ง€๋ฅผ ์ •๋ ฌ์‹œ์ผœ ํ•™์Šตํ•จ

Image-Text Matching (ITM): ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์ด ์„œ๋กœ ์ •์ƒ์ ์ธ ์ง์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ (CLIP์ด๋‚˜ UNITER ๋“ฑ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ)

$L_{IT} = plogS(z^I,z^T) + (1-p)log(1-S(z^I,z^T))$

$p$: image-text paired์ด๋ฉด 1, ์•„๋‹ˆ๋ฉด 0

Region-Word Matching (RWM): ์ด๋ฏธ์ง€ ์•ˆ์˜ ๋ถ€๋ถ„(region)๊ณผ ํ…์ŠคํŠธ ๋‚ด ๋‹จ์–ด(word)์˜ local ์ •๋ ฌ์„ ํ•™์Šต ("cat"์ด๋ผ๋Š” ๋‹จ์–ด์™€ ์ด๋ฏธ์ง€ ์† ๊ณ ์–‘์ด ์˜์—ญ ๊ฐ„ alignment)

$L_{RW} = plogS^r(r^I,w^T) + (1-p)log(1-S^r(r^I,w^T))$

$(r^I,w^T)$: region-word pair

 

3-3. VLM Pre-training Frameworks

VLM(Vision-Language Model)์„ ์‚ฌ์ „ ํ•™์Šตํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ํ”„๋ ˆ์ž„์›Œํฌ ๊ตฌ์กฐ๊ฐ€ ์žˆ๋‹ค.

๊ฐ๊ฐ์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ†ตํ•ฉ ์ „๋žต์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค:

๊ตฌ์กฐ ์ธ์ฝ”๋” ์ˆ˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒํ˜ธ์ž‘์šฉ ๋Œ€ํ‘œ ๋ชจ๋ธ
Two-Tower 2 ์—†์Œ CLIP
Two-Leg 2 + Fusion ์žˆ์Œ BLIP
One-Tower 1 ๊ฐ•ํ•จ FLAVA

 

3-4. Evaluation Setups and Downstream Tasks

3.4.1 Zero-shot Prediction (์ œ๋กœ์ƒท ์˜ˆ์ธก)

์‚ฌ์ „ ํ•™์Šต๋œ VLM์„ ํƒœ์Šคํฌ๋ณ„๋กœ ์ถ”๊ฐ€ ํ•™์Šต(fine-tuning) ์—†์ด ๋ฐ”๋กœ ์ ์šฉํ•˜์—ฌ ์„ฑ๋Šฅ ํ‰๊ฐ€

downstream task ์„ค๋ช…
Image Classification ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ vs ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ๋น„๊ต๋กœ ๋ถ„๋ฅ˜
Prompt engineering์„ ํ†ตํ•ด ํ…์ŠคํŠธ ์ฟผ๋ฆฌ๋ฅผ ๋งŒ๋“ฆ (e.g. "a photo of a [label]")
Semantic Segmentation ๊ฐ ํ”ฝ์…€ ์ž„๋ฒ ๋”ฉ vs ํด๋ž˜์Šค ์„ค๋ช… ํ…์ŠคํŠธ ๋น„๊ต
Object Detection ๊ฐ์ฒด proposal box ์ž„๋ฒ ๋”ฉ vs ํ…์ŠคํŠธ ๋น„๊ต
Image-Text Retrieval ํ…์ŠคํŠธ → ์ด๋ฏธ์ง€, ์ด๋ฏธ์ง€ → ํ…์ŠคํŠธ ๊ฒ€์ƒ‰

3.4.2 Linear Probing (์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ ํ‰๊ฐ€)

์‚ฌ์ „ ํ•™์Šต๋œ VLM์˜ ๊ฐ€์ค‘์น˜๋Š” ๊ณ ์ •ํ•˜๊ณ  ํ•ด๋‹น ์ž„๋ฒ ๋”ฉ ์œ„์— ์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ๋งŒ ํ•™์Šตํ•˜์—ฌ ์„ฑ๋Šฅ ํ‰๊ฐ€


4. Datasets

4-1. Datasets for Pre-training VLMs

  • ๋Œ€๋ถ€๋ถ„ ์›น์—์„œ ์ˆ˜์ง‘ํ•œ ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉ
  • ๊ธฐ์กด์˜ ์ˆ˜์ž‘์—… ๋ผ๋ฒจ๋ง๋œ ๋ฐ์ดํ„ฐ์…‹(ImageNet ๋“ฑ)๋ณด๋‹ค ํ›จ์”ฌ ํฌ๊ณ  ๋น„์šฉ์ด ์ €๋ ดํ•จ
  • ์ตœ๊ทผ์—๋Š” ์ˆ˜์‹ญ์–ต ๊ฐœ ๊ทœ๋ชจ์˜ ๋ฐ์ดํ„ฐ์…‹๋„ ๋“ฑ์žฅ
  • ์˜ˆ: LAION-5B, ALIGN ๋“ฑ

 

4-2. Datasets for VLM Evaluation

๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์— ๋”ฐ๋ผ ์ด 40๊ฐœ ์ด์ƒ์˜ ํ‰๊ฐ€์šฉ ๋ฐ์ดํ„ฐ์…‹์ด ์‚ฌ์šฉ๋จ


5. VISION-LANGUAGE MODEL PRE-TRAINING

5-1. VLM Pre-Training with Contrastive Objectives

Image Contrastive Learning

: ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์˜ ํ‘œํ˜„ ํ•™์Šต์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ณด์กฐ ๋ชฉ์ (auxiliary)์œผ๋กœ ์‚ฌ์šฉ๋จ

Image-Text Contrastive Learning

: ์Œ์„ ์ด๋ฃจ๋Š” ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ๊ฐ„ ์ž„๋ฒ ๋”ฉ์„ dot-product ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ๋Œ€ํ™”ํ•˜๊ณ  InfoNCE ๊ธฐ๋ฐ˜ ์†์‹ค์„ ์–‘๋ฐฉํ–ฅ(์ด๋ฏธ์ง€→ํ…์ŠคํŠธ, ํ…์ŠคํŠธ→์ด๋ฏธ์ง€)์œผ๋กœ ์‚ฌ์šฉํ•จ

Image-Text-Label Contrastive Learning

: ๋ถ„๋ฅ˜(label) ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ-๋ ˆ์ด๋ธ”์„ ๋™์ผํ•œ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์— ๋งคํ•‘

Dicussion

Limitation 2๊ฐ€์ง€ ์กด์žฌ

(1) contrastive learning์€ positive-negative ์Œ์„ ๋™์‹œ์— ์ตœ์ ํ™”ํ•˜๋Š”๊ฒŒ ์–ด๋ ต๊ณ  ๋ณต์žกํ•˜๋‹ค.

(2)  ๋˜ํ•œ ์˜จ๋„ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ temperature $\tau$๋กœ feature discriminability๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

 

5-2. VLM Pre-training with Generative Objectives

Masked Image Modelling (MIM)

: ์ด๋ฏธ์ง€ ํŒจ์น˜ ์ผ๋ถ€๋ฅผ ๋งˆ์Šคํ‚น(masking) ํ•œ ํ›„, ๋‚˜๋จธ์ง€ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฅผ ์žฌ๊ตฌ์„ฑํ•จ (MAE, BEiT)

Masked Language Modelling (MLM)

: NLP์—์„œ ๋„๋ฆฌ ์“ฐ์ด๋Š” ๋ฐฉ์‹. ๋ฌธ์žฅ ๋‚ด ์ผ๋ถ€ ํ† ํฐ(์˜ˆ: 15%)์„ ๋งˆ์Šคํ‚น ํ›„, ๋‚˜๋จธ์ง€๋กœ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šตํ•จ

Masked Cross-Modal Modelling (MCM)

: ์ด๋ฏธ์ง€ ํŒจ์น˜์™€ ํ…์ŠคํŠธ ํ† ํฐ์„ ๋™์‹œ์— ์ผ๋ถ€ ๋งˆ์Šคํ‚น ํ›„, ์„œ๋กœ์˜ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ์–‘์ชฝ ๋ชจ๋‘ ๋ณต์› (์‹œ๊ฐ + ์–ธ์–ด ์–‘์ชฝ์˜ ๊ณต์กด์  ์˜๋ฏธ(context) ํ•™์Šต ๊ฐ€๋Šฅ)

Image-to-Text Generation (ITG)

: ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋กœ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ ํ›„, ์ด๋ฅผ ํ…์ŠคํŠธ ์ƒ์„ฑ(decoding) ๋ชจ๋ธ์— ์ž…๋ ฅํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์„ค๋ช…ํ•˜๋Š” ์ž์—ฐ์–ด ๋ฌธ์žฅ(= ์บก์…˜)์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹  (VLM์ด ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ๋ฏธ์„ธํ•œ ์˜๋ฏธ(์„ธ๋ถ„ํ™”๋œ ์ •๋ณด)๊นŒ์ง€ ์ถ”๋ก ํ•˜๋„๋ก ํ•™์Šต ๊ฐ€๋Šฅ)

Discussion

์ด๋ฏธ์ง€๋‚˜ ํ…์ŠคํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ๋ณต์›ํ•˜๋ฉด์„œ, ์‹œ๊ฐ·์–ธ์–ด·๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฌธ๋งฅ์„ ํ’๋ถ€ํ•˜๊ฒŒ ํ•™์Šตํ•˜๋„๋ก ๋•๋Š” ๋ฐฉ์‹์ด๋‹ค.

๋ณดํ†ต ๋‹ค๋ฅธ ํ•™์Šต ๋ชฉํ‘œ์™€ ํ•จ๊ป˜ ๋ณด์กฐ์ ์œผ๋กœ ์‚ฌ์šฉ๋˜์–ด ์ œ๋กœ์ƒท ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

 

5-3. VLM Pre-training with Alignment Objectives

์ฃผ์š” ๋ชฉ์ ์€ ์ฃผ์–ด์ง„ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์Œ์ด ์˜๋ฏธ์ ์œผ๋กœ ์ผ์น˜ํ•˜๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ

Image-Text Matching

: ์ด๋ฏธ์ง€ ์ „์ฒด์™€ ํ…์ŠคํŠธ ์ „์ฒด๊ฐ€ ์ž˜ ๋งž๋Š”์ง€๋ฅผ ํŒ๋‹จํ•จ

  • ์˜ˆ: FLAVA๋Š” ์ด๋ฏธ์ง€์™€ ํ•ด๋‹น ์บก์…˜์ด ๋งž๋Š”์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜(binary classification)๋ฅผ ์ˆ˜ํ–‰ํ•จ
  • FIBER๋Š” ๋” ์–ด๋ ค์šด negative sample์„ pair-wise similarity๋ฅผ ์ด์šฉํ•ด ์ฐพ์•„๋ƒ„์œผ๋กœ์จ ์ •๋ ฌ ์„ฑ๋Šฅ์„ ๋†’์ž„

Region-Word Matching

: ์ด๋ฏธ์ง€ ๋‚ด๋ถ€์˜ object region๊ณผ ํ…์ŠคํŠธ์˜ ๋‹จ์–ด ๋‹จ์œ„๋ฅผ ์ •๋ฐ€ํ•˜๊ฒŒ ๋Œ€์‘์‹œํ‚ด

  • ์˜ˆ: GLIP, FIBER, DetCLIP ๋“ฑ์€ ๊ฐ์ฒด ์ธ์‹์—์„œ ๊ฐ ๊ฐ์ฒด(region)์˜ classification logits์„ region-word ์œ ์‚ฌ๋„(dot-product)๋กœ ๋Œ€์ฒดํ•จ.
  • ์ด๋ฅผ ํ†ตํ•ด object detection์ด๋‚˜ semantic segmentation์ฒ˜๋Ÿผ dense prediction task์— ์ ํ•ฉํ•œ ์ •๋ ฌ์„ ํ•™์Šตํ•จ.

 

Discussion

์‹œ๊ฐ ๋˜๋Š” ์–ธ์–ด์˜ ๋‹จ์ผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋‚ด๋ถ€์˜ ๊ด€๊ณ„๋Š” ํ•™์Šตํ•˜์ง€ ๋ชปํ•จ.

๊ทธ๋ž˜์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ๋ณด์กฐ ๋ชฉ์  ํ•จ์ˆ˜(auxiliary loss)๋กœ ์‚ฌ์šฉ๋˜์–ด, ๋‹ค๋ฅธ VLM pre-training objective์™€ ํ•จ๊ป˜ ํ™œ์šฉ๋จ


6. VLM TRANSFER LEARNING

Vision-Language Model(VLM)์€ ์›๋ž˜ zero-shot ๋ฐฉ์‹์œผ๋กœ ํ‰๊ฐ€๋˜์ง€๋งŒ, ์ตœ๊ทผ์—๋Š” ๋‹ค์–‘ํ•œ downstream task์— ๋” ์ž˜ ์ ์‘ํ•˜๊ธฐ ์œ„ํ•ด ์ „์ด ํ•™์Šต(Transfer Learning)์ด ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ๋‹ค.

6-1. Motivation of Transfer learning

Pre-trained VLM์ด ๊ฐ•๋ ฅํ•œ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ–๊ณ  ์žˆ์Œ์—๋„, ์‹ค์ œ downstream task์— ์ ์šฉํ•  ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€ ๊ฐ„๊ทน(gap)์ด ๋ฐœ์ƒํ•จ:

  1. ๋„๋ฉ”์ธ ์ฐจ์ด (Distribution Gap)
    • downstream ๋ฐ์ดํ„ฐ์…‹์˜ ์ด๋ฏธ์ง€ ์Šคํƒ€์ผ์ด๋‚˜ ํ…์ŠคํŠธ ํ˜•์‹์ด pretraining ๋ฐ์ดํ„ฐ์™€ ๋‹ค๋ฆ„
    • ์˜ˆ: ํ•™์Šต ์‹œ ์ž์—ฐ ์ด๋ฏธ์ง€, ํ‰๊ฐ€ ์‹œ ์˜๋ฃŒ ์ด๋ฏธ์ง€
  2. ๋ชฉํ‘œ ์ฐจ์ด (Objective Gap)
    • VLM์€ ๋ณดํ†ต ์ผ๋ฐ˜ ๊ฐœ๋… ํ•™์Šต์„ ์œ„ํ•œ task-agnostic objective๋กœ ํ›ˆ๋ จ๋จ
    • ๋ฐ˜๋ฉด, downstream task๋Š” ๋ณดํ†ต ์„ธ๋ถ€ ๋ถ„๋ฅ˜(coarse/fine-grained), region-level ๋˜๋Š” pixel-level ์ธ์‹ ๊ฐ™์€ ํŠน์ • ๋ชฉ์ ์„ ์ง€๋‹˜

→ ์ด ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด transfer learning์ด ํ•„์š”

 

6-2. Common Setup of Transfer Learning

Vision-Language Model(VLM)์„ ์‹ค์ œ downstream task์— ์ ์šฉํ•  ๋•Œ, ๋„๋ฉ”์ธ ๊ฐ„ ๊ฒฉ์ฐจ(domain gap)๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•œ ์„ธ ๊ฐ€์ง€ transfer learning ์„ค์ •์ด ์กด์žฌํ•œ๋‹ค.

Supervised Transfer

  • ๊ฐ€์žฅ ์ „ํ†ต์ ์ธ ๋ฐฉ์‹์œผ๋กœ, ์ „์ฒด ๋ผ๋ฒจ๋ง๋œ downstream ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ fine-tuning์„ ์ˆ˜ํ–‰
  • ํ•™์Šต ์„ฑ๋Šฅ์€ ๋†’์ง€๋งŒ, ๋ผ๋ฒจ ๋น„์šฉ์ด ํผ

Few-shot Supervised Transfer

  • ์†Œ๋Ÿ‰์˜ ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ๋งŒ ํ™œ์šฉํ•˜์—ฌ fine-tuningํ•˜๋Š” ๋ฐฉ์‹
  • ๋ฐ์ดํ„ฐ๊ฐ€ ์ ์€ ์ƒํ™ฉ์—์„œ๋„ ํšจ๊ณผ์ ์œผ๋กœ ์ ์‘ํ•  ์ˆ˜ ์žˆ์–ด annotation efficiency๊ฐ€ ๋›ฐ์–ด๋‚จ

Unsupervised Transfer

  • ๋ผ๋ฒจ ์—†๋Š” ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ fine-tuningํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, ๊ฐ€์žฅ ๋„์ „์ ์ด์ง€๋งŒ ํ™•์žฅ์„ฑ ๋ฉด์—์„œ ๊ฐ€์žฅ ์œ ๋งํ•œ ๋ฐฉ์‹
  • pseudo-labeling, self-training ๋“ฑ์˜ ๊ธฐ๋ฒ•๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋จ

 

6-3. Common Transfer Learning 

6-3-1. Transfer via Prompt Tuning

Prompt Tuning์€ NLP์—์„œ ์œ ๋ž˜๋œ ๊ฐœ๋…์œผ๋กœ, ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๊ทธ๋Œ€๋กœ ๋‘๊ณ , ์ž…๋ ฅ(prompt)๋งŒ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.

VLM์—์„œ๋Š” text prompt tuning, visual prompt tuning, text-visual prompt tuning ์ด๋ ‡๊ฒŒ ์„ธ๊ฐ€์ง€ ์ ‘๊ทผ์ด ์กด์žฌํ•œ๋‹ค.

 

 

Text Prompt Tuning

  • ๊ธฐ์กด์˜ ์ˆ˜๋™ ํ”„๋กฌํ”„ํŠธ(์˜ˆ: "a photo of a [class]") ๋Œ€์‹ , ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ…์ŠคํŠธ ๋ฒกํ„ฐ(learnable text prompt)๋ฅผ ํ™œ์šฉ
  • ๋Œ€ํ‘œ ์—ฐ๊ตฌ:
    • ๊ธฐ๋ณธ ๊ฐœ๋… ํ™•์žฅ ๋ฐ ๊ณผ์ ํ•ฉ ํ•ด๊ฒฐ

    • CoOp ํด๋ž˜์Šค ์ด๋ฆ„(label)์— ๋ถ™๋Š” ๋ฌธ๋งฅ ๋‹จ์–ด([V]1, [V]2, ..., [V]m)์„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“ค์–ด ์ตœ์ ํ™”
      CoCoOp ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ด๋ฏธ์ง€ ์กฐ๊ฑด๋ถ€(context conditioned)๋กœ ์ƒ์„ฑํ•ด ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€
      SubPT ํ”„๋กฌํ”„ํŠธ ๋ฒกํ„ฐ์˜ ์„œ๋ธŒ์ŠคํŽ˜์ด์Šค(subspace)๋ฅผ ํ•™์Šตํ•ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ
      LASP ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ธฐ์กด ์ˆ˜์ž‘์—… ํ”„๋กฌํ”„ํŠธ์™€ regularizationํ•จ์œผ๋กœ์จ ์•ˆ์ •์„ฑ ํ™•๋ณด
      VPT ํ”„๋กฌํ”„ํŠธ ๋ถ„ํฌ๋ฅผ ์ด๋ฏธ์ง€ ์ธ์Šคํ„ด์Šค๋ณ„๋กœ ๋‹ค๋ฅด๊ฒŒ ํ•™์Šตํ•ด ์ผ๋ฐ˜ํ™” ๊ฐœ์„ 
      KgCoOp ์ƒˆ๋กœ์šด ํด๋ž˜์Šค์—๋„ ์ž˜ ์ผ๋ฐ˜ํ™”๋˜๋„๋ก ํ…์ŠคํŠธ ์ง€์‹ ๋ง๊ฐ ๋ฐฉ์ง€ ์ „๋žต ๋„์ž…
    • ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ ์ ์šฉ ๋ฐ ํ™•์žฅ์„ฑ ์—ฐ๊ตฌ

    • SoftCPT ์—ฌ๋Ÿฌ few-shot ํƒœ์Šคํฌ๋ฅผ ๋™์‹œ์— ํ•™์Šต (multi-task prompt tuning)
      PLOT ์นดํ…Œ๊ณ ๋ฆฌ ๋‚ด ๋‹ค์–‘ํ•œ ํŠน์ง•์„ ์„ค๋ช…ํ•˜๋Š” ๋ณต์ˆ˜ ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•™์Šต (optimal transport ์‚ฌ์šฉ)
      DualCoOp ๋ฉ€ํ‹ฐ๋ผ๋ฒจ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด positive + negative ํ”„๋กฌํ”„ํŠธ ๋™์‹œ ํ•™์Šต
      TaI-DP ์กฐ์žกํ•œ(coarse) + ์„ธ๋ฐ€ํ•œ(fine) ํ‘œํ˜„์„ ๋™์‹œ์— ํ•™์Šตํ•˜๋Š” ์ด์ค‘ ๋ ˆ๋ฒจ ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„
      DenseCLIP dense prediction task๋ฅผ ์œ„ํ•ด ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋น„์ฃผ์–ผ ํ”ผ์ฒ˜๋กœ ํŠœ๋‹
      ProTeCt ๊ณ„์ธต์  ๋ถ„๋ฅ˜(hierarchical classification)์˜ ์ผ๊ด€์„ฑ ๊ฐ•ํ™”๋ฅผ ์œ„ํ•œ ํ”„๋กฌํ”„ํŠธ ์„ค๊ณ„
    • ๋ผ๋ฒจ ์—†์ด ํ•™์Šตํ•˜๋Š” Unsupervised Prompt Tuning

    • UPL pseudo-label์„ ์„ ํƒํ•˜๊ณ  self-training์„ ํ†ตํ•ด ํ”„๋กฌํ”„ํŠธ ์ตœ์ ํ™”
      TPT ๋‹จ 1๊ฐœ์˜ ์ƒ˜ํ”Œ๋งŒ ๋ณด๊ณ  test-time์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ ์‘์ ์œผ๋กœ ์ƒ์„ฑ (adaptive prompt at test time)

 

 

Visual Prompt Tuning

  • ํ…์ŠคํŠธ ๋Œ€์‹  ์ด๋ฏธ์ง€ ์ž…๋ ฅ์— learnable perturbation์„ ๋”ํ•ด ํ”„๋กฌํ”„ํŠธ๋กœ ์‚ฌ์šฉ.
  • ๋Œ€ํ‘œ ์—ฐ๊ตฌ:
    • VP: ์ž…๋ ฅ ์ด๋ฏธ์ง€ $x^I$์— ์ž‘์€ ๋ฒกํ„ฐ $v$๋ฅผ ๋”ํ•˜์—ฌ $x^I + v$๋กœ ํ•™์Šต
    • RePrompt: ์ด๋ฏธ์ง€ ํ”„๋กฌํ”„ํŠธ์— retrieval ๊ธฐ๋ฐ˜ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•ด downstream task์— ๋” ์ž˜ ์ ์‘
  • ํŠนํžˆ pixel-level ์กฐ์ •์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— dense prediction task์— ํšจ๊ณผ์ 

 

Text–Visual Prompt Tuning 

  • ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์–‘์ชฝ์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•จ๊ป˜ ํ•™์Šตํ•˜์—ฌ ์ƒํ˜ธ๋ณด์™„ ํšจ๊ณผ๋ฅผ ์–ป๋Š” ๋ฐฉ์‹
  • ๋Œ€ํ‘œ ์—ฐ๊ตฌ:
    • UPT: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๋™์‹œ์— ์ตœ์ ํ™”
    • MAPLE, CAVPT: cross-attention ๊ตฌ์กฐ๋ฅผ ํ™œ์šฉํ•ด modality ๊ฐ„ ์‹œ๋„ˆ์ง€ ์œ ๋„
    • MVLPT: multi-task ๊ธฐ๋ฐ˜ prompt ํŠœ๋‹์œผ๋กœ ํ•™์Šต ์ง€์‹์„ task ๊ฐ„ ๊ณต์œ 

 

Discussion

์žฅ์ 

  • ๋ชจ๋ธ ์ „์ฒด๋ฅผ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—(black-box) ๋งค์šฐ ํšจ์œจ์ 
  • ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ์ ๊ณ , ์ง€์‹์žฌ์‚ฐ(IP) ๋ณดํ˜ธ๊ฐ€ ์ค‘์š”ํ•œ ์ƒํ™ฉ์—๋„ ์ ํ•ฉ
  • ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ๋น ๋ฅด๊ฒŒ ์ ์‘ ๊ฐ€๋Šฅ

ํ•œ๊ณ„

  • ๊ธฐ์กด VLM์˜ ํ‘œํ˜„ ๊ณต๊ฐ„(manifold)์„ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ‘œํ˜„ ์œ ์—ฐ์„ฑ์ด ๋‚ฎ์Œ
  • ํŠนํžˆ ๋ณต์žกํ•˜๊ฑฐ๋‚˜ fine-grainedํ•œ task์—์„  ์„ฑ๋Šฅ ์ €ํ•˜ ๊ฐ€๋Šฅ

 

6-3-2. Transfer via Feature Adaptation

Feature adaptation์€ ์ด๋ฏธ์ง€๋‚˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์ž์ฒด๋ฅผ ๊ฐ€๋ณ๊ฒŒ ์กฐ์ •ํ•ด downstream task์— ๋งž๊ฒŒ VLM์„ ์ ์‘์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด๋‹ค.
Prompt tuning์ด ์ž…๋ ฅ(input)์„ ์กฐ์ ˆํ–ˆ๋‹ค๋ฉด, feature adaptation์€ ์ค‘๊ฐ„ ํ‘œํ˜„(feature)์„ ์กฐ์ ˆํ•˜๋Š” ๋ฐฉ์‹์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋Œ€ํ‘œ ์—ฐ๊ตฌ ๋ฐ ๋ฐฉ์‹

CLIP-Adapter [33] CLIP์˜ image/text encoder ๋’ค์— ์–‡์€ ์„ ํ˜• ๋ ˆ์ด์–ด(adapter)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํ•™์Šต
CLIP ๋ณธ์ฒด๋Š” ๊ณ ์ •๋œ ์ƒํƒœ๋กœ, adapter๋งŒ ํ•™์Šต
Tip-Adapter [34] ํ•™์Šต์ด ํ•„์š” ์—†๋Š” ๋ฐฉ์‹์œผ๋กœ, few-shot ์ด๋ฏธ์ง€๋“ค์˜ ์ž„๋ฒ ๋”ฉ์„ adapter weight๋กœ ๋ฐ”๋กœ ์‚ฌ์šฉ
SVL-Adapter [153] self-supervised ๋ฐฉ์‹์˜ ์ถ”๊ฐ€ encoder๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€ ํ‘œํ˜„์„ ๋ณด์™„ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ๋„์ž…

 

Discussion

  • ์žฅ์ 
    • ๊ตฌ์กฐ ๋ณ€๊ฒฝ์ด ํฌ์ง€ ์•Š๊ณ , ๋‹ค์–‘ํ•œ downstream task์— ์œ ์—ฐํ•˜๊ฒŒ ๋งž์ถœ ์ˆ˜ ์žˆ์Œ
    • pixel-level task๋‚˜ ๋ณต์žกํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž‘์—…์—๋„ ์‰ฝ๊ฒŒ ํ™•์žฅ ๊ฐ€๋Šฅ
  • ๋‹จ์ 
    • prompt tuning๋ณด๋‹ค ์ง์ ‘์ ์ธ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ๋ณ€๊ฒฝ์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—,
      IP ๋ณดํ˜ธ๊ฐ€ ํ•„์š”ํ•œ ์ƒํ™ฉ์ด๋‚˜ black-box VLM์—๋Š” ์ ์šฉ์ด ์–ด๋ ค์›€

 

6-3-3. Other Transfer Methods

Prompt tuning์ด๋‚˜ adapter ๋ฐฉ์‹ ์™ธ์—๋„, ๋ชจ๋ธ ๊ตฌ์กฐ ์ž์ฒด๋ฅผ ์ˆ˜์ •ํ•˜๊ฑฐ๋‚˜ attention์„ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ์‹๋„ ๋‹ค์ˆ˜ ์ œ์•ˆ๋˜๊ณ  ์žˆ๋‹ค.

Wise-FT [162] pre-trained VLM๊ณผ fine-tuned VLM์˜ weight๋ฅผ ํ˜ผํ•ฉ(weight interpolation)ํ•˜์—ฌ ๊ธฐ์กด ์ง€์‹์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ƒˆ๋กœ์šด ํƒœ์Šคํฌ์— ์ ์‘
MaskCLIP [163] CLIP์˜ image encoder ๊ตฌ์กฐ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ, dense feature๋ฅผ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ™•์žฅ
VT-CLIP [157] ์ด๋ฏธ์ง€์—์„œ ์œ ๋„๋œ ์‹œ๊ฐ์  ์ •๋ณด๋กœ ํ…์ŠคํŠธ attention์„ ๊ฐ•ํ™”ํ•˜์—ฌ, ์‹œ๋งจํ‹ฑ ์ •๋ ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ
CALIP [158] parameter-free attention์„ ๋„์ž…ํ•ด, ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ๊ฐ„ ํšจ์œจ์ ์ธ ์ •๋ณด ๊ตํ™˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ
TaskRes [159] pre-trained VLM์˜ **๊ธฐ์กด ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ถ„๋ฅ˜๊ธฐ(classifier)**๋ฅผ ์žฌํ™œ์šฉํ•˜๋ฉด์„œ downstream task์— ๋งž๊ฒŒ ์กฐ์ •
CuPL [160],
VCD [161]
GPT-3 ๋“ฑ ๋Œ€ํ˜• ์–ธ์–ด๋ชจ๋ธ์„ ํ™œ์šฉํ•ด, ํ’๋ถ€ํ•˜๊ณ  ๊ตฌ๋ถ„๋ ฅ ์žˆ๋Š” ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ƒ์„ฑํ•จ์œผ๋กœ์จ VLM์˜ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์„ ๊ฐ•ํ™”

 

6-3-4. Summary

Prompt Tuning๊ณผ Feature Adapter๋Š” VLM ์ „์ดํ•™์Šต์—์„œ ๊ฐ€์žฅ ๋„๋ฆฌ ์“ฐ์ด๋Š” ๋‘ ๊ฐ€์ง€ ๋Œ€ํ‘œ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹

  • Prompt Tuning์€ ์ž…๋ ฅ ํ…์ŠคํŠธ๋‚˜ ์ด๋ฏธ์ง€ ์ž์ฒด๋ฅผ ์ˆ˜์ •ํ•˜๊ณ ,
  • Feature Adapter๋Š” ์ด๋ฏธ์ง€/ํ…์ŠคํŠธ ํŠน์ง•(feature)์„ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ์‹
  • ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ๊ธฐ์กด VLM ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ ์ •(freeze)ํ•˜๊ณ , ์•„์ฃผ ์ ์€ ์ˆ˜์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ๋„์ž…ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ „์ด ํšจ์œจ์ด ๋†’๋‹ค.
  • ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” few-shot ํ•™์Šต ๊ธฐ๋ฐ˜์˜ supervised transfer์— ์ดˆ์ ์„ ๋งž์ถ”์ง€๋งŒ,
    ์ตœ๊ทผ์—๋Š” ๋ผ๋ฒจ ์—†์ด ํ•™์Šตํ•˜๋Š” unsupervised transfer ๋ฐฉ์‹๋„ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์—์„œ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋‹ค.

 


7. VLM KNOWLEDGE DISTILLATION

์ตœ๊ทผ์—๋Š” ์ด๋Ÿฌํ•œ VLM์˜ ๋ฒ”์šฉ ํ‘œํ˜„๋ ฅ์„ object detection, semantic segmentation ๊ฐ™์€
๋ฐ€๋„ ์˜ˆ์ธก(dense prediction) ๊ณผ์ œ์— ํšจ๊ณผ์ ์œผ๋กœ ์ „์ดํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋“ค์ด ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค.

 

7-1. Motivation of Distilling Knowledge from VLMs

๋Œ€๋ถ€๋ถ„์˜ Vision-Language Model(VLM)์€ ์ด๋ฏธ์ง€ ์ „์ฒด(image-level) ํ‘œํ˜„์„ ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šต๋œ๋‹ค.

๋ฐ˜๋ฉด, ์‹ค์ œ ๋งŽ์€ ๋น„์ „ ๊ณผ์ œ๋Š” object-level(region-level)์ด๋‚˜ pixel-level ํ‘œํ˜„์ด ์š”๊ตฌ๋œ๋‹ค.

๋”ฐ๋ผ์„œ VLM์ด ๊ฐ€์ง„ ๋„“๊ณ  ์ผ๋ฐ˜ํ™”๋œ ์‹œ๊ฐ-์–ธ์–ด ์ง€์‹์„ ๋” ์„ธ๋ฐ€ํ•œ ์ˆ˜์ค€์˜ ํƒœ์Šคํฌ์— ๋งž๊ฒŒ ์ฆ๋ฅ˜(distill)ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค.

  • VLM์˜ ํ‘œํ˜„์„ Faster R-CNN, DETR ๋“ฑ ๋‹ค๋ฅธ detection ์•„ํ‚คํ…์ฒ˜๋กœ ์ด์ „ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

7-2. Common Knowledge Distillation Methods

7-2-1. Knowledge Distillation for Object Detection

open-vocabulary detection: ๊ธฐ์กด ๋ถ„๋ฅ˜ ๋ฒ”์œ„๋ฅผ ๋„˜์–ด, ํ…์ŠคํŠธ๋กœ ๊ธฐ์ˆ  ๊ฐ€๋Šฅํ•œ ์ž„์˜์˜ ๊ฐ์ฒด ์นดํ…Œ๊ณ ๋ฆฌ๊นŒ์ง€ ์ธ์‹ํ•˜๋„๋ก detector๋ฅผ ํ™•์žฅ

ViLD CLIP์˜ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„๊ณผ ์ผ์น˜ํ•˜๋„๋ก two-stage detector ํ•™์Šต
HierKD ๊ธ€๋กœ๋ฒŒ + ๋กœ์ปฌ ์ง€์‹์„ ๊ณ„์ธต์ ์œผ๋กœ ์ฆ๋ฅ˜
RKD ์ด๋ฏธ์ง€ ↔ ๊ฐ์ฒด(region) ๊ฐ„ ์ •๋ ฌ ํ•™์Šต
ZSD-YOLO CLIP + pseudo label ๊ธฐ๋ฐ˜ self-labelling ์ฆ๊ฐ•
OADP proposal feature ๋ณด์กด + ์ปจํ…์ŠคํŠธ ์ •๋ณด ์ „๋‹ฌ
RO-ViT ๊ฐœ๋ณ„ ๊ฐ์ฒด๊ฐ€ ์•„๋‹Œ region ์ง‘ํ•ฉ(bag of regions) ๋‹จ์œ„๋กœ ์ฆ๋ฅ˜
BARON ์ฃผ๋ณ€ ์ง€์—ญ(neighborhood) ์ •๋ณด ํ™œ์šฉ
DetPro / PromptDet region-level prompt ํ•™์Šต์œผ๋กœ ์ •๋ ฌ ๊ฐœ์„ 
PB-OVD / XPM / P3OVD CLIP์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑํ•œ pseudo bounding box/mask๋ฅผ self-training์— ํ™œ์šฉ

 

7-2-2. Knowledge Distillation for Semantic Segmentation

๊ธฐ์กด segmentation ๋ชจ๋ธ์€ base class์— ํ•œ์ •๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

VLM์˜ ํ‘œํ˜„์„ pixel-level๋กœ ํ™•์žฅํ•˜์—ฌ, open-vocabulary pixel segmentation์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•œ๋‹ค.

CLIPSeg CLIP ๊ธฐ๋ฐ˜ + transformer decoder๋กœ ๊ฒฝ๋Ÿ‰ segmentation ๊ตฌ์กฐ ๊ตฌํ˜„
LSeg CLIP ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ pixel-level ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ ๊ฐ„ ์ƒ๊ด€์„ฑ ๊ทน๋Œ€ํ™”
ZegCLIP CLIP์œผ๋กœ semantic mask ์ƒ์„ฑ + ๊ด€๊ณ„ ๋””์Šคํฌ๋ฆฝํ„ฐ๋กœ overfitting ๋ฐฉ์ง€
MaskCLIP+ / SSIW CLIP ๊ธฐ๋ฐ˜ pseudo pixel label์„ ์ƒ์„ฑ ํ›„ ์ฆ๋ฅ˜
FreeSeg mask ํ›„๋ณด → zero-shot ๋ถ„๋ฅ˜ ๊ตฌ์กฐ ์ ์šฉ
CLIP-ES CAM + CLIP ์ •๋ ฌ → ์นดํ…Œ๊ณ ๋ฆฌ ํ˜ผ๋™ ์™„ํ™”
CLIMS CLIP์œผ๋กœ๋ถ€ํ„ฐ ๊ณ ํ’ˆ์งˆ CAM์„ ์ƒ์„ฑํ•˜์—ฌ ์•ฝ์ง€๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ

 

7-3. Summary and Discussion

์ฃผ์š” ๊ตฌ๋ถ„ ๊ธฐ์ค€

Object Detection ์ด๋ฏธ์ง€-level ↔ ๊ฐ์ฒด-level ์ •๋ ฌ ๊ฐ•ํ™”
Semantic Segmentation ์ด๋ฏธ์ง€-level ↔ ํ”ฝ์…€-level ์ •๋ ฌ ํ•ด๊ฒฐ

 

 

์ ‘๊ทผ ๋ฐฉ์‹ ๋ถ„๋ฅ˜

Feature-space distillation VLM encoder์™€ detection/segmentation encoder ๊ฐ„ ์ž„๋ฒ ๋”ฉ ์ •๋ ฌ
Pseudolabelling distillation CLIP ๋“ฑ์˜ VLM์ด ์ƒ์„ฑํ•œ pseudo-label์„ ํ™œ์šฉํ•ด regularization ์ˆ˜ํ–‰

 

VLM knowledge distillation์€ ๊ธฐ์กด transfer ๋ฐฉ์‹๋ณด๋‹ค ๋” ๋†’์€ ์œ ์—ฐ์„ฑ๊ณผ ํ™•์žฅ์„ฑ์„ ์ œ๊ณตํ•˜๋ฉฐ,

detection, segmentation์ฒ˜๋Ÿผ ๋ณต์žกํ•œ dense task์—๋„ VLM์˜ ๋ฒ”์šฉ ์ง€์‹์„ ํšจ๊ณผ์ ์œผ๋กœ ์ด์ „ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ ์ ‘๊ทผ์ด๋‹ค.

 


8. PERFORMANCE COMPARISON

8-1. Performance of VLM Pre-training

Vision-Language Model(VLM)์€
์‚ฌ์ „ํ•™์Šต(pre-training)๋งŒ์œผ๋กœ๋„ ์ œ๋กœ์ƒท(zero-shot) ๋ฐฉ์‹์œผ๋กœ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ ์ธ์‹ ํƒœ์Šคํฌ์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

ํ‰๊ฐ€ ๋Œ€์ƒ ํƒœ์Šคํฌ

 

  • ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ (image classification)
  • ๊ฐ์ฒด ํƒ์ง€ (object detection)
  • ์˜๋ฏธ ๋ถ„ํ•  (semantic segmentation)

 

VLM ์„ฑ๋Šฅ์„ ๊ฒฐ์ •์ง“๋Š” 3๊ฐ€์ง€ ํ•ต์‹ฌ ์š”์ธ

1. Big Data ์ธํ„ฐ๋„ท ๊ธฐ๋ฐ˜์˜ ๊ฑฐ๋Œ€ํ•œ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์œผ๋กœ ํ•™์Šต (์ˆ˜๋ฐฑ๋งŒ~์ˆ˜์‹ญ์–ต ๊ฐœ),
๋‹ค์–‘ํ•œ ๊ฐœ๋…์„ ํฌ๊ด„ํ•˜์—ฌ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ
2. Big Model ViT-L, ViT-G ๋“ฑ ์ดˆ๋Œ€ํ˜• ๋ชจ๋ธ(ex: COCA์˜ ViT-G๋Š” 20์–ต ํŒŒ๋ผ๋ฏธํ„ฐ)์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ ํ‘œํ˜„๋ ฅ ๊ฐ•ํ™”
3. Task-agnostic Supervision ํ…์ŠคํŠธ๋Š” ํŠน์ • ํƒœ์Šคํฌ์— ์ข…์†๋˜์ง€ ์•Š๊ณ , ๋‹ค์–‘ํ•˜๊ณ  ํ’๋ถ€ํ•œ ์ •๋ณด ์ œ๊ณต
๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ์œ ์—ฐํ•˜๊ฒŒ ๋Œ€์‘ ๊ฐ€๋Šฅ

 

 

Performance of VLM pre-training methods over zero-shot prediction setup on segmentation task & detection task

ํ•ด๋‹น ๋ถ„์•ผ๋Š” ์•„์ง ์—ฐ๊ตฌ๊ฐ€ ์ ์–ด ์ œํ•œ์ ์ด์ง€๋งŒ, ์ œ๋กœ์ƒท์œผ๋กœ๋„ ์„ฑ๋Šฅ์ด ์ƒ๋‹นํžˆ ๊ฒฝ์Ÿ๋ ฅ ์žˆ์Œ

 

Limitation of VLMs

(1) ์„ฑ๋Šฅ ํฌํ™” ๋ฐ์ดํ„ฐ/๋ชจ๋ธ์„ ์•„๋ฌด๋ฆฌ ํ‚ค์›Œ๋„ ์ผ์ • ์ˆ˜์ค€ ์ดํ›„์—๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์ •์ฒด๋œ๋‹ค.
(2) ์ž์› ์†Œ๋ชจ ํผ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ํ•™์Šต์€ ์ˆ˜๋ฐฑ ๊ฐœ์˜ GPU, ์ˆ˜๋ฐฑ ์‹œ๊ฐ„์˜ ์—ฐ์‚ฐ ์‹œ๊ฐ„์ด ํ•„์š”
(์˜ˆ: CLIP ViT-L์€ 256 V100 GPU, 288์‹œ๊ฐ„ ํ•„์š”)
(3) ์ถ”๋ก  ๋ถ€๋‹ด ๋Œ€ํ˜• ๋ชจ๋ธ์€ ํ•™์Šต๋ฟ ์•„๋‹ˆ๋ผ ์ถ”๋ก  ์‹œ์—๋„ ๋ฉ”๋ชจ๋ฆฌ์™€ ์—ฐ์‚ฐ ๋น„์šฉ์ด ๋งค์šฐ ํผ

 

 

8-2. Performance of VLM Transfer Learning

VLM์€ ์‚ฌ์ „ํ•™์Šต๋งŒ์œผ๋กœ๋„ ์ œ๋กœ์ƒท ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์‹ค์ œ ํƒœ์Šคํฌ์—์„œ ๋„๋ฉ”์ธ ๊ฐญ(domain gap)์ด ์กด์žฌํ•  ์ˆ˜ ์žˆ๋‹ค.
์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ฐฉ์‹์˜ ์ „์ด ํ•™์Šต(Transfer Learning)์ด ์‚ฌ์šฉ๋œ๋‹ค.

๋ณธ ์ ˆ์—์„œ๋Š” ๋‹ค์Œ ์„ธ ๊ฐ€์ง€ ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•œ๋‹ค:

 

  • Supervised Transfer
  • Few-shot Supervised Transfer
  • Unsupervised Transfer

 

 

ํ•ต์‹ฌ ๊ฒฐ๋ก 

1. ๋ชจ๋“  Transfer ๋ฐฉ์‹์€ ๋„๋ฉ”์ธ ๊ฐญ์„ ์ค„์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•œ๋‹ค

  • VLM์€ ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ downstream ํƒœ์Šคํฌ ๊ฐ„ ๋„๋ฉ”์ธ ์ฐจ์ด๊ฐ€ ์กด์žฌํ•  ์ˆ˜ ์žˆ์Œ
  • ์ „์ด ํ•™์Šต์€ ๋ผ๋ฒจ์ด ์žˆ๋“  ์—†๋“  ์ด ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐ ํšจ๊ณผ์ ์ž„
  • → supervised, few-shot, unsupervised ๋ชจ๋‘ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๊ธฐ์—ฌํ•จ
Supervised Wise-FT +10.9%
Few-shot CoOp (16-shot) +1.7%
Unsupervised TPT +0.8%

 

2. Few-shot transfer๋Š” supervised transfer๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋‚ฎ๋‹ค

  • ์˜ˆ:
    • Wise-FT (supervised) → 87.1%
    • CuPL (few-shot) → 76.6%
  • ์ด์œ :
    • few-shot ํ™˜๊ฒฝ์—์„œ๋Š” ๋ผ๋ฒจ ์ˆ˜๊ฐ€ ์ ์–ด ๊ณผ์ ํ•ฉ(overfitting)์ด ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์›€
    • VLM์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ์ œํ•œ๋  ์ˆ˜ ์žˆ์Œ

 

3. Unsupervised transfer๋Š” few-shot transfer์™€ ์œ ์‚ฌํ•˜๊ฑฐ๋‚˜ ๋” ๋›ฐ์–ด๋‚  ์ˆ˜ ์žˆ๋‹ค

  • ์˜ˆ์‹œ:
    • UPL (unsupervised) > 2-shot CoOp (+0.4%)
    • TPT (unsupervised) โ‰’ 16-shot CoOp
  • ์ด์œ :
    • ๋ผ๋ฒจ ์—†์ด ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅ → ์ผ๋ฐ˜ํ™”์— ์œ ๋ฆฌ
    • ๊ณผ์ ํ•ฉ ์œ„ํ—˜์ด ์ ์Œ
  • ๋‹จ์ :
    • pseudo-label์ด ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์„ ์ˆ˜ ์žˆ์Œ
    • ์—ฌ์ „ํžˆ ์—ฐ๊ตฌ ์ดˆ๊ธฐ ๋‹จ๊ณ„, ์„ฑ์ˆ™๋„๋Š” ๋‚ฎ์Œ

 

8-3. Performance of VLM Knowledge Distillation

์ง€์‹ ์ฆ๋ฅ˜๋Š” ์„ฑ๋Šฅ์„ ์•ˆ์ •์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค

 

  • ๋Œ€๋ถ€๋ถ„์˜ detection ๋ฐ segmentation ๋ชจ๋ธ์—์„œ baseline ๋Œ€๋น„ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋จ
  • ์ด๋Š” ๋‹จ์ˆœํ•œ ๊ตฌ์กฐ ๊ฐœ์„ ๋ณด๋‹ค VLM์ด ๊ฐ€์ง„ ์ผ๋ฐ˜ํ™”๋œ ํ‘œํ˜„์„ ์ž˜ ํ™œ์šฉํ•œ ๊ฒฐ๊ณผ
  •  

 

 

8-4. Summary

1) VLM Pre-training:

์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜(๋ถ„๋ฅ˜ ์ค‘์‹ฌ ํƒœ์Šคํฌ)์—๋Š” ๋งค์šฐ ํšจ๊ณผ์ ,๊ทธ๋Ÿฌ๋‚˜ dense task(ํƒ์ง€·๋ถ„ํ• )์—๋Š” ์—ฌ์ „ํžˆ ๋ฏธํก

์žฅ์ /ํ•œ๊ณ„
- ์ž˜ ์„ค๊ณ„๋œ pre-training objective ๋•๋ถ„์—
์ œ๋กœ์ƒท ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜ ์„ฑ๋Šฅ์ด ๋งค์šฐ ์šฐ์ˆ˜
- region/pixel ์ˆ˜์ค€์˜
dense ํƒœ์Šคํฌ(detection/segmentation)๋Š” ์—ฐ๊ตฌ ๋ถ€์กฑ
- COCA, FILIP, CLIP ๋“ฑ์€
๋‹ค์–‘ํ•œ task์—์„œ strong performance ๋‹ฌ์„ฑ
- pre-training์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ·๋ชจ๋ธ·ํƒœ์Šคํฌ๊ฐ€ ์ œ๊ฐ๊ฐ์ด๋ผ ๊ณต์ •ํ•œ ๋น„๊ต ์–ด๋ ค์›€
 

2)  VLM Transfer Learning:

๋‹ค์–‘ํ•œ ๋ฐฑ๋ณธ๊ณผ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒํ•˜์ง€๋งŒ ์•„์ง ๋ผ๋ฒจ ์˜์กด์„ฑ์ด ํฌ๊ฑฐ๋‚˜, unsupervised๋Š” ์†Œ์™ธ๋จ

์žฅ์ /ํ•œ๊ณ„
- ResNet, ViT, Transformer ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฑ๋ณธ์—์„œ ๋†’์€ ์„ฑ๋Šฅ - Supervised/few-shot ๋ฐฉ์‹์€ ๋ผ๋ฒจ ํ•„์š”
- ๋Œ€๋ถ€๋ถ„ ๋™์ผํ•œ pre-trained ๋ชจ๋ธ๊ณผ downstream task๋กœ ์‹คํ—˜ํ•˜์—ฌ
์žฌํ˜„์„ฑ๊ณผ ๋ฒค์น˜๋งˆํ‚น ์šฉ์ด
- Unsupervised transfer๋Š” ์œ ๋งํ•˜์ง€๋งŒ ์•„์ง ๋งŽ์ด ์—ฐ๊ตฌ๋˜์ง€ ์•Š์Œ
 

3)  VLM Knowledge Distillation:

๋ณต์žกํ•œ ํƒœ์Šคํฌ์— ๋งž์ถฐ ์ผ๋ฐ˜ํ™”๋œ ์ง€์‹์„ ํšจ์œจ์ ์œผ๋กœ ์ „์ดํ•˜์ง€๋งŒ ๋ฐฑ๋ณธ ๋‹ค์–‘์„ฑ์œผ๋กœ ์ธํ•ด ๋ฒค์น˜๋งˆํ‚น์ด ์–ด๋ ค์›€

์žฅ์ /ํ•œ๊ณ„

- Faster R-CNN, DETR ๋“ฑ task-specific ๊ตฌ์กฐ์™€ ๊ฒฐํ•ฉํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ - ๋ฐฑ๋ณธ๊ณผ ๊ตฌ์กฐ๊ฐ€ ์ œ๊ฐ๊ฐ(ViLD vs OV-DETR ๋“ฑ) → ์ผ๊ด€๋œ ๋น„๊ต ์–ด๋ ค์›€
- Detection/segmentation์—์„œ๋„ CLIP์˜ ์ง€์‹์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉ - downstream ๊ตฌ์กฐ๋‚˜ ํ•™์Šต ๋ฐฉ์‹์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ ํŽธ์ฐจ๊ฐ€ ํผ

 


9. FUTURE DIRECTIONS

Vision-Language Model(VLM)์€ ์ง€๊ธˆ๊นŒ์ง€ ๋†€๋ผ์šด ์„ฑ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ์ง€๋งŒ,
๊ทธ ์ž ์žฌ๋ ฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํ•ด๊ฒฐํ•ด์•ผ ํ•  ๊ณผ์ œ๋„ ๋ถ„๋ช…ํžˆ ์กด์žฌํ•œ๋‹ค.

9-1. VLM Pre-training์˜ ์ฃผ์š” ๊ณผ์ œ

(1) ์ •๋ฐ€ํ•œ ๋น„์ „-์–ธ์–ด ์ •๋ ฌ ํ•™์Šต (Fine-grained correlation modelling)

  • ๊ธฐ์กด VLM์€ ์ฃผ๋กœ ์ด๋ฏธ์ง€ ์ „์ฒด์™€ ํ…์ŠคํŠธ ๊ฐ„์˜ ์ •๋ ฌ์„ ํ•™์Šตํ•œ๋‹ค.
  • ํ•˜์ง€๋งŒ ๊ฐ์ฒด ํƒ์ง€๋‚˜ ์˜๋ฏธ ๋ถ„ํ• ๊ณผ ๊ฐ™์€ ๋ฐ€๋„ ์˜ˆ์ธก ํƒœ์Šคํฌ์—์„œ๋Š”
    ์˜์—ญ(region)์ด๋‚˜ ํ”ฝ์…€ ์ˆ˜์ค€์˜ ์ •๋ฐ€ํ•œ ์ •๋ ฌ์ด ์š”๊ตฌ๋œ๋‹ค.
  • ์•„์ง ์ด ๋ถ„์•ผ์˜ ์—ฐ๊ตฌ๋Š” ๋งŽ์ง€ ์•Š์•„, ์ œ๋กœ์ƒท ๋ฐ€๋„ ์˜ˆ์ธก์„ ์œ„ํ•œ ์ •๋ฐ€ ํ•™์Šต ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

(2) ๋น„์ „๊ณผ ์–ธ์–ด์˜ ํ†ตํ•ฉ ํ•™์Šต (Unification of vision and language learning)

  • ๊ธฐ์กด VLM์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ธ์ฝ”๋”๋กœ ์ฒ˜๋ฆฌํ•œ๋‹ค.
  • Transformer์˜ ๋“ฑ์žฅ์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์ผํ•œ ํ† ํฐ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์œผ๋กœ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์—ˆ๋‹ค.
  • ๋‹จ์ผ ๋„คํŠธ์›Œํฌ ์•ˆ์—์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๋ฉด, ํšจ์œจ์„ฑ๊ณผ ํšจ๊ณผ ๋ชจ๋‘ ํ–ฅ์ƒ๋  ์ˆ˜ ์žˆ๋‹ค.

(3) ๋‹ค๊ตญ์–ด ๊ธฐ๋ฐ˜์˜ VLM ์‚ฌ์ „ํ•™์Šต (Multilingual VLMs)

  • ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ VLM์€ ์˜์–ด๋งŒ์œผ๋กœ ํ•™์Šต๋˜์–ด ์žˆ๋‹ค.
  • ์ด๋Š” ๋ฌธํ™”์ /์ง€์—ญ์  ํŽธํ–ฅ์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋‹ค๊ตญ์–ด ํ™˜๊ฒฝ์—์„œ์˜ ์ ์šฉ์„ฑ์„ ๋–จ์–ด๋œจ๋ฆฐ๋‹ค.
  • ๋‹ค์–‘ํ•œ ์–ธ์–ด๋กœ ๋œ ํ…์ŠคํŠธ๋ฅผ ํฌํ•จํ•˜์—ฌ ํ•™์Šตํ•˜๋ฉด, ๋‹ค์–‘ํ•œ ์–ธ์–ด์  ํ‘œํ˜„๊ณผ ๋ฌธํ™”์  ์‹œ๊ฐ ํŠน์„ฑ์„ ํฌ๊ด„ํ•˜๋Š” ๋ชจ๋ธ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

(4) ๋ฐ์ดํ„ฐ ํšจ์œจ์ ์ธ VLM ํ•™์Šต (Data-efficient VLMs)

  • ๋Œ€๋ถ€๋ถ„์˜ VLM์€ ์ˆ˜์–ต ๊ฐœ์˜ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ๊ณผ ๋ง‰๋Œ€ํ•œ ์—ฐ์‚ฐ ์ž์›์„ ์‚ฌ์šฉํ•ด ํ•™์Šต๋œ๋‹ค.
  • ์ง€์†๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์„ ์œ„ํ•ด, ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํšจ์œจ์ ์ธ VLM ๊ตฌ์กฐ๊ฐ€ ์š”๊ตฌ๋œ๋‹ค.
  • ์˜ˆ: ์ด๋ฏธ์ง€ ๊ฐ„ ๊ด€๊ณ„, ์Œ ๊ฐ„ ๋น„๊ต ๋“ฑ ๋ณด๋‹ค ์ •๊ตํ•œ ํ•™์Šต supervision ํ™œ์šฉ

(5) LLM๊ณผ์˜ ์œตํ•ฉ์„ ํ†ตํ•œ ์‚ฌ์ „ํ•™์Šต ๊ฐ•ํ™” (Pre-training with LLMs)

  • ์ตœ๊ทผ ์—ฐ๊ตฌ์—์„œ๋Š” LLM(GPT ๋“ฑ)์„ ํ™œ์šฉํ•ด ํ…์ŠคํŠธ ์„ค๋ช…์„ ํ™•์žฅํ•˜์—ฌ
    VLM์˜ ์–ธ์–ด์  ์ดํ•ด๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์‹œ๋„๊ฐ€ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค.
  • ํ–ฅํ›„ VLM ์‚ฌ์ „ํ•™์Šต์—์„œ LLM์„ ๋”์šฑ ์ ๊ทน์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๊ธฐ๋Œ€๋œ๋‹ค.

 

9-2. VLM Transfer learning์˜ ์ฃผ์š” ๊ณผ์ œ

(1) Unsupervised Transfer

  • ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ supervised๋‚˜ few-shot์— ์ง‘์ค‘๋˜์–ด ์žˆ์œผ๋‚˜,
    unsupervised ๋ฐฉ์‹์€ ๋ผ๋ฒจ ๋น„์šฉ์ด ์—†๊ณ  ๊ณผ์ ํ•ฉ ์œ„ํ—˜์ด ๋‚ฎ์•„ ๋งค์šฐ ์œ ๋งํ•˜๋‹ค.
  • ๋” ์ •๊ตํ•œ pseudo-label ์ƒ์„ฑ, self-training ๋ฐฉ์‹ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

(2) Visual Prompt/Adapter ๊ธฐ๋ฐ˜ ์ „์ด

  • ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ(text prompt)์— ์ง‘์ค‘๋˜์–ด ์žˆ์Œ
  • ๊ทธ๋Ÿฌ๋‚˜ ์ด๋ฏธ์ง€ ์ž…๋ ฅ ์ž์ฒด๋ฅผ ์กฐ์ ˆํ•˜๋Š” visual prompt ๋˜๋Š” adapter๋Š”
    ํŠนํžˆ ํ”ฝ์…€ ๋‹จ์œ„ ์กฐ์ •์ด ํ•„์š”ํ•œ ํƒœ์Šคํฌ์—์„œ ๋” ํšจ๊ณผ์ ์ผ ์ˆ˜ ์žˆ๋‹ค.
  • ์ด ๋ถ„์•ผ๋Š” ํ˜„์žฌ๊นŒ์ง€ ๋งŽ์ด ์†Œ์™ธ๋˜์–ด ์žˆ์œผ๋ฉฐ, ํ–ฅํ›„ ํ™œ๋ฐœํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

(3) Test-time Transfer

  • ๊ธฐ์กด์˜ ์ „์ด ํ•™์Šต์€ ๋งค๋ฒˆ ํƒœ์Šคํฌ๋ณ„๋กœ ํŒŒ์ธํŠœ๋‹์ด ํ•„์š”ํ•ด ๋ฐ˜๋ณต์ ์ธ ํ•™์Šต ๋น„์šฉ์ด ๋“ ๋‹ค.
  • ๋ฐ˜๋ฉด, ํ…Œ์ŠคํŠธ ์‹œ์ ์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ ์‘์ ์œผ๋กœ ์กฐ์ •(test-time prompt tuning)ํ•˜๋Š” ๋ฐฉ์‹์€
    ํ›จ์”ฌ ๋” ํšจ์œจ์ ์ธ ๋ฉ€ํ‹ฐํƒœ์Šคํฌ ๋Œ€์‘์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

(4) LLM์„ ํ™œ์šฉํ•œ ์ž๋™ ํ”„๋กฌํ”„ํŠธ ์ƒ์„ฑ

  • ์ˆ˜๋™ ์„ค๊ณ„ ๋Œ€์‹ , LLM(GPT ๋“ฑ)์„ ํ™œ์šฉํ•ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์— ๋งž๋Š” ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ž๋™ ์ƒ์„ฑํ•˜๋Š” ์‹œ๋„๋“ค์ด ๋“ฑ์žฅํ•˜๊ณ  ์žˆ๋‹ค.
  • ์ด๋Š” ๊ฑฐ์˜ ๋ผ๋ฒจ ์—†์ด๋„ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์—์„œ, ์ €๋น„์šฉ ์ „์ด ๋ฐฉ์‹์œผ๋กœ ๋งค์šฐ ์œ ๋งํ•˜๋‹ค.

 

9-3. VLM Knowledge Distillation์˜ ์ฃผ์š” ๊ณผ์ œ

(1) ๋‹ค์ˆ˜ VLM์œผ๋กœ๋ถ€ํ„ฐ์˜ ์ง€์‹ ์ฆ๋ฅ˜

  • ์—ฌ๋Ÿฌ VLM์ด ๊ฐ€์ง„ ํ‘œํ˜„๋ ฅ์„ ํ†ตํ•ฉ·๋ณด์™„์ ์œผ๋กœ ์ฆ๋ฅ˜ํ•จ์œผ๋กœ์จ ์‹œ๋„ˆ์ง€ ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Œ

(2) ๋‹ค์–‘ํ•œ ๋น„์ „ ํƒœ์Šคํฌ๋กœ ํ™•์žฅ

  • ํ˜„์žฌ๋Š” object detection๊ณผ semantic segmentation์— ๊ตญํ•œ๋œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  • ํ–ฅํ›„์—๋Š” instance segmentation, panoptic segmentation, person re-ID ๋“ฑ
    ๋‹ค์–‘ํ•œ ๋น„์ „ ํƒœ์Šคํฌ๋กœ distillation ์—ฐ๊ตฌ๋ฅผ ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋‹ค.