๐Ÿ“š Study/Paper Review

[24’ Neurlps] ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models

์œฐ๊ฐฑ 2025. 5. 31. 11:24

0. Abstract

— ํ•™์Šต ์—†์ด๋„ MLLM์— ์‹œ๊ฐ์  ์ง€์‹œ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“  ControlMLLM

ControlMLLM์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(MLLM)์— ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด ์‹œ๊ฐ์  ํ”„๋กฌํ”„ํŠธ(๋ฐ•์Šค, ๋งˆ์Šคํฌ, ์  ๋“ฑ)๋ฅผ ์ฃผ์ž…ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์ด๋‹ค.
ํ•ต์‹ฌ์€ attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ™œ์šฉํ•ด, ํ…์ŠคํŠธ ํ† ํฐ์ด ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์‹œํ•œ ์‹œ๊ฐ์  ์˜์—ญ์— ์ฃผ๋ชฉํ•˜๋„๋ก ์‹œ๊ฐ ํ† ํฐ์„ ํ…Œ์ŠคํŠธ ์‹œ์ ์—์„œ๋งŒ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

  • ์ถ”๊ฐ€ ํ•™์Šต ๋ถˆํ•„์š”: ํŒŒ์ธํŠœ๋‹ ์—†์ด ์ ์šฉ ๊ฐ€๋Šฅ
  • ์ •ํ™•ํ•œ ์˜์—ญ ์ง€์‹œ: attention์„ ์ œ์–ดํ•ด referring ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ๋ฒ”์šฉ์„ฑ: ๋‹ค์–‘ํ•œ ์ž…๋ ฅ ํ˜•์‹ ์ง€์› + ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ํ™•์ธ

1. Introduction

์ตœ๊ทผ Multimodal Large Language Models (MLLMs)์€ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ์—์„œ ํ™œ์•ฝํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€๋ถ€๋ถ„์˜ ๊ธฐ์กด MLLM์€ ์ด๋ฏธ์ง€ ์ „์ฒด ์ˆ˜์ค€์—์„œ ์ •๋ณด๋ฅผ ์ •๋ ฌํ•˜๋Š” coarse image-level alignment ๋ฐฉ์‹์— ์˜์กดํ•œ๋‹ค.

์ด๋กœ ์ธํ•ด ์„ธ๋ฐ€ํ•œ ์˜์—ญ ์„ค๋ช…์ด๋‚˜ ์ถ”๋ก ์ด ํ•„์š”ํ•  ๊ฒฝ์šฐ,
์‚ฌ์šฉ์ž ์ž…๋ ฅ์€ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์—๋งŒ ์˜์กดํ•ด์•ผ ํ•˜๊ณ , ์ด๋Š” ์ด๋ฏธ์ง€ ์† ๋ณต์žกํ•œ ์‹œ๊ฐ์  ๋‰˜์•™์Šค(intricate visual nuance)๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค.

 

 

์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด, ์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ ์‚ฌ์šฉ์ž๊ฐ€ ๋ฐ•์Šค, ํฌ์ธํŠธ, ๋งˆ์Šคํฌ ๋“ฑ ์‹œ๊ฐ์  ์ง€์‹œ(referring input)๋ฅผ ์ง์ ‘ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ์‹์„ ๋„์ž…ํ–ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์€ ๋ณด๋‹ค ์ •๋ฐ€ํ•œ ์‹œ๊ฐ-์–ธ์–ด ์ •๋ ฌ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์ง€๋งŒ, ๋ชจ๋ธ์„ ๋‹ค์‹œ ํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์—์„œ ์œ ์—ฐ์„ฑ์ด ๋–จ์–ด์ง€๊ณ , ๋„๋ฉ”์ธ์ด ๋ฐ”๋€Œ๊ฑฐ๋‚˜ ๋ชจ๋ธ ๊ตฌ์กฐ๊ฐ€ ๋‹ฌ๋ผ์งˆ ๊ฒฝ์šฐ ๋ฐ˜๋ณต์ ์ธ ์žฌํ•™์Šต์ด ํ•„์š”ํ•˜๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค.

์ด ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด visual prompt๋ฅผ MLLM์— ์ฃผ์ž…ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.
ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š”, MLLM ๋””์ฝ”๋”์˜ attention map์ด ํ…์ŠคํŠธ ํ† ํฐ๊ณผ ์‹œ๊ฐ ์ •๋ณด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋ชจ๋ธ๋งํ•˜๊ณ  ์žˆ๋‹ค๋Š” ์ ์— ์ฐฉ์•ˆํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๊ธฐ์กด MLLM์—์„œ๋Š” visual encoder์—์„œ ์–ป์€ feature๋ฅผ MLP๋ฅผ ํ†ตํ•ด ์–ธ์–ด ๊ณต๊ฐ„์œผ๋กœ ํˆฌ์˜ํ•œ ํ›„, ํ•ด๋‹น visual token์„ ๋””์ฝ”๋”์˜ attention์— ํ™œ์šฉํ•œ๋‹ค. ์ด๋•Œ MLP์˜ ์ถœ๋ ฅ์€ text token์ด ์–ด๋–ค ์‹œ๊ฐ ์ •๋ณด์— ์ฃผ๋ชฉํ• ์ง€๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ์ œ์–ดํ•˜๊ฒŒ ๋œ๋‹ค.

์ €์ž๋“ค์€ ์ด ๊ตฌ์กฐ์  ํŠน์ง•์„ ํ™œ์šฉํ•ด, visual token์— learnableํ•œ latent variable์„ ์ถ”๊ฐ€ํ•˜๊ณ , ์ด๋ฅผ ํ…Œ์ŠคํŠธ ์‹œ์ ์—์„œ๋งŒ energy-based objective๋ฅผ ํ†ตํ•ด ์ตœ์ ํ™”ํ•จ์œผ๋กœ์จ, attention map ๋‚ด์—์„œ ์‚ฌ์šฉ์ž๊ฐ€ ์ง€์ •ํ•œ ์‹œ๊ฐ์  ์˜์—ญ์œผ๋กœ ๋ชจ๋ธ์˜ ์ฃผ์˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์œ ๋„ํ•œ๋‹ค.

์ด ๋ฐฉ์‹์€ ๋ณ„๋„์˜ fine-tuning์ด๋‚˜ ๊ตฌ์กฐ ๋ณ€๊ฒฝ ์—†์ด๋„, box, mask, scribble, point ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•์‹์˜ ์‹œ๊ฐ์  ์ž…๋ ฅ์„ ์ง€์›ํ•˜๋ฉฐ, ๋„๋ฉ”์ธ ์ผ๋ฐ˜ํ™” ๋ฐ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ ์ธก๋ฉด์—์„œ๋„ ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋น„๊ตํ•ด ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.


2. Related Works

Visual Prompt

Hard Visual Prompt: ์ด๋ฏธ์ง€๋ฅผ ์ง์ ‘ ์กฐ์ž‘ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์ฃผ์˜๋ฅผ ํŠน์ • ๋ถ€๋ถ„์— ์ง‘์ค‘์‹œํ‚ค๋Š” ๋ฐฉ์‹

ex) ์ด๋ฏธ์ง€ ๋‚ด ํŠน์ • ์˜์—ญ์„ ์ƒ‰์œผ๋กœ ๊ฐ•์กฐ(color guidance) / ๋งˆ์šฐ์Šค๋กœ ํŠน์ • ์œ„์น˜๋ฅผ ํด๋ฆญํ•˜๊ฑฐ๋‚˜, bounding box๋ฅผ ์ œ๊ณต

์žฅ์ ) Training-Free: ๋ณ„๋„์˜ ํ•™์Šต ์—†์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅ

๋‹จ์ ) ์ด๋ฏธ์ง€ ๊ตฌ์กฐ๊ฐ€ ํ›ผ์†๋  ์ˆ˜ ์žˆ์Œ (ex. ์ƒ‰ ๋ณด์ •์ด ์›๋ž˜ ์ •๋ณด๋ฅผ ์™œ๊ณก) / ๋ชจ๋ธ์ด ํ•ด๋‹น ์‹œ๊ฐ์  ์ง€์‹œ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ → ๊ธฐ์กด ๋ชจ๋ธ์˜ ์ดํ•ด ๋Šฅ๋ ฅ์— ์˜์กด

Soft Visual Prompt: ์ด๋ฏธ์ง€๋กœ๋ถ€ํ„ฐ ์œ ๋„๋œ, ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ฒกํ„ฐ๋‚˜ ํ† ํฐ์„ ํ”„๋กฌํ”„ํŠธ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹

์žฅ์ ) ๋ชจ๋ธ์— ์œ ์—ฐํ•˜๊ฒŒ ํ†ตํ•ฉ ๊ฐ€๋Šฅ / ๋‹ค์–‘ํ•œ downstream task์— ๋งž๊ฒŒ fine-tune ๊ฐ€๋Šฅ

๋‹จ์ ) fine-tuning์ด ํ•„์š”ํ•จ (downstream task ๋ฐ์ดํ„ฐ ์š”๊ตฌ) / Hard prompt์ฒ˜๋Ÿผ ๋ช…์‹œ์ ์ธ ์ง€์—ญ(region) ์ง€์‹œ ๋ถˆ๊ฐ€๋Šฅ

ํ•ญ๋ชฉ Hard Visual Prompt Soft Visual Prompt ๋ณธ ์—ฐ๊ตฌ (Latent Prompt)
์ž…๋ ฅ ๋ฐฉ์‹ ์ด๋ฏธ์ง€ ์ง์ ‘ ์กฐ์ž‘ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์‹œ๊ฐ ํ† ํฐ Latent ๋ฒกํ„ฐ ์ตœ์ ํ™”
ํ•™์Šต ํ•„์š” ์—ฌ๋ถ€ X (Training-free) O ํ•„์š”ํ•จ X (Test-time๋งŒ ์‚ฌ์šฉ)
์ง€์—ญ ์œ ๋„ (Region Guidance) ๊ฐ€๋Šฅ ์–ด๋ ค์›€ ๊ฐ€๋Šฅ
๊ตฌ์กฐ ๋ณด์กด ์†์ƒ๋  ์ˆ˜ ์žˆ์Œ ๋ณด์กด๋จ ๋ณด์กด๋จ

 

4. Method

4.1 Analysis of the Attention in LVLMs

" input๊ณผ output ์‚ฌ์ด์˜ ๊ด€๊ณ„์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์†Œ๊ฐ€ ๋ฌด์—‡์ธ๊ฐ€? " ์ด์— ๋Œ€ํ•ด ๋ถ„์„ํ•œ๋‹ค.

MLLM์€ visual prompt์™€ text prompt ์‚ฌ์ด์—์„œ ๊ฐ€์žฅ ๊ฐ€๋Šฅ์„ฑ ๋†’์€ ์ถœ๋ ฅ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค.

๋ชจ๋ธ์€ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์กฐ๊ฑด(condition)์œผ๋กœ ์‚ผ์•„, ์ด๋ฏธ์ง€์˜ ์–ด๋–ค ๋ถ€๋ถ„์ด ์ถœ๋ ฅ์— ์ค‘์š”ํ•œ์ง€๋ฅผ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค.

Figure 2์˜ ๊ทธ๋ฆผ(top line)์—์„œ ๋ณด๋ฉด, Attention Layer๋งˆ๋‹ค text token "hat"์— ๋Œ€ํ•ด visual token์ด ์–ด๋””๊ฐ€ ๊ฐ€์žฅ ๋งŽ์ด ํ™œ์„ฑํ™”๋˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ ์ฆ๋ช…ํ•  ์ˆ˜ ์žˆ๋‹ค.

๊ฒฐ๋ก ์ ์œผ๋กœ, attention map์€ (1) ๋ชจ๋ธ์˜ ์ถœ๋ ฅ์ด ์ด๋ฏธ์ง€์˜ ์–ด๋–ค ํ”ฝ์…€๊ณผ ๊ด€๋ จ ์žˆ๋Š”์ง€๋ฅผ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๊ณ , (2) ๋” ๋‚˜์•„๊ฐ€ ๋ชจ๋ธ์ด ์ถœ๋ ฅํ•  ๋‚ด์šฉ์„ ์œ ๋„ํ•˜๋Š” ๋ฐ์—๋„ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.

 

์•„์ด๋””์–ด๋Š”, attention map์„ ์ˆ˜์ •ํ•จ์œผ๋กœ์จ model์˜ output์„ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

๋ชจ๋ธ์˜ attention map์— ์ง์ ‘์ ์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํŠน์ • ์‹œ๊ฐ ์˜์—ญ (visual token)์˜ ์ค‘์š”๋„๋ฅผ ์˜๋„์ ์œผ๋กœ ๋†’์˜€๋‹ค.

(2)๊ฐ€ ์›๋ž˜ MLLM์˜ ์ˆ˜์‹์ด๊ณ , (4)๊ฐ€ ๋ณธ ๋…ผ๋ฌธ์˜ ์ˆ˜์‹์ด๋‹ค.

๊ด€์‹ฌ ์žˆ๋Š” ์˜์—ญ(r)์„ ์ง€์ •ํ•˜๊ณ , ๊ทธ ์˜์—ญ์˜ attention score์„ η๋งŒํผ boostํ•˜๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” binary mask๋ฅผ ๋”ํ•ด์ค€๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค.

 

์ด๋•Œ ๋‹น์—ฐํ•˜๊ฒŒ๋„ η๊ฐ’์„ ์–ด๋–ป๊ฒŒ ์„ค์ •ํ•ด์ค„์ง€๊ฐ€ ์ค‘์š”ํ•˜๋‹ค.

Figure 3์€ ๊ฐ’์— ๋”ฐ๋ผ Attention์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋Š”๋ฐ

η์ด ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด Figure-3a์ฒ˜๋Ÿผ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ์ดˆ๋ž˜ํ•˜๊ณ , ๋„ˆ๋ฌด ํฌ๋ฉด Figure-3c์ฒ˜๋Ÿผ LLM๋ชจ๋ธ ์ž์ฒด์˜ ์„ฑ๋Šฅ์ด ์ด์ƒํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค.

์ถ”๊ฐ€๋กœ, inference ์ค‘ 0๋ฒˆ์งธ step(์ฒซ๋ฒˆ์งธ attention layer)์—์„œ text token์˜ ์˜ํ–ฅ์ด ์ œ์ผ ํฌ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋•Œ attention map์„ ์กฐ์ •ํ•˜๋Š”๊ฒŒ ์ข‹๋‹ค๊ณ  ํ•œ๋‹ค.

step by step์œผ๋กœ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ(Figure-3d)๋„ Figure-3c์ฒ˜๋Ÿผ LLM๋ชจ๋ธ ์ž์ฒด์˜ ์„ฑ๋Šฅ์ด ์ด์ƒํ•ด์ง„๋‹ค.

 

๋Œ€๋ถ€๋ถ„์˜ MLLM์—์„œ๋Š” MLP layer๊ฐ€ image-text alignment๋ฅผ ํ•™์Šตํ•œ๋‹ค.

๋‹ค์‹œ ๋งํ•ด LLM์•ˆ์— ๋“ค์–ด๊ฐ€๋Š” visual token์ด attention map์˜ ๊ฐ’์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

(๋ฌผ๋ก  text token๋„ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€๋งŒ, ์ด๋ฏธ์ง€์™€ ์ถœ๋ ฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„์„ฑ์„ ๋ถ„์„ํ•˜๊ณ ์ž text token์˜ ์˜ํ–ฅ์€ ๊ณ ๋ ค ๋Œ€์ƒ์—์„œ ์ œ์™ธ)

 

 

4.2 Manipulating Attention via Latent Variable Learning

4.1์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ, energy function์„ ํ†ตํ•ด learnable latent variable์„ ์ตœ์ ํ™”ํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ์„ธ์› ๋‹ค.

์ด์ œ ์–ด๋–ค attention map์„ ์‚ฌ์šฉํ• ์ง€ ๊ฒฐ์ •ํ•ด์•ผ ํ•œ๋‹ค.

์ฒซ ๋ฒˆ์งธ ์ ‘๊ทผ์€, ๊ฐ๊ฐ์˜ text token๊ณผ ๋ชจ๋“  visual token ๊ฐ„์˜ attention map์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ visual token์€ ์†Œ์ˆ˜์˜ text prompt๋งŒ์œผ๋กœ๋„ ๊ฐ•ํ•œ ์—ฐ๊ด€์„ฑ์„ ๋ณด์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋ชจ๋“  attention map์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ ๊ณ„์‚ฐ์ ์œผ๋กœ ๋น„ํšจ์œจ์ ์ด๋‹ค.

๋˜ํ•œ ์ด๋Ÿฌํ•œ ์†Œ์ˆ˜์˜ ์ค‘์š”ํ•œ text token์„ ์ฐพ๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค.

๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ๊ฐ text prompt์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ attention map์„ average poolingํ•˜์—ฌ global context token์„ ๋งŒ๋“ค์—ˆ๋‹ค.

Figure 2์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ด์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ƒ์„ฑ๋œ context token์€ ์ค‘์š”ํ•œ text token์ธ "hat"๊ณผ ์œ ์‚ฌํ•œ attention ๋ถ„ํฌ๋ฅผ ๋ณด์ธ๋‹ค.

 

๋˜ํ•œ, ๋ณธ ๋ฐฉ๋ฒ•์€ 4๊ฐ€์ง€ ์ข…๋ฅ˜์˜ referring shape(box, mask, scribble, point)๋ฅผ ์ง€์›ํ•œ๋‹ค.

๋‘ ์ข…๋ฅ˜์˜ energy function์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ box & mask์— ๋Œ€ํ•œ hard masked-based energy function๊ณผ

scribble & point์— ๋Œ€ํ•œ soft masked-based energy function์ด๋‹ค.

 

Hard Mask-based Energy Function (box, mask)

latent vector $p_v$์„ $e_v$์™€ ๊ฐ™์€ ์ฐจ์›์„ ๋งŒ๋“ค์–ด 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , $e_v$์™€ concatํ•œ๋‹ค.

context token๊ณผ ์ƒˆ๋กœ์šด visual token์‚ฌ์ด์˜ $N$๊ฐœ์˜ attention layer์„ ๊ตฌํ•œ๋‹ค.

 box์™€ mask๋Š” binary mask๋กœ ๋งŒ๋“ ๋‹ค.

๊ทธ๋ฆฌ๊ณ  $N$๊ฐœ์˜ attention map์„ average poolํ•˜์—ฌ ์–ป์–ด๋‚ธ attention map๊ณผ mask์— ๊ธฐ๋ฐ˜ํ•œ mask-based energy function์„ ๊ณ„์‚ฐํ•œ๋‹ค.

 

 

Soft Mask-based Energy Function (scribble, point)

SAM์„ ์ด์šฉํ•ด Hard Mask-based Energy function์— ์‚ฌ์šฉํ•  mask๋ฅผ ๊ตฌํ•œ๋‹ค

์ด๊ฒŒ inference cost๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฏ€๋กœ, distance matrix $D$์— ์˜ํ•œ optimal soft mask-based energy function์„ ์ œ์•ˆํ•œ๋‹ค.

$D$๋Š” scribble์ด๋‚˜ point์— ๋Œ€ํ•ด OpenCV distanceTransform function์„ ์‚ฌ์šฉํ•ด์„œ ์–ป๋Š”๋‹ค.

 

 


5. Experiments

5.2 Applications

Referring with Different Visual Prompts. & Impact on Hallucinations.

๋‹ค์Œ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ–ˆ์„ ๋•Œ ๋Œ€๋‹ต์„ ๋” ์ •ํ™•ํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

Out-of-Domain Task.

 

 

5.3 Comparisons

Comparison on Referring Object Classification Task.

Comparison on Referring Text Classification Task.

 

+ Blur, + Color, + Edit Att ์ถ”๊ฐ€ํ•œ ์ด์œ 

LLaVA + Blur: Upper bound (์ตœ์  ์กฐ๊ฑด) ๋น„๊ต

  • ๋ฐฐ๊ฒฝ์„ ํ๋ฆฌ๊ฒŒ ์ฒ˜๋ฆฌํ•ด์„œ ๋ชจ๋ธ์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ด€์‹ฌ ์˜์—ญ์—๋งŒ ์ง‘์ค‘ํ•˜๋„๋ก ์œ ๋„
  • ์„ฑ๋Šฅ์ด ๋†’๊ฒŒ ๋‚˜์˜ด → LLaVA๊ฐ€ "๋ฌด์—‡์„ ๋ด์•ผ ํ•˜๋Š”์ง€"๋งŒ ์ •ํ™•ํžˆ ์•Œ๋ฉด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์คŒ
  • ํ•˜์ง€๋งŒ ์ด๊ฑด ํ˜„์‹ค์ ์ธ ๋ฐฉ๋ฒ•์ด ์•„๋‹ˆ๋ฉฐ, ์‹ค์ œ region์ด ๋ช…์‹œ๋˜์ง€ ์•Š์œผ๋ฉด ๋ถˆ๊ฐ€๋Šฅํ•จ

--> “LLaVA๊ฐ€ ์‹œ๊ฐ์ ์œผ๋กœ ๋„์›€๋งŒ ๋ฐ›์œผ๋ฉด ์ž˜ ์ž‘๋™ํ•จ”์ด๋ผ๋Š” ์ƒํ•œ์„ ์„ ์ œ์‹œ


LLaVA + Color: ๋‹ค๋ฅธ hard visual prompt ๋ฐฉ์‹๊ณผ ๋น„๊ต

  • ๊ด€์‹ฌ ์˜์—ญ์„ ์ƒ‰์œผ๋กœ ๊ฐ•์กฐ (์˜ˆ: ๋นจ๊ฐ„์ƒ‰ box)
  • ์‹œ๊ฐ์ ์œผ๋กœ ๋‹จ์ˆœํ•˜์ง€๋งŒ ๋งŽ์€ ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ ์ž์ฃผ ์“ฐ์ด๋Š” ๋ฐฉ์‹
  • LLaVA์˜ ๊ธฐ๋ณธ ํ‘œํ˜„๋ ฅ๊ณผ ๊ฒฐํ•ฉํ•  ๋•Œ ์–ด๋А ์ •๋„ ์œ ํšจํ•˜์ง€๋งŒ,
    → ๊ตฌ์กฐ ์ •๋ณด ์†์‹ค, ํ•ด์„์˜ ๋ถˆ์•ˆ์ •์„ฑ ๋“ฑ์˜ ํ•œ๊ณ„ ์กด์žฌ

--> ์šฐ๋ฆฌ ๋ฐฉ์‹์ด color prompt๋ณด๋‹ค ๋” ์ •๊ตํ•˜๊ณ  ์•ˆ์ •์ ์ธ attention ์œ ๋„ ๊ฐ€๋Šฅํ•จ์„ ๋ณด์ด๊ธฐ ์œ„ํ•จ


LLaVA + Edit Att: ์ด์ „ attention ์ง์ ‘ ์กฐ์ž‘ ๋ฐฉ์‹๊ณผ ๋น„๊ต (baseline)

  • Equation (4) ๋ฐ Figure-3b์˜ ๋ฐฉ๋ฒ•
  • attention score์— η๋ฅผ ๋”ํ•ด ์ง์ ‘ ์กฐ์ž‘
  • ๋ณธ ๋…ผ๋ฌธ์ด ๋น„ํŒํ•˜๋Š” ๋ฐฉ์‹:
    • ์ง€๋‚˜์น˜๊ฒŒ ๊ฐœ์ž…์ ์ด๋ฉฐ
    • ํ‘œํ˜„๋ ฅ ์†์ƒ์„ ์œ ๋ฐœ
    • Figure 3(b, c, d)์—์„œ ๋ถ€์ž‘์šฉ์ด ๋“œ๋Ÿฌ๋‚จ

--> attention ์ง์ ‘ ์ˆ˜์ • ๋ฐฉ์‹์˜ ํ•œ๊ณ„๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด ํฌํ•จ

 

 


6. Limitations

1) inference overhead

: Ollama๊ฐ™์€ ํˆด์ด ํ•ด๊ฒฐํ•ด์ค„ ์ˆ˜ ์žˆ์„ ๋“ฏ

2) white-box model์—๋งŒ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๊ณ  ๋ชจ๋ธ ๊ทธ ์ž์ฒด์˜ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นจ

:์ด๊ฑด training-free๋‹ˆ ์–ด์ฉ” ์ˆ˜ ์—†์ง€๋งŒ ์ถ”ํ›„์˜ ์—ฌ๋Ÿฌ foundation model์— ์ ์šฉ ๊ฐ€๋Šฅํ•จ

3) ํ•˜๋‚˜์˜ region์„ ๊ฐ–๋Š” ๊ฒฝ์šฐ, single visual prompt์— ๋Œ€ํ•ด์„œ๋งŒ ๊ฐ€๋Šฅํ•จ

: multi๋Š” future work๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Œ

4) ๋‹จ์ˆœํ•œ ์ตœ์ ํ™” ์ „๋žต์ด๊ธฐ์—, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ ์„ ํƒ์ด ๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ์ฃผ๋Š”๋ฐ ๊ทธ ๋ถ€๋ถ„์„ ์ œ๋Œ€๋กœ ๋‹ค๋ฃจ์ง€ ๋ชปํ–ˆ์Œ.

: ์ถ”ํ›„์— ์ด์ชฝ๋„ ๋ถ„์„

 

 


๐Ÿค”

์‚ฌ์‹ค ์–ด๋–ป๊ฒŒ ๋ณด๋ฉด ๊ธฐ์กด์— ์ฐธ์—ฌํ–ˆ๋˜ t2i optimal transport๋กœ attention map์„ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•œ ์•„์ด๋””์–ด๊ฐ€ ๋“ฑ์žฅํ–ˆ์—ˆ๋‹ค

diffusion์—์„œ๋Š” ์ž˜ ์ž‘๋™์„ ํ–ˆ๋Š”๋ฐ ์—ฌ๊ธฐ์„œ๋Š” ์–ด๋ ค์šด๊ฑธ๊นŒ ์ƒ๊ฐ์ด ๋“ค์—ˆ๋‹ค.. ๋ถ„ํฌ ์กฐ์ ˆํ•ด๋ณผ๊นŒ ํ–ˆ๋Š”๋ฐ