๐Ÿ“š Study/Paper Review 29

[25’ CVPR] PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction (CVPR 2025)

# Introduction์š”์ฆ˜ ChatGPT์ฒ˜๋Ÿผ ๋˜‘๋˜‘ํ•œ AI๋“ค์ด ์ด๋ฏธ์ง€๋ฅผ ์ดํ•ดํ•˜๋Š” ๋Šฅ๋ ฅ๊นŒ์ง€ ๊ฐ–์ถ”๋ฉด์„œ, Large Vision-Language Models (LVLMs)๊ฐ€ ๋”ฅ๋Ÿฌ๋‹ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ์œผ๋กœ ๋– ์˜ค๋ฅด๊ณ  ์žˆ๋‹ค. ์ด๋ฏธ์ง€๋‚˜ ์˜์ƒ ๊ธฐ๋ฐ˜์˜ ์งˆ๋ฌธ ์‘๋‹ต, ์„ค๋ช… ์ƒ์„ฑ, ๋ฌธ์„œ ์ดํ•ด ๋“ฑ ๋‹ค์–‘ํ•œ ๋น„์ „-์–ธ์–ด ์ž‘์—…์—์„œ ์—„์ฒญ๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค.๊ทธ๋Ÿฐ๋ฐ ์ด๋Ÿฐ LVLM๋“ค์ด ํ˜„์‹ค์— ๊ฐ€๊นŒ์›Œ์งˆ์ˆ˜๋ก ํฐ ๊ฑธ๋ฆผ๋Œ์ด ํ•˜๋‚˜ ์ƒ๊ธด๋‹ค.๋ฐ”๋กœ ๊ณ„์‚ฐ ๋น„์šฉ(computational cost)์ด ์—„์ฒญ๋‚˜๊ฒŒ ๋†’๋‹ค๋Š” ์ ์ด๋‹ค.์ด๋ฏธ์ง€๋‚˜ ์˜์ƒ์€ ํ…์ŠคํŠธ๋ณด๋‹ค ํ›จ์”ฌ ์—ฐ์†์ ์ด๊ณ  ๊ณ ํ•ด์ƒ๋„์ด๋ฉฐ ์ •๋ณด ๋ฐ€๋„๋„ ๋†’๋‹ค. ํ•˜์ง€๋งŒ ๋™์‹œ์— ์ค‘๋ณต๋„ ๋งŽ์•„์„œ, ๋ชจ๋“  ์ •๋ณด๋ฅผ ๋‹ค ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฑด ๋น„ํšจ์œจ์ ์ด๋‹ค.์˜ˆ๋ฅผ ๋“ค์–ด, ํ•ด์ƒ๋„๊ฐ€ ์กฐ๊ธˆ๋งŒ ์˜ฌ๋ผ๊ฐ€๋„ visual token ์ˆ˜๊ฐ€ ์ˆ˜์ฒœ์—์„œ ์ˆ˜๋งŒ ๊ฐœ๋กœ ๊ธ‰์ฆํ•˜๊ณ ..

[25’ ICML] SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

# Introduction์š”์ฆ˜ LLM(๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ)์˜ ๋ฐœ์ „ ๋•๋ถ„์—, ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ์ดํ•ดํ•˜๋Š” ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(VLM)๋„ ๋น ๋ฅด๊ฒŒ ์„ฑ์žฅํ•˜๊ณ  ์žˆ๋‹ค.์ด๋ฏธ์ง€์—์„œ ์ •๋ณด๋ฅผ ์ฝ์–ด๋‚ด๊ณ , ์งˆ๋ฌธ์— ๋‹ตํ•˜๊ฑฐ๋‚˜ ์„ค๋ช…์„ ์ƒ์„ฑํ•˜๋Š” ์ž‘์—…์ด ์ด์ œ๋Š” ๊ฝค ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค.์ด๋Ÿฐ VLM๋“ค์€ ์ฃผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ visual token์œผ๋กœ ์ชผ๊ฐœ๊ณ , ์ด๋ฅผ LLM๊ณผ ํ•จ๊ป˜ ์ž…๋ ฅ์— ๋„ฃ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค.๋ฌธ์ œ๋Š”, ์ด๋ ‡๊ฒŒ visual token์„ ๋งŽ์ด ๋„ฃ๊ฒŒ ๋˜๋ฉด ๊ณ„์‚ฐ๋Ÿ‰๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ํญ๋ฐœ์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.์˜ˆ๋ฅผ ๋“ค์–ด, LLaVA ๋ชจ๋ธ์—์„œ 672×672 ์ด๋ฏธ์ง€๋ฅผ ๋„ฃ์œผ๋ฉด ๋ฌด๋ ค 2304๊ฐœ์˜ visual token์ด ์ƒ์„ฑ๋˜๊ณ ,์ด ํ† ํฐ๋“ค๋งŒ์œผ๋กœ ์ „์ฒด ์ž…๋ ฅ์˜ ์ ˆ๋ฐ˜ ์ด์ƒ์„ ์ฐจ์ง€ํ•˜๊ฒŒ ๋œ๋‹ค.ํ•˜์ง€๋งŒ ํ…์ŠคํŠธ์™€ ๋‹ฌ๋ฆฌ, ์ด๋ฏธ์ง€ ์ •๋ณด๋Š” ๋œ ์กฐ๋ฐ€ํ•˜๊ณ  ..

[25’ AAAI] Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

# Introduction์š”์ฆ˜ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ์ดํ•ดํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM(Multimodal Large Language Models, MLLMs)์ด ์—„์ฒญ๋‚˜๊ฒŒ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋‹ค.์ด๋ฏธ์ง€ ์„ค๋ช…, ์›น ํƒ์ƒ‰, ๋ฌธ์ œ ํ’€์ด ๋“ฑ ๋‹ค์–‘ํ•œ ์ž‘์—…์„ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. MLLM ๋ฌธ์ œ: visual token์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๊ณ„์‚ฐ ๋ณต์žก๋„๋„ ์ฆ๊ฐ€์ด๋Ÿฐ MLLM์€ ๋ณดํ†ต ์ด๋ฏธ์ง€์—์„œ ์ถ”์ถœํ•œ ์ •๋ณด๋ฅผ visual token ํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ” ํ…์ŠคํŠธ์™€ ํ•จ๊ป˜ LLM์— ๋„ฃ๋Š”๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด LLaVA ๋ชจ๋ธ์€ ์ด๋ฏธ์ง€๋ฅผ 576๊ฐœ patch๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ๊ฐ์„ visual token์œผ๋กœ ๋ณ€ํ™˜ํ•ด ์‚ฌ์šฉํ•œ๋‹ค. ๋ฌธ์ œ๋Š” ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ณ„์‚ฐ๋Ÿ‰์ด ๊ธ‰๊ฒฉํžˆ ๋Š˜์–ด๋‚œ๋‹ค๋Š” ์ ์ด๋‹ค.์‹ค์ œ๋กœ ํ…์ŠคํŠธ๋งŒ ์‚ฌ์šฉํ•  ๋•Œ๋ณด๋‹ค 6๋ฐฐ ์ด์ƒ์˜ ๊ณ„์‚ฐ๋น„์šฉ(FLOPs)์ด ๋“œ๋Š” ๊ฒฝ์šฐ๋„ ..

[24’ ECCV Oral] An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

# Introduction์š”์ฆ˜ ChatGPT๋‚˜ Gemini ๊ฐ™์€ AI ๋ชจ๋ธ๋“ค์€ ๋‹จ์ˆœํžˆ ํ…์ŠคํŠธ๋งŒ ์ดํ•ดํ•˜๋Š” ๊ฑธ ๋„˜์–ด์„œ, ์ด๋ฏธ์ง€๋„ ๊ฐ™์ด ํ•ด์„ํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Šฅ๋ ฅ์„ ์ ์  ๋” ๊ฐ•ํ™”ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ…์ŠคํŠธ์™€ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” LVLM(Large Vision-Language Model)๋“ค์€ ์ด๋ฏธ์ง€ ์„ค๋ช…, ์›น ํƒ์ƒ‰, ์Šค๋งˆํŠธํฐ ์กฐ์ž‘, ์‹ฌ์ง€์–ด ํ˜„์‹ค ์„ธ๊ณ„์—์„œ์˜ ์˜์‚ฌ๊ฒฐ์ •๊นŒ์ง€๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.LVLM ๋ฌธ์ œ: visual token์ด ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ๊ณ„์‚ฐ ๋ณต์žก๋„๋„ ์ฆ๊ฐ€์ด๋Ÿฐ LVLM๋“ค์€ ์ฃผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ˆ˜๋ฐฑ~์ˆ˜์ฒœ ๊ฐœ์˜ visual token์œผ๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ํ•จ๊ป˜ LLM์— ์ž…๋ ฅํ•˜๋Š” ๊ตฌ์กฐ๋กœ ๋™์ž‘ํ•œ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์—” ์น˜๋ช…์ ์ธ ๋‹จ์ ์ด ํ•˜๋‚˜ ์žˆ๋‹ค.๋ฐ”๋กœ ๊ณ„์‚ฐ ๋ณต์žก๋„(computational complexity..

[25’ ICCV] [CLS] Attention is All You Need for Training-Free Visual Token Pruning: Make VLM Inference Faster

# Introduction์š”์ฆ˜ ๋Œ€์„ธ์ธ ChatGPT, Gemini, Claude ๊ฐ™์€ AI๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€ํ˜• ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(VLM) ๋•๋ถ„์— ์ ์  ๋” ๋˜‘๋˜‘ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํฅ๋ฏธ๋กœ์šด ์‚ฌ์‹ค์ด ํ•˜๋‚˜ ์žˆ๋‹ค. ๋ฌธ์ œ์ : ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์€ visual tokenVLM์˜ ์ž…๋ ฅ ์‹œํ€€์Šค์—์„œ visual token์ด ์ฐจ์ง€ํ•˜๋Š” ๋น„์ค‘์ด ๋งค์šฐ ๋†’์€๋ฐ, ๊ฑฐ์˜ 90%์— ๋‹ฌํ•œ๋‹ค. ์ด๋กœ ์ธํ•ด ๊ณ„์‚ฐ ๋ณต์žก๋„(computational complexity)์™€ ์ถ”๋ก  ๋น„์šฉ(inference cost)์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•œ๋‹ค.์ผ๋ถ€ ์—ฐ๊ตฌ๋“ค์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€ ํ•ด์ƒ๋„๋ฅผ ๋†’์—ฌ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋ ค ํ–ˆ์ง€๋งŒ, ์ด ์—ญ์‹œ visual token์˜ ๊ฐœ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œ์ผœ ์˜คํžˆ๋ ค ๊ณ„์‚ฐ ๋น„์šฉ์„ ํ‚ค์šฐ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ์•˜๋‹ค. ํŠนํžˆ Video-LLaVA ๊ฐ™์€ ๋น„๋””์˜ค ๊ธฐ๋ฐ˜..

[25' ICML] Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

# IntroductionVLM์€ ์™œ “์ฑ…์ด ์ด›๋ถˆ ๋’ค์— ์žˆ๋‹ค”๋Š” ๊ฒƒ๋„ ์ž˜ ๋ชจ๋ฅผ๊นŒ? – ADPATVIS์˜ ๋“ฑ์žฅ์š”์ฆ˜ ๋Œ€์„ธ์ธ ChatGPT, Gemini, Claude ๊ฐ™์€ AI๋Š” ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ์ดํ•ดํ•˜๋Š” ๋Œ€ํ˜• ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(VLM) ๋•๋ถ„์— ์ ์  ๋˜‘๋˜‘ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•œ ๊ฐ€์ง€ ํฅ๋ฏธ๋กœ์šด ์‚ฌ์‹ค์ด ์žˆ๋‹ค.์•„๋ฌด๋ฆฌ ์„ฑ๋Šฅ ์ข‹์€ VLM์ด๋ผ๋„ "์ด›๋ถˆ ๋’ค์— ์ฑ…์ด ์žˆ๋‹ค" ๊ฐ™์€ ๊ฐ„๋‹จํ•œ ๊ณต๊ฐ„ ๊ด€๊ณ„์กฐ์ฐจ ์ž์ฃผ ํ‹€๋ฆฐ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.๊ทธ ์ด์œ ๋Š” ๋ญ˜๊นŒ? ๋ฌธ์ œ: AI๋Š” ๊ณต๊ฐ„ ์ถ”๋ก (Spatial Reasoning)์— ์•ฝํ•˜๋‹ค์‚ฌ๋žŒ์—๊ฒŒ๋Š” ๋งค์šฐ ์‰ฌ์šด ์ผ์ด๋‹ค.๊ทธ๋ฆผ์„ ๋ณด๊ณ  “์ด ๋ฌผ์ฒด๊ฐ€ ์ € ๋ฌผ์ฒด์˜ ์™ผ์ชฝ์— ์žˆ๋‹ค”๊ณ  ๋งํ•˜๋Š” ๊ฒƒ ๋ง์ด๋‹ค.ํ•˜์ง€๋งŒ ๋Œ€ํ˜• VLM๋“ค์€ ์ด๋Ÿฐ **๊ธฐ๋ณธ์ ์ธ ๊ณต๊ฐ„ ๊ฐœ๋…(“left”, “right”, “behind”, “above” ๋“ฑ)*..

[25' CVPR] DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

# IntroductionDiTCtrl: ํ”„๋กฌํ”„ํŠธ๋งˆ๋‹ค ์žฅ๋ฉด์ด ๋ฐ”๋€Œ๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์˜์ƒ ์ƒ์„ฑ, ํ›ˆ๋ จ ์—†์ด๋„ ๊ฐ€๋Šฅํ•˜๋‹ค?์˜์ƒ ์ƒ์„ฑ AI๊ฐ€ ์ ์  ๋” ๋˜‘๋˜‘ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ํ…์ŠคํŠธ๋ฅผ ์ž…๋ ฅํ•˜๋ฉด ์˜์ƒ ์ „์ฒด๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” Text-to-Video(T2V) ๋ชจ๋ธ์€ Sora ๊ฐ™์€ ๋ชจ๋ธ์„ ํ†ตํ•ด ์ด๋ฏธ ํฐ ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์•„์ง ํ•ด๊ฒฐ๋˜์ง€ ์•Š์€ ์ค‘์š”ํ•œ ๋ฌธ์ œ๊ฐ€ ํ•˜๋‚˜ ์žˆ๋‹ค.๋ฐ”๋กœ “๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ์— ๋”ฐ๋ผ ์žฅ๋ฉด์ด ๋ฐ”๋€Œ๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ธด ์˜์ƒ”์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.์™œ ๋ฉ€ํ‹ฐ ํ”„๋กฌํ”„ํŠธ ์˜์ƒ ์ƒ์„ฑ์€ ์–ด๋ ค์šธ๊นŒ?ํ˜„์žฌ ๋Œ€๋ถ€๋ถ„์˜ T2V ๋ชจ๋ธ์€ ํ•˜๋‚˜์˜ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•ด ํ•˜๋‚˜์˜ ์งง์€ ์˜์ƒ์„ ์ƒ์„ฑํ•˜๋„๋ก ํ›ˆ๋ จ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์—ฌ๋Ÿฌ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ๋„ฃ์œผ๋ฉด ์˜์ƒ์ด ํˆญํˆญ ๋Š๊ธฐ๋“ฏ ์ด์–ด์ง€์ง€ ์•Š๊ณ , ํ”„๋กฌํ”„ํŠธ๋ฅผ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ๋„ ๊ฐ๊ฐ์˜ ์žฅ๋ฉด ์ „ํ™˜์ด ์ž์—ฐ์Šค๋Ÿฝ์ง€ ์•Š๊ฒŒ ๋‚˜ํƒ€๋‚œ๋‹ค...

[25’ NeurlPS] Slot-VLM: SlowFast Slots for Video-Language Modeling

# IntroductionVision-Language >>> Video-Language์ด๋ฏธ์ง€ ์ˆ˜์ค€์˜ VLM์€ MiniGPT-4, LLaVA ๊ฐ™์€ ๋ชจ๋ธ๋“ค๋กœ ํฐ ์ง„์ „์„ ์ด๋ค˜๋‹ค. ์ด๋“ค์€ ์ด๋ฏธ์ง€์˜ ํŠน์ง•์„ ํ…์ŠคํŠธ์™€ ์ž˜ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๊ตฌ์กฐ(Q-Former, projection layers ๋“ฑ)๋ฅผ ํ™œ์šฉํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ "์˜์ƒ"์€ ๋˜ ๋‹ค๋ฅธ ์ด์•ผ๊ธฐ์ด๋‹ค.์ด๋ฏธ์ง€๋Š” ํ•œ ์žฅ์œผ๋กœ ์ •๋ณด๋ฅผ ๋‹ด์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์˜์ƒ์€ ์‹œ๊ฐ„ ์ถ•์„ ๋”ฐ๋ผ ์—ฐ์†๋œ ์ˆ˜๋งŽ์€ ํ”„๋ ˆ์ž„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋ž˜์„œ ์˜์ƒ ์ดํ•ด๋ฅผ ์œ„ํ•ด์„  ์—ฌ๋Ÿฌ ํ”„๋ ˆ์ž„์„ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ๊ฐ ํ”„๋ ˆ์ž„์—์„œ ์ถ”์ถœํ•œ ํŠน์ง•์„ ์–ธ์–ด ๋ชจ๋ธ์— ๋„ฃ๋Š” ๋ฐฉ์‹์ด ์ž์ฃผ ์‚ฌ์šฉ๋œ๋‹ค.์˜ˆ๋ฅผ ๋“ค์–ด, ํ”„๋ ˆ์ž„๋ณ„๋กœ ํ† ํฐ์„ ๋ฝ‘์•„ ๊ทธ๋Œ€๋กœ LLM์— ์Œ“์•„ ๋„ฃ๋Š” ๋ฐฉ์‹์ด ๋Œ€ํ‘œ์ ์ด๋‹ค.ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์—” ์น˜๋ช…์ ์ธ ํ•œ๊ณ„๊ฐ€ ์žˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋””ํ…Œ์ผ์„..

[25' CVPR Highlights] MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

# Introduction๋‹ค์Œ ๋…ผ๋ฌธ์€ Action-Scene Hallucination ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.์ด๋Š”, ๋ชจ๋ธ์ด ์žฅ๋ฉด์„ ์ž˜๋ชป ํ•ด์„ํ•˜๊ฑฐ๋‚˜, ๊ด€์ฐฐ๋œ ํ–‰๋™์„ ๊ธฐ๋ฐ˜์œผ๋กœ scene์„ ์ž˜๋ชป ์ถ”๋ก ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋„์„œ๊ด€์—์„œ ๋ณต์‹ฑ์„ ํ•˜๋Š” ์˜์ƒ์„ ๋ณด์—ฌ์ฃผ๋ฉด, ๋ชจ๋ธ์€ ์ด๋ฅผ ์‹ค์ œ ๋ณต์‹ฑ ๊ฒฝ๊ธฐ์žฅ(boxing ring)์œผ๋กœ ์ž˜๋ชป ์ธ์‹ํ•œ๋‹ค. ์ด๋Š” ‘๋ณต์‹ฑ’์ด๋ผ๋Š” ๋™์ž‘๋งŒ ๋ณด๊ณ  ๊ทธ์— ๋งž๋Š” ์ „ํ˜•์ ์ธ ์žฅ์†Œ๋ฅผ ์ƒ์ƒํ•ด๋ฒ„๋ฆฌ๋Š” ์˜ค๋ฅ˜์ด๋‹ค.๋˜ ๋‹ค๋ฅธ ์˜ˆ๋กœ, ๋ˆˆ ๋ฎ์ธ ์‚ฐ์— ์•„๋ฌด๋„ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ์˜์ƒ์„ ์ œ์‹œํ–ˆ์„ ๋•Œ, ๋ชจ๋ธ์€ ์‹ค์ œ๋กœ ๋ณด์ด์ง€๋„ ์•Š๋Š” ์Šคํ‚ค ํƒ€๋Š” ์‚ฌ๋žŒ์ด๋‚˜ ์Šค๋…ธ๋ณด๋”๊ฐ€ ์žˆ๋‹ค๊ณ  ์ž˜๋ชป ์˜ˆ์ธกํ•˜๊ธฐ๋„ ํ•œ๋‹ค. ์ €์ž๋“ค์€ Video-LLM์—์„œ์˜ action-scene hallucination๋ฌธ์ œ๊ฐ€ ์•„๋ž˜ ๋‘ ํ•œ๊ณ„ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋ณด๊ณ ..

[25’ ICLR] VIDEOGRAIN: MODULATING SPACE-TIME ATTENTION FOR MULTI-GRAINED VIDEO EDITING

# Introduction๋…ผ๋ฌธ์—์„œ๋Š” multi-grained video editing์ด๋ผ๋Š” ๊ฐœ๋…์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ด๋Š” ํŽธ์ง‘์˜ ์„ธ๋ฐ€ํ•œ ์ˆ˜์ค€์— ๋”ฐ๋ผ class-level, instance-level, part-level์˜ ์„ธ ๊ฐ€์ง€๋กœ ๊ตฌ๋ถ„๋œ๋‹ค. (Figure 2 ์™ผ์ชฝ)Class-level editing์€ ๋™์ผํ•œ ํด๋ž˜์Šค ๋‚ด์—์„œ ๊ฐ์ฒด๋ฅผ ๊ต์ฒดํ•˜๋Š” ์ž‘์—…์„ ์˜๋ฏธํ•œ๋‹ค.Instance-level editing์€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ์ฒด ์ธ์Šคํ„ด์Šค๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ณ  ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.Part-level editing์€ ์ƒˆ๋กœ์šด ๊ฐ์ฒด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ๊ธฐ์กด ๊ฐ์ฒด์˜ ์†์„ฑ(attribute)์„ ๋ถ€๋ถ„์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ์ž‘์—…์„ ํฌํ•จํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋“ค์€ instance-agnostic(๊ฐ์ฒด๋ฅผ ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชปํ•จ)ํ•˜๊ธฐ์— editing์„ ํ•  ๋•Œ ์„œ๋กœ ๋‹ค๋ฅธ instance์˜..