๐Ÿ“š Study/Paper Review

[25' CVPR Highlights] MASH-VLM: Mitigating Action-Scene Hallucination in Video-LLMs through Disentangled Spatial-Temporal Representations

์œฐ๊ฐฑ 2025. 7. 3. 18:00

# Introduction

๋‹ค์Œ ๋…ผ๋ฌธ์€ Action-Scene Hallucination ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ๋‹ค.

์ด๋Š”, ๋ชจ๋ธ์ด ์žฅ๋ฉด์„ ์ž˜๋ชป ํ•ด์„ํ•˜๊ฑฐ๋‚˜, ๊ด€์ฐฐ๋œ ํ–‰๋™์„ ๊ธฐ๋ฐ˜์œผ๋กœ scene์„ ์ž˜๋ชป ์ถ”๋ก ํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ๋„์„œ๊ด€์—์„œ ๋ณต์‹ฑ์„ ํ•˜๋Š” ์˜์ƒ์„ ๋ณด์—ฌ์ฃผ๋ฉด, ๋ชจ๋ธ์€ ์ด๋ฅผ ์‹ค์ œ ๋ณต์‹ฑ ๊ฒฝ๊ธฐ์žฅ(boxing ring)์œผ๋กœ ์ž˜๋ชป ์ธ์‹ํ•œ๋‹ค. ์ด๋Š” ‘๋ณต์‹ฑ’์ด๋ผ๋Š” ๋™์ž‘๋งŒ ๋ณด๊ณ  ๊ทธ์— ๋งž๋Š” ์ „ํ˜•์ ์ธ ์žฅ์†Œ๋ฅผ ์ƒ์ƒํ•ด๋ฒ„๋ฆฌ๋Š” ์˜ค๋ฅ˜์ด๋‹ค.
๋˜ ๋‹ค๋ฅธ ์˜ˆ๋กœ, ๋ˆˆ ๋ฎ์ธ ์‚ฐ์— ์•„๋ฌด๋„ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ์˜์ƒ์„ ์ œ์‹œํ–ˆ์„ ๋•Œ, ๋ชจ๋ธ์€ ์‹ค์ œ๋กœ ๋ณด์ด์ง€๋„ ์•Š๋Š” ์Šคํ‚ค ํƒ€๋Š” ์‚ฌ๋žŒ์ด๋‚˜ ์Šค๋…ธ๋ณด๋”๊ฐ€ ์žˆ๋‹ค๊ณ  ์ž˜๋ชป ์˜ˆ์ธกํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

 

 

์ €์ž๋“ค์€ Video-LLM์—์„œ์˜ action-scene hallucination๋ฌธ์ œ๊ฐ€ ์•„๋ž˜ ๋‘ ํ•œ๊ณ„ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ๋ณด๊ณ  ์žˆ๋‹ค.

์ฒซ๋ฒˆ์งธ ์›์ธ์€ sptial๊ณผ temporal feature์˜ ์–ฝํž˜์ด๋‹ค.

๊ธฐ์กด Video-LLM๋“ค์€ spatial๊ณผ temporal ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์ง€ ๋ชปํ•˜๊ณ  ์„ž์–ด์„œ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.
์ด๋Ÿฌํ•œ ์–ฝํž˜์€ ์˜์ƒ ์†์—์„œ ํ–‰๋™๊ณผ ์žฅ๋ฉด ๊ฐ„์˜ ์šฐ์—ฐํ•œ ์ƒ๊ด€๊ด€๊ณ„(spurious correlation)๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ์ด๋Š” ์ž˜๋ชป๋œ ์ถ”๋ก ์œผ๋กœ ์ด์–ด์ง€๋Š” ๊ฒƒ์ด๋‹ค.

์ผ๋ถ€ ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์ž…๋ ฅ ๋‹จ๊ณ„์—์„œ ๊ณต๊ฐ„๊ณผ ์‹œ๊ฐ„ ํŠน์ง•์„ ๋ถ„๋ฆฌ(disentangle)ํ•˜๋ ค๋Š” ์‹œ๋„๋ฅผ ํ•˜๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์€ LLM์— ์ž…๋ ฅ๋˜๊ธฐ ์ „ ์ „์ฒ˜๋ฆฌ ์ˆ˜์ค€์—๋งŒ ์ง‘์ค‘ํ•˜๊ณ , LLM ๋‚ด๋ถ€์—์„œ๋Š” ์—ฌ์ „ํžˆ ์ด ๋‘ ์ •๋ณด๊ฐ€ attention ์—ฐ์‚ฐ ๊ณผ์ •์—์„œ ๋‹ค์‹œ ์–ฝํžˆ๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•œ๋‹ค.

์‹ค์ œ๋กœ Figure 2 (a)์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋“ฏ์ด, ์ž…๋ ฅ์—์„œ๋Š” ๋ถ„๋ฆฌ๋œ ์ •๋ณด๊ฐ€ ๋ชจ๋ธ ๋‚ด๋ถ€์—์„œ ์žฌ๊ฒฐํ•ฉ(re-entangle)๋˜๋ฉฐ, ๊ฒฐ๊ตญ action-scene hallucination์ด ๋ฐœ์ƒํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๋‘ ๋ฒˆ์งธ ์›์ธ์€ ๋Œ€๋ถ€๋ถ„์˜ LLM๋“ค์ด ์œ„์น˜ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•  ๋•Œ RoPE(Rotary Position Embedding)์— ์˜์กดํ•˜๊ณ  ์žˆ๋‹ค๋Š” ์ ์ด๋‹ค.

RoPE๋Š” ๋‹จ์ผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(single modality)์—์„œ๋Š” ํšจ๊ณผ์ ์œผ๋กœ ์œ„์น˜ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•˜์ง€๋งŒ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ(multimodal) ํ™˜๊ฒฝ์—์„œ๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค. RoPE๋Š” ์ž…๋ ฅ ํ† ํฐ์— 1์ฐจ์› ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ถ€์—ฌํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ์— ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๋ฉด text token์ด ์‹œ๊ณต๊ฐ„์ (spatial ๋˜๋Š” temporal) ํ† ํฐ ์ค‘ ์–ด๋А ํ•œ ์ชฝ์— ๊ณผ๋„ํ•˜๊ฒŒ ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์ด๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค.

Figure 2(a)์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ๋งŒ์•ฝ spatial token์ด text token ๊ฐ€๊นŒ์ด์— ์œ„์น˜ํ•œ๋‹ค๋ฉด, text token์€ ์‹œ๊ฐ„ ์ •๋ณด๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ๊ณต๊ฐ„ ์ •๋ณด์— ๊ณผ๋„ํ•˜๊ฒŒ ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋ถˆ๊ท ํ˜•์€ ํŠน์ • ํ† ํฐ ํƒ€์ž…์— ํŽธํ–ฅ๋œ ์ฒ˜๋ฆฌ๋ฅผ ์ดˆ๋ž˜ํ•˜๋ฉฐ, ์ด๋Š” ๋™์ž‘์ด๋‚˜ ์žฅ๋ฉด์— ๋Œ€ํ•œ hallucination์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ด ๋‘๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ €์ž๋“ค์€ MASH-VLM์„ ์ œ์•ˆํ•œ๋‹ค.

์ด ๋ฐฉ๋ฒ•์€ spatial๊ณผ temporal representation์„ disentagleํ•จ์œผ๋กœ์จ
Video Large Language Model์—์„œ์˜ Action-Scene Hallucination ๋ฌธ์ œ๋ฅผ ์™„ํ™”
ํ•œ๋‹ค.

<1>

์ฒซ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์€ ๋ฐ”๋กœ DST-attention์ด๋‹ค. ์ด๋Š” LLM ๋‚ด๋ถ€์—์„œ spatial๊ณผ temporal token์„ ๋ถ„๋ฆฌํ•œ๋‹ค.

Figure 2(b)์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, masked attention์„ ํ†ตํ•ด spatial๊ณผ temporal token ์‚ฌ์ด์—์„œ์˜ interaction์„ ๋ง‰๋Š”๋‹ค.

  • temporal tokens ๊ฐ„์— casual attention ์‚ฌ์šฉ >>  ์ˆœ์ฐจ์  ์˜์กด์„ฑ์™€ ์‹œ๊ฐ„์  ์ˆœ์„œ ์œ ์ง€
  • spatial tokens ๊ฐ„์—๋Š” bi-directional attention ์‚ฌ์šฉ >> ๊ณต๊ฐ„ ์ฐจ์›์€ ์–‘๋ฐฉํ–ฅ์„ฑ์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์—
  • text token์ด spatial๊ณผ temporal ๋ชจ๋‘์—๊ฒŒ attentionํ•  ์ˆ˜ ์žˆ๋„๋ก >> ํ…์ŠคํŠธ ํ† ํฐ์ด ์‹œ๊ณต๊ฐ„ ์ •๋ณด ๊ณ ๋ฅด๊ฒŒ ํ†ตํ•ฉ

์ด๋ ‡๊ฒŒ attention ํ๋ฆ„์„ ๊ตฌ์กฐ์ ์œผ๋กœ ์กฐ์ •ํ•จ์œผ๋กœ์จ, DST-attention์€ ์‹œ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜๊ณ , ๊ฒฐ๊ณผ์ ์œผ๋กœ ํ™˜๊ฐ์„ ์ค„์ด๋ฉฐ ๋น„๋””์˜ค ์ดํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.


<2>

๋‘๋ฒˆ์งธ๋กœ๋Š” ๊ธฐ์กด์˜ RoPE ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ Harmonic-RoPE์ด๋‹ค.

๊ธฐ์กด RoPE๋Š” spatial token๊ณผ temporal token์— ๋™์ผํ•˜๊ฑฐ๋‚˜ ๊ท ํ˜• ์žกํžŒ positional ID๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ์œ„์น˜ ์ •๋ณด๊ฐ€ ์™œ๊ณก๋  ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ €์ž๋“ค์€ positional ID์˜ ์ฐจ์›์„ ํ™•์žฅํ•˜์—ฌ, ๊ฐ token ์œ ํ˜•์ด ํ…์ŠคํŠธ ํ† ํฐ ๊ธฐ์ค€์œผ๋กœ ๊ท ํ˜• ์žกํžŒ positional ID๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค(Figure 2(b) ์ฐธ๊ณ ).

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ spatial๊ณผ temporal ์ •๋ณด์˜ ๊ท ํ˜• ์žˆ๋Š” ํ‘œํ˜„์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋ฉฐ, ๋ชจ๋ธ์ด ์‹œ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์กฐํ™”๋กญ๊ฒŒ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š”๋‹ค.

 

 

Contribution์„ ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜ ์„ธ๊ฐ€์ง€์™€ ๊ฐ™๋‹ค.

1. MASH-VLM ์•„ํ‚คํ…์ฒ˜

DST-attention์„ ํ†ตํ•ด spatial๊ณผ temporal token์„ disentangle
Harmonic-RoPE๋ฅผ ํ†ตํ•ด spatial๊ณผ temporal token์— ๊ท ํ˜• ๋งž์ถ˜ relative position ID ๋ถ€์—ฌ
--> Video-LLM์—์„œ์˜ hallucination ๋ฌธ์ œ๋ฅผ ์™„ํ™”

2. UNSCENE benchmark
1,320 video์™€ 4,078 QA pair๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฒค์น˜๋งˆํฌ
: Unusual action-scene ์กฐํ•ฉ์„ ํฌํ•จํ•œ ๋น„๋””์˜ค
: ์‚ฌ๋žŒ์˜ ํ–‰๋™์ด ํฌํ•จ๋˜์ง€ ์•Š์€ scene-only ๋น„๋””์˜ค

3. UNSCENE benchmark & ๊ธฐ์กด์˜ video QA benchmark์—์„œ SOTA

 

 


# Related Works

Multimodal LLMs for Video Understanding.

MLLM (vision) ์—์„œ Video-LLM (video)๋กœ ํ™•์žฅ๋˜๋Š” ์—ฐ๊ตฌ ์ง„ํ–‰ ์ค‘์ด๋‹ค.

video-based MLLM์€
(1) temporal dynamic์„ ํ•™์Šตํ•ด์•ผ ํ•˜๊ณ  (time์ถ•)
(2) ์—ฌ๋Ÿฌ frame์— ๋”ฐ๋ฅธ ๋งŽ์€ ์–‘์˜ visual token์„ ํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค.

์ด์— ๋”ฐ๋ผ ์ตœ๊ทผ์˜ Video-LLM ๋ชจ๋ธ๋“ค์€
(1) visual token์„ ์ค„์ด๊ฑฐ๋‚˜

  • Mvbench: A comprehensive multimodal video understanding benchmark. In CVPR, 2024.
  • Video-llama: An instruction-tuned audio-visual language model for video understanding.


(2) visual token์„ ํ†ตํ•ฉํ•˜๊ฑฐ๋‚˜

  • Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024.

(3) temporal dynamics๋ฅผ ๋ชจ๋ธ๋ง ํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ณ  ๋…ธ๋ ฅ ์ค‘์ด๋‹ค.

  • Video-llava: Learning united visual representation by alignment before projection. In EMNLP, 2024.
  • Bt-adapter: Video conversation is feasible without video instruction tuning. In CVPR, 2024.

 

์ด ๋…ผ๋ฌธ๊ณผ ๊ฐ€์žฅ ๊ด€๋ จ์žˆ๋Š” Video-ChatGPT๋Š” pooling ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ LLM์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— spatial๊ณผ temporal feature์„ ๊ตฌ๋ถ„ํ•˜์˜€๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ฐฉ๋ฒ•์€ LLM์˜ attention mechanism์œผ๋กœ ์ธํ•ด ๋‹ค์‹œ entagle๋˜๊ธฐ๋„ ํ•œ๋‹ค.
๋”ฐ๋ผ์„œ LLM ๋‚ด์—์„œ ์กฐ์ž‘ํ•˜๋Š”๊ฒŒ ํ•„์š”ํ•˜๋‹ค๋Š” ์–˜๊ธฐ๋ฅผ ํ•˜๊ณ  ์žˆ๋‹ค.

 


# Method