๐Ÿ“š Study/Paper Review

[25’ ICLR] VIDEOGRAIN: MODULATING SPACE-TIME ATTENTION FOR MULTI-GRAINED VIDEO EDITING

์œฐ๊ฐฑ 2025. 6. 30. 13:20

# Introduction

๋…ผ๋ฌธ์—์„œ๋Š” multi-grained video editing์ด๋ผ๋Š” ๊ฐœ๋…์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ด๋Š” ํŽธ์ง‘์˜ ์„ธ๋ฐ€ํ•œ ์ˆ˜์ค€์— ๋”ฐ๋ผ class-level, instance-level, part-level์˜ ์„ธ ๊ฐ€์ง€๋กœ ๊ตฌ๋ถ„๋œ๋‹ค. (Figure 2  ์™ผ์ชฝ)

  • Class-level editing์€ ๋™์ผํ•œ ํด๋ž˜์Šค ๋‚ด์—์„œ ๊ฐ์ฒด๋ฅผ ๊ต์ฒดํ•˜๋Š” ์ž‘์—…์„ ์˜๋ฏธํ•œ๋‹ค.
  • Instance-level editing์€ ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ์ฒด ์ธ์Šคํ„ด์Šค๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ณ  ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
  • Part-level editing์€ ์ƒˆ๋กœ์šด ๊ฐ์ฒด๋ฅผ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ ๊ธฐ์กด ๊ฐ์ฒด์˜ ์†์„ฑ(attribute)์„ ๋ถ€๋ถ„์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ์ž‘์—…์„ ํฌํ•จํ•œ๋‹ค.

 

๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋“ค์€ instance-agnostic(๊ฐ์ฒด๋ฅผ ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชปํ•จ)ํ•˜๊ธฐ์— editing์„ ํ•  ๋•Œ ์„œ๋กœ ๋‹ค๋ฅธ instance์˜ feature์ด ํ˜ผํ•ฉ๋œ๋‹ค.
Figure 2์˜ ์˜ค๋ฅธ์ชฝ์„ ๋ณด๋ฉด, ์ตœ์‹  T2V ๋ชจ๋ธ๋“ค์€ multi-grained editing์—๋Š” ์ทจ์•ฝํ•œ ๋ฉด์„ ๋ณด์ธ๋‹ค.

 

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋ณธ ๋…ผ๋ฌธ์€ ๋‹ค์Œ ๋‘ ๊ฐ€์ง€๋ฅผ ํ•ต์‹ฌ ๋ชจํ‹ฐ๋ฒ ์ด์…˜์œผ๋กœ ์‚ผ๋Š”๋‹ค.

  1. ํ…์ŠคํŠธ๋กœ ํŠน์ • ์˜์—ญ(region) ์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š” text-to-region control์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ ,
  2. ์˜์—ญ ๊ฐ„ feature๊ฐ€ ์„ž์ด์ง€ ์•Š๋„๋ก feature separation์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

์ผ๋ฐ˜์ ์ธ diffusion ๋ชจ๋ธ์—์„œ

  • Cross-attention layer๋Š” text feature๋ฅผ ํ™œ์šฉํ•ด ๊ฐ spatial region์„ ์ œ์–ดํ•˜๊ณ ,
  • Self-attention layer๋Š” ์‹œ๊ฐ„ ์ถ•์„ ๋”ฐ๋ผ ํ”„๋ ˆ์ž„ ๊ฐ„ ํ† ํฐ์„ ์—ฐ๊ฒฐํ•˜๋ฉฐ global coherence์„ ํ˜•์„ฑํ•œ๋‹ค.

ํ•˜์ง€๋งŒ ๊ธฐ์กด ๋ฐฉ์‹์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค.

 

Cross-Attention Layer์˜ ๋ฌธ์ œ์™€ ํ•ด๊ฒฐ

  • ๋ฌธ์ œ์ : ๋ชจ๋“  ํ”„๋ ˆ์ž„ ํ† ํฐ์— ๋™์ผํ•œ ๊ธ€๋กœ๋ฒŒ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์ ์šฉ๋˜์–ด,
    → ๊ฐ ์˜์—ญ์— ๋งž์ง€ ์•Š๋Š” ์˜๋ฏธ์  ๋ถˆ์ผ์น˜(semantic misalignment)๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.
  • ํ•ด๊ฒฐ์ฑ…: ๊ฐ ๋กœ์ปฌ ํ”„๋กฌํ”„ํŠธ(local prompt)๊ฐ€ ๊ทธ์— ๋Œ€์‘๋˜๋Š” ๊ณต๊ฐ„ ๋ถ„๋ฆฌ๋œ ์˜์—ญ(spatially-disentangled region)์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก cross-attention์„ ์กฐ์ •(amplify)ํ•˜์—ฌ, ๋ถˆํ•„์š”ํ•œ ์˜์—ญ์— ๋Œ€ํ•œ ์ฃผ์˜๋ฅผ ์–ต์ œํ•œ๋‹ค.

 

Self-Attention Layer์˜ ๋ฌธ์ œ์™€ ํ•ด๊ฒฐ

  • ๋ฌธ์ œ์ : ํ•œ ์˜์—ญ์˜ ํ”ฝ์…€์ด ๊ฐ™์€ ํด๋ž˜์Šค ๋‚ด์˜ ์™ธ๋ถ€ ์˜์—ญ์ด๋‚˜ ์œ ์‚ฌํ•œ ์ธ์ ‘ ์˜์—ญ๊นŒ์ง€ ์˜ํ–ฅ์„ ๋ฏธ์ณ
    feature coupling๊ณผ texture mixing์ด ๋ฐœ์ƒํ•œ๋‹ค.
  • ํ•ด๊ฒฐ์ฑ…: self-attention์ด intra-region (์˜์—ญ ๋‚ด) ๊ด€๊ณ„์— ์ง‘์ค‘ํ•˜๊ณ , inter-region (์˜์—ญ ๊ฐ„) ๊ด€๊ณ„๋Š” ์–ต์ œํ•  ์ˆ˜ ์žˆ๋„๋ก ์กฐ์ •ํ•˜์—ฌ → ๊ฐ query๊ฐ€ ์ž์‹ ์˜ ๋Œ€์ƒ ์˜์—ญ(target region)์—๋งŒ ์ง‘์ค‘ํ•˜๋„๋ก ํ•œ๋‹ค.

 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ†ตํ•ฉ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” Spatial-Temporal Layout-Guided Attention (ST-Layout Attn)์„ ์ œ์•ˆํ•œ๋‹ค.
์ด๋Š” spacetime cross-attention๊ณผ self-attention์„ ํ•จ๊ป˜ ์กฐ์ ˆํ•จ์œผ๋กœ์จ,
์ •ํ™•ํ•œ text-to-region control๊ณผ feature separation์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•œ๋‹ค.

 

Contribution

- multi-grained video editing์— ๋Œ€ํ•œ ์ฒซ ์‹œ๋„

- cross attention์„ ํ†ตํ•ด text-to-region control์„ ํ•˜๊ณ , self-attention์„ ํ†ตํ•ด feature separation์„ ํ•˜๋Š” VideoGrain

- parameter tuning์„ ํ•˜์ง€ ์•Š๊ณ ๋„ SOTA ์„ฑ๋Šฅ ๋„๋‹ฌ


# METHOD

## 3.1 Motivation

๋ณธ ๋…ผ๋ฌธ์€ DDIM Inversion๊ณผ์ •์—์„œ Self-Attention์„ ๋ถ„์„ํ•˜์˜€๋‹ค.

๊ฐ ํ”„๋ ˆ์ž„์—์„œ self-attention feature์„ K-Means๋กœ ํด๋Ÿฌ์ŠคํŒ…ํ–ˆ์„ ๋•Œ, semantic segmentation์€ ์–ด๋А์ •๋„ ๋˜์ง€๋งŒ, ๋‘ ๋‚จ์ž๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์€ ์‹คํŒจํ–ˆ๋‹ค.
ํด๋Ÿฌ์Šคํ„ฐ ์ˆ˜๋ฅผ ๋Š˜๋ ค๋„ ๊ฐœ๋ณ„ ์ธ์Šคํ„ด์Šค๋ฅผ ๋ถ„๋ฆฌํ•˜์ง€ ๋ชปํ•˜๊ณ  ์ด๋Š” instance-level editing์ด ์–ด๋ ค์šด ์ด์œ ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค.

๋˜ํ•œ, SDEdit์„ ์‚ฌ์šฉํ•ด์„œ ์™ผ์ชฝ ๋‚จ์ž๋ฅผ Iron Man, ์˜ค๋ฅธ์ชฝ ๋‚จ์ž๋ฅผ Spiderman์œผ๋กœ editํ•˜๋ ค๊ณ  ํ–ˆ๋‹ค.
๊ทธ๋Ÿฌ๋‚˜ cross-attention map์„ ๋ณด๋ฉด Iron Man๊ณผ Spiderman์˜ attention์ด ๋ชจ๋‘ ์™ผ์ชฝ ๋‚จ์ž์— ์ง‘์ค‘๋˜๊ณ , blossom์˜ attention์€ ์˜ค๋ฅธ์ชฝ ๋‚จ์ž์—๊ฒŒ๊นŒ์ง€ ํผ์ ธ์„œ ์‹คํŒจํ•˜์˜€๋‹ค.

๋”ฐ๋ผ์„œ, ์ €์ž๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์•„์ด๋””์–ด๋ฅผ ์ œ์‹œํ–ˆ๋‹ค.
"๊ฐ ์ง€์—ญ(local)์˜ attention์ด ์ •ํ™•ํ•˜๊ฒŒ ๋ถ„ํฌ๋˜๋„๋ก ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?"

 

์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด VideoGrain์ด๋ผ๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆํ•œ๋‹ค.

  • (1) Cross-attention modulation: text embedding์ด ์˜ฌ๋ฐ”๋ฅธ spatial region์— ์ง‘์ค‘ํ•˜๋„๋ก ์กฐ์ ˆ.
  • (2) Self-attention modulation: ์ง€์—ญ ๋‚ด focus๋Š” ๋†’์ด๊ณ  ์ง€์—ญ ๊ฐ„ ๊ฐ„์„ญ์€ ์ค„์ด๋„๋ก ์„ค๊ณ„.

 


## 3.3 Overall Framework

์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Input: Video frames $V$, Text prompt

Step 1: DDIM Inversion → Noisy latent $x_t$

  • Clean latent $x_0$๋ฅผ DDIM Inversion์œผ๋กœ ์—ญ์ถ”๋ก ํ•˜์—ฌ $x_t$ ํš๋“
  • ์ด๋Š” high-fidelity reconstruction์„ ์œ„ํ•œ ์ดˆ๊ธฐ ๋‹จ๊ณ„

 

Step 2: Semantic layout condition $e$

  • Self-attention features๋ฅผ clusteringํ•˜์—ฌ semantic layout ์ƒ์„ฑ
  • ํ•˜์ง€๋งŒ 3.1์ ˆ์—์„œ ๋ณด์•˜๋“ฏ, self-attention๋งŒ์œผ๋กœ๋Š” instance ๊ตฌ๋ถ„์ด ๋ถˆ๊ฐ€๋Šฅ
  • ๋”ฐ๋ผ์„œ SAM-Track์„ ํ™œ์šฉํ•ด ๊ฐ ์ธ์Šคํ„ด์Šค๋ฅผ ๋ถ„๋ฆฌ(segment)
  • ํ•„์š”์— ๋”ฐ๋ผ ControlNet condition $e$  ์ ์šฉ ๊ฐ€๋Šฅ (e.g., pose map or depth map for structural guidance)

Step 3. Spatial-Temporal Layout-Guided Attention

  • Cross-attention๊ณผ self-attention ๋ชจ๋‘์— layout mask๋ฅผ ์ ์šฉํ•˜์—ฌ
    → ํ…์ŠคํŠธ๊ฐ€ ์˜ฌ๋ฐ”๋ฅธ ์œ„์น˜(region) ์— ์ง‘์ค‘ํ•˜๋„๋ก ์œ ๋„
    → ์„œ๋กœ ๋‹ค๋ฅธ ์ธ์Šคํ„ด์Šค ๊ฐ„ feature interference๋ฅผ ๋ฐฉ์ง€

Step 4: DDIM Denoising → Output frames $V'$

  • ์กฐ์ ˆ๋œ attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… ์˜์ƒ ์ƒ์„ฑ

 

## 3.4 SPATIAL-TEMPORAL LAYOUT-GUIDED ATTENTION

Cross-attention์€ ๊ฐ€์ค‘์น˜ ๋ถ„ํฌ๊ฐ€ ํŽธ์ง‘ ๊ฒฐ๊ณผ์— ์ง์ ‘์ ์ธ ์˜ํ–ฅ์„ ์ฃผ๋Š” ์—ญํ• ์„ ํ•˜๊ณ 

Self-attention์€ ํ”„๋ ˆ์ž„ ๊ฐ„ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.

 

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘ ์ข…๋ฅ˜์˜ attention ๋ชจ๋‘์— ๋Œ€ํ•ด "๊ธ์ • ์Œ์€ ๊ฐ•์กฐ, ๋ถ€์ • ์Œ์€ ์–ต์ œ"ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ score๋ฅผ ์กฐ์ ˆํ•œ๋‹ค.

  • ๊ธฐ์กด attention score $QK^T$์— mask $M$์„ ๋”ํ•ด์„œ modulation
  • $\lambda$: modulation ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ธ์ • ๊ด€๊ณ„๋Š” ๊ฐ•์กฐ๋˜๊ณ , ๋ถ€์ • ๊ด€๊ณ„๋Š” ์•ฝํ™”๋จ

  • $R_i$: ์–ด๋–ค query-key ์Œ์ด positive pair์ธ์ง€ negative pair์ธ์ง€ ์•Œ๋ ค์ฃผ๋Š” binary mask
  • $M_i^{pos}$: ๊ฐ•์กฐํ•  ๊ด€๊ณ„์˜ score
  • $M_i^{neg}$: ์–ต์ œํ•  ๊ด€๊ณ„์˜ score

์ฆ‰, "๊ฐ•์กฐํ•  ์Œ์€ pos score ์‚ฌ์šฉ, ์–ต์ œํ•  ์Œ์€ neg score ์‚ฌ์šฉ"์˜ unified ํฌ๋ฎฌ๋ ˆ์ด์…˜์ด๋‹ค.

 

[Modulate Cross-Attention for Text-to-Region Control.]

Cross-attention์—์„œ๋Š” ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์ด key/value๋กœ ์‚ฌ์šฉ๋˜๊ณ , ๋น„๋””์˜ค latent์—์„œ ์˜จ feature๊ฐ€ query๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

๊ทธ๋ฆฌ๊ณ , ์ด query๊ฐ€ ์–ด๋–ค key์— ์ง‘์ค‘ํ•˜๋ƒ์— ๋”ฐ๋ผ, ์–ด๋А ์œ„์น˜์— ์–ด๋–ค ํ…์ŠคํŠธ ์กฐ๊ฑด์ด ๋ฐ˜์˜๋ ์ง€ ๊ฒฐ์ •๋œ๋‹ค.

 

1. Attention Modulation Score ๊ณ„์‚ฐ

  • positive: ์›๋ž˜ attention score์™€ ์ตœ๋Œ€๊ฐ’์˜ ์ฐจ๋กœ ๊ณ„์‚ฐ
    • --> ํฐ attention score์ผ์ˆ˜๋ก ์ž‘๊ฒŒ ๋‚˜์˜ค๋ฏ€๋กœ, ์–‘์˜ modulation์œผ๋กœ ๋ฐ˜์˜๋  ๊ฒƒ
  • negative: ์›๋ž˜ score์—์„œ ์ตœ์†Œ๊ฐ’์„ ๋นผ์„œ ์Œ์˜ ์˜ํ–ฅ์„ ์ฃผ๋„๋ก ๊ตฌ์„ฑ

 

2. ์–ด๋–ค ์Œ์ด positive/negative์ธ๊ฐ€?

 

  • $x: query index (๋น„๋””์˜ค latent์˜ ๊ณต๊ฐ„ ์œ„์น˜)
  • $y: key index (ํ…์ŠคํŠธ ํ† ํฐ ์œ„์น˜)
  • $\tau_k$: ํ•ด๋‹น ํ…์ŠคํŠธ๊ฐ€ ํƒ€๊ฒŸ์œผ๋กœ ํ•˜๋Š” ์ธ์Šคํ„ด์Šค
  • $m_{i,k}$: ์ธ์Šคํ„ด์Šค $k์˜ ๋งˆ์Šคํฌ์—์„œ ์œ„์น˜ $x๊ฐ€ ์†ํ•ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€

--> ์ฆ‰, ํ…์ŠคํŠธ $\tau_k$๊ฐ€ ํŠน์ • ์ธ์Šคํ„ด์Šค์— ๋Œ€์‘๋œ๋‹ค๊ณ  ํ•  ๋•Œ, ๊ทธ ์ธ์Šคํ„ด์Šค์˜ ์œ„์น˜์—๋งŒ positive modulation์„ ์ฃผ๊ณ , ๋‚˜๋จธ์ง€๋Š” 0 ์ฒ˜๋ฆฌ (negative)

 

 

[Modulate Self-Attention to Keep Feature Separation.]

๊ธฐ์กด Text-to-Image(T2I) ๋ชจ๋ธ์€ ๋‹จ์ผ ํ”„๋ ˆ์ž„๋งŒ ๋ณด์ง€๋งŒ, ์˜์ƒ ํŽธ์ง‘์—์„œ๋Š” ์—ฌ๋Ÿฌ ํ”„๋ ˆ์ž„์„ ๋ณด๋ฉฐ ์‹œ๊ฐ„์ ์ธ ์ผ๊ด€์„ฑ๋„ ๊ณ ๋ คํ•ด์•ผ ํ•œ๋‹ค. --> ๊ทธ๋ž˜์„œ spatial attention → spatial-temporal self-attention์œผ๋กœ ํ™•์žฅํ•จ

ํ•˜์ง€๋งŒ, ๋‹จ์ˆœํ•œ self-attention์€ ๋‹ค๋ฅธ ์ธ์Šคํ„ด์Šค๋ผ๋ฆฌ ์ž˜๋ชป ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ด๋„ self-attention modulation์„ ํ•˜๊ธฐ ์ „์—๋Š” ๋‘ ์ธ์Šคํ„ด์Šค ๊ฐ„์— feature๊ฐ€ ์„ž์—ฌ๋ฒ„๋ฆฐ ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ด๋ฅผ ์œ„ํ•ด์„œ ๊ฐ™์€ instance ๋‚ด์—์„œ attention์„ ๊ฐ•ํ™”ํ•˜๊ณ , ๋‹ค๋ฅธ instance ๊ฐ„์—๋Š” attention์„ ์–ต์ œํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์„ค๊ณ„ํ•œ๋‹ค.

  • positive: ๊ฐ™์€ ์ธ์Šคํ„ด์Šค์˜ ํ† ํฐ๋“ค๋ผ๋ฆฌ attention์„ ์ฃผ๊ฒŒ๋” ๊ฐ•์กฐ
    --> max ๊ธฐ์ค€์œผ๋กœ ํ‰์ค€ํ™”ํ•˜์—ฌ ๊ฐ•ํ•˜๊ฒŒ ์—ฐ๊ฒฐ๋˜๋„๋ก ์œ ๋„
  • negative: ์„œ๋กœ ๋‹ค๋ฅธ ์ธ์Šคํ„ด์Šค ๊ฐ„์˜ attention์€ ์ค„์ด๋„๋ก ์กฐ์ •
    --> min ๊ธฐ์ค€์œผ๋กœ ๋‚ฎ์ถ”๋„๋ก ์œ ๋„

 

  • ๋งŒ์•ฝ $x์™€ $y๊ฐ€ ๋‹ค๋ฅธ ์ธ์Šคํ„ด์Šค์— ์†ํ•˜๋ฉด → 0 (attention ์ฐจ๋‹จ)
  • ๊ฐ™์€ ์ธ์Šคํ„ด์Šค์— ์†ํ•˜๋ฉด → 1 (attention ํ—ˆ์šฉ)

→ ์ฆ‰, ์ด mask๋Š” ์ธ์Šคํ„ด์Šค ๊ฐ„ ๊ฒฝ๊ณ„๋ฅผ ๋ช…ํ™•ํžˆ ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ”์ด๋„ˆ๋ฆฌ ํ•„ํ„ฐ ์—ญํ• 

 


# Evaluation Result