๐Ÿ“š Study/Paper Review

[25โ€™ ICLR] Decouple and Track: Benchmarking and Improving Video Diffusion Transformers For Motion Transfer

์œฐ๊ฐฑ 2025. 5. 30. 13:59

# Introduction

 

์ตœ์‹  ์—ฐ๊ตฌ ์š”์•ฝ: ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ชจ์…˜ ์ „์ด(Text-to-Video Motion Transfer)์˜ ํ•œ๊ณ„์™€ ์ƒˆ๋กœ์šด ์ ‘๊ทผ

์ตœ๊ทผ Diffusion Transformer(DiT) ๊ธฐ๋ฐ˜์˜ ํ…์ŠคํŠธ-ํˆฌ-๋น„๋””์˜ค(Text-to-Video, T2V) ์ƒ์„ฑ ๋ชจ๋ธ๋“ค์ด ์ฃผ๋ชฉํ• ๋งŒํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ ๋ชจ์…˜์„ ์„ธ๋ฐ€ํ•˜๊ฒŒ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๊ณผ์ œ์ด๋‹ค. ํŠนํžˆ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋งŒ์œผ๋กœ๋Š” ์‚ฌ์šฉ์ž ์˜๋„๋ฅผ ์™„์ „ํžˆ ๋ฐ˜์˜ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค.

์ด์— ๋Œ€ํ•œ ํ•ด๊ฒฐ์ฑ…์œผ๋กœ, ๋ชจ์…˜ ์ „์ด(Motion Transfer) ์—ฐ๊ตฌ๊ฐ€ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋‹ค. 

 

์™œ Motion Transfer๊ฐ€ ์ค‘์š”ํ•œ๊ฐ€?

๊ธฐ์กด์˜ T2V ์ƒ์„ฑ์€ ํ…์ŠคํŠธ๋งŒ์œผ๋กœ ๋น„๋””์˜ค ์ „์ฒด๋ฅผ ์ƒ์„ฑํ•˜๋ ค๋Š” ๋ฐฉ์‹์ด์ง€๋งŒ,

  • ํ…์ŠคํŠธ๋งŒ์œผ๋กœ๋Š” ์ •ํ™•ํ•œ ๋ชจ์…˜ ๋””ํ…Œ์ผ์„ ๋ฐ˜์˜ํ•˜๊ธฐ ์–ด๋ ค์›€
  • ์‚ฌ์šฉ์ž ์š”๊ตฌ(์˜ˆ: "๊ณ ์–‘์ด๊ฐ€ ์ถค์ถ”๋“ฏ ๊ฑท๊ธฐ")์— ์ œํ•œ๋œ ํ‘œํ˜„๋ ฅ

Motion Transfer๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด ๋“ฑ์žฅํ–ˆ๋‹ค.
โ†’ ๊ธฐ์กด์˜ ๋น„๋””์˜ค์—์„œ ๋ชจ์…˜์„ ์ถ”์ถœํ•˜๊ณ , ์ƒˆ๋กœ์šด ๋Œ€์ƒ๊ณผ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์— ๋งž๊ฒŒ ์ „์ดํ•˜๋Š” ๋ฐฉ์‹

 

๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„์ 

๊ณผ๊ฑฐ์—๋Š” ์ฃผ๋กœ 3D U-Net ๊ตฌ์กฐ์—์„œ

  • Temporal self-attention๊ณผ Spatial self-attention์„ ๋ถ„๋ฆฌํ•˜๊ณ 
  • Spatial ๋ถ€๋ถ„์„ ๊ณ ์ •(freeze)์‹œ์ผœ motion๊ณผ appearance๋ฅผ ๋ถ„๋ฆฌํ•˜๋ ค ํ–ˆ๋‹ค.

ํ•˜์ง€๋งŒ ์ตœ์‹ ์˜ DiT ๊ธฐ๋ฐ˜ ๋ชจ๋ธ(3D Full Attention ๊ตฌ์กฐ)์—์„œ๋Š”

  • ์‹œ๊ฐ„๊ณผ ๊ณต๊ฐ„ ์ •๋ณด๊ฐ€ ํ•˜๋‚˜์˜ attention์—์„œ ๋™์‹œ์— ๊ฒฐํ•ฉ๋จ
  • ์ด๋กœ ์ธํ•ด motion๊ณผ appearance๋ฅผ ๋ถ„๋ฆฌ(de-couple)ํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ค์›Œ์กŒ๋‹ค

๋˜ํ•œ ๊ธฐ์กด ๋ฒค์น˜๋งˆํฌ๋“ค์€ ๋Œ€๋ถ€๋ถ„ ์ง์„  ์ด๋™ ์ˆ˜์ค€์˜ ๋‹จ์ˆœํ•œ ๋ชจ์…˜๋งŒ ํฌํ•จ๋˜์–ด ์žˆ์–ด
๋ณต์žกํ•˜๊ณ  ํ˜„์‹ค์ ์ธ ๋ชจ์…˜ ์ „์ด๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ์—๋Š” ๋ถ€์กฑํ–ˆ๋‹ค.

 

์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•: Shared Temporal Kernel + Dense Point Tracking

์—ฐ๊ตฌํŒ€์€ DiT ๋ชจ๋ธ ๊ธฐ๋ฐ˜์—์„œ ๋™์ž‘ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ์…˜ ์ „์ด ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค:

ํ•ต์‹ฌ ์•„์ด๋””์–ด

  • Shared Temporal Kernel
    • 3D attention map ๋ถ„์„ ๊ฒฐ๊ณผ, ์ธ์ ‘ ํ”„๋ ˆ์ž„ ์‚ฌ์ด์˜ attention์ด ๊ฐ•ํ•จ
    • ์ด๋ฅผ ์ด์šฉํ•ด 1D temporal kernel์„ ๋„์ž… โ†’ ์‹œ๊ฐ„์  smoothing & motion ์ •ํ•ฉ์„ฑ ํ™•๋ณด
  • Dense Point Tracking Loss
    • Latent feature ๊ณต๊ฐ„์—์„œ optical flow์™€ ์œ ์‚ฌํ•œ ๊ฐœ๋… ์ ์šฉ
    • Foreground์˜ trajectory alignment๋ฅผ ํ†ตํ•ด consistent motion ์œ ๋„
  • Foreground vs Background ๋ถ„๋ฆฌ
    • Temporal smoothing์„ ํ†ตํ•ด background appearance์™€ foreground motion์„ ๋” ์ž˜ ๊ตฌ๋ถ„

 

์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ: MTBench

๋ณด๋‹ค ํ˜„์‹ค์ ์ด๊ณ  ์–ด๋ ค์šด ํ‰๊ฐ€๋ฅผ ์œ„ํ•ด MTBench๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ๋„ ํ•จ๊ป˜ ์ œ์•ˆ๋˜์—ˆ๋‹ค.

  • 100๊ฐœ ๊ณ ํ’ˆ์งˆ ๋น„๋””์˜ค๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ
  • ์ตœ์‹  LLM๊ณผ tracking ๋ชจ๋ธ์„ ํ™œ์šฉํ•ด ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ๋ชจ์…˜ trajectory ์ƒ์„ฑ
  • ๋‚œ์ด๋„ ๋ถ„๋ฅ˜๊นŒ์ง€ ๋ฐ˜์˜๋œ ์„ธ๋ถ„ํ™”๋œ motion ํ‰๊ฐ€ ๊ธฐ์ค€

 

์ƒˆ๋กœ์šด ํ‰๊ฐ€ ์ง€ํ‘œ: Hybrid Motion Fidelity Metric

  • ๊ธฐ์กด์˜ local velocity similarity ์™ธ์—
  • ์ „์ฒด trajectory์˜ ๋ชจ์–‘ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๋Š” Freฬchet Distance ๋„์ž…
    โ†’ ๋” ์ •๋ฐ€ํ•˜๊ณ  ํ˜„์‹ค์ ์ธ ํ‰๊ฐ€ ๊ฐ€๋Šฅ

 

๋Œ“๊ธ€์ˆ˜0