๐Ÿ“š Study/Paper Review 16

[24'NeurlPs] Visual Fourier Prompt Tuning

1. Introduction ๋ฐฐ๊ฒฝํ”„๋กฌํ”„ํŠธ ํŠœ๋‹(Prompt Tuning)์€ ์›๋ž˜ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ(NLP) ๋ถ„์•ผ์—์„œ ๋Œ€๊ทœ๋ชจ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์„ ํšจ์œจ์ ์œผ๋กœ ์ ์‘์‹œํ‚ค๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋„์ž…๋˜์—ˆ์œผ๋ฉฐ, ์ดํ›„ ๋น„์ „ ๋ถ„์•ผ๋กœ ํ™•์žฅ๋˜์—ˆ๋‹ค.ํ˜„์žฌ๋Š” ์ด๋ฏธ์ง€ ๋ถ„๋ฅ˜, ๊ฐ์ฒด ํƒ์ง€, ๋ถ„ํ•  ๋“ฑ ๋‹ค์–‘ํ•œ ๋น„์ „ ํƒœ์Šคํฌ์— ์ ์šฉ๋˜์–ด ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์žฌํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ€๋ณ๊ณ  ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ์ž๋ฆฌ์žก๊ณ  ์žˆ๋‹ค.๊ทธ๋Ÿฌ๋‚˜ ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹์—๋Š” ์—ฌ์ „ํžˆ ์ค‘์š”ํ•œ ๊ณผ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค.๋ฐ”๋กœ, ์‚ฌ์ „ํ•™์Šต ๋ฐ์ดํ„ฐ์™€ ํŒŒ์ธํŠœ๋‹ ๋ฐ์ดํ„ฐ ๊ฐ„์˜ ๋„๋ฉ”์ธ ์ฐจ์ด๊ฐ€ ํด์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๊ธ‰๊ฒฉํžˆ ์ €ํ•˜๋œ๋‹ค๋Š” ์ ์ด๋‹ค.์ด ๋ฌธ์ œ๋Š” ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹์˜ ๋ฒ”์šฉ์„ฑ์„ ์ œํ•œํ•˜๋ฉฐ,์—ฐ๊ตฌ์ž๋“ค์—๊ฒŒ “ํ”„๋กฌํ”„ํŠธ ํŠœ๋‹์€ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ ๊ฐ„ ์ผ๋ฐ˜ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•œ๊ฐ€?”๋ผ๋Š” ๊ทผ๋ณธ์ ์ธ ์งˆ๋ฌธ์„ ๋˜์ง€๊ฒŒ ๋งŒ๋“ ๋‹ค. ์ธ๊ฐ„ ์‹œ๊ฐ ์ธ์ง€์—์„œ ์˜๊ฐ์„ ์–ป๋‹ค์—ฐ๊ตฌํŒ€์€ ..

[Paper Review] An Introduction to Vision-Language Modeling (Meta)

0. Abstract์ตœ๊ทผ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ํญ๋ฐœ์ ์ธ ๋ฐœ์ „๊ณผ ํ•จ๊ป˜, ๊ทธ ๋Šฅ๋ ฅ์„ ์‹œ๊ฐ ์˜์—ญ์œผ๋กœ ํ™•์žฅํ•˜๋ ค๋Š” ์‹œ๋„๊ฐ€ ํ™œ๋ฐœํžˆ ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค.VLM(Vision-Language Model)์€ ๊ณ ์ˆ˜์ค€์˜ ํ…์ŠคํŠธ ์„ค๋ช…๋งŒ์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜, ์‹œ๊ฐ์  ์žฅ๋ฉด์„ ์–ธ์–ด๋กœ ์ดํ•ดํ•˜๋Š” ๋“ฑ์šฐ๋ฆฌ์˜ ๊ธฐ์ˆ  ํ™˜๊ฒฝ์— ํฐ ๋ณ€ํ™”๋ฅผ ๊ฐ€์ ธ์˜ฌ ์‘์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค.ํ•˜์ง€๋งŒ ์‹œ๊ฐ ์ •๋ณด๋Š” ์–ธ์–ด์™€ ๋‹ฌ๋ฆฌ ๊ณ ์ฐจ์› ์—ฐ์† ๊ณต๊ฐ„์— ์กด์žฌํ•˜๋ฉฐ, ๊ฐœ๋…์„ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•˜๊ฑฐ๋‚˜ ํ‘œํ˜„ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๊ทผ๋ณธ์ ์ธ ํŠน์„ฑ์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ๊ฐ ์ •๋ณด๋ฅผ ์–ธ์–ด๋กœ ์ •ํ™•ํžˆ ์—ฐ๊ฒฐ(mapping)ํ•˜๋Š” ๋ฐ์—๋Š” ์—ฌ์ „ํžˆ ๋งŽ์€ ๊ธฐ์ˆ ์  ๊ณผ์ œ๊ฐ€ ๋‚จ์•„ ์žˆ๋‹ค.๋ณธ ๋…ผ๋ฌธ์€ ์ด๋Ÿฌํ•œ ๋ฐฐ๊ฒฝ ์†์—์„œ, VLM์˜ ๊ฐœ๋…๊ณผ ํ•™์Šต ๋ฐฉ๋ฒ•, ํ‰๊ฐ€ ๋ฐฉ๋ฒ•์„ ์†Œ๊ฐœํ•˜๋Š” ์ž…๋ฌธ์„œ๋กœ ๊ธฐํš๋˜์—ˆ๋‹ค.ํŠนํžˆ ์ด๋ฏธ์ง€๋ฅผ ๋„˜์–ด์„œ ๋น„๋””์˜ค๊นŒ์ง€ ํ™•์žฅ๋˜๋Š”..

[Paper Review] Vision-Language Models for Vision Tasks: A Survey

0. Abstract๊ธฐ์กด์˜ visual recognition ์—ฐ๊ตฌ๋Š” ๋”ฅ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ(DNN)๋ฅผ ๊ฐ visual recognition task ๋งˆ๋‹ค ๋ณ„๋„๋กœ ํ•™์Šต์‹œ์ผœ์•ผ ํ–ˆ๋‹ค.--> ์ด๋Š” ๋Œ€๊ทœ๋ชจ์˜ ์ˆ˜์ž‘์—… ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ์— ์˜์กดํ•˜๋ฉฐ ๋งŽ์€ ์‹œ๊ฐ„๊ณผ ์ธ๋ ฅ ์ž์›์ด ์†Œ๋ชจ๋˜๋Š” ๊ตฌ์กฐ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ ์ฃผ๋ชฉ๋ฐ›๊ณ  ์žˆ๋Š” ๊ฒƒ์ด ๋ฐ”๋กœ Vision-Language Models (VLMs)์ด๋‹ค.VLM์€ (1) ์›น์—์„œ ๊ฑฐ์˜ ๋ฌดํ•œํžˆ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๋Œ€๊ทœ๋ชจ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์Œ์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ-์–ธ์–ด ๊ฐ„์˜ ํ’๋ถ€ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋ฉฐ,(2) ๋‹จ ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ๋‹ค์–‘ํ•œ visual recognition task์— ๋Œ€ํ•ด ์ œ๋กœ์ƒท(Zero-Shot) ์˜ˆ์ธก์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฐ•์ ์„ ๊ฐ€์ง„๋‹ค.์ด ๋…ผ๋ฌธ์€ VLM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ์‹œ๊ฐ ์ธ์‹ ๊ธฐ์ˆ ์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™..

[Paper Review] QLoRA: Efficient Finetuning of Quantized LLMs

QLoRA: 16-bit์˜ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ 65B๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ Single 48GB GPU์— ์˜ฌ๋ ค finetuning ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.# ContributionQLoRA ๋ฐฉ๋ฒ•๋ก 1. 4-bit NormalFloat(NF4): ์ •๊ทœ๋ถ„ํฌ๋œ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด ์ •๋ณด ์ด๋ก ์ ์œผ๋กœ ์ตœ์ ์ธ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํƒ€์ž…2. Double Quantization: ์–‘์žํ™” ์ƒ์ˆ˜๋ฅผ ๋‹ค์‹œ ์–‘์žํ™”ํ•จ์œผ๋กœ์จ ํ‰๊ท  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ3. Paged Optimizers: ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๊ธ‰์ฆํ•˜๋Š” ์ƒํ™ฉ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ์–ด# IntroductionLLM์„ Finetuning ํ•˜๋Š”๊ฑด ํŠน์ • ๋„๋ฉ”์ธ์—์„œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๊ณผ์ •์ด๋‹ค.๊ธฐ์กด์—๋Š” 16-bit finetuning์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” LLaMA 65B ๊ธฐ์ค€์œผ๋กœ, 780GB ํฌ๊ธฐ์˜..

[LLM] Base Model๊ณผ Instruct Model, ๊ทธ๋ฆฌ๊ณ  Chat Template

# Base Model, Instruct Model Base Model: ๋‹จ์ˆœํžˆ ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก์ด๋ผ๋Š” ๋ชฉํ‘œ๋กœ ์‚ฌ์ „ ํ•™์Šต๋งŒ์„ ๊ฑฐ์นœ ๋ชจ๋ธInstruct Model: ํŠน์ •ํ•œ ๋ชฉ์ ์˜ ํƒœ์Šคํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ณ„๋„์˜ ํŒŒ์ธํŠœ๋‹์„ ๊ฑฐ์นœ ๋ชจ๋ธ ์•„๋ž˜ ์‚ฌ์ง„์ฒ˜๋Ÿผ,์•„๋ฌด๊ฒƒ๋„ ๋ถ™์–ด์žˆ์ง€ ์•Š์œผ๋ฉด base์ด๊ณ  instruct ๋ชจ๋ธ์€ ๋’ค์— Instruct, it, chat ๋“ฑ ๋ญ”๊ฐ€ ์ถ”๊ฐ€๋กœ ๋ถ™์–ด ์žˆ๋‹ค. ์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด GPT์™€ ChatGPT ๊ฐ™์€ ๋А๋‚Œ์ด๋‹ค.ChatGPT๋„ ์›๋ž˜๋Š” InstructGPT๋ผ๋Š” ๋ชจ๋ธ์„ ๋ฒ ์ด์Šค๋กœ ํ•˜๋Š”๋ฐ,์ด ๋ชจ๋ธ์ด ๋ฐ”๋กœ ์‚ฌ์šฉ์ž์˜ ์ž…๋ ฅ์— ๋งž๋Š” ์ ์ ˆํ•œ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋„๋ก ๋ณ„๋„๋กœ ํ•™์Šต์ด ๋œ ๋ชจ๋ธ์ด๋‹ค.๋ฌผ๋ก  ๋ณธ์งˆ์ ์œผ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์€ ๋ณ€ํ•˜์ง€ ์•Š์ง€๋งŒ,์‚ฌ์ „ ํ•™์Šต๋งŒ ๋œ Base Model์€ ๋ง ๊ทธ๋Œ€๋กœ, ์ž…๋ ฅ์ด ์งˆ๋ฌธ์ธ์ง€ ์•„๋‹Œ์ง€์—..

[Paper Review] Compact3D: Smaller and Faster Gaussian Splatting with Vector Quantization

0. Abstract1. ๋งŽ์€ Gaussian๋“ค์ด similar parameters๋ฅผ ๊ณต์œ ํ•œ๋‹ค๋Š” ์ ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค2. ๋”ฐ๋ผ์„œ, Gaussian parameters๋ฅผ quantizeํ•˜๊ธฐ ์œ„ํ•ด K-means ๊ธฐ๋ฐ˜์˜ vector quantization ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•œ๋‹ค.์ด๋•Œ ๊ฐ๊ฐ์˜ Gaussian์˜ ์ฝ”๋“œ index์™€ ํ•จ๊ป˜ codebook์„ ์ €์žฅํ•œ๋‹ค.์ถ”๊ฐ€๋กœ ์ธ๋ฑ์Šค๋Š” ์ •๋ ฌํ•œ ํ›„ run-length encoding๊ณผ ๋น„์Šทํ•œ ๋ฐฉ๋ฒ•๋ก ์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์••์ถ•ํ•œ๋‹ค.3. Gaussian ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด zero opacity (invisible Gaussian)์„ ์žฅ๋ คํ•˜๋Š” regularizer๋„ ์ œ์•ˆํ•œ๋‹ค.์ด ๋ฐฉ๋ฒ•์€ ๋ชจ๋ธ์„ ์••์ถ•ํ•˜๊ณ  rendering ์†๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด๋‹ค.4. ๊ฒฐ๋ก ์ ์œผ๋กœ ๊ธฐ์กด 3DGS์— ๋น„ํ•ด ์„ฑ๋Šฅ์€ ์•ฝ๊ฐ„..

K-means Clustering์—์„œ centroid์™€ assignment ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฒƒ ์ค‘ ๋ญ๊ฐ€ ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆด๊นŒ?

K-means has two steps: updating centroids given assignments, and updating assignments given centroids. We note that the latter is more expensive while the former is a simple averaging. Hence, we update the centroids after each iteration and update the assignments once every t iterations.์ถœ์ฒ˜) Compact3D: Smaller and Faster Gaussian Splatting with Vector Quantization ๋…ผ๋ฌธ์„ ์ฝ๋‹ค๊ฐ€ ์™œ ๊ทธ๋ ‡์ง€? ์ดํ•ด๊ฐ€ ์ž˜ ์•ˆ๋˜๋Š” ๋ถ€๋ถ„์ด ์žˆ์–ด์„œ..

[Paper Review] Mini-Splatting: Representing Scenes with a Constrained Number of Gaussians

2024๋…„ 3์›”์— arxiv์— ์˜ฌ๋ผ์˜จ ๋…ผ๋ฌธ์œผ๋กœ,๊ธฐ์กด์— 3DGS์˜ ๊ฐ€์šฐ์‹œ์•ˆ ์ˆ˜ ํ˜น์€ ์ฐจ์›์„ ์ค„์ด๊ธฐ ์œ„ํ•œ ์‹œ๋„๋“ค(LightGaussian, Compact3DGS ๋“ฑ..)์˜ ํ•œ๊ณ„๋ฅผ ์–ธ๊ธ‰ํ•˜๋ฉฐ์ƒˆ๋กœ์šด ์•„์ด๋””์–ด๋ฅผ ์ฃผ์žฅํ•œ๋‹ค๋Š” ์ ์—์„œ ํฅ๋ฏธ๋กœ์›Œ ์ฝ๊ฒŒ ๋˜์—ˆ๋‹ค.  Mini-Splatting: Representing Scenes with a Constrained Number of GaussiansIn this study, we explore the challenge of efficiently representing scenes with a constrained number of Gaussians. Our analysis shifts from traditional graphics and 2D computer vision to t..

3DGS์—์„œ Covariance Matrix๋ฅผ ๊ตฌํ•  ๋•Œ transpose matrix๋ฅผ ๊ณฑํ•ด์ฃผ๋Š” ์ด์œ ?

3DGS ๋…ผ๋ฌธ์„ ์ฝ๋‹ค๊ฐ€ ์ˆ˜์‹์„ ๋ณด๊ณ  ๋“  ๊ถ๊ธˆ์ฆ์ด๋‹ค. ๋จผ์ €, world ์ขŒํ‘œ๊ณ„์—์„œ covariance matrix(๊ณต๋ถ„์‚ฐํ–‰๋ ฌ)์€,(1) ํฌ๊ธฐ๋ณ€ํ™˜ํ–‰๋ ฌ(scaling matrix) S์™€ (2) ํšŒ์ „๋ณ€ํ™˜ํ–‰๋ ฌ(rotation matrix) R์„ ์ด์šฉํ•ด์„œ $$\sum = RSS^{T}R^{T}$$๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹์œผ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๋˜ํ•œ, image ์ขŒํ‘œ๊ณ„์—์„œ ๊ณต๋ถ„์‚ฐํ–‰๋ ฌ์€,(1) world์ขŒํ‘œ๊ณ„์—์„œ camera์ขŒํ‘œ๊ณ„๋กœ ๋ณ€ํ™˜ํ•˜๋Š” viewing transform๊ณผ (2) camera์ขŒํ‘œ๊ณ„์—์„œ image์ขŒํ‘œ๊ณ„๋กœ ๋ณ€ํ™˜ํ•˜๋Š” projective transformation์— ๋Œ€ํ•œ ์•„ํ•€๊ทผ์‚ฌ์˜ Jacobian์„ ์ด์šฉํ•ด์„œ $$  \sum^{'} = JW \sum W^{T}J^{T} $$์œ„์˜ ์‹์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค.  ๋‘ ์‹์„ ์‚ดํŽด๋ณด๋ฉด ์™œ ์ „์น˜ํ–‰๋ ฌ..

3DGS์—์„œ ํœด๋ฆฌ์Šคํ‹ฑ(heuristic)์˜ ์˜๋ฏธ?

Radsplat ๋…ผ๋ฌธ์„ ์ฝ๋‹ค๊ฐ€ 3DGS์˜ ํ•œ๊ณ„์ ์œผ๋กœ,ํœด๋ฆฌ์Šคํ‹ฑ ๊ธฐ๋ฒ•์œผ๋กœ ์ธํ•ด optimizationํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ์–˜๊ธฐ๊ฐ€ ์žˆ์—ˆ๋‹ค.3DGS, however, suffers from a challenging optimization landscape and an unbounded model size.The number of Gaussian primitives is not known as a priori, and carefully-tuned merging, splitting, and pruning heuristics are required to acheive satisfactory results.The brittlenenss of these heuristics become particularly evident in..