๐Ÿ“š Study/Paper Review

[Paper Review] QLoRA: Efficient Finetuning of Quantized LLMs

์œฐ๊ฐฑ 2025. 4. 11. 14:34

QLoRA: 16-bit์˜ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ 65B๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์„ Single 48GB GPU์— ์˜ฌ๋ ค finetuning ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.


# Contribution

QLoRA ๋ฐฉ๋ฒ•๋ก 

1. 4-bit NormalFloat(NF4): ์ •๊ทœ๋ถ„ํฌ๋œ ๊ฐ€์ค‘์น˜์— ๋Œ€ํ•ด ์ •๋ณด ์ด๋ก ์ ์œผ๋กœ ์ตœ์ ์ธ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํƒ€์ž…

2. Double Quantization: ์–‘์žํ™” ์ƒ์ˆ˜๋ฅผ ๋‹ค์‹œ ์–‘์žํ™”ํ•จ์œผ๋กœ์จ ํ‰๊ท  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ

3. Paged Optimizers: ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๊ธ‰์ฆํ•˜๋Š” ์ƒํ™ฉ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ์–ด


# Introduction

LLM์„ Finetuning ํ•˜๋Š”๊ฑด ํŠน์ • ๋„๋ฉ”์ธ์—์„œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๊ณผ์ •์ด๋‹ค.

๊ธฐ์กด์—๋Š” 16-bit finetuning์„ ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” LLaMA 65B ๊ธฐ์ค€์œผ๋กœ, 780GB ํฌ๊ธฐ์˜ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜์˜€๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ QLoRA๋Š” runtime์ด๋‚˜ predictive performance์˜ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด๋„ ์˜ค์ง 48GB ํฌ๊ธฐ์˜ ๋ฉ”๋ชจ๋ฆฌ๋งŒ ํ•„์š”ํ•˜๋‹ค.

์œ„์˜ contribution์—์„œ ์–˜๊ธฐํ–ˆ๋“ฏ์ด ํ•ต์‹ฌ์€ ์„ธ๊ฐ€์ง€๋‹ค.

1. 4-bit NormalFloat

๊ธฐ์กด์˜ 4-bit quantizaiton ๋ฐฉ์‹์ธ 4-bit Integer, 4-bit Float quantization ๋ฐฉ์‹์€ ๊ท ๋“ฑํ•˜๊ฒŒ ๊ฐ„๊ฒฉ์„ ๋‚˜๋ˆ„์ง€๋งŒ, ๋Œ€๋ถ€๋ถ„ weight ๋Š” ์ •๊ทœ๋ถ„ํฌ (normal distribution, N(0, σ²)) ๋ฅผ ๋”ฐ๋ฅธ๋‹ค.

๋น„ํŠธ๋งˆ๋‹ค ํ‘œํ˜„ํ•˜๋Š” ๊ฐ’์˜ ๋ฒ”์œ„๋ฅผ "๊ท ๋“ฑ"ํ•˜๊ฒŒ ๋‚˜๋ˆ„๊ธฐ๋ณด๋‹ค๋Š”, ์‹ค์ œ ๋ถ„ํฌ์— ๋งž๊ฒŒ ๊ฐ’์ด ๋งŽ์ด ๋ชฐ๋ฆฐ ๋ถ€๋ถ„์— ๋น„ํŠธ๋ฅผ ๋” ํ• ๋‹นํ•˜๋Š” ๋ฐฉ์‹์ด ๋ฐ”๋กœ 4-bit NormalFloat์ด๋‹ค. ์ •๊ทœ ๋ถ„ํฌ์— ์ตœ์ ํ™”๋˜์—ˆ๊ธฐ์— ์ ์€ ๋น„ํŠธ๋กœ๋„ ๋†’์€ ์ •ํ™•๋„๋ฅผ ์œ ์ง€ํ•œ๋‹ค.

2. Double Quantization

์›๋ž˜ ์–‘์žํ™”(quantization)๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ์••์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ๋ฐ,

QLoRA ๋Š” ํ•œ ๋‹จ๊ณ„ ๋” ๋‚˜์•„๊ฐ€์„œ ์–‘์žํ™”์— ์‚ฌ์šฉ๋˜๋Š” ์ƒ์ˆ˜๋“ค์กฐ์ฐจ ๋‹ค์‹œ ์–‘์žํ™”ํ•œ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ๋‹น ํ‰๊ท  0.37bit ์ •๋„๋ฅผ ์ ˆ๊ฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ (์˜ˆ: 65B ๋ชจ๋ธ) ์—์„œ๋Š” ์•ฝ 3GB ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์•„๋‚„ ์ˆ˜ ์žˆ๋‹ค.

3. Paged Optimizers

๊ธฐ์กด Optimizer๋Š” mini-batch ํฌ๊ธฐ๊ฐ€ ์ปค์ง€๊ฑฐ๋‚˜ ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์ด ํญ๋ฐœํ•œ๋‹ค.

ํŠนํžˆ gradient checkpointing ๊ฐ™์€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•  ๋•Œ ๋ฉ”๋ชจ๋ฆฌ spike ํ˜„์ƒ์ด ์ƒ๊ธด๋‹ค.

๋”ฐ๋ผ์„œ NVIDIA์˜ Unified Memory ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•ด์„œ optimizer state ๋ฅผ CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋„˜๊ฒจ ์‚ฌ์šฉํ•œ๋‹ค.

 

์ƒˆ๋กœ์šด ๋ฐœ๊ฒฌ

1. ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์ด ์–‘๋ณด๋‹ค ํ›จ์”ฌ ์ค‘์š”ํ•˜๋‹ค

9k sample dataset(OASST1)์€ 450k sample dataset(FLAN v2, subsampled)๋ณด๋‹ค ์ฑ—๋ด‡ ์„ฑ๋Šฅ์—์„œ ๋” ์ข‹์•˜๋‹ค.

2. MMLU ์„ฑ๋Šฅ์ด ์ฑ—๋ด‡ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€ ์•Š๋Š”๋‹ค

MMLU (Massive Multitask Language Understanding) ๋ฒค์น˜๋งˆํฌ์—์„œ ์ ์ˆ˜๊ฐ€ ๋†’๋‹ค๊ณ  ํ•ด์„œ Vicuna ๊ฐ™์€ ์ฑ—๋ด‡ ๋ฒค์น˜๋งˆํฌ์—์„œ ๋ฌด์กฐ๊ฑด ์ž˜ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์—ˆ๋‹ค.

 

์ถ”๊ฐ€ ๋ถ„์„

์‚ฌ๋žŒ ํ‰๊ฐ€์ž + GPT-4 ๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์„œ ํ† ๋„ˆ๋จผํŠธ ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋ธ๋“ค์„ ์„œ๋กœ ๋Œ€๊ฒฐ์‹œ์ผœ ํ‰๊ฐ€ํ–ˆ๋‹ค. (์ฃผ์–ด์ง„ ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•ด ์–ด๋А ๋ชจ๋ธ์ด ๋” ๋‚˜์€ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š”์ง€)

ํ† ๋„ˆ๋จผํŠธ ๊ฒฐ๊ณผ๋Š” Elo ์ ์ˆ˜๋กœ ์ง‘๊ณ„๋˜์–ด ์ฑ—๋ด‡์˜ ์„ฑ๋Šฅ ์ˆœ์œ„๊ฐ€ ๋งค๊ฒจ์ง„๋‹ค.

๊ฒฐ๊ณผ๋Š” ๋Œ€์ฒด๋กœ GPT-4์™€ ์‚ฌ๋žŒ ํ‰๊ฐ€๊ฐ€ ์ผ์น˜ํ•˜์˜€์ง€๋งŒ, ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ๋„ ์žˆ์—ˆ๋‹ค.


# Background

Block-wise k-bit Quantization

๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ ์€ ๋น„ํŠธ ์ˆ˜๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

๋ฐ์ดํ„ฐ ์ „์ฒด์—์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”(normalize) ํ•˜๊ณ  ์Šค์ผ€์ผ๋ง(scale) ํ•ด์„œ 8๋น„ํŠธ๋กœ ํ‘œํ˜„

 

๋‹จ์ 

๋งŒ์•ฝ์— ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ค‘์— ๋„ˆ๋ฌด ํฐ ๊ฐ’ (outlier, ์ด์ƒ์น˜) ์ด ์žˆ์œผ๋ฉด,

์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€๊ฐ’์— ๋งž์ถฐ ์ •๊ทœํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—:

  • ๋Œ€๋ถ€๋ถ„์˜ "ํ‰๋ฒ”ํ•œ ๊ฐ’๋“ค" ์€ ์–‘์žํ™” ๋ฒ”์œ„์˜ ์•„์ฃผ ์ข์€ ๋ถ€๋ถ„์— ๋ชฐ๋ฆฌ๊ฒŒ ๋œ๋‹ค.
  • ๋ฐ˜๋ฉด์— ํฐ ๊ฐ’ (outlier) ๋•Œ๋ฌธ์— ์–‘์žํ™” ๋น„ํŠธ (quantization bins) ๊ฐ€ ์ œ๋Œ€๋กœ ํ™œ์šฉ๋˜์ง€ ์•Š๊ฒŒ ๋œ๋‹ค.
ex.
$X^{FP32} = [2.0,-1.0,0.0,8.0]$
$absmax(X^{FP32}) = max(|2.0|,|-1.0|,|0.0|,|8.0|) = 8.0$

 

ํ•ด๊ฒฐ์ฑ…

๋ฐ์ดํ„ฐ๋ฅผ ๋ธ”๋ก(block) ์œผ๋กœ ๋‚˜๋ˆ„๊ณ , ๊ฐ ๋ธ”๋ก๋งˆ๋‹ค ๋”ฐ๋กœ ์–‘์žํ™” ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค.

๋ธ”๋ก๋งˆ๋‹ค ๊ฐ๊ฐ ์ตœ๋Œ“๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ •๊ทœํ™”ํ•˜๋ฏ€๋กœ outlier์˜ ์˜ํ–ฅ์„ ์ค„์ด๊ณ  ๋น„ํŠธ ์กฐํ•ฉ์„ ๋” ์ž˜ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

Low-rank Adapters

๊ธฐ์กด ํฐ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์€ ๊ทธ๋Œ€๋กœ ๋‘๊ณ , ์ž‘๊ณ  ํ•™์Šต ๊ฐ€๋Šฅํ•œ "Adapter (์ €์ฐจ์› ํ–‰๋ ฌ)" ๋งŒ ์ถ”๊ฐ€๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

  • $X$: ์ž…๋ ฅํ…์„œ ($X \in R^{b*h}$)
  • $W$: ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ ($W \in R^{h*o}$)
  • $L_1$: ์ฐจ์›์„ ์ถ•์†Œํ•˜๋Š” projection (down-projection)  ($ L_1 \in R^{h*r}$)
  • $L_2$: ์ฐจ์›์„ ํ™•์žฅํ•˜๋Š” projection (down-projection) ($ L_2 \in R^{r*o}$)

 

Memory Requirement of Parameter-Efficient Finetuning

LoRA ๋Š” ํ•™์Šตํ•ด์•ผ ํ•  ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ธ ํ˜์‹ ์ ์ธ PEFT ๋ฐฉ์‹์ด์ง€๋งŒ, ์‹ค์ œ ํŒŒ์ธํŠœ๋‹์—์„œ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ž์ฒด๋ณด๋‹ค ํ•™์Šต ์ค‘ ์ƒ์„ฑ๋˜๋Š” ์ค‘๊ฐ„ ๊ณ„์‚ฐ๊ฐ’(activation gradients)์ด ํ›จ์”ฌ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฐจ์ง€ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, FLAN v2 ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ•˜๋Š” 7B LLaMA ๋ชจ๋ธ์—์„œ๋Š” LoRA ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์›๋ž˜ ๋ชจ๋ธ์˜ ์•ฝ 0.2% ์ˆ˜์ค€์ธ 26MB ์— ๋ถˆ๊ณผํ•˜์ง€๋งŒ, input gradients ๋Š” 567MB ๋ฅผ ์ฐจ์ง€ํ•œ๋‹ค. ์ค‘๊ฐ„ activation ๊ฐ’์„ ์ €์žฅํ•˜์ง€ ์•Š๊ณ  ํ•„์š”ํ•  ๋•Œ๋งˆ๋‹ค ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋Š” gradient checkpointing ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜๋ฉด, input gradients ๋ฅผ 567MB ์—์„œ 18MB ๋กœ ํฌ๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด, LoRA parameter ๋ฅผ ๋” ์ค„์ด๋Š” ๊ฒƒ์€ ์ „์ฒด ๋ฉ”๋ชจ๋ฆฌ์—์„œ ํฐ ํšจ๊ณผ๊ฐ€ ์—†๊ณ , ๊ทธ ๋Œ€์‹  adapter ์ˆ˜๋ฅผ ๋Š˜๋ ค ์„ฑ๋Šฅ์„ ๋†’์ด๋”๋ผ๋„ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์€ ํฌ์ง€ ์•Š๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฐ ์„ค๊ณ„๊ฐ€ full 16-bit precision ์„ฑ๋Šฅ ๋ณต์›์— ํ•ต์‹ฌ์ ์ธ ์—ญํ• ์„ ํ•œ๋‹ค. 


# QLoRA Finetuning

1. 4-bit NormalFloat Quantization

๋…น์ƒ‰์ด NormalFloat๋ฐฉ๋ฒ•์ด๊ณ , ํŒŒ๋ž€์ƒ‰์ด ๊ท ๋“ฑํ•˜๊ฒŒ ์–‘์žํ™”๋œ ๋ฐฉ๋ฒ•์ด๋‹ค

๋ชจ๋ธ์˜ weight ๋ถ„ํฌ ํŠน์„ฑ์„ ํ™œ์šฉํ•ด์„œ ํšจ์œจ์ ์ด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ 4-bit ์–‘์žํ™” ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

  • ์ผ๋ฐ˜์ ์ธ ์–‘์žํ™”๋Š” ๊ท ์ผํ•˜๊ฒŒ ๊ฐ’์„ ๋‚˜๋ˆ„์ง€๋งŒ, ๋ชจ๋ธ weight ๋Š” ๋Œ€๋ถ€๋ถ„ ์ •๊ทœ๋ถ„ํฌ(Normal Distribution) ๋ฅผ ๋”ฐ๋ฅด๋ฏ€๋กœ ๋น„ํšจ์œจ์ ์ด๋‹ค.
  • NormalFloat ์€ ๋ถ„ํฌ์— ๋งž์ถฐ ๋น„ํŠธ๋ฅผ ๋ฐฐ๋ถ„ํ•˜์—ฌ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ๊ฐ’ ๊ทผ์ฒ˜๋Š” ๋” ์ด˜์ด˜ํ•˜๊ฒŒ, ํฌ๊ท€ํ•œ ๊ฐ’์€ ์ ๊ฒŒ ๋น„ํŠธ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

์–‘์žํ™” ๋‹จ๊ณ„

1. ์ด๋ก ์ ์ธ ์ •๊ทœ๋ถ„ํฌ $N(0,1)$์— ๋Œ€ํ•ด $2^k+1$๊ฐœ์˜ quantile์„ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•œ๋‹ค. -> k-bit ์–‘์žํ™”์šฉ ๋ฐ์ดํ„ฐ ํƒ€์ž… ์™„์„ฑ

: ๋งค๋ฒˆ ๋ชจ๋ธ weight์— ๋”ฐ๋ผ ๊ณ„์‚ฐํ•˜์ง€ ์•Š์•„๋„ ๋˜๋ฏ€๋กœ ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ด๋‹ค

2. ์ด quantile๋“ค์„ [-1,1] ๋ฒ”์œ„๋กœ ์ •๊ทœํ™”ํ•œ๋‹ค.

3. ์ž…๋ ฅ weight tensor ๋„ absolute max rescaling์„ ํ†ตํ•ด [-1,1] ๋ฒ”์œ„๋กœ ๋งž์ถ˜๋‹ค.

 

๋Œ€์นญ์ ์ธ k-bit quantization ๋ฐฉ์‹์—์„œ๋Š” 0 ๊ฐ’์„ ์ •ํ™•ํžˆ ํ‘œํ˜„ํ•  ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค
ํ•˜์ง€๋งŒ ํŒจ๋”ฉ(padding)์ด๋‚˜ ๋‹ค๋ฅธ 0 ๊ฐ’ ์š”์†Œ๋“ค์„ ์˜ค์ฐจ ์—†์ด ์–‘์žํ™”ํ•˜๋ ค๋ฉด 0 ๊ฐ’์˜ ์ •ํ™•ํ•œ ํ‘œํ˜„์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋น„๋Œ€์นญ(asymmetric) ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์‚ฌ์šฉํ•œ๋‹ค.
๊ตฌ์ฒด์ ์œผ๋กœ, ์Œ์ˆ˜ ๊ตฌ๊ฐ„์—๋Š” $2^{k-1}$ ๊ฐœ, ์–‘์ˆ˜ ๊ตฌ๊ฐ„์—๋Š” $2^{k-1} + 1$๊ฐœ์˜ ๋ถ„์œ„์ˆ˜๋ฅผ ์ถ”์ •ํ•œ๋‹ค.
๊ทธ ํ›„ ์ด ๋‘ ๋ถ„์œ„์ˆ˜ ์ง‘ํ•ฉ์„ ํ†ตํ•ฉํ•˜๊ณ , ์–‘์ˆ˜์™€ ์Œ์ˆ˜ ๊ตฌ๊ฐ„์—์„œ ์ค‘๋ณต๋œ 0 ์€ ํ•˜๋‚˜ ์ œ๊ฑฐํ•œ๋‹ค.

๊ทธ ๊ฒฐ๊ณผ๋กœ, ๋ชจ๋“   $2^{k}$ ๋น„ํŠธ๋ฅผ ํ™œ์šฉํ•˜๋ฉด์„œ๋„ bin ๋งˆ๋‹ค ๊ธฐ๋Œ€๊ฐ’์ด ๋™์ผํ•˜๋„๋ก ๋ฐฐ๋ถ„๋œ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๊ณ , ์ด๋Ÿฌํ•œ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ k-bit NormalFloat (NFk) ๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค.

 

2. Double Quantization

Double Quantization (DQ) ์€ ์–‘์žํ™” ์ƒ์ˆ˜ ์ž์ฒด๋ฅผ ๋‹ค์‹œ ํ•œ ๋ฒˆ ์–‘์žํ™”ํ•ด์„œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ถ”๊ฐ€๋กœ ์ค„์ด๋Š” ๊ธฐ๋ฒ•์ด๋‹ค,

๋ฐฐ๊ฒฝ

์›๋ž˜ weight ์–‘์žํ™” ์‹œ, ๊ฐ ๋ธ”๋ก๋งˆ๋‹ค ์–‘์žํ™” ์ƒ์ˆ˜ (scale factor) ๊ฐ€ ํ•„์š”ํ•˜๋‹ค.

ํ•˜์ง€๋งŒ ๋ธ”๋ก ์‚ฌ์ด์ฆˆ๊ฐ€ ์ž‘์„์ˆ˜๋ก ์ •๋ฐ€๋„๋Š” ์ข‹์•„์ง€์ง€๋งŒ, ์–‘์žํ™” ์ƒ์ˆ˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์•„์ ธ์„œ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

(ex. 32-bit ์ƒ์ˆ˜ ์‚ฌ์šฉ, ๋ธ”๋ก ์‚ฌ์ด์ฆˆ 64 → ํ‰๊ท  ํŒŒ๋ผ๋ฏธํ„ฐ๋‹น 0.5 bit ์†Œ๋ชจ)

 

1. 1๋‹จ๊ณ„: ๊ธฐ๋ณธ weight ์–‘์žํ™” (1์ฐจ ์–‘์žํ™”)

์›๋ž˜ float weight๋ฅผ quantizationํ•˜๋ ค๋ฉด scale factor๊ฐ€ ํ•„์š”ํ•˜๋‹ค

$w_{int} = round(w_{fp32}/c_2)$

  • $c_2$: block๋งˆ๋‹ค ์กด์žฌํ•˜๋Š” scale factor(quantization constant)
  • $c_2$ ๋“ค์„ ์ €์žฅํ•ด์•ผ ํ•˜๋Š”๋ฐ, ๋ณดํ†ต FP32 (32-bit float) ๋กœ ์ €์žฅํ•œ๋‹ค.

block size๊ฐ€ 64๋ผ๊ณ ํ•  ๋•Œ,

$32 bits / 64 parameters = 0.5bits/parameter$ 

2. 2๋‹จ๊ณ„: scale factor ๋„ ์–‘์žํ™” (2์ฐจ ์–‘์žํ™”)

$c_2$ ๋˜ํ•œ block์œผ๋กœ ๋ฌถ์–ด ์–‘์žํ™”ํ•˜๋Š”๊ฒŒ ๋ฐ”๋กœ double quantization > ์ €์žฅ ๋น„์šฉ์„ ๋” ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.

$c_{2}^{int8}= round(c_2 -\mu_{c_2}/c_1)$

  • $c_{2}^{int8}$: ์–‘์žํ™”๋œ $c_2$, 8-bit๋กœ ์ €์žฅ
  • $\mu_{c_2}$: $c_2$๋“ค์˜ ํ‰๊ท  (mean clustering)
    : ํ‰๊ท ์„ ๋นผ๋Š” ์ด์œ ) $c_2$ ๊ฐ€ ์–‘์ˆ˜ ๊ฐ’์ด๋ฏ€๋กœ, ๋Œ€์นญ์ ์œผ๋กœ ๋งŒ๋“ค์–ด์„œ quantization ํšจ์œจ ๋†’์ž„
  • $c_1$: ๋‘ ๋ฒˆ์งธ quantization์˜ scale factor

 

ํšจ๊ณผ

block size๊ฐ€ 64๋ผ๊ณ ํ•  ๋•Œ,

๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ: $32/64 = 0.5 bits$ ->  $8/64 + 32/(64 · 256) = 0.127 bits$

$c_2$๋ฅผ ๋˜ block์œผ๋กœ ๋ฌถ์–ด์„œ ์–‘์žํ™”ํ•จ.
$c_2$๋“ค์„ 256๊ฐœ์”ฉ ๋ฌถ์–ด์„œ
scale factor $c_1$ 1๊ฐœ
์–‘์žํ™”๋œ $c_2^{int8}๊ฐ’ 256๊ฐœ

ํ‰๊ท ์ ์œผ๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋‹น 0.373 bit ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ๋‹ค.

 

3. Paged Optimizers

Paged Optimizer ๋Š” NVIDIA Unified Memory ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•œ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•  ๋•Œ, CPU ๋ฉ”๋ชจ๋ฆฌ(RAM) ์™€ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฐ„์— ์ž๋™์œผ๋กœ "ํŽ˜์ด์ง€ ๋‹จ์œ„ ์ „์†ก(page-to-page transfer)" ์„ ํ•ด์ฃผ๋Š” ๊ธฐ๋Šฅ์ด๋‹ค.

์‰ฝ๊ฒŒ ๋งํ•˜๋ฉด:

  • ์šฐ๋ฆฌ๊ฐ€ ๋ณดํ†ต PC ์—์„œ RAM ์ด ๋ถ€์กฑํ•˜๋ฉด, ๋””์Šคํฌ(HDD/SSD) ๋กœ ์Šค์™€ํ•‘์ด ์ผ์–ด๋‚˜๋“ฏ์ด
  • GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•˜๋ฉด, CPU RAM ์œผ๋กœ ์ž๋™์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฎ๊ฒผ๋‹ค๊ฐ€ ํ•„์š”ํ•  ๋•Œ ๋‹ค์‹œ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐฉ์‹์ด๋‹ค.

 


# QLoRA

QLoRA Linear Layer ์›๋ฆฌ

  • ์ €์žฅ ์‹œ:
    • Weight: 4-bit NormalFloat (NF4)
    • Scale factor: Double Quantization (block size 256)
  • ๊ณ„์‚ฐ ์‹œ:
    • ๋ณต์›๋œ BF16 precision ์œผ๋กœ ์—ฐ์‚ฐ
    • LoRA adapter (BF16) ๋ฅผ ํ†ตํ•ด fine-tuning
  • gradient:
    • base weight ์— ๋Œ€ํ•ด gradient ์ €์žฅ ์•ˆ ํ•จ
    • adapter weight ์— ๋Œ€ํ•ด์„œ๋งŒ gradient ์ €์žฅ

# Evaluation

QLoRA๋ฅผ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋งŽ์ด ์ค„์˜€๋Š”๋ฐ๋„ ์ ์ˆ˜๊ฐ€ ๊ฑฐ์˜ ์ฐจ์ด๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 


Reference

์‹œ๊ฐ์ ์œผ๋กœ ๋„์›€์„ ๋ฐ›์€ ์˜์ƒ

https://www.youtube.com/watch?v=6l8GZDPbFn8

https://www.youtube.com/watch?v=aZPAqBov3tQ

 

https://www.youtube.com/watch?v=XpoKB3usmKc