My Subject Matter
artificial-intelligence

Low-Bit LLMs New

A sourced reference on Low-Bit LLMs.

What is a low-bit LLM?

A low-bit LLM is a large language model whose weights and/or activations are stored using fewer bits than the standard 32- or 16-bit floating point. Common formats include 8-bit, 4-bit, and 2-bit integers, dramatically reducing memory footprint and enabling inference on consumer hardware. [Source: ACM]

Sources
·

What is quantization in the context of LLMs?

Quantization is the process of mapping a model's continuous high-precision floating-point weights to a discrete lower-precision representation, such as INT8 or INT4. It reduces model size and speeds up inference at the cost of a small, often acceptable, loss in accuracy. [Source: IEEE]

Sources
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper
academic · IEEE Xplore / Google Brain · 2020-07-15
·

Why would you quantize a large language model?

Quantizing an LLM reduces its memory requirements by 2–8×, lowers inference latency, and cuts energy consumption, making billion-parameter models deployable on laptops, edge devices, and low-cost cloud instances without retraining from scratch. [Source: MLSys / arXiv]

Sources
·
MLSys 2024 Conference Proceedings
academic · MLSys Conference · 2024-05-01
·

What is post-training quantization (PTQ) for LLMs?

Post-training quantization applies bit-width reduction to an already-trained model without further gradient-based optimization, using only a small calibration dataset to determine quantization parameters. PTQ is fast and requires no access to the original training data. [Source: arXiv / NeurIPS]

Sources
·
·

What is quantization-aware training (QAT) for LLMs?

Quantization-aware training simulates low-precision arithmetic during the forward pass while keeping full-precision weights for gradient updates, allowing the model to adapt its weights to quantization noise before deployment. QAT generally achieves better accuracy than PTQ but requires more compute. [Source: arXiv]

Sources
QLoRA: Efficient Finetuning of Quantized LLMs
academic · arXiv (NeurIPS 2023 accepted) · 2023-05-23
·
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
academic · arXiv (NeurIPS 2022 accepted) · 2022-11-15
·

What is the GPTQ quantization method?

GPTQ is a one-shot weight quantization method for GPT-style models that uses approximate second-order information (Hessians) to minimize quantization error layer by layer. It can compress a 175-billion-parameter model to 4-bit in roughly four GPU-hours with minimal perplexity increase. [Source: arXiv / ICLR 2023]

Sources
·

What is AWQ (Activation-aware Weight Quantization)?

AWQ is a PTQ method that identifies and protects the small fraction of weights most salient to model outputs—determined by activation magnitudes—while aggressively quantizing the rest to INT4. It achieves accuracy close to full-precision on a wide range of LLMs without grid search. [Source: arXiv / MLSys 2024]

Sources
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
academic · arXiv (MLSys 2024 accepted) · 2024-01-10
·
MLSys 2024 Conference Proceedings
academic · MLSys Conference · 2024-05-01
·

What is the GGUF file format for low-bit models?

GGUF (GPT-Generated Unified Format) is a binary container format introduced by the llama.cpp project to store quantized LLM weights along with metadata such as tokenizer vocabularies and model hyperparameters in a single, self-describing file optimized for CPU and GPU inference. [Source: GitHub / llama.cpp repository]

Sources
llama.cpp: LLM inference in C/C++ (official repository)
official · GitHub / Georgi Gerganov · 2024-11-01
·

What is QLoRA and how does it differ from standard quantization?

QLoRA combines 4-bit NormalFloat quantization of frozen base model weights with Low-Rank Adaptation (LoRA) fine-tuning adapters in full precision, enabling fine-tuning of a 65-billion-parameter model on a single 48 GB GPU while matching full 16-bit fine-tuning performance. [Source: arXiv / NeurIPS 2023]

Sources
QLoRA: Efficient Finetuning of Quantized LLMs
academic · arXiv (NeurIPS 2023 accepted) · 2023-05-23
·

How much GPU memory does a 4-bit quantized LLM require?

A 4-bit quantized model requires approximately 0.5 GB of VRAM per billion parameters, so a 7B model needs roughly 3.5–4 GB, a 13B model around 7 GB, and a 70B model approximately 35–40 GB, compared to roughly 14 GB and 26 GB respectively at 16-bit precision. [Source: arXiv / Hugging Face research]

Sources
QLoRA: Efficient Finetuning of Quantized LLMs
academic · arXiv (NeurIPS 2023 accepted) · 2023-05-23
·
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
academic · arXiv (NeurIPS 2022 accepted) · 2022-11-15
·

What hardware can run low-bit quantized LLMs?

4-bit and 8-bit quantized models can run on consumer NVIDIA RTX GPUs (8–24 GB VRAM), Apple Silicon Macs via Metal GPU acceleration, and even CPU-only systems using llama.cpp. Edge devices like the Raspberry Pi 5 can run small quantized models at limited speeds. [Source: GitHub / llama.cpp; Apple Developer Documentation]

Sources
llama.cpp: LLM inference in C/C++ (official repository)
official · GitHub / Georgi Gerganov · 2024-11-01
·
Metal Performance Shaders — Apple Developer Documentation
official · Apple Developer · 2024-06-10
·

Which quantization bit-width gives the best accuracy-vs-size trade-off?

Research consistently shows that 4-bit quantization offers the best practical trade-off: perplexity degradation versus a 16-bit baseline is typically below 1–2% for modern methods like GPTQ and AWQ, while halving memory relative to 8-bit and enabling a much wider range of deployment hardware. [Source: arXiv GPTQ; arXiv AWQ]

Sources
·
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
academic · arXiv (MLSys 2024 accepted) · 2024-01-10
·

What is a 1-bit LLM, and does it work?

A 1-bit LLM constrains every weight to {-1, +1} (BitNet) or {-1, 0, +1} (BitNet b1.58), replacing multiply-accumulate operations with additions. Microsoft Research demonstrated that 1.58-bit models match full-precision Llama-3 on language benchmarks while reducing energy use by up to 71.4%. [Source: arXiv / Microsoft Research]

Sources
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
academic · arXiv / Microsoft Research · 2024-02-27
·

What is LLM.int8() quantization?

LLM.int8() is a 8-bit matrix multiplication scheme that decomposes weight matrices into two parts: a standard INT8 path for the majority of values and a 16-bit floating-point path for the small fraction of 'emergent' outlier features, preserving accuracy while halving memory vs FP16. [Source: arXiv / NeurIPS 2022]

Sources
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
academic · arXiv (NeurIPS 2022 accepted) · 2022-11-15
·

What is SmoothQuant and why does it matter for INT8 inference?

SmoothQuant mathematically migrates quantization difficulty from activations to weights by applying a channel-wise smoothing factor, making both tensors easy to quantize to INT8 simultaneously. It enables lossless 8-bit weight-and-activation quantization at up to 1.56× throughput improvement on NVIDIA A100 GPUs. [Source: arXiv / NeurIPS 2023]

Sources
·

How much accuracy do you lose when quantizing an LLM to 4 bits?

For well-calibrated PTQ methods like GPTQ and AWQ applied to modern 7B–70B models, perplexity increases by roughly 0.1–0.5 points on WikiText-2 compared to FP16 baselines, and downstream task accuracy drops are typically under 1–2 percentage points, often imperceptible in practice. [Source: arXiv GPTQ; arXiv AWQ]

Sources
·
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
academic · arXiv (MLSys 2024 accepted) · 2024-01-10
·

What are the main frameworks for running low-bit LLMs?

The leading open-source inference frameworks for quantized LLMs are llama.cpp (GGUF, CPU/GPU), Hugging Face bitsandbytes (INT8/INT4, CUDA), vLLM (GPTQ/AWQ, production serving), and ONNX Runtime (cross-platform INT8). NVIDIA TensorRT-LLM provides optimized INT4/INT8 paths for datacenter GPUs. [Source: GitHub repositories; NVIDIA Developer Docs]

Sources
llama.cpp: LLM inference in C/C++ (official repository)
official · GitHub / Georgi Gerganov · 2024-11-01
·
NVIDIA TensorRT — Deep Learning Model Optimization
official · NVIDIA Developer · 2024-09-01
·

What is the difference between model sparsity and quantization for LLM compression?

Quantization reduces the numerical precision of every weight, while sparsity sets a fraction of weights to exactly zero and skips their computation. Both shrink effective model size; quantization is currently more hardware-friendly because sparse computation requires specialized hardware support to yield real-world speed gains. [Source: IEEE; arXiv]

Sources
Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper
academic · IEEE Xplore / Google Brain · 2020-07-15
·
·

Does quantization affect LLM safety alignment and refusal behavior?

Research published at IEEE S&P and arXiv shows that heavy quantization (2-bit and below) can degrade safety fine-tuning, causing increased harmful-output rates, while 4-bit and 8-bit quantization generally preserves alignment within 1–2% of the full-precision model on safety benchmarks. [Source: arXiv safety-quantization studies]

Sources

What is FP8 quantization and how does it compare to INT8 for LLMs?

FP8 (8-bit floating point) uses the IEEE 754-style E4M3 or E5M2 encodings, preserving a dynamic range closer to FP16 than INT8 and requiring no separate scaling factors per tensor. NVIDIA Hopper (H100) GPUs include native FP8 Tensor Core support, enabling near-FP16 accuracy at twice the throughput of FP16. [Source: NVIDIA / IEEE]

Sources
·
FP8 Formats for Deep Learning (IEEE Xplore)
academic · IEEE Xplore · 2023-05-15
·