artificial-intelligence

Low-Bit LLMs New

A sourced reference on Low-Bit LLMs.

What is a low-bit LLM?

A low-bit LLM is a large language model whose weights and/or activations are stored using fewer bits than the standard 32- or 16-bit floating point. Common formats include 8-bit, 4-bit, and 2-bit integers, dramatically reducing memory footprint and enabling inference on consumer hardware. [Source: ACM]

Sources

A Survey of Quantization Methods for Efficient Neural Network Inference

academic · ACM Computing Surveys · 2023-09-01

What is quantization in the context of LLMs?

Quantization is the process of mapping a model's continuous high-precision floating-point weights to a discrete lower-precision representation, such as INT8 or INT4. It reduces model size and speeds up inference at the cost of a small, often acceptable, loss in accuracy. [Source: IEEE]

Sources

Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper

academic · IEEE Xplore / Google Brain · 2020-07-15

Why would you quantize a large language model?

Quantizing an LLM reduces its memory requirements by 2–8×, lowers inference latency, and cuts energy consumption, making billion-parameter models deployable on laptops, edge devices, and low-cost cloud instances without retraining from scratch. [Source: MLSys / arXiv]

Sources

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

academic · arXiv (ICLR 2023 accepted) · 2023-02-22

MLSys 2024 Conference Proceedings

academic · MLSys Conference · 2024-05-01

What is post-training quantization (PTQ) for LLMs?

Post-training quantization applies bit-width reduction to an already-trained model without further gradient-based optimization, using only a small calibration dataset to determine quantization parameters. PTQ is fast and requires no access to the original training data. [Source: arXiv / NeurIPS]

Sources

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

academic · arXiv (ICLR 2023 accepted) · 2023-02-22

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

academic · arXiv (NeurIPS 2023 accepted) · 2023-07-14

What is quantization-aware training (QAT) for LLMs?

Quantization-aware training simulates low-precision arithmetic during the forward pass while keeping full-precision weights for gradient updates, allowing the model to adapt its weights to quantization noise before deployment. QAT generally achieves better accuracy than PTQ but requires more compute. [Source: arXiv]

Sources

QLoRA: Efficient Finetuning of Quantized LLMs

academic · arXiv (NeurIPS 2023 accepted) · 2023-05-23

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

academic · arXiv (NeurIPS 2022 accepted) · 2022-11-15

What is the GPTQ quantization method?

GPTQ is a one-shot weight quantization method for GPT-style models that uses approximate second-order information (Hessians) to minimize quantization error layer by layer. It can compress a 175-billion-parameter model to 4-bit in roughly four GPU-hours with minimal perplexity increase. [Source: arXiv / ICLR 2023]

Sources

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

academic · arXiv (ICLR 2023 accepted) · 2023-02-22

What is AWQ (Activation-aware Weight Quantization)?

AWQ is a PTQ method that identifies and protects the small fraction of weights most salient to model outputs—determined by activation magnitudes—while aggressively quantizing the rest to INT4. It achieves accuracy close to full-precision on a wide range of LLMs without grid search. [Source: arXiv / MLSys 2024]

Sources

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

academic · arXiv (MLSys 2024 accepted) · 2024-01-10

MLSys 2024 Conference Proceedings

academic · MLSys Conference · 2024-05-01

What is the GGUF file format for low-bit models?

GGUF (GPT-Generated Unified Format) is a binary container format introduced by the llama.cpp project to store quantized LLM weights along with metadata such as tokenizer vocabularies and model hyperparameters in a single, self-describing file optimized for CPU and GPU inference. [Source: GitHub / llama.cpp repository]

Sources

llama.cpp: LLM inference in C/C++ (official repository)

official · GitHub / Georgi Gerganov · 2024-11-01

What is QLoRA and how does it differ from standard quantization?

QLoRA combines 4-bit NormalFloat quantization of frozen base model weights with Low-Rank Adaptation (LoRA) fine-tuning adapters in full precision, enabling fine-tuning of a 65-billion-parameter model on a single 48 GB GPU while matching full 16-bit fine-tuning performance. [Source: arXiv / NeurIPS 2023]

Sources

QLoRA: Efficient Finetuning of Quantized LLMs

academic · arXiv (NeurIPS 2023 accepted) · 2023-05-23

How much GPU memory does a 4-bit quantized LLM require?

A 4-bit quantized model requires approximately 0.5 GB of VRAM per billion parameters, so a 7B model needs roughly 3.5–4 GB, a 13B model around 7 GB, and a 70B model approximately 35–40 GB, compared to roughly 14 GB and 26 GB respectively at 16-bit precision. [Source: arXiv / Hugging Face research]

Sources

QLoRA: Efficient Finetuning of Quantized LLMs

academic · arXiv (NeurIPS 2023 accepted) · 2023-05-23

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

academic · arXiv (NeurIPS 2022 accepted) · 2022-11-15

What hardware can run low-bit quantized LLMs?

4-bit and 8-bit quantized models can run on consumer NVIDIA RTX GPUs (8–24 GB VRAM), Apple Silicon Macs via Metal GPU acceleration, and even CPU-only systems using llama.cpp. Edge devices like the Raspberry Pi 5 can run small quantized models at limited speeds. [Source: GitHub / llama.cpp; Apple Developer Documentation]

Sources

llama.cpp: LLM inference in C/C++ (official repository)

official · GitHub / Georgi Gerganov · 2024-11-01

Metal Performance Shaders — Apple Developer Documentation

official · Apple Developer · 2024-06-10

Which quantization bit-width gives the best accuracy-vs-size trade-off?

Research consistently shows that 4-bit quantization offers the best practical trade-off: perplexity degradation versus a 16-bit baseline is typically below 1–2% for modern methods like GPTQ and AWQ, while halving memory relative to 8-bit and enabling a much wider range of deployment hardware. [Source: arXiv GPTQ; arXiv AWQ]

Sources

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

academic · arXiv (ICLR 2023 accepted) · 2023-02-22

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

academic · arXiv (MLSys 2024 accepted) · 2024-01-10

What is a 1-bit LLM, and does it work?

A 1-bit LLM constrains every weight to {-1, +1} (BitNet) or {-1, 0, +1} (BitNet b1.58), replacing multiply-accumulate operations with additions. Microsoft Research demonstrated that 1.58-bit models match full-precision Llama-3 on language benchmarks while reducing energy use by up to 71.4%. [Source: arXiv / Microsoft Research]

Sources

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

academic · arXiv / Microsoft Research · 2024-02-27

What is LLM.int8() quantization?

LLM.int8() is a 8-bit matrix multiplication scheme that decomposes weight matrices into two parts: a standard INT8 path for the majority of values and a 16-bit floating-point path for the small fraction of 'emergent' outlier features, preserving accuracy while halving memory vs FP16. [Source: arXiv / NeurIPS 2022]

Sources

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

academic · arXiv (NeurIPS 2022 accepted) · 2022-11-15

What is SmoothQuant and why does it matter for INT8 inference?

SmoothQuant mathematically migrates quantization difficulty from activations to weights by applying a channel-wise smoothing factor, making both tensors easy to quantize to INT8 simultaneously. It enables lossless 8-bit weight-and-activation quantization at up to 1.56× throughput improvement on NVIDIA A100 GPUs. [Source: arXiv / NeurIPS 2023]

Sources

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

academic · arXiv (NeurIPS 2023 accepted) · 2023-07-14

How much accuracy do you lose when quantizing an LLM to 4 bits?

For well-calibrated PTQ methods like GPTQ and AWQ applied to modern 7B–70B models, perplexity increases by roughly 0.1–0.5 points on WikiText-2 compared to FP16 baselines, and downstream task accuracy drops are typically under 1–2 percentage points, often imperceptible in practice. [Source: arXiv GPTQ; arXiv AWQ]

Sources

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

academic · arXiv (ICLR 2023 accepted) · 2023-02-22

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

academic · arXiv (MLSys 2024 accepted) · 2024-01-10

What are the main frameworks for running low-bit LLMs?

The leading open-source inference frameworks for quantized LLMs are llama.cpp (GGUF, CPU/GPU), Hugging Face bitsandbytes (INT8/INT4, CUDA), vLLM (GPTQ/AWQ, production serving), and ONNX Runtime (cross-platform INT8). NVIDIA TensorRT-LLM provides optimized INT4/INT8 paths for datacenter GPUs. [Source: GitHub repositories; NVIDIA Developer Docs]

Sources

llama.cpp: LLM inference in C/C++ (official repository)

official · GitHub / Georgi Gerganov · 2024-11-01

NVIDIA TensorRT — Deep Learning Model Optimization

official · NVIDIA Developer · 2024-09-01

What is the difference between model sparsity and quantization for LLM compression?

Quantization reduces the numerical precision of every weight, while sparsity sets a fraction of weights to exactly zero and skips their computation. Both shrink effective model size; quantization is currently more hardware-friendly because sparse computation requires specialized hardware support to yield real-world speed gains. [Source: IEEE; arXiv]

Sources

Quantizing Deep Convolutional Networks for Efficient Inference: A Whitepaper

academic · IEEE Xplore / Google Brain · 2020-07-15

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

academic · arXiv (ICLR 2023 accepted) · 2023-02-22

Does quantization affect LLM safety alignment and refusal behavior?

Research published at IEEE S&P and arXiv shows that heavy quantization (2-bit and below) can degrade safety fine-tuning, causing increased harmful-output rates, while 4-bit and 8-bit quantization generally preserves alignment within 1–2% of the full-precision model on safety benchmarks. [Source: arXiv safety-quantization studies]

Sources

Quantization Hurts: Effects of Quantization on LLM Safety and Alignment

academic · arXiv · 2024-02-08

What is FP8 quantization and how does it compare to INT8 for LLMs?

FP8 (8-bit floating point) uses the IEEE 754-style E4M3 or E5M2 encodings, preserving a dynamic range closer to FP16 than INT8 and requiring no separate scaling factors per tensor. NVIDIA Hopper (H100) GPUs include native FP8 Tensor Core support, enabling near-FP16 accuracy at twice the throughput of FP16. [Source: NVIDIA / IEEE]

Sources

FP8 Formats for Deep Learning — NVIDIA Transformer Engine Documentation

official · NVIDIA Developer · 2024-03-15

FP8 Formats for Deep Learning (IEEE Xplore)

academic · IEEE Xplore · 2023-05-15

Low-Bit LLMs New

What is a low-bit LLM?

What is quantization in the context of LLMs?

Why would you quantize a large language model?

What is post-training quantization (PTQ) for LLMs?

What is quantization-aware training (QAT) for LLMs?

What is the GPTQ quantization method?

What is AWQ (Activation-aware Weight Quantization)?

What is the GGUF file format for low-bit models?

What is QLoRA and how does it differ from standard quantization?

How much GPU memory does a 4-bit quantized LLM require?

What hardware can run low-bit quantized LLMs?

Which quantization bit-width gives the best accuracy-vs-size trade-off?

What is a 1-bit LLM, and does it work?

What is LLM.int8() quantization?

What is SmoothQuant and why does it matter for INT8 inference?

How much accuracy do you lose when quantizing an LLM to 4 bits?

What are the main frameworks for running low-bit LLMs?

What is the difference between model sparsity and quantization for LLM compression?

Does quantization affect LLM safety alignment and refusal behavior?

What is FP8 quantization and how does it compare to INT8 for LLMs?

Sign in

Consent & Cookies