My Subject Matter
Technology

NLP & Large Language Models New

Language models, fine-tuning, prompt engineering, RAG

What is a large language model (LLM)?

A large language model is a neural network trained on massive text datasets to predict and generate human-like text. LLMs use transformer architectures with billions of parameters, enabling tasks like translation, summarization, and question answering without task-specific training. [Source: Google Research]

Sources
Attention Is All You Need
academic · arXiv (Google Brain / Google Research) · 2017-06-12
·
Large Language Models: A New Moore's Law?
primary · Google Research · 2021-10-04
·

How do transformer models work in NLP?

Transformers process entire input sequences simultaneously using a self-attention mechanism that weighs relationships between all tokens at once. This parallelism, introduced in the 2017 'Attention Is All You Need' paper, replaced recurrent networks and enabled scaling to billions of parameters efficiently. [Source: Google Brain / arXiv]

Sources
Attention Is All You Need
academic · arXiv (Google Brain / Google Research) · 2017-06-12
·

What is prompt engineering and why does it matter?

Prompt engineering is the practice of designing inputs to guide LLM outputs toward desired results. Because LLMs are sensitive to phrasing, few-shot examples, and instruction framing, well-crafted prompts can dramatically improve accuracy and relevance without modifying model weights. [Source: OpenAI]

Sources
·

What is fine-tuning an LLM and when should you use it?

Fine-tuning updates a pretrained LLM's weights on a smaller, domain-specific dataset to improve performance on targeted tasks. It is best used when consistent style, specialized vocabulary, or domain accuracy is required and cannot be achieved reliably through prompt engineering alone. [Source: OpenAI]

Sources
Fine-tuning – OpenAI Platform Documentation
official · OpenAI · 2024-01-01
·

What is Retrieval-Augmented Generation (RAG)?

Retrieval-Augmented Generation combines a retrieval system — typically a vector database — with an LLM so the model can access up-to-date or proprietary documents at inference time. This reduces hallucinations and extends knowledge beyond the model's training cutoff without retraining. [Source: Meta AI / arXiv]

Sources
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
academic · arXiv (Meta AI Research) · 2020-05-22
·

What is Reinforcement Learning from Human Feedback (RLHF)?

RLHF is a training technique where human raters rank model outputs, and those preferences are used to train a reward model that then guides further LLM fine-tuning via reinforcement learning. OpenAI used RLHF to align InstructGPT and ChatGPT with human intent. [Source: OpenAI / arXiv]

Sources
·

What is few-shot prompting in large language models?

Few-shot prompting provides an LLM with a small number of input-output examples directly in the prompt to demonstrate the desired task format. GPT-3's original paper showed that this in-context learning approach enables strong performance on new tasks without any gradient updates. [Source: OpenAI / arXiv]

Sources
Language Models are Few-Shot Learners
academic · arXiv (OpenAI) · 2020-05-28
·

What is chain-of-thought prompting?

Chain-of-thought (CoT) prompting encourages LLMs to produce intermediate reasoning steps before a final answer, mimicking human problem-solving. Google Research demonstrated that this technique substantially improves accuracy on arithmetic, commonsense, and symbolic reasoning benchmarks across large models. [Source: Google Research / arXiv]

Sources
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
academic · arXiv (Google Research) · 2022-01-28
·

What is BERT and how does it differ from GPT-style models?

BERT (Bidirectional Encoder Representations from Transformers) is a Google model pretrained to understand context from both directions simultaneously, making it strong at classification and extraction tasks. GPT-style models are unidirectional, predicting the next token, making them better suited for text generation. [Source: Google Research / arXiv]

Sources
·

What is LoRA and how does it make LLM fine-tuning more efficient?

Low-Rank Adaptation (LoRA) fine-tunes LLMs by injecting small trainable rank-decomposition matrices into model layers rather than updating all weights. This reduces trainable parameters by up to 10,000× and GPU memory requirements drastically, making fine-tuning accessible on consumer hardware. [Source: Microsoft Research / arXiv]

Sources
LoRA: Low-Rank Adaptation of Large Language Models
academic · arXiv (Microsoft Research) · 2021-06-17
·

What is QLoRA and how does quantization help LLM fine-tuning?

QLoRA combines 4-bit quantization with LoRA adapters, allowing a 65-billion-parameter model to be fine-tuned on a single 48GB GPU without significant accuracy loss. Developed at the University of Washington, it enables high-quality instruction tuning on widely available hardware. [Source: University of Washington / arXiv]

Sources
QLoRA: Efficient Finetuning of Quantized LLMs
academic · arXiv (University of Washington) · 2023-05-23
·

What is a vector database and how is it used in LLM applications?

A vector database stores high-dimensional numerical embeddings of text, images, or other data, enabling fast semantic similarity search. In LLM pipelines — especially RAG — vector databases retrieve contextually relevant documents to inject into prompts, grounding model responses in specific knowledge. [Source: IEEE]

What are text embeddings in NLP?

Text embeddings are dense numerical vector representations of words, sentences, or documents that capture semantic meaning, so that similar meanings map to nearby points in vector space. They are foundational to search, classification, clustering, and retrieval tasks in modern NLP systems. [Source: Google Research / arXiv]

Sources
Efficient Estimation of Word Representations in Vector Space
academic · arXiv (Google Research) · 2013-01-16
·
Get text embeddings – Vertex AI Generative AI
official · Google Cloud · 2024-03-01
·

How can you reduce hallucinations in large language models?

Key strategies include grounding responses via Retrieval-Augmented Generation, providing explicit source documents in context, using temperature settings closer to zero, applying output validation layers, and fine-tuning on high-quality factual data. No single technique eliminates hallucinations entirely; combinations work best. [Source: Stanford HAI]

Sources
The Hallucination Crisis
academic · Stanford Human-Centered AI Institute (HAI) · 2023-08-07
·
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
academic · arXiv (Meta AI Research) · 2020-05-22
·

What is Constitutional AI (CAI)?

Constitutional AI is Anthropic's alignment method where an LLM critiques and revises its own outputs according to a set of written principles (a 'constitution'), reducing the need for human labelers to evaluate harmful content directly. It improves harmlessness while maintaining helpfulness. [Source: Anthropic / arXiv]

Sources
Constitutional AI: Harmlessness from AI Feedback
academic · arXiv (Anthropic) · 2022-12-15
·

What are the main safety risks associated with large language models?

Primary LLM risks include generating harmful or false content (hallucinations), leaking private training data, enabling cyberattacks via code generation, producing biased outputs, and being misused for disinformation. NIST's AI Risk Management Framework provides a structured approach to identifying and mitigating these risks. [Source: NIST]

Sources
AI Risk Management Framework (AI RMF 1.0)
primary · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What is a context window in an LLM, and why does size matter?

A context window is the maximum number of tokens an LLM can process in a single inference pass, covering both input and output. Larger windows — now reaching 1 million tokens in some models — allow processing of entire codebases or books, enabling richer reasoning and document-level tasks. [Source: Google DeepMind]

Sources
Gemini: A Family of Highly Capable Multimodal Models
academic · arXiv (Google DeepMind) · 2023-12-19
·

What is tokenization in NLP and how does it affect LLM performance?

Tokenization splits raw text into subword units (tokens) using algorithms like Byte-Pair Encoding (BPE) or WordPiece, forming the model's vocabulary. Token count directly impacts cost, context length usage, and model behavior — for example, rare words or non-English text typically consume more tokens per character. [Source: OpenAI]

Sources
Tokenizer – OpenAI Platform
official · OpenAI · 2024-01-01
·
Neural Machine Translation of Rare Words with Subword Units
academic · arXiv (University of Edinburgh) · 2015-08-31
·

What is the relationship between LLM parameter count and model capability?

Parameter count measures the total number of learnable weights in a model; larger counts generally yield better performance, but with diminishing returns. DeepMind's Chinchilla research showed that many large models were undertrained — optimal performance requires scaling data proportionally with parameters, not just increasing size. [Source: DeepMind / arXiv]

Sources
Training Compute-Optimal Large Language Models
academic · arXiv (DeepMind) · 2022-03-29
·

When should you use RAG versus fine-tuning for an LLM application?

RAG is preferred when information changes frequently, requires citing sources, or is too large to train on — it injects knowledge at runtime. Fine-tuning is better for teaching consistent tone, style, or domain-specific behavior that is stable over time. Many production systems combine both approaches. [Source: Stanford HAI]

Sources
Reflections on Foundation Models and Generative AI
academic · Stanford Human-Centered AI Institute (HAI) · 2023-03-01
·
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
academic · arXiv (Meta AI Research) · 2020-05-22
·
Fine-tuning – OpenAI Platform Documentation
official · OpenAI · 2024-01-01
·