AI Security
AI vulnerabilities, model attacks, and defense strategies
What is an adversarial attack on an AI model?
An adversarial attack deliberately manipulates input data with subtle, often imperceptible perturbations to cause an AI model to produce incorrect outputs. For example, adding small pixel-level noise to an image can fool a classifier into misidentifying a stop sign as a speed limit sign. [Source: NIST]
What does 'adversarial robustness' mean in AI security?
Adversarial robustness refers to an AI model's ability to maintain correct performance when inputs have been intentionally manipulated by an attacker. A robust model produces consistent, accurate predictions even under adversarial perturbations, and is measured via standardized evaluations across multiple attack types. [Source: NIST]
What is a model evasion attack in machine learning?
A model evasion attack occurs at inference time, when an adversary crafts inputs designed to bypass a trained model's detection or classification—without altering the model itself. Common examples include bypassing malware detectors or spam filters by slightly modifying the malicious content. [Source: NIST]
What is a data or model poisoning attack?
A poisoning attack corrupts an AI model's training data or training process so the resulting model behaves maliciously or inaccurately. Attackers may inject mislabeled samples or backdoor triggers during training, causing the model to misclassify specific inputs while appearing normal otherwise. [Source: NIST]
What is a backdoor attack on an AI model?
A backdoor attack embeds a hidden trigger into an AI model during training so that it behaves normally on clean inputs but produces attacker-specified outputs whenever a secret trigger pattern appears. This vulnerability is especially dangerous in models trained on third-party or crowdsourced datasets. [Source: CISA]
What is a prompt injection attack against large language models?
Prompt injection is an attack where malicious instructions are embedded in input text to override or hijack an LLM's intended behavior. Direct injection targets the model itself; indirect injection embeds commands in external content the model reads, potentially causing data exfiltration or unauthorized actions. [Source: OWASP]
What is 'jailbreaking' a large language model?
Jailbreaking an LLM means using crafted prompts—often roleplay scenarios or special character sequences—to bypass the model's built-in safety guardrails and produce harmful, restricted, or policy-violating outputs. It differs from prompt injection because it targets alignment constraints rather than application-level controls. [Source: NIST]
What is a model inversion attack and what data does it expose?
A model inversion attack exploits a trained model's outputs or gradients to reconstruct sensitive information from the training dataset—such as facial images or medical records. Attackers query the model repeatedly and use its responses to reverse-engineer private data it was trained on. [Source: NIST]
What is a membership inference attack on a machine learning model?
A membership inference attack determines whether a specific data record was included in a model's training set by analyzing the model's prediction confidence scores. This exposes privacy risks when training data is sensitive—such as patient health records—even without accessing the training data directly. [Source: NIST]
How does differential privacy protect AI models from data leakage?
Differential privacy adds mathematically calibrated noise to training data or model outputs so that the presence or absence of any individual record cannot be statistically inferred. It provides a formal, quantifiable privacy guarantee (epsilon-delta DP) that limits the risk of membership inference and model inversion attacks. [Source: NIST]
Is federated learning secure, and what are its main vulnerabilities?
Federated learning improves data privacy by training models locally without sharing raw data, but it remains vulnerable to gradient inversion attacks, Byzantine poisoning from malicious participants, and model-update backdoors. Central aggregation servers also represent single points of failure. [Source: NIST]
What are the AI supply chain security risks organizations face?
AI supply chain risks include compromised pre-trained models from public repositories, malicious third-party datasets, vulnerable ML framework dependencies, and backdoored model weights. Attackers can embed threats at any stage—data collection, model training, packaging, or deployment—before an organization ever uses the model. [Source: CISA]
What are the most effective defenses against adversarial attacks on AI models?
The most validated defenses include adversarial training (incorporating adversarial examples during training), certified defenses using randomized smoothing, input preprocessing such as feature squeezing, and ensemble methods. NIST recommends combining technical controls with ongoing red-teaming and monitoring for deployment environments. [Source: NIST]
What is a certified defense in AI security?
A certified defense provides a mathematical guarantee that an AI model's prediction will not change for any input perturbation within a specified bound—unlike empirical defenses, which can still be broken. Randomized smoothing is the most widely adopted certified defense applicable to large-scale neural networks. [Source: NIST]
How should developers secure applications built on large language models?
OWASP recommends input and output validation, least-privilege tool access, sandboxed execution environments, human-in-the-loop controls for high-impact actions, and regular red-teaming. Developers should also implement prompt hardening, monitor for anomalous outputs, and isolate LLM components from sensitive backend systems. [Source: OWASP]
What is AI red teaming and how does it differ from traditional red teaming?
AI red teaming involves structured adversarial testing of AI systems to identify failures in safety, security, and alignment—including jailbreaks, harmful outputs, and bias. Unlike traditional cybersecurity red teaming, it also evaluates behavioral risks and model-specific attack surfaces like prompt injection and training data leakage. [Source: NIST]
How do you conduct a security audit of an AI model?
An AI model security audit should cover training data provenance, model card review, adversarial robustness testing, privacy attack simulation (membership inference, model inversion), supply chain verification of dependencies, and deployment configuration review. NIST's AI RMF provides a governance framework for structuring these assessments. [Source: NIST]
How can organizations protect sensitive data used to train AI models?
Organizations should apply differential privacy during training, enforce strict data access controls and audit logs, use synthetic data generation where possible, perform data minimization, and contractually bind data processors. NIST's AI RMF and NIST SP 800-188 on de-identification provide technical and governance guidance. [Source: NIST]
What is AI model alignment and why is it a security concern?
Model alignment ensures an AI system's goals and behaviors conform to human intentions and organizational values. Misaligned models may pursue unintended objectives, be manipulated into harmful outputs, or resist human oversight—making alignment a foundational security property alongside robustness and privacy. [Source: NIST]
What are the key NIST standards and frameworks for AI security?
NIST's primary AI security resources are the AI Risk Management Framework (AI RMF 1.0), NIST AI 100-2 on adversarial machine learning taxonomy, and NIST AI 600-1 covering generative AI risks. Together they provide risk governance, threat taxonomies, and mitigation guidance for organizations deploying AI systems. [Source: NIST]