My Subject Matter
ai-security

AI Security

AI vulnerabilities, model attacks, and defense strategies

What is an adversarial attack on an AI model?

An adversarial attack deliberately manipulates input data with subtle, often imperceptible perturbations to cause an AI model to produce incorrect outputs. For example, adding small pixel-level noise to an image can fool a classifier into misidentifying a stop sign as a speed limit sign. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·

What does 'adversarial robustness' mean in AI security?

Adversarial robustness refers to an AI model's ability to maintain correct performance when inputs have been intentionally manipulated by an attacker. A robust model produces consistent, accurate predictions even under adversarial perturbations, and is measured via standardized evaluations across multiple attack types. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What is a model evasion attack in machine learning?

A model evasion attack occurs at inference time, when an adversary crafts inputs designed to bypass a trained model's detection or classification—without altering the model itself. Common examples include bypassing malware detectors or spam filters by slightly modifying the malicious content. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·

What is a data or model poisoning attack?

A poisoning attack corrupts an AI model's training data or training process so the resulting model behaves maliciously or inaccurately. Attackers may inject mislabeled samples or backdoor triggers during training, causing the model to misclassify specific inputs while appearing normal otherwise. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What is a backdoor attack on an AI model?

A backdoor attack embeds a hidden trigger into an AI model during training so that it behaves normally on clean inputs but produces attacker-specified outputs whenever a secret trigger pattern appears. This vulnerability is especially dangerous in models trained on third-party or crowdsourced datasets. [Source: CISA]

Sources
Guidelines for Secure AI System Development
official · Cybersecurity and Infrastructure Security Agency (CISA) · 2023-11-27
·
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·

What is a prompt injection attack against large language models?

Prompt injection is an attack where malicious instructions are embedded in input text to override or hijack an LLM's intended behavior. Direct injection targets the model itself; indirect injection embeds commands in external content the model reads, potentially causing data exfiltration or unauthorized actions. [Source: OWASP]

Sources
OWASP Top 10 for Large Language Model Applications
official · Open Worldwide Application Security Project (OWASP) · 2025-01-01
·
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1)
official · National Institute of Standards and Technology (NIST) · 2024-07-26
·

What is 'jailbreaking' a large language model?

Jailbreaking an LLM means using crafted prompts—often roleplay scenarios or special character sequences—to bypass the model's built-in safety guardrails and produce harmful, restricted, or policy-violating outputs. It differs from prompt injection because it targets alignment constraints rather than application-level controls. [Source: NIST]

Sources
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1)
official · National Institute of Standards and Technology (NIST) · 2024-07-26
·
OWASP Top 10 for Large Language Model Applications
official · Open Worldwide Application Security Project (OWASP) · 2025-01-01
·

What is a model inversion attack and what data does it expose?

A model inversion attack exploits a trained model's outputs or gradients to reconstruct sensitive information from the training dataset—such as facial images or medical records. Attackers query the model repeatedly and use its responses to reverse-engineer private data it was trained on. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What is a membership inference attack on a machine learning model?

A membership inference attack determines whether a specific data record was included in a model's training set by analyzing the model's prediction confidence scores. This exposes privacy risks when training data is sensitive—such as patient health records—even without accessing the training data directly. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·

How does differential privacy protect AI models from data leakage?

Differential privacy adds mathematically calibrated noise to training data or model outputs so that the presence or absence of any individual record cannot be statistically inferred. It provides a formal, quantifiable privacy guarantee (epsilon-delta DP) that limits the risk of membership inference and model inversion attacks. [Source: NIST]

Sources
De-Identifying Government Datasets (NIST SP 800-188)
official · National Institute of Standards and Technology (NIST) · 2023-12-01
·
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·

Is federated learning secure, and what are its main vulnerabilities?

Federated learning improves data privacy by training models locally without sharing raw data, but it remains vulnerable to gradient inversion attacks, Byzantine poisoning from malicious participants, and model-update backdoors. Central aggregation servers also represent single points of failure. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What are the AI supply chain security risks organizations face?

AI supply chain risks include compromised pre-trained models from public repositories, malicious third-party datasets, vulnerable ML framework dependencies, and backdoored model weights. Attackers can embed threats at any stage—data collection, model training, packaging, or deployment—before an organization ever uses the model. [Source: CISA]

Sources
Guidelines for Secure AI System Development
official · Cybersecurity and Infrastructure Security Agency (CISA) · 2023-11-27
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What are the most effective defenses against adversarial attacks on AI models?

The most validated defenses include adversarial training (incorporating adversarial examples during training), certified defenses using randomized smoothing, input preprocessing such as feature squeezing, and ensemble methods. NIST recommends combining technical controls with ongoing red-teaming and monitoring for deployment environments. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What is a certified defense in AI security?

A certified defense provides a mathematical guarantee that an AI model's prediction will not change for any input perturbation within a specified bound—unlike empirical defenses, which can still be broken. Randomized smoothing is the most widely adopted certified defense applicable to large-scale neural networks. [Source: NIST]

Sources
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·

How should developers secure applications built on large language models?

OWASP recommends input and output validation, least-privilege tool access, sandboxed execution environments, human-in-the-loop controls for high-impact actions, and regular red-teaming. Developers should also implement prompt hardening, monitor for anomalous outputs, and isolate LLM components from sensitive backend systems. [Source: OWASP]

Sources
OWASP Top 10 for Large Language Model Applications
official · Open Worldwide Application Security Project (OWASP) · 2025-01-01
·
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1)
official · National Institute of Standards and Technology (NIST) · 2024-07-26
·

What is AI red teaming and how does it differ from traditional red teaming?

AI red teaming involves structured adversarial testing of AI systems to identify failures in safety, security, and alignment—including jailbreaks, harmful outputs, and bias. Unlike traditional cybersecurity red teaming, it also evaluates behavioral risks and model-specific attack surfaces like prompt injection and training data leakage. [Source: NIST]

Sources
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1)
official · National Institute of Standards and Technology (NIST) · 2024-07-26
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

How do you conduct a security audit of an AI model?

An AI model security audit should cover training data provenance, model card review, adversarial robustness testing, privacy attack simulation (membership inference, model inversion), supply chain verification of dependencies, and deployment configuration review. NIST's AI RMF provides a governance framework for structuring these assessments. [Source: NIST]

Sources
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·

How can organizations protect sensitive data used to train AI models?

Organizations should apply differential privacy during training, enforce strict data access controls and audit logs, use synthetic data generation where possible, perform data minimization, and contractually bind data processors. NIST's AI RMF and NIST SP 800-188 on de-identification provide technical and governance guidance. [Source: NIST]

Sources
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·
De-Identifying Government Datasets (NIST SP 800-188)
official · National Institute of Standards and Technology (NIST) · 2023-12-01
·

What is AI model alignment and why is it a security concern?

Model alignment ensures an AI system's goals and behaviors conform to human intentions and organizational values. Misaligned models may pursue unintended objectives, be manipulated into harmful outputs, or resist human oversight—making alignment a foundational security property alongside robustness and privacy. [Source: NIST]

Sources
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1)
official · National Institute of Standards and Technology (NIST) · 2024-07-26
·
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·

What are the key NIST standards and frameworks for AI security?

NIST's primary AI security resources are the AI Risk Management Framework (AI RMF 1.0), NIST AI 100-2 on adversarial machine learning taxonomy, and NIST AI 600-1 covering generative AI risks. Together they provide risk governance, threat taxonomies, and mitigation guidance for organizations deploying AI systems. [Source: NIST]

Sources
AI Risk Management Framework (AI RMF 1.0)
official · National Institute of Standards and Technology (NIST) · 2023-01-26
·
Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST AI 100-2)
official · National Institute of Standards and Technology (NIST) · 2024-01-01
·
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile (NIST AI 600-1)
official · National Institute of Standards and Technology (NIST) · 2024-07-26
·