A1. Jailbreak

[2025/02] "Short-length" Adversarial Training Helps LLMs Defend "Long-length" Jailbreak Attacks: Theoretical and Empirical Evidence
[2025/02] Understanding and Enhancing the Transferability of Jailbreaking Attacks
[2025/02] PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
[2025/02] Jailbreaking with Universal Multi-Prompts
[2025/02] Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
[2025/02] Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
[2025/02] From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs
[2025/02] AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds
[2025/01] Peering Behind the Shield: Guardrail Identification in Large Language Models
[2025/01] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
[2025/01] Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
[2025/01] Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation
[2025/01] Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
[2025/01] GuardReasoner: Towards Reasoning-based LLM Safeguards
[2025/01] Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors
[2025/01] Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
[2025/01] Jailbreaking Large Language Models in Infinitely Many Ways
[2025/01] You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
[2025/01] Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
[2025/01] Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
[2025/01] Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models
[2025/01] Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
[2025/01] Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
[2025/01] LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
[2024/12] Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models
[2024/12] AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models
[2024/12] SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
[2024/12] JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs
[2024/12] No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
[2024/12] Defending LVLMs Against Vision Attacks through Partial-Perception Supervision
[2024/12] AdvPrefix: An Objective for Nuanced LLM Jailbreaks
[2024/12] Model-Editing-Based Jailbreak against Safety-aligned Large Language Models
[2024/12] Antelope: Potent and Concealed Jailbreak Attack Strategy
[2024/12] AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
[2024/12] FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
[2024/12] PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips
[2024/12] BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs
[2024/12] Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
[2024/12] Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
[2024/11] Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
[2024/11] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
[2024/11] In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
[2024/11] "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks
[2024/11] Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective
[2024/11] GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
[2024/11] Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
[2024/11] JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
[2024/11] SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach
[2024/11] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense
[2024/11] SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
[2024/11] MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
[2024/11] What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
[2024/11] SQL Injection Jailbreak: a structural disaster of large language models
[2024/10] Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models
[2024/10] Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
[2024/10] RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
[2024/10] You Know What I'm Saying: Jailbreak Attack via Implicit Reference
[2024/10] Adversarial Attacks on Large Language Models Using Regularized Relaxation
[2024/10] SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models
[2024/10] AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents
[2024/10] Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
[2024/10] Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
[2024/10] Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
[2024/10] Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
[2024/10] SoK: Prompt Hacking of Large Language Models
[2024/10] Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
[2024/10] Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
[2024/10] BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
[2024/10] RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
[2024/10] AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
[2024/10] Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
[2024/10] Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step
[2024/10] Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
[2024/10] Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
[2024/10] FlipAttack: Jailbreak LLMs via Flipping
[2024/10] Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
[2024/10] VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
[2024/10] Adversarial Suffixes May Be Features Too!
[2024/09] Multimodal Pragmatic Jailbreak on Text-to-image Models
[2024/09] Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity
[2024/09] RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
[2024/09] MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks
[2024/09] PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach
[2024/09] Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs
[2024/09] AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
[2024/09] Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking
[2024/09] HSF: Defending against Jailbreak Attacks with Hidden State Filtering
[2024/08] Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
[2024/08] Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models
[2024/08] Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
[2024/08] LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
[2024/08] RT-Attack: Jailbreaking Text-to-Image Models via Random Token
[2024/08] Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer
[2024/08] SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
[2024/08] Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles
[2024/08] EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
[2024/08] Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation
[2024/08] Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
[2024/08] BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
[2024/08] MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
[2024/08] Defending LLMs against Jailbreaking Attacks via Backtranslation
[2024/08] Jailbreak Open-Sourced Large Language Models via Enforced Decoding
[2024/08] h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment
[2024/08] A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares
[2024/08] EnJa: Ensemble Jailbreak on Large Language Models
[2024/08] Jailbreaking Text-to-Image Models with LLM-Based Agents
[2024/07] Defending Jailbreak Attack in VLMs via Cross-modality Information Detector
[2024/07] Exploring Scaling Trends in LLM Robustness
[2024/07] The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
[2024/07] Can Large Language Models Automatically Jailbreak GPT-4V?
[2024/07] PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
[2024/07] Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models
[2024/07] RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
[2024/07] Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
[2024/07] When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
[2024/07] Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
[2024/07] Does Refusal Training in LLMs Generalize to the Past Tense?
[2024/07] Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models
[2024/07] Jailbreak Attacks and Defenses Against Large Language Models: A Survey
[2024/07] DART: Deep Adversarial Automated Red Teaming for LLM Safety
[2024/07] JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
[2024/07] SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack
[2024/07] Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
[2024/07] A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
[2024/07] Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
[2024/07] Badllama 3: removing safety finetuning from Llama 3 in minutes
[2024/06] Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
[2024/06] Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
[2024/06] Poisoned LangChain: Jailbreak LLMs by LangChain
[2024/06] WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
[2024/06] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
[2024/06] SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
[2024/06] Adversaries Can Misuse Combinations of Safe Models
[2024/06] Jailbreak Paradox: The Achilles' Heel of LLMs
[2024/06] "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
[2024/06] Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
[2024/06] Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
[2024/06] StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure
[2024/06] When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
[2024/06] RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
[2024/06] Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
[2024/06] MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
[2024/06] Merging Improves Self-Critique Against Jailbreak Attacks
[2024/06] How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
[2024/06] SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
[2024/06] Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
[2024/06] Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
[2024/06] Improving Alignment and Robustness with Short Circuiting
[2024/06] Are PPO-ed Language Models Hackable?
[2024/06] Cross-Modal Safety Alignment: Is textual unlearning all you need?
[2024/06] Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
[2024/06] Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
[2024/06] AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
[2024/06] Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
[2024/06] BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
[2024/05] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
[2024/05] Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
[2024/05] Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte
[2024/05] Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
[2024/05] Improved Generation of Adversarial Examples Against Safety-aligned LLMs
[2024/05] Robustifying Safety-Aligned Large Language Models through Clean Data Curation
[2024/05] Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
[2024/05] Voice Jailbreak Attacks Against GPT-4o
[2024/05] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
[2024/05] Automatic Jailbreaking of the Text-to-Image Generative AI Systems
[2024/05] Hacc-Man: An Arcade Game for Jailbreaking LLMs
[2024/05] Efficient Adversarial Training in LLMs with Continuous Attacks
[2024/05] JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models
[2024/05] Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
[2024/05] Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
[2024/05] GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
[2024/05] Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
[2024/05] Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
[2024/04] Don't Say No: Jailbreaking LLM by Suppressing Refusal
[2024/04] Universal Adversarial Triggers Are Not Universal
[2024/04] AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
[2024/04] The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
[2024/04] Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
[2024/04] Protecting Your LLMs with Information Bottleneck
[2024/04] JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
[2024/04] AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
[2024/04] Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
[2024/04] AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
[2024/04] Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
[2024/04] Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak
[2024/04] Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security
[2024/04] Increased LLM Vulnerabilities from Fine-tuning and Quantization
[2024/04] Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
[2024/04] Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
[2024/04] JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
[2024/04] JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
[2024/04] Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
[2024/04] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
[2024/04] Many-shot Jailbreaking
[2024/03] Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
[2024/03] Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
[2024/03] Jailbreaking is Best Solved by Definition
[2024/03] Detoxifying Large Language Models via Knowledge Editing
[2024/03] RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
[2024/03] Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
[2024/03] AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting
[2024/03] Tastle: Distract Large Language Models for Automatic Jailbreak Attack
[2024/03] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
[2024/03] AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
[2024/03] Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
[2024/02] Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
[2024/02] Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
[2024/02] DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
[2024/02] GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
[2024/02] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
[2024/02] PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
[2024/02] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
[2024/02] LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
[2024/02] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
[2024/02] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
[2024/02] Is the System Message Really Important to Jailbreaks in Large Language Models?
[2024/02] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
[2024/02] How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
[2024/02] Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
[2024/02] LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study
[2024/02] Coercing LLMs to do and reveal (almost) anything
[2024/02] GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
[2024/02] Query-Based Adversarial Prompt Generation
[2024/02] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
[2024/02] SPML: A DSL for Defending Language Models Against Prompt Attacks
[2024/02] A StrongREJECT for Empty Jailbreaks
[2024/02] Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
[2024/02] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
[2024/02] PAL: Proxy-Guided Black-Box Attack on Large Language Models
[2024/02] Attacking Large Language Models with Projected Gradient Descent
[2024/02] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
[2024/02] Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
[2024/02] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
[2024/02] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
[2024/02] Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
[2024/02] Comprehensive Assessment of Jailbreak Attacks Against LLMs
[2024/02] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
[2024/02] HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
[2024/02] Jailbreaking Attack against Multimodal Large Language Model
[2024/02] Prompt-Driven LLM Safeguarding via Directed Representation Optimization
[2024/01] On Prompt-Driven Safeguarding for Large Language Models
[2024/01] A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
[2024/01] Weak-to-Strong Jailbreaking on Large Language Models
[2024/01] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
[2024/01] Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
[2024/01] PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
[2024/01] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
[2024/01] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
[2024/01] All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
[2024/01] AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
[2024/01] Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
[2024/01] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
[2023/12] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
[2023/12] Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
[2023/12] Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
[2023/12] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
[2023/12] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
[2023/12] A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
[2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
[2023/12] On Large Language Models’ Resilience to Coercive Interrogation
[2023/11] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
[2023/11] A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily
[2023/11] Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
[2023/11] MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
[2023/11] Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
[2023/11] SneakyPrompt: Jailbreaking Text-to-image Generative Models
[2023/11] DeepInception: Hypnotize Large Language Model to Be Jailbreaker
[2023/11] Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
[2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents
[2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
[2023/10] Attack Prompt Generation for Red Teaming and Defending Large Language Models
[2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack
[2023/10] Low-Resource Languages Jailbreak GPT-4
[2023/10] SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
[2023/10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
[2023/10] Adversarial Attacks on LLMs
[2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
[2023/10] Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
[2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
[2023/09] Baseline Defenses for Adversarial Attacks Against Aligned Language Models
[2023/09] Certifying LLM Safety against Adversarial Prompting
[2023/09] SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
[2023/09] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
[2023/09] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
[2023/09] GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
[2023/09] Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
[2023/09] Multilingual Jailbreak Challenges in Large Language Models
[2023/09] On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
[2023/09] RAIN: Your Language Models Can Align Themselves without Finetuning
[2023/09] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
[2023/09] Understanding Hidden Context in Preference Learning: Consequences for RLHF
[2023/09] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
[2023/09] FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
[2023/09] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
[2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
[2023/08] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
[2023/08] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
[2023/08] “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
[2023/08] Detecting Language Model Attacks with Perplexity
[2023/07] From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
[2023/07] LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem?
[2023/07] Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
[2023/07] Jailbroken: How Does LLM Safety Training Fail?
[2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
[2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
[2023/06] Visual Adversarial Examples Jailbreak Aligned Large Language Models
[2023/05] Adversarial demonstration attacks on large language models.
[2023/05] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
[2023/05] Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
[2023/04] Multi-step Jailbreaking Privacy Attacks on ChatGPT
[2023/03] Automatically Auditing Large Language Models via Discrete Optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jailbreak.md

jailbreak.md

A1. Jailbreak

Files

jailbreak.md

Latest commit

History

jailbreak.md

File metadata and controls

A1. Jailbreak