- [2025/02] "Short-length" Adversarial Training Helps LLMs Defend "Long-length" Jailbreak Attacks: Theoretical and Empirical Evidence
- [2025/02] Understanding and Enhancing the Transferability of Jailbreaking Attacks
- [2025/02] PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
- [2025/02] Jailbreaking with Universal Multi-Prompts
- [2025/02] Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
- [2025/02] Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
- [2025/02] From Compliance to Exploitation: Jailbreak Prompt Attacks on Multimodal LLMs
- [2025/02] AgentBreeder: Mitigating the AI Safety Impact of Multi-Agent Scaffolds
- [2025/01] Peering Behind the Shield: Guardrail Identification in Large Language Models
- [2025/01] Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming
- [2025/01] Towards Safe AI Clinicians: A Comprehensive Study on Large Language Model Jailbreaking in Healthcare
- [2025/01] Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation
- [2025/01] Enhancing Model Defense Against Jailbreaks with Proactive Safety Reasoning
- [2025/01] GuardReasoner: Towards Reasoning-based LLM Safeguards
- [2025/01] Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors
- [2025/01] Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
- [2025/01] Jailbreaking Large Language Models in Infinitely Many Ways
- [2025/01] You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
- [2025/01] Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
- [2025/01] Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
- [2025/01] Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models
- [2025/01] Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
- [2025/01] Auto-RT: Automatic Jailbreak Strategy Exploration for Red-Teaming Large Language Models
- [2025/01] LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
- [2024/12] Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models
- [2024/12] AEIOU: A Unified Defense Framework against NSFW Prompts in Text-to-Image Models
- [2024/12] SATA: A Paradigm for LLM Jailbreak via Simple Assistive Task Linkage
- [2024/12] JailPO: A Novel Black-box Jailbreak Framework via Preference Optimization against Aligned LLMs
- [2024/12] No Free Lunch for Defending Against Prefilling Attack by In-Context Learning
- [2024/12] Defending LVLMs Against Vision Attacks through Partial-Perception Supervision
- [2024/12] AdvPrefix: An Objective for Nuanced LLM Jailbreaks
- [2024/12] Model-Editing-Based Jailbreak against Safety-aligned Large Language Models
- [2024/12] Antelope: Potent and Concealed Jailbreak Attack Strategy
- [2024/12] AdvWave: Stealthy Adversarial Jailbreak Attack against Large Audio-Language Models
- [2024/12] FlexLLM: Exploring LLM Customization for Moving Target Defense on Black-Box LLMs Against Jailbreak Attacks
- [2024/12] PrisonBreak: Jailbreaking Large Language Models with Fewer Than Twenty-Five Targeted Bit-flips
- [2024/12] BAMBA: A Bimodal Adversarial Multi-Round Black-Box Jailbreak Attacker for LVLMs
- [2024/12] Heuristic-Induced Multimodal Risk Distribution Jailbreak Attack for Multimodal Large Language Models
- [2024/12] Improved Large Language Model Jailbreak Detection via Pretrained Embeddings
- [2024/11] Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
- [2024/11] PEFT-as-an-Attack! Jailbreaking Language Models during Federated Parameter-Efficient Fine-Tuning
- [2024/11] In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
- [2024/11] "Moralized" Multi-Step Jailbreak Prompts: Black-Box Testing of Guardrails in Large Language Models for Verbal Attacks
- [2024/11] Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective
- [2024/11] GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs
- [2024/11] Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
- [2024/11] JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit
- [2024/11] SoK: Unifying Cybersecurity and Cybersafety of Multimodal Foundation Models with an Information Theory Approach
- [2024/11] The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense
- [2024/11] SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
- [2024/11] MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue
- [2024/11] What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks
- [2024/11] SQL Injection Jailbreak: a structural disaster of large language models
- [2024/10] Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models
- [2024/10] Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
- [2024/10] RobustKV: Defending Large Language Models against Jailbreak Attacks via KV Eviction
- [2024/10] You Know What I'm Saying: Jailbreak Attack via Implicit Reference
- [2024/10] Adversarial Attacks on Large Language Models Using Regularized Relaxation
- [2024/10] SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models
- [2024/10] AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents
- [2024/10] Feint and Attack: Attention-Based Strategies for Jailbreaking and Protecting LLMs
- [2024/10] Faster-GCG: Efficient Discrete Optimization Jailbreak Attacks against Aligned Large Language Models
- [2024/10] Jailbreaking and Mitigation of Vulnerabilities in Large Language Models
- [2024/10] Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
- [2024/10] SoK: Prompt Hacking of Large Language Models
- [2024/10] Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
- [2024/10] Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
- [2024/10] BlackDAN: A Black-Box Multi-Objective Approach for Effective and Contextual Jailbreaking of Large Language Models
- [2024/10] RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process
- [2024/10] AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
- [2024/10] Root Defence Strategies: Ensuring Safety of LLM at the Decoding Level
- [2024/10] Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step
- [2024/10] Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
- [2024/10] Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models
- [2024/10] FlipAttack: Jailbreak LLMs via Flipping
- [2024/10] Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
- [2024/10] VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
- [2024/10] Adversarial Suffixes May Be Features Too!
- [2024/09] Multimodal Pragmatic Jailbreak on Text-to-image Models
- [2024/09] Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity
- [2024/09] RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
- [2024/09] MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks
- [2024/09] PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach
- [2024/09] Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs
- [2024/09] AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs
- [2024/09] Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking
- [2024/09] HSF: Defending against Jailbreak Attacks with Hidden State Filtering
- [2024/08] Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks
- [2024/08] Detecting AI Flaws: Target-Driven Attacks on Internal Faults in Language Models
- [2024/08] Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
- [2024/08] LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
- [2024/08] RT-Attack: Jailbreaking Text-to-Image Models via Random Token
- [2024/08] Unlocking Adversarial Suffix Optimization Without Affirmative Phrases: Efficient Black-box Jailbreaking via LLM as Optimizer
- [2024/08] SAGE-RT: Synthetic Alignment data Generation for Safety Evaluation and Red Teaming
- [2024/08] Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles
- [2024/08] EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models
- [2024/08] Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation
- [2024/08] Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
- [2024/08] BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger
- [2024/08] MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
- [2024/08] Defending LLMs against Jailbreaking Attacks via Backtranslation
- [2024/08] Jailbreak Open-Sourced Large Language Models via Enforced Decoding
- [2024/08] h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment
- [2024/08] A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares
- [2024/08] EnJa: Ensemble Jailbreak on Large Language Models
- [2024/08] Jailbreaking Text-to-Image Models with LLM-Based Agents
- [2024/07] Defending Jailbreak Attack in VLMs via Cross-modality Information Detector
- [2024/07] Exploring Scaling Trends in LLM Robustness
- [2024/07] The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
- [2024/07] Can Large Language Models Automatically Jailbreak GPT-4V?
- [2024/07] PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing
- [2024/07] Figure it Out: Analyzing-based Jailbreak Attack on Large Language Models
- [2024/07] RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
- [2024/07] Arondight: Red Teaming Large Vision Language Models with Auto-generated Multi-modal Jailbreak Prompts
- [2024/07] When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
- [2024/07] Imposter.AI: Adversarial Attacks with Hidden Intentions towards Aligned Large Language Models
- [2024/07] Does Refusal Training in LLMs Generalize to the Past Tense?
- [2024/07] Continuous Embedding Attacks via Clipped Inputs in Jailbreaking Large Language Models
- [2024/07] Jailbreak Attacks and Defenses Against Large Language Models: A Survey
- [2024/07] DART: Deep Adversarial Automated Red Teaming for LLM Safety
- [2024/07] JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
- [2024/07] SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack
- [2024/07] Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything
- [2024/07] A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses
- [2024/07] Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
- [2024/07] Badllama 3: removing safety finetuning from Llama 3 in minutes
- [2024/06] Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
- [2024/06] Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
- [2024/06] Poisoned LangChain: Jailbreak LLMs by LangChain
- [2024/06] WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
- [2024/06] WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
- [2024/06] SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
- [2024/06] Adversaries Can Misuse Combinations of Safe Models
- [2024/06] Jailbreak Paradox: The Achilles' Heel of LLMs
- [2024/06] "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
- [2024/06] Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
- [2024/06] Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
- [2024/06] StructuralSleight: Automated Jailbreak Attacks on Large Language Models Utilizing Uncommon Text-Encoded Structure
- [2024/06] When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
- [2024/06] RL-JACK: Reinforcement Learning-powered Black-box Jailbreaking Attack against LLMs
- [2024/06] Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
- [2024/06] MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
- [2024/06] Merging Improves Self-Critique Against Jailbreak Attacks
- [2024/06] How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
- [2024/06] SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
- [2024/06] Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks
- [2024/06] Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
- [2024/06] Improving Alignment and Robustness with Short Circuiting
- [2024/06] Are PPO-ed Language Models Hackable?
- [2024/06] Cross-Modal Safety Alignment: Is textual unlearning all you need?
- [2024/06] Defending Large Language Models Against Attacks With Residual Stream Activation Analysis
- [2024/06] Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt
- [2024/06] AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
- [2024/06] Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
- [2024/06] BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
- [2024/05] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
- [2024/05] Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
- [2024/05] Visual-RolePlay: Universal Jailbreak Attack on MultiModal Large Language Models via Role-playing Image Characte
- [2024/05] Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models
- [2024/05] Improved Generation of Adversarial Examples Against Safety-aligned LLMs
- [2024/05] Robustifying Safety-Aligned Large Language Models through Clean Data Curation
- [2024/05] Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
- [2024/05] Voice Jailbreak Attacks Against GPT-4o
- [2024/05] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
- [2024/05] Automatic Jailbreaking of the Text-to-Image Generative AI Systems
- [2024/05] Hacc-Man: An Arcade Game for Jailbreaking LLMs
- [2024/05] Efficient Adversarial Training in LLMs with Continuous Attacks
- [2024/05] JailbreakEval: An Integrated Safety Evaluator Toolkit for Assessing Jailbreaks Against Large Language Models
- [2024/05] Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
- [2024/05] Lockpicking LLMs: A Logit-Based Jailbreak Using Token-level Manipulation
- [2024/05] GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
- [2024/05] Chain of Attack: a Semantic-Driven Contextual Multi-Turn attacker for LLM
- [2024/05] Can LLMs Deeply Detect Complex Malicious Queries? A Framework for Jailbreaking via Obfuscating Intent
- [2024/04] Don't Say No: Jailbreaking LLM by Suppressing Refusal
- [2024/04] Universal Adversarial Triggers Are Not Universal
- [2024/04] AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
- [2024/04] The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
- [2024/04] Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
- [2024/04] Protecting Your LLMs with Information Bottleneck
- [2024/04] JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models
- [2024/04] AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs
- [2024/04] Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
- [2024/04] AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts
- [2024/04] Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge
- [2024/04] Take a Look at it! Rethinking How to Evaluate Language Model Jailbreak
- [2024/04] Unbridled Icarus: A Survey of the Potential Perils of Image Inputs in Multimodal Large Language Model Security
- [2024/04] Increased LLM Vulnerabilities from Fine-tuning and Quantization
- [2024/04] Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
- [2024/04] Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models
- [2024/04] JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks
- [2024/04] JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
- [2024/04] Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
- [2024/04] Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
- [2024/04] Many-shot Jailbreaking
- [2024/03] Against The Achilles' Heel: A Survey on Red Teaming for Generative Models
- [2024/03] Don't Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models
- [2024/03] Jailbreaking is Best Solved by Definition
- [2024/03] Detoxifying Large Language Models via Knowledge Editing
- [2024/03] RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
- [2024/03] Images are Achilles’ Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
- [2024/03] AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting
- [2024/03] Tastle: Distract Large Language Models for Automatic Jailbreak Attack
- [2024/03] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
- [2024/03] AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks
- [2024/03] Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
- [2024/02] Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
- [2024/02] Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction
- [2024/02] DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers
- [2024/02] GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
- [2024/02] CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models
- [2024/02] PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
- [2024/02] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
- [2024/02] LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
- [2024/02] From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings
- [2024/02] Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs
- [2024/02] Is the System Message Really Important to Jailbreaks in Large Language Models?
- [2024/02] Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement
- [2024/02] How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries
- [2024/02] Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment
- [2024/02] LLM Jailbreak Attack versus Defense Techniques -- A Comprehensive Study
- [2024/02] Coercing LLMs to do and reveal (almost) anything
- [2024/02] GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis
- [2024/02] Query-Based Adversarial Prompt Generation
- [2024/02] ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
- [2024/02] SPML: A DSL for Defending Language Models Against Prompt Attacks
- [2024/02] A StrongREJECT for Empty Jailbreaks
- [2024/02] Jailbreaking Proprietary Large Language Models using Word Substitution Cipher
- [2024/02] ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
- [2024/02] PAL: Proxy-Guided Black-Box Attack on Large Language Models
- [2024/02] Attacking Large Language Models with Projected Gradient Descent
- [2024/02] SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
- [2024/02] Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
- [2024/02] COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability
- [2024/02] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
- [2024/02] Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning
- [2024/02] Comprehensive Assessment of Jailbreak Attacks Against LLMs
- [2024/02] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models
- [2024/02] HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
- [2024/02] Jailbreaking Attack against Multimodal Large Language Model
- [2024/02] Prompt-Driven LLM Safeguarding via Directed Representation Optimization
- [2024/01] On Prompt-Driven Safeguarding for Large Language Models
- [2024/01] A Cross-Language Investigation into Jailbreak Attacks in Large Language Models
- [2024/01] Weak-to-Strong Jailbreaking on Large Language Models
- [2024/01] Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
- [2024/01] Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts
- [2024/01] PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
- [2024/01] Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
- [2024/01] Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
- [2024/01] All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks
- [2024/01] AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models
- [2024/01] Intention Analysis Prompting Makes Large Language Models A Good Jailbreak Defender
- [2024/01] How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
- [2023/12] A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models
- [2023/12] Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak
- [2023/12] Goal-Oriented Prompt Attack and Safety Evaluation for LLMs
- [2023/12] Tree of Attacks: Jailbreaking Black-Box LLMs Automatically
- [2023/12] Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
- [2023/12] A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection
- [2023/12] Adversarial Attacks on GPT-4 via Simple Random Search
- [2023/12] On Large Language Models’ Resilience to Coercive Interrogation
- [2023/11] MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
- [2023/11] A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily
- [2023/11] Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
- [2023/11] MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
- [2023/11] Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- [2023/11] SneakyPrompt: Jailbreaking Text-to-image Generative Models
- [2023/11] DeepInception: Hypnotize Large Language Model to Be Jailbreaker
- [2023/11] Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild
- [2023/11] Evil Geniuses: Delving into the Safety of LLM-based Agents
- [2023/11] FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts
- [2023/10] Attack Prompt Generation for Red Teaming and Defending Large Language Models
- [2023/10] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attack
- [2023/10] Low-Resource Languages Jailbreak GPT-4
- [2023/10] SC-Safety: A Multi-round Open-ended Question Adversarial Safety Benchmark for Large Language Models in Chinese
- [2023/10] SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
- [2023/10] Adversarial Attacks on LLMs
- [2023/10] AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
- [2023/10] Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
- [2023/10] Jailbreaking Black Box Large Language Models in Twenty Queries
- [2023/09] Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- [2023/09] Certifying LLM Safety against Adversarial Prompting
- [2023/09] SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
- [2023/09] Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation
- [2023/09] AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
- [2023/09] GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
- [2023/09] Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
- [2023/09] Multilingual Jailbreak Challenges in Large Language Models
- [2023/09] On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs
- [2023/09] RAIN: Your Language Models Can Align Themselves without Finetuning
- [2023/09] Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
- [2023/09] Understanding Hidden Context in Preference Learning: Consequences for RLHF
- [2023/09] Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
- [2023/09] FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models
- [2023/09] GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
- [2023/09] Open Sesame! Universal Black Box Jailbreaking of Large Language Models
- [2023/08] Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
- [2023/08] XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
- [2023/08] “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models
- [2023/08] Detecting Language Model Attacks with Perplexity
- [2023/07] From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy
- [2023/07] LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem?
- [2023/07] Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
- [2023/07] Jailbroken: How Does LLM Safety Training Fail?
- [2023/07] MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots
- [2023/07] Universal and Transferable Adversarial Attacks on Aligned Language Models
- [2023/06] Visual Adversarial Examples Jailbreak Aligned Large Language Models
- [2023/05] Adversarial demonstration attacks on large language models.
- [2023/05] Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
- [2023/05] Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks
- [2023/04] Multi-step Jailbreaking Privacy Attacks on ChatGPT
- [2023/03] Automatically Auditing Large Language Models via Discrete Optimization