- [2024/12] Agent-SafetyBench: Evaluating the Safety of LLM Agents
- [2024/12] SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
- [2024/11] Quantized Delta Weight Is Safety Keeper
- [2024/08] Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model Evolution
- [2024/07] Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
- [2024/06] Finding Safety Neurons in Large Language Models
- [2024/06] SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors
- [2024/06] GuardAgent: Safeguard LLM Agents by a Guard Agent via Knowledge-Enabled Reasoning
- [2024/06] Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
- [2024/06] Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study
- [2024/05] AI Risk Management Should Incorporate Both Safety and Security
- [2024/05] S-Eval: Automatic and Adaptive Test Generation for Benchmarking Safety Evaluation of Large Language Models
- [2024/04] Introducing v0.5 of the AI Safety Benchmark from MLCommons
- [2024/04] ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
- [2024/04] Benchmarking Llama2, Mistral, Gemma and GPT for Factuality, Toxicity, Bias and Propensity for Hallucinations
- [2024/04] Foundational Challenges in Assuring Alignment and Safety of Large Language Models
- [2024/04] Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward
- [2024/04] SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety