Mirror of https://github.com/roostorg/awesome-safety-tools
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Add new datasets from PyRIT

Updated project descriptions and added new datasets related to AI safety and red teaming based on what's available in Microsoft's PyRIT tool.

Signed-off-by: Roman Lutz <romanlutz13@gmail.com>

authored by

Roman Lutz and committed by
GitHub
6e254553 29184c88

+37 -1
+37 -1
README.md
··· 119 119 * Tool for testing prompt injection vulnerabilities in AI systems 120 120 * [Promptfoo](https://github.com/promptfoo/promptfoo) 121 121 * Automated LLM evaluations, report generations, several ready-to-use attack strategies 122 - * [PyRIT](https://github.com/Azure/PyRIT) 122 + * [PyRIT by Microsoft](https://github.com/Azure/PyRIT) 123 123 * Microsoft’s Python-based tool for AI red teaming and security testing 124 124 * [Socketteer](https://github.com/socketteer?tab=repositories) 125 125 * Allows AI models to interact, helping test conversational weaknesses ··· 191 191 * dataset created by NVIDIA to aid in content moderation and toxicity detection 192 192 * [badwords by Richard Hughes](https://github.com/hughsie/badwords) 193 193 * simple list of bad words in different locales that can be used to flag suspicious user-submitted content 194 + * [PKU-SafeRLHF dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) 195 + * prompts with RLHF markers for unsafe responses across multiple harm categories 194 196 * [Toxic Chat by LMSYS](https://huggingface.co/datasets/lmsys/toxic-chat) 195 197 * dataset of toxic conversations collected from interactions with Vicuna 196 198 * [Toxicity by Jigsaw](https://huggingface.co/datasets/google/jigsaw_toxicity_pred) 197 199 * large number of Wikipedia comments which have been labeled by human raters for toxic behavior 200 + * [Transphobia Awareness dataset](https://doi.org/10.5281/zenodo.15482694) 201 + * user-generated queries related to transphobia with human annotations and model responses from Quora questions 198 202 * [Uli Dataset by Tattle](https://github.com/tattle-made/uli_dataset) 199 203 * dataset of gendered abuse, created for Uli ML redaction. 200 204 * [VTC by Unitary AI](https://github.com/unitaryai/VTC) ··· 205 209 206 210 * [AI Alignment Dataset by Anthropic](https://atlas.nomic.ai/map/anthropic_rlhf) 207 211 * data used for reinforcement learning with human feedback (RLHF) to align AI models. 212 + * [AILuminate dataset by MLCommons](https://github.com/mlcommons/ailuminate) 213 + * Human-created prompts across different harm categories 214 + * [Aya Red-teaming dataset by Cohere](https://huggingface.co/datasets/CohereForAI/aya_redteaming) 215 + * multilingual red-teaming prompts across various harm categories 216 + * [ALERT dataset by Babelscape](https://huggingface.co/datasets/Babelscape/ALERT) 217 + * standard and adversarial red-teaming prompts 218 + * [CCP Sensitive Prompts by Promptfoo](https://huggingface.co/datasets/promptfoo/CCP-sensitive-prompts) 219 + * Prompts covering topics sensitive to the Chinese Communist Party (CCP) 220 + * [DarkBench by Apart](https://huggingface.co/datasets/apart/darkbench) 221 + * Comprehensive benchmark to detect dark design patterns in LLMs 208 222 * [DEFCOM Red Teaming Dataset](https://github.com/humane-intelligence/ai_village_defcon_grt_data) 209 223 * dataset from DEF CON’s AI red teaming event. 224 + * [Do Not Answer dataset](https://huggingface.co/datasets/LibrAI/do-not-answer) 225 + * Questions across multiple risk areas and harm types to test LLM safety and refusal behavior 226 + * [Forbidden Questions dataset](https://huggingface.co/datasets/TrustAIRLab/forbidden_question_set) 227 + * Questions adopted from OpenAI Usage Policy 210 228 * [HackAPrompt Jailbreak Dataset](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset/viewer/default/train?p=1&row=137) 211 229 * dataset for testing AI vulnerability to prompt-based jailbreaking 230 + * [HarmBench by Center for AI Safety](https://github.com/centerforaisafety/HarmBench) 231 + * Evaluation dataset for automated red teaming 212 232 * [HiroKachi Jailbreak Dataset](https://sizu.me/love) 213 233 * dataset focused on adversarial AI prompt attacks 214 234 * [Jailbreak Prompt Generator AI Model](https://huggingface.co/tsq2000/Jailbreak-generator) 215 235 * AI model that generates jailbreak-style prompts 236 + * [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) 237 + * Harmful behaviors for jailbreaking evaluation 216 238 * [JailbreakHub by WalledAI](https://huggingface.co/datasets/walledai/JailbreakHub) 217 239 * collection of jailbreak prompts and corresponding model responses 240 + * [LLM-LAT harmful dataset](https://huggingface.co/datasets/LLM-LAT/harmful-dataset) 241 + * Prompts to assess harmful behaviors in LLMs 242 + * [MedSafetyBench](https://github.com/AI4LIFE-GROUP/med-safety-bench) 243 + * Medical safety prompts to evaluate LLM safety in medical contexts 244 + * [Multilingual Vulnerability dataset](https://github.com/CarsonDon/Multilingual-Vuln-LLMs) 245 + * Multilingual prompts demonstrating LLM vulnerabilities 218 246 * [Red Team Resistance Leaderboard](https://huggingface.co/spaces/HaizeLabs/red-teaming-resistance-benchmark) 219 247 * rankings of AI models based on resistance to adversarial attacks 220 248 * [Rentry Jailbreak Datasets](https://rentry.org/gpt0721) 221 249 * collection of datasets related to jailbreak attempts on AI models 222 250 * [SidFeel Jailbreak Dataset](https://github.com/sidfeels/PromptsDB) 223 251 * collection of prompts used for jailbreaking AI models 252 + * [SorryBench](https://huggingface.co/datasets/sorry-bench/sorry-bench-202503) 253 + * adversarial prompts to test LLM safety with linguistic mutations 254 + * [SOSBench](https://huggingface.co/datasets/SOSBench/SOSBench) 255 + * regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws. 256 + * [TDC23-RedTeaming dataset by walledai](https://huggingface.co/datasets/walledai/TDC23-RedTeaming) 257 + * collection of prompts from the red teaming track at TDC23 258 + * [XSTest dataset](https://github.com/paul-rottger/exaggerated-safety) 259 + * Prompts designed to test exaggerated safety behaviors in LLMs 224 260 225 261 226 262 ## Decentralized Platforms