Merge pull request #47 from romanlutz/patch-1 · roost.tools/awesome-safety-tools@7c0d98a

+37 -1

1 changed file

expand all

README.md

+37 -1

README.md

··· 119 119 * Tool for testing prompt injection vulnerabilities in AI systems 120 120 * [Promptfoo](https://github.com/promptfoo/promptfoo) 121 121 * Automated LLM evaluations, report generations, several ready-to-use attack strategies 122 - * [PyRIT](https://github.com/Azure/PyRIT) 122 + * [PyRIT by Microsoft](https://github.com/Azure/PyRIT) 123 123 * Microsoft’s Python-based tool for AI red teaming and security testing 124 124 * [Socketteer](https://github.com/socketteer?tab=repositories) 125 125 * Allows AI models to interact, helping test conversational weaknesses ··· 191 191 * dataset created by NVIDIA to aid in content moderation and toxicity detection 192 192 * [badwords by Richard Hughes](https://github.com/hughsie/badwords) 193 193 * simple list of bad words in different locales that can be used to flag suspicious user-submitted content 194 + * [PKU-SafeRLHF dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) 195 + * prompts with RLHF markers for unsafe responses across multiple harm categories 194 196 * [Toxic Chat by LMSYS](https://huggingface.co/datasets/lmsys/toxic-chat) 195 197 * dataset of toxic conversations collected from interactions with Vicuna 196 198 * [Toxicity by Jigsaw](https://huggingface.co/datasets/google/jigsaw_toxicity_pred) 197 199 * large number of Wikipedia comments which have been labeled by human raters for toxic behavior 200 + * [Transphobia Awareness dataset](https://doi.org/10.5281/zenodo.15482694) 201 + * user-generated queries related to transphobia with human annotations and model responses from Quora questions 198 202 * [Uli Dataset by Tattle](https://github.com/tattle-made/uli_dataset) 199 203 * dataset of gendered abuse, created for Uli ML redaction. 200 204 * [VTC by Unitary AI](https://github.com/unitaryai/VTC) ··· 205 209 206 210 * [AI Alignment Dataset by Anthropic](https://atlas.nomic.ai/map/anthropic_rlhf) 207 211 * data used for reinforcement learning with human feedback (RLHF) to align AI models. 212 + * [AILuminate dataset by MLCommons](https://github.com/mlcommons/ailuminate) 213 + * Human-created prompts across different harm categories 214 + * [Aya Red-teaming dataset by Cohere](https://huggingface.co/datasets/CohereForAI/aya_redteaming) 215 + * multilingual red-teaming prompts across various harm categories 216 + * [ALERT dataset by Babelscape](https://huggingface.co/datasets/Babelscape/ALERT) 217 + * standard and adversarial red-teaming prompts 218 + * [CCP Sensitive Prompts by Promptfoo](https://huggingface.co/datasets/promptfoo/CCP-sensitive-prompts) 219 + * Prompts covering topics sensitive to the Chinese Communist Party (CCP) 220 + * [DarkBench by Apart](https://huggingface.co/datasets/apart/darkbench) 221 + * Comprehensive benchmark to detect dark design patterns in LLMs 208 222 * [DEFCOM Red Teaming Dataset](https://github.com/humane-intelligence/ai_village_defcon_grt_data) 209 223 * dataset from DEF CON’s AI red teaming event. 224 + * [Do Not Answer dataset](https://huggingface.co/datasets/LibrAI/do-not-answer) 225 + * Questions across multiple risk areas and harm types to test LLM safety and refusal behavior 226 + * [Forbidden Questions dataset](https://huggingface.co/datasets/TrustAIRLab/forbidden_question_set) 227 + * Questions adopted from OpenAI Usage Policy 210 228 * [HackAPrompt Jailbreak Dataset](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset/viewer/default/train?p=1&row=137) 211 229 * dataset for testing AI vulnerability to prompt-based jailbreaking 230 + * [HarmBench by Center for AI Safety](https://github.com/centerforaisafety/HarmBench) 231 + * Evaluation dataset for automated red teaming 212 232 * [HiroKachi Jailbreak Dataset](https://sizu.me/love) 213 233 * dataset focused on adversarial AI prompt attacks 214 234 * [Jailbreak Prompt Generator AI Model](https://huggingface.co/tsq2000/Jailbreak-generator) 215 235 * AI model that generates jailbreak-style prompts 236 + * [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) 237 + * Harmful behaviors for jailbreaking evaluation 216 238 * [JailbreakHub by WalledAI](https://huggingface.co/datasets/walledai/JailbreakHub) 217 239 * collection of jailbreak prompts and corresponding model responses 240 + * [LLM-LAT harmful dataset](https://huggingface.co/datasets/LLM-LAT/harmful-dataset) 241 + * Prompts to assess harmful behaviors in LLMs 242 + * [MedSafetyBench](https://github.com/AI4LIFE-GROUP/med-safety-bench) 243 + * Medical safety prompts to evaluate LLM safety in medical contexts 244 + * [Multilingual Vulnerability dataset](https://github.com/CarsonDon/Multilingual-Vuln-LLMs) 245 + * Multilingual prompts demonstrating LLM vulnerabilities 218 246 * [Red Team Resistance Leaderboard](https://huggingface.co/spaces/HaizeLabs/red-teaming-resistance-benchmark) 219 247 * rankings of AI models based on resistance to adversarial attacks 220 248 * [Rentry Jailbreak Datasets](https://rentry.org/gpt0721) 221 249 * collection of datasets related to jailbreak attempts on AI models 222 250 * [SidFeel Jailbreak Dataset](https://github.com/sidfeels/PromptsDB) 223 251 * collection of prompts used for jailbreaking AI models 252 + * [SorryBench](https://huggingface.co/datasets/sorry-bench/sorry-bench-202503) 253 + * adversarial prompts to test LLM safety with linguistic mutations 254 + * [SOSBench](https://huggingface.co/datasets/SOSBench/SOSBench) 255 + * regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws. 256 + * [TDC23-RedTeaming dataset by walledai](https://huggingface.co/datasets/walledai/TDC23-RedTeaming) 257 + * collection of prompts from the red teaming track at TDC23 258 + * [XSTest dataset](https://github.com/paul-rottger/exaggerated-safety) 259 + * Prompts designed to test exaggerated safety behaviors in LLMs 224 260 225 261 226 262 ## Decentralized Platforms

Configure Feed

Configure Feed