Mirror of https://github.com/roostorg/awesome-safety-tools
0
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge branch 'main' into add-google-content-safety-api

authored by

Cassidy James Blaede and committed by
GitHub
0dc9ff3d 1265d17f

+59 -3
+8 -1
.github/pull_request_template.md
··· 1 - <!-- Thank you for opening a pull request! Please ensure your addition is in the correct section, follows existing formatting, and is in alphabetical order. If you have more information or context about your addition, please share it below: --> 1 + <!-- Thank you for opening a pull request! 2 + 3 + Please ensure your addition: 4 + - links to a source code repo (versus a marketing or documentation website, if possible) 5 + - is in the correct section 6 + - follows existing formatting 7 + - is in alphabetical order 2 8 9 + If you have more information or context about your addition, please share it below: -->
+51 -2
README.md
··· 15 15 * hashing algorithm, matching function, and ability to hook into actions 16 16 * [Hasher-Matcher-Actioner (CLIP demo)](https://github.com/juanmrad/HMA-CLIP-demo) 17 17 * HMA extension for CLIP as reference for adding other format extensions 18 + * [hma-matrix by the Matrix.org Foundation](https://github.com/matrix-org/hma-matrix) 19 + * Matrix-specific extensions to HMA for (primarily) the Matrix ecosystem 18 20 * [Lattice Extract by Adobe](https://github.com/adobe/lattice_extract) 19 21 * grid and lattice detection to guard against FP in hash matching 20 22 * [MediaModeration (Wiki Extension)](https://github.com/wikimedia/mediawiki-extensions-MediaModeration?tab=readme-ov-file) ··· 60 62 * BERT-based model for detecting toxic content in prompts to language models 61 63 62 64 63 - ## AI-powered Guardrails 65 + ## AI for Safety 64 66 65 67 * [Guardrails AI](https://github.com/guardrails-ai/guardrails) 66 68 * Python framework that helps build safe AI applications checking input/output for predefined risks 67 69 * [Kanana Safeguard By Kakao](https://huggingface.co/kakaocorp/kanana-safeguard-8b) 68 70 * harmful content detection model based on Kanana 8B 71 + * [Granite Guardian by IBM Research](https://github.com/ibm-granite/granite-guardian) 72 + * an input-output guardrail for detecting harms in a variety of use cases (general harm, RAG settings, agentic workflows, etc.) 69 73 * [Llama Guard by Meta](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3) 70 74 * AI-powered content moderation model to detect harm in text-based interactions 71 75 * [Llama Prompt Guard 2 by Meta](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Prompt-Guard-2/86M/MODEL_CARD.md) 72 76 * Detects prompt injection and jailbreaking attacks in LLM inputs 77 + * [OpenGuardrails](https://github.com/openguardrails/openguardrails) 78 + * Security Gateway providing a transparent reverse proxy for OpenAI apis with integrated safety protection 73 79 * [Purple Llama by Meta](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3) 74 80 * set of tools to assess and improve LLM security. Includes Llama Guard, CyberSec Eval, and Code Shield 75 81 * [RoGuard](https://github.com/Roblox/RoGuard-1.0) 76 82 * LLM that helps safeguard unlimited text generation on Roblox 77 83 * [ShieldGemma by Google DeepMind](https://www.kaggle.com/code/fernandosr85/shieldgemma-web-content-safety-analyzer?scriptVersionId=198456916) 78 84 * AI safety toolkit by Google DeepMind designed to help detect and mitigate harmful or unsafe outputs in LLM applications 85 + * [Risk Atlas Nexus by IBM Research](https://github.com/IBM/risk-atlas-nexus) 86 + * knowledge-graph toolkit that maps AI risk taxonomies (IBM AI Risk Atlas, IBM Granite Guardian MIT AI Risk Repository, NIST AI RMF GenAI Profile, AIR 2024, AILuminate Benchmark, Credo Unified Control Framework, OWASP Top 10 for LLM Apps) to evaluations, mitigations and controls, supporting the generation of structured governance workflows 87 + 79 88 80 89 81 90 ## Privacy Protection ··· 113 122 * Tool for testing prompt injection vulnerabilities in AI systems 114 123 * [Promptfoo](https://github.com/promptfoo/promptfoo) 115 124 * Automated LLM evaluations, report generations, several ready-to-use attack strategies 116 - * [PyRIT Documentation](https://azure.github.io/PyRIT/) 125 + * [PyRIT by Microsoft](https://github.com/Azure/PyRIT) 117 126 * Microsoft’s Python-based tool for AI red teaming and security testing 118 127 * [Socketteer](https://github.com/socketteer?tab=repositories) 119 128 * Allows AI models to interact, helping test conversational weaknesses ··· 121 130 122 131 ## Clustering 123 132 133 + * [bogofilter](https://bogofilter.sourceforge.io/) 134 + * spam filter that classifies text using Bayesian statistical analysis; able to learn from classifications and corrections 124 135 * [scikit-learn](https://github.com/scikit-learn/scikit-learn) 125 136 * python library including clustering through various algorithms, such as K-Means, DBSCAN, and hierarchical clustering 126 137 * [SpamAssassin by Apache](https://spamassassin.apache.org) ··· 181 192 182 193 * [Aegis Content Safety by NVIDIA](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0) 183 194 * dataset created by NVIDIA to aid in content moderation and toxicity detection 195 + * [badwords by Richard Hughes](https://github.com/hughsie/badwords) 196 + * simple list of bad words in different locales that can be used to flag suspicious user-submitted content 197 + * [PKU-SafeRLHF dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF) 198 + * prompts with RLHF markers for unsafe responses across multiple harm categories 184 199 * [Toxic Chat by LMSYS](https://huggingface.co/datasets/lmsys/toxic-chat) 185 200 * dataset of toxic conversations collected from interactions with Vicuna 186 201 * [Toxicity by Jigsaw](https://huggingface.co/datasets/google/jigsaw_toxicity_pred) 187 202 * large number of Wikipedia comments which have been labeled by human raters for toxic behavior 203 + * [Transphobia Awareness dataset](https://doi.org/10.5281/zenodo.15482694) 204 + * user-generated queries related to transphobia with human annotations and model responses from Quora questions 188 205 * [Uli Dataset by Tattle](https://github.com/tattle-made/uli_dataset) 189 206 * dataset of gendered abuse, created for Uli ML redaction. 190 207 * [VTC by Unitary AI](https://github.com/unitaryai/VTC) ··· 195 212 196 213 * [AI Alignment Dataset by Anthropic](https://atlas.nomic.ai/map/anthropic_rlhf) 197 214 * data used for reinforcement learning with human feedback (RLHF) to align AI models. 215 + * [AILuminate dataset by MLCommons](https://github.com/mlcommons/ailuminate) 216 + * Human-created prompts across different harm categories 217 + * [Aya Red-teaming dataset by Cohere](https://huggingface.co/datasets/CohereForAI/aya_redteaming) 218 + * multilingual red-teaming prompts across various harm categories 219 + * [ALERT dataset by Babelscape](https://huggingface.co/datasets/Babelscape/ALERT) 220 + * standard and adversarial red-teaming prompts 221 + * [CCP Sensitive Prompts by Promptfoo](https://huggingface.co/datasets/promptfoo/CCP-sensitive-prompts) 222 + * Prompts covering topics sensitive to the Chinese Communist Party (CCP) 223 + * [DarkBench by Apart](https://huggingface.co/datasets/apart/darkbench) 224 + * Comprehensive benchmark to detect dark design patterns in LLMs 198 225 * [DEFCOM Red Teaming Dataset](https://github.com/humane-intelligence/ai_village_defcon_grt_data) 199 226 * dataset from DEF CON’s AI red teaming event. 227 + * [Do Not Answer dataset](https://huggingface.co/datasets/LibrAI/do-not-answer) 228 + * Questions across multiple risk areas and harm types to test LLM safety and refusal behavior 229 + * [Forbidden Questions dataset](https://huggingface.co/datasets/TrustAIRLab/forbidden_question_set) 230 + * Questions adopted from OpenAI Usage Policy 200 231 * [HackAPrompt Jailbreak Dataset](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset/viewer/default/train?p=1&row=137) 201 232 * dataset for testing AI vulnerability to prompt-based jailbreaking 233 + * [HarmBench by Center for AI Safety](https://github.com/centerforaisafety/HarmBench) 234 + * Evaluation dataset for automated red teaming 202 235 * [HiroKachi Jailbreak Dataset](https://sizu.me/love) 203 236 * dataset focused on adversarial AI prompt attacks 204 237 * [Jailbreak Prompt Generator AI Model](https://huggingface.co/tsq2000/Jailbreak-generator) 205 238 * AI model that generates jailbreak-style prompts 239 + * [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors) 240 + * Harmful behaviors for jailbreaking evaluation 206 241 * [JailbreakHub by WalledAI](https://huggingface.co/datasets/walledai/JailbreakHub) 207 242 * collection of jailbreak prompts and corresponding model responses 243 + * [LLM-LAT harmful dataset](https://huggingface.co/datasets/LLM-LAT/harmful-dataset) 244 + * Prompts to assess harmful behaviors in LLMs 245 + * [MedSafetyBench](https://github.com/AI4LIFE-GROUP/med-safety-bench) 246 + * Medical safety prompts to evaluate LLM safety in medical contexts 247 + * [Multilingual Vulnerability dataset](https://github.com/CarsonDon/Multilingual-Vuln-LLMs) 248 + * Multilingual prompts demonstrating LLM vulnerabilities 208 249 * [Red Team Resistance Leaderboard](https://huggingface.co/spaces/HaizeLabs/red-teaming-resistance-benchmark) 209 250 * rankings of AI models based on resistance to adversarial attacks 210 251 * [Rentry Jailbreak Datasets](https://rentry.org/gpt0721) 211 252 * collection of datasets related to jailbreak attempts on AI models 212 253 * [SidFeel Jailbreak Dataset](https://github.com/sidfeels/PromptsDB) 213 254 * collection of prompts used for jailbreaking AI models 255 + * [SorryBench](https://huggingface.co/datasets/sorry-bench/sorry-bench-202503) 256 + * adversarial prompts to test LLM safety with linguistic mutations 257 + * [SOSBench](https://huggingface.co/datasets/SOSBench/SOSBench) 258 + * regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws. 259 + * [TDC23-RedTeaming dataset by walledai](https://huggingface.co/datasets/walledai/TDC23-RedTeaming) 260 + * collection of prompts from the red teaming track at TDC23 261 + * [XSTest dataset](https://github.com/paul-rottger/exaggerated-safety) 262 + * Prompts designed to test exaggerated safety behaviors in LLMs 214 263 215 264 216 265 ## Decentralized Platforms