···11-<!-- Thank you for opening a pull request! Please ensure your addition is in the correct section, follows existing formatting, and is in alphabetical order. If you have more information or context about your addition, please share it below: -->
11+<!-- Thank you for opening a pull request!
22+33+Please ensure your addition:
44+- links to a source code repo (versus a marketing or documentation website, if possible)
55+- is in the correct section
66+- follows existing formatting
77+- is in alphabetical order
2899+If you have more information or context about your addition, please share it below: -->
+51-2
README.md
···1515 * hashing algorithm, matching function, and ability to hook into actions
1616* [Hasher-Matcher-Actioner (CLIP demo)](https://github.com/juanmrad/HMA-CLIP-demo)
1717 * HMA extension for CLIP as reference for adding other format extensions
1818+* [hma-matrix by the Matrix.org Foundation](https://github.com/matrix-org/hma-matrix)
1919+ * Matrix-specific extensions to HMA for (primarily) the Matrix ecosystem
1820* [Lattice Extract by Adobe](https://github.com/adobe/lattice_extract)
1921 * grid and lattice detection to guard against FP in hash matching
2022* [MediaModeration (Wiki Extension)](https://github.com/wikimedia/mediawiki-extensions-MediaModeration?tab=readme-ov-file)
···6062 * BERT-based model for detecting toxic content in prompts to language models
616362646363-## AI-powered Guardrails
6565+## AI for Safety
64666567* [Guardrails AI](https://github.com/guardrails-ai/guardrails)
6668 * Python framework that helps build safe AI applications checking input/output for predefined risks
6769* [Kanana Safeguard By Kakao](https://huggingface.co/kakaocorp/kanana-safeguard-8b)
6870 * harmful content detection model based on Kanana 8B
7171+* [Granite Guardian by IBM Research](https://github.com/ibm-granite/granite-guardian)
7272+ * an input-output guardrail for detecting harms in a variety of use cases (general harm, RAG settings, agentic workflows, etc.)
6973* [Llama Guard by Meta](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3)
7074 * AI-powered content moderation model to detect harm in text-based interactions
7175* [Llama Prompt Guard 2 by Meta](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Prompt-Guard-2/86M/MODEL_CARD.md)
7276 * Detects prompt injection and jailbreaking attacks in LLM inputs
7777+* [OpenGuardrails](https://github.com/openguardrails/openguardrails)
7878+ * Security Gateway providing a transparent reverse proxy for OpenAI apis with integrated safety protection
7379* [Purple Llama by Meta](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3)
7480 * set of tools to assess and improve LLM security. Includes Llama Guard, CyberSec Eval, and Code Shield
7581* [RoGuard](https://github.com/Roblox/RoGuard-1.0)
7682 * LLM that helps safeguard unlimited text generation on Roblox
7783* [ShieldGemma by Google DeepMind](https://www.kaggle.com/code/fernandosr85/shieldgemma-web-content-safety-analyzer?scriptVersionId=198456916)
7884 * AI safety toolkit by Google DeepMind designed to help detect and mitigate harmful or unsafe outputs in LLM applications
8585+* [Risk Atlas Nexus by IBM Research](https://github.com/IBM/risk-atlas-nexus)
8686+ * knowledge-graph toolkit that maps AI risk taxonomies (IBM AI Risk Atlas, IBM Granite Guardian MIT AI Risk Repository, NIST AI RMF GenAI Profile, AIR 2024, AILuminate Benchmark, Credo Unified Control Framework, OWASP Top 10 for LLM Apps) to evaluations, mitigations and controls, supporting the generation of structured governance workflows
8787+798880898190## Privacy Protection
···113122 * Tool for testing prompt injection vulnerabilities in AI systems
114123* [Promptfoo](https://github.com/promptfoo/promptfoo)
115124 * Automated LLM evaluations, report generations, several ready-to-use attack strategies
116116-* [PyRIT Documentation](https://azure.github.io/PyRIT/)
125125+* [PyRIT by Microsoft](https://github.com/Azure/PyRIT)
117126 * Microsoft’s Python-based tool for AI red teaming and security testing
118127* [Socketteer](https://github.com/socketteer?tab=repositories)
119128 * Allows AI models to interact, helping test conversational weaknesses
···121130122131## Clustering
123132133133+* [bogofilter](https://bogofilter.sourceforge.io/)
134134+ * spam filter that classifies text using Bayesian statistical analysis; able to learn from classifications and corrections
124135* [scikit-learn](https://github.com/scikit-learn/scikit-learn)
125136 * python library including clustering through various algorithms, such as K-Means, DBSCAN, and hierarchical clustering
126137* [SpamAssassin by Apache](https://spamassassin.apache.org)
···181192182193* [Aegis Content Safety by NVIDIA](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
183194 * dataset created by NVIDIA to aid in content moderation and toxicity detection
195195+* [badwords by Richard Hughes](https://github.com/hughsie/badwords)
196196+ * simple list of bad words in different locales that can be used to flag suspicious user-submitted content
197197+* [PKU-SafeRLHF dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF)
198198+ * prompts with RLHF markers for unsafe responses across multiple harm categories
184199* [Toxic Chat by LMSYS](https://huggingface.co/datasets/lmsys/toxic-chat)
185200 * dataset of toxic conversations collected from interactions with Vicuna
186201* [Toxicity by Jigsaw](https://huggingface.co/datasets/google/jigsaw_toxicity_pred)
187202 * large number of Wikipedia comments which have been labeled by human raters for toxic behavior
203203+* [Transphobia Awareness dataset](https://doi.org/10.5281/zenodo.15482694)
204204+ * user-generated queries related to transphobia with human annotations and model responses from Quora questions
188205* [Uli Dataset by Tattle](https://github.com/tattle-made/uli_dataset)
189206 * dataset of gendered abuse, created for Uli ML redaction.
190207* [VTC by Unitary AI](https://github.com/unitaryai/VTC)
···195212196213* [AI Alignment Dataset by Anthropic](https://atlas.nomic.ai/map/anthropic_rlhf)
197214 * data used for reinforcement learning with human feedback (RLHF) to align AI models.
215215+* [AILuminate dataset by MLCommons](https://github.com/mlcommons/ailuminate)
216216+ * Human-created prompts across different harm categories
217217+* [Aya Red-teaming dataset by Cohere](https://huggingface.co/datasets/CohereForAI/aya_redteaming)
218218+ * multilingual red-teaming prompts across various harm categories
219219+* [ALERT dataset by Babelscape](https://huggingface.co/datasets/Babelscape/ALERT)
220220+ * standard and adversarial red-teaming prompts
221221+* [CCP Sensitive Prompts by Promptfoo](https://huggingface.co/datasets/promptfoo/CCP-sensitive-prompts)
222222+ * Prompts covering topics sensitive to the Chinese Communist Party (CCP)
223223+* [DarkBench by Apart](https://huggingface.co/datasets/apart/darkbench)
224224+ * Comprehensive benchmark to detect dark design patterns in LLMs
198225* [DEFCOM Red Teaming Dataset](https://github.com/humane-intelligence/ai_village_defcon_grt_data)
199226 * dataset from DEF CON’s AI red teaming event.
227227+* [Do Not Answer dataset](https://huggingface.co/datasets/LibrAI/do-not-answer)
228228+ * Questions across multiple risk areas and harm types to test LLM safety and refusal behavior
229229+* [Forbidden Questions dataset](https://huggingface.co/datasets/TrustAIRLab/forbidden_question_set)
230230+ * Questions adopted from OpenAI Usage Policy
200231* [HackAPrompt Jailbreak Dataset](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset/viewer/default/train?p=1&row=137)
201232 * dataset for testing AI vulnerability to prompt-based jailbreaking
233233+* [HarmBench by Center for AI Safety](https://github.com/centerforaisafety/HarmBench)
234234+ * Evaluation dataset for automated red teaming
202235* [HiroKachi Jailbreak Dataset](https://sizu.me/love)
203236 * dataset focused on adversarial AI prompt attacks
204237* [Jailbreak Prompt Generator AI Model](https://huggingface.co/tsq2000/Jailbreak-generator)
205238 * AI model that generates jailbreak-style prompts
239239+* [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)
240240+ * Harmful behaviors for jailbreaking evaluation
206241* [JailbreakHub by WalledAI](https://huggingface.co/datasets/walledai/JailbreakHub)
207242 * collection of jailbreak prompts and corresponding model responses
243243+* [LLM-LAT harmful dataset](https://huggingface.co/datasets/LLM-LAT/harmful-dataset)
244244+ * Prompts to assess harmful behaviors in LLMs
245245+* [MedSafetyBench](https://github.com/AI4LIFE-GROUP/med-safety-bench)
246246+ * Medical safety prompts to evaluate LLM safety in medical contexts
247247+* [Multilingual Vulnerability dataset](https://github.com/CarsonDon/Multilingual-Vuln-LLMs)
248248+ * Multilingual prompts demonstrating LLM vulnerabilities
208249* [Red Team Resistance Leaderboard](https://huggingface.co/spaces/HaizeLabs/red-teaming-resistance-benchmark)
209250 * rankings of AI models based on resistance to adversarial attacks
210251* [Rentry Jailbreak Datasets](https://rentry.org/gpt0721)
211252 * collection of datasets related to jailbreak attempts on AI models
212253* [SidFeel Jailbreak Dataset](https://github.com/sidfeels/PromptsDB)
213254 * collection of prompts used for jailbreaking AI models
255255+* [SorryBench](https://huggingface.co/datasets/sorry-bench/sorry-bench-202503)
256256+ * adversarial prompts to test LLM safety with linguistic mutations
257257+* [SOSBench](https://huggingface.co/datasets/SOSBench/SOSBench)
258258+ * regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws.
259259+* [TDC23-RedTeaming dataset by walledai](https://huggingface.co/datasets/walledai/TDC23-RedTeaming)
260260+ * collection of prompts from the red teaming track at TDC23
261261+* [XSTest dataset](https://github.com/paul-rottger/exaggerated-safety)
262262+ * Prompts designed to test exaggerated safety behaviors in LLMs
214263215264216265## Decentralized Platforms