README.md at main · roost.tools/awesome-safety-tools

roost.tools / awesome-safety-tools
fork
Mirror of https://github.com/roostorg/awesome-safety-tools
fork
awesome-safety-tools / README.md
at main 292 lines 20 kB view raw view rendered
wrap content
Cassidy James Blaede Add Content Review Filters by Meta 5w ago
6fe4d24b
  1# awesome-safety-tools
  2
  3A collection of open source tools for online safety
  4
  5Inspired by prior work like [Awesome Redteaming](https://github.com/yeyintminthuhtut/Awesome-Red-Teaming/) and [Awesome Phishing](https://github.com/PhishyAlice/awesome-phishing). This list is not an endorsement, but rather an attempt to organize and map the available technology. ❤️
  6
  7Help contribute by opening a pull request to add more resources and tools!
  8
  9
 10## Hash Matching
 11
 12* [Altitude by Jigsaw](https://github.com/jigsaw-code/altitude)
 13  * web UI and hash matching for violent extremism and terrorism content
 14* [Hasher Matcher Action (HMA) by Meta](https://github.com/facebook/ThreatExchange/tree/main/hasher-matcher-actioner)
 15  * hashing algorithm, matching function, and ability to hook into actions
 16* [Hasher-Matcher-Actioner (CLIP demo)](https://github.com/juanmrad/HMA-CLIP-demo)
 17  * HMA extension for CLIP as reference for adding other format extensions
 18* [hma-matrix by the Matrix.org Foundation](https://github.com/matrix-org/hma-matrix)
 19  * Matrix-specific extensions to HMA for (primarily) the Matrix ecosystem
 20* [Lattice Extract by Adobe](https://github.com/adobe/lattice_extract)
 21  * grid and lattice detection to guard against FP in hash matching
 22* [MediaModeration (Wiki Extension)](https://github.com/wikimedia/mediawiki-extensions-MediaModeration?tab=readme-ov-file)
 23  * CSAM hash matching for Wikimedia
 24* [PDQ by Meta](https://github.com/facebook/ThreatExchange/tree/main/pdq)
 25  * perceptual hash algorithm for images
 26* [Perception by Thorn](https://github.com/thorn-oss/perception)
 27  * provides a common wrapper around existing, popular perceptual hashes (such as those implemented by ImageHash)
 28* [RocketChat CSAM](https://github.com/prostasia/rocketchatcsam)
 29  * CSAM hash matching for RocketChat
 30* [TMK by Meta](https://github.com/facebook/ThreatExchange/tree/main/tmk)
 31  * visual similarity match for videos
 32* [VPDQ by Meta](https://github.com/facebook/ThreatExchange/tree/main/vpdq)
 33  * visual similarity match for videos using PDQ algorithm
 34
 35
 36## Classification
 37
 38* [Content Safety API by Google](https://protectingchildren.google/tools-for-partners/#learn-about-our-tools)
 39  * uses machine learning to detect novel CSAM, nudity, and sexually explicit content in images and videos
 40  * free service, but requires registration
 41  * not open source itself, but can be [used via Coop](https://roostorg.github.io/coop/SIGNALS.html#content-safety-api-by-google), which is open source
 42* [CoPE by Zentropi](https://huggingface.co/zentropi-ai/cope-a-9b)
 43  * small language model trained for accurate, fast, steerable content classification based on developer-defined content policies
 44* [Detoxify by Unitary AI](https://github.com/unitaryai/detoxify)
 45  * detects and mitigates generalized toxic language (including hate speech, harassment, bullying) in text
 46* [gpt-oss-safeguard by OpenAI](https://github.com/openai/gpt-oss-safeguard)
 47  * open-weight reasoning model to classify text content based on provided safety policies
 48* [NSFW Keras Model](https://github.com/GantMan/nsfw_model)
 49  * convoluted neural network (CNN) based explicit image ML model
 50* [NSFW Filtering](https://github.com/nsfw-filter/nsfw-filter)
 51  * browser extension to block explicit images from online platforms; user facing
 52* [OSmod by Jigsaw](https://github.com/conversationai/conversationai-moderator)
 53  * toolkit of machine learning (ML) tools, models, and APIs that platforms can use to moderate content
 54* [Perspective API by Jigsaw](https://github.com/conversationai/perspectiveapi)
 55  * machine learning-powered tool that helps platforms detect and assess the toxicity of online conversations
 56* [Private Detector by Bumble](https://github.com/bumble-tech/private-detector)
 57  * pretrained model for detecting lewd images
 58* [Roblox Voice Safety Classifier](https://github.com/Roblox/voice-safety-classifier)
 59  * machine learning model that detects and moderates harmful content in real-time voice chat on Roblox; focuses on spoken language detection
 60* [Sentinel by Roblox](https://github.com/Roblox/Sentinel/tree/main)
 61  * Python library designed specifically for realtime detection of extremely rare classes of text by using contrastive learning principles
 62* [Toxic Prompt RoBERTa by Intel](https://huggingface.co/Intel/toxic-prompt-roberta)
 63  * BERT-based model for detecting toxic content in prompts to language models
 64
 65
 66## AI for Safety
 67
 68* [Guardrails AI](https://github.com/guardrails-ai/guardrails)
 69  * Python framework that helps build safe AI applications checking input/output for predefined risks
 70* [Kanana Safeguard By Kakao](https://huggingface.co/kakaocorp/kanana-safeguard-8b)
 71  * harmful content detection model based on Kanana 8B
 72* [Granite Guardian by IBM Research](https://github.com/ibm-granite/granite-guardian)
 73  * an input-output guardrail for detecting harms in a variety of use cases (general harm, RAG settings, agentic workflows, etc.)
 74* [Llama Guard by Meta](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3)
 75  * AI-powered content moderation model to detect harm in text-based interactions
 76* [Llama Prompt Guard 2 by Meta](https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Prompt-Guard-2/86M/MODEL_CARD.md)
 77  * Detects prompt injection and jailbreaking attacks in LLM inputs
 78* [OpenGuardrails](https://github.com/openguardrails/openguardrails)
 79  * Security Gateway providing a transparent reverse proxy for OpenAI apis with integrated safety protection
 80* [Purple Llama by Meta](https://github.com/meta-llama/PurpleLlama/tree/main/Llama-Guard3)
 81  * set of tools to assess and improve LLM security. Includes Llama Guard, CyberSec Eval, and Code Shield
 82* [RoGuard](https://github.com/Roblox/RoGuard-1.0)
 83  * LLM that helps safeguard unlimited text generation on Roblox
 84* [ShieldGemma by Google DeepMind](https://www.kaggle.com/code/fernandosr85/shieldgemma-web-content-safety-analyzer?scriptVersionId=198456916)
 85  * AI safety toolkit by Google DeepMind designed to help detect and mitigate harmful or unsafe outputs in LLM applications
 86* [Risk Atlas Nexus by IBM Research](https://github.com/IBM/risk-atlas-nexus)
 87  * knowledge-graph toolkit that maps AI risk taxonomies (IBM AI Risk Atlas, IBM Granite Guardian MIT AI Risk Repository, NIST AI RMF GenAI Profile, AIR 2024, AILuminate Benchmark, Credo Unified Control Framework, OWASP Top 10 for LLM Apps) to evaluations, mitigations and controls, supporting the generation of structured governance workflows
 88
 89
 90
 91## Privacy Protection
 92
 93* [Fawkes Facial De-Recognition Cloaking](https://github.com/Shawn-Shan/fawkes)
 94  * Code and binaries to confuse AIs when trying to match identity to photos, such as [Clearview](https://www.theverge.com/23919134/kashmir-hill-your-face-belongs-to-us-clearview-ai-facial-recognition-privacy-decoder)
 95  * Many other great tools at github.com/Shawn-Shan, MIT researcher
 96* [Presidio by Microsoft](https://github.com/microsoft/presidio)
 97  * toolset for detecting Personal Identifiable Information (PII) and other sensitive data in images and text
 98
 99
100## Core Infrastructure
101
102* [AbuseIO](https://github.com/AbuseIO/AbuseIO)
103  * abuse management platform designed to help organizations handle and track abuse complaints related to online content, infrastructure, or services
104* [Access by Discord](https://github.com/discord/access)
105  * centralized portal for managing access to internal systems within any organization
106* [Mjolnir by Matrix](https://github.com/matrix-org/mjolnir)
107  * moderation bot for the Matrix protocol that automatically enforces content policies
108* [Open Truss by GitHub](https://github.com/open-truss/open-truss)
109  * framework designed to help users create internal tools without needing to write code
110 
111
112## Redteaming Tools
113
114* [Aymara](https://github.com/aymara-ai/aymara-sdk-python)
115  * Automated eval tools for AI safety, accuracy, and jailbreak vulnerability
116* [Counterfit by Microsoft](https://github.com/Azure/counterfit/)  
117  * Automation tool for assessing AI model security and robustness
118* [Garak by NVIDIA](https://github.com/NVIDIA/garak)  
119  * Framework for adversarial testing and model evaluation
120* [LLM Canary](https://github.com/LLM-Canary/LLM-Canary)  
121  * AI benchmarking tool that evaluates models for security vulnerabilities and adversarial robustness
122* [Prompt Fuzzer](https://github.com/prompt-security/ps-fuzz)  
123  * Tool for testing prompt injection vulnerabilities in AI systems
124* [Promptfoo](https://github.com/promptfoo/promptfoo)
125  * Automated LLM evaluations, report generations, several ready-to-use attack strategies
126* [PyRIT by Microsoft](https://github.com/Azure/PyRIT)
127  * Microsoft’s Python-based tool for AI red teaming and security testing
128* [Socketteer](https://github.com/socketteer?tab=repositories)
129  * Allows AI models to interact, helping test conversational weaknesses
130
131
132## Clustering
133
134* [bogofilter](https://bogofilter.sourceforge.io/)
135  * spam filter that classifies text using Bayesian statistical analysis; able to learn from classifications and corrections
136* [scikit-learn](https://github.com/scikit-learn/scikit-learn)
137  * python library including clustering through various algorithms, such as K-Means, DBSCAN, and hierarchical clustering
138* [SpamAssassin by Apache](https://spamassassin.apache.org)
139  * anti-spam platform that uses a variety of techniques, including text analysis, Bayesian filtering, and DNS blocklists, to classify and block unsolicited email
140
141
142## Rules Engines
143
144* [Druid by Apache](https://github.com/apache/druid)
145  * high performance real-time analytics database
146* [Marble](https://github.com/checkmarble/marble)
147  * real-time fraud detection and compliance engine tailored for fintech companies and financial institutions
148* [Osprey by ROOST](https://github.com/roostorg/osprey)
149  * high-performance rules engine for real-time event processing at scale, designed for Trust & Safety and anti-abuse work
150* [RulesEngine by Microsoft](https://microsoft.github.io/RulesEngine/)
151  * library for abstracting business logic, rules, and policies from a system via JSON  for .NET language families
152* [Wikimedia Smite Spam](https://github.com/wikimedia/mediawiki-extensions-SmiteSpam)
153  * extension for MediaWiki that helps identify and manage spam content on a wiki
154
155
156## Review
157
158* [BullMQ](https://github.com/taskforcesh/bullmq)
159  * message queue and batch processing for NodeJS and Python based on Redis
160* [Content Review Filters by Meta](https://github.com/facebook/content-review-filters)
161  * collection of React components to integrate content filters in review tools
162* [NCMEC Reporting by ello](https://github.com/ello/ncmec_reporting)
163  * Ruby client library for reporting incidents to the National Center for Missing & Exploited Children (NCMEC) CyberTipline
164* [Owlculus](https://github.com/be0vlk/owlculus)
165  * OSINT (Open-Source Intelligence) toolkit and case management platform
166* [RabbitMQ](https://github.com/rabbitmq)
167  * message broker that enables applications to communicate with each other by sending messages through queues
168
169
170## Investigation
171
172* [CIB MangoTree](https://github.com/CIB-Mango-Tree/CIB-Mango-Tree-Website)
173  * collection of tools to aid researchers in coordinated inauthentic behavior (CIB) analysis
174* [Crossover](https://crossover.social/)
175  * open-source project that builds dashboards for monitoring and analyzing the recommendation algorithms of social networks, with a focus on disinformation and election monitoring
176* [DAU Dashboard by Tattle](https://github.com/tattle-made/dau-dashboard)
177  * Deepfake Analysis Unit(DAU) is a collaborative space for analyzing deepfakes
178* [Feluda by Tattle](https://github.com/tattle-made/feluda)
179  * configurable engine for analysing multi-lingual and multi-modal content
180* [Interference by Digital Forensics Research Lab](https://github.com/DFRLab/interference2024)
181  * interactive, open-source database that tracks allegations of foreign interference or foreign malign influence relevant to the 2024 U.S. presidential election
182* [OpenMeasures](https://gitlab.com/openmeasures)
183  * open source platform for investigating internet trends
184* [ThreatExchange by Meta](https://github.com/facebook/ThreatExchange )
185  * platform that enables organizations to share information about  threats, such as malware, phishing attacks, and online safety harms in a structured and privacy-compliant manner
186* [ThreatExchange Client via PHP](https://github.com/certly/threatexchange)
187  * PHP client for ThreatExchange
188* [ThreatExchange via Python](https://github.com/facebook/ThreatExchange/tree/main/python-threatexchange)
189  * Python library for ThreatExchange
190* [TikTok Observatory](https://github.com/aiforensics/tkobservatory)
191  * open-source project maintained by [AI Forensics](https://aiforensics.org/) that allows researchers to monitor the promotion and demotion of content by the TikTok reccomendation algorithm
192
193
194## Datasets
195
196* [Aegis Content Safety by NVIDIA](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-2.0)
197  * dataset created by NVIDIA to aid in content moderation and toxicity detection
198* [badwords by Richard Hughes](https://github.com/hughsie/badwords)
199  * simple list of bad words in different locales that can be used to flag suspicious user-submitted content
200* [PKU-SafeRLHF dataset](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF)
201  * prompts with RLHF markers for unsafe responses across multiple harm categories
202* [Toxic Chat by LMSYS](https://huggingface.co/datasets/lmsys/toxic-chat)
203  * dataset of toxic conversations collected from interactions with Vicuna
204* [Toxicity by Jigsaw](https://huggingface.co/datasets/google/jigsaw_toxicity_pred)
205  * large number of Wikipedia comments which have been labeled by human raters for toxic behavior
206* [Transphobia Awareness dataset](https://doi.org/10.5281/zenodo.15482694)
207  * user-generated queries related to transphobia with human annotations and model responses from Quora questions
208* [Uli Dataset by Tattle](https://github.com/tattle-made/uli_dataset)
209  * dataset of gendered abuse, created for Uli ML redaction.
210* [VTC by Unitary AI](https://github.com/unitaryai/VTC)
211  * implementation of video-text retrieval with comments including a dataset, method of identifying relevant auxiliary information that adds context to videos, and quantification of the value comment-modality bring to video
212
213
214## Red Teaming Datasets
215
216* [AI Alignment Dataset by Anthropic](https://atlas.nomic.ai/map/anthropic_rlhf)  
217  * data used for reinforcement learning with human feedback (RLHF) to align AI models.
218* [AILuminate dataset by MLCommons](https://github.com/mlcommons/ailuminate)
219  * Human-created prompts across different harm categories
220* [Aya Red-teaming dataset by Cohere](https://huggingface.co/datasets/CohereForAI/aya_redteaming)
221  * multilingual red-teaming prompts across various harm categories
222* [ALERT dataset by Babelscape](https://huggingface.co/datasets/Babelscape/ALERT)
223  * standard and adversarial red-teaming prompts
224* [CCP Sensitive Prompts by Promptfoo](https://huggingface.co/datasets/promptfoo/CCP-sensitive-prompts)
225  * Prompts covering topics sensitive to the Chinese Communist Party (CCP)
226* [DarkBench by Apart](https://huggingface.co/datasets/apart/darkbench)
227  * Comprehensive benchmark to detect dark design patterns in LLMs
228* [DEFCOM Red Teaming Dataset](https://github.com/humane-intelligence/ai_village_defcon_grt_data)  
229  * dataset from DEF CON’s AI red teaming event.
230* [Do Not Answer dataset](https://huggingface.co/datasets/LibrAI/do-not-answer)
231  * Questions across multiple risk areas and harm types to test LLM safety and refusal behavior
232* [Forbidden Questions dataset](https://huggingface.co/datasets/TrustAIRLab/forbidden_question_set)
233  * Questions adopted from OpenAI Usage Policy
234* [HackAPrompt Jailbreak Dataset](https://huggingface.co/datasets/hackaprompt/hackaprompt-dataset/viewer/default/train?p=1&row=137)  
235  * dataset for testing AI vulnerability to prompt-based jailbreaking
236* [HarmBench by Center for AI Safety](https://github.com/centerforaisafety/HarmBench)
237  * Evaluation dataset for automated red teaming
238* [HiroKachi Jailbreak Dataset](https://sizu.me/love)  
239  * dataset focused on adversarial AI prompt attacks
240* [Jailbreak Prompt Generator AI Model](https://huggingface.co/tsq2000/Jailbreak-generator)  
241  * AI model that generates jailbreak-style prompts
242* [JailbreakBench](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)
243  * Harmful behaviors for jailbreaking evaluation
244* [JailbreakHub by WalledAI](https://huggingface.co/datasets/walledai/JailbreakHub)
245  * collection of jailbreak prompts and corresponding model responses
246* [LLM-LAT harmful dataset](https://huggingface.co/datasets/LLM-LAT/harmful-dataset)
247  * Prompts to assess harmful behaviors in LLMs
248* [MedSafetyBench](https://github.com/AI4LIFE-GROUP/med-safety-bench)
249  * Medical safety prompts to evaluate LLM safety in medical contexts
250* [Multilingual Vulnerability dataset](https://github.com/CarsonDon/Multilingual-Vuln-LLMs)
251  * Multilingual prompts demonstrating LLM vulnerabilities
252* [Red Team Resistance Leaderboard](https://huggingface.co/spaces/HaizeLabs/red-teaming-resistance-benchmark)  
253  * rankings of AI models based on resistance to adversarial attacks
254* [Rentry Jailbreak Datasets](https://rentry.org/gpt0721)  
255  * collection of datasets related to jailbreak attempts on AI models
256* [SidFeel Jailbreak Dataset](https://github.com/sidfeels/PromptsDB)  
257  * collection of prompts used for jailbreaking AI models
258* [SorryBench](https://huggingface.co/datasets/sorry-bench/sorry-bench-202503)
259  * adversarial prompts to test LLM safety with linguistic mutations
260* [SOSBench](https://huggingface.co/datasets/SOSBench/SOSBench)
261  * regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws.
262* [TDC23-RedTeaming dataset by walledai](https://huggingface.co/datasets/walledai/TDC23-RedTeaming)
263  * collection of prompts from the red teaming track at TDC23
264* [XSTest dataset](https://github.com/paul-rottger/exaggerated-safety)
265  * Prompts designed to test exaggerated safety behaviors in LLMs
266 
267
268## Decentralized Platforms
269
270* [Automod by Bluesky](https://github.com/bluesky-social/indigo/tree/main/automod)
271  * tool for automating content moderation processes for the Bluesky social network and other apps on the AT Protocol
272* [FediCheck](https://connect.iftas.org/library/iftas-documentation/fedicheck/)
273  * domain moderation tool to assist ActivityPub service providers, such as Mastodon servers, now open-sourced.
274* [Fediverse Spam Filtering](https://github.com/MarcT0K/Fediverse-Spam-Filtering/ )
275  * spam filter for Fediverse social media platforms. For now, the current version is only a proof of concept.
276* [FIRES](https://github.com/fedimod/fires)
277  * reference server + protocol for the exchange of moderation adivsories and recommendations
278* [Ozone by Bluesky](https://github.com/bluesky-social/ozone)
279  * labeling tool designed for Bluesky. Includes moderation features to action on abuse flags, policy enforcement tools, and investigation features
280
281
282## User Safety Tools
283
284* [Frankly by Applied Social Media Lab](https://github.com/berkmancenter/frankly/)
285  * online deliberations platform that allows anyone to host video-enabled conversations about any topic
286* [PolicyKit by UW Social Futures Lab](https://github.com/policykit/policykit)
287  * toolkit for building governance in your online community
288* [SquadBox by UW Social Futures Lab](https://github.com/amyxzhang/squadbox)
289  * tool to help people who are being harassed online by having their friends (or “squad”) moderate their messages
290* [Uli by Tattle](https://github.com/tattle-made/Uli)
291  * Software and Resources for Mitigating Online Gender Based Violence in India
292
Configure Feed

Configure Feed