ai-safety

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

ShengranHu / Thought-Cloning

Star

[NeurIPS '23 Spotlight] Thought Cloning: Learning to Think while Acting by Imitating Human Thinking

reinforcement-learning deep-learning pytorch artificial-intelligence imitation-learning ai-safety

Updated Mar 1, 2024
Python

hendrycks / ethics

Star

Aligning AI With Shared Human Values (ICLR 2021)

ai-safety machine-ethics ml-safety ethical-ai gpt-3

Updated Apr 21, 2023
Python

normster / llm_rules

Star

RuLES: a benchmark for evaluating rule-following in language models

ai-safety ai-security gpt-4

Updated Jun 13, 2024
Python

lets-make-safe-ai / make-safe-ai

Star

How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚

ai agi artificial-intelligence artificial-general-intelligence ai-safety ai-alignment

Updated Mar 29, 2023

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

Giskard-AI / awesome-ai-safety

Sponsor

Star

📚 A curated list of papers & technical articles on AI Quality & Safety

Updated Oct 13, 2023

WindVChen / DiffAttack

Star

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.

ai-safety diffusion-models unrestricted-attacks adverarial-attacks transferable-attacks diffusion-adversarial-attack imperceptible-attacks

Updated Feb 20, 2024
Python

ryoungj / ToolEmu

Star

A language model (LM)-based emulation framework for identifying the risks of LM agents with tool use

agent language-model ai-safety large-language-models prompt-engineering language-agent

Updated Mar 22, 2024
Python

microsoft / SafeNLP

Star

Safety Score for Pre-Trained Language Models

nlp ai-safety fairness-ai

Updated Oct 18, 2023
Python

PKU-YuanGroup / Hallucination-Attack

Star

Attack to induce LLMs within hallucinations

nlp machine-learning deep-learning ai-safety adversarial-attacks hallucinations llm llm-safety

Updated May 17, 2024
Python

PKU-Alignment / beavertails

Star

BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).

safety llama gpt datasets language-model beaver ai-safety human-feedback-data llm llms human-feedback rlhf large-language-model safe-rlhf

Updated Oct 27, 2023
Makefile

megvii-research / FSSD_OoD_Detection

Star

Feature Space Singularity for Out-of-Distribution Detection. (SafeAI 2021)

anomaly ai-safety anomaly-detection out-of-distribution-detection ood-detection

Updated Feb 15, 2021
Python

EzgiKorkmaz / adversarial-reinforcement-learning

Star

Reading list for adversarial perspective and robustness in deep reinforcement learning.

Updated Sep 18, 2023

dlmacedo / entropic-out-of-distribution-detection

Star

A project to add scalable state-of-the-art out-of-distribution detection (open set recognition) support by changing two lines of code! Perform efficient inferences (i.e., do not increase inference time) and detection without classification accuracy drop, hyperparameter tuning, or collecting additional data.

machine-learning deep-learning pytorch ood osr ai-safety open-set anomaly-detection novelty-detection robust-machine-learning open-set-recognition out-of-distribution out-of-distribution-detection ood-detection trustworthy-machine-learning trustworthy-ai

Updated Sep 22, 2022
Python

SafeAILab / RAIN

Star

[ICLR'24] RAIN: Your Language Models Can Align Themselves without Finetuning

alignment ai-safety large-language-models

Updated May 23, 2024
Python

Improve this page

Add a description, image, and links to the ai-safety topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-safety topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-safety

Here are 95 public repositories matching this topic...

Giskard-AI / giskard

jphall663 / awesome-machine-learning-interpretability

PKU-Alignment / safe-rlhf

tigerlab-ai / tiger

agencyenterprise / PromptInject

ShengranHu / Thought-Cloning

hendrycks / ethics

normster / llm_rules

lets-make-safe-ai / make-safe-ai

tomekkorbak / pretraining-with-human-feedback

Giskard-AI / awesome-ai-safety

WindVChen / DiffAttack

ryoungj / ToolEmu

microsoft / SafeNLP

PKU-YuanGroup / Hallucination-Attack

PKU-Alignment / beavertails

megvii-research / FSSD_OoD_Detection

EzgiKorkmaz / adversarial-reinforcement-learning

dlmacedo / entropic-out-of-distribution-detection

SafeAILab / RAIN

Improve this page

Add this topic to your repo