Nicholas Schiefer | Member of Technical Staff at Anthropic.

Alignment and Safety Auditing language models for hidden objectives

Alignment and RLHF Collective Constitutional AI: Aligning a Language Model with Public Input

Alignment and RLHF Constitutional AI: Harmlessness from AI Feedback

Alignment and Safety Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Alignment and Safety Constitutional Classifiers++: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Alignment and Safety Many-shot Jailbreaking

Alignment and Safety Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Interpretability Tracing the thoughts of a large language model

Current frame