LLMpeople
Home People Organizations Reports Fields Schools
Public Atlas People first, reports as evidence, organizations as context.

Atlas / Reports / Detail

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Alignment and Safety report from Anthropic with 13 connected researchers in the LLMpeople atlas.

Anthropic2025-01-3113 researchers
Field
Alignment and Safety
Organization
Anthropic
arXiv
2501.18837

Canonical link

https://arxiv.org/abs/2501.18837

Connected researchers

Liane Lovitt portrait
Researcher 2 reports

Liane Lovitt

Anthropic

Research scientist at Anthropic whose public work includes AI alignment, reinforcement learning from human feedback, and model behavior.

Anthropic
5 likes
Ethan Perez portrait
Researcher 8 reports

Ethan Perez

Anthropic

Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.

Anthropic
Nicholas Schiefer portrait
Researcher 8 reports

Nicholas Schiefer

Anthropic

Member of Technical Staff at Anthropic and cofounder of Oulipo Labs, working on language model safety, evaluations, and scientific forecasting.

Anthropic
Samuel Marks portrait
Researcher 6 reports

Samuel Marks

Anthropic

Senior research engineer at Anthropic interested in agent foundations, model organisms of misalignment, and human-computer interaction.

Anthropic
Maxwell Tegmark portrait
Researcher 1 reports

Maxwell Tegmark

Anthropic

Researcher at Anthropic and coauthor of the Constitutional Classifiers report.

Anthropic
Tomás Riofrío portrait
Researcher 1 reports

Tomás Riofrío

Anthropic

Tomás Riofrío is listed as an author of the Anthropic technical report Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.

Anthropic
William Saunders portrait
Researcher 2 reports

William Saunders

Anthropic

William Saunders is a research scientist at Anthropic working on aligning and evaluating language models. His public homepage says he works at the intersection of game theory, optimization, and deep learning, previously interned at OpenAI, DeepMind, and Mila, studied mathematics at the University of Oxford, and is a PhD student in machine learning at Carnegie Mellon University.

Anthropic
Alexey Nazarov portrait
Researcher 1 reports

Alexey Nazarov

Anthropic

Member of technical staff at Anthropic focused on safe and reliable AI.

Anthropic
Jordan Taylor portrait
Researcher 1 reports

Jordan Taylor

Anthropic

Jordan Taylor is listed as an author of the Anthropic technical report Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming.

Anthropic
Alex Tamkin portrait
Researcher 3 reports

Alex Tamkin

Anthropic

Member of technical staff at Anthropic whose work focuses on language models, model understanding, and alignment.

Anthropic
Yanda Chen portrait
Researcher 2 reports

Yanda Chen

Anthropic

Yanda Chen is a member of technical staff at Anthropic and a PhD candidate in computer science at Georgetown University advised by Kevin Knight. His homepage says he previously worked at Allen Institute for AI and focuses on AI safety, natural language processing, and deep learning.

Anthropic
Beth Barnes portrait
Researcher 2 reports

Beth Barnes

Anthropic

President of METR and former team member at Anthropic whose work focuses on evaluating and forecasting frontier AI capabilities.

Anthropic
Jacob Hilton portrait
Researcher 2 reports

Jacob Hilton

Anthropic

Jacob Hilton is a researcher and executive director at Alignment Research Center, where he works on mechanistic approaches to outperforming random sampling. He previously worked at OpenAI on truthfulness, reinforcement learning, and interpretability for language models, earlier worked at Jane Street, completed a PhD in mathematics at the University of Leeds, and later coauthored Anthropic work on constitutional classifiers.

Anthropic

LLMpeople is a public atlas for discovering frontier AI researchers with context, provenance, and respect.

Privacy · Terms