Auditing language models for hidden objectives

Amanda Askell is a philosopher and AI alignment researcher at Anthropic. Her personal site says she previously worked as a research scientist on the policy team at OpenAI.

Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.

Member of technical staff at Anthropic and associate professor of computer science, data science, and linguistics at New York University on leave. His public homepage focuses on natural language processing, machine learning, and AI alignment.

Member of Technical Staff at Anthropic and cofounder of Oulipo Labs, working on language model safety, evaluations, and scientific forecasting.

Research scientist at Anthropic working on machine learning and AI safety.

PhD student at the University of Oxford working on AI safety, including scalable oversight and interpretability.

Benjamin Lermen is listed as an author of the Anthropic technical report Auditing language models for hidden objectives.

Josh Batson is a research scientist at Anthropic. Public descriptions of his work emphasize understanding how and why AI systems work, especially interpretability.

Chenyan Zhang is listed as an author of the Anthropic technical report Auditing language models for hidden objectives.

Member of Technical Staff at Anthropic working on AI control, hidden objectives, alignment, and evaluations, with a background in language models, efficient training, and scientific machine learning.

Anthropic alignment researcher whose personal site says he leads the Alignment Science team; previously co-led OpenAI's Superalignment team and earlier worked on reinforcement learning from human feedback at DeepMind.

Assistant Professor of Computer Science at the University of Oxford whose research spans generalization, reasoning, and large language model agents.

Research scientist at Anthropic and assistant professor of computer science at Northeastern University working on interpretability and model understanding.

Canonical link

Amanda Askell

Ethan Perez

Samuel R. Bowman

Nicholas Schiefer

Sören Mindermann

Henry Sleight

Benjamin Lermen

Josh Batson

Chenyan Zhang

Scott Emmons

Jan Leike

Owain Evans

David Bau