Alignment faking in large language models

Jared D. Kaplan is a co-founder and Chief Science Officer at Anthropic. Anthropic's public materials also identify him as the company's Responsible Scaling Officer.

Research scientist at Anthropic focused on scalable oversight, AI safety, and language model evaluation; previously worked at New York University and Google.

Member of technical staff at Anthropic and associate professor of computer science, data science, and linguistics at New York University on leave. His public homepage focuses on natural language processing, machine learning, and AI alignment.

Senior research engineer at Anthropic interested in agent foundations, model organisms of misalignment, and human-computer interaction.

Evan Hubinger is Head of Alignment Stress-Testing at Anthropic, where he works on AI safety and alignment. He previously worked at MIRI and OpenAI, studied mathematics and computer science at Harvey Mudd College, and is known for work on inner alignment, deceptive alignment, and alignment stress-testing.

Member of Technical Staff at Anthropic and PhD student at Carnegie Mellon University focused on AI safety, evaluations, and oversight of large language models.

Member of technical staff at Anthropic working on alignment science and the evaluation of hidden objectives in language models.

Associate Professor at the University of Toronto whose research spans deep learning, probabilistic modeling, and machine learning methods for science and AI safety.

Research scientist at Anthropic working on machine learning and AI safety.

Ryan Greenblatt is chief scientist at Redwood Research. His public Redwood and Forethought profiles identify him as part of Redwood's AI safety team and say he holds a BS in applied mathematics and computer science from Brown University.

Buck Shlegeris is a Member of Technical Staff at Anthropic whose public homepage focuses on AI safety, model evaluations, and alignment.

Member of Technical Staff at Anthropic and researcher in neural circuits and mechanistic interpretability, building tools for understanding AI systems.

Researcher at Anthropic with interests in machine learning, AI alignment, and economics.

Research scientist at Anthropic focused on safety and robustness for language models and reinforcement learning.

Canonical link

Jared D. Kaplan

Ethan Perez

Samuel R. Bowman

Samuel Marks

Evan Hubinger

Carson Denison

Monte MacDiarmid

David Duvenaud

Sören Mindermann

Ryan Greenblatt

Buck Shlegeris

Johannes Treutlein

Jack Chen

Linda Petrini