Home People Organizations Reports Fields Schools

Public Atlas People first, reports as evidence, organizations as context.

Atlas / People / Detail

Saurav Kadavath

Researcher at Anthropic whose public report authorships and scholarly profiles cover language model evaluation, AI safety, and robustness.

Researcher at Anthropic1 organizations4 reports

Profile status: updated

Suggest a correction

Suggest a source

Trust signals

Profile completeness91%

Public sources3

Official sources2

Last reviewedJun 8, 2026

Scholar profile Structured work Structured education

updated 3 public sources

AI safetylanguage model evaluationrobustness

Current frame

Researcher at Anthropic

Education

University of California, Berkeley MS student 2020 → 2021

University of California, Berkeley Undergrad student 2016 → 2020

Work

Anthropic Role not listed

Public links

dblp DBLP google_scholar Google Scholar linkedin LinkedIn openreview OpenReview

Organizations

core Anthropic

Reports

Alignment and RLHF Collective Constitutional AI: Aligning a Language Model with Public Input Alignment and RLHF Constitutional AI: Harmlessness from AI Feedback Alignment and Safety Many-shot Jailbreaking Alignment and RLHF Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Official and primary sources

Saurav Kadavath DBLP profile Official source · dblp · DBLP Saurav Kadavath OpenReview profile Official source · openreview · OpenReview

Supporting sources

Many-shot Jailbreaking Supporting source · report · arXiv

LLMpeople is a public atlas for discovering frontier AI researchers with context, provenance, and respect.

Privacy · Terms