Google DeepMind · Zurich

Research Scientist, Frontier, Zurich

1/20/2026

Description

We are seeking a Research Scientist or Engineer to lead the development of next-generation post-training recipes for Gemini. In this role, you will move beyond standard tuning; you will architect the Reward Modeling and Reinforcement Learning strategies that define how our most capable models learn. You will focus specifically on "hard" capabilities—such as improving chain-of-thought reasoning and complex instruction following—where synthetic data and distillation fall short. You will work horizontally to ensure these recipes scale across text, audio, and multimodal domains, establishing the gold standard for how Gemini evolves.

Key responsibilities:

  • Frontier Recipe Development: Design and validate novel post-training pipelines (SFT, RLHF, RLAIF) specifically for frontier-class models where no "teacher" model exists.
  • Advance Reward Modeling: Lead research into next-gen Reward Models, including investigating new architectures, reducing reward hacking, and improving signal-to-noise ratios in preference data.
  • Unlock "Thinking" Capabilities: innovative methods to improve the model's internal reasoning (chain-of-thought), focusing on correctness, logic, and self-correction in multi-step tasks.
  • Revamp RL Paradigms: critically re-evaluate and optimize RL prompts and feedback mechanisms to extract maximum performance from the underlying base models.
  • Solve the "Flywheel" Challenge: create robust mechanisms to turn user signals and interactions into training data that continuously improves the model without introducing regression or bias.

Horizontal Impact: collaborate across teams to apply these advanced recipes to various model sizes and modalities (e.g., Audio), ensuring consistent high-quality behavior.

Qualifications

  • PhD in machine learning, artificial intelligence, or computer science (or equivalent practical experience).
  • Strong background in Large Language Models (LLMs), Reinforcement Learning (RL), or preference learning.
  • Research interest in aligning AI systems with human feedback and utility.
  • Familiarity with experiment design and analyzing large-scale user data.
  • Strong coding and communication skills.
  • Experience with RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization).
  • Experience building or improving reward models and conducting human evaluation studies.
  • A proven track record of publications in top-tier conferences (e.g., NeurIPS, ICML, ICLR).
  • Experience with Chain-of-Thought (CoT) reasoning research or process-based supervision.
  • Deep understanding and experience training models from scratch or using self-play/self-improvement techniques.

Application

View listing at origin and apply!