Welcome!
Hi everyone! I'm a first-year PhD student at the Department of Computer Science of University of Toronto, where I am working with Prof. Ashton Anderson. My research focus on the post training, safety, and mechanistic interpretability of Large Language Models and AI systems. In my daily life, I am deeply enthusiastic in the sports of Go, basketball, and tennis. Looking forward to connecting with you!
Publications
[ACL 2026] LLM Safety From Within: Detecting Harmful Content with Internal Representations
Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers... Read more
[Under Review] ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO)... Read more
[Under Review] Understanding the Dynamics of Demonstration Conflict in In-Context Learning
In-context learning enables large language models to perform novel tasks through few-shot demonstrations. However, demonstrations per se can naturally contain noise and conflicting examples, making this capability vulnerable... Read more
[TMLR 2026] Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess
As humans seek to collaborate with, learn from, and better understand artificial intelligence systems, developing AIs that can accurately emulate individual decision-making becomes increasingly important... Read more
[COLM 2025] SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
The rapid advancement of large vision-language models (VLMs) has introduced challenges in evaluating their reasoning across multiple modalities... Read more
[ACL 2024 Findings] SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification
Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. Current text classification paradigms, however, rely solely on the output of the final layer in the LLM... Read more
[NeurIPS 2024] Maia-2: A Unified Model for Human-AI Alignment in Chess
There are an increasing number of domains in which artificial intelligence (AI) systems both surpass human ability and accurately model human behavior... Read more
[Under Review] Understanding Mechanisms of Skill Adaptation in Transformers: Chess as a Model System
Generative models can adapt their outputs to different skill levels, yet the mechanism underlying this adaptation remains unexplored. We address this gap using chess as a model system, leveraging its well-defined decision space, precise skill metrics, and formally measurable strategic concepts... Read more