Welcome!

Hi everyone! I'm a first-year PhD student at the Department of Computer Science of University of Toronto, where I am working with Prof. Ashton Anderson. My research focus on the post training, safety, and mechanistic interpretability of Large Language Models and AI systems. In my daily life, I am deeply enthusiastic in the sports of Go, basketball, and tennis. Looking forward to connecting with you!

Publications

[ACL 2026] LLM Safety From Within: Detecting Harmful Content with Internal Representations

Difan Jiao, Yilun Liu, Ye Yuan, Zhenwei Tang, Linfeng Du, Haolun Wu, Ashton Anderson

Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers... Read more

[Under Review] ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Difan Jiao, Qianfeng Wen*, Blair Yang, Zhenwei Tang, Ashton Anderson

We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO)... Read more

[Under Review] Understanding the Dynamics of Demonstration Conflict in In-Context Learning

Difan Jiao, Di Wang, Lijie Hu

In-context learning enables large language models to perform novel tasks through few-shot demonstrations. However, demonstrations per se can naturally contain noise and conflicting examples, making this capability vulnerable... Read more

[TMLR 2026] Learning to Imitate with Less: Efficient Individual Behavior Modeling in Chess

Zhenwei Tang, Difan Jiao, Eric Xue, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson

As humans seek to collaborate with, learn from, and better understand artificial intelligence systems, developing AIs that can accurately emulate individual decision-making becomes increasingly important... Read more

[COLM 2025] SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson

The rapid advancement of large vision-language models (VLMs) has introduced challenges in evaluating their reasoning across multiple modalities... Read more

[ACL 2024 Findings] SPIN: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification

Difan Jiao, Yilun Liu*, Zhenwei Tang, Daniel Matter, JΓΌrgen Pfeffer, Ashton Anderson

Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. Current text classification paradigms, however, rely solely on the output of the final layer in the LLM... Read more

[NeurIPS 2024] Maia-2: A Unified Model for Human-AI Alignment in Chess

Zhenwei Tang, Difan Jiao, Reid McIlroy-Young, Jon Kleinberg, Siddhartha Sen, Ashton Anderson

There are an increasing number of domains in which artificial intelligence (AI) systems both surpass human ability and accurately model human behavior... Read more

[Under Review] Understanding Mechanisms of Skill Adaptation in Transformers: Chess as a Model System

Difan Jiao, George Eilender, Zhenwei Tang, Ashton Anderson

Generative models can adapt their outputs to different skill levels, yet the mechanism underlying this adaptation remains unexplored. We address this gap using chess as a model system, leveraging its well-defined decision space, precise skill metrics, and formally measurable strategic concepts... Read more