We research on foundations and application of approaches that make AI explainable and controllable.
Mechanistic Interpretability
We try to understand the mechanisms of DNN models, and reveal how their internal mechanisms lead to their behavior, especially the behavior related to knowledge and model safety. We develop methods to probe DNNs, both language models and beyond. These methods probe the DNNs at multiple abstraction levels: layerwise, module-wise, attention-wise, neuron-wise, etc. Recently we are also working on Sparse Autoencoder features. How do the signals extracted at each of these components explain the model behavior? When we intervene on these components, can we steer the model’s behavior? In addition to developing “testing methods” which understand DNN models with high validity and reliability, we also set up “testing materials” which evaluate the crucial capabilities of DNN models.
Related works include:
- What does the Knowledge Neuron Thesis Have to do with Knowledge? (2024)
- Understanding Language Model Circuits through Model Editing (2024)
- A State-Vector Framework for Dataset Effects (2023)
- Predicting Fine-Tuning Performance with Probing (2023)
- On the Data Requirements of Probing (2022)
- An Information-Theoretic View on Selecting Linguistic Probes (2020)
AI for Research and Knowledge Discovery
We develop AI tools for scientific research. We are interested in how AI can help researchers along multiple steps in the process of producing novel scientific knowledge. These steps include understanding scientific papers, creating scientific hypotheses, planning experiments, analyzing experiment results, generating scientific explanations, writing academic papers, etc. These eventually lead to the discovery of novel knowledge.
Related works include:
- What Would You Ask When You First Saw a2+b2=c2? Evaluating LLM on Curiosity-Driven Questioning (2025)
- LLM-Generated Black-box Explanations can be Adversarially Helpful (2024)
- Scenarios and Approaches for Situated Natural Language Explanations (2024)
Applications of AI Agents
We explore novel use cases of AI agents driven by strong, general-purpose foundational models (including but not limited to language models and vision-language models). We consider the problems that have profound real-world impacts, including: finance, sports, education, and document processing, etc. We explore innovative architectures for these agents, and develop benchmarks that rigorously evaluate their performances.
Related works include: