Research projects

We research on foundations and application of approaches that make AI explainable and controllable.

Model interpretability

We try to understand the mechanisms of DNN models, and reveal how their structures are associated to their behavior. We develop methods to probe DNNs, both language models and beyond. These methods probe the DNNs at multiple abstraction levels: layerwise, module-wise, attention-wise, neuron-wise, etc. How do the signals extracted at each of these levels correlate to the model behavior? Do some parts of the training data lead to the model behavior, as observed in these signals? Probing tries to answer questions of this type. In addition to developing “testing methods” which probe DNN models with high validity and reliability, we also set up “testing materials” which evaluate the crucial capabilities of DNN models when they are deployed in the high-stake decision-making scenarios.
Related works include: A State-Vector Framework for Dataset Effects (2023), Predicting Fine-Tuning Performance with Probing (2023), On the Data Requirements of Probing (2022), How is BERT Surprised? (2021), An Information-Theoretic View on Selecting Linguistic Probes (2020)

Model intervention and control

We apply the finding from model interpretability to making the models safer. This intervention involve many levels. For example, prompt-based control steers the model behavior, and model editing changes some part of the knowledge stored in the parameters. Compared to fine-tuning the models, these intervention approaches use a fraction of computational resources and aim at achieving targeted modifications on the models. We rigorously test the limitations of existing model intervention approaches and improve the efficacy of them.
Related works include: What do the Circuits Mean? A Knowledge Edit View (2024), What does the Knowledge Neuron Thesis Have to do with Knowledge? (2024), Plug and Play with Prompts (2024)

Natural language explanation

Language is among the most flexible media that is used by humans to communicate with each other, transmitting information including the rationales behind decisions and the trustworthiness about the individuals. Language models demonstrate strong reasoning capabilities, but the explanation involves the users and is more challenging. Currently, it is hard to explain with sufficient faithfulness while balancing the plausibility. We explore approaches to steer data-driven systems — including LLMs, LM agents and beyond — to generate explanations, both for the systems themselves and for other systems. We take inspirations from the literatures in psychology, philosophy and linguistics, and set up rigorous benchmarks that evaluate the explanations in terms of informativeness, safety, and situatedness.
Related works include: LLM-Generated Black-box Explanations Can Be Adversarially Helpful (2024), ACCORD:Closing the Commonsense Measurability Gap (2024), Measuring Information in Text Explanations (2023), Situated Natural Language Explanations (2023)