Research projects

We research on foundations and application of approaches that make AI explainable and controllable.

Model interpretation

We try to understand the mechanisms of DNN models, and reveal how their structures are associated to their behavior. We develop methods to probe DNNs, both language models and beyond. These methods probe the DNNs at multiple abstraction levels: layerwise, module-wise, attention-wise, neuron-wise, etc. How do the signals extracted at each of these levels correlate to the model behavior? Do some parts of the training data lead to the model behavior, as observed in these signals? Probing tries to answer questions of this type. We develop methods to probe DNN models with high validity and reliability, and integrate the investigated outcomes to the developments of DNN models.
Related works include: C10, C9, C8, C5, C4

Model intervention

One cannot claim a sufficient understanding of a DNN model without successful intervention. Intervention involves many aspects. For example, prompt-based control steers the model behavior, and model editing changes some part of the knowledge stored in the parameters. Compared to fine-tuning the models, these intervention approaches use a fraction of computational resources and aim at achieving targeted modifications on the models. We rigorously test the limitations of existing model intervention approaches and explore approaches to improve these intervention methods.
Related works include: C11, W7

Natural language explanation

Language is among the most flexible media that is used by humans to communicate with each other, transmitting information including the rationales behind decisions and the trustworthiness about the individuals. When presenting the same concepts to users, language should be the go-to choice, but the generation of explanations with sufficient correctness is still an open problem. We explore approaches to steer data-driven systems (including LLMs and beyond) to generate natural language explanations. We take inspirations from the literatures in psychology, philosophy and linguistics, and study the factors that make explanations informative, reasonable, and applicable for a wide variety of audience. We also research on approaches that prevent the potential misuse of LLM-based explanation generators.
Related works include: T1, W6, I6

Safety in real-world scenarios

The AI systems are deployed more widely than ever before. The deployment environments are rarely identical to the training environments. When some environmental factors are changed, how will the AI models respond, and will they be able to still work as expected? When anomalous events happen, what can we do to mitigate the potential risks? The research towards the safety of AI models is closely integrated in the research towards AI model interpretation, intervention and natural language explanation. Following this route, we also explore topics that could have potential societal impacts: What jobs will be delegated to AIs? How to prepare the world for a future where the humans collaborate closely with highly intelligent AIs?
Related works include: W4