"Motivations for methods in explainable artificial intelligence (XAI) often include detecting, quantifying and mitigating bias, and contributing to making machine learning models fairer. However, exactly how an XAI method can help in combating biases is often left unspecified. In this paper, we briefly review trends in explainability and fairness in NLP research, identify the current
practices in which explainability methods are applied to detect and mitigate bias, and investigate the barriers preventing XAI methods from being used more widely in tackling fairness issues."
Very relevant paper as large LLMs are raising concerns for AI safety.
As described in this paper, explainability methods have only been applied to ML fairness in narrow applications: for feature understanding and hate speech detection. Although limiting biases is one of the motivations for model transparency, NLP fairness and interpretability struggle to find a common ground.
NLP fairness focuses on local explanations and invariant outcome across groups. On the other hand, XAI aims to solve procedural fairness; whether the model's reasoning across groups is bias. We struggle to generalize local explanations, identify biases without human supervision and quantify how biases may change.
The issue of "fairwashing" is also becoming increasingly concerning as we have no guarantee our current explanation methods actually represent inner working of the model.
At the end of the day, more representative and less bias datasets remain the key to AI fairness.