Dial-M: A Masking-based Framework for Dialogue Evaluation

Abstract

In dialogue systems, automatically evaluating machine-generated responses is critical and challenging. Despite the tremendous progress in dialogue generation research, its evaluation heavily depends on human judgments. The standard word-overlapping based evaluation metrics are ineffective for dialogues. As a result, most of the recently proposed metrics are model-based and reference-free, which learn to score different aspects of a conversation. However, understanding each aspect requires a separate model, which makes them computationally expensive. To this end, we propose Dial-M, a Masking-based reference-free framework for Dialogue evaluation. The main idea is to mask the keywords of the current utterance and predict them, given the dialogue history and various conditions (like knowledge, persona, etc.), thereby making the evaluation framework simple and easily extensible for multiple datasets. Regardless of its simplicity, Dial-M achieves comparable performance to state-of-the-art metrics on several dialogue evaluation datasets. We also discuss the interpretability of our proposed metric along with error analysis.

BibTeX

@inproceedings{dey-desarkar-2023-dial,
  title     = {Dial-{M}: A Masking-based Framework for Dialogue Evaluation},
  author    = {Dey, Suvodip  and
              Desarkar, Maunendra Sankar},
  editor    = {Stoyanchev, Svetlana  and
              Joty, Shafiq  and
              Schlangen, David  and
              Dusek, Ondrej  and
              Kennington, Casey  and
              Alikhani, Malihe},
  booktitle = {Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue},
  month     = sep,
  year      = {2023},
  address   = {Prague, Czechia},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2023.sigdial-1.7},
  doi       = {10.18653/v1/2023.sigdial-1.7},
  pages     = {77--84},
  abstract  = {In dialogue systems, automatically evaluating machine-generated responses is critical and challenging. Despite the tremendous progress in dialogue generation research, its evaluation heavily depends on human judgments. The standard word-overlapping based evaluation metrics are ineffective for dialogues. As a result, most of the recently proposed metrics are model-based and reference-free, which learn to score different aspects of a conversation. However, understanding each aspect requires a separate model, which makes them computationally expensive. To this end, we propose Dial-M, a Masking-based reference-free framework for Dialogue evaluation. The main idea is to mask the keywords of the current utterance and predict them, given the dialogue history and various conditions (like knowledge, persona, etc.), thereby making the evaluation framework simple and easily extensible for multiple datasets. Regardless of its simplicity, Dial-M achieves comparable performance to state-of-the-art metrics on several dialogue evaluation datasets. We also discuss the interpretability of our proposed metric along with error analysis.}
}