subscribe to arXiv mailings

arXiv:2301.12004 [pdf, other]

Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation

Authors: Jessica Huynh, Cathy Jiao, Prakhar Gupta, Shikib Mehri, Payal Bajaj, Vishrav Chaudhary, Maxine Eskenazi

Abstract: Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the… ▽ More Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance. △ Less

Submitted 27 January, 2023; originally announced January 2023.

Comments: Accepted for publication at IWSDS 2023

arXiv:2208.10918 [pdf, other]

The DialPort tools

Authors: Jessica Huynh, Shikib Mehri, Cathy Jiao, Maxine Eskenazi

Abstract: The DialPort project http://dialport.org/, funded by the National Science Foundation (NSF), covers a group of tools and services that aim at fulfilling the needs of the dialog research community. Over the course of six years, several offerings have been created, including the DialPort Portal and DialCrowd. This paper describes these contributions, which will be demoed at SIGDIAL, including impleme… ▽ More The DialPort project http://dialport.org/, funded by the National Science Foundation (NSF), covers a group of tools and services that aim at fulfilling the needs of the dialog research community. Over the course of six years, several offerings have been created, including the DialPort Portal and DialCrowd. This paper describes these contributions, which will be demoed at SIGDIAL, including implementation, prior studies, corresponding discoveries, and the locations at which the tools will remain freely available to the community going forward. △ Less

Submitted 18 August, 2022; originally announced August 2022.

Comments: Accepted to SIGDIAL 2022

arXiv:2207.14403 [pdf, other]

Interactive Evaluation of Dialog Track at DSTC9

Authors: Shikib Mehri, Yulan Feng, Carla Gordon, Seyed Hossein Alavi, David Traum, Maxine Eskenazi

Abstract: The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to exte… ▽ More The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to extend dialog models beyond static datasets by assessing them in an interactive setting with real users. Our track challenges participants to develop strong response generation models and explore strategies that extend them to back-and-forth interactions with real users. The progression from static corpora to interactive evaluation introduces unique challenges and facilitates a more thorough assessment of open-domain dialog systems. This paper provides an overview of the track, including the methodology and results. Furthermore, it provides insights into how to best evaluate open-domain dialog models △ Less

Submitted 28 July, 2022; originally announced July 2022.

Comments: Presented at LREC 2022 and DSTC9 Workshop at AAAI 2021

arXiv:2207.14393 [pdf, other]

LAD: Language Models as Data for Zero-Shot Dialog

Authors: Shikib Mehri, Yasemin Altun, Maxine Eskenazi

Abstract: To facilitate zero-shot generalization in taskoriented dialog, this paper proposes Language Models as Data (LAD). LAD is a paradigm for creating diverse and accurate synthetic data which conveys the necessary structural constraints and can be used to train a downstream neural dialog model. LAD leverages GPT-3 to induce linguistic diversity. LAD achieves significant performance gains in zero-shot s… ▽ More To facilitate zero-shot generalization in taskoriented dialog, this paper proposes Language Models as Data (LAD). LAD is a paradigm for creating diverse and accurate synthetic data which conveys the necessary structural constraints and can be used to train a downstream neural dialog model. LAD leverages GPT-3 to induce linguistic diversity. LAD achieves significant performance gains in zero-shot settings on intent prediction (+15%), slot filling (+31.4 F-1) and next action prediction (+11 F1). Furthermore, an interactive human evaluation shows that training with LAD is competitive with training on human dialogs. LAD is open-sourced, with the code and data available at https://github.com/Shikib/lad. △ Less

Submitted 28 July, 2022; originally announced July 2022.

Comments: Accepted as a long paper to SIGDial 2022

arXiv:2207.12551 [pdf, other]

DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit

Authors: Jessica Huynh, Ting-Rui Chiang, Jeffrey Bigham, Maxine Eskenazi

Abstract: Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help r… ▽ More Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers. △ Less

Submitted 25 July, 2022; originally announced July 2022.

Comments: Published at LREC 2022

arXiv:2205.12673 [pdf, other]

InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning

Authors: Prakhar Gupta, Cathy Jiao, Yi-Ting Yeh, Shikib Mehri, Maxine Eskenazi, Jeffrey P. Bigham

Abstract: Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perf… ▽ More Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perform multiple kinds of tasks related to language (e.g., natural language understanding and generation, domain-specific interaction), yet instruction tuning has not been systematically explored for dialogue-related tasks. We introduce InstructDial, an instruction tuning framework for dialogue, which consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets. Next, we explore cross-task generalization ability on models tuned on InstructDial across diverse dialogue tasks. Our analysis reveals that InstructDial enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting. To ensure that models adhere to instructions, we introduce novel meta-tasks. We establish benchmark zero-shot and few-shot performance of models trained using the proposed framework on multiple dialogue tasks. △ Less

Submitted 26 October, 2022; v1 submitted 25 May, 2022; originally announced May 2022.

Comments: EMNLP 2022

arXiv:2203.10012 [pdf, ps, other]

Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges

Authors: Shikib Mehri, Jinho Choi, Luis Fernando D'Haro, Jan Deriu, Maxine Eskenazi, Milica Gasic, Kallirroi Georgila, Dilek Hakkani-Tur, Zekang Li, Verena Rieser, Samira Shaikh, David Traum, Yi-Ting Yeh, Zhou Yu, Yizhe Zhang, Chen Zhang

Abstract: This is a report on the NSF Future Directions Workshop on Automatic Evaluation of Dialog. The workshop explored the current state of the art along with its limitations and suggested promising directions for future work in this important and very rapidly changing area of research. This is a report on the NSF Future Directions Workshop on Automatic Evaluation of Dialog. The workshop explored the current state of the art along with its limitations and suggested promising directions for future work in this important and very rapidly changing area of research. △ Less

Submitted 18 March, 2022; originally announced March 2022.

Comments: Report from the NSF AED Workshop (http://dialrc.org/AED/)

arXiv:2111.05241 [pdf, other]

A Survey of NLP-Related Crowdsourcing HITs: what works and what does not

Authors: Jessica Huynh, Jeffrey Bigham, Maxine Eskenazi

Abstract: Crowdsourcing requesters on Amazon Mechanical Turk (AMT) have raised questions about the reliability of the workers. The AMT workforce is very diverse and it is not possible to make blanket assumptions about them as a group. Some requesters now reject work en mass when they do not get the results they expect. This has the effect of giving each worker (good or bad) a lower Human Intelligence Task (… ▽ More Crowdsourcing requesters on Amazon Mechanical Turk (AMT) have raised questions about the reliability of the workers. The AMT workforce is very diverse and it is not possible to make blanket assumptions about them as a group. Some requesters now reject work en mass when they do not get the results they expect. This has the effect of giving each worker (good or bad) a lower Human Intelligence Task (HIT) approval score, which is unfair to the good workers. It also has the effect of giving the requester a bad reputation on the workers' forums. Some of the issues causing the mass rejections stem from the requesters not taking the time to create a well-formed task with complete instructions and/or not paying a fair wage. To explore this assumption, this paper describes a study that looks at the crowdsourcing HITs on AMT that were available over a given span of time and records information about those HITs. This study also records information from a crowdsourcing forum on the worker perspective on both those HITs and on their corresponding requesters. Results reveal issues in worker payment and presentation issues such as missing instructions or HITs that are not doable. △ Less

Submitted 9 November, 2021; originally announced November 2021.

arXiv:2106.07056 [pdf, other]

Schema-Guided Paradigm for Zero-Shot Dialog

Authors: Shikib Mehri, Maxine Eskenazi

Abstract: Developing mechanisms that flexibly adapt dialog systems to unseen tasks and domains is a major challenge in dialog research. Neural models implicitly memorize task-specific dialog policies from the training data. We posit that this implicit memorization has precluded zero-shot transfer learning. To this end, we leverage the schema-guided paradigm, wherein the task-specific dialog policy is explic… ▽ More Developing mechanisms that flexibly adapt dialog systems to unseen tasks and domains is a major challenge in dialog research. Neural models implicitly memorize task-specific dialog policies from the training data. We posit that this implicit memorization has precluded zero-shot transfer learning. To this end, we leverage the schema-guided paradigm, wherein the task-specific dialog policy is explicitly provided to the model. We introduce the Schema Attention Model (SAM) and improved schema representations for the STAR corpus. SAM obtains significant improvement in zero-shot settings, with a +22 F1 score improvement over prior work. These results validate the feasibility of zero-shot generalizability in dialog. Ablation experiments are also presented to demonstrate the efficacy of SAM. △ Less

Submitted 13 June, 2021; originally announced June 2021.

Comments: Accepted at SIGDial 2021

arXiv:2106.07055 [pdf, other]

GenSF: Simultaneous Adaptation of Generative Pre-trained Models and Slot Filling

Authors: Shikib Mehri, Maxine Eskenazi

Abstract: In transfer learning, it is imperative to achieve strong alignment between a pre-trained model and a downstream task. Prior work has done this by proposing task-specific pre-training objectives, which sacrifices the inherent scalability of the transfer learning paradigm. We instead achieve strong alignment by simultaneously modifying both the pre-trained model and the formulation of the downstream… ▽ More In transfer learning, it is imperative to achieve strong alignment between a pre-trained model and a downstream task. Prior work has done this by proposing task-specific pre-training objectives, which sacrifices the inherent scalability of the transfer learning paradigm. We instead achieve strong alignment by simultaneously modifying both the pre-trained model and the formulation of the downstream task, which is more efficient and preserves the scalability of transfer learning. We present GenSF (Generative Slot Filling), which leverages a generative pre-trained open-domain dialog model for slot filling. GenSF (1) adapts the pre-trained model by incorporating inductive biases about the task and (2) adapts the downstream task by reformulating slot filling to better leverage the pre-trained model's capabilities. GenSF achieves state-of-the-art results on two slot filling datasets with strong gains in few-shot and zero-shot settings. We achieve a 9 F1 score improvement in zero-shot slot filling. This highlights the value of strong alignment between the pre-trained model and the downstream task. △ Less

Submitted 13 June, 2021; originally announced June 2021.

Comments: Accepted at SIGDial 2021

arXiv:2106.03706 [pdf, other]

A Comprehensive Assessment of Dialog Evaluation Metrics

Authors: Yi-Ting Yeh, Maxine Eskenazi, Shikib Mehri

Abstract: Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and ther… ▽ More Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 23 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. Metrics are assessed (1) on both the turn level and the dialog level, (2) for different dialog lengths, (3) for different dialog qualities (e.g., coherence, engaging), (4) for different types of response generation models (i.e., generative, retrieval, simple models and state-of-the-art models), (5) taking into account the similarity of different metrics and (6) exploring combinations of different metrics. This comprehensive assessment offers several takeaways pertaining to dialog evaluation metrics in general. It also suggests how to best assess evaluation metrics and indicates promising directions for future work. △ Less

Submitted 7 July, 2021; v1 submitted 7 June, 2021; originally announced June 2021.

arXiv:2011.06486 [pdf, ps, other]

Overview of the Ninth Dialog System Technology Challenge: DSTC9

Authors: Chulaka Gunasekara, Seokhwan Kim, Luis Fernando D'Haro, Abhinav Rastogi, Yun-Nung Chen, Mihail Eric, Behnam Hedayatnia, Karthik Gopalakrishnan, Yang Liu, Chao-Wei Huang, Dilek Hakkani-Tür, Jinchao Li, Qi Zhu, Lingxiao Luo, Lars Liden, Kaili Huang, Shahin Shayandeh, Runze Liang, Baolin Peng, Zheng Zhang, Swadheen Shukla, Minlie Huang, Jianfeng Gao, Shikib Mehri, Yulan Feng , et al. (14 additional authors not shown)

Abstract: This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with unstructured knowledge access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog, and 4. Situated interactive multi-modal dialog. This… ▽ More This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with unstructured knowledge access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog, and 4. Situated interactive multi-modal dialog. This paper describes the task definition, provided datasets, baselines and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks. △ Less

Submitted 12 November, 2020; originally announced November 2020.

arXiv:2006.12719 [pdf, ps, other]

Unsupervised Evaluation of Interactive Dialog with DialoGPT

Authors: Shikib Mehri, Maxine Eskenazi

Abstract: It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset… ▽ More It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels. △ Less

Submitted 22 June, 2020; originally announced June 2020.

Comments: Published at to SIGdial 2020

arXiv:2006.06026 [pdf, other]

Report from the NSF Future Directions Workshop, Toward User-Oriented Agents: Research Directions and Challenges

Authors: Maxine Eskenazi, Tiancheng Zhao

Abstract: This USER Workshop was convened with the goal of defining future research directions for the burgeoning intelligent agent research community and to communicate them to the National Science Foundation. It took place in Pittsburgh Pennsylvania on October 24 and 25, 2019 and was sponsored by National Science Foundation Grant Number IIS-1934222. Any opinions, findings and conclusions or future directi… ▽ More This USER Workshop was convened with the goal of defining future research directions for the burgeoning intelligent agent research community and to communicate them to the National Science Foundation. It took place in Pittsburgh Pennsylvania on October 24 and 25, 2019 and was sponsored by National Science Foundation Grant Number IIS-1934222. Any opinions, findings and conclusions or future directions expressed in this document are those of the authors and do not necessarily reflect the views of the National Science Foundation. The 27 participants presented their individual research interests and their personal research goals. In the breakout sessions that followed, the participants defined the main research areas within the domain of intelligent agents and they discussed the major future directions that the research in each area of this domain should take △ Less

Submitted 10 June, 2020; originally announced June 2020.

Comments: Final report of the NSF Future Directions Workshop, Toward User-Oriented Agents: Research Directions and Challenges

arXiv:2005.00456 [pdf, other]

USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation

Authors: Shikib Mehri, Maxine Eskenazi

Abstract: The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable… ▽ More The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog. △ Less

Submitted 1 May, 2020; originally announced May 2020.

Comments: Accepted to ACL 2020 as long paper

arXiv:2004.01926 [pdf, other]

"None of the Above":Measure Uncertainty in Dialog Response Retrieval

Authors: Yulan Feng, Shikib Mehri, Maxine Eskenazi, Tiancheng Zhao

Abstract: This paper discusses the importance of uncovering uncertainty in end-to-end dialog tasks, and presents our experimental results on uncertainty classification on the Ubuntu Dialog Corpus. We show that, instead of retraining models for this specific purpose, the original retrieval model's underlying confidence concerning the best prediction can be captured with trivial additional computation. This paper discusses the importance of uncovering uncertainty in end-to-end dialog tasks, and presents our experimental results on uncertainty classification on the Ubuntu Dialog Corpus. We show that, instead of retraining models for this specific purpose, the original retrieval model's underlying confidence concerning the best prediction can be captured with trivial additional computation. △ Less

Submitted 14 May, 2020; v1 submitted 4 April, 2020; originally announced April 2020.

Comments: Accepted to ACL 2020 as short paper

arXiv:1909.01322 [pdf, other]

CMU GetGoing: An Understandable and Memorable Dialog System for Seniors

Authors: Shikib Mehri, Alan W Black, Maxine Eskenazi

Abstract: Voice-based technologies are typically developed for the average user, and thus generally not tailored to the specific needs of any subgroup of the population, like seniors. This paper presents CMU GetGoing, an accessible trip planning dialog system designed for senior users. The GetGoing system design is described in detail, with particular attention to the senior-tailored features. A user study… ▽ More Voice-based technologies are typically developed for the average user, and thus generally not tailored to the specific needs of any subgroup of the population, like seniors. This paper presents CMU GetGoing, an accessible trip planning dialog system designed for senior users. The GetGoing system design is described in detail, with particular attention to the senior-tailored features. A user study is presented, demonstrating that the senior-tailored features significantly improve comprehension and retention of information. △ Less

Submitted 3 September, 2019; originally announced September 2019.

Comments: Accepted to the Dialog for Good (DiGo) workshop (http://dialogforgood.org) at SIGDial 2019

arXiv:1908.09890 [pdf, ps, other]

Multi-Granularity Representations of Dialog

Authors: Shikib Mehri, Maxine Eskenazi

Abstract: Neural models of dialog rely on generalized latent representations of language. This paper introduces a novel training procedure which explicitly learns multiple representations of language at several levels of granularity. The multi-granularity training algorithm modifies the mechanism by which negative candidate responses are sampled in order to control the granularity of learned latent represen… ▽ More Neural models of dialog rely on generalized latent representations of language. This paper introduces a novel training procedure which explicitly learns multiple representations of language at several levels of granularity. The multi-granularity training algorithm modifies the mechanism by which negative candidate responses are sampled in order to control the granularity of learned latent representations. Strong performance gains are observed on the next utterance retrieval task using both the MultiWOZ dataset and the Ubuntu dialog corpus. Analysis significantly demonstrates that multiple granularities of representation are being learned, and that multi-granularity training facilitates better transfer to downstream tasks. △ Less

Submitted 26 August, 2019; originally announced August 2019.

Comments: Accepted as a long paper at EMNLP 2019

arXiv:1907.10568 [pdf, other]

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

Authors: Prakhar Gupta, Shikib Mehri, Tiancheng Zhao, Amy Pavel, Maxine Eskenazi, Jeffrey P. Bigham

Abstract: The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of mu… ▽ More The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of multi-reference evaluation, we augment the test set of DailyDialog with multiple references. A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output. △ Less

Submitted 8 September, 2019; v1 submitted 24 July, 2019; originally announced July 2019.

Comments: SIGDIAL 2019

arXiv:1907.10016 [pdf, other]

Structured Fusion Networks for Dialog

Authors: Shikib Mehri, Tejas Srinivasan, Maxine Eskenazi

Abstract: Neural dialog models have exhibited strong performance, however their end-to-end nature lacks a representation of the explicit structure of dialog. This results in a loss of generalizability, controllability and a data-hungry nature. Conversely, more traditional dialog systems do have strong models of explicit structure. This paper introduces several approaches for explicitly incorporating structu… ▽ More Neural dialog models have exhibited strong performance, however their end-to-end nature lacks a representation of the explicit structure of dialog. This results in a loss of generalizability, controllability and a data-hungry nature. Conversely, more traditional dialog systems do have strong models of explicit structure. This paper introduces several approaches for explicitly incorporating structure into neural models of dialog. Structured Fusion Networks first learn neural dialog modules corresponding to the structured components of traditional dialog systems and then incorporate these modules in a higher-level generative model. Structured Fusion Networks obtain strong results on the MultiWOZ dataset, both with and without reinforcement learning. Structured Fusion Networks are shown to have several valuable properties, including better domain generalizability, improved performance in reduced data scenarios and robustness to divergence during reinforcement learning. △ Less

Submitted 23 July, 2019; originally announced July 2019.

Comments: Accepted to SIGDial 2019

arXiv:1906.00414 [pdf, other]

Pretraining Methods for Dialog Context Representation Learning

Authors: Shikib Mehri, Evgeniia Razumovskaia, Tiancheng Zhao, Maxine Eskenazi

Abstract: This paper examines various unsupervised pretraining objectives for learning dialog context representations. Two novel methods of pretraining dialog context encoders are proposed, and a total of four methods are examined. Each pretraining objective is fine-tuned and evaluated on a set of downstream dialog tasks using the MultiWoz dataset and strong performance improvement is observed. Further eval… ▽ More This paper examines various unsupervised pretraining objectives for learning dialog context representations. Two novel methods of pretraining dialog context encoders are proposed, and a total of four methods are examined. Each pretraining objective is fine-tuned and evaluated on a set of downstream dialog tasks using the MultiWoz dataset and strong performance improvement is observed. Further evaluation shows that our pretraining objectives result in not only better performance, but also better convergence, models that are less data hungry and have better domain generalizability. △ Less

Submitted 3 June, 2019; v1 submitted 2 June, 2019; originally announced June 2019.

Comments: Accepted to ACL 2019

arXiv:1902.08858 [pdf, other]

Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models

Authors: Tiancheng Zhao, Kaige Xie, Maxine Eskenazi

Abstract: Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge. Common practice has been to use handcrafted dialog acts, or the output vocabulary, e.g. in neural encoder decoders, as the action spaces. Both have their own limitations. This paper proposes a novel latent action framework that treats the action spaces… ▽ More Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge. Common practice has been to use handcrafted dialog acts, or the output vocabulary, e.g. in neural encoder decoders, as the action spaces. Both have their own limitations. This paper proposes a novel latent action framework that treats the action spaces of an end-to-end dialog agent as latent variables and develops unsupervised methods in order to induce its own action space from the data. Comprehensive experiments are conducted examining both continuous and discrete action types and two different optimization methods based on stochastic variational inference. Results show that the proposed latent actions achieve superior empirical performance improvement over previous word-level policy gradient methods on both DealOrNoDeal and MultiWoz dialogs. Our detailed analysis also provides insights about various latent variable approaches for policy learning and can serve as a foundation for developing better latent actions in future research. △ Less

Submitted 15 April, 2019; v1 submitted 23 February, 2019; originally announced February 2019.

Comments: Camera ready version for NAACL 2019 long paper

arXiv:1901.06613 [pdf, other]

Beyond Turing: Intelligent Agents Centered on the User

Authors: Maxine Eskenazi, Shikib Mehri, Evgeniia Razumovskaia, Tiancheng Zhao

Abstract: Most research on intelligent agents centers on the agent and not on the user. We look at the origins of agent-centric research for slot-filling, gaming and chatbot agents. We then argue that it is important to concentrate more on the user. After reviewing relevant literature, some approaches for creating and assessing user-centric systems are proposed. Most research on intelligent agents centers on the agent and not on the user. We look at the origins of agent-centric research for slot-filling, gaming and chatbot agents. We then argue that it is important to concentrate more on the user. After reviewing relevant literature, some approaches for creating and assessing user-centric systems are proposed. △ Less

Submitted 18 March, 2019; v1 submitted 19 January, 2019; originally announced January 2019.

Comments: 13 pages

arXiv:1811.11430 [pdf, other]

Context-Aware Dialog Re-Ranking for Task-Oriented Dialog Systems

Authors: Junki Ohmura, Maxine Eskenazi

Abstract: Dialog response ranking is used to rank response candidates by considering their relation to the dialog history. Although researchers have addressed this concept for open-domain dialogs, little attention has been focused on task-oriented dialogs. Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human-computer dia… ▽ More Dialog response ranking is used to rank response candidates by considering their relation to the dialog history. Although researchers have addressed this concept for open-domain dialogs, little attention has been focused on task-oriented dialogs. Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human-computer dialogs with speech recognition errors. In this paper, we propose a context-aware dialog response re-ranking system. Our system reranks responses in two steps: (1) it calculates matching scores for each candidate response and the current dialog context; (2) it combines the matching scores and a probability distribution of the candidates from an existing dialog system for response re-ranking. By using neural word embedding-based models and handcrafted or logistic regression-based ensemble models, we have improved the performance of a recently proposed end-to-end task-oriented dialog system on real dialogs with speech recognition errors. △ Less

Submitted 28 November, 2018; originally announced November 2018.

Comments: Accepted in IEEE SLT 2018. 8 pages, 3 figures

arXiv:1810.10565 [pdf, other]

doi 10.21437/Interspeech.2018-2011

Multimodal Polynomial Fusion for Detecting Driver Distraction

Authors: Yulun Du, Chirag Raman, Alan W Black, Louis-Philippe Morency, Maxine Eskenazi

Abstract: Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone. Although there has been a considerable amount of research on modeling the distracted behavior of drivers under various conditions, accurate automatic detection using multiple modalities and especially the contribution of using the speech modality to improve accuracy has received little attention. This paper introduces a… ▽ More Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone. Although there has been a considerable amount of research on modeling the distracted behavior of drivers under various conditions, accurate automatic detection using multiple modalities and especially the contribution of using the speech modality to improve accuracy has received little attention. This paper introduces a new multimodal dataset for distracted driving behavior and discusses automatic distraction detection using features from three modalities: facial expression, speech and car signals. Detailed multimodal feature analysis shows that adding more modalities monotonically increases the predictive accuracy of the model. Finally, a simple and effective multimodal fusion technique using a polynomial fusion layer shows superior distraction detection results compared to the baseline SVM and neural network models. △ Less

Submitted 24 October, 2018; originally announced October 2018.

Comments: INTERSPEECH 2018

arXiv:1805.04803 [pdf, other]

Zero-Shot Dialog Generation with Cross-Domain Latent Actions

Authors: Tiancheng Zhao, Maxine Eskenazi

Abstract: This paper introduces zero-shot dialog generation (ZSDG), as a step towards neural dialog systems that can instantly generalize to new situations with minimal data. ZSDG enables an end-to-end generative dialog system to generalize to a new domain for which only a domain description is provided and no training dialogs are available. Then a novel learning framework, Action Matching, is proposed. Thi… ▽ More This paper introduces zero-shot dialog generation (ZSDG), as a step towards neural dialog systems that can instantly generalize to new situations with minimal data. ZSDG enables an end-to-end generative dialog system to generalize to a new domain for which only a domain description is provided and no training dialogs are available. Then a novel learning framework, Action Matching, is proposed. This algorithm can learn a cross-domain embedding space that models the semantics of dialog responses which, in turn, lets a neural dialog generation model generalize to new domains. We evaluate our methods on a new synthetic dialog dataset, and an existing human-human dialog dataset. Results show that our method has superior performance in learning dialog models that rapidly adapt their behavior to new domains and suggests promising future research. △ Less

Submitted 12 May, 2018; originally announced May 2018.

Comments: Accepted as a long paper in SIGDIAL 2018

arXiv:1804.08069 [pdf, other]

Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation

Authors: Tiancheng Zhao, Kyusong Lee, Maxine Eskenazi

Abstract: The encoder-decoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoder-deco… ▽ More The encoder-decoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoder-decoder dialog models for interpretable response generation. Building upon variational autoencoders (VAEs), we present two novel models, DI-VAE and DI-VST that improve VAEs and can discover interpretable semantics via either auto encoding or context predicting. Our methods have been validated on real-world dialog datasets to discover semantic representations and enhance encoder-decoder models with interpretable generation. △ Less

Submitted 22 April, 2018; originally announced April 2018.

Comments: Accepted as a long paper in ACL 2018

arXiv:1707.05254 [pdf, other]

Explainable Entity-based Recommendations with Knowledge Graphs

Authors: Rose Catherine, Kathryn Mazaitis, Maxine Eskenazi, William Cohen

Abstract: Explainable recommendation is an important task. Many methods have been proposed which generate explanations from the content and reviews written for items. When review text is unavailable, generating explanations is still a hard problem. In this paper, we illustrate how explanations can be generated in such a scenario by leveraging external knowledge in the form of knowledge graphs. Our method jo… ▽ More Explainable recommendation is an important task. Many methods have been proposed which generate explanations from the content and reviews written for items. When review text is unavailable, generating explanations is still a hard problem. In this paper, we illustrate how explanations can be generated in such a scenario by leveraging external knowledge in the form of knowledge graphs. Our method jointly ranks items and knowledge graph entities using a Personalized PageRank procedure to produce recommendations together with their explanations. △ Less

Submitted 12 July, 2017; originally announced July 2017.

Comments: Accepted for publication in the 11th ACM Conference on Recommender Systems (RecSys 2017) - Posters

arXiv:1706.08476 [pdf, other]

Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability

Authors: Tiancheng Zhao, Allen Lu, Kyusong Lee, Maxine Eskenazi

Abstract: Generative encoder-decoder models offer great promise in developing domain-general dialog systems. However, they have mainly been applied to open-domain conversations. This paper presents a practical and novel framework for building task-oriented dialog systems based on encoder-decoder models. This framework enables encoder-decoder models to accomplish slot-value independent decision-making and in… ▽ More Generative encoder-decoder models offer great promise in developing domain-general dialog systems. However, they have mainly been applied to open-domain conversations. This paper presents a practical and novel framework for building task-oriented dialog systems based on encoder-decoder models. This framework enables encoder-decoder models to accomplish slot-value independent decision-making and interact with external databases. Moreover, this paper shows the flexibility of the proposed method by interleaving chatting capability with a slot-filling system for better out-of-domain recovery. The models were trained on both real-user data from a bus information system and human-human chat data. Results show that the proposed framework achieves good performance in both offline evaluation metrics and in task success rate with human users. △ Less

Submitted 26 June, 2017; originally announced June 2017.

Comments: Accepted as a long paper in SIGIDIAL 2017

arXiv:1703.10960 [pdf, other]

Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders

Authors: Tiancheng Zhao, Ran Zhao, Maxine Eskenazi

Abstract: While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the en… ▽ More While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the encoder. Our model uses latent variables to learn a distribution over potential conversational intents and generates diverse responses using only greedy decoders. We have further developed a novel variant that is integrated with linguistic prior knowledge for better performance. Finally, the training procedure is improved by introducing a bag-of-word loss. Our proposed models have been validated to generate significantly more diverse responses than baseline approaches and exhibit competence in discourse-level decision-making. △ Less

Submitted 21 October, 2017; v1 submitted 31 March, 2017; originally announced March 2017.

Comments: Appeared in ACL2017 proceedings as a long paper. Correct a calculation mistake in Table 1 E-bow & A-bow and results into higher scores

arXiv:1606.08425 [pdf, other]

Predicting the Relative Difficulty of Single Sentences With and Without Surrounding Context

Authors: Elliot Schumacher, Maxine Eskenazi, Gwen Frishkoff, Kevyn Collins-Thompson

Abstract: The problem of accurately predicting relative reading difficulty across a set of sentences arises in a number of important natural language applications, such as finding and curating effective usage examples for intelligent language tutoring systems. Yet while significant research has explored document- and passage-level reading difficulty, the special challenges involved in assessing aspects of r… ▽ More The problem of accurately predicting relative reading difficulty across a set of sentences arises in a number of important natural language applications, such as finding and curating effective usage examples for intelligent language tutoring systems. Yet while significant research has explored document- and passage-level reading difficulty, the special challenges involved in assessing aspects of readability for single sentences have received much less attention, particularly when considering the role of surrounding passages. We introduce and evaluate a novel approach for estimating the relative reading difficulty of a set of sentences, with and without surrounding context. Using different sets of lexical and grammatical features, we explore models for predicting pairwise relative difficulty using logistic regression, and examine rankings generated by aggregating pairwise difficulty labels using a Bayesian rating system to form a final ranking. We also compare rankings derived for sentences assessed with and without context, and find that contextual features can help predict differences in relative difficulty judgments across these two conditions. △ Less

Submitted 24 October, 2016; v1 submitted 27 June, 2016; originally announced June 2016.

Comments: EMNLP 2016 Long Paper

arXiv:1606.04081 [pdf, other]

Graph-Community Detection for Cross-Document Topic Segment Relationship Identification

Authors: Pedro Mota, Maxine Eskenazi, Luisa Coheur

Abstract: In this paper we propose a graph-community detection approach to identify cross-document relationships at the topic segment level. Given a set of related documents, we automatically find these relationships by clustering segments with similar content (topics). In this context, we study how different weighting mechanisms influence the discovery of word communities that relate to the different topic… ▽ More In this paper we propose a graph-community detection approach to identify cross-document relationships at the topic segment level. Given a set of related documents, we automatically find these relationships by clustering segments with similar content (topics). In this context, we study how different weighting mechanisms influence the discovery of word communities that relate to the different topics found in the documents. Finally, we test different mapping functions to assign topic segments to word communities, determining which topic segments are considered equivalent. By performing this task it is possible to enable efficient multi-document browsing, since when a user finds relevant content in one document we can provide access to similar topics in other documents. We deploy our approach in two different scenarios. One is an educational scenario where equivalence relationships between learning materials need to be found. The other consists of a series of dialogs in a social context where students discuss commonplace topics. Results show that our proposed approach better discovered equivalence relationships in learning material documents and obtained close results in the social speech domain, where the best performing approach was a clustering technique. △ Less

Submitted 13 June, 2016; originally announced June 2016.

arXiv:1606.02562 [pdf, other]

DialPort: Connecting the Spoken Dialog Research Community to Real User Data

Authors: Tiancheng Zhao, Kyusong Lee, Maxine Eskenazi

Abstract: This paper describes a new spoken dialog portal that connects systems produced by the spoken dialog academic research community and gives them access to real users. We introduce a distributed, multi-modal, multi-agent prototype dialog framework that affords easy integration with various remote resources, ranging from end-to-end dialog systems to external knowledge APIs. To date, the DialPort porta… ▽ More This paper describes a new spoken dialog portal that connects systems produced by the spoken dialog academic research community and gives them access to real users. We introduce a distributed, multi-modal, multi-agent prototype dialog framework that affords easy integration with various remote resources, ranging from end-to-end dialog systems to external knowledge APIs. To date, the DialPort portal has successfully connected to the multi-domain spoken dialog system at Cambridge University, the NOAA (National Oceanic and Atmospheric Administration) weather API and the Yelp API. △ Less

Submitted 8 June, 2016; originally announced June 2016.

Comments: Under Peer Review of SigDial 2016

arXiv:1606.02560 [pdf, other]

Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning

Authors: Tiancheng Zhao, Maxine Eskenazi

Abstract: This paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent Q-Networks (DRQN). The model is able to interface with a relational database and jointly learn policies for both language understanding and dialog strategy. Moreover, we propose a hybrid algorithm that combines the strength of reinforcement learning and supervised learning to achieve fast… ▽ More This paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent Q-Networks (DRQN). The model is able to interface with a relational database and jointly learn policies for both language understanding and dialog strategy. Moreover, we propose a hybrid algorithm that combines the strength of reinforcement learning and supervised learning to achieve faster learning speed. We evaluated the proposed model on a 20 Question Game conversational game simulator. Results show that the proposed method outperforms the modular-based baseline and learns a distributed representation of the latent dialog state. △ Less

Submitted 15 September, 2016; v1 submitted 8 June, 2016; originally announced June 2016.

Comments: In proceeding of SIGDIAL 2016. Added changes based-on peer review, including: 1. Added references, 2. fixed typos in text and figures, 3. added minor change to introduction

arXiv:1604.03511 [pdf]

doi 10.1016/j.asr.2016.05.021

Directed Energy Missions for Planetary Defense

Authors: Philip Lubin, Gary B. Hughes, Mike Eskenazi, Kelly Kosmo, Isabella E. Johansson, Janelle Griswold, Mark Pryor, Hugh O'Neill, Peter Meinhold, Jonathon Suen, Jordan Riley, Qicheng Zhang, Kevin Walsh, Carl Melis, Miikka Kangas, Caio Motta, Travis Brashears

Abstract: Directed energy for planetary defense is now a viable option and is superior in many ways to other proposed technologies, being able to defend the Earth against all known threats. This paper presents basic ideas behind a directed energy planetary defense system that utilizes laser ablation of an asteroid to impart a deflecting force on the target. A conceptual philosophy called DE-STAR, which stan… ▽ More Directed energy for planetary defense is now a viable option and is superior in many ways to other proposed technologies, being able to defend the Earth against all known threats. This paper presents basic ideas behind a directed energy planetary defense system that utilizes laser ablation of an asteroid to impart a deflecting force on the target. A conceptual philosophy called DE-STAR, which stands for Directed Energy System for Targeting of Asteroids and exploRation, is an orbiting stand-off system, which has been described in other papers. This paper describes a smaller, stand-on system known as DE-STARLITE as a reduced-scale version of DE-STAR. Both share the same basic heritage of a directed energy array that heats the surface of the target to the point of high surface vapor pressure that causes significant mass ejection thus forming an ejection plume of material from the target that acts as a rocket to deflect the object. This is generally classified as laser ablation. DE-STARLITE uses conventional propellant for launch to LEO and then ion engines to propel the spacecraft from LEO to the near-Earth asteroid (NEA). During laser ablation, the asteroid itself provides the propellant source material; thus a very modest spacecraft can deflect an asteroid much larger than would be possible with a system of similar mission mass using ion beam deflection (IBD) or a gravity tractor. DE- STARLITE is capable of deflecting an Apophis-class (325 m diameter) asteroid with a 1- to 15-year targeting time (laser on time) depending on the system design. The mission fits within the rough mission parameters of the Asteroid Redirect Mission (ARM) program in terms of mass and size. DE-STARLITE also has much greater capability for planetary defense than current proposals and is readily scalable to match the threat. It can deflect all known threats with sufficient warning. △ Less

Submitted 20 April, 2016; v1 submitted 12 April, 2016; originally announced April 2016.

Comments: 33 pages, 17 figures. Submitted to ASR

arXiv:1603.05739 [pdf]

A Readability Analysis of Campaign Speeches from the 2016 US Presidential Campaign

Authors: Elliot Schumacher, Maxine Eskenazi

Abstract: Readability is defined as the reading level of the speech from grade 1 to grade 12. It results from the use of the REAP readability analysis (vocabulary - Collins-Thompson and Callan, 2004; syntax - Heilman et al ,2006, 2007), which use the lexical contents and grammatical structure of the sentences in a document to predict the reading level. After analysis, results were grouped into the average r… ▽ More Readability is defined as the reading level of the speech from grade 1 to grade 12. It results from the use of the REAP readability analysis (vocabulary - Collins-Thompson and Callan, 2004; syntax - Heilman et al ,2006, 2007), which use the lexical contents and grammatical structure of the sentences in a document to predict the reading level. After analysis, results were grouped into the average readability of each candidate, the evolution of the candidate's speeches' readability over time and the standard deviation, or how much each candidate varied their speech from one venue to another. For comparison, one speech from four past presidents and the Gettysburg Address were also analyzed. △ Less

Submitted 17 March, 2016; originally announced March 2016.

Report number: CMU-LTI-16-001

Showing 1–36 of 36 results for author: Eskenazi, M