-
Understanding the Effectiveness of Very Large Language Models on Dialog Evaluation
Authors:
Jessica Huynh,
Cathy Jiao,
Prakhar Gupta,
Shikib Mehri,
Payal Bajaj,
Vishrav Chaudhary,
Maxine Eskenazi
Abstract:
Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the…
▽ More
Language models have steadily increased in size over the past few years. They achieve a high level of performance on various natural language processing (NLP) tasks such as question answering and summarization. Large language models (LLMs) have been used for generation and can now output human-like text. Due to this, there are other downstream tasks in the realm of dialog that can now harness the LLMs' language understanding capabilities. Dialog evaluation is one task that this paper will explore. It concentrates on prompting with LLMs: BLOOM, OPT, GPT-3, Flan-T5, InstructDial and TNLGv2. The paper shows that the choice of datasets used for training a model contributes to how well it performs on a task as well as on how the prompt should be structured. Specifically, the more diverse and relevant the group of datasets that a model is trained on, the better dialog evaluation performs. This paper also investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
△ Less
Submitted 27 January, 2023;
originally announced January 2023.
-
The DialPort tools
Authors:
Jessica Huynh,
Shikib Mehri,
Cathy Jiao,
Maxine Eskenazi
Abstract:
The DialPort project http://dialport.org/, funded by the National Science Foundation (NSF), covers a group of tools and services that aim at fulfilling the needs of the dialog research community. Over the course of six years, several offerings have been created, including the DialPort Portal and DialCrowd. This paper describes these contributions, which will be demoed at SIGDIAL, including impleme…
▽ More
The DialPort project http://dialport.org/, funded by the National Science Foundation (NSF), covers a group of tools and services that aim at fulfilling the needs of the dialog research community. Over the course of six years, several offerings have been created, including the DialPort Portal and DialCrowd. This paper describes these contributions, which will be demoed at SIGDIAL, including implementation, prior studies, corresponding discoveries, and the locations at which the tools will remain freely available to the community going forward.
△ Less
Submitted 18 August, 2022;
originally announced August 2022.
-
Interactive Evaluation of Dialog Track at DSTC9
Authors:
Shikib Mehri,
Yulan Feng,
Carla Gordon,
Seyed Hossein Alavi,
David Traum,
Maxine Eskenazi
Abstract:
The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to exte…
▽ More
The ultimate goal of dialog research is to develop systems that can be effectively used in interactive settings by real users. To this end, we introduced the Interactive Evaluation of Dialog Track at the 9th Dialog System Technology Challenge. This track consisted of two sub-tasks. The first sub-task involved building knowledge-grounded response generation models. The second sub-task aimed to extend dialog models beyond static datasets by assessing them in an interactive setting with real users. Our track challenges participants to develop strong response generation models and explore strategies that extend them to back-and-forth interactions with real users. The progression from static corpora to interactive evaluation introduces unique challenges and facilitates a more thorough assessment of open-domain dialog systems. This paper provides an overview of the track, including the methodology and results. Furthermore, it provides insights into how to best evaluate open-domain dialog models
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
LAD: Language Models as Data for Zero-Shot Dialog
Authors:
Shikib Mehri,
Yasemin Altun,
Maxine Eskenazi
Abstract:
To facilitate zero-shot generalization in taskoriented dialog, this paper proposes Language Models as Data (LAD). LAD is a paradigm for creating diverse and accurate synthetic data which conveys the necessary structural constraints and can be used to train a downstream neural dialog model. LAD leverages GPT-3 to induce linguistic diversity. LAD achieves significant performance gains in zero-shot s…
▽ More
To facilitate zero-shot generalization in taskoriented dialog, this paper proposes Language Models as Data (LAD). LAD is a paradigm for creating diverse and accurate synthetic data which conveys the necessary structural constraints and can be used to train a downstream neural dialog model. LAD leverages GPT-3 to induce linguistic diversity. LAD achieves significant performance gains in zero-shot settings on intent prediction (+15%), slot filling (+31.4 F-1) and next action prediction (+11 F1). Furthermore, an interactive human evaluation shows that training with LAD is competitive with training on human dialogs. LAD is open-sourced, with the code and data available at https://github.com/Shikib/lad.
△ Less
Submitted 28 July, 2022;
originally announced July 2022.
-
DialCrowd 2.0: A Quality-Focused Dialog System Crowdsourcing Toolkit
Authors:
Jessica Huynh,
Ting-Rui Chiang,
Jeffrey Bigham,
Maxine Eskenazi
Abstract:
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help r…
▽ More
Dialog system developers need high-quality data to train, fine-tune and assess their systems. They often use crowdsourcing for this since it provides large quantities of data from many workers. However, the data may not be of sufficiently good quality. This can be due to the way that the requester presents a task and how they interact with the workers. This paper introduces DialCrowd 2.0 to help requesters obtain higher quality data by, for example, presenting tasks more clearly and facilitating effective communication with workers. DialCrowd 2.0 guides developers in creating improved Human Intelligence Tasks (HITs) and is directly applicable to the workflows used currently by developers and researchers.
△ Less
Submitted 25 July, 2022;
originally announced July 2022.
-
InstructDial: Improving Zero and Few-shot Generalization in Dialogue through Instruction Tuning
Authors:
Prakhar Gupta,
Cathy Jiao,
Yi-Ting Yeh,
Shikib Mehri,
Maxine Eskenazi,
Jeffrey P. Bigham
Abstract:
Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perf…
▽ More
Instruction tuning is an emergent paradigm in NLP wherein natural language instructions are leveraged with language models to induce zero-shot performance on unseen tasks. Instructions have been shown to enable good performance on unseen tasks and datasets in both large and small language models. Dialogue is an especially interesting area to explore instruction tuning because dialogue systems perform multiple kinds of tasks related to language (e.g., natural language understanding and generation, domain-specific interaction), yet instruction tuning has not been systematically explored for dialogue-related tasks. We introduce InstructDial, an instruction tuning framework for dialogue, which consists of a repository of 48 diverse dialogue tasks in a unified text-to-text format created from 59 openly available dialogue datasets. Next, we explore cross-task generalization ability on models tuned on InstructDial across diverse dialogue tasks. Our analysis reveals that InstructDial enables good zero-shot performance on unseen datasets and tasks such as dialogue evaluation and intent detection, and even better performance in a few-shot setting. To ensure that models adhere to instructions, we introduce novel meta-tasks. We establish benchmark zero-shot and few-shot performance of models trained using the proposed framework on multiple dialogue tasks.
△ Less
Submitted 26 October, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Report from the NSF Future Directions Workshop on Automatic Evaluation of Dialog: Research Directions and Challenges
Authors:
Shikib Mehri,
Jinho Choi,
Luis Fernando D'Haro,
Jan Deriu,
Maxine Eskenazi,
Milica Gasic,
Kallirroi Georgila,
Dilek Hakkani-Tur,
Zekang Li,
Verena Rieser,
Samira Shaikh,
David Traum,
Yi-Ting Yeh,
Zhou Yu,
Yizhe Zhang,
Chen Zhang
Abstract:
This is a report on the NSF Future Directions Workshop on Automatic Evaluation of Dialog. The workshop explored the current state of the art along with its limitations and suggested promising directions for future work in this important and very rapidly changing area of research.
This is a report on the NSF Future Directions Workshop on Automatic Evaluation of Dialog. The workshop explored the current state of the art along with its limitations and suggested promising directions for future work in this important and very rapidly changing area of research.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
A Survey of NLP-Related Crowdsourcing HITs: what works and what does not
Authors:
Jessica Huynh,
Jeffrey Bigham,
Maxine Eskenazi
Abstract:
Crowdsourcing requesters on Amazon Mechanical Turk (AMT) have raised questions about the reliability of the workers. The AMT workforce is very diverse and it is not possible to make blanket assumptions about them as a group. Some requesters now reject work en mass when they do not get the results they expect. This has the effect of giving each worker (good or bad) a lower Human Intelligence Task (…
▽ More
Crowdsourcing requesters on Amazon Mechanical Turk (AMT) have raised questions about the reliability of the workers. The AMT workforce is very diverse and it is not possible to make blanket assumptions about them as a group. Some requesters now reject work en mass when they do not get the results they expect. This has the effect of giving each worker (good or bad) a lower Human Intelligence Task (HIT) approval score, which is unfair to the good workers. It also has the effect of giving the requester a bad reputation on the workers' forums. Some of the issues causing the mass rejections stem from the requesters not taking the time to create a well-formed task with complete instructions and/or not paying a fair wage. To explore this assumption, this paper describes a study that looks at the crowdsourcing HITs on AMT that were available over a given span of time and records information about those HITs. This study also records information from a crowdsourcing forum on the worker perspective on both those HITs and on their corresponding requesters. Results reveal issues in worker payment and presentation issues such as missing instructions or HITs that are not doable.
△ Less
Submitted 9 November, 2021;
originally announced November 2021.
-
Schema-Guided Paradigm for Zero-Shot Dialog
Authors:
Shikib Mehri,
Maxine Eskenazi
Abstract:
Developing mechanisms that flexibly adapt dialog systems to unseen tasks and domains is a major challenge in dialog research. Neural models implicitly memorize task-specific dialog policies from the training data. We posit that this implicit memorization has precluded zero-shot transfer learning. To this end, we leverage the schema-guided paradigm, wherein the task-specific dialog policy is explic…
▽ More
Developing mechanisms that flexibly adapt dialog systems to unseen tasks and domains is a major challenge in dialog research. Neural models implicitly memorize task-specific dialog policies from the training data. We posit that this implicit memorization has precluded zero-shot transfer learning. To this end, we leverage the schema-guided paradigm, wherein the task-specific dialog policy is explicitly provided to the model. We introduce the Schema Attention Model (SAM) and improved schema representations for the STAR corpus. SAM obtains significant improvement in zero-shot settings, with a +22 F1 score improvement over prior work. These results validate the feasibility of zero-shot generalizability in dialog. Ablation experiments are also presented to demonstrate the efficacy of SAM.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
GenSF: Simultaneous Adaptation of Generative Pre-trained Models and Slot Filling
Authors:
Shikib Mehri,
Maxine Eskenazi
Abstract:
In transfer learning, it is imperative to achieve strong alignment between a pre-trained model and a downstream task. Prior work has done this by proposing task-specific pre-training objectives, which sacrifices the inherent scalability of the transfer learning paradigm. We instead achieve strong alignment by simultaneously modifying both the pre-trained model and the formulation of the downstream…
▽ More
In transfer learning, it is imperative to achieve strong alignment between a pre-trained model and a downstream task. Prior work has done this by proposing task-specific pre-training objectives, which sacrifices the inherent scalability of the transfer learning paradigm. We instead achieve strong alignment by simultaneously modifying both the pre-trained model and the formulation of the downstream task, which is more efficient and preserves the scalability of transfer learning. We present GenSF (Generative Slot Filling), which leverages a generative pre-trained open-domain dialog model for slot filling. GenSF (1) adapts the pre-trained model by incorporating inductive biases about the task and (2) adapts the downstream task by reformulating slot filling to better leverage the pre-trained model's capabilities. GenSF achieves state-of-the-art results on two slot filling datasets with strong gains in few-shot and zero-shot settings. We achieve a 9 F1 score improvement in zero-shot slot filling. This highlights the value of strong alignment between the pre-trained model and the downstream task.
△ Less
Submitted 13 June, 2021;
originally announced June 2021.
-
A Comprehensive Assessment of Dialog Evaluation Metrics
Authors:
Yi-Ting Yeh,
Maxine Eskenazi,
Shikib Mehri
Abstract:
Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and ther…
▽ More
Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 23 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. Metrics are assessed (1) on both the turn level and the dialog level, (2) for different dialog lengths, (3) for different dialog qualities (e.g., coherence, engaging), (4) for different types of response generation models (i.e., generative, retrieval, simple models and state-of-the-art models), (5) taking into account the similarity of different metrics and (6) exploring combinations of different metrics. This comprehensive assessment offers several takeaways pertaining to dialog evaluation metrics in general. It also suggests how to best assess evaluation metrics and indicates promising directions for future work.
△ Less
Submitted 7 July, 2021; v1 submitted 7 June, 2021;
originally announced June 2021.
-
Overview of the Ninth Dialog System Technology Challenge: DSTC9
Authors:
Chulaka Gunasekara,
Seokhwan Kim,
Luis Fernando D'Haro,
Abhinav Rastogi,
Yun-Nung Chen,
Mihail Eric,
Behnam Hedayatnia,
Karthik Gopalakrishnan,
Yang Liu,
Chao-Wei Huang,
Dilek Hakkani-Tür,
Jinchao Li,
Qi Zhu,
Lingxiao Luo,
Lars Liden,
Kaili Huang,
Shahin Shayandeh,
Runze Liang,
Baolin Peng,
Zheng Zhang,
Swadheen Shukla,
Minlie Huang,
Jianfeng Gao,
Shikib Mehri,
Yulan Feng
, et al. (14 additional authors not shown)
Abstract:
This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with unstructured knowledge access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog, and 4. Situated interactive multi-modal dialog. This…
▽ More
This paper introduces the Ninth Dialog System Technology Challenge (DSTC-9). This edition of the DSTC focuses on applying end-to-end dialog technologies for four distinct tasks in dialog systems, namely, 1. Task-oriented dialog Modeling with unstructured knowledge access, 2. Multi-domain task-oriented dialog, 3. Interactive evaluation of dialog, and 4. Situated interactive multi-modal dialog. This paper describes the task definition, provided datasets, baselines and evaluation set-up for each track. We also summarize the results of the submitted systems to highlight the overall trends of the state-of-the-art technologies for the tasks.
△ Less
Submitted 12 November, 2020;
originally announced November 2020.
-
Unsupervised Evaluation of Interactive Dialog with DialoGPT
Authors:
Shikib Mehri,
Maxine Eskenazi
Abstract:
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset…
▽ More
It is important to define meaningful and interpretable automatic evaluation metrics for open-domain dialog research. Standard language generation metrics have been shown to be ineffective for dialog. This paper introduces the FED metric (fine-grained evaluation of dialog), an automatic evaluation metric which uses DialoGPT, without any fine-tuning or supervision. It also introduces the FED dataset which is constructed by annotating a set of human-system and human-human conversations with eighteen fine-grained dialog qualities. The FED metric (1) does not rely on a ground-truth response, (2) does not require training data and (3) measures fine-grained dialog qualities at both the turn and whole dialog levels. FED attains moderate to strong correlation with human judgement at both levels.
△ Less
Submitted 22 June, 2020;
originally announced June 2020.
-
Report from the NSF Future Directions Workshop, Toward User-Oriented Agents: Research Directions and Challenges
Authors:
Maxine Eskenazi,
Tiancheng Zhao
Abstract:
This USER Workshop was convened with the goal of defining future research directions for the burgeoning intelligent agent research community and to communicate them to the National Science Foundation. It took place in Pittsburgh Pennsylvania on October 24 and 25, 2019 and was sponsored by National Science Foundation Grant Number IIS-1934222. Any opinions, findings and conclusions or future directi…
▽ More
This USER Workshop was convened with the goal of defining future research directions for the burgeoning intelligent agent research community and to communicate them to the National Science Foundation. It took place in Pittsburgh Pennsylvania on October 24 and 25, 2019 and was sponsored by National Science Foundation Grant Number IIS-1934222. Any opinions, findings and conclusions or future directions expressed in this document are those of the authors and do not necessarily reflect the views of the National Science Foundation. The 27 participants presented their individual research interests and their personal research goals. In the breakout sessions that followed, the participants defined the main research areas within the domain of intelligent agents and they discussed the major future directions that the research in each area of this domain should take
△ Less
Submitted 10 June, 2020;
originally announced June 2020.
-
USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
Authors:
Shikib Mehri,
Maxine Eskenazi
Abstract:
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable…
▽ More
The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.
△ Less
Submitted 1 May, 2020;
originally announced May 2020.
-
"None of the Above":Measure Uncertainty in Dialog Response Retrieval
Authors:
Yulan Feng,
Shikib Mehri,
Maxine Eskenazi,
Tiancheng Zhao
Abstract:
This paper discusses the importance of uncovering uncertainty in end-to-end dialog tasks, and presents our experimental results on uncertainty classification on the Ubuntu Dialog Corpus. We show that, instead of retraining models for this specific purpose, the original retrieval model's underlying confidence concerning the best prediction can be captured with trivial additional computation.
This paper discusses the importance of uncovering uncertainty in end-to-end dialog tasks, and presents our experimental results on uncertainty classification on the Ubuntu Dialog Corpus. We show that, instead of retraining models for this specific purpose, the original retrieval model's underlying confidence concerning the best prediction can be captured with trivial additional computation.
△ Less
Submitted 14 May, 2020; v1 submitted 4 April, 2020;
originally announced April 2020.
-
CMU GetGoing: An Understandable and Memorable Dialog System for Seniors
Authors:
Shikib Mehri,
Alan W Black,
Maxine Eskenazi
Abstract:
Voice-based technologies are typically developed for the average user, and thus generally not tailored to the specific needs of any subgroup of the population, like seniors. This paper presents CMU GetGoing, an accessible trip planning dialog system designed for senior users. The GetGoing system design is described in detail, with particular attention to the senior-tailored features. A user study…
▽ More
Voice-based technologies are typically developed for the average user, and thus generally not tailored to the specific needs of any subgroup of the population, like seniors. This paper presents CMU GetGoing, an accessible trip planning dialog system designed for senior users. The GetGoing system design is described in detail, with particular attention to the senior-tailored features. A user study is presented, demonstrating that the senior-tailored features significantly improve comprehension and retention of information.
△ Less
Submitted 3 September, 2019;
originally announced September 2019.
-
Multi-Granularity Representations of Dialog
Authors:
Shikib Mehri,
Maxine Eskenazi
Abstract:
Neural models of dialog rely on generalized latent representations of language. This paper introduces a novel training procedure which explicitly learns multiple representations of language at several levels of granularity. The multi-granularity training algorithm modifies the mechanism by which negative candidate responses are sampled in order to control the granularity of learned latent represen…
▽ More
Neural models of dialog rely on generalized latent representations of language. This paper introduces a novel training procedure which explicitly learns multiple representations of language at several levels of granularity. The multi-granularity training algorithm modifies the mechanism by which negative candidate responses are sampled in order to control the granularity of learned latent representations. Strong performance gains are observed on the next utterance retrieval task using both the MultiWOZ dataset and the Ubuntu dialog corpus. Analysis significantly demonstrates that multiple granularities of representation are being learned, and that multi-granularity training facilitates better transfer to downstream tasks.
△ Less
Submitted 26 August, 2019;
originally announced August 2019.
-
Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References
Authors:
Prakhar Gupta,
Shikib Mehri,
Tiancheng Zhao,
Amy Pavel,
Maxine Eskenazi,
Jeffrey P. Bigham
Abstract:
The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of mu…
▽ More
The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in open-domain dialog. One alternative is to collect human annotations for evaluation, which can be expensive and time consuming. To demonstrate the effectiveness of multi-reference evaluation, we augment the test set of DailyDialog with multiple references. A series of experiments show that the use of multiple references results in improved correlation between several automatic metrics and human judgement for both the quality and the diversity of system output.
△ Less
Submitted 8 September, 2019; v1 submitted 24 July, 2019;
originally announced July 2019.
-
Structured Fusion Networks for Dialog
Authors:
Shikib Mehri,
Tejas Srinivasan,
Maxine Eskenazi
Abstract:
Neural dialog models have exhibited strong performance, however their end-to-end nature lacks a representation of the explicit structure of dialog. This results in a loss of generalizability, controllability and a data-hungry nature. Conversely, more traditional dialog systems do have strong models of explicit structure. This paper introduces several approaches for explicitly incorporating structu…
▽ More
Neural dialog models have exhibited strong performance, however their end-to-end nature lacks a representation of the explicit structure of dialog. This results in a loss of generalizability, controllability and a data-hungry nature. Conversely, more traditional dialog systems do have strong models of explicit structure. This paper introduces several approaches for explicitly incorporating structure into neural models of dialog. Structured Fusion Networks first learn neural dialog modules corresponding to the structured components of traditional dialog systems and then incorporate these modules in a higher-level generative model. Structured Fusion Networks obtain strong results on the MultiWOZ dataset, both with and without reinforcement learning. Structured Fusion Networks are shown to have several valuable properties, including better domain generalizability, improved performance in reduced data scenarios and robustness to divergence during reinforcement learning.
△ Less
Submitted 23 July, 2019;
originally announced July 2019.
-
Pretraining Methods for Dialog Context Representation Learning
Authors:
Shikib Mehri,
Evgeniia Razumovskaia,
Tiancheng Zhao,
Maxine Eskenazi
Abstract:
This paper examines various unsupervised pretraining objectives for learning dialog context representations. Two novel methods of pretraining dialog context encoders are proposed, and a total of four methods are examined. Each pretraining objective is fine-tuned and evaluated on a set of downstream dialog tasks using the MultiWoz dataset and strong performance improvement is observed. Further eval…
▽ More
This paper examines various unsupervised pretraining objectives for learning dialog context representations. Two novel methods of pretraining dialog context encoders are proposed, and a total of four methods are examined. Each pretraining objective is fine-tuned and evaluated on a set of downstream dialog tasks using the MultiWoz dataset and strong performance improvement is observed. Further evaluation shows that our pretraining objectives result in not only better performance, but also better convergence, models that are less data hungry and have better domain generalizability.
△ Less
Submitted 3 June, 2019; v1 submitted 2 June, 2019;
originally announced June 2019.
-
Rethinking Action Spaces for Reinforcement Learning in End-to-end Dialog Agents with Latent Variable Models
Authors:
Tiancheng Zhao,
Kaige Xie,
Maxine Eskenazi
Abstract:
Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge. Common practice has been to use handcrafted dialog acts, or the output vocabulary, e.g. in neural encoder decoders, as the action spaces. Both have their own limitations. This paper proposes a novel latent action framework that treats the action spaces…
▽ More
Defining action spaces for conversational agents and optimizing their decision-making process with reinforcement learning is an enduring challenge. Common practice has been to use handcrafted dialog acts, or the output vocabulary, e.g. in neural encoder decoders, as the action spaces. Both have their own limitations. This paper proposes a novel latent action framework that treats the action spaces of an end-to-end dialog agent as latent variables and develops unsupervised methods in order to induce its own action space from the data. Comprehensive experiments are conducted examining both continuous and discrete action types and two different optimization methods based on stochastic variational inference. Results show that the proposed latent actions achieve superior empirical performance improvement over previous word-level policy gradient methods on both DealOrNoDeal and MultiWoz dialogs. Our detailed analysis also provides insights about various latent variable approaches for policy learning and can serve as a foundation for developing better latent actions in future research.
△ Less
Submitted 15 April, 2019; v1 submitted 23 February, 2019;
originally announced February 2019.
-
Beyond Turing: Intelligent Agents Centered on the User
Authors:
Maxine Eskenazi,
Shikib Mehri,
Evgeniia Razumovskaia,
Tiancheng Zhao
Abstract:
Most research on intelligent agents centers on the agent and not on the user. We look at the origins of agent-centric research for slot-filling, gaming and chatbot agents. We then argue that it is important to concentrate more on the user. After reviewing relevant literature, some approaches for creating and assessing user-centric systems are proposed.
Most research on intelligent agents centers on the agent and not on the user. We look at the origins of agent-centric research for slot-filling, gaming and chatbot agents. We then argue that it is important to concentrate more on the user. After reviewing relevant literature, some approaches for creating and assessing user-centric systems are proposed.
△ Less
Submitted 18 March, 2019; v1 submitted 19 January, 2019;
originally announced January 2019.
-
Context-Aware Dialog Re-Ranking for Task-Oriented Dialog Systems
Authors:
Junki Ohmura,
Maxine Eskenazi
Abstract:
Dialog response ranking is used to rank response candidates by considering their relation to the dialog history. Although researchers have addressed this concept for open-domain dialogs, little attention has been focused on task-oriented dialogs. Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human-computer dia…
▽ More
Dialog response ranking is used to rank response candidates by considering their relation to the dialog history. Although researchers have addressed this concept for open-domain dialogs, little attention has been focused on task-oriented dialogs. Furthermore, no previous studies have analyzed whether response ranking can improve the performance of existing dialog systems in real human-computer dialogs with speech recognition errors. In this paper, we propose a context-aware dialog response re-ranking system. Our system reranks responses in two steps: (1) it calculates matching scores for each candidate response and the current dialog context; (2) it combines the matching scores and a probability distribution of the candidates from an existing dialog system for response re-ranking. By using neural word embedding-based models and handcrafted or logistic regression-based ensemble models, we have improved the performance of a recently proposed end-to-end task-oriented dialog system on real dialogs with speech recognition errors.
△ Less
Submitted 28 November, 2018;
originally announced November 2018.
-
Multimodal Polynomial Fusion for Detecting Driver Distraction
Authors:
Yulun Du,
Chirag Raman,
Alan W Black,
Louis-Philippe Morency,
Maxine Eskenazi
Abstract:
Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone. Although there has been a considerable amount of research on modeling the distracted behavior of drivers under various conditions, accurate automatic detection using multiple modalities and especially the contribution of using the speech modality to improve accuracy has received little attention. This paper introduces a…
▽ More
Distracted driving is deadly, claiming 3,477 lives in the U.S. in 2015 alone. Although there has been a considerable amount of research on modeling the distracted behavior of drivers under various conditions, accurate automatic detection using multiple modalities and especially the contribution of using the speech modality to improve accuracy has received little attention. This paper introduces a new multimodal dataset for distracted driving behavior and discusses automatic distraction detection using features from three modalities: facial expression, speech and car signals. Detailed multimodal feature analysis shows that adding more modalities monotonically increases the predictive accuracy of the model. Finally, a simple and effective multimodal fusion technique using a polynomial fusion layer shows superior distraction detection results compared to the baseline SVM and neural network models.
△ Less
Submitted 24 October, 2018;
originally announced October 2018.
-
Zero-Shot Dialog Generation with Cross-Domain Latent Actions
Authors:
Tiancheng Zhao,
Maxine Eskenazi
Abstract:
This paper introduces zero-shot dialog generation (ZSDG), as a step towards neural dialog systems that can instantly generalize to new situations with minimal data. ZSDG enables an end-to-end generative dialog system to generalize to a new domain for which only a domain description is provided and no training dialogs are available. Then a novel learning framework, Action Matching, is proposed. Thi…
▽ More
This paper introduces zero-shot dialog generation (ZSDG), as a step towards neural dialog systems that can instantly generalize to new situations with minimal data. ZSDG enables an end-to-end generative dialog system to generalize to a new domain for which only a domain description is provided and no training dialogs are available. Then a novel learning framework, Action Matching, is proposed. This algorithm can learn a cross-domain embedding space that models the semantics of dialog responses which, in turn, lets a neural dialog generation model generalize to new domains. We evaluate our methods on a new synthetic dialog dataset, and an existing human-human dialog dataset. Results show that our method has superior performance in learning dialog models that rapidly adapt their behavior to new domains and suggests promising future research.
△ Less
Submitted 12 May, 2018;
originally announced May 2018.
-
Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation
Authors:
Tiancheng Zhao,
Kyusong Lee,
Maxine Eskenazi
Abstract:
The encoder-decoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoder-deco…
▽ More
The encoder-decoder dialog model is one of the most prominent methods used to build dialog systems in complex domains. Yet it is limited because it cannot output interpretable actions as in traditional systems, which hinders humans from understanding its generation process. We present an unsupervised discrete sentence representation learning method that can integrate with any existing encoder-decoder dialog models for interpretable response generation. Building upon variational autoencoders (VAEs), we present two novel models, DI-VAE and DI-VST that improve VAEs and can discover interpretable semantics via either auto encoding or context predicting. Our methods have been validated on real-world dialog datasets to discover semantic representations and enhance encoder-decoder models with interpretable generation.
△ Less
Submitted 22 April, 2018;
originally announced April 2018.
-
Explainable Entity-based Recommendations with Knowledge Graphs
Authors:
Rose Catherine,
Kathryn Mazaitis,
Maxine Eskenazi,
William Cohen
Abstract:
Explainable recommendation is an important task. Many methods have been proposed which generate explanations from the content and reviews written for items. When review text is unavailable, generating explanations is still a hard problem. In this paper, we illustrate how explanations can be generated in such a scenario by leveraging external knowledge in the form of knowledge graphs. Our method jo…
▽ More
Explainable recommendation is an important task. Many methods have been proposed which generate explanations from the content and reviews written for items. When review text is unavailable, generating explanations is still a hard problem. In this paper, we illustrate how explanations can be generated in such a scenario by leveraging external knowledge in the form of knowledge graphs. Our method jointly ranks items and knowledge graph entities using a Personalized PageRank procedure to produce recommendations together with their explanations.
△ Less
Submitted 12 July, 2017;
originally announced July 2017.
-
Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability
Authors:
Tiancheng Zhao,
Allen Lu,
Kyusong Lee,
Maxine Eskenazi
Abstract:
Generative encoder-decoder models offer great promise in developing domain-general dialog systems. However, they have mainly been applied to open-domain conversations. This paper presents a practical and novel framework for building task-oriented dialog systems based on encoder-decoder models. This framework enables encoder-decoder models to accomplish slot-value independent decision-making and in…
▽ More
Generative encoder-decoder models offer great promise in developing domain-general dialog systems. However, they have mainly been applied to open-domain conversations. This paper presents a practical and novel framework for building task-oriented dialog systems based on encoder-decoder models. This framework enables encoder-decoder models to accomplish slot-value independent decision-making and interact with external databases. Moreover, this paper shows the flexibility of the proposed method by interleaving chatting capability with a slot-filling system for better out-of-domain recovery. The models were trained on both real-user data from a bus information system and human-human chat data. Results show that the proposed framework achieves good performance in both offline evaluation metrics and in task success rate with human users.
△ Less
Submitted 26 June, 2017;
originally announced June 2017.
-
Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders
Authors:
Tiancheng Zhao,
Ran Zhao,
Maxine Eskenazi
Abstract:
While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the en…
▽ More
While recent neural encoder-decoder models have shown great promise in modeling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the encoder. Our model uses latent variables to learn a distribution over potential conversational intents and generates diverse responses using only greedy decoders. We have further developed a novel variant that is integrated with linguistic prior knowledge for better performance. Finally, the training procedure is improved by introducing a bag-of-word loss. Our proposed models have been validated to generate significantly more diverse responses than baseline approaches and exhibit competence in discourse-level decision-making.
△ Less
Submitted 21 October, 2017; v1 submitted 31 March, 2017;
originally announced March 2017.
-
Predicting the Relative Difficulty of Single Sentences With and Without Surrounding Context
Authors:
Elliot Schumacher,
Maxine Eskenazi,
Gwen Frishkoff,
Kevyn Collins-Thompson
Abstract:
The problem of accurately predicting relative reading difficulty across a set of sentences arises in a number of important natural language applications, such as finding and curating effective usage examples for intelligent language tutoring systems. Yet while significant research has explored document- and passage-level reading difficulty, the special challenges involved in assessing aspects of r…
▽ More
The problem of accurately predicting relative reading difficulty across a set of sentences arises in a number of important natural language applications, such as finding and curating effective usage examples for intelligent language tutoring systems. Yet while significant research has explored document- and passage-level reading difficulty, the special challenges involved in assessing aspects of readability for single sentences have received much less attention, particularly when considering the role of surrounding passages. We introduce and evaluate a novel approach for estimating the relative reading difficulty of a set of sentences, with and without surrounding context. Using different sets of lexical and grammatical features, we explore models for predicting pairwise relative difficulty using logistic regression, and examine rankings generated by aggregating pairwise difficulty labels using a Bayesian rating system to form a final ranking. We also compare rankings derived for sentences assessed with and without context, and find that contextual features can help predict differences in relative difficulty judgments across these two conditions.
△ Less
Submitted 24 October, 2016; v1 submitted 27 June, 2016;
originally announced June 2016.
-
Graph-Community Detection for Cross-Document Topic Segment Relationship Identification
Authors:
Pedro Mota,
Maxine Eskenazi,
Luisa Coheur
Abstract:
In this paper we propose a graph-community detection approach to identify cross-document relationships at the topic segment level. Given a set of related documents, we automatically find these relationships by clustering segments with similar content (topics). In this context, we study how different weighting mechanisms influence the discovery of word communities that relate to the different topic…
▽ More
In this paper we propose a graph-community detection approach to identify cross-document relationships at the topic segment level. Given a set of related documents, we automatically find these relationships by clustering segments with similar content (topics). In this context, we study how different weighting mechanisms influence the discovery of word communities that relate to the different topics found in the documents. Finally, we test different mapping functions to assign topic segments to word communities, determining which topic segments are considered equivalent.
By performing this task it is possible to enable efficient multi-document browsing, since when a user finds relevant content in one document we can provide access to similar topics in other documents. We deploy our approach in two different scenarios. One is an educational scenario where equivalence relationships between learning materials need to be found. The other consists of a series of dialogs in a social context where students discuss commonplace topics. Results show that our proposed approach better discovered equivalence relationships in learning material documents and obtained close results in the social speech domain, where the best performing approach was a clustering technique.
△ Less
Submitted 13 June, 2016;
originally announced June 2016.
-
DialPort: Connecting the Spoken Dialog Research Community to Real User Data
Authors:
Tiancheng Zhao,
Kyusong Lee,
Maxine Eskenazi
Abstract:
This paper describes a new spoken dialog portal that connects systems produced by the spoken dialog academic research community and gives them access to real users. We introduce a distributed, multi-modal, multi-agent prototype dialog framework that affords easy integration with various remote resources, ranging from end-to-end dialog systems to external knowledge APIs. To date, the DialPort porta…
▽ More
This paper describes a new spoken dialog portal that connects systems produced by the spoken dialog academic research community and gives them access to real users. We introduce a distributed, multi-modal, multi-agent prototype dialog framework that affords easy integration with various remote resources, ranging from end-to-end dialog systems to external knowledge APIs. To date, the DialPort portal has successfully connected to the multi-domain spoken dialog system at Cambridge University, the NOAA (National Oceanic and Atmospheric Administration) weather API and the Yelp API.
△ Less
Submitted 8 June, 2016;
originally announced June 2016.
-
Towards End-to-End Learning for Dialog State Tracking and Management using Deep Reinforcement Learning
Authors:
Tiancheng Zhao,
Maxine Eskenazi
Abstract:
This paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent Q-Networks (DRQN). The model is able to interface with a relational database and jointly learn policies for both language understanding and dialog strategy. Moreover, we propose a hybrid algorithm that combines the strength of reinforcement learning and supervised learning to achieve fast…
▽ More
This paper presents an end-to-end framework for task-oriented dialog systems using a variant of Deep Recurrent Q-Networks (DRQN). The model is able to interface with a relational database and jointly learn policies for both language understanding and dialog strategy. Moreover, we propose a hybrid algorithm that combines the strength of reinforcement learning and supervised learning to achieve faster learning speed. We evaluated the proposed model on a 20 Question Game conversational game simulator. Results show that the proposed method outperforms the modular-based baseline and learns a distributed representation of the latent dialog state.
△ Less
Submitted 15 September, 2016; v1 submitted 8 June, 2016;
originally announced June 2016.
-
Directed Energy Missions for Planetary Defense
Authors:
Philip Lubin,
Gary B. Hughes,
Mike Eskenazi,
Kelly Kosmo,
Isabella E. Johansson,
Janelle Griswold,
Mark Pryor,
Hugh O'Neill,
Peter Meinhold,
Jonathon Suen,
Jordan Riley,
Qicheng Zhang,
Kevin Walsh,
Carl Melis,
Miikka Kangas,
Caio Motta,
Travis Brashears
Abstract:
Directed energy for planetary defense is now a viable option and is superior in many ways to other proposed technologies, being able to defend the Earth against all known threats. This paper presents basic ideas behind a directed energy planetary defense system that utilizes laser ablation of an asteroid to impart a deflecting force on the target. A conceptual philosophy called DE-STAR, which stan…
▽ More
Directed energy for planetary defense is now a viable option and is superior in many ways to other proposed technologies, being able to defend the Earth against all known threats. This paper presents basic ideas behind a directed energy planetary defense system that utilizes laser ablation of an asteroid to impart a deflecting force on the target. A conceptual philosophy called DE-STAR, which stands for Directed Energy System for Targeting of Asteroids and exploRation, is an orbiting stand-off system, which has been described in other papers. This paper describes a smaller, stand-on system known as DE-STARLITE as a reduced-scale version of DE-STAR. Both share the same basic heritage of a directed energy array that heats the surface of the target to the point of high surface vapor pressure that causes significant mass ejection thus forming an ejection plume of material from the target that acts as a rocket to deflect the object. This is generally classified as laser ablation. DE-STARLITE uses conventional propellant for launch to LEO and then ion engines to propel the spacecraft from LEO to the near-Earth asteroid (NEA). During laser ablation, the asteroid itself provides the propellant source material; thus a very modest spacecraft can deflect an asteroid much larger than would be possible with a system of similar mission mass using ion beam deflection (IBD) or a gravity tractor. DE- STARLITE is capable of deflecting an Apophis-class (325 m diameter) asteroid with a 1- to 15-year targeting time (laser on time) depending on the system design. The mission fits within the rough mission parameters of the Asteroid Redirect Mission (ARM) program in terms of mass and size. DE-STARLITE also has much greater capability for planetary defense than current proposals and is readily scalable to match the threat. It can deflect all known threats with sufficient warning.
△ Less
Submitted 20 April, 2016; v1 submitted 12 April, 2016;
originally announced April 2016.
-
A Readability Analysis of Campaign Speeches from the 2016 US Presidential Campaign
Authors:
Elliot Schumacher,
Maxine Eskenazi
Abstract:
Readability is defined as the reading level of the speech from grade 1 to grade 12. It results from the use of the REAP readability analysis (vocabulary - Collins-Thompson and Callan, 2004; syntax - Heilman et al ,2006, 2007), which use the lexical contents and grammatical structure of the sentences in a document to predict the reading level. After analysis, results were grouped into the average r…
▽ More
Readability is defined as the reading level of the speech from grade 1 to grade 12. It results from the use of the REAP readability analysis (vocabulary - Collins-Thompson and Callan, 2004; syntax - Heilman et al ,2006, 2007), which use the lexical contents and grammatical structure of the sentences in a document to predict the reading level. After analysis, results were grouped into the average readability of each candidate, the evolution of the candidate's speeches' readability over time and the standard deviation, or how much each candidate varied their speech from one venue to another. For comparison, one speech from four past presidents and the Gettysburg Address were also analyzed.
△ Less
Submitted 17 March, 2016;
originally announced March 2016.