Computer Science > Computation and Language

arXiv:2303.13809v2 (cs)

[Submitted on 24 Mar 2023 (v1), revised 8 Oct 2023 (this version, v2), latest version 5 Jun 2024 (v4)]

Title:Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT

Authors:Qingyu Lu, Baopu Qiu, Liang Ding, Kanjian Zhang, Tom Kocmi, Dacheng Tao

View PDF

Abstract:Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks, such as machine translation, text summarization. Recent research (Kocmi and Federmann, 2023) has shown that utilizing ChatGPT for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conduct an investigation into several prompting methods, and propose a new prompting method called Error Analysis Prompting (EAPrompt) by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2022). Our results on WMT22 indicate that prompting LLMs like ChatGPT with error analysis can generate human-like MT evaluations at both the system and segment level. Additionally, we first discover some limitations of ChatGPT as an MT evaluator, such as changing the order of input may significantly influence the judgment when providing multiple translations in a single query. This work provides a preliminary experience of prompting LLMs as an evaluator to improve the reliability of translation evaluation metrics under the error analysis paradigm.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2303.13809 [cs.CL]
	(or arXiv:2303.13809v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2303.13809

Submission history

From: Liang Ding [view email]
[v1] Fri, 24 Mar 2023 05:05:03 UTC (992 KB)
[v2] Sun, 8 Oct 2023 12:50:10 UTC (3,283 KB)
[v3] Wed, 21 Feb 2024 04:18:32 UTC (4,167 KB)
[v4] Wed, 5 Jun 2024 07:40:54 UTC (4,161 KB)

Computer Science > Computation and Language

Title:Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators