Skip to main content

Showing 1–11 of 11 results for author: Abrego, G H

  1. arXiv:2404.01616  [pdf, other

    cs.CL cs.IR cs.SD eess.AS

    Transforming LLMs into Cross-modal and Cross-lingual Retrieval Systems

    Authors: Frank Palma Gomez, Ramon Sanabria, Yun-hsuan Sung, Daniel Cer, Siddharth Dalmia, Gustavo Hernandez Abrego

    Abstract: Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi… ▽ More

    Submitted 10 July, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

  2. arXiv:2403.20327  [pdf, other

    cs.CL cs.AI

    Gecko: Versatile Text Embeddings Distilled from Large Language Models

    Authors: Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, Iftekhar Naim

    Abstract: We present Gecko, a compact and versatile text embedding model. Gecko achieves strong retrieval performance by leveraging a key idea: distilling knowledge from large language models (LLMs) into a retriever. Our two-step distillation process begins with generating diverse, synthetic paired data using an LLM. Next, we further refine the data quality by retrieving a set of candidate passages for each… ▽ More

    Submitted 29 March, 2024; originally announced March 2024.

    Comments: 18 pages

  3. arXiv:2311.05800  [pdf, other

    cs.IR cs.AI cs.CL

    Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

    Authors: Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, Daniel Cer

    Abstract: There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages. Synthetic training data generation is promising (e.g., InPars or Promptagator), but has been investigated only for English. Therefore, to study model capabilities across both cross-lingual and monolingual retrieval tasks, we develop SWIM-I… ▽ More

    Submitted 15 April, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

    Comments: Accepted at NAACL 2024. Data released at https://github.com/google-research-datasets/swim-ir

  4. arXiv:2306.02516  [pdf, other

    cs.LG cs.IR

    SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives

    Authors: Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, Zhe Dong

    Abstract: Dual encoders have been used for retrieval tasks and representation learning with good results. A standard way to train dual encoders is using a contrastive loss with in-batch negatives. In this work, we propose an improved contrastive learning objective by adding queries or documents from the same encoder towers to the negatives, for which we name it as "contrastive loss with SAMe TOwer NEgatives… ▽ More

    Submitted 4 June, 2023; originally announced June 2023.

    Comments: ACL 2023 Findings

  5. arXiv:2305.10403  [pdf, other

    cs.CL cs.AI

    PaLM 2 Technical Report

    Authors: Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Hernandez Abrego , et al. (103 additional authors not shown)

    Abstract: We introduce PaLM 2, a new state-of-the-art language model that has better multilingual and reasoning capabilities and is more compute-efficient than its predecessor PaLM. PaLM 2 is a Transformer-based model trained using a mixture of objectives. Through extensive evaluations on English and multilingual language, and reasoning tasks, we demonstrate that PaLM 2 has significantly improved quality on… ▽ More

    Submitted 13 September, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

  6. arXiv:2112.07899  [pdf, other

    cs.IR cs.CL

    Large Dual Encoders Are Generalizable Retrievers

    Authors: Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, Yinfei Yang

    Abstract: It has been shown that dual encoders trained on one domain often fail to generalize to other domains for retrieval tasks. One widespread belief is that the bottleneck layer of a dual encoder, where the final score is simply a dot-product between a query vector and a passage vector, is too limited to make dual encoders an effective retrieval model for out-of-domain generalization. In this paper, we… ▽ More

    Submitted 15 December, 2021; originally announced December 2021.

  7. arXiv:2108.08877  [pdf, other

    cs.CL

    Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models

    Authors: Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, Yinfei Yang

    Abstract: We provide the first exploration of sentence embeddings from text-to-text transformers (T5). Sentence embeddings are broadly useful for language processing tasks. While T5 achieves impressive performance on language tasks cast as sequence-to-sequence mapping problems, it is unclear how to produce sentence embeddings from encoder-decoder models. We investigate three methods for extracting T5 senten… ▽ More

    Submitted 14 December, 2021; v1 submitted 19 August, 2021; originally announced August 2021.

  8. arXiv:2010.12523  [pdf, other

    cs.CL

    Neural Passage Retrieval with Improved Negative Contrast

    Authors: Jing Lu, Gustavo Hernandez Abrego, Ji Ma, Jianmo Ni, Yinfei Yang

    Abstract: In this paper we explore the effects of negative sampling in dual encoder models used to retrieve passages for automatic question answering. We explore four negative sampling strategies that complement the straightforward random sampling of negatives, typically used to train dual encoder models. Out of the four strategies, three are based on retrieval and one on heuristics. Our retrieval-based str… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  9. arXiv:1907.04307  [pdf, other

    cs.CL

    Multilingual Universal Sentence Encoder for Semantic Retrieval

    Authors: Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil

    Abstract: We introduce two pre-trained retrieval focused multilingual sentence encoding models, respectively based on the Transformer and CNN model architectures. The models embed text from 16 languages into a single semantic space using a multi-task trained dual-encoder that learns tied representations using translation based bridge tasks (Chidambaram al., 2018). The models provide performance that is comp… ▽ More

    Submitted 9 July, 2019; originally announced July 2019.

    Comments: 6 pages, 6 tables, 2 listings, and 1 figure

  10. arXiv:1902.08564  [pdf, other

    cs.CL

    Improving Multilingual Sentence Embedding using Bi-directional Dual Encoder with Additive Margin Softmax

    Authors: Yinfei Yang, Gustavo Hernandez Abrego, Steve Yuan, Mandy Guo, Qinlan Shen, Daniel Cer, Yun-hsuan Sung, Brian Strope, Ray Kurzweil

    Abstract: In this paper, we present an approach to learn multilingual sentence embeddings using a bi-directional dual-encoder with additive margin softmax. The embeddings are able to achieve state-of-the-art results on the United Nations (UN) parallel corpus retrieval task. In all the languages tested, the system achieves P@1 of 86% or higher. We use pairs retrieved by our approach to train NMT models that… ▽ More

    Submitted 14 June, 2019; v1 submitted 22 February, 2019; originally announced February 2019.

    Comments: Accepted by IJCAI'19(International Joint Conference on Artificial Intelligence)

  11. arXiv:1807.11906  [pdf, other

    cs.CL

    Effective Parallel Corpus Mining using Bilingual Sentence Embeddings

    Authors: Mandy Guo, Qinlan Shen, Yinfei Yang, Heming Ge, Daniel Cer, Gustavo Hernandez Abrego, Keith Stevens, Noah Constant, Yun-Hsuan Sung, Brian Strope, Ray Kurzweil

    Abstract: This paper presents an effective approach for parallel corpus mining using bilingual sentence embeddings. Our embedding models are trained to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. This is achieved using a novel training method that introduces hard negatives consisting of sentences that are not translations but that have some d… ▽ More

    Submitted 2 August, 2018; v1 submitted 31 July, 2018; originally announced July 2018.