Skip to main content

Showing 1–50 of 90 results for author: Catanzaro, B

  1. arXiv:2407.07263  [pdf, other

    cs.CL

    Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

    Authors: Jupinder Parmar, Sanjev Satheesh, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: As language models have scaled both their number of parameters and pretraining dataset sizes, the computational cost for pretraining has become intractable except for the most well-resourced teams. This increasing cost makes it ever more important to be able to reuse a model after it has completed pretraining; allowing for a model's abilities to further improve without needing to train from scratc… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: Preprint. Under review

  2. arXiv:2407.06380  [pdf, other

    cs.CL

    Data, Data Everywhere: A Guide for Pretraining Dataset Construction

    Authors: Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Bo Liu, Aastha Jhunjhunwala, Zhilin Wang, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: The impressive capabilities of recent language models can be largely attributed to the multi-trillion token pretraining datasets that they are trained on. However, model developers fail to disclose their construction methodology which has lead to a lack of open information on how to develop effective pretraining sets. To address this issue, we perform the first systematic study across the entire p… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: Preprint. Under review

  3. arXiv:2407.02485  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

    Authors: Yue Yu, Wei Ping, Zihan Liu, Boxin Wang, Jiaxuan You, Chao Zhang, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Large language models (LLMs) typically utilize the top-k contexts from a retriever in retrieval-augmented generation (RAG). In this work, we propose a novel instruction fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG. In particular, the instruction-tuned LLMs work surprisingly well by adding a small fraction o… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

  4. arXiv:2406.15487  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Improving Text-To-Audio Models with Synthetic Captions

    Authors: Zhifeng Kong, Sang-gil Lee, Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, Rafael Valle, Soujanya Poria, Bryan Catanzaro

    Abstract: It is an open challenge to obtain high quality training data, especially captions, for text-to-audio models. Although prior methods have leveraged \textit{text-only language models} to augment and improve captions, such methods have limitations related to scale and coherence between audio and captions. In this work, we propose an audio captioning pipeline that uses an \textit{audio language model}… ▽ More

    Submitted 8 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  5. arXiv:2406.11704  [pdf, other

    cs.CL cs.AI cs.LG

    Nemotron-4 340B Technical Report

    Authors: Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek , et al. (58 additional authors not shown)

    Abstract: We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation be… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  6. CircuitVAE: Efficient and Scalable Latent Circuit Optimization

    Authors: Jialin Song, Aidan Swope, Robert Kirby, Rajarshi Roy, Saad Godil, Jonathan Raiman, Bryan Catanzaro

    Abstract: Automatically designing fast and space-efficient digital circuits is challenging because circuits are discrete, must exactly implement the desired logic, and are costly to simulate. We address these challenges with CircuitVAE, a search algorithm that embeds computation graphs in a continuous space and optimizes a learned surrogate of physical simulation by gradient descent. By carefully controllin… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Design Automation Conference (DAC) 2024; the first two authors contributed equally

  7. arXiv:2406.07887  [pdf, other

    cs.LG cs.CL

    An Empirical Study of Mamba-based Language Models

    Authors: Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a contr… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  8. arXiv:2405.17428  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Authors: Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

    Abstract: Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model with a variety of architectural designs and training procedures to significantly enhance the performance of LLM as a versatile embedding model, whil… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  9. arXiv:2404.07616  [pdf, other

    cs.CL cs.SD eess.AS

    Audio Dialogues: Dialogues dataset for audio and music understanding

    Authors: Arushi Goel, Zhifeng Kong, Rafael Valle, Bryan Catanzaro

    Abstract: Existing datasets for audio understanding primarily focus on single-turn interactions (i.e. audio captioning, audio question answering) for describing audio in natural language, thus limiting understanding audio via interactive dialogue. To address this gap, we introduce Audio Dialogues: a multi-turn dialogue dataset containing 163.8k samples for general audio sounds and music. In addition to dial… ▽ More

    Submitted 11 April, 2024; originally announced April 2024.

    Comments: Demo website: https://audiodialogues.github.io/

  10. arXiv:2402.16819  [pdf, other

    cs.CL cs.AI cs.LG

    Nemotron-4 15B Technical Report

    Authors: Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi , et al. (2 additional authors not shown)

    Abstract: We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remai… ▽ More

    Submitted 27 February, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

  11. arXiv:2402.07319  [pdf, other

    cs.LG cs.AI cs.CL

    ODIN: Disentangled Reward Mitigates Hacking in RLHF

    Authors: Lichang Chen, Chen Zhu, Davit Soselia, Jiuhai Chen, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: In this work, we study the issue of reward hacking on the response length, a challenge emerging in Reinforcement Learning from Human Feedback (RLHF) on LLMs. A well-formatted, verbose but less helpful response from the LLMs can often deceive LLMs or even human evaluators to achieve high scores. The same issue also holds for some reward models in RL. To address the challenges in both training and e… ▽ More

    Submitted 11 February, 2024; originally announced February 2024.

  12. arXiv:2402.01831  [pdf, other

    cs.SD cs.LG eess.AS

    Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

    Authors: Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, Bryan Catanzaro

    Abstract: Augmenting large language models (LLMs) to understand audio -- including non-speech sounds and non-verbal speech -- is critically important for diverse real-world applications of LLMs. In this paper, we propose Audio Flamingo, a novel audio language model with 1) strong audio understanding abilities, 2) the ability to quickly adapt to unseen tasks via in-context learning and retrieval, and 3) stro… ▽ More

    Submitted 28 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: ICML 2024

  13. arXiv:2401.13851  [pdf, ps, other

    cs.SD cs.LG eess.AS

    Scaling NVIDIA's Multi-speaker Multi-lingual TTS Systems with Zero-Shot TTS to Indic Languages

    Authors: Akshit Arora, Rohan Badlani, Sungwon Kim, Rafael Valle, Bryan Catanzaro

    Abstract: In this paper, we describe the TTS models developed by NVIDIA for the MMITS-VC (Multi-speaker, Multi-lingual Indic TTS with Voice Cloning) 2024 Challenge. In Tracks 1 and 2, we utilize RAD-MMM to perform few-shot TTS by training additionally on 5 minutes of target speaker data. In Track 3, we utilize P-Flow to perform zero-shot TTS by training on the challenge dataset as well as external datasets.… ▽ More

    Submitted 29 January, 2024; v1 submitted 24 January, 2024; originally announced January 2024.

    Comments: Presentation accepted at ICASSP 2024

  14. arXiv:2401.10225  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    ChatQA: Surpassing GPT-4 on Conversational QA and RAG

    Authors: Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Chankyu Lee, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: In this work, we introduce ChatQA, a suite of models that outperform GPT-4 on retrieval-augmented generation (RAG) and conversational question answering (QA). To enhance generation, we propose a two-stage instruction tuning method that significantly boosts the performance of RAG. For effective retrieval, we introduce a dense retriever optimized for conversational QA, which yields results comparabl… ▽ More

    Submitted 22 May, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: We include the results of the Llama3-ChatQA-1.5-8B, Llama3-ChatQA-1.5-70B, and GPT-4-Turbo-2024-04-09 models on ChatRAG Bench. Additionally, we provide results on single-turn QA datasets: Natural Questions, TriviaQA, and HotpotQA

  15. arXiv:2311.00176  [pdf, other

    cs.CL

    ChipNeMo: Domain-Adapted LLMs for Chip Design

    Authors: Mingjie Liu, Teodor-Dumitru Ene, Robert Kirby, Chris Cheng, Nathaniel Pinckney, Rongjian Liang, Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, Bonita Bhaskaran, Bryan Catanzaro, Arjun Chaudhuri, Sharon Clay, Bill Dally, Laura Dang, Parikshit Deshpande, Siddhanth Dhodhi, Sameer Halepete, Eric Hill, Jiashang Hu, Sumit Jain, Ankit Jindal, Brucek Khailany, George Kokai , et al. (17 additional authors not shown)

    Abstract: ChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: domain-adaptive tokenization, domain-adaptive continued pretraining, model alignment with domain-specific instructions, and domain-adapted retrieval models. We e… ▽ More

    Submitted 4 April, 2024; v1 submitted 31 October, 2023; originally announced November 2023.

    Comments: Updated results for ChipNeMo-70B model

  16. arXiv:2310.07713  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining

    Authors: Boxin Wang, Wei Ping, Lawrence McAfee, Peng Xu, Bo Li, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Pretraining auto-regressive large language models~(LLMs) with retrieval demonstrates better perplexity and factual accuracy by leveraging external databases. However, the size of existing pretrained retrieval-augmented LLM is still limited (e.g., Retro has 7.5B parameters), which limits the effectiveness of instruction tuning and zero-shot generalization. In this work, we introduce Retro 48B, the… ▽ More

    Submitted 29 May, 2024; v1 submitted 11 October, 2023; originally announced October 2023.

    Comments: ICML 2024

  17. arXiv:2310.03025  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    Retrieval meets Long Context Large Language Models

    Authors: Peng Xu, Wei Ping, Xianchao Wu, Lawrence McAfee, Chen Zhu, Zihan Liu, Sandeep Subramanian, Evelina Bakhturina, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Extending the context window of large language models (LLMs) is getting popular recently, while the solution of augmenting LLMs with retrieval has existed for years. The natural questions are: i) Retrieval-augmentation versus long context window, which one is better for downstream tasks? ii) Can both methods be combined to get the best of both worlds? In this work, we answer these questions by stu… ▽ More

    Submitted 23 January, 2024; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Published at ICLR 2024

  18. CleanUNet 2: A Hybrid Speech Denoising Model on Waveform and Spectrogram

    Authors: Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

    Abstract: In this work, we present CleanUNet 2, a speech denoising model that combines the advantages of waveform denoiser and spectrogram denoiser and achieves the best of both worlds. CleanUNet 2 uses a two-stage framework inspired by popular speech synthesis methods that consist of a waveform model and a spectrogram model. Specifically, CleanUNet 2 builds upon CleanUNet, the state-of-the-art waveform den… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: INTERSPEECH 2023

    Journal ref: Proc. INTERSPEECH 2023, pages 790--794

  19. arXiv:2308.07922  [pdf, other

    cs.CL cs.AI cs.LG

    RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models

    Authors: Jie Huang, Wei Ping, Peng Xu, Mohammad Shoeybi, Kevin Chen-Chuan Chang, Bryan Catanzaro

    Abstract: In this paper, we investigate the in-context learning ability of retrieval-augmented encoder-decoder language models. We first conduct a comprehensive analysis of existing models and identify their limitations in in-context learning, primarily due to a mismatch between pretraining and inference, as well as a restricted context length. To address these issues, we propose RAVEN, a model that combine… ▽ More

    Submitted 1 April, 2024; v1 submitted 15 August, 2023; originally announced August 2023.

  20. arXiv:2308.03944  [pdf, other

    cs.LG cs.AR

    GraPhSyM: Graph Physical Synthesis Model

    Authors: Ahmed Agiza, Rajarshi Roy, Teodor Dumitru Ene, Saad Godil, Sherief Reda, Bryan Catanzaro

    Abstract: In this work, we introduce GraPhSyM, a Graph Attention Network (GATv2) model for fast and accurate estimation of post-physical synthesis circuit delay and area metrics from pre-physical synthesis circuit netlists. Once trained, GraPhSyM provides accurate visibility of final design metrics to early EDA stages, such as logic synthesis, without running the slow physical synthesis flow, enabling globa… ▽ More

    Submitted 7 September, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

    Comments: Accepted at Proceedings of the 42nd International Conference on Computer-Aided Design (ICCAD), 2023

  21. arXiv:2305.11102  [pdf, other

    cs.CV

    Progressive Learning of 3D Reconstruction Network from 2D GAN Data

    Authors: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

    Abstract: This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted image… ▽ More

    Submitted 18 May, 2023; originally announced May 2023.

    Comments: Web-page: https://research.nvidia.com/labs/adlr/progressive-3d-learning. arXiv admin note: text overlap with arXiv:2203.09362

  22. arXiv:2305.10474  [pdf, other

    cs.CV cs.GR cs.LG

    Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

    Authors: Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji

    Abstract: Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is co… ▽ More

    Submitted 25 March, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

    Comments: ICCV 2023. Project webpage: https://research.nvidia.com/labs/dir/pyoco

  23. arXiv:2304.06762  [pdf, other

    cs.CL cs.AI cs.IR cs.LG

    Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study

    Authors: Boxin Wang, Wei Ping, Peng Xu, Lawrence McAfee, Zihan Liu, Mohammad Shoeybi, Yi Dong, Oleksii Kuchaiev, Bo Li, Chaowei Xiao, Anima Anandkumar, Bryan Catanzaro

    Abstract: Large decoder-only language models (LMs) can be largely improved in terms of perplexity by retrieval (e.g., RETRO), but its impact on text generation quality and downstream task accuracy is unclear. Thus, it is still an open question: shall we pretrain large autoregressive LMs with retrieval? To answer it, we perform a comprehensive study on a scalable pre-trained retrieval-augmented LM (i.e., RET… ▽ More

    Submitted 20 December, 2023; v1 submitted 13 April, 2023; originally announced April 2023.

    Comments: EMNLP 2023

  24. arXiv:2303.07578  [pdf, ps, other

    cs.SD cs.LG eess.AS

    VANI: Very-lightweight Accent-controllable TTS for Native and Non-native speakers with Identity Preservation

    Authors: Rohan Badlani, Akshit Arora, Subhankar Ghosh, Rafael Valle, Kevin J. Shih, João Felipe Santos, Boris Ginsburg, Bryan Catanzaro

    Abstract: We introduce VANI, a very lightweight multi-lingual accent controllable speech synthesis system. Our model builds upon disentanglement strategies proposed in RADMMM and supports explicit control of accent, language, speaker and fine-grained $F_0$ and energy features for speech synthesis. We utilize the Indic languages dataset, released for LIMMITS 2023 as part of ICASSP Signal Processing Grand Cha… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: Presentation accepted at ICASSP 2023

  25. arXiv:2302.07388  [pdf, other

    cs.CL cs.AI

    Adding Instructions during Pretraining: Effective Way of Controlling Toxicity in Language Models

    Authors: Shrimai Prabhumoye, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Pretrained large language models have become indispensable for solving various natural language processing (NLP) tasks. However, safely deploying them in real world applications is challenging because they generate toxic content. To address this challenge, we propose two novel pretraining data augmentation strategies that significantly reduce model toxicity without compromising its utility. Our tw… ▽ More

    Submitted 14 February, 2023; originally announced February 2023.

    Comments: This paper will be presented at EACL 2023

  26. arXiv:2302.04858  [pdf, other

    cs.CV cs.AI cs.CL cs.IR cs.LG

    Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

    Authors: Zhuolin Yang, Wei Ping, Zihan Liu, Vijay Korthikanti, Weili Nie, De-An Huang, Linxi Fan, Zhiding Yu, Shiyi Lan, Bo Li, Ming-Yu Liu, Yuke Zhu, Mohammad Shoeybi, Bryan Catanzaro, Chaowei Xiao, Anima Anandkumar

    Abstract: Augmenting pretrained language models (LMs) with a vision encoder (e.g., Flamingo) has obtained the state-of-the-art results in image-to-text generation. However, these models store all the knowledge within their parameters, thus often requiring enormous model parameters to model the abundant visual concepts and very rich textual descriptions. Additionally, they are inefficient in incorporating ne… ▽ More

    Submitted 22 October, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

    Comments: Findings of EMNLP 2023

  27. arXiv:2301.10335  [pdf, other

    cs.SD cs.LG eess.AS

    Multilingual Multiaccented Multispeaker TTS with RADTTS

    Authors: Rohan Badlani, Rafael Valle, Kevin J. Shih, João Felipe Santos, Siddharth Gururani, Bryan Catanzaro

    Abstract: We work to create a multilingual speech synthesis system which can generate speech with the proper accent while retaining the characteristics of an individual voice. This is challenging to do because it is expensive to obtain bilingual training data in multiple languages, and the lack of such data results in strong correlations that entangle speaker, language, and accent, resulting in poor transfe… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

    Comments: 5 pages, submitted to ICASSP 2023

  28. arXiv:2211.01324  [pdf, other

    cs.CV cs.LG

    eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

    Authors: Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu

    Abstract: Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly… ▽ More

    Submitted 13 March, 2023; v1 submitted 2 November, 2022; originally announced November 2022.

  29. arXiv:2210.13673  [pdf, other

    cs.CL

    Evaluating Parameter Efficient Learning for Generation

    Authors: Peng Xu, Mostofa Patwary, Shrimai Prabhumoye, Virginia Adams, Ryan J. Prenger, Wei Ping, Nayeon Lee, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Parameter efficient learning methods (PERMs) have recently gained significant attention as they provide an efficient way for pre-trained language models (PLMs) to adapt to a downstream task. However, these conclusions are mostly drawn from in-domain evaluations over the full training set. In this paper, we present comparisons between PERMs and finetuning from three new perspectives: (1) the effect… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: Accepted to EMNLP 2022 main conference

  30. arXiv:2210.06349  [pdf, other

    cs.CL cs.AI

    Context Generation Improves Open Domain Question Answering

    Authors: Dan Su, Mostofa Patwary, Shrimai Prabhumoye, Peng Xu, Ryan Prenger, Mohammad Shoeybi, Pascale Fung, Anima Anandkumar, Bryan Catanzaro

    Abstract: Closed-book question answering (QA) requires a model to directly answer an open-domain question without access to any external knowledge. Prior work on closed-book QA either directly finetunes or prompts a pretrained language model (LM) to leverage the stored knowledge. However, they do not fully exploit the parameterized knowledge. To address this issue, we propose a two-stage, closed-book QA fra… ▽ More

    Submitted 27 April, 2023; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: 8 pages; Accepted at EACL2023

  31. arXiv:2206.04658  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    BigVGAN: A Universal Neural Vocoder with Large-Scale Training

    Authors: Sang-gil Lee, Wei Ping, Boris Ginsburg, Bryan Catanzaro, Sungroh Yoon

    Abstract: Despite recent progress in generative adversarial network (GAN)-based vocoders, where the model generates raw waveform conditioned on acoustic features, it is challenging to synthesize high-fidelity audio for numerous speakers across various recording environments. In this work, we present BigVGAN, a universal vocoder that generalizes well for various out-of-distribution scenarios without fine-tun… ▽ More

    Submitted 16 February, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: To appear at ICLR 2023. Listen to audio samples from BigVGAN at: https://bigvgan-demo.github.io/

  32. arXiv:2206.04624  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Factuality Enhanced Language Models for Open-Ended Text Generation

    Authors: Nayeon Lee, Wei Ping, Peng Xu, Mostofa Patwary, Pascale Fung, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Pretrained language models (LMs) are susceptible to generate text with nonfactual information. In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation. We design the FactualityPrompts test set and metrics to measure the factuality of LM generations. Based on that, we study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B… ▽ More

    Submitted 2 March, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: NeurIPS 2022

  33. PrefixRL: Optimization of Parallel Prefix Circuits using Deep Reinforcement Learning

    Authors: Rajarshi Roy, Jonathan Raiman, Neel Kant, Ilyas Elkin, Robert Kirby, Michael Siu, Stuart Oberman, Saad Godil, Bryan Catanzaro

    Abstract: In this work, we present a reinforcement learning (RL) based approach to designing parallel prefix circuits such as adders or priority encoders that are fundamental to high-performance digital design. Unlike prior methods, our approach designs solutions tabula rasa purely through learning with synthesis in the loop. We design a grid-based state-action representation and an RL environment for const… ▽ More

    Submitted 14 May, 2022; originally announced May 2022.

    Comments: Copyright 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

    Journal ref: ACM/IEEE Design Automation Conference (DAC), 2021, pp. 853-858

  34. arXiv:2205.05198  [pdf, other

    cs.LG cs.CL

    Reducing Activation Recomputation in Large Transformer Models

    Authors: Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capacity constraints. Rather than storing activations for backpropagation, they are traditionally recomp… ▽ More

    Submitted 10 May, 2022; originally announced May 2022.

  35. arXiv:2203.09362  [pdf, other

    cs.CV

    Fine Detailed Texture Learning for 3D Meshes with Generative Models

    Authors: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

    Abstract: This paper presents a method to reconstruct high-quality textured 3D models from both multi-view and single-view images. The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network. In the generative learning pipeli… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

  36. arXiv:2203.08745  [pdf, other

    cs.CL cs.AI

    Multi-Stage Prompting for Knowledgeable Dialogue Generation

    Authors: Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai Prabhumoye, Wei Ping, Mohammad Shoeybi, Bryan Catanzaro

    Abstract: Existing knowledge-grounded dialogue systems typically use finetuned versions of a pretrained language model (LM) and large-scale knowledge bases. These models typically fail to generalize on topics outside of the knowledge base, and require maintaining separate potentially large checkpoints each time finetuning is needed. In this paper, we aim to address these limitations by leveraging the inhere… ▽ More

    Submitted 16 March, 2022; originally announced March 2022.

  37. arXiv:2203.01786  [pdf, other

    cs.SD cs.LG eess.AS

    Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows

    Authors: Kevin J. Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro

    Abstract: Despite recent advances in generative modeling for text-to-speech synthesis, these models do not yet have the same fine-grained adjustability of pitch-conditioned deterministic models such as FastPitch and FastSpeech2. Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting. Our work explores several techniques for ha… ▽ More

    Submitted 27 June, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: 22 pages, 11 figures, 3 tables

  38. arXiv:2202.07790  [pdf, other

    cs.SD cs.LG eess.AS

    Speech Denoising in the Waveform Domain with Self-Attention

    Authors: Zhifeng Kong, Wei Ping, Ambrish Dantrey, Bryan Catanzaro

    Abstract: In this work, we present CleanUNet, a causal speech denoising model on the raw waveform. The proposed model is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. The model is optimized through a set of losses defined over both waveform and multi-resolution spectrograms. The proposed… ▽ More

    Submitted 6 July, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: Published in ICASSP 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Listen to audio samples from CleanUNet at: https://cleanunet.github.io/

  39. arXiv:2202.04173  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models

    Authors: Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, Bryan Catanzaro

    Abstract: Pre-trained language models (LMs) are shown to easily generate toxic language. In this work, we systematically explore domain-adaptive training to reduce the toxicity of language models. We conduct this study on three dimensions: training corpus, model size, and parameter efficiency. For the training corpus, we propose to leverage the generative power of LMs and generate nontoxic datasets for doma… ▽ More

    Submitted 21 October, 2022; v1 submitted 8 February, 2022; originally announced February 2022.

    Comments: NeurIPS 2022

  40. arXiv:2202.00011  [pdf, other

    eess.IV cs.CV cs.LG

    Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed Video Quality Enhancement

    Authors: Max Ehrlich, Jon Barker, Namitha Padmanabhan, Larry Davis, Andrew Tao, Bryan Catanzaro, Abhinav Shrivastava

    Abstract: Video compression is a central feature of the modern internet powering technologies from social media to video conferencing. While video compression continues to mature, for many compression settings, quality loss is still noticeable. These settings nevertheless have important applications to the efficient transmission of videos over bandwidth constrained or otherwise unstable connections. In this… ▽ More

    Submitted 30 October, 2023; v1 submitted 31 January, 2022; originally announced February 2022.

    Comments: WACV 2024

  41. arXiv:2201.11990  [pdf, other

    cs.CL

    Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model

    Authors: Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, Bryan Catanzaro

    Abstract: Pretrained general-purpose language models can achieve state-of-the-art accuracies in various natural language processing domains by adapting to downstream tasks via zero-shot, few-shot and fine-tuning techniques. Because of their success, the size of these models has increased rapidly, requiring high-performance hardware, software, and algorithmic techniques to enable training such large models.… ▽ More

    Submitted 4 February, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: Shaden Smith and Mostofa Patwary contributed equally

  42. arXiv:2112.07868  [pdf, other

    cs.CL cs.AI

    Few-shot Instruction Prompts for Pretrained Language Models to Detect Social Biases

    Authors: Shrimai Prabhumoye, Rafal Kocielnik, Mohammad Shoeybi, Anima Anandkumar, Bryan Catanzaro

    Abstract: Detecting social bias in text is challenging due to nuance, subjectivity, and difficulty in obtaining good quality labeled datasets at scale, especially given the evolving nature of social biases and society. To address these challenges, we propose a few-shot instruction-based method for prompting pre-trained language models (LMs). We select a few class-balanced exemplars from a small support repo… ▽ More

    Submitted 15 April, 2022; v1 submitted 14 December, 2021; originally announced December 2021.

    Comments: Submission revised with new results

  43. arXiv:2111.13587  [pdf, other

    cs.CV cs.LG

    Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers

    Authors: John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro

    Abstract: Vision transformers have delivered tremendous success in representation learning. This is primarily due to effective token mixing through self attention. However, this scales quadratically with the number of pixels, which becomes infeasible for high-resolution inputs. To cope with this challenge, we propose Adaptive Fourier Neural Operator (AFNO) as an efficient token mixer that learns to mix in t… ▽ More

    Submitted 27 March, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

  44. arXiv:2109.02631  [pdf, other

    cs.LG

    Guiding Global Placement With Reinforcement Learning

    Authors: Robert Kirby, Kolby Nottingham, Rajarshi Roy, Saad Godil, Bryan Catanzaro

    Abstract: Recent advances in GPU accelerated global and detail placement have reduced the time to solution by an order of magnitude. This advancement allows us to leverage data driven optimization (such as Reinforcement Learning) in an effort to improve the final quality of placement results. In this work we augment state-of-the-art, force-based global placement solvers with a reinforcement learning agent t… ▽ More

    Submitted 6 September, 2021; originally announced September 2021.

    ACM Class: B.7.2

  45. arXiv:2108.10447  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    One TTS Alignment To Rule Them All

    Authors: Rohan Badlani, Adrian Łancucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro

    Abstract: Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive endto-end TTS models rely on durati… ▽ More

    Submitted 23 August, 2021; originally announced August 2021.

  46. arXiv:2107.02192  [pdf, other

    cs.CV cs.CL cs.LG cs.MM

    Long-Short Transformer: Efficient Transformers for Language and Vision

    Authors: Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, Bryan Catanzaro

    Abstract: Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-att… ▽ More

    Submitted 7 December, 2021; v1 submitted 5 July, 2021; originally announced July 2021.

    Comments: Published at NeurIPS 2021

  47. arXiv:2106.06533  [pdf, other

    cs.CV cs.AI cs.GR cs.LG

    View Generalization for Single Image Textured 3D Models

    Authors: Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro

    Abstract: Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training err… ▽ More

    Submitted 10 June, 2021; originally announced June 2021.

    Comments: CVPR 2021. Project website: https://nv-adlr.github.io/view-generalization

  48. arXiv:2104.04473  [pdf, other

    cs.CL cs.DC

    Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

    Authors: Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia

    Abstract: Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently… ▽ More

    Submitted 23 August, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Accepted to SC 2021

  49. arXiv:2103.16748  [pdf, other

    cs.CV cs.GR

    Dual Contrastive Loss and Attention for GANs

    Authors: Ning Yu, Guilin Liu, Aysegul Dundar, Andrew Tao, Bryan Catanzaro, Larry Davis, Mario Fritz

    Abstract: Generative Adversarial Networks (GANs) produce impressive results on unconditional image generation when powered with large-scale image datasets. Yet generated images are still easy to spot especially on datasets with high variance (e.g. bedroom, church). In this paper, we propose various improvements to further push the boundaries in image generation. Specifically, we propose a novel dual contras… ▽ More

    Submitted 17 March, 2022; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: Accepted to ICCV'21

  50. arXiv:2101.00408  [pdf, other

    cs.CL cs.AI

    End-to-End Training of Neural Retrievers for Open-Domain Question Answering

    Authors: Devendra Singh Sachan, Mostofa Patwary, Mohammad Shoeybi, Neel Kant, Wei Ping, William L Hamilton, Bryan Catanzaro

    Abstract: Recent work on training neural retrievers for open-domain question answering (OpenQA) has employed both supervised and unsupervised approaches. However, it remains unclear how unsupervised and supervised methods can be used most effectively for neural retrievers. In this work, we systematically study retriever pre-training. We first propose an approach of unsupervised pre-training with the Inverse… ▽ More

    Submitted 1 June, 2021; v1 submitted 2 January, 2021; originally announced January 2021.

    Comments: ACL 2021