Skip to main content

Showing 1–50 of 303 results for author: Manocha, D

  1. arXiv:2407.01851  [pdf, other

    cs.CV cs.AI cs.LG eess.AS

    Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

    Authors: Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha

    Abstract: Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained un… ▽ More

    Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: Accepted at ECCV 2024

  2. arXiv:2406.18068  [pdf, other

    cs.CV

    Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs

    Authors: Uttaran Bhattacharya, Aniket Bera, Dinesh Manocha

    Abstract: We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters using RGB video data captured using commodity cameras. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions. Given a speech audio waveform and a tok… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

    Comments: 14 pages, 7 figures, 2 tables

    Journal ref: In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 1st Workshop on Human Motion Generation, 2024, Seattle, Washington, USA

  3. arXiv:2406.13683  [pdf, other

    cs.CV cs.AI

    IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning

    Authors: Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, Dinesh Manocha

    Abstract: Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task. To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by le… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  4. arXiv:2406.11768  [pdf, other

    cs.SD cs.AI cs.CL eess.AS

    GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

    Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha

    Abstract: Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including feat… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Project Website: https://sreyan88.github.io/gamaaudio/

  5. arXiv:2406.10918  [pdf, other

    cs.LG cs.AI cs.CL

    Embodied Question Answering via Multi-LLM Systems

    Authors: Bhrij Patel, Vishnu Sashank Dorbala, Dinesh Manocha, Amrit Singh Bedi

    Abstract: Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple large language models (LLM) based agents independen… ▽ More

    Submitted 25 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: 17 pages, 13 Figures, 4 Tables

  6. arXiv:2406.10900  [pdf, other

    cs.CV cs.CL

    AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models

    Authors: Xiyang Wu, Tianrui Guan, Dianqi Li, Shuaiyi Huang, Xiaoyu Liu, Xijun Wang, Ruiqi Xian, Abhinav Shrivastava, Furong Huang, Jordan Lee Boyd-Graber, Tianyi Zhou, Dinesh Manocha

    Abstract: Large vision-language models (LVLMs) hallucinate: certain context cues in an image may trigger the language module's overconfident and incorrect reasoning on abnormal or hypothetical objects. Though a few benchmarks have been developed to investigate LVLM hallucinations, they mainly rely on hand-crafted corner cases whose fail patterns may hardly generalize, and finetuning on them could undermine… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  7. arXiv:2406.04673  [pdf, other

    cs.CV cs.AI cs.MM eess.AS

    MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models

    Authors: Sanjoy Chowdhury, Sayan Nag, K J Joseph, Balaji Vasan Srinivasan, Dinesh Manocha

    Abstract: Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media, ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script, but also through visualizations, w… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: Accepted at CVPR 2024 as Highlight paper. Webpage: https://schowdhury671.github.io/melfusion_cvpr2024/

  8. arXiv:2406.04432  [pdf, other

    eess.AS cs.AI cs.CL

    LipGER: Visually-Conditioned Generative Error Correction for Robust Automatic Speech Recognition

    Authors: Sreyan Ghosh, Sonal Kumar, Ashish Seth, Purva Chiniya, Utkarsh Tyagi, Ramani Duraiswami, Dinesh Manocha

    Abstract: Visual cues, like lip motion, have been shown to improve the performance of Automatic Speech Recognition (ASR) systems in noisy environments. We propose LipGER (Lip Motion aided Generative Error Correction), a novel framework for leveraging visual cues for noise-robust ASR. Instead of learning the cross-modal correlation between the audio and visual modalities, we make an LLM learn the task of vis… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: InterSpeech 2024. Code and Data: https://github.com/Sreyan88/LipGER

  9. arXiv:2406.04286  [pdf, other

    cs.CL cs.AI

    ABEX: Data Augmentation for Low-Resource NLU via Expanding Abstract Descriptions

    Authors: Sreyan Ghosh, Utkarsh Tyagi, Sonal Kumar, C. K. Evuru, S Ramaneswaran, S Sakshi, Dinesh Manocha

    Abstract: We present ABEX, a novel and effective generative data augmentation methodology for low-resource Natural Language Understanding (NLU) tasks. ABEX is based on ABstract-and-EXpand, a novel paradigm for generating diverse forms of an input document -- we first convert a document into its concise, abstract description and then generate new documents based on expanding the resultant abstraction. To lea… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: ACL 2024 Main Conference. Code and data: https://github.com/Sreyan88/ABEX

  10. arXiv:2405.20495  [pdf, other

    cs.CL cs.LG

    Transfer Q Star: Principled Decoding for LLM Alignment

    Authors: Souradip Chakraborty, Soumya Suvra Ghosal, Ming Yin, Dinesh Manocha, Mengdi Wang, Amrit Singh Bedi, Furong Huang

    Abstract: Aligning foundation models is essential for their safe and trustworthy deployment. However, traditional fine-tuning methods are computationally intensive and require updating billions of model parameters. A promising alternative, alignment via decoding, adjusts the response distribution directly without model updates to maximize a target reward $r$, thus providing a lightweight and adaptable frame… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  11. arXiv:2405.17366  [pdf, other

    cs.LG eess.SP

    EM-GANSim: Real-time and Accurate EM Simulation Using Conditional GANs for 3D Indoor Scenes

    Authors: Ruichen Wang, Dinesh Manocha

    Abstract: We present a novel machine-learning (ML) approach (EM-GANSim) for real-time electromagnetic (EM) propagation that is used for wireless communication simulation in 3D indoor environments. Our approach uses a modified conditional Generative Adversarial Network (GAN) that incorporates encoded geometry and transmitter location while adhering to the electromagnetic propagation theory. The overall physi… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 10 pages, 8 figures, 5 tables

  12. arXiv:2405.16430  [pdf, other

    cs.RO cs.MA

    GAMEOPT+: Improving Fuel Efficiency in Unregulated Heterogeneous Traffic Intersections via Optimal Multi-agent Cooperative Control

    Authors: Nilesh Suriyarachchi, Rohan Chandra, Arya Anantula, John S. Baras, Dinesh Manocha

    Abstract: Better fuel efficiency leads to better financial security as well as a cleaner environment. We propose a novel approach for improving fuel efficiency in unstructured and unregulated traffic environments. Existing intelligent transportation solutions for improving fuel efficiency, however, apply only to traffic intersections with sparse traffic or traffic where drivers obey the regulations, or both… ▽ More

    Submitted 26 May, 2024; originally announced May 2024.

    Comments: Journal Version

  13. arXiv:2405.15683  [pdf, other

    cs.CV cs.AI cs.CL

    VDGD: Mitigating LVLM Hallucinations in Cognitive Prompts by Bridging the Visual Perception Gap

    Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Utkarsh Tyagi, Oriol Nieto, Zeyu Jin, Dinesh Manocha

    Abstract: Recent interest in Large Vision-Language Models (LVLMs) for practical applications is moderated by the significant challenge of hallucination or the inconsistency between the factual information and the generated text. In this paper, we first perform an in-depth analysis of hallucinations and discover several novel insights about how and when LVLMs hallucinate. From our analysis, we show that: (1)… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

    Comments: Preprint. Under review. Code will be released on paper acceptance

  14. arXiv:2405.13951  [pdf, other

    cs.CV

    Text Prompting for Multi-Concept Video Customization by Autoregressive Generation

    Authors: Divya Kothandaraman, Kihyuk Sohn, Ruben Villegas, Paul Voigtlaender, Dinesh Manocha, Mohammad Babaeizadeh

    Abstract: We present a method for multi-concept customization of pretrained text-to-video (T2V) models. Intuitively, the multi-concept customized video can be derived from the (non-linear) intersection of the video manifolds of the individual concepts, which is not straightforward to find. We hypothesize that sequential and controlled walking towards the intersection of the video manifolds, directed by text… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Paper accepted to AI4CC Workshop at CVPR 2024

  15. arXiv:2405.13685  [pdf, other

    cs.CV

    Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

    Authors: Divya Kothandaraman, Ming Lin, Dinesh Manocha

    Abstract: We introduce a novel approach for prompt mixing, aiming to generate images at the intersection of multiple text prompts using pre-trained text-to-image diffusion models. At each time step during diffusion denoising, our algorithm forecasts predictions w.r.t. the generated image and makes informed text conditioning decisions. To do so, we leverage the connection between diffusion models (rooted in… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  16. arXiv:2405.05363  [pdf, other

    cs.CV cs.RO

    LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

    Authors: Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

    Abstract: In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stabilit… ▽ More

    Submitted 8 May, 2024; originally announced May 2024.

    Comments: Accepted to ICRA 2024

  17. arXiv:2405.04732  [pdf, other

    cs.RO cs.AI

    S-EQA: Tackling Situational Queries in Embodied Question Answering

    Authors: Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha, Reza Ghanadhan

    Abstract: We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just wha… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: 8 Pages

  18. arXiv:2405.02762  [pdf, other

    cs.CV cs.LG cs.RO

    TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes

    Authors: Christopher Maxey, Jaehoon Choi, Yonghan Lee, Hyungtae Lee, Dinesh Manocha, Heesung Kwon

    Abstract: In this paper, we present a new approach to bridge the domain gap between synthetic and real-world data for un- manned aerial vehicle (UAV)-based perception. Our formu- lation is designed for dynamic scenes, consisting of moving objects or human actions, where the goal is to recognize the pose or actions. We propose an extension of K-Planes Neural Radiance Field (NeRF), wherein our algorithm store… ▽ More

    Submitted 4 May, 2024; originally announced May 2024.

    Comments: 8 pages, submitted to IROS2024

  19. arXiv:2404.08827  [pdf, other

    cs.RO cs.CV

    "Don't forget to put the milk back!" Dataset for Enabling Embodied Agents to Detect Anomalous Situations

    Authors: James F. Mullen Jr, Prasoon Goyal, Robinson Piramuthu, Michael Johnston, Dinesh Manocha, Reza Ghanadan

    Abstract: Home robots intend to make their users lives easier. Our work assists in this goal by enabling robots to inform their users of dangerous or unsanitary anomalies in their home. Some examples of these anomalies include the user leaving their milk out, forgetting to turn off the stove, or leaving poison accessible to children. To move towards enabling home robots with these abilities, we have created… ▽ More

    Submitted 12 April, 2024; originally announced April 2024.

  20. arXiv:2404.03187  [pdf, other

    cs.CV

    AGL-NET: Aerial-Ground Cross-Modal Global Localization with Varying Scales

    Authors: Tianrui Guan, Ruiqi Xian, Xijun Wang, Xiyang Wu, Mohamed Elnoor, Daeun Song, Dinesh Manocha

    Abstract: We present AGL-NET, a novel learning-based method for global localization using LiDAR point clouds and satellite maps. AGL-NET tackles two critical challenges: bridging the representation gap between image and points modalities for robust feature matching, and handling inherent scale discrepancies between global view and local view. To address these challenges, AGL-NET leverages a unified network… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  21. arXiv:2404.02885  [pdf, other

    cs.CV

    PoCo: Point Context Cluster for RGBD Indoor Place Recognition

    Authors: Jing Liang, Zhuo Deng, Zheming Zhou, Omid Ghasemalizadeh, Dinesh Manocha, Min Sun, Cheng-Hao Kuo, Arnie Sen

    Abstract: We present a novel end-to-end algorithm (PoCo) for the indoor RGB-D place recognition task, aimed at identifying the most likely match for a given query frame within a reference database. The task presents inherent challenges attributed to the constrained field of view and limited range of perception sensors. We propose a new network architecture, which generalizes the recent Context of Clusters (… ▽ More

    Submitted 3 April, 2024; originally announced April 2024.

  22. arXiv:2404.00419  [pdf, other

    cs.CV cs.CL

    Do Vision-Language Models Understand Compound Nouns?

    Authors: Sonal Kumar, Sreyan Ghosh, S Sakshi, Utkarsh Tyagi, Dinesh Manocha

    Abstract: Open-vocabulary vision-language models (VLMs) like CLIP, trained using contrastive loss, have emerged as a promising new paradigm for text-to-image retrieval. However, do VLMs understand compound nouns (CNs) (e.g., lab coat) as well as they understand nouns (e.g., lab)? We curate Compun, a novel benchmark with 400 unique and commonly used CNs, to evaluate the effectiveness of VLMs in interpreting… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: Accepted to NAACL 2024 Main Conference

  23. arXiv:2404.00415  [pdf, other

    cs.CL

    CoDa: Constrained Generation based Data Augmentation for Low-Resource NLP

    Authors: Chandra Kiran Reddy Evuru, Sreyan Ghosh, Sonal Kumar, Ramaneswaran S, Utkarsh Tyagi, Dinesh Manocha

    Abstract: We present CoDa (Constrained Generation based Data Augmentation), a controllable, effective, and training-free data augmentation technique for low-resource (data-scarce) NLP. Our approach is based on prompting off-the-shelf instruction-following Large Language Models (LLMs) for generating text that satisfies a set of constraints. Precisely, we extract a set of simple constraints from every instanc… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: Accepted to NAACL 2024 Findings

  24. arXiv:2404.00210  [pdf, other

    cs.RO

    VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models

    Authors: Daeun Song, Jing Liang, Amirreza Payandeh, Xuesu Xiao, Dinesh Manocha

    Abstract: We propose VLM-Social-Nav, a novel Vision-Language Model (VLM) based navigation approach to compute a robot's motion in human-centered environments. Our goal is to make real-time decisions on robot actions that are socially compliant with human expectations. We utilize a perception model to detect important social entities and prompt a VLM to generate guidance for socially compliant robot behavior… ▽ More

    Submitted 7 July, 2024; v1 submitted 29 March, 2024; originally announced April 2024.

  25. arXiv:2403.15637  [pdf, other

    cs.RO

    CoNVOI: Context-aware Navigation using Vision Language Models in Outdoor and Indoor Environments

    Authors: Adarsh Jagan Sathyamoorthy, Kasun Weerakoon, Mohamed Elnoor, Anuj Zore, Brian Ichter, Fei Xia, Jie Tan, Wenhao Yu, Dinesh Manocha

    Abstract: We present ConVOI, a novel method for autonomous robot navigation in real-world indoor and outdoor environments using Vision Language Models (VLMs). We employ VLMs in two ways: first, we leverage their zero-shot image classification capability to identify the context or scenario (e.g., indoor corridor, outdoor terrain, crosswalk, etc) of the robot's surroundings, and formulate context-based naviga… ▽ More

    Submitted 22 March, 2024; originally announced March 2024.

    Comments: 9 pages, 4 figures

  26. arXiv:2403.13235  [pdf, other

    cs.RO

    AMCO: Adaptive Multimodal Coupling of Vision and Proprioception for Quadruped Robot Navigation in Outdoor Environments

    Authors: Mohamed Elnoor, Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Tianrui Guan, Vignesh Rajagopal, Dinesh Manocha

    Abstract: We present AMCO, a novel navigation method for quadruped robots that adaptively combines vision-based and proprioception-based perception capabilities. Our approach uses three cost maps: general knowledge map; traversability history map; and current proprioception map; which are derived from a robot's vision and proprioception data, and couples them to obtain a coupled traversability cost map for… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

    Comments: 8 pages

  27. arXiv:2403.13198  [pdf, other

    cs.RO

    Towards Robots That Know When They Need Help: Affordance-Based Uncertainty for Large Language Model Planners

    Authors: James F. Mullen Jr., Dinesh Manocha

    Abstract: Large language models (LLMs) showcase many desirable traits for intelligent and helpful robots. However, they are also known to hallucinate predictions. This issue is exacerbated in consumer robotics where LLM hallucinations may result in robots confidently executing plans that are contrary to user goals, relying more frequently on human assistance, or preventing the robot from asking for help at… ▽ More

    Submitted 19 March, 2024; originally announced March 2024.

  28. arXiv:2403.11925  [pdf, other

    cs.LG

    Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles

    Authors: Bhrij Patel, Wesley A. Suttle, Alec Koppel, Vaneet Aggarwal, Brian M. Sadler, Amrit Singh Bedi, Dinesh Manocha

    Abstract: In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating… ▽ More

    Submitted 20 June, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: 26 Pages, 2 Figures

  29. arXiv:2403.11487  [pdf, other

    cs.RO cs.AI

    Can LLMs Generate Human-Like Wayfinding Instructions? Towards Platform-Agnostic Embodied Instruction Synthesis

    Authors: Vishnu Sashank Dorbala, Sanjoy Chowdhury, Dinesh Manocha

    Abstract: We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question… ▽ More

    Submitted 2 April, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: 14 Pages

  30. arXiv:2403.09905  [pdf, other

    cs.RO cs.CV

    Right Place, Right Time! Towards ObjectNav for Non-Stationary Goals

    Authors: Vishnu Sashank Dorbala, Bhrij Patel, Amrit Singh Bedi, Dinesh Manocha

    Abstract: We present a novel approach to tackle the ObjectNav task for non-stationary and potentially occluded targets in an indoor environment. We refer to this task Portable ObjectNav (or P-ObjectNav), and in this work, present its formulation, feasibility, and a navigation benchmark using a novel memory-enhanced LLM-based policy. In contrast to ObjNav where target object locations are fixed for each epis… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.

    Comments: 32

  31. arXiv:2403.09900  [pdf, other

    cs.RO

    DTG : Diffusion-based Trajectory Generation for Mapless Global Navigation

    Authors: Jing Liang, Amirreza Payandeh, Daeun Song, Xuesu Xiao, Dinesh Manocha

    Abstract: We present a novel end-to-end diffusion-based trajectory generation method, DTG, for mapless global navigation in challenging outdoor scenarios with occlusions and unstructured off-road features like grass, buildings, bushes, etc. Given a distant goal, our approach computes a trajectory that satisfies the following goals: (1) minimize the travel distance to the goal; (2) maximize the traversabilit… ▽ More

    Submitted 24 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: 10 pages

  32. arXiv:2403.08936  [pdf, other

    cs.MA cs.AI cs.RO

    Beyond Joint Demonstrations: Personalized Expert Guidance for Efficient Multi-Agent Reinforcement Learning

    Authors: Peihong Yu, Manav Mishra, Alec Koppel, Carl Busart, Priya Narayan, Dinesh Manocha, Amrit Bedi, Pratap Tokekar

    Abstract: Multi-Agent Reinforcement Learning (MARL) algorithms face the challenge of efficient exploration due to the exponential increase in the size of the joint state-action space. While demonstration-guided learning has proven beneficial in single-agent settings, its direct applicability to MARL is hindered by the practical difficulty of obtaining joint expert demonstrations. In this work, we introduce… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

  33. arXiv:2402.10340  [pdf, other

    cs.RO cs.AI

    Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics

    Authors: Xiyang Wu, Souradip Chakraborty, Ruiqi Xian, Jing Liang, Tianrui Guan, Fuxiao Liu, Brian M. Sadler, Dinesh Manocha, Amrit Singh Bedi

    Abstract: In this paper, we highlight the critical issues of robustness and safety associated with integrating large language models (LLMs) and vision-language models (VLMs) into robotics applications. Recent works focus on using LLMs and VLMs to improve the performance of robotics tasks, such as manipulation and navigation. Despite these improvements, analyzing the safety of such systems remains underexplo… ▽ More

    Submitted 16 June, 2024; v1 submitted 15 February, 2024; originally announced February 2024.

  34. arXiv:2402.08925  [pdf, other

    cs.CL cs.AI cs.LG cs.RO

    MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with Diverse Human Preferences

    Authors: Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

    Abstract: Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data. However, such an approach overlooks the rich diversity of human preferences inherent in data collected from multiple users. In this work, we first derive an impossibility result of alignment with single reward RLHF, thereby highlighting it… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

  35. arXiv:2402.07916  [pdf, other

    cs.HC cs.GR

    Perceptual Thresholds for Radial Optic Flow Distortion in Near-Eye Stereoscopic Displays

    Authors: Mohammad R. Saeedpour-Parizi, Niall L. Williams, Tim Wong, Phillip Guan, Dinesh Manocha, Ian M. Erkelens

    Abstract: We provide the first perceptual quantification of user's sensitivity to radial optic flow artifacts and demonstrate a promising approach for masking this optic flow artifact via blink suppression. Near-eye HMDs allow users to feel immersed in virtual environments by providing visual cues, like motion parallax and stereoscopy, that mimic how we view the physical world. However, these systems exhibi… ▽ More

    Submitted 1 February, 2024; originally announced February 2024.

  36. arXiv:2402.05119  [pdf, other

    cs.CL cs.AI

    A Closer Look at the Limitations of Instruction Tuning

    Authors: Sreyan Ghosh, Chandra Kiran Reddy Evuru, Sonal Kumar, Ramaneswaran S, Deepali Aneja, Zeyu Jin, Ramani Duraiswami, Dinesh Manocha

    Abstract: Instruction Tuning (IT), the process of training large language models (LLMs) using instruction-response pairs, has emerged as the predominant method for transforming base pre-trained LLMs into open-domain conversational agents. While IT has achieved notable success and widespread adoption, its limitations and shortcomings remain underexplored. In this paper, through rigorous experiments and an in… ▽ More

    Submitted 27 May, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted at ICML 2024

  37. DocuBits: VR Document Decomposition for Procedural Task Completion

    Authors: Geonsun Lee, Jennifer Healey, Dinesh Manocha

    Abstract: Reading monolithic instructional documents in VR is often challenging, especially when tasks are collaborative. Here we present DocuBits, a novel method for transforming monolithic documents into small, interactive instructional elements. Our approach allows users to:(i) create instructional elements (ii) position them within VR and (iii) use them to monitor and share progress in a multi-user VR l… ▽ More

    Submitted 27 January, 2024; originally announced January 2024.

  38. "May I Speak?": Multi-modal Attention Guidance in Social VR Group Conversations

    Authors: Geonsun Lee, Dae Yeol Lee, Guan-Ming Su, Dinesh Manocha

    Abstract: In this paper, we present a novel multi-modal attention guidance method designed to address the challenges of turn-taking dynamics in meetings and enhance group conversations within virtual reality (VR) environments. Recognizing the difficulties posed by a confined field of view and the absence of detailed gesture tracking in VR, our proposed method aims to mitigate the challenges of noticing new… ▽ More

    Submitted 27 January, 2024; originally announced January 2024.

  39. arXiv:2312.14436  [pdf, other

    cs.RO cs.LG

    REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback

    Authors: Souradip Chakraborty, Anukriti Singh, Amisha Bhaskar, Pratap Tokekar, Dinesh Manocha, Amrit Singh Bedi

    Abstract: The effectiveness of reinforcement learning (RL) agents in continuous control robotics tasks is heavily dependent on the design of the underlying reward function. However, a misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preference… ▽ More

    Submitted 14 April, 2024; v1 submitted 21 December, 2023; originally announced December 2023.

  40. arXiv:2312.13026  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    FusDom: Combining In-Domain and Out-of-Domain Knowledge for Continuous Self-Supervised Learning

    Authors: Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

    Abstract: Continued pre-training (CP) offers multiple advantages, like target domain adaptation and the potential to exploit the continuous stream of unlabeled data available online. However, continued pre-training on out-of-domain distributions often leads to catastrophic forgetting of previously acquired knowledge, leading to sub-optimal ASR performance. This paper presents FusDom, a simple and novel meth… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Accepted at ICASSP 2024. Code: https://github.com/cs20s030/fusdom

  41. arXiv:2312.12783  [pdf, other

    eess.AS cs.AI cs.CL cs.SD

    Stable Distillation: Regularizing Continued Pre-training for Low-Resource Automatic Speech Recognition

    Authors: Ashish Seth, Sreyan Ghosh, S. Umesh, Dinesh Manocha

    Abstract: Continued self-supervised (SSL) pre-training for adapting existing SSL models to the target domain has shown to be extremely effective for low-resource Automatic Speech Recognition (ASR). This paper proposes Stable Distillation, a simple and novel approach for SSL-based continued pre-training that boosts ASR performance in the target domain where both labeled and unlabeled data are limited. Stable… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Accepted to ICASSP 2024. Code: https://github.com/cs20s030/stable_distillation

  42. arXiv:2312.01564  [pdf, other

    cs.LG cs.AI cs.CL cs.CV

    APoLLo: Unified Adapter and Prompt Learning for Vision Language Models

    Authors: Sanjoy Chowdhury, Sayan Nag, Dinesh Manocha

    Abstract: The choice of input text prompt plays a critical role in the performance of Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. Our method is designed to substantially improve the generalization capabilities of VLP models when they are fine-tuned in a few-shot setting. We intro… ▽ More

    Submitted 3 December, 2023; originally announced December 2023.

    Comments: Accepted at EMNLP 2023 (Main track)

  43. arXiv:2312.00834  [pdf, other

    cs.SD cs.CV

    AV-RIR: Audio-Visual Room Impulse Response Estimation

    Authors: Anton Ratnarajah, Sreyan Ghosh, Sonal Kumar, Purva Chiniya, Dinesh Manocha

    Abstract: Accurate estimation of Room Impulse Response (RIR), which captures an environment's acoustic properties, is important for speech processing and AR/VR applications. We propose AV-RIR, a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based… ▽ More

    Submitted 23 April, 2024; v1 submitted 30 November, 2023; originally announced December 2023.

    Comments: Accepted to CVPR 2024

  44. arXiv:2311.15478  [pdf, other

    cs.CV

    HawkI: Homography & Mutual Information Guidance for 3D-free Single Image to Aerial View

    Authors: Divya Kothandaraman, Tianyi Zhou, Ming Lin, Dinesh Manocha

    Abstract: We present HawkI, for synthesizing aerial-view images from text and an exemplar image, without any additional multi-view or 3D information for finetuning or at inference. HawkI uses techniques from classical computer vision and information theory. It seamlessly blends the visual features from the input image within a pretrained text-to-2Dimage stable diffusion model with a test-time optimization p… ▽ More

    Submitted 13 March, 2024; v1 submitted 26 November, 2023; originally announced November 2023.

  45. arXiv:2311.08740  [pdf, other

    cs.RO

    AdVENTR: Autonomous Robot Navigation in Complex Outdoor Environments

    Authors: Kasun Weerakoon, Adarsh Jagan Sathyamoorthy, Mohamed Elnoor, Dinesh Manocha

    Abstract: We present a novel system, AdVENTR for autonomous robot navigation in unstructured outdoor environments that consist of uneven and vegetated terrains. Our approach is general and can enable both wheeled and legged robots to handle outdoor terrain complexity including unevenness, surface properties like poor traction, granularity, obstacle stiffness, etc. We use data from sensors including RGB came… ▽ More

    Submitted 15 November, 2023; originally announced November 2023.

  46. arXiv:2310.16255  [pdf, other

    cs.CV

    UAV-Sim: NeRF-based Synthetic Data Generation for UAV-based Perception

    Authors: Christopher Maxey, Jaehoon Choi, Hyungtae Lee, Dinesh Manocha, Heesung Kwon

    Abstract: Tremendous variations coupled with large degrees of freedom in UAV-based imaging conditions lead to a significant lack of data in adequately learning UAV-based perception models. Using various synthetic renderers in conjunction with perception models is prevalent to create synthetic data to augment the learning in the ground-based imaging domain. However, severe challenges in the austere UAV-based… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Video Link: https://www.youtube.com/watch?v=ucPzbPLqqpI

  47. arXiv:2310.15799  [pdf, other

    cs.CL cs.AI

    DALE: Generative Data Augmentation for Low-Resource Legal NLP

    Authors: Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar, S Ramaneswaran, S Sakshi, Utkarsh Tyagi, Dinesh Manocha

    Abstract: We present DALE, a novel and effective generative Data Augmentation framework for low-resource LEgal NLP. DALE addresses the challenges existing frameworks pose in generating effective data augmentations of legal documents - legal language, with its specialized vocabulary and complex semantics, morphology, and syntax, does not benefit from data augmentations that merely rephrase the source sentenc… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Accepted to EMNLP 2023 Main Conference. Code: https://github.com/Sreyan88/DALE

  48. arXiv:2310.15264  [pdf, other

    cs.CL cs.AI

    Towards Possibilities & Impossibilities of AI-generated Text Detection: A Survey

    Authors: Soumya Suvra Ghosal, Souradip Chakraborty, Jonas Geiping, Furong Huang, Dinesh Manocha, Amrit Singh Bedi

    Abstract: Large Language Models (LLMs) have revolutionized the domain of natural language processing (NLP) with remarkable capabilities of generating human-like text responses. However, despite these advancements, several works in the existing literature have raised serious concerns about the potential misuse of LLMs such as spreading misinformation, generating fake news, plagiarism in academia, and contami… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

  49. arXiv:2310.14566  [pdf, other

    cs.CV cs.CL

    HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

    Authors: Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, Dinesh Manocha, Tianyi Zhou

    Abstract: We introduce HallusionBench, a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs), such as GPT-4V(Vision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129… ▽ More

    Submitted 25 March, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

    Comments: Accepted to CVPR 2024

  50. arXiv:2310.10578  [pdf, other

    eess.SP

    Indoor Wireless Signal Modeling with Smooth Surface Diffraction Effects

    Authors: Ruichen Wang, Samuel Audia, Dinesh Manocha

    Abstract: We present a novel algorithm that enhances the accuracy of electromagnetic field simulations in indoor environments by incorporating the Uniform Geometrical Theory of Diffraction (UTD) for surface diffraction. This additional diffraction phenomenology is important for the design of modern wireless systems and allows us to capture the effects of more complex scene geometries. Central to our methodo… ▽ More

    Submitted 16 October, 2023; originally announced October 2023.

    Comments: 5 pages, 9 figures, conference