subscribe to arXiv mailings

VQA-Diff: Exploiting VQA and Diffusion for Zero-Shot Image-to-3D Vehicle Asset Generation in Autonomous Driving

Authors: Yibo Liu, Zheyuan Yang, Guile Wu, Yuan Ren, Kejian Lin, Bingbing Liu, Yang Liu, Jinjun Shan

Abstract: Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc.). This leads to their poor zero-shot prediction capability to handle real-world obser… ▽ More Generating 3D vehicle assets from in-the-wild observations is crucial to autonomous driving. Existing image-to-3D methods cannot well address this problem because they learn generation merely from image RGB information without a deeper understanding of in-the-wild vehicles (such as car models, manufacturers, etc.). This leads to their poor zero-shot prediction capability to handle real-world observations with occlusion or tricky viewing angles. To solve this problem, in this work, we propose VQA-Diff, a novel framework that leverages in-the-wild vehicle images to create photorealistic 3D vehicle assets for autonomous driving. VQA-Diff exploits the real-world knowledge inherited from the Large Language Model in the Visual Question Answering (VQA) model for robust zero-shot prediction and the rich image prior knowledge in the Diffusion model for structure and appearance generation. In particular, we utilize a multi-expert Diffusion Models strategy to generate the structure information and employ a subject-driven structure-controlled generation mechanism to model appearance information. As a result, without the necessity to learn from a large-scale image-to-3D vehicle dataset collected from the real world, VQA-Diff still has a robust zero-shot image-to-novel-view generation ability. We conduct experiments on various datasets, including Pascal 3D+, Waymo, and Objaverse, to demonstrate that VQA-Diff outperforms existing state-of-the-art methods both qualitatively and quantitatively. △ Less

Submitted 10 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

arXiv:2407.05285 [pdf, other]

Gradient Diffusion: A Perturbation-Resilient Gradient Leakage Attack

Authors: Xuan Liu, Siqi Cai, Qihua Zhou, Song Guo, Ruibin Li, Kaiwei Lin

Abstract: Recent years have witnessed the vulnerability of Federated Learning (FL) against gradient leakage attacks, where the private training data can be recovered from the exchanged gradients, making gradient protection a critical issue for the FL training process. Existing solutions often resort to perturbation-based mechanisms, such as differential privacy, where each participating client injects a spe… ▽ More Recent years have witnessed the vulnerability of Federated Learning (FL) against gradient leakage attacks, where the private training data can be recovered from the exchanged gradients, making gradient protection a critical issue for the FL training process. Existing solutions often resort to perturbation-based mechanisms, such as differential privacy, where each participating client injects a specific amount of noise into local gradients before aggregating to the server, and the global distribution variation finally conceals the gradient privacy. However, perturbation is not always the panacea for gradient protection since the robustness heavily relies on the injected noise. This intuition raises an interesting question: \textit{is it possible to deactivate existing protection mechanisms by removing the perturbation inside the gradients?} In this paper, we present the answer: \textit{yes} and propose the Perturbation-resilient Gradient Leakage Attack (PGLA), the first attempt to recover the perturbed gradients, without additional access to the original model structure or third-party data. Specifically, we leverage the inherent diffusion property of gradient perturbation protection and construct a novel diffusion-based denoising model to implement PGLA. Our insight is that capturing the disturbance level of perturbation during the diffusion reverse process can release the gradient denoising capability, which promotes the diffusion model to generate approximate gradients as the original clean version through adaptive sampling steps. Extensive experiments demonstrate that PGLA effectively recovers the protected gradients and exposes the FL training process to the threat of gradient leakage, achieving the best quality in gradient denoising and data recovery compared to existing models. We hope to arouse public attention on PGLA and its defense. △ Less

Submitted 7 July, 2024; originally announced July 2024.

arXiv:2406.16544 [pdf, other]

Hierarchical B-frame Video Coding for Long Group of Pictures

Authors: Ivan Kirillov, Denis Parkhomenko, Kirill Chernyshev, Alexander Pletnev, Yibo Shi, Kai Lin, Dmitry Babin

Abstract: Learned video compression methods already outperform VVC in the low-delay (LD) case, but the random-access (RA) scenario remains challenging. Most works on learned RA video compression either use HEVC as an anchor or compare it to VVC in specific test conditions, using RGB-PSNR metric instead of Y-PSNR and avoiding comprehensive evaluation. Here, we present an end-to-end learned video codec for ra… ▽ More Learned video compression methods already outperform VVC in the low-delay (LD) case, but the random-access (RA) scenario remains challenging. Most works on learned RA video compression either use HEVC as an anchor or compare it to VVC in specific test conditions, using RGB-PSNR metric instead of Y-PSNR and avoiding comprehensive evaluation. Here, we present an end-to-end learned video codec for random access that combines training on long sequences of frames, rate allocation designed for hierarchical coding and content adaptation on inference. We show that under common test conditions (JVET-CTC), it achieves results comparable to VTM (VVC reference software) in terms of YUV-PSNR BD-Rate on some classes of videos, and outperforms it on almost all test sets in terms of VMAF BD-Rate. On average it surpasses open LD and RA end-to-end solutions in terms of VMAF and YUV BD-Rates. △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.14235 [pdf, other]

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation

Authors: Jiaming Zhou, Teli Ma, Kun-Yu Lin, Ronghe Qiu, Zifan Wang, Junwei Liang

Abstract: Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy,… ▽ More Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks. To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy. Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner. The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated. On the large RLBench benchmark, our adaptation method achieves an average improvement of $8.9\%$ in success rate over the pre-trained R3M model across multiple tasks. We will release the code and models upon acceptance. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.13719 [pdf, other]

GUI Action Narrator: Where and When Did That Action Take Place?

Authors: Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Xiangwu Guo, Peiran Li, Weichen Zhang, Hengxu Wang, Mike Zheng Shou

Abstract: The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. T… ▽ More The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset \textbf{Act2Cap} as well as a simple yet effective framework, \textbf{GUI Narrator}, for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a multimodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today's most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models. △ Less

Submitted 19 June, 2024; originally announced June 2024.

arXiv:2406.12466 [pdf, other]

Rastall gravity: accretion disk image in radiation fields context and visual transformations compared to Reissner-Nordstrom black holes

Authors: Yu-Xiang Huang, Sen Guo, Yu Liang, Yu-Hao Cui, Qing-Quan Jiang, Kai Lin

Abstract: Our study investigates the astronomical implications of Rastall gravity, particularly its behavior amidst a radiation field compared to Reissner-Nordstrom (RN) black holes. Our research delineates a crucial correlation between the dynamics of the accretion disk and the parameters Q and N_{\rm r}, which aptly reflect the influence of spacetime metrics on the disk's appearance. Elevated electric cha… ▽ More Our study investigates the astronomical implications of Rastall gravity, particularly its behavior amidst a radiation field compared to Reissner-Nordstrom (RN) black holes. Our research delineates a crucial correlation between the dynamics of the accretion disk and the parameters Q and N_{\rm r}, which aptly reflect the influence of spacetime metrics on the disk's appearance. Elevated electric charge Q prompts contraction in the disk's orbit due to enhanced gravitational effects, while higher N_{\rm r} values lead to outward expansion, influenced by the radiation field's attributes. Interestingly, the charged black holes surrounded by radiation fields display distinct visual disparities from RN black holes. Brightness decreases and expansion occurs within the accretion disk's innermost stable circular orbit with rising N_{\rm r} values. Our study also reveals the process by which the accretion disk transitions from a conventional disk-like structure to a hat-like form at different observation angles, with the redshift effect gradually intensifying. Moreover, the results of the Rastall gravity radiation field we consider are consistent with the constraints of the host galaxy's gravitational lensing on the Rastall gravity parameters, enhancing the consistency between theoretical predictions and actual observations. △ Less

Submitted 18 June, 2024; originally announced June 2024.

arXiv:2406.11937 [pdf, other]

Using graph neural networks to reconstruct charged pion showers in the CMS High Granularity Calorimeter

Authors: M. Aamir, B. Acar, G. Adamov, T. Adams, C. Adloff, S. Afanasiev, C. Agrawal, C. Agrawal, A. Ahmad, H. A. Ahmed, S. Akbar, N. Akchurin, B. Akgul, B. Akgun, R. O. Akpinar, E. Aktas, A. AlKadhim, V. Alexakhin, J. Alimena, J. Alison, A. Alpana, W. Alshehri, P. Alvarez Dominguez, M. Alyari, C. Amendola , et al. (550 additional authors not shown)

Abstract: A novel method to reconstruct the energy of hadronic showers in the CMS High Granularity Calorimeter (HGCAL) is presented. The HGCAL is a sampling calorimeter with very fine transverse and longitudinal granularity. The active media are silicon sensors and scintillator tiles readout by SiPMs and the absorbers are a combination of lead and Cu/CuW in the electromagnetic section, and steel in the hadr… ▽ More A novel method to reconstruct the energy of hadronic showers in the CMS High Granularity Calorimeter (HGCAL) is presented. The HGCAL is a sampling calorimeter with very fine transverse and longitudinal granularity. The active media are silicon sensors and scintillator tiles readout by SiPMs and the absorbers are a combination of lead and Cu/CuW in the electromagnetic section, and steel in the hadronic section. The shower reconstruction method is based on graph neural networks and it makes use of a dynamic reduction network architecture. It is shown that the algorithm is able to capture and mitigate the main effects that normally hinder the reconstruction of hadronic showers using classical reconstruction methods, by compensating for fluctuations in the multiplicity, energy, and spatial distributions of the shower's constituents. The performance of the algorithm is evaluated using test beam data collected in 2018 prototype of the CMS HGCAL accompanied by a section of the CALICE AHCAL prototype. The capability of the method to mitigate the impact of energy leakage from the calorimeter is also demonstrated. △ Less

Submitted 30 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: Prepared for submission to JINST

arXiv:2406.11816 [pdf, other]

VideoLLM-online: Online Video Large Language Model for Streaming Video

Authors: Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou

Abstract: Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-St… ▽ More Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: CVPR 2024. This arxiv version is upgraded with Llama-3

arXiv:2406.11781 [pdf, other]

DiffMM: Multi-Modal Diffusion Model for Recommendation

Authors: Yangqin Jiang, Lianghao Xia, Wei Wei, Da Luo, Kangyi Lin, Chao Huang

Abstract: The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniq… ▽ More The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniques to enhance recommender systems. However, these methods often rely on simplistic random augmentation or intuitive cross-view information, which can introduce irrelevant noise and fail to accurately align the multi-modal context with user-item interaction modeling. To fill this research gap, we propose a novel multi-modal graph diffusion model for recommendation called DiffMM. Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning. This integration facilitates better alignment between multi-modal feature information and collaborative relation modeling. Our approach leverages diffusion models' generative capabilities to automatically generate a user-item graph that is aware of different modalities, facilitating the incorporation of useful multi-modal knowledge in modeling user-item interactions. We conduct extensive experiments on three public datasets, consistently demonstrating the superiority of our DiffMM over various competitive baselines. For open-sourced model implementation details, you can access the source codes of our proposed framework at: https://github.com/HKUDS/DiffMM . △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.10583 [pdf, other]

Demonstration of neutron identification in neutrino interactions in the MicroBooNE liquid argon time projection chamber

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, A. Barnard, G. Barr, D. Barrow, J. Barrow, V. Basque, J. Bateman, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book , et al. (165 additional authors not shown)

Abstract: A significant challenge in measurements of neutrino oscillations is reconstructing the incoming neutrino energies. While modern fully-active tracking calorimeters such as liquid argon time projection chambers in principle allow the measurement of all final state particles above some detection threshold, undetected neutrons remain a considerable source of missing energy with little to no data const… ▽ More A significant challenge in measurements of neutrino oscillations is reconstructing the incoming neutrino energies. While modern fully-active tracking calorimeters such as liquid argon time projection chambers in principle allow the measurement of all final state particles above some detection threshold, undetected neutrons remain a considerable source of missing energy with little to no data constraining their production rates and kinematics. We present the first demonstration of tagging neutrino-induced neutrons in liquid argon time projection chambers using secondary protons emitted from neutron-argon interactions in the MicroBooNE detector. We describe the method developed to identify neutrino-induced neutrons and demonstrate its performance using neutrons produced in muon-neutrino charged current interactions. The method is validated using a small subset of MicroBooNE's total dataset. The selection yields a sample with $60\%$ of selected tracks corresponding to neutron-induced secondary protons. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Report number: FERMILAB-PUB-24-0301

arXiv:2406.10227 [pdf, other]

VideoGUI: A Benchmark for GUI Automation from Instructional Videos

Authors: Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

Abstract: Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-c… ▽ More Graphical User Interface (GUI) automation holds significant promise for enhancing human productivity by assisting with computer tasks. Existing task formulations primarily focus on simple tasks that can be specified by a single, language-only instruction, such as "Insert a new slide." In this work, we introduce VideoGUI, a novel multi-modal benchmark designed to evaluate GUI assistants on visual-centric GUI tasks. Sourced from high-quality web instructional videos, our benchmark focuses on tasks involving professional and novel software (e.g., Adobe Photoshop or Stable Diffusion WebUI) and complex activities (e.g., video editing). VideoGUI evaluates GUI assistants through a hierarchical process, allowing for identification of the specific levels at which they may fail: (i) high-level planning: reconstruct procedural subtasks from visual conditions without language descriptions; (ii) middle-level planning: generate sequences of precise action narrations based on visual state (i.e., screenshot) and goals; (iii) atomic action execution: perform specific actions such as accurately clicking designated elements. For each level, we design evaluation metrics across individual dimensions to provide clear signals, such as individual performance in clicking, dragging, typing, and scrolling for atomic action execution. Our evaluation on VideoGUI reveals that even the SoTA large multimodal model GPT4o performs poorly on visual-centric GUI tasks, especially for high-level planning. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: 24 pages, 16 tables, 17 figures

arXiv:2406.10123 [pdf, other]

Improving neutrino energy estimation of charged-current interaction events with recurrent neural networks in MicroBooNE

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, A. Barnard, G. Barr, D. Barrow, J. Barrow, V. Basque, J. Bateman, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book , et al. (164 additional authors not shown)

Abstract: We present a deep learning-based method for estimating the neutrino energy of charged-current neutrino-argon interactions. We employ a recurrent neural network (RNN) architecture for neutrino energy estimation in the MicroBooNE experiment, utilizing liquid argon time projection chamber (LArTPC) detector technology. Traditional energy estimation approaches in LArTPCs, which largely rely on reconstr… ▽ More We present a deep learning-based method for estimating the neutrino energy of charged-current neutrino-argon interactions. We employ a recurrent neural network (RNN) architecture for neutrino energy estimation in the MicroBooNE experiment, utilizing liquid argon time projection chamber (LArTPC) detector technology. Traditional energy estimation approaches in LArTPCs, which largely rely on reconstructing and summing visible energies, often experience sizable biases and resolution smearing because of the complex nature of neutrino interactions and the detector response. The estimation of neutrino energy can be improved after considering the kinematics information of reconstructed final-state particles. Utilizing kinematic information of reconstructed particles, the deep learning-based approach shows improved resolution and reduced bias for the muon neutrino Monte Carlo simulation sample compared to the traditional approach. In order to address the common concern about the effectiveness of this method on experimental data, the RNN-based energy estimator is further examined and validated with dedicated data-simulation consistency tests using MicroBooNE data. We also assess its potential impact on a neutrino oscillation study after accounting for all statistical and systematic uncertainties and show that it enhances physics sensitivity. This method has good potential to improve the performance of other physics analyses. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Report number: FERMILAB-PUB-24-0287

arXiv:2406.09767 [pdf, other]

Language-Guided Manipulation with Diffusion Policies and Constrained Inpainting

Authors: Ce Hao, Kelvin Lin, Siyuan Luo, Harold Soh

Abstract: Diffusion policies have demonstrated robust performance in generative modeling, prompting their application in robotic manipulation controlled via language descriptions. In this paper, we introduce a zero-shot, open-vocabulary diffusion policy method for robot manipulation. Using Vision-Language Models (VLMs), our method transforms linguistic task descriptions into actionable keyframes in 3D space… ▽ More Diffusion policies have demonstrated robust performance in generative modeling, prompting their application in robotic manipulation controlled via language descriptions. In this paper, we introduce a zero-shot, open-vocabulary diffusion policy method for robot manipulation. Using Vision-Language Models (VLMs), our method transforms linguistic task descriptions into actionable keyframes in 3D space. These keyframes serve to guide the diffusion process via inpainting. However, naively enforcing the diffusion process to adhere to the generated keyframes is problematic: the keyframes from the VLMs may be incorrect and lead to out-of-distribution (OOD) action sequences where the diffusion model performs poorly. To address these challenges, we develop an inpainting optimization strategy that balances adherence to the keyframes v.s. the training data distribution. Experimental evaluations demonstrate that our approach surpasses the performance of traditional fine-tuned language-conditioned methods in both simulated and real-world settings. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.08407 [pdf, other]

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Authors: Xuehai He, Weixi Feng, Kaizhi Zheng, Yujie Lu, Wanrong Zhu, Jiachen Li, Yue Fan, Jianfeng Wang, Linjie Li, Zhengyuan Yang, Kevin Lin, William Yang Wang, Lijuan Wang, Xin Eric Wang

Abstract: Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multi… ▽ More Multimodal Language Language Models (MLLMs) demonstrate the emerging abilities of "world models" -- interpreting and reasoning about complex real-world dynamics. To assess these abilities, we posit videos are the ideal medium, as they encapsulate rich representations of real-world dynamics and causalities. To this end, we introduce MMWorld, a new benchmark for multi-discipline, multi-faceted multimodal video understanding. MMWorld distinguishes itself from previous video understanding benchmarks with two unique advantages: (1) multi-discipline, covering various disciplines that often require domain expertise for comprehensive understanding; (2) multi-faceted reasoning, including explanation, counterfactual thinking, future prediction, etc. MMWorld consists of a human-annotated dataset to evaluate MLLMs with questions about the whole videos and a synthetic dataset to analyze MLLMs within a single modality of perception. Together, MMWorld encompasses 1,910 videos across seven broad disciplines and 69 subdisciplines, complete with 6,627 question-answer pairs and associated captions. The evaluation includes 2 proprietary and 10 open-source MLLMs, which struggle on MMWorld (e.g., GPT-4V performs the best with only 52.3\% accuracy), showing large room for improvement. Further ablation studies reveal other interesting findings such as models' different skill sets from humans. We hope MMWorld can serve as an essential step towards world model evaluation in videos. △ Less

Submitted 13 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.07540 [pdf, other]

Ctrl-X: Controlling Structure and Appearance for Text-To-Image Generation Without Guidance

Authors: Kuan Heng Lin, Sicheng Mo, Ben Klingher, Fangzhou Mu, Bolei Zhou

Abstract: Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexib… ▽ More Recent controllable generation approaches such as FreeControl and Diffusion Self-guidance bring fine-grained spatial and appearance control to text-to-image (T2I) diffusion models without training auxiliary modules. However, these methods optimize the latent embedding for each type of score function with longer diffusion steps, making the generation process time-consuming and limiting their flexibility and use. This work presents Ctrl-X, a simple framework for T2I diffusion controlling structure and appearance without additional training or guidance. Ctrl-X designs feed-forward structure control to enable the structure alignment with a structure image and semantic-aware appearance transfer to facilitate the appearance transfer from a user-input image. Extensive qualitative and quantitative experiments illustrate the superior performance of Ctrl-X on various condition inputs and model checkpoints. In particular, Ctrl-X supports novel structure and appearance control with arbitrary condition images of any modality, exhibits superior image quality and appearance transfer compared to existing works, and provides instant plug-and-play functionality to any T2I and text-to-video (T2V) diffusion model. See our project page for an overview of the results: https://genforce.github.io/ctrl-x △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 18 pages, 11 figures, see project page at https://genforce.github.io/ctrl-x

arXiv:2406.07514 [pdf, other]

Scintillation Light in SBND: Simulation, Reconstruction, and Expected Performance of the Photon Detection System

Authors: SBND Collaboration, P. Abratenko, R. Acciarri, C. Adams, L. Aliaga-Soplin, O. Alterkait, R. Alvarez-Garrote, C. Andreopoulos, A. Antonakis, L. Arellano, J. Asaadi, W. Badgett, S. Balasubramanian, V. Basque, A. Beever, B. Behera, E. Belchior, M. Betancourt, A. Bhat, M. Bishai, A. Blake, B. Bogart, J. Bogenschuetz, D. Brailsford, A. Brandt , et al. (158 additional authors not shown)

Abstract: SBND is the near detector of the Short-Baseline Neutrino program at Fermilab. Its location near to the Booster Neutrino Beam source and relatively large mass will allow the study of neutrino interactions on argon with unprecedented statistics. This paper describes the expected performance of the SBND photon detection system, using a simulated sample of beam neutrinos and cosmogenic particles. Its… ▽ More SBND is the near detector of the Short-Baseline Neutrino program at Fermilab. Its location near to the Booster Neutrino Beam source and relatively large mass will allow the study of neutrino interactions on argon with unprecedented statistics. This paper describes the expected performance of the SBND photon detection system, using a simulated sample of beam neutrinos and cosmogenic particles. Its design is a dual readout concept combining a system of 120 photomultiplier tubes, used for triggering, with a system of 192 X-ARAPUCA devices, located behind the anode wire planes. Furthermore, covering the cathode plane with highly-reflective panels coated with a wavelength-shifting compound recovers part of the light emitted towards the cathode, where no optical detectors exist. We show how this new design provides a high light yield and a more uniform detection efficiency, an excellent timing resolution and an independent 3D-position reconstruction using only the scintillation light. Finally, the whole reconstruction chain is applied to recover the temporal structure of the beam spill, which is resolved with a resolution on the order of nanoseconds. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: 21 pages, 17 figures

Report number: FERMILAB-PUB-24-0303-PPD

arXiv:2406.07122 [pdf, other]

Compact Polarization-Entangled Photon Source Based on Coexisting Noncritically Birefringent and Quasi Phase Matching in a Nonlinear Crystal

Authors: C. -Y. Yang, C. -Y. Wang, K. -H. Lin, T. -Y. Tsai, C. -C. Lin, C. Canalias, L. -B. Wang, A. Yabushita, C. -S. Chuu

Abstract: Polarization-entangled photons are indispensable to numerous quantum technologies and fundamental studies. In this paper, we propose and demonstrate a novel source that generates collinear polarization-entangled photons by simultaneously achieving two distinct types of phase-matching conditions (noncritically birefringent and quasi phase matching) in a periodically poled nonlinear crystal with a l… ▽ More Polarization-entangled photons are indispensable to numerous quantum technologies and fundamental studies. In this paper, we propose and demonstrate a novel source that generates collinear polarization-entangled photons by simultaneously achieving two distinct types of phase-matching conditions (noncritically birefringent and quasi phase matching) in a periodically poled nonlinear crystal with a large poling period of 2 mm. The photon pairs are generated in a polarization-entangled state with a fidelity and concurrence of 0.998 and 0.935, respectively, and violate the Clauser-Horne-Shimony-Holt inequality by 84 standard deviations. The compact source does not require interferometer, delicate domain structures, or post selection, and is advantageous for scalable quantum computing and communication, where many replicas or chip-scale devices are needed. △ Less

Submitted 11 June, 2024; originally announced June 2024.

arXiv:2406.06890 [pdf, other]

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Authors: Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang

Abstract: Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation wh… ▽ More Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data. We propose motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM includes a video consistency model that distills motion from the video teacher model, and an image discriminator that enhances frame appearance to match high-quality image data. This combination presents two challenges: (1) conflicting frame learning objectives, as video distillation learns from low-quality video frames while the image discriminator targets high-quality images; and (2) training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: Project page: https://yhzhai.github.io/mcm/

arXiv:2406.06472 [pdf, other]

Multi-Amplifier Sensing Charge-coupled Devices for Next Generation Spectroscopy

Authors: Kenneth Lin, Armin Karcher, Julien Guy, Stephen E. Holland, William F. Kolbe, Peter Nugent, Alex Drlica-Wagner

Abstract: We present characterization results and performance of a prototype Multiple-Amplifier Sensing (MAS) silicon charge-coupled device (CCD) sensor with 16 channels potentially suitable for faint object astronomical spectroscopy and low-signal, photon-limited imaging. The MAS CCD is designed to reach sub-electron readout noise by repeatedly measuring charge through a line of amplifiers during the seria… ▽ More We present characterization results and performance of a prototype Multiple-Amplifier Sensing (MAS) silicon charge-coupled device (CCD) sensor with 16 channels potentially suitable for faint object astronomical spectroscopy and low-signal, photon-limited imaging. The MAS CCD is designed to reach sub-electron readout noise by repeatedly measuring charge through a line of amplifiers during the serial transfer shifts. Using synchronized readout electronics based on the DESI CCD controller, we report a read noise of 1.03 e- rms/pix at a speed of 26 $μ$s/pix with a single-sample readout scheme where charge in a pixel is measured only once for each output stage. At these operating parameters, we find the amplifier-to-amplifier charge transfer efficiency (ACTE) to be $>0.9995$ at low counts for all amplifiers but one for which the ACTE is 0.997. This charge transfer efficiency falls above 50,000 electrons for the read-noise optimized voltage configuration we chose for the serial clocks and gates. The amplifier linearity across a broad dynamic range from $\sim$300--35,000 e- was also measured to be $\pm 2.5\%$. We describe key operating parameters to optimize on these characteristics and describe the specific applications for which the MAS CCD may be a suitable detector candidate. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 20 pages, 18 figures, submitted to PASP

arXiv:2406.03298 [pdf, other]

L-PR: Exploiting LiDAR Fiducial Marker for Unordered Low Overlap Multiview Point Cloud Registration

Authors: Yibo Liu, Jinjun Shan, Amaldev Haridevan, Shuo Zhang, Kejian Lin

Abstract: Point cloud registration is a prerequisite for many applications in computer vision and robotics. Most existing methods focus on pairwise registration of two point clouds with high overlap. Although there have been some methods for low overlap cases, they struggle in degraded scenarios. This paper introduces a novel framework named L-PR, designed to register unordered low overlap multiview point c… ▽ More Point cloud registration is a prerequisite for many applications in computer vision and robotics. Most existing methods focus on pairwise registration of two point clouds with high overlap. Although there have been some methods for low overlap cases, they struggle in degraded scenarios. This paper introduces a novel framework named L-PR, designed to register unordered low overlap multiview point clouds leveraging LiDAR fiducial markers. We refer to them as LiDAR fiducial markers, but they are the same as the popular AprilTag and ArUco markers, thin sheets of paper that do not affect the 3D geometry of the environment. We first propose an improved adaptive threshold marker detection method to provide robust detection results when the viewpoints among point clouds change dramatically. Then, we formulate the unordered multiview point cloud registration problem as a maximum a-posteriori (MAP) problem and develop a framework consisting of two levels of graphs to address it. The first-level graph, constructed as a weighted graph, is designed to efficiently and optimally infer initial values of scan poses from the unordered set. The second-level graph is constructed as a factor graph. By globally optimizing the variables on the graph, including scan poses, marker poses, and marker corner positions, we tackle the MAP problem. We conduct qualitative and quantitative experiments to demonstrate that the proposed method exhibits superiority over competitors in four aspects: registration accuracy, instance reconstruction quality, localization accuracy, and robustness to the degraded scene. To benefit the community, we open-source our method and dataset at https://github.com/yorklyb/LiDAR-SFM. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: 8 pages

arXiv:2406.03270 [pdf, other]

A Successive Gap Constraint Linearization Method for Optimal Control Problems with Equilibrium Constraints

Authors: Kangyu Lin, Toshiyuki Ohtsuka

Abstract: In this study, we propose a novel gap-constraint-based reformulation for optimal control problems with equilibrium constraints (OCPECs). We show that the proposed reformulation generates a new constraint system equivalent to the original one but more concise and with favorable differentiability. Moreover, constraint regularity can be recovered by a relaxation strategy. We show that the gap constra… ▽ More In this study, we propose a novel gap-constraint-based reformulation for optimal control problems with equilibrium constraints (OCPECs). We show that the proposed reformulation generates a new constraint system equivalent to the original one but more concise and with favorable differentiability. Moreover, constraint regularity can be recovered by a relaxation strategy. We show that the gap constraint and its gradient can be evaluated efficiently. We then propose a successive gap constraint linearization method to solve the discretized OCPEC. We also provide an intuitive geometric interpretation of the gap constraint. Numerical experiments validate the effectiveness of the proposed reformulation and solution method. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: Forthcoming, (Accepted to the 2024 IFAC Conference on Nonlinear Model Predictive Control (NMPC))

arXiv:2406.00033 [pdf, other]

doi 10.1145/3626772.3657670

Retrieval-Augmented Conversational Recommendation with Prompt-based Semi-Structured Natural Language State Tracking

Authors: Sara Kemper, Justin Cui, Kai Dicarlantonio, Kathy Lin, Danjie Tang, Anton Korikov, Scott Sanner

Abstract: Conversational recommendation (ConvRec) systems must understand rich and diverse natural language (NL) expressions of user preferences and intents, often communicated in an indirect manner (e.g., "I'm watching my weight"). Such complex utterances make retrieving relevant items challenging, especially if only using often incomplete or out-of-date metadata. Fortunately, many domains feature rich ite… ▽ More Conversational recommendation (ConvRec) systems must understand rich and diverse natural language (NL) expressions of user preferences and intents, often communicated in an indirect manner (e.g., "I'm watching my weight"). Such complex utterances make retrieving relevant items challenging, especially if only using often incomplete or out-of-date metadata. Fortunately, many domains feature rich item reviews that cover standard metadata categories and offer complex opinions that might match a user's interests (e.g., "classy joint for a date"). However, only recently have large language models (LLMs) let us unlock the commonsense connections between user preference utterances and complex language in user-generated reviews. Further, LLMs enable novel paradigms for semi-structured dialogue state tracking, complex intent and preference understanding, and generating recommendations, explanations, and question answers. We thus introduce a novel technology RA-Rec, a Retrieval-Augmented, LLM-driven dialogue state tracking system for ConvRec, showcased with a video, open source GitHub repository, and interactive Google Colab notebook. △ Less

Submitted 25 May, 2024; originally announced June 2024.

arXiv:2405.15784 [pdf, other]

CLARINET: Augmenting Language Models to Ask Clarification Questions for Retrieval

Authors: Yizhou Chi, Jessy Lin, Kevin Lin, Dan Klein

Abstract: Users often make ambiguous requests that require clarification. We study the problem of asking clarification questions in an information retrieval setting, where systems often face ambiguous search queries and it is challenging to turn the uncertainty in the retrieval model into a natural language question. We present CLARINET, a system that asks informative clarification questions by choosing que… ▽ More Users often make ambiguous requests that require clarification. We study the problem of asking clarification questions in an information retrieval setting, where systems often face ambiguous search queries and it is challenging to turn the uncertainty in the retrieval model into a natural language question. We present CLARINET, a system that asks informative clarification questions by choosing questions whose answers would maximize certainty in the correct candidate. Our approach works by augmenting a large language model (LLM) to condition on a retrieval distribution, finetuning end-to-end to generate the question that would have maximized the rank of the true candidate at each turn. When evaluated on a real-world retrieval dataset of users searching for books, our system outperforms traditional heuristics such as information gain on retrieval success by 17% and vanilla-prompted LLMs by 39% relative. △ Less

Submitted 28 April, 2024; originally announced May 2024.

arXiv:2405.13860 [pdf, other]

MAGIC: Map-Guided Few-Shot Audio-Visual Acoustics Modeling

Authors: Diwei Huang, Kunyang Lin, Peihao Chen, Qing Du, Mingkui Tan

Abstract: Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and map… ▽ More Few-shot audio-visual acoustics modeling seeks to synthesize the room impulse response in arbitrary locations with few-shot observations. To sufficiently exploit the provided few-shot data for accurate acoustic modeling, we present a *map-guided* framework by constructing acoustic-related visual semantic feature maps of the scenes. Visual features preserve semantic details related to sound and maps provide explicit structural regularities of sound propagation, which are valuable for modeling environment acoustics. We thus extract pixel-wise semantic features derived from observations and project them into a top-down map, namely the **observation semantic map**. This map contains the relative positional information among points and the semantic feature information associated with each point. Yet, limited information extracted by few-shot observations on the map is not sufficient for understanding and modeling the whole scene. We address the challenge by generating a **scene semantic map** via diffusing features and anticipating the observation semantic map. The scene semantic map then interacts with echo encoding by a transformer-based encoder-decoder to predict RIR for arbitrary speaker-listener query pairs. Extensive experiments on Matterport3D and Replica dataset verify the efficacy of our framework. △ Less

Submitted 22 May, 2024; originally announced May 2024.

Comments: 17 pages, 12 pages for main paper, 5 pages for supplementary

arXiv:2405.12808 [pdf, other]

Influence of quantum correction on the Schwarzschild black hole polarized image

Authors: Sen Guo, Yu-Xiang Huang, Kuan Liu, En-Wei Liang, Kai Lin

Abstract: Using a model of an accretion disk around a Schwarzschild black hole, the analytic estimates for image polarization were derived by Narayan $et~al.$. [Astrophys. J, 102, 912 (2021)]. Recently, the EHT team also obtained polarization images of the Sgr A$^{*}$ and measured both linear and circular polarization [Astrophys. J. Lett, 964, L25 (2024)]. We find that quantum correction effects can also in… ▽ More Using a model of an accretion disk around a Schwarzschild black hole, the analytic estimates for image polarization were derived by Narayan $et~al.$. [Astrophys. J, 102, 912 (2021)]. Recently, the EHT team also obtained polarization images of the Sgr A$^{*}$ and measured both linear and circular polarization [Astrophys. J. Lett, 964, L25 (2024)]. We find that quantum correction effects can also influence polarization information. Considering the quantum corrected Schwarzschild black hole (Kazakov-Solodukhin black hole), we derive the polarization intensity of the target black hole and investigate polarization images under different parameters. It is found that a larger quantum deformation leads to an expansion of the polarization region, while the polarization intensity value decrease. Under different observation angles, magnetic fields, fluid direction angles, and fluid velocity conditions, we also derive polarization images of corrected black holes. These key indicators not only affect the intensity of polarization but also the direction of polarization. We establish the relationship between polarization intensity and quantum correction deformation parameters, revealing a gradual decline in polarization intensity with reduced radius and an anti-polarization behavior induced by the progressive increase in deformation parameters at a constant radius. Our analysis may provide observational evidence for quantum effect of general relativity. △ Less

Submitted 21 May, 2024; originally announced May 2024.

Comments: 20 pages, 8 figures

Report number: Accepted European Physical Journal C (EPJC) 2024

arXiv:2405.10925 [pdf]

High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariates

Authors: Janick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo, Fang Tian, Wei Liu, Jie Li, José J. Hernández-Muñoz, Sebastian Schneeweiss, Rishi J. Desai

Abstract: Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from… ▽ More Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2405.10232 [pdf, other]

doi 10.1145/3631700.3664869

Beyond Static Calibration: The Impact of User Preference Dynamics on Calibrated Recommendation

Authors: Kun Lin, Masoud Mansoury, Farzad Eskandanian, Milad Sabouri, Bamshad Mobasher

Abstract: Calibration in recommender systems is an important performance criterion that ensures consistency between the distribution of user preference categories and that of recommendations generated by the system. Standard methods for mitigating miscalibration typically assume that user preference profiles are static, and they measure calibration relative to the full history of user's interactions, includ… ▽ More Calibration in recommender systems is an important performance criterion that ensures consistency between the distribution of user preference categories and that of recommendations generated by the system. Standard methods for mitigating miscalibration typically assume that user preference profiles are static, and they measure calibration relative to the full history of user's interactions, including possibly outdated and stale preference categories. We conjecture that this approach can lead to recommendations that, while appearing calibrated, in fact, distort users' true preferences. In this paper, we conduct a preliminary investigation of recommendation calibration at a more granular level, taking into account evolving user preferences. By analyzing differently sized training time windows from the most recent interactions to the oldest, we identify the most relevant segment of user's preferences that optimizes the calibration metric. We perform an exploratory analysis with datasets from different domains with distinctive user-interaction characteristics. We demonstrate how the evolving nature of user preferences affects recommendation calibration, and how this effect is manifested differently depending on the characteristics of the data in a given domain. Datasets, codes, and more detailed experimental results are available at: https://github.com/nicolelin13/DynamicCalibrationUMAP. △ Less

Submitted 16 May, 2024; originally announced May 2024.

Comments: 8 pages, 4 figures, accepted as LBR paper at UMAP '24 -- ACM Conference on User Modeling, Adaptation and Personalization 2024

MSC Class: 68-06 ACM Class: H.3.4

arXiv:2405.07503 [pdf, other]

Consistency Policy: Accelerated Visuomotor Policies via Consistency Distillation

Authors: Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, Jeannette Bohg

Abstract: Many robotic systems, such as mobile manipulators or quadrotors, cannot be equipped with high-end GPUs due to space, weight, and power constraints. These constraints prevent these systems from leveraging recent developments in visuomotor policy architectures that require high-end GPUs to achieve fast policy inference. In this paper, we propose Consistency Policy, a faster and similarly powerful al… ▽ More Many robotic systems, such as mobile manipulators or quadrotors, cannot be equipped with high-end GPUs due to space, weight, and power constraints. These constraints prevent these systems from leveraging recent developments in visuomotor policy architectures that require high-end GPUs to achieve fast policy inference. In this paper, we propose Consistency Policy, a faster and similarly powerful alternative to Diffusion Policy for learning visuomotor robot control. By virtue of its fast inference speed, Consistency Policy can enable low latency decision making in resource-constrained robotic setups. A Consistency Policy is distilled from a pretrained Diffusion Policy by enforcing self-consistency along the Diffusion Policy's learned trajectories. We compare Consistency Policy with Diffusion Policy and other related speed-up methods across 6 simulation tasks as well as three real-world tasks where we demonstrate inference on a laptop GPU. For all these tasks, Consistency Policy speeds up inference by an order of magnitude compared to the fastest alternative method and maintains competitive success rates. We also show that the Conistency Policy training procedure is robust to the pretrained Diffusion Policy's quality, a useful result that helps practioners avoid extensive testing of the pretrained model. Key design decisions that enabled this performance are the choice of consistency objective, reduced initial sample variance, and the choice of preset chaining steps. △ Less

Submitted 28 June, 2024; v1 submitted 13 May, 2024; originally announced May 2024.

Comments: https://consistency-policy.github.io/

arXiv:2405.07303 [pdf, other]

Search for solar axions by Primakoff effect with the full dataset of the CDEX-1B Experiment

Authors: L. T. Yang, S. K. Liu, Q. Yue, K. J. Kang, Y. J. Li, H. P. An, Greeshma C., J. P. Chang, Y. H. Chen, J. P. Cheng, W. H. Dai, Z. Deng, C. H. Fang, X. P. Geng, H. Gong, Q. J. Guo, T. Guo, X. Y. Guo, L. He, J. R. He, J. W. Hu, H. X. Huang, T. C. Huang, L. Jiang, S. Karmakar , et al. (61 additional authors not shown)

Abstract: We present the first limit on $g_{Aγ}$ coupling constant using the Bragg-Primakoff conversion based on an exposure of 1107.5 kg days of data from the CDEX-1B experiment at the China Jinping Underground Laboratory. The data are consistent with the null signal hypothesis, and no excess signals are observed. Limits of the coupling $g_{Aγ}<2.08\times10^{-9}$ GeV$^{-1}$ (95\% C.L.) are derived for axio… ▽ More We present the first limit on $g_{Aγ}$ coupling constant using the Bragg-Primakoff conversion based on an exposure of 1107.5 kg days of data from the CDEX-1B experiment at the China Jinping Underground Laboratory. The data are consistent with the null signal hypothesis, and no excess signals are observed. Limits of the coupling $g_{Aγ}<2.08\times10^{-9}$ GeV$^{-1}$ (95\% C.L.) are derived for axions with mass up to 100 eV/$c^2$. Within the hadronic model of KSVZ, our results exclude axion mass $>5.3~\rm{eV}/c^2$ at 95\% C.L. △ Less

Submitted 12 May, 2024; originally announced May 2024.

Comments: 7 pages, 5 figures

arXiv:2405.05962 [pdf, other]

Age Aware Scheduling for Differentially-Private Federated Learning

Authors: Kuan-Yu Lin, Hsuan-Yin Lin, Yu-Pin Hsu, Yu-Chih Huang

Abstract: This paper explores differentially-private federated learning (FL) across time-varying databases, delving into a nuanced three-way tradeoff involving age, accuracy, and differential privacy (DP). Emphasizing the potential advantages of scheduling, we propose an optimization problem aimed at meeting DP requirements while minimizing the loss difference between the aggregated model and the model obta… ▽ More This paper explores differentially-private federated learning (FL) across time-varying databases, delving into a nuanced three-way tradeoff involving age, accuracy, and differential privacy (DP). Emphasizing the potential advantages of scheduling, we propose an optimization problem aimed at meeting DP requirements while minimizing the loss difference between the aggregated model and the model obtained without DP constraints. To harness the benefits of scheduling, we introduce an age-dependent upper bound on the loss, leading to the development of an age-aware scheduling design. Simulation results underscore the superior performance of our proposed scheme compared to FL with classic DP, which does not consider scheduling as a design factor. This research contributes insights into the interplay of age, accuracy, and DP in federated learning, with practical implications for scheduling strategies. △ Less

Submitted 5 July, 2024; v1 submitted 9 May, 2024; originally announced May 2024.

Comments: Simulation parameters updated. Paper accepted for presentation at the 2024 IEEE International Symposium on Information Theory (ISIT 2024)

arXiv:2405.03648 [pdf, ps, other]

Proof of the geometric Langlands conjecture II: Kac-Moody localization and the FLE

Authors: D. Arinkin, D. Beraldo, J. Campbell, L. Chen, J. Faergeman, D. Gaitsgory, K. Lin, S. Raskin, N. Rozenblyum

Abstract: This paper is the second in a series of five that together prove the geometric Langlands conjecture. Our goals are two-fold: (1) Formulate and prove the Fundamental Local Equivalence (FLE) at the critical level; (2) Study the interaction between Kac-Moody localization and the global geometric Langlands functor of ref. [GLC1]. This paper contains an extensive Appendix, whose primary goals are… ▽ More This paper is the second in a series of five that together prove the geometric Langlands conjecture. Our goals are two-fold: (1) Formulate and prove the Fundamental Local Equivalence (FLE) at the critical level; (2) Study the interaction between Kac-Moody localization and the global geometric Langlands functor of ref. [GLC1]. This paper contains an extensive Appendix, whose primary goals are: (a) Development the theory of ind-coherent sheaves in infinite type; (b)Development of the formalism of factorization categories. △ Less

Submitted 23 May, 2024; v1 submitted 6 May, 2024; originally announced May 2024.

arXiv:2405.02794 [pdf, other]

Octopi: Object Property Reasoning with Large Tactile-Language Models

Authors: Samson Yu, Kelvin Lin, Anxing Xiao, Jiafei Duan, Harold Soh

Abstract: Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they a… ▽ More Physical reasoning is important for effective robot manipulation. Recent work has investigated both vision and language modalities for physical reasoning; vision can reveal information about objects in the environment and language serves as an abstraction and communication medium for additional context. Although these works have demonstrated success on a variety of physical reasoning tasks, they are limited to physical properties that can be inferred from visual or language inputs. In this work, we investigate combining tactile perception with language, which enables embodied systems to obtain physical properties through interaction and apply commonsense reasoning. We contribute a new dataset PhysiCLeAR, which comprises both physical/property reasoning tasks and annotated tactile videos obtained using a GelSight tactile sensor. We then introduce Octopi, a system that leverages both tactile representation learning and large vision-language models to predict and reason about tactile inputs with minimal language fine-tuning. Our evaluations on PhysiCLeAR show that Octopi is able to effectively use intermediate physical property predictions to improve its performance on various tactile-related tasks. PhysiCLeAR and Octopi are available at https://github.com/clear-nus/octopi. △ Less

Submitted 4 June, 2024; v1 submitted 4 May, 2024; originally announced May 2024.

Comments: Accepted at Robotics: Science and Systems (R:SS 2024)

arXiv:2404.19074 [pdf, other]

Chaos-Assisted Dynamical Tunneling in Flat Band Superwires

Authors: Anton Marius Graf, Ke Lin, MyeongSeo Kim, Joonas Keski-Rahkonen, Alvar Daza, Eric Heller

Abstract: Recent theoretical investigations have revealed unconventional transport mechanisms within high Brilliouin zones of two-dimensional superlattices. Electrons can navigate along channels we call superwires, gently guided without brute force confinement. Such dynamical confinement is caused by weak superlattice deflections, markedly different from the static or energetic confinement observed in tradi… ▽ More Recent theoretical investigations have revealed unconventional transport mechanisms within high Brilliouin zones of two-dimensional superlattices. Electrons can navigate along channels we call superwires, gently guided without brute force confinement. Such dynamical confinement is caused by weak superlattice deflections, markedly different from the static or energetic confinement observed in traditional wave guides or one-dimensional electron wires. The quantum properties of superwires give rise to elastic dynamical tunneling, linking disjoint regions of the corresponding classical phase space, and enabling the emergence of several parallel channels. This paper provides the underlying theory and mechanisms that facilitate dynamical tunneling assisted by chaos in periodic lattices. Moreover, we show that the mechanism of dynamical tunneling can be effectively conceptualized through the lens of a paraxial approximation. Our results further reveal that superwires predominantly exist within flat bands, emerging from eigenstates that represent linear combinations of conventional degenerate Bloch states. Finally, we quantify tunneling rates across various lattice configurations, and demonstrate the tunneling can be suppressed in a controlled fashion, illustrating potential implications in future nanodevices. △ Less

Submitted 29 April, 2024; originally announced April 2024.

Comments: 12 pages, 6 Figures

arXiv:2404.18716 [pdf, other]

Electrically tunable layer-hybridized trions in doped WSe$_2$ bilayers

Authors: Raul Perea-Causin, Samuel Brem, Fabian Buchner, Kenji Watanabe, Takashi Taniguchi, John M. Lupton, Kai-Qiang Lin, Ermin Malic

Abstract: Doped van der Waals heterostructures host layer-hybridized trions, i.e. charged excitons with layer-delocalized constituents holding promise for highly controllable optoelectronics. Combining a microscopic theory with photoluminescence (PL) experiments, we demonstrate the electrical tunability of the trion energy landscape in naturally stacked WSe$_2$ bilayers. We show that an out-of-plane electri… ▽ More Doped van der Waals heterostructures host layer-hybridized trions, i.e. charged excitons with layer-delocalized constituents holding promise for highly controllable optoelectronics. Combining a microscopic theory with photoluminescence (PL) experiments, we demonstrate the electrical tunability of the trion energy landscape in naturally stacked WSe$_2$ bilayers. We show that an out-of-plane electric field modifies the energetic ordering of the lowest lying trion states, which consist of layer-hybridized $Λ$-point electrons and layer-localized K-point holes. At small fields, intralayer-like trions yield distinct PL signatures in opposite doping regimes characterized by weak Stark shifts in both cases. Above a doping-asymmetric critical field, interlayer-like species are energetically favored and produce PL peaks with a pronounced Stark red-shift and a counter-intuitively large intensity arising from efficient phonon-assisted recombination. Our work presents an important step forward in the microscopic understanding of layer-hybridized trions in van der Waals heterostructures and paves the way towards optoelectronic applications based on electrically controllable atomically-thin semiconductors. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.17343 [pdf, other]

A Bionic Natural Language Parser Equivalent to a Pushdown Automaton

Authors: Zhenghao Wei, Kehua Lin, Jianlin Feng

Abstract: Assembly Calculus (AC), proposed by Papadimitriou et al., aims to reproduce advanced cognitive functions through simulating neural activities, with several applications based on AC having been developed, including a natural language parser proposed by Mitropolsky et al. However, this parser lacks the ability to handle Kleene closures, preventing it from parsing all regular languages and rendering… ▽ More Assembly Calculus (AC), proposed by Papadimitriou et al., aims to reproduce advanced cognitive functions through simulating neural activities, with several applications based on AC having been developed, including a natural language parser proposed by Mitropolsky et al. However, this parser lacks the ability to handle Kleene closures, preventing it from parsing all regular languages and rendering it weaker than Finite Automata (FA). In this paper, we propose a new bionic natural language parser (BNLP) based on AC and integrates two new biologically rational structures, Recurrent Circuit and Stack Circuit which are inspired by RNN and short-term memory mechanism. In contrast to the original parser, the BNLP can fully handle all regular languages and Dyck languages. Therefore, leveraging the Chomsky-Sch űtzenberger theorem, the BNLP which can parse all Context-Free Languages can be constructed. We also formally prove that for any PDA, a Parser Automaton corresponding to BNLP can always be formed, ensuring that BNLP has a description ability equal to that of PDA and addressing the deficiencies of the original parser. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: to be published in IJCNN 2024

arXiv:2404.16375 [pdf, other]

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Authors: An Yan, Zhengyuan Yang, Junda Wu, Wanrong Zhu, Jianwei Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Julian McAuley, Jianfeng Gao, Lijuan Wang

Abstract: Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these vis… ▽ More Set-of-Mark (SoM) Prompting unleashes the visual grounding capability of GPT-4V, by enabling the model to associate visual objects with tags inserted on the image. These tags, marked with alphanumerics, can be indexed via text tokens for easy reference. Despite the extraordinary performance from GPT-4V, we observe that other Multimodal Large Language Models (MLLMs) struggle to understand these visual tags. To promote the learning of SoM prompting for open-source models, we propose a new learning paradigm: "list items one by one," which asks the model to enumerate and describe all visual tags placed on the image following the alphanumeric orders of tags. By integrating our curated dataset with other visual instruction tuning datasets, we are able to equip existing MLLMs with the SoM prompting ability. Furthermore, we evaluate our finetuned SoM models on five MLLM benchmarks. We find that this new dataset, even in a relatively small size (10k-30k images with tags), significantly enhances visual reasoning capabilities and reduces hallucinations for MLLMs. Perhaps surprisingly, these improvements persist even when the visual tags are omitted from input images during inference. This suggests the potential of "list items one by one" as a new paradigm for training MLLMs, which strengthens the object-text alignment through the use of visual tags in the training stage. Finally, we conduct analyses by probing trained models to understand the working mechanism of SoM. Our code and data are available at \url{https://github.com/zzxslp/SoM-LLaVA}. △ Less

Submitted 25 April, 2024; originally announced April 2024.

Comments: Preprint

arXiv:2404.16254 [pdf, ps, other]

Standing wave in perturbed anti-de Sitter spacetimes with a naked singularity

Authors: Kai Lin, Wei-Liang Qian

Abstract: In the framework of black hole perturbation theory, this work investigates the standing wave solutions in Reissner-Nordtsröm (RN) anti-de Sitter (AdS) spacetimes with a naked singularity. These solutions can be viewed as a specific class of quasinormal modes exhibiting distinct characteristics. The imaginary parts of their frequencies are numerically vanishing, allowing them to persist over an ext… ▽ More In the framework of black hole perturbation theory, this work investigates the standing wave solutions in Reissner-Nordtsröm (RN) anti-de Sitter (AdS) spacetimes with a naked singularity. These solutions can be viewed as a specific class of quasinormal modes exhibiting distinct characteristics. The imaginary parts of their frequencies are numerically vanishing, allowing them to persist over an extended period. Besides, these modes are predominantly stationary in terms of the evolution of spacetime waveforms. The numerical calculations are carried out employing the finite difference method, and the quasinormal frequencies extracted by the Prony method are shown to be consistent with those obtained using the matrix method. The obtained waveforms and quasinormal frequencies are shown to be drastically different from those of an extreme RN-AdS black hole. As the quasinormal modes are primarily dissipative, the non-dissipative standing waves are attributed to the nature that the singularity can neither be a sink nor a source of the gravitational system. △ Less

Submitted 10 May, 2024; v1 submitted 24 April, 2024; originally announced April 2024.

Comments: 15 pages and 8 figures

arXiv:2404.15909 [pdf, other]

Learning Long-form Video Prior via Generative Pre-Training

Authors: Jinheng Xie, Jiajun Feng, Zhaoxu Tian, Kevin Qinghong Lin, Yawen Huang, Xi Xia, Nanxu Gong, Xu Zuo, Jiaqi Yang, Yefeng Zheng, Mike Zheng Shou

Abstract: Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning lon… ▽ More Concepts involved in long-form videos such as people, objects, and their interactions, can be viewed as following an implicit prior. They are notably complex and continue to pose challenges to be comprehensively learned. In recent years, generative pre-training (GPT) has exhibited versatile capacities in modeling any kind of text content even visual locations. Can this manner work for learning long-form video prior? Instead of operating on pixel space, it is efficient to employ visual locations like bounding boxes and keypoints to represent key information in videos, which can be simply discretized and then tokenized for consumption by GPT. Due to the scarcity of suitable data, we create a new dataset called \textbf{Storyboard20K} from movies to serve as a representative. It includes synopses, shot-by-shot keyframes, and fine-grained annotations of film sets and characters with consistent IDs, bounding boxes, and whole body keypoints. In this way, long-form videos can be represented by a set of tokens and be learned via generative pre-training. Experimental results validate that our approach has great potential for learning long-form video prior. Code and data will be released at \url{https://github.com/showlab/Long-form-Video-Prior}. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2404.15450 [pdf, other]

A possible origin of the $α$-vacuum as the initial state of the Universe

Authors: Pisin Chen, Kuan-Nan Lin, Wei-Chen Lin, Dong-han Yeom

Abstract: We investigate the cosmological observables using the Euclidean path integral approach. Specifically, we study both the no-boundary compact instantons scenario and the Euclidean wormholes scenario that can induce the creation of two universes from nothing. It is known that perturbations associated with the no-boundary scenario can only be consistent with the Bunch-Davies vacuum. Here we demonstrat… ▽ More We investigate the cosmological observables using the Euclidean path integral approach. Specifically, we study both the no-boundary compact instantons scenario and the Euclidean wormholes scenario that can induce the creation of two universes from nothing. It is known that perturbations associated with the no-boundary scenario can only be consistent with the Bunch-Davies vacuum. Here we demonstrate that the Euclidean wormholes can allow for a de Sitter invariant vacuum, the so-called $α$-vacuum state, where the Bunch-Davies vacuum is a special case. This therefore provides the $α$-vacuum a geometrical origin. As an aside, we discuss a subtle phase issue when considering the power spectrum related to $α$-vacuum in the closed universe framework. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 25 pages, 8 figures

arXiv:2404.15402 [pdf, other]

KiDS-SBI: Simulation-Based Inference Analysis of KiDS-1000 Cosmic Shear

Authors: Maximilian von Wietersheim-Kramsta, Kiyam Lin, Nicolas Tessore, Benjamin Joachimi, Arthur Loureiro, Robert Reischke, Angus H. Wright

Abstract: We present a simulation-based inference (SBI) cosmological analysis of cosmic shear two-point statistics from the fourth weak gravitational lensing data release of the ESO Kilo-Degree Survey (KiDS-1000). KiDS-SBI efficiently performs non-Limber projection of the matter power spectrum via Levin's method, and constructs log-normal random matter fields on the curved sky for arbitrary cosmologies, inc… ▽ More We present a simulation-based inference (SBI) cosmological analysis of cosmic shear two-point statistics from the fourth weak gravitational lensing data release of the ESO Kilo-Degree Survey (KiDS-1000). KiDS-SBI efficiently performs non-Limber projection of the matter power spectrum via Levin's method, and constructs log-normal random matter fields on the curved sky for arbitrary cosmologies, including effective prescriptions for intrinsic alignments and baryonic feedback. The forward model samples realistic galaxy positions and shapes based on the observational characteristics, incorporating shear measurement and redshift calibration uncertainties, as well as angular anisotropies due to variations in depth and point-spread function. To enable direct comparison with standard inference, we limit our analysis to pseudo-angular power spectra. The SBI is based on sequential neural likelihood estimation to infer the posterior distribution of spatially-flat $Λ$CDM cosmological parameters from 18,000 realisations. We infer a mean marginal of the growth of structure parameter $S_{8} \equiv σ_8 (Ω_\mathrm{m} / 0.3)^{0.5} = 0.731\pm 0.033$ ($68 \%$). We present a measure of goodness-of-fit for SBI and determine that the forward model fits the data well with a probability-to-exceed of $0.42$. For fixed cosmology, the learnt likelihood is approximately Gaussian, while constraints widen compared to a Gaussian likelihood analysis due to cosmology dependence in the covariance. Neglecting variable depth and anisotropies in the point spread function in the model can cause $S_{8}$ to be overestimated by ${\sim}5\%$. Our results are in agreement with previous analysis of KiDS-1000 and reinforce a $2.9 σ$ tension with constraints from cosmic microwave background measurements. This work highlights the importance of forward-modelling systematic effects in upcoming galaxy surveys. △ Less

Submitted 23 April, 2024; originally announced April 2024.

Comments: 44 pages, 30 figures. Submitted to Astronomy & Astrophysics

arXiv:2404.14705 [pdf, other]

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Authors: Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin Jin

Abstract: This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of levera… ▽ More This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.10948 [pdf, other]

First double-differential cross section measurement of neutral-current $π^0$ production in neutrino-argon scattering in the MicroBooNE detector

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, A. Barnard, G. Barr, D. Barrow, J. Barrow, V. Basque, J. Bateman, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book , et al. (166 additional authors not shown)

Abstract: We report the first double-differential cross section measurement of neutral-current neutral pion (NC$π^0$) production in neutrino-argon scattering, as well as single-differential measurements of the same channel in terms of final states with and without protons. The kinematic variables of interest for these measurements are the $π^0$ momentum and the $π^0$ scattering angle with respect to the neu… ▽ More We report the first double-differential cross section measurement of neutral-current neutral pion (NC$π^0$) production in neutrino-argon scattering, as well as single-differential measurements of the same channel in terms of final states with and without protons. The kinematic variables of interest for these measurements are the $π^0$ momentum and the $π^0$ scattering angle with respect to the neutrino beam. A total of 4971 candidate NC$π^0$ events fully-contained within the MicroBooNE detector are selected using data collected at a mean neutrino energy of $\sim 0.8$ GeV from $6.4\times10^{20}$ protons on target from the Booster Neutrino Beam at the Fermi National Accelerator Laboratory. After extensive data-driven model validation to ensure unbiased unfolding, the Wiener-SVD method is used to extract nominal flux-averaged cross sections. The results are compared to predictions from commonly used neutrino event generators, which tend to overpredict the measured NC$π^0$ cross section, especially in the 0.2-0.5 GeV/c $π^0$ momentum range, at forward scattering angles, and when at least one proton is present in the final state. These measurements show sensitivity to a variety of features that complicate the description of NC$π^0$ production including the form factors describing the elementary neutrino interaction and the final state interactions of the outgoing particles in the residual argon nucleus. This data will help improve the modeling of NC$π^0$ production, which represents a major background in measurements of charge-parity violation in the neutrino sector and in searches for new physics beyond the Standard Model. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Report number: FERMILAB-PUB-24-0125

arXiv:2404.09949 [pdf, other]

Measurement of the differential cross section for neutral pion production in charged-current muon neutrino interactions on argon with the MicroBooNE detector

Authors: MicroBooNE collaboration, P. Abratenko, O. Alterkait, D. Andrade Aldana, L. Arellano, J. Asaadi, A. Ashkenazi, S. Balasubramanian, B. Baller, G. Barr, D. Barrow, J. Barrow, V. Basque, O. Benevides Rodrigues, S. Berkman, A. Bhanderi, A. Bhat, M. Bhattacharya, M. Bishai, A. Blake, B. Bogart, T. Bolton, J. Y. Book, M. B. Brunetti, L. Camilleri , et al. (163 additional authors not shown)

Abstract: We present a measurement of neutral pion production in charged-current interactions using data recorded with the MicroBooNE detector exposed to Fermilab's booster neutrino beam. The signal comprises one muon, one neutral pion, any number of nucleons, and no charged pions. Studying neutral pion production in the MicroBooNE detector provides an opportunity to better understand neutrino-argon interac… ▽ More We present a measurement of neutral pion production in charged-current interactions using data recorded with the MicroBooNE detector exposed to Fermilab's booster neutrino beam. The signal comprises one muon, one neutral pion, any number of nucleons, and no charged pions. Studying neutral pion production in the MicroBooNE detector provides an opportunity to better understand neutrino-argon interactions, and is crucial for future accelerator-based neutrino oscillation experiments. Using a dataset corresponding to $6.86 \times 10^{20}$ protons on target, we present single-differential cross sections in muon and neutral pion momenta, scattering angles with respect to the beam for the outgoing muon and neutral pion, as well as the opening angle between the muon and neutral pion. Data extracted cross sections are compared to generator predictions. We report good agreement between the data and the models for scattering angles, except for an over-prediction by generators at muon forward angles. Similarly, the agreement between data and the models as a function of momentum is good, except for an underprediction by generators in the medium momentum ranges, $200-400$ MeV for muons and $100-200$ MeV for pions. △ Less

Submitted 6 May, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

Report number: FERMILAB-PUB-24-0142-CSAID-PPD

arXiv:2404.09793 [pdf, other]

First Search for Light Fermionic Dark Matter Absorption on Electrons Using Germanium Detector in CDEX-10 Experiment

Authors: J. X. Liu, L. T. Yang, Q. Yue, K. J. Kang, Y. J. Li, H. P. An, Greeshma C., J. P. Chang, Y. H. Chen, J. P. Cheng, W. H. Dai, Z. Deng, C. H. Fang, X. P. Geng, H. Gong, Q. J. Guo, T. Guo, X. Y. Guo, L. He, J. R. He, J. W. Hu, H. X. Huang, T. C. Huang, L. Jiang, S. Karmakar , et al. (61 additional authors not shown)

Abstract: We present the first results of the search for sub-MeV fermionic dark matter absorbed by electron targets of Germanium using the 205.4~kg$\cdot$day data collected by the CDEX-10 experiment, with the analysis threshold of 160~eVee. No significant dark matter (DM) signals over the background are observed. Results are presented as limits on the cross section of DM--electron interaction. We present ne… ▽ More We present the first results of the search for sub-MeV fermionic dark matter absorbed by electron targets of Germanium using the 205.4~kg$\cdot$day data collected by the CDEX-10 experiment, with the analysis threshold of 160~eVee. No significant dark matter (DM) signals over the background are observed. Results are presented as limits on the cross section of DM--electron interaction. We present new constraints of cross section in the DM range of 0.1--10 keV/$c^2$ for vector and axial-vector interaction. The upper limit on the cross section is set to be $\rm 5.5\times10^{-46}~cm^2$ for vector interaction, and $\rm 1.8\times10^{-46}~cm^2$ for axial-vector interaction at DM mass of 5 keV/$c^2$. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 6 pages, 4 figures

arXiv:2404.09446 [pdf, other]

The final burst of the moving mirror is unrelated to the partner mode of analog Hawking radiation

Authors: Yuki Osawa, Kuan-Nan Lin, Yasusada Nambu, Masahiro Hotta, Pisin Chen

Abstract: Flying mirrors with appropriate trajectories have been recognized as an analog system that mimics black hole Hawking evaporation and have been widely investigated. It has recently been suggested that the partner mode of the analog Hawking radiation emitted from a moving mirror would manifest itself through a final burst when the mirror executes a sudden stop. Here we argue the opposite via the par… ▽ More Flying mirrors with appropriate trajectories have been recognized as an analog system that mimics black hole Hawking evaporation and have been widely investigated. It has recently been suggested that the partner mode of the analog Hawking radiation emitted from a moving mirror would manifest itself through a final burst when the mirror executes a sudden stop. Here we argue the opposite via the partner formula for the moving mirror model. By expanding the theoretical foundation of the partner formula and augmenting it with numerical analysis, we demonstrate that the supposed final burst is induced by a shock that requires the input of external energy, whereas the Hawking radiation partner mode, which is associated with the zero-point vacuum fluctuations, is not responsible for the burst. △ Less

Submitted 15 April, 2024; originally announced April 2024.

Comments: 18 pages, 6 figures

arXiv:2404.06780 [pdf, other]

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

Authors: Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang

Abstract: Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimi… ▽ More Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: https://urbanarchitect.github.io. △ Less

Submitted 10 April, 2024; originally announced April 2024.

Comments: Project page: https://urbanarchitect.github.io/

arXiv:2404.02927 [pdf]

Probing the band splitting near the $Γ$ point in the van der Waals magnetic semiconductor CrSBr

Authors: Kaiman Lin, Yi Li, Mahdi Ghorbani-Asl, Zdenek Sofer, Stephan Winnerl, Artur Erbe, Arkady V. Krasheninnikov, Manfred Helm, Shengqiang Zhou, Yaping Dan, Slawomir Prucnal

Abstract: This study investigates the electronic band structure of Chromium Sulfur Bromide (CrSBr) through comprehensive photoluminescence (PL) characterization. We clearly identify low-temperature optical transitions between two closely adjacent conduction-band states and two different valence-band states. The analysis of the PL data robustly unveils energy splittings, bandgaps and excitonic transitions ac… ▽ More This study investigates the electronic band structure of Chromium Sulfur Bromide (CrSBr) through comprehensive photoluminescence (PL) characterization. We clearly identify low-temperature optical transitions between two closely adjacent conduction-band states and two different valence-band states. The analysis of the PL data robustly unveils energy splittings, bandgaps and excitonic transitions across different thicknesses of CrSBr, ranging from monolayer to bulk. Temperature-dependent PL measurements elucidate the stability of the band splitting below the Néel temperature, suggesting that magnons coupled with excitons are responsible for the symmetry breaking and brightening of the transitions from the secondary conduction band minimum (CBM2) to the global valence band maximum (VBM1). Collectively, these results not only reveal band splitting in both the conduction and valence bands, but also point to an intricate interplay between the optical, electronic and magnetic properties of antiferromagnetic two-dimensional van der Waals crystals. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.01294 [pdf, other]

CosmicMan: A Text-to-Image Foundation Model for Humans

Authors: Shikai Li, Jianglin Fu, Kaiyuan Liu, Wentao Wang, Kwan-Yee Lin, Wayne Wu

Abstract: We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detai… ▽ More We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans, CosmicMan enables generating photo-realistic human images with meticulous appearance, reasonable structure, and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence, we propose a new data production paradigm, Annotate Anyone, which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this, we constructed a large-scale dataset, CosmicMan-HQ 1.0, with 6 Million high-quality real-world human images in a mean resolution of 1488x1255, and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic -- easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence, we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner, and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model, and enforces attention refocusing without adding extra modules. Through Daring, we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze. △ Less

Submitted 1 April, 2024; originally announced April 2024.

Comments: Accepted by CVPR 2024. The supplementary material is included. Project Page: https://cosmicman-cvpr2024.github.io

arXiv:2404.00304 [pdf]

doi 10.1126/science.adn1555

Ultrafast Kapitza-Dirac effect

Authors: Kang Lin, Sebastian Eckart, Hao Liang, Alexander Hartung, Sina Jacob, Qinying Ji, Lothar Ph. H. Schmidt, Markus S. Schöffler, Till Jahnke, Maksim Kunitski, Reinhard Dörner

Abstract: Similar to the optical diffraction of light passing through a material grating, the Kapitza-Dirac effect occurs when an electron is diffracted by a standing light wave. In its original description the effect is time-independent. In the present work, we extend the Kapitza-Dirac concept to the time domain. By tracking the spatiotemporal evolution of a pulsed electron wave packet diffracted by a femt… ▽ More Similar to the optical diffraction of light passing through a material grating, the Kapitza-Dirac effect occurs when an electron is diffracted by a standing light wave. In its original description the effect is time-independent. In the present work, we extend the Kapitza-Dirac concept to the time domain. By tracking the spatiotemporal evolution of a pulsed electron wave packet diffracted by a femtosecond (10 15 second) standing wave pulse in a pump-probe scheme, we observe so far unseen time-dependent diffraction patterns. The fringe spacing in the observed pattern differs from that generated by the conventional Kapitza-Dirac effect, moreover it decreases as the pump-probe delay time increases. By exploiting this time-resolved diffraction scheme, we gather access to the time evolution of the previously inaccessible phase properties of a free electron. △ Less

Submitted 30 March, 2024; originally announced April 2024.

Journal ref: Science 2024

arXiv:2403.20276 [pdf, other]

Constraints on the Blazar-Boosted Dark Matter from the CDEX-10 Experiment

Authors: R. Xu, L. T. Yang, Q. Yue, K. J. Kang, Y. J. Li, H. P. An, Greeshma C., J. P. Chang, Y. H. Chen, J. P. Cheng, W. H. Dai, Z. Deng, C. H. Fang, X. P. Geng, H. Gong, Q. J. Guo, T. Guo, X. Y. Guo, L. He, S. M. He, J. W. Hu, H. X. Huang, T. C. Huang, L. Jiang, S. Karmakar , et al. (59 additional authors not shown)

Abstract: We report new constraints on light dark matter (DM) boosted by blazars using the 205.4 kg day data from the CDEX-10 experiment located at the China Jinping Underground Laboratory. Two representative blazars, TXS 0506+56 and BL Lacertae are studied. The results derived from TXS 0506+56 exclude DM-nucleon elastic scattering cross sections from $4.6\times 10^{-33}\ \rm cm^2$ to… ▽ More We report new constraints on light dark matter (DM) boosted by blazars using the 205.4 kg day data from the CDEX-10 experiment located at the China Jinping Underground Laboratory. Two representative blazars, TXS 0506+56 and BL Lacertae are studied. The results derived from TXS 0506+56 exclude DM-nucleon elastic scattering cross sections from $4.6\times 10^{-33}\ \rm cm^2$ to $1\times10^{-26}\ \rm cm^2$ for DM masses between 10 keV and 1 GeV, and the results derived from BL Lacertae exclude DM-nucleon elastic scattering cross sections from $2.4\times 10^{-34}\ \rm cm^2$ to $1\times10^{-26}\ \rm cm^2$ for the same range of DM masses. The constraints correspond to the best sensitivities among solid-state detector experiments in the sub-MeV mass range. △ Less

Submitted 29 March, 2024; originally announced March 2024.

Comments: 7 pages, 4 figures

Showing 1–50 of 704 results for author: Lin, K