subscribe to arXiv mailings

arXiv:2407.05713 [pdf, other]

Short-term Object Interaction Anticipation with Disentangled Object Detection @ Ego4D Short Term Object Interaction Anticipation Challenge

Authors: Hyunjin Cho, Dong Un Kang, Se Young Chun

Abstract: Short-term object interaction anticipation is an important task in egocentric video analysis, including precise predictions of future interactions and their timings as well as the categories and positions of the involved active objects. To alleviate the complexity of this task, our proposed method, SOIA-DOD, effectively decompose it into 1) detecting active object and 2) classifying interaction an… ▽ More Short-term object interaction anticipation is an important task in egocentric video analysis, including precise predictions of future interactions and their timings as well as the categories and positions of the involved active objects. To alleviate the complexity of this task, our proposed method, SOIA-DOD, effectively decompose it into 1) detecting active object and 2) classifying interaction and predicting their timing. Our method first detects all potential active objects in the last frame of egocentric video by fine-tuning a pre-trained YOLOv9. Then, we combine these potential active objects as query with transformer encoder, thereby identifying the most promising next active object and predicting its future interaction and time-to-contact. Experimental results demonstrate that our method outperforms state-of-the-art models on the challenge test set, achieving the best performance in predicting next active objects and their interactions. Finally, our proposed ranked the third overall top-5 mAP when including time-to-contact predictions. The source code is available at https://github.com/KeenyJin/SOIA-DOD. △ Less

Submitted 8 July, 2024; originally announced July 2024.

Comments: 4 pages

arXiv:2407.05551 [pdf, other]

Read, Watch and Scream! Sound Generation from Text and Video

Authors: Yujin Jeong, Yunji Kim, Sanghyuk Chun, Jiyoung Lee

Abstract: Multimodal generative models have shown impressive advances with the help of powerful diffusion models. Despite the progress, generating sound solely from text poses challenges in ensuring comprehensive scene depiction and temporal alignment. Meanwhile, video-to-sound generation limits the flexibility to prioritize sound synthesis for specific objects within the scene. To tackle these challenges,… ▽ More Multimodal generative models have shown impressive advances with the help of powerful diffusion models. Despite the progress, generating sound solely from text poses challenges in ensuring comprehensive scene depiction and temporal alignment. Meanwhile, video-to-sound generation limits the flexibility to prioritize sound synthesis for specific objects within the scene. To tackle these challenges, we propose a novel video-and-text-to-sound generation method, called ReWaS, where video serves as a conditional control for a text-to-audio generation model. Our method estimates the structural information of audio (namely, energy) from the video while receiving key content cues from a user prompt. We employ a well-performing text-to-sound model to consolidate the video control, which is much more efficient for training multimodal diffusion models with massive triplet-paired (audio-video-text) data. In addition, by separating the generative components of audio, it becomes a more flexible system that allows users to freely adjust the energy, surrounding environment, and primary sound source according to their preferences. Experimental results demonstrate that our method shows superiority in terms of quality, controllability, and training efficiency. Our demo is available at https://naver-ai.github.io/rewas △ Less

Submitted 7 July, 2024; originally announced July 2024.

Comments: Project page: https://naver-ai.github.io/rewas

arXiv:2406.12014 [pdf, other]

An IXPE-Led X-ray Spectro-Polarimetric Campaign on the Soft State of Cygnus X-1: X-ray Polarimetric Evidence for Strong Gravitational Lensing

Authors: James F. Steiner, Edward Nathan, Kun Hu, Henric Krawczynski, Michal Dovciak, Alexandra Veledina, Fabio Muleri, Jiri Svoboda, Kevin Alabarta, Maxime Parra, Yash Bhargava, Giorgio Matt, Juri Poutanen, Pierre-Olivier Petrucci, Allyn F. Tennant, M. Cristina Baglio, Luca Baldini, Samuel Barnier, Sudip Bhattacharyya, Stefano Bianchi, Maimouna Brigitte, Mauricio Cabezas, Floriane Cangemi, Fiamma Capitanio, Jacob Casey , et al. (112 additional authors not shown)

Abstract: We present the first X-ray spectropolarimetric results for Cygnus X-1 in its soft state from a campaign of five IXPE observations conducted during 2023 May-June. Companion multiwavelength data during the campaign are likewise shown. The 2-8 keV X-rays exhibit a net polarization degree PD=1.99%+/-0.13% (68% confidence). The polarization signal is found to increase with energy across IXPE's 2-8 keV… ▽ More We present the first X-ray spectropolarimetric results for Cygnus X-1 in its soft state from a campaign of five IXPE observations conducted during 2023 May-June. Companion multiwavelength data during the campaign are likewise shown. The 2-8 keV X-rays exhibit a net polarization degree PD=1.99%+/-0.13% (68% confidence). The polarization signal is found to increase with energy across IXPE's 2-8 keV bandpass. The polarized X-rays exhibit an energy-independent polarization angle of PA=-25.7+/-1.8 deg. East of North (68% confidence). This is consistent with being aligned to Cyg X-1's AU-scale compact radio jet and its pc-scale radio lobes. In comparison to earlier hard-state observations, the soft state exhibits a factor of 2 lower polarization degree, but a similar trend with energy and a similar (also energy-independent) position angle. When scaling by the natural unit of the disk temperature, we find the appearance of a consistent trendline in the polarization degree between soft and hard states. Our favored polarimetric model indicates Cyg X-1's spin is likely high (a* above ~0.96). The substantial X-ray polarization in Cyg X-1's soft state is most readily explained as resulting from a large portion of X-rays emitted from the disk returning and reflecting off the disk surface, generating a high polarization degree and a polarization direction parallel to the black hole spin axis and radio jet. In IXPE's bandpass, the polarization signal is dominated by the returning reflection emission. This constitutes polarimetric evidence for strong gravitational lensing of X-rays close to the black hole. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 20 pages, accepted for publication in ApJL

arXiv:2406.09188 [pdf, ps, other]

Reducing Task Discrepancy of Text Encoders for Zero-Shot Composed Image Retrieval

Authors: Jaeseok Byun, Seokhyeon Jeong, Wonjae Kim, Sanghyuk Chun, Taesup Moon

Abstract: Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable searches. Due to the expensive dataset construction cost for CIR triplets, a zero-shot (ZS) CIR setting has been actively studied to eliminate the need for human-collected triplet datasets. The mainstream of ZS-CIR employs an efficient projection module that projec… ▽ More Composed Image Retrieval (CIR) aims to retrieve a target image based on a reference image and conditioning text, enabling controllable searches. Due to the expensive dataset construction cost for CIR triplets, a zero-shot (ZS) CIR setting has been actively studied to eliminate the need for human-collected triplet datasets. The mainstream of ZS-CIR employs an efficient projection module that projects a CLIP image embedding to the CLIP text token embedding space, while fixing the CLIP encoders. Using the projected image embedding, these methods generate image-text composed features by using the pre-trained text encoder. However, their CLIP image and text encoders suffer from the task discrepancy between the pre-training task (text $\leftrightarrow$ image) and the target CIR task (image + text $\leftrightarrow$ image). Conceptually, we need expensive triplet samples to reduce the discrepancy, but we use cheap text triplets instead and update the text encoder. To that end, we introduce the Reducing Task Discrepancy of text encoders for Composed Image Retrieval (RTD), a plug-and-play training scheme for the text encoder that enhances its capability using a novel target-anchored text contrastive learning. We also propose two additional techniques to improve the proposed learning scheme: a hard negatives-based refined batch sampling strategy and a sophisticated concatenation scheme. Integrating RTD into the state-of-the-art projection-based ZS-CIR methods significantly improves performance across various datasets and backbones, demonstrating its efficiency and generalizability. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: 17 pages

arXiv:2404.17507 [pdf, other]

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

Authors: Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, Sangdoo Yun

Abstract: In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our appr… ▽ More In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity $ε_{i}$ can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score. △ Less

Submitted 26 April, 2024; originally announced April 2024.

Comments: 28pages, 4.5MB

arXiv:2404.04544 [pdf, other]

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Authors: Gwanghyun Kim, Hayeon Kim, Hoigi Seo, Dong Un Kang, Se Young Chun

Abstract: Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often… ▽ More Generating higher-resolution human-centric scenes with details and controls remains a challenge for existing text-to-image diffusion models. This challenge stems from limited training image size, text encoder capacity (limited tokens), and the inherent difficulty of generating complex scenes involving multiple humans. While current methods attempted to address training size limit only, they often yielded human-centric scenes with severe artifacts. We propose BeyondScene, a novel framework that overcomes prior limitations, generating exquisite higher-resolution (over 8K) human-centric scenes with exceptional text-image correspondence and naturalness using existing pretrained diffusion models. BeyondScene employs a staged and hierarchical approach to initially generate a detailed base image focusing on crucial elements in instance creation for multiple humans and detailed descriptions beyond token limit of diffusion model, and then to seamlessly convert the base image to a higher-resolution output, exceeding training image size and incorporating details aware of text and instances via our novel instance-aware hierarchical enlargement process that consists of our proposed high-frequency injected forward diffusion and adaptive joint diffusion. BeyondScene surpasses existing methods in terms of correspondence with detailed text descriptions and naturalness, paving the way for advanced applications in higher-resolution human-centric scene creation beyond the capacity of pretrained diffusion models without costly retraining. Project page: https://janeyeon.github.io/beyond-scene. △ Less

Submitted 6 April, 2024; originally announced April 2024.

Comments: Project page: https://janeyeon.github.io/beyond-scene

arXiv:2404.01954 [pdf, other]

HyperCLOVA X Technical Report

Authors: Kang Min Yoo, Jaegeun Han, Sookyo In, Heewon Jeon, Jisu Jeong, Jaewook Kang, Hyunwook Kim, Kyung-Min Kim, Munhyong Kim, Sungju Kim, Donghyun Kwak, Hanock Kwak, Se Jung Kwon, Bado Lee, Dongsoo Lee, Gichang Lee, Jooho Lee, Baeseong Park, Seongjin Shin, Joonsang Yu, Seolki Baek, Sumin Byeon, Eungsup Cho, Dooseok Choe, Jeesung Han , et al. (371 additional authors not shown)

Abstract: We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t… ▽ More We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in developing their sovereign LLMs. △ Less

Submitted 13 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 44 pages; updated authors list and fixed author names

arXiv:2403.18260 [pdf, other]

Toward Interactive Regional Understanding in Vision-Large Language Models

Authors: Jungbeom Lee, Sanghyuk Chun, Sangdoo Yun

Abstract: Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to un… ▽ More Recent Vision-Language Pre-training (VLP) models have demonstrated significant advancements. Nevertheless, these models heavily rely on image-text pairs that capture only coarse and global information of an image, leading to a limitation in their regional understanding ability. In this work, we introduce \textbf{RegionVLM}, equipped with explicit regional modeling capabilities, allowing them to understand user-indicated image regions. To achieve this, we design a simple yet innovative architecture, requiring no modifications to the model architecture or objective function. Additionally, we leverage a dataset that contains a novel source of information, namely Localized Narratives, which has been overlooked in previous VLP research. Our experiments demonstrate that our single generalist model not only achieves an interactive dialogue system but also exhibits superior performance on various zero-shot region understanding tasks, without compromising its ability for global image understanding. △ Less

Submitted 27 March, 2024; originally announced March 2024.

Comments: NAACL 2024 Main Conference

arXiv:2403.13993 [pdf, ps, other]

Is the RSGC4 (Alicante 8) cluster a real star cluster?: Peculiar radial velocities of red supergiant stars

Authors: Sang-Hyun Chun, GyuChul Myeong, Jae-Joon Lee, Heeyoung Oh

Abstract: Young massive star clusters, like the six red supergiant clusters in the Scutum complex, provide valuable insights into star-formation and galaxy structures. We investigated the high-resolution near-infrared spectra of 60 RSG candidates in these clusters using the Immersion Grating Infrared Spectrograph. Among the candidates in RSGC4, we found significant scattering in radial velocity ($-64$ km/s… ▽ More Young massive star clusters, like the six red supergiant clusters in the Scutum complex, provide valuable insights into star-formation and galaxy structures. We investigated the high-resolution near-infrared spectra of 60 RSG candidates in these clusters using the Immersion Grating Infrared Spectrograph. Among the candidates in RSGC4, we found significant scattering in radial velocity ($-64$ km/s to $115$ km/s), unlike other clusters with velocities of $\sim$100 km/s. Most candidates in RSGC4 have $Q_{GK_s}$ values larger than 1.7, suggesting that they could be early AGB stars. Four candidates in RSGC4 exhibit infrared excess and distinct absorption features absent in other candidates. Two of these stars exhibit absorption lines resembling those of D-type symbiotic stars, showing radial velocity changes in multi-epoch observations. Analysis of relative proper motions revealed no runaway/walkaway stars in RSGC4. The dynamic properties of RSGC4 and RSGC1 differ from the disk-like motions of other clusters: RSGC4 has low normalized horizontal action $J_\mathrm{hor}=J_\mathrmφ/J_\mathrm{tot}$ and vertical action $J_\mathrm{ver}=(J_\mathrm{z}-J_\mathrm{R})/J_\mathrm{tot}$ values and high eccentricities, while RSGC1 has vertical motions with high $J_\mathrm{ver}$ values and inclinations. We propose that RSGC4 may not be a genuine star cluster but rather a composite of RSGs and AGBs distributed along the line of sight at similar distances, possibly originating from various environments. Our results suggest a complex and hierarchical secular evolution of star clusters in the Scutum complex, emphasizing the importance of considering factors beyond density crowding when identifying star clusters in the bulge regions. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: 22 pages, 9 figures, 2 tables, accepted for publication in AJ

arXiv:2403.04460 [pdf, other]

Pearl: A Review-driven Persona-Knowledge Grounded Conversational Recommendation Dataset

Authors: Minjin Kim, Minju Kim, Hana Kim, Beong-woo Kwak, Soyeon Chun, Hyunseo Kim, SeongKu Kang, Youngjae Yu, Jinyoung Yeo, Dongha Lee

Abstract: Conversational recommender system is an emerging area that has garnered an increasing interest in the community, especially with the advancements in large language models (LLMs) that enable diverse reasoning over conversational input. Despite the progress, the field has many aspects left to explore. The currently available public datasets for conversational recommendation lack specific user prefer… ▽ More Conversational recommender system is an emerging area that has garnered an increasing interest in the community, especially with the advancements in large language models (LLMs) that enable diverse reasoning over conversational input. Despite the progress, the field has many aspects left to explore. The currently available public datasets for conversational recommendation lack specific user preferences and explanations for recommendations, hindering high-quality recommendations. To address such challenges, we present a novel conversational recommendation dataset named PEARL, synthesized with persona- and knowledge-augmented LLM simulators. We obtain detailed persona and knowledge from real-world reviews and construct a large-scale dataset with over 57k dialogues. Our experimental results demonstrate that utterances in PEARL include more specific user preferences, show expertise in the target domain, and provide recommendations more relevant to the dialogue context than those in prior datasets. △ Less

Submitted 8 June, 2024; v1 submitted 7 March, 2024; originally announced March 2024.

Comments: Published at ACL 2024 Findings

arXiv:2402.15229 [pdf, other]

doi 10.1016/j.astropartphys.2024.102944

Systematic effects on a Compton polarimeter at the focus of an X-ray mirror

Authors: M. Aoyagi, R. G. Bose, S. Chun, E. Gau, K. Hu, K. Ishiwata, N. K. Iyer, F. Kislat, M. Kiss, K. Klepper, H. Krawczynski, L. Lisalda, Y. Maeda, F. af Malmborg, H. Matsumoto, A. Miyamoto, T. Miyazawa, M. Pearce, B. F. Rauch, N. Rodriguez Cavero, S. Spooner, H. Takahashi, Y. Uchida, A. T. West, K. Wimalasena , et al. (1 additional authors not shown)

Abstract: XL-Calibur is a balloon-borne Compton polarimeter for X-rays in the $\sim$15-80 keV range. Using an X-ray mirror with a 12 m focal length for collecting photons onto a beryllium scattering rod surrounded by CZT detectors, a minimum-detectable polarization as low as $\sim$3% is expected during a 24-hour on-target observation of a 1 Crab source at 45$^{\circ}$ elevation. Systematic effects alter the… ▽ More XL-Calibur is a balloon-borne Compton polarimeter for X-rays in the $\sim$15-80 keV range. Using an X-ray mirror with a 12 m focal length for collecting photons onto a beryllium scattering rod surrounded by CZT detectors, a minimum-detectable polarization as low as $\sim$3% is expected during a 24-hour on-target observation of a 1 Crab source at 45$^{\circ}$ elevation. Systematic effects alter the reconstructed polarization as the mirror focal spot moves across the beryllium scatterer, due to pointing offsets, mechanical misalignment or deformation of the carbon-fiber truss supporting the mirror and the polarimeter. Unaddressed, this can give rise to a spurious polarization signal for an unpolarized flux, or a change in reconstructed polarization fraction and angle for a polarized flux. Using bench-marked Monte-Carlo simulations and an accurate mirror point-spread function characterized at synchrotron beam-lines, systematic effects are quantified, and mitigation strategies discussed. By recalculating the scattering site for a shifted beam, systematic errors can be reduced from several tens of percent to the few-percent level for any shift within the scattering element. The treatment of these systematic effects will be important for any polarimetric instrument where a focused X-ray beam is impinging on a scattering element surrounded by counting detectors. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: Submitted to Astroparticle Physics

Journal ref: Astropart. Phys. 158 (2024) 102944

arXiv:2401.05513 [pdf, other]

doi 10.1103/PhysRevFluids.9.043302

Flow rate-pressure drop relations for shear-thinning fluids in deformable configurations: theory and experiments

Authors: SungGyu Chun, Evgeniy Boyko, Ivan C. Christov, Jie Feng

Abstract: We provide an experimental framework to measure the flow rate--pressure drop relation for Newtonian and shear-thinning fluids in two common deformable configurations: (\textit{i}) a rectangular channel and (\textit{ii}) an axisymmetric tube. Using the Carreau model to describe the shear-dependent viscosity, we identify the key dimensionless rheological number, $Cu$, which characterizes shear thinn… ▽ More We provide an experimental framework to measure the flow rate--pressure drop relation for Newtonian and shear-thinning fluids in two common deformable configurations: (\textit{i}) a rectangular channel and (\textit{ii}) an axisymmetric tube. Using the Carreau model to describe the shear-dependent viscosity, we identify the key dimensionless rheological number, $Cu$, which characterizes shear thinning, and we show that our experiments lie within the power-law regime of shear rates. To rationalize the experimental data, we derive the flow rate-pressure drop relation taking into account the two-way-coupled fluid-structure interaction between the flow and its compliant confining boundaries. We thus identify the second key dimensionless number, $α$, which characterizes the compliance of the conduit. We then compare the theoretical flow rate-pressure drop relation to our experimental measurements, finding excellent agreement between the two. We further contrast our results for shear-thinning and Newtonian fluids to highlight the influence of $Cu$ on the flow rate-pressure drop relation. Finally, we delineate four distinct physical regimes of flow and deformation by mapping our experimental flow rate-pressure drop data for Newtonian and shear-thinning fluids into a $Cu-α$ plane. △ Less

Submitted 25 April, 2024; v1 submitted 10 January, 2024; originally announced January 2024.

Comments: 10 pages, 4 figures; v2 accepted for publication in Physical Review Fluids

Journal ref: Phys. Rev. Fluids 9 (2024) 043302

arXiv:2312.13027 [pdf, other]

Doubly Perturbed Task Free Continual Learning

Authors: Byung Hyun Lee, Min-hwan Oh, Se Young Chun

Abstract: Task Free online continual learning (TF-CL) is a challenging problem where the model incrementally learns tasks without explicit task information. Although training with entire data from the past, present as well as future is considered as the gold standard, naive approaches in TF-CL with the current samples may be conflicted with learning with samples in the future, leading to catastrophic forget… ▽ More Task Free online continual learning (TF-CL) is a challenging problem where the model incrementally learns tasks without explicit task information. Although training with entire data from the past, present as well as future is considered as the gold standard, naive approaches in TF-CL with the current samples may be conflicted with learning with samples in the future, leading to catastrophic forgetting and poor plasticity. Thus, a proactive consideration of an unseen future sample in TF-CL becomes imperative. Motivated by this intuition, we propose a novel TF-CL framework considering future samples and show that injecting adversarial perturbations on both input data and decision-making is effective. Then, we propose a novel method named Doubly Perturbed Continual Learning (DPCL) to efficiently implement these input and decision-making perturbations. Specifically, for input perturbation, we propose an approximate perturbation method that injects noise into the input data as well as the feature vector and then interpolates the two perturbed samples. For decision-making process perturbation, we devise multiple stochastic classifiers. We also investigate a memory management scheme and learning rate scheduling reflecting our proposed double perturbations. We demonstrate that our proposed method outperforms the state-of-the-art baseline methods by large margins on various TF-CL benchmarks. △ Less

Submitted 18 February, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI 2024 (Oral)

arXiv:2312.07425 [pdf, other]

Deep Internal Learning: Deep Learning from a Single Input

Authors: Tom Tirer, Raja Giryes, Se Young Chun, Yonina C. Eldar

Abstract: Deep learning, in general, focuses on training a neural network from large labeled datasets. Yet, in many cases there is value in training a network just from the input at hand. This is particularly relevant in many signal and image processing problems where training data is scarce and diversity is large on the one hand, and on the other, there is a lot of structure in the data that can be exploit… ▽ More Deep learning, in general, focuses on training a neural network from large labeled datasets. Yet, in many cases there is value in training a network just from the input at hand. This is particularly relevant in many signal and image processing problems where training data is scarce and diversity is large on the one hand, and on the other, there is a lot of structure in the data that can be exploited. Using this information is the key to deep internal-learning strategies, which may involve training a network from scratch using a single input or adapting an already trained network to a provided input example at inference time. This survey paper aims at covering deep internal-learning techniques that have been proposed in the past few years for these two important directions. While our main focus will be on image processing problems, most of the approaches that we survey are derived for general signals (vectors with recurring patterns that can be distinguished from noise) and are therefore applicable to other modalities. △ Less

Submitted 8 April, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: Accepted to IEEE Signal Processing Magazine

arXiv:2312.01998 [pdf, other]

Language-only Efficient Training of Zero-shot Composed Image Retrieval

Authors: Geonmo Gu, Sanghyuk Chun, Wonjae Kim, Yoohoon Kang, Sangdoo Yun

Abstract: Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collect… ▽ More Composed image retrieval (CIR) task takes a composed query of image and text, aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image, query text, and target image, which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However, the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework, only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then, we let the new and original texts have the same latent embedding vector. With this simple strategy, LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks, CIRCO, GeneCIS, FashionIQ, and CIRR, even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir △ Less

Submitted 31 March, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

Comments: CVPR 2024 camera-ready; First two authors contributed equally; 17 pages, 3.1MB

arXiv:2312.01689 [pdf, other]

Fast and accurate sparse-view CBCT reconstruction using meta-learned neural attenuation field and hash-encoding regularization

Authors: Heejun Shin, Taehee Kim, Jongho Lee, Se Young Chun, Seungryung Cho, Dongmyung Shin

Abstract: Cone beam computed tomography (CBCT) is an emerging medical imaging technique to visualize the internal anatomical structures of patients. During a CBCT scan, several projection images of different angles or views are collectively utilized to reconstruct a tomographic image. However, reducing the number of projections in a CBCT scan while preserving the quality of a reconstructed image is challeng… ▽ More Cone beam computed tomography (CBCT) is an emerging medical imaging technique to visualize the internal anatomical structures of patients. During a CBCT scan, several projection images of different angles or views are collectively utilized to reconstruct a tomographic image. However, reducing the number of projections in a CBCT scan while preserving the quality of a reconstructed image is challenging due to the nature of an ill-posed inverse problem. Recently, a neural attenuation field (NAF) method was proposed by adopting a neural radiance field algorithm as a new way for CBCT reconstruction, demonstrating fast and promising results using only 50 views. However, decreasing the number of projections is still preferable to reduce potential radiation exposure, and a faster reconstruction time is required considering a typical scan time. In this work, we propose a fast and accurate sparse-view CBCT reconstruction (FACT) method to provide better reconstruction quality and faster optimization speed in the minimal number of view acquisitions ($<$ 50 views). In the FACT method, we meta-trained a neural network and a hash-encoder using a few scans (= 15), and a new regularization technique is utilized to reconstruct the details of an anatomical structure. In conclusion, we have shown that the FACT method produced better, and faster reconstruction results over the other conventional algorithms based on CBCT scans of different body parts (chest, head, and abdomen) and CT vendors (Siemens, Phillips, and GE). △ Less

Submitted 16 January, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.18654 [pdf, other]

Detailed Human-Centric Text Description-Driven Large Scene Synthesis

Authors: Gwanghyun Kim, Dong Un Kang, Hoigi Seo, Hayeon Kim, Se Young Chun

Abstract: Text-driven large scene image synthesis has made significant progress with diffusion models, but controlling it is challenging. While using additional spatial controls with corresponding texts has improved the controllability of large scene synthesis, it is still challenging to faithfully reflect detailed text descriptions without user-provided controls. Here, we propose DetText2Scene, a novel tex… ▽ More Text-driven large scene image synthesis has made significant progress with diffusion models, but controlling it is challenging. While using additional spatial controls with corresponding texts has improved the controllability of large scene synthesis, it is still challenging to faithfully reflect detailed text descriptions without user-provided controls. Here, we propose DetText2Scene, a novel text-driven large-scale image synthesis with high faithfulness, controllability, and naturalness in a global context for the detailed human-centric text description. Our DetText2Scene consists of 1) hierarchical keypoint-box layout generation from the detailed description by leveraging large language model (LLM), 2) view-wise conditioned joint diffusion process to synthesize a large scene from the given detailed text with LLM-generated grounded keypoint-box layout and 3) pixel perturbation-based pyramidal interpolation to progressively refine the large scene for global coherence. Our DetText2Scene significantly outperforms prior arts in text-to-large scene synthesis qualitatively and quantitatively, demonstrating strong faithfulness with detailed descriptions, superior controllability, and excellent naturalness in a global context. △ Less

Submitted 30 November, 2023; originally announced November 2023.

arXiv:2311.18387 [pdf, other]

On Exact Inversion of DPM-Solvers

Authors: Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, Se Young Chun

Abstract: Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly, but have posed challenges to find the exact inverse (i.e., finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by t… ▽ More Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly, but have posed challenges to find the exact inverse (i.e., finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by the first-order as well as higher-order DPM-solvers. For each explicit denoising step in DPM-solvers, we formulated the inversions using implicit methods such as gradient descent or forward step method to ensure the robustness to large classifier-free guidance unlike the prior approach using fixed-point iteration. Experimental results demonstrated that our proposed exact inversion methods significantly reduced the error of both image and noise reconstructions, greatly enhanced the ability to distinguish invisible watermarks and well prevented unintended background changes consistently during image editing. Project page: \url{https://smhongok.github.io/inv-dpm.html}. △ Less

Submitted 30 November, 2023; originally announced November 2023.

Comments: 16 pages

arXiv:2311.08461 [pdf, other]

Chemical homogeneity of wide binary system: An approach from Near-Infrared spectroscopy

Authors: Dongwook Lim, Andreas J. Koch-Hansen, Seungsoo Hong, Sang-Hyun Chun, Young-Wook Lee

Abstract: Wide binaries, with separations between two stars from a few AU to more than several thousand AU, are valuable objects for various research topics in Galactic astronomy. As the number of newly reported wide binaries continues to increase, studying the chemical abundances of their component stars becomes more important. We conducted high-resolution near-infrared (NIR) spectroscopy for six pairs of… ▽ More Wide binaries, with separations between two stars from a few AU to more than several thousand AU, are valuable objects for various research topics in Galactic astronomy. As the number of newly reported wide binaries continues to increase, studying the chemical abundances of their component stars becomes more important. We conducted high-resolution near-infrared (NIR) spectroscopy for six pairs of wide binary candidates using the Immersion Grating Infrared Spectrometer (IGRINS) at the Gemini-South telescope. One pair was excluded from the wide binary samples due to a significant difference in radial velocity between its component stars, while the remaining five pairs exhibited homogeneous properties in 3D motion and chemical composition among the pair stars. The differences in [Fe/H] ranged from 0.00 to 0.07 dex for these wide binary pairs. The abundance differences between components are comparable to the previous results from optical spectroscopy for other samples. In addition, when combining our data with literature data, it appears that the variation of abundance differences increases in wide binaries with larger separations. However, the SVO2324 and SVO3206 showed minimal differences in most elements despite their large separation, supporting the concept of multiple formation mechanisms depending on each wide binary. This study is the first approach to the chemical properties of wide binaries based on NIR spectroscopy. Our results further highlight that NIR spectroscopy is an effective tool for stellar chemical studies based on equivalent measurements of chemical abundances from the two stars in each wide binary system. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: 16 pages, 9 figures, accepted for publication in AJ

arXiv:2311.01001 [pdf, other]

Fully Quantized Always-on Face Detector Considering Mobile Image Sensors

Authors: Haechang Lee, Wongi Jeong, Dongil Ryu, Hyunwoo Je, Albert No, Kijeong Kim, Se Young Chun

Abstract: Despite significant research on lightweight deep neural networks (DNNs) designed for edge devices, the current face detectors do not fully meet the requirements for "intelligent" CMOS image sensors (iCISs) integrated with embedded DNNs. These sensors are essential in various practical applications, such as energy-efficient mobile phones and surveillance systems with always-on capabilities. One not… ▽ More Despite significant research on lightweight deep neural networks (DNNs) designed for edge devices, the current face detectors do not fully meet the requirements for "intelligent" CMOS image sensors (iCISs) integrated with embedded DNNs. These sensors are essential in various practical applications, such as energy-efficient mobile phones and surveillance systems with always-on capabilities. One noteworthy limitation is the absence of suitable face detectors for the always-on scenario, a crucial aspect of image sensor-level applications. These detectors must operate directly with sensor RAW data before the image signal processor (ISP) takes over. This gap poses a significant challenge in achieving optimal performance in such scenarios. Further research and development are necessary to bridge this gap and fully leverage the potential of iCIS applications. In this study, we aim to bridge the gap by exploring extremely low-bit lightweight face detectors, focusing on the always-on face detection scenario for mobile image sensor applications. To achieve this, our proposed model utilizes sensor-aware synthetic RAW inputs, simulating always-on face detection processed "before" the ISP chain. Our approach employs ternary (-1, 0, 1) weights for potential implementations in image sensors, resulting in a relatively simple network architecture with shallow layers and extremely low-bitwidth. Our method demonstrates reasonable face detection performance and excellent efficiency in simulation studies, offering promising possibilities for practical always-on face detectors in real-world applications. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: Accepted to ICCV 2023 Workshop on Low-Bit Quantized Neural Networks (LBQNN), Oral

arXiv:2310.13593 [pdf, other]

Learning with Unmasked Tokens Drives Stronger Vision Learners

Authors: Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han

Abstract: Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing maske… ▽ More Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage the enhanced unmasked tokens for MIM. As a result, our simple remedy trains more discriminative representations revealed by achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, our models achieve significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Code is available at https://github.com/naver-ai/lut △ Less

Submitted 23 April, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.11125 [pdf, other]

IXPE observation confirms a high spin in the accreting black hole 4U 1957+115

Authors: L. Marra, M. Brigitte, N. Rodriguez Cavero, S. Chun, J. F. Steiner, M. Dovčiak, M. Nowak, S. Bianchi, F. Capitanio, A. Ingram, G. Matt, F. Muleri, J. Podgorný, J. Poutanen, J. Svoboda, R. Taverna, F. Ursini, A. Veledina, A. De Rosa, J. A. Garcia, A. A. Lutovinov, I. A. Mereminskiy, R. Farinelli, S. Gunji, P. Kaaret , et al. (91 additional authors not shown)

Abstract: We present the results of the first X-ray polarimetric observation of the low-mass X-ray binary 4U 1957+115, performed with the Imaging X-ray Polarimetry Explorer in May 2023. The binary system has been in a high-soft spectral state since its discovery and is thought to host a black hole. The $\sim$571 ks observation reveals a linear polarisation degree of $1.9\% \pm 0.6\%$ and a polarisation angl… ▽ More We present the results of the first X-ray polarimetric observation of the low-mass X-ray binary 4U 1957+115, performed with the Imaging X-ray Polarimetry Explorer in May 2023. The binary system has been in a high-soft spectral state since its discovery and is thought to host a black hole. The $\sim$571 ks observation reveals a linear polarisation degree of $1.9\% \pm 0.6\%$ and a polarisation angle of $-41^\circ.8 \pm 7^\circ.9$ in the 2-8 keV energy range. Spectral modelling is consistent with the dominant contribution coming from the standard accretion disc, while polarimetric data suggest a significant role of returning radiation: photons that are bent by strong gravity effects and forced to return to the disc surface, where they can be reflected before eventually reaching the observer. In this setting, we find that models with a black hole spin lower than 0.96 and an inclination lower than $50^\circ$ are disfavoured. △ Less

Submitted 8 February, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: 12 pages, 10 figures, 2 tables, accepted for publication in A&A

arXiv:2310.10847 [pdf, other]

doi 10.1103/PhysRevB.108.155122

Non-local features of the spin-orbit exciton in Kitaev materials

Authors: Blair W. Lebert, Subin Kim, Beom Hyun Kim, Sae Hwan Chun, Diego Casa, Jaewon Choi, Stefano Agrestini, Kejin Zhou, Mirian Garcia-Fernandez, Young-June Kim

Abstract: A comparative resonant inelastic x-ray scattering (RIXS) study of three well-known Kitaev materials is presented: $α$-Li$_2$IrO$_3$, Na$_2$IrO$_3$, and $α$-RuCl$_3$. Despite similar low-energy physics, these materials show distinct electronic properties, such as the large difference in the size of the charge gap. The RIXS spectra of the spin-orbit exciton for these materials show remarkably simila… ▽ More A comparative resonant inelastic x-ray scattering (RIXS) study of three well-known Kitaev materials is presented: $α$-Li$_2$IrO$_3$, Na$_2$IrO$_3$, and $α$-RuCl$_3$. Despite similar low-energy physics, these materials show distinct electronic properties, such as the large difference in the size of the charge gap. The RIXS spectra of the spin-orbit exciton for these materials show remarkably similar three-peak features, including sharp low energy peak (peak A) as well as transitions between $j_{\text{eff}}=1/2$ and $j_{\text{eff}}=3/2$ states. Comparison of experimental spectra with cluster calculations reveals that the observed three-peak structure reflects the significant role that non-local physics plays in the electronic structure of these materials. In particular, the low-energy peak A arises from a holon-doublon pair rather than a conventional particle-hole exciton as proposed earlier. Our study suggests that while spin-orbit assisted Mott insulator is still the best description for these materials, electron itinerancy cannot be ignored when formulating low-energy Hamiltonian of these materials. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Journal ref: Phys. Rev. B 108, 155122, 2023

arXiv:2309.10813 [pdf, other]

doi 10.3847/1538-4357/ad0842

First X-ray polarization measurement confirms the low black-hole spin in LMC X-3

Authors: Jiří Svoboda, Michal Dovčiak, James F. Steiner, Fabio Muleri, Adam Ingram, Anastasiya Yilmaz, Nicole Rodriguez Cavero, Lorenzo Marra, Juri Poutanen, Alexandra Veledina, Mehrnoosh Rahbardar Mojaver, Stefano Bianchi, Javier Garcia, Philip Kaaret, Henric Krawczynski, Giorgio Matt, Jakub Podgorný, Martin C. Weisskopf, Fabian Kislat, Pierre-Olivier Petrucci, Maimouna Brigitte, Michal Bursa, Sergio Fabiani, Kun Hu, Sohee Chun , et al. (87 additional authors not shown)

Abstract: X-ray polarization is a powerful tool to investigate the geometry of accreting material around black holes, allowing independent measurements of the black hole spin and orientation of the innermost parts of the accretion disk. We perform the X-ray spectro-polarimetric analysis of an X-ray binary system in the Large Magellanic Cloud, LMC X-3, that hosts a stellar-mass black hole, known to be persis… ▽ More X-ray polarization is a powerful tool to investigate the geometry of accreting material around black holes, allowing independent measurements of the black hole spin and orientation of the innermost parts of the accretion disk. We perform the X-ray spectro-polarimetric analysis of an X-ray binary system in the Large Magellanic Cloud, LMC X-3, that hosts a stellar-mass black hole, known to be persistently accreting since its discovery. We report the first detection of the X-ray polarization in LMC X-3 with the Imaging X-ray Polarimetry Explorer, and find the average polarization degree of 3.2% +- 0.6% and a constant polarization angle -42 deg +- 6 deg over the 2-8 keV range. Using accompanying spectroscopic observations by NICER, NuSTAR, and the Neil Gehrels Swift observatories, we confirm previous measurements of the black hole spin via the X-ray continuum method, a ~ 0.2. From polarization analysis only, we found consistent results with low black-hole spin, with an upper limit of a < 0.7 at a 90% confidence level. A slight increase of the polarization degree with energy, similar to other black-hole X-ray binaries in the soft state, is suggested from the data but with a low statistical significance. △ Less

Submitted 19 September, 2023; originally announced September 2023.

Comments: 14 pages, 8 figures, submitted to ApJ

Journal ref: The Astrophysical Journal, 2024, 960, 3

arXiv:2308.14844 [pdf, other]

The SN 2023ixf Progenitor in M101: II. Properties

Authors: Schuyler D. Van Dyk, Sundar Srinivasan, Jennifer E. Andrews, Monika Soraisam, Tamas Szalai, Steve B. Howell, Howard Isaacson, Thomas Matheson, Erik Petigura, Peter Scicluna, Andrew W. Stephens, Judah Van Zandt, WeiKang Zheng, Sang-Hyun Chun, Alexei V. Filippenko

Abstract: We follow our first paper with an analysis of the ensemble of the extensive pre-explosion ground- and space-based infrared observations of the red supergiant (RSG) progenitor candidate for the nearby core-collapse supernova SN 2023ixf in Messier 101, together with optical data prior to explosion obtained with the Hubble Space Telescope (HST). We have confirmed the association of the progenitor can… ▽ More We follow our first paper with an analysis of the ensemble of the extensive pre-explosion ground- and space-based infrared observations of the red supergiant (RSG) progenitor candidate for the nearby core-collapse supernova SN 2023ixf in Messier 101, together with optical data prior to explosion obtained with the Hubble Space Telescope (HST). We have confirmed the association of the progenitor candidate with the SN, as well as constrained the metallicity at the SN site, based on SN observations with instruments at Gemini-North. The internal host extinction to the SN has also been confirmed from a high-resolution Keck spectrum. We fit the observed spectral energy distribution (SED) for the star, accounting for its intrinsic variability, with dust radiative-transfer modeling, which assume a silicate-rich dust shell ahead of the underlying stellar photosphere. The star is heavily dust-obscured, likely the dustiest progenitor candidate yet encountered. We found median estimates of the star's effective temperature and luminosity of 2770 K and 9.0e4 L_Sun, with 68% credible intervals of 2340--3150 K and (7.5--10.9)e4 L_sun. The candidate may have a Galactic RSG analog, IRC -10414, with a strikingly similar SED and luminosity. Via comparison with single-star evolutionary models we have constrained the initial mass of the progenitor candidate from 12 M_sun to as high as 14 M_sun. We have had available to us an extraordinary view of the SN 2023ixf progenitor candidate, which should be further followed up in future years with HST and the James Webb Space Telescope. △ Less

Submitted 23 April, 2024; v1 submitted 28 August, 2023; originally announced August 2023.

Comments: 40 pages, substantive modifications relative to the previous, although the overall conclusions remain the same; to appear in AAS Journals

arXiv:2308.14374 [pdf, other]

Online Continual Learning on Hierarchical Label Expansion

Authors: Byung Hyun Lee, Okchul Jung, Jonghyun Choi, Se Young Chun

Abstract: Continual learning (CL) enables models to adapt to new tasks and environments without forgetting previously learned knowledge. While current CL setups have ignored the relationship between labels in the past task and the new task with or without small task overlaps, real-world scenarios often involve hierarchical relationships between old and new tasks, posing another challenge for traditional CL… ▽ More Continual learning (CL) enables models to adapt to new tasks and environments without forgetting previously learned knowledge. While current CL setups have ignored the relationship between labels in the past task and the new task with or without small task overlaps, real-world scenarios often involve hierarchical relationships between old and new tasks, posing another challenge for traditional CL approaches. To address this challenge, we propose a novel multi-level hierarchical class incremental task configuration with an online learning constraint, called hierarchical label expansion (HLE). Our configuration allows a network to first learn coarse-grained classes, with data labels continually expanding to more fine-grained classes in various hierarchy depths. To tackle this new setup, we propose a rehearsal-based method that utilizes hierarchy-aware pseudo-labeling to incorporate hierarchical class information. Additionally, we propose a simple yet effective memory management and sampling strategy that selectively adopts samples of newly encountered classes. Our experiments demonstrate that our proposed method can effectively use hierarchy on our HLE setup to improve classification accuracy across all levels of hierarchies, regardless of depth and class imbalance ratio, outperforming prior state-of-the-art works by significant margins while also outperforming them on the conventional disjoint, blurry and i-Blurry CL setups. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023

arXiv:2308.13449 [pdf, other]

The Poison of Alignment

Authors: Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki

Abstract: From the perspective of content safety issues, alignment has shown to limit large language models' (LLMs) harmful content generation. This intentional method of reinforcing models to not respond to certain user inputs seem to be present in many modern open-source instruction tuning datasets such as OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned model's performance a… ▽ More From the perspective of content safety issues, alignment has shown to limit large language models' (LLMs) harmful content generation. This intentional method of reinforcing models to not respond to certain user inputs seem to be present in many modern open-source instruction tuning datasets such as OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment in supervised fine-tuning dataset. To be specific, we noticed that alignment acts as if it is poisoning the instruction dataset. Experimentally, we demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks such as Big Bench (BBH), Massive Multitask Language Understanding (MMLU), Human Eval, and Discrete Reasoning Over Paragraphs (DROP), performing worse than the counterpart tuned without alignment by 4-33%. △ Less

Submitted 25 August, 2023; originally announced August 2023.

arXiv:2307.10667 [pdf, other]

Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors

Authors: Haechang Lee, Dongwon Park, Wongi Jeong, Kijeong Kim, Hyunwoo Je, Dongil Ryu, Se Young Chun

Abstract: As the physical size of recent CMOS image sensors (CIS) gets smaller, the latest mobile cameras are adopting unique non-Bayer color filter array (CFA) patterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with adjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA thanks to their changeable pixel-bin sizes for different light conditions but may introdu… ▽ More As the physical size of recent CMOS image sensors (CIS) gets smaller, the latest mobile cameras are adopting unique non-Bayer color filter array (CFA) patterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with adjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA thanks to their changeable pixel-bin sizes for different light conditions but may introduce visual artifacts during demosaicing due to their inherent pixel pattern structures and sensor hardware characteristics. Previous demosaicing methods have primarily focused on Bayer CFA, necessitating distinct reconstruction methods for non-Bayer patterned CIS with various CFA modes under different lighting conditions. In this work, we propose an efficient unified demosaicing method that can be applied to both conventional Bayer RAW and various non-Bayer CFAs' RAW data in different operation modes. Our Knowledge Learning-based demosaicing model for Adaptive Patterns, namely KLAP, utilizes CFA-adaptive filters for only 1% key filters in the network for each CFA, but still manages to effectively demosaic all the CFAs, yielding comparable performance to the large-scale models. Furthermore, by employing meta-learning during inference (KLAP-M), our model is able to eliminate unknown sensor-generic artifacts in real RAW data, effectively bridging the gap between synthetic images and real sensor RAW. Our KLAP and KLAP-M methods achieved state-of-the-art demosaicing performance in both synthetic and real RAW data of Bayer and non-Bayer CFAs. △ Less

Submitted 20 July, 2023; originally announced July 2023.

arXiv:2306.16615 [pdf, other]

Representation learning of vertex heatmaps for 3D human mesh reconstruction from multi-view images

Authors: Sungho Chun, Sungbum Park, Ju Yong Chang

Abstract: This study addresses the problem of 3D human mesh reconstruction from multi-view images. Recently, approaches that directly estimate the skinned multi-person linear model (SMPL)-based human mesh vertices based on volumetric heatmap representation from input images have shown good performance. We show that representation learning of vertex heatmaps using an autoencoder helps improve the performance… ▽ More This study addresses the problem of 3D human mesh reconstruction from multi-view images. Recently, approaches that directly estimate the skinned multi-person linear model (SMPL)-based human mesh vertices based on volumetric heatmap representation from input images have shown good performance. We show that representation learning of vertex heatmaps using an autoencoder helps improve the performance of such approaches. Vertex heatmap autoencoder (VHA) learns the manifold of plausible human meshes in the form of latent codes using AMASS, which is a large-scale motion capture dataset. Body code predictor (BCP) utilizes the learned body prior from VHA for human mesh reconstruction from multi-view images through latent code-based supervision and transfer of pretrained weights. According to experiments on Human3.6M and LightStage datasets, the proposed method outperforms previous methods and achieves state-of-the-art human mesh reconstruction performance. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: ICIP 2023

arXiv:2306.10783 [pdf, other]

The SN 2023ixf Progenitor in M101: I. Infrared Variability

Authors: Monika D. Soraisam, Tamás Szalai, Schuyler D. Van Dyk, Jennifer E. Andrews, Sundar Srinivasan, Sang-Hyun Chun, Thomas Matheson, Peter Scicluna, Diego A. Vasquez-Torres

Abstract: Observational evidence points to a red supergiant (RSG) progenitor for SN 2023ixf. The progenitor candidate has been detected in archival images at wavelengths (>0.6 micron) where RSGs typically emit profusely. This object is distinctly variable in the infrared (IR). We characterize the variability using pre-explosion mid-IR (3.6 and 4.5 micron) Spitzer and ground-based near-IR (JHKs) archival dat… ▽ More Observational evidence points to a red supergiant (RSG) progenitor for SN 2023ixf. The progenitor candidate has been detected in archival images at wavelengths (>0.6 micron) where RSGs typically emit profusely. This object is distinctly variable in the infrared (IR). We characterize the variability using pre-explosion mid-IR (3.6 and 4.5 micron) Spitzer and ground-based near-IR (JHKs) archival data jointly covering 19 yr. The IR light curves exhibit significant variability with RMS amplitudes in the range of 0.2-0.4 mag, increasing with decreasing wavelength. From a robust period analysis of the more densely sampled Spitzer data, we measure a period of 1091+/-71 days. We demonstrate using Gaussian Process modeling that this periodicity is also present in the near-IR light curves, thus indicating a common physical origin, which is likely pulsational instability. We use a period-luminosity relation for RSGs to derive a value of M_K=-11.58+/-0.31 mag. Assuming a late M spectral type, this corresponds to log(L/L_sun)=5.27+/-0.12 at T_eff=3200 K and to log(L/L_sun)=5.37+/-0.12 at T_eff=3500 K. This gives an independent estimate of the progenitor's luminosity, unaffected by uncertainties in extinction and distance. Assuming the progenitor candidate underwent enhanced dust-driven mass-loss during the time of these archival observations, and using an empirical period-luminosity-based mass-loss prescription, we obtain a mass-loss rate of around (2-4)x10^-4 M_sun/yr. Comparing the above luminosity with stellar evolution models, we infer an initial mass for the progenitor candidate of 20+/-4 M_sun, making this one of the most massive progenitors for a Type II SN detected to-date. △ Less

Submitted 22 August, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

Comments: 20 pages, 2 tables, accepted to ApJ

arXiv:2306.09766 [pdf, other]

Ultrafast switching of topological invariants by light-driven strain

Authors: Tae Gwan Park, Seungil Baek, Junho Park, Eui-Cheol Shin, Hong Ryeol Na, Eon-Taek Oh, Seung-Hyun Chun, Yong-Hyun Kim, Sunghun Lee, Fabian Rotermund

Abstract: Reversible control of the topological invariants from nontrivial to trivial states has fundamental implications for quantum information processors and spintronics, by realizing of an on/off switch for robust and dissipationless spin-current. Although mechanical strain has typically advantageous for such control of topological invariants, it is often accompanied by in-plane fractures and is not sui… ▽ More Reversible control of the topological invariants from nontrivial to trivial states has fundamental implications for quantum information processors and spintronics, by realizing of an on/off switch for robust and dissipationless spin-current. Although mechanical strain has typically advantageous for such control of topological invariants, it is often accompanied by in-plane fractures and is not suited for high-speed, time-dependent operations. Here, we use ultrafast optical and THz spectroscopy to investigate topological phase transitions by light-driven strain in Bi$_2$Se$_3$, a material that requires substantial strain for $\mathrm{Z}_2$ switching. We show that Bi$_2$Se$_3$ experiences ultrafast switching from being a topological insulator with spin-momentum-locked surfaces, to hybridized states and normal insulating phases at ambient conditions. Light-induced strong out-of-plane strain can suppress the surface-bulk coupling, enabling differentiation of surface and bulk conductance at room temperature, far above the Debye temperature. We illustrate various time-dependent sequences of transient hybridization, as well as the switching operation of topological invariants by adjusting the photoexcitation intensity. The abrupt alterations in both surface and bulk transport near the transition point allow for coherent conductance modulation at hyper-sound frequencies. Our findings regarding light-triggered ultrafast switching of topological invariants pave the way for high-speed topological switching and its associated applications. △ Less

Submitted 16 June, 2023; originally announced June 2023.

Comments: 14 pages, 4 figure, and Supplementary Material

arXiv:2305.18171 [pdf, other]

Improved Probabilistic Image-Text Representations

Authors: Sanghyuk Chun

Abstract: Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; t… ▽ More Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations. Deterministic functions are not sufficiently powerful to capture ambiguity, prompting the exploration of probabilistic embeddings to tackle the challenge. However, the existing probabilistic ITM approach encounters two key shortcomings; the burden of heavy computations due to the Monte Carlo approximation, and the loss saturation issue in the face of abundant false negatives. To overcome the issues, this paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution. In addition, two optimization techniques are proposed to enhance PCME++ further: first, the incorporation of pseudo-positives to prevent the negative effect under massive false negatives; second, mixed sample data augmentation for probabilistic matching. Experimental results on MS-COCO Caption and two extended benchmarks, CxC and ECCV Caption, demonstrate the effectiveness of PCME++ compared to state-of-the-art ITM methods. The robustness of PCME++ is also evaluated under noisy image-text correspondences. In addition, the potential applicability of PCME++ in automatic prompt-filtering for zero-shot classification is shown. The code is available at https://github.com/naver-ai/pcmepp △ Less

Submitted 9 April, 2024; v1 submitted 29 May, 2023; originally announced May 2023.

Comments: ICLR 2024 camera-ready; Code: https://github.com/naver-ai/pcmepp. Project page: https://naver-ai.github.io/pcmepp/. 30 pages, 2.2 MB

arXiv:2305.16057 [pdf]

Fake News Detection and Behavioral Analysis: Case of COVID-19

Authors: Chih-Yuan Li, Navya Martin Kollapally, Soon Ae Chun, James Geller

Abstract: While the world has been combating COVID-19 for over three years, an ongoing "Infodemic" due to the spread of fake news regarding the pandemic has also been a global issue. The existence of the fake news impact different aspect of our daily lives, including politics, public health, economic activities, etc. Readers could mistake fake news for real news, and consequently have less access to authent… ▽ More While the world has been combating COVID-19 for over three years, an ongoing "Infodemic" due to the spread of fake news regarding the pandemic has also been a global issue. The existence of the fake news impact different aspect of our daily lives, including politics, public health, economic activities, etc. Readers could mistake fake news for real news, and consequently have less access to authentic information. This phenomenon will likely cause confusion of citizens and conflicts in society. Currently, there are major challenges in fake news research. It is challenging to accurately identify fake news data in social media posts. In-time human identification is infeasible as the amount of the fake news data is overwhelming. Besides, topics discussed in fake news are hard to identify due to their similarity to real news. The goal of this paper is to identify fake news on social media to help stop the spread. We present Deep Learning approaches and an ensemble approach for fake news detection. Our detection models achieved higher accuracy than previous studies. The ensemble approach further improved the detection performance. We discovered feature differences between fake news and real news items. When we added them into the sentence embeddings, we found that they affected the model performance. We applied a hybrid method and built models for recognizing topics from posts. We found half of the identified topics were overlapping in fake news and real news, which could increase confusion in the population. △ Less

Submitted 25 May, 2023; originally announced May 2023.

Comments: 27 pages, 11 figures, 13 tables

MSC Class: 68

arXiv:2305.10630 [pdf, other]

The First X-ray Polarization Observation of the Black Hole X-ray Binary 4U 1630-47 in the Steep Power Law State

Authors: Nicole Rodriguez Cavero, Lorenzo Marra, Henric Krawczynski, Michal Dovčiak, Stefano Bianchi, James F. Steiner, Jiri Svoboda, Fiamma Capitanio, Giorgio Matt, Michela Negro, Adam Ingram, Alexandra Veledina, Roberto Taverna, Vladimir Karas, Francesco Ursini, Jakub Podgorný, Ajay Ratheesh, Valery Suleimanov, Romana Mikušincová, Silvia Zane, Philip Kaaret, Fabio Muleri, Juri Poutanen, Christian Malacaria, Pierre-Olivier Petrucci , et al. (85 additional authors not shown)

Abstract: The Imaging X-ray Polarimetry Explorer (IXPE) observed the black hole X-ray binary 4U 1630-47 in the steep power law (or very high) state. The observations reveal a linear polarization degree of the 2-8 keV X-rays of 6.8 +/- 0.2 % at a position angle of 21°.3 +/- 0°.9 East of North (all errors at 1σ confidence level). Whereas the polarization degree increases with energy, the polarization angle st… ▽ More The Imaging X-ray Polarimetry Explorer (IXPE) observed the black hole X-ray binary 4U 1630-47 in the steep power law (or very high) state. The observations reveal a linear polarization degree of the 2-8 keV X-rays of 6.8 +/- 0.2 % at a position angle of 21°.3 +/- 0°.9 East of North (all errors at 1σ confidence level). Whereas the polarization degree increases with energy, the polarization angle stays constant within the accuracy of our measurements. We compare the polarization of the source in the steep power-law state with the previous IXPE measurement of the source in the high soft state. We find that even though the source flux and spectral shape are significantly different between the high soft state and the steep power-law state, their polarization signatures are similar. Assuming that the polarization of both the thermal and power-law emission components are constant over time, we estimate the power-law component polarization to be 6.8-7.0% and note that the polarization angle of the thermal and power-law components must be approximately aligned. We discuss the implications for the origin of the power-law component and the properties of the emitting plasma. △ Less

Submitted 17 May, 2023; originally announced May 2023.

Comments: 14 pages, 2 tables, 6 figures

arXiv:2304.12752 [pdf, other]

doi 10.3847/1538-4357/ad226e

X-ray Polarization of the Black Hole X-ray Binary 4U 1630-47 Challenges Standard Thin Accretion Disk Scenario

Authors: Ajay Ratheesh, Michal Dovčiak, Henric Krawczynski, Jakub Podgorný, Lorenzo Marra, Alexandra Veledina, Valery Suleimanov, Nicole Rodriguez Cavero, James Steiner, Jiri Svoboda, Andrea Marinucci, Stefano Bianchi, Michela Negro, Giorgio Matt, Francesco Tombesi, Juri Poutanen, Adam Ingram, Roberto Taverna, Andrew West, Vladimir Karas, Francesco Ursini, Paolo Soffitta, Fiamma Capitanio, Domenico Viscolo, Alberto Manfreda , et al. (90 additional authors not shown)

Abstract: Large energy-dependent X-ray polarization degree is detected by the Imaging X-ray Polarimetry Explorer ({IXPE}) in the high-soft emission state of the black hole X-ray binary 4U 1630--47. The highly significant detection (at $\approx50σ$ confidence level) of an unexpectedly high polarization, rising from $\sim6\%$ at $2$ keV to $\sim10\%$ at $8$ keV, cannot be easily reconciled with standard model… ▽ More Large energy-dependent X-ray polarization degree is detected by the Imaging X-ray Polarimetry Explorer ({IXPE}) in the high-soft emission state of the black hole X-ray binary 4U 1630--47. The highly significant detection (at $\approx50σ$ confidence level) of an unexpectedly high polarization, rising from $\sim6\%$ at $2$ keV to $\sim10\%$ at $8$ keV, cannot be easily reconciled with standard models of thin accretion discs. In this work we compare the predictions of different theoretical models with the {IXPE} data and conclude that the observed polarization properties are compatible with a scenario in which matter accretes onto the black hole through a thin disc, covered by a partially-ionized atmosphere flowing away at mildly relativistic velocities. △ Less

Submitted 19 March, 2024; v1 submitted 25 April, 2023; originally announced April 2023.

Comments: Published in ApJ (https://doi.org/10.3847/1538-4357/ad226e)

Journal ref: 2024 ApJ 964 77

arXiv:2304.10727 [pdf, other]

RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models

Authors: Seulki Park, Daeho Um, Hajung Yoon, Sanghyuk Chun, Sangdoo Yun, Jin Young Choi

Abstract: In this paper, we propose a robustness benchmark for image-text matching models to assess their vulnerabilities. To this end, we insert adversarial texts and images into the search pool (i.e., gallery set) and evaluate models with the adversarial data. Specifically, we replace a word in the text to change the meaning of the text and mix images with different images to create perceptible changes in… ▽ More In this paper, we propose a robustness benchmark for image-text matching models to assess their vulnerabilities. To this end, we insert adversarial texts and images into the search pool (i.e., gallery set) and evaluate models with the adversarial data. Specifically, we replace a word in the text to change the meaning of the text and mix images with different images to create perceptible changes in pixels. We assume that such explicit alterations would not deceive a robust model, as they should understand the holistic meaning of texts and images simultaneously. However, in our evaluations on the proposed benchmark, many state-of-the-art models show significant performance degradation, e.g., Recall@1: 81.9% $\rightarrow$ 64.5% in BLIP, 66.1% $\rightarrow$ 37.5% in VSE$\infty$, where the models favor adversarial texts/images over the original ones. This reveals the current vision-language models may not account for subtle changes or understand the overall context of texts and images. Our findings can provide insights for improving the robustness of the vision-language models and devising more diverse stress-test methods in cross-modal retrieval task. Source code and dataset will be available at https://github.com/pseulki/rococo. △ Less

Submitted 14 July, 2023; v1 submitted 20 April, 2023; originally announced April 2023.

arXiv:2304.04875 [pdf, other]

Three Recipes for Better 3D Pseudo-GTs of 3D Human Mesh Estimation in the Wild

Authors: Gyeongsik Moon, Hongsuk Choi, Sanghyuk Chun, Jiyoung Lee, Sangdoo Yun

Abstract: Recovering 3D human mesh in the wild is greatly challenging as in-the-wild (ITW) datasets provide only 2D pose ground truths (GTs). Recently, 3D pseudo-GTs have been widely used to train 3D human mesh estimation networks as the 3D pseudo-GTs enable 3D mesh supervision when training the networks on ITW datasets. However, despite the great potential of the 3D pseudo-GTs, there has been no extensive… ▽ More Recovering 3D human mesh in the wild is greatly challenging as in-the-wild (ITW) datasets provide only 2D pose ground truths (GTs). Recently, 3D pseudo-GTs have been widely used to train 3D human mesh estimation networks as the 3D pseudo-GTs enable 3D mesh supervision when training the networks on ITW datasets. However, despite the great potential of the 3D pseudo-GTs, there has been no extensive analysis that investigates which factors are important to make more beneficial 3D pseudo-GTs. In this paper, we provide three recipes to obtain highly beneficial 3D pseudo-GTs of ITW datasets. The main challenge is that only 2D-based weak supervision is allowed when obtaining the 3D pseudo-GTs. Each of our three recipes addresses the challenge in each aspect: depth ambiguity, sub-optimality of weak supervision, and implausible articulation. Experimental results show that simply re-training state-of-the-art networks with our new 3D pseudo-GTs elevates their performance to the next level without bells and whistles. The 3D pseudo-GT is publicly available in https://github.com/mks0601/NeuralAnnot_RELEASE. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: Published at CVPRW 2023

arXiv:2304.04555 [pdf, other]

Neural Diffeomorphic Non-uniform B-spline Flows

Authors: Seongmin Hong, Se Young Chun

Abstract: Normalizing flows have been successfully modeling a complex probability distribution as an invertible transformation of a simple base distribution. However, there are often applications that require more than invertibility. For instance, the computation of energies and forces in physics requires the second derivatives of the transformation to be well-defined and continuous. Smooth normalizing flow… ▽ More Normalizing flows have been successfully modeling a complex probability distribution as an invertible transformation of a simple base distribution. However, there are often applications that require more than invertibility. For instance, the computation of energies and forces in physics requires the second derivatives of the transformation to be well-defined and continuous. Smooth normalizing flows employ infinitely differentiable transformation, but with the price of slow non-analytic inverse transforms. In this work, we propose diffeomorphic non-uniform B-spline flows that are at least twice continuously differentiable while bi-Lipschitz continuous, enabling efficient parametrization while retaining analytic inverse transforms based on a sufficient condition for diffeomorphism. Firstly, we investigate the sufficient condition for Ck-2-diffeomorphic non-uniform kth-order B-spline transformations. Then, we derive an analytic inverse transformation of the non-uniform cubic B-spline transformation for neural diffeomorphic non-uniform B-spline flows. Lastly, we performed experiments on solving the force matching problem in Boltzmann generators, demonstrating that our C2-diffeomorphic non-uniform B-spline flows yielded solutions better than previous spline flows and faster than smooth normalizing flows. Our source code is publicly available at https://github.com/smhongok/Non-uniform-B-spline-Flow. △ Less

Submitted 11 April, 2023; v1 submitted 7 April, 2023; originally announced April 2023.

Comments: Accepted to AAAI 2023

arXiv:2304.02827 [pdf, other]

DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model

Authors: Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun

Abstract: The increasing demand for high-quality 3D content creation has motivated the development of automated methods for creating 3D object models from a single image and/or from a text prompt. However, the reconstructed 3D objects using state-of-the-art image-to-3D methods still exhibit low correspondence to the given image and low multi-view consistency. Recent state-of-the-art text-to-3D methods are a… ▽ More The increasing demand for high-quality 3D content creation has motivated the development of automated methods for creating 3D object models from a single image and/or from a text prompt. However, the reconstructed 3D objects using state-of-the-art image-to-3D methods still exhibit low correspondence to the given image and low multi-view consistency. Recent state-of-the-art text-to-3D methods are also limited, yielding 3D samples with low diversity per prompt with long synthesis time. To address these challenges, we propose DITTO-NeRF, a novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a single image. Our DITTO-NeRF consists of constructing high-quality partial 3D object for limited in-boundary (IB) angles using the given or text-generated 2D image from the frontal view and then iteratively reconstructing the remaining 3D NeRF using inpainting latent diffusion model. We propose progressive 3D object reconstruction schemes in terms of scales (low to high resolution), angles (IB angles initially to outer-boundary (OB) later), and masks (object to background boundary) in our DITTO-NeRF so that high-quality information on IB can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods in terms of fidelity and diversity qualitatively and quantitatively with much faster training times than prior arts on image/text-to-3D such as DreamFusion, and NeuralLift-360. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: Project page: https://janeyeon.github.io/ditto-nerf/

arXiv:2304.01900 [pdf, other]

PODIA-3D: Domain Adaptation of 3D Generative Model Across Large Domain Gap Using Pose-Preserved Text-to-Image Diffusion

Authors: Gwanghyun Kim, Ji Ha Jang, Se Young Chun

Abstract: Recently, significant advancements have been made in 3D generative models, however training these models across diverse domains is challenging and requires an huge amount of training data and knowledge of pose distribution. Text-guided domain adaptation methods have allowed the generator to be adapted to the target domains using text prompts, thereby obviating the need for assembling numerous data… ▽ More Recently, significant advancements have been made in 3D generative models, however training these models across diverse domains is challenging and requires an huge amount of training data and knowledge of pose distribution. Text-guided domain adaptation methods have allowed the generator to be adapted to the target domains using text prompts, thereby obviating the need for assembling numerous data. Recently, DATID-3D presents impressive quality of samples in text-guided domain, preserving diversity in text by leveraging text-to-image diffusion. However, adapting 3D generators to domains with significant domain gaps from the source domain still remains challenging due to issues in current text-to-image diffusion models as following: 1) shape-pose trade-off in diffusion-based translation, 2) pose bias, and 3) instance bias in the target domain, resulting in inferior 3D shapes, low text-image correspondence, and low intra-domain diversity in the generated samples. To address these issues, we propose a novel pipeline called PODIA-3D, which uses pose-preserved text-to-image diffusion-based domain adaptation for 3D generative models. We construct a pose-preserved text-to-image diffusion model that allows the use of extremely high-level noise for significant domain changes. We also propose specialized-to-general sampling strategies to improve the details of the generated samples. Moreover, to overcome the instance bias, we introduce a text-guided debiasing method that improves intra-domain diversity. Consequently, our method successfully adapts 3D generators across significant domain gaps. Our qualitative results and user study demonstrates that our approach outperforms existing 3D text-guided domain adaptation methods in terms of text-image correspondence, realism, diversity of rendered images, and sense of depth of 3D shapes in the generated samples △ Less

Submitted 4 April, 2023; originally announced April 2023.

Comments: Project page: https://gwang-kim.github.io/podia_3d/

arXiv:2303.17595 [pdf, other]

Neglected Free Lunch -- Learning Image Classifiers Using Annotation Byproducts

Authors: Dongyoon Han, Junsuk Choe, Seonghyeok Chun, John Joon Young Chung, Minsuk Chang, Sangdoo Yun, Jean Y. Song, Seong Joon Oh

Abstract: Supervised learning of image classifiers distills human knowledge into a parametric model through pairs of images and corresponding labels (X,Y). We argue that this simple and widely used representation of human knowledge neglects rich auxiliary information from the annotation procedure, such as the time-series of mouse traces and clicks left after image selection. Our insight is that such annotat… ▽ More Supervised learning of image classifiers distills human knowledge into a parametric model through pairs of images and corresponding labels (X,Y). We argue that this simple and widely used representation of human knowledge neglects rich auxiliary information from the annotation procedure, such as the time-series of mouse traces and clicks left after image selection. Our insight is that such annotation byproducts Z provide approximate human attention that weakly guides the model to focus on the foreground cues, reducing spurious correlations and discouraging shortcut learning. To verify this, we create ImageNet-AB and COCO-AB. They are ImageNet and COCO training sets enriched with sample-wise annotation byproducts, collected by replicating the respective original annotation tasks. We refer to the new paradigm of training models with annotation byproducts as learning using annotation byproducts (LUAB). We show that a simple multitask loss for regressing Z together with Y already improves the generalisability and robustness of the learned models. Compared to the original supervised learning, LUAB does not require extra annotation costs. ImageNet-AB and COCO-AB are at https://github.com/naver-ai/NeglectedFreeLunch. △ Less

Submitted 26 July, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

Comments: Code & data at https://github.com/naver-ai/NeglectedFreeLunch. To be presented at ICCV'23

arXiv:2303.12034 [pdf, other]

The first X-ray polarimetric observation of the black hole binary LMC X-1

Authors: Jakub Podgorny, Lorenzo Marra, Fabio Muleri, Nicole Rodriguez Cavero, Ajay Ratheesh, Michal Dovciak, Romana Mikusincova, Maimouna Brigitte, James F. Steiner, Alexandra Veledina, Stefano Bianchi, Henric Krawczynski, Jiri Svoboda, Philip Kaaret, Giorgio Matt, Javier A. Garcia, Pierre-Olivier Petrucci, Alexander A. Lutovinov, Andrey N. Semena, Alessandro Di Marco, Michela Negro, Martin C. Weisskopf, Adam Ingram, Juri Poutanen, Banfsheh Beheshtipour , et al. (86 additional authors not shown)

Abstract: We report on an X-ray polarimetric observation of the high-mass X-ray binary LMC X-1 in the high/soft state, obtained by the Imaging X-ray Polarimetry Explorer (IXPE) in October 2022. The measured polarization is below the minimum detectable polarization of 1.1 per cent (at the 99 per cent confidence level). Simultaneously, the source was observed with the NICER, NuSTAR and SRG/ART-XC instruments,… ▽ More We report on an X-ray polarimetric observation of the high-mass X-ray binary LMC X-1 in the high/soft state, obtained by the Imaging X-ray Polarimetry Explorer (IXPE) in October 2022. The measured polarization is below the minimum detectable polarization of 1.1 per cent (at the 99 per cent confidence level). Simultaneously, the source was observed with the NICER, NuSTAR and SRG/ART-XC instruments, which enabled spectral decomposition into a dominant thermal component and a Comptonized one. The low 2-8 keV polarization of the source did not allow for strong constraints on the black-hole spin and inclination of the accretion disc. However, if the orbital inclination of about 36 degrees is assumed, then the upper limit is consistent with predictions for pure thermal emission from geometrically thin and optically thick discs. Assuming the polarization degree of the Comptonization component to be 0, 4, or 10 per cent, and oriented perpendicular to the polarization of the disc emission (in turn assumed to be perpendicular to the large scale ionization cone orientation detected in the optical band), an upper limit to the polarization of the disc emission of 1.0, 0.9 or 0.9 per cent, respectively, is found (at the 99 per cent confidence level). △ Less

Submitted 9 October, 2023; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: 12 pages, 9 figures, 4 tables. Accepted for publication in MNRAS

arXiv:2303.11916 [pdf, other]

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

Authors: Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun

Abstract: This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR ap… ▽ More This paper proposes a novel diffusion-based model, CompoDiff, for solving zero-shot Composed Image Retrieval (ZS-CIR) with latent diffusion. This paper also introduces a new synthetic dataset, named SynthTriplets18M, with 18.8 million reference images, conditions, and corresponding target image triplets to train CIR models. CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR approaches, such as poor generalizability due to the small dataset scale and the limited types of conditions. CompoDiff not only achieves a new state-of-the-art on four ZS-CIR benchmarks, including FashionIQ, CIRR, CIRCO, and GeneCIS, but also enables a more versatile and controllable CIR by accepting various conditions, such as negative text, and image mask conditions. CompoDiff also shows the controllability of the condition strength between text and image queries and the trade-off between inference speed and performance, which are unavailable with existing CIR methods. The code and dataset are available at https://github.com/navervision/CompoDiff △ Less

Submitted 25 February, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: First two authors contributed equally; 28 pages, 6.2MB

arXiv:2303.11114 [pdf, other]

SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage

Authors: Song Park, Sanghyuk Chun, Byeongho Heo, Wonjae Kim, Sangdoo Yun

Abstract: We need billion-scale images to achieve more generalizable and ground-breaking vision models, as well as massive dataset storage to ship the images (e.g., the LAION-4B dataset needs 240TB storage space). However, it has become challenging to deal with unlimited dataset storage with limited storage infrastructure. A number of storage-efficient training methods have been proposed to tackle the probl… ▽ More We need billion-scale images to achieve more generalizable and ground-breaking vision models, as well as massive dataset storage to ship the images (e.g., the LAION-4B dataset needs 240TB storage space). However, it has become challenging to deal with unlimited dataset storage with limited storage infrastructure. A number of storage-efficient training methods have been proposed to tackle the problem, but they are rarely scalable or suffer from severe damage to performance. In this paper, we propose a storage-efficient training strategy for vision classifiers for large-scale datasets (e.g., ImageNet) that only uses 1024 tokens per instance without using the raw level pixels; our token storage only needs <1% of the original JPEG-compressed raw pixels. We also propose token augmentations and a Stem-adaptor module to make our approach able to use the same architecture as pixel-based approaches with only minimal modifications on the stem layer and the carefully tuned optimization settings. Our experimental results on ImageNet-1k show that our method significantly outperforms other storage-efficient training methods with a large gap. We further show the effectiveness of our method in other practical scenarios, storage-efficient pre-training, and continual learning. Code is available at https://github.com/naver-ai/seit △ Less

Submitted 11 September, 2023; v1 submitted 20 March, 2023; originally announced March 2023.

Comments: ICCV 2023; First two authors contributed equally; code url: https://github.com/naver-ai/seit; 17 pages, 1.2MB

arXiv:2303.00442 [pdf, other]

Re-weighting Based Group Fairness Regularization via Classwise Robust Optimization

Authors: Sangwon Jung, Taeeon Park, Sanghyuk Chun, Taesup Moon

Abstract: Many existing group fairness-aware training methods aim to achieve the group fairness by either re-weighting underrepresented groups based on certain rules or using weakly approximated surrogates for the fairness metrics in the objective as regularization terms. Although each of the learning schemes has its own strength in terms of applicability or performance, respectively, it is difficult for an… ▽ More Many existing group fairness-aware training methods aim to achieve the group fairness by either re-weighting underrepresented groups based on certain rules or using weakly approximated surrogates for the fairness metrics in the objective as regularization terms. Although each of the learning schemes has its own strength in terms of applicability or performance, respectively, it is difficult for any method in the either category to be considered as a gold standard since their successful performances are typically limited to specific cases. To that end, we propose a principled method, dubbed as \ours, which unifies the two learning schemes by incorporating a well-justified group fairness metric into the training objective using a class wise distributionally robust optimization (DRO) framework. We then develop an iterative optimization algorithm that minimizes the resulting objective by automatically producing the correct re-weights for each group. Our experiments show that FairDRO is scalable and easily adaptable to diverse applications, and consistently achieves the state-of-the-art performance on several benchmark datasets in terms of the accuracy-fairness trade-off, compared to recent strong baselines. △ Less

Submitted 1 March, 2023; originally announced March 2023.

arXiv:2212.04319 [pdf, other]

On the Robustness of Normalizing Flows for Inverse Problems in Imaging

Authors: Seongmin Hong, Inbum Park, Se Young Chun

Abstract: Conditional normalizing flows can generate diverse image samples for solving inverse problems. Most normalizing flows for inverse problems in imaging employ the conditional affine coupling layer that can generate diverse images quickly. However, unintended severe artifacts are occasionally observed in the output of them. In this work, we address this critical issue by investigating the origins of… ▽ More Conditional normalizing flows can generate diverse image samples for solving inverse problems. Most normalizing flows for inverse problems in imaging employ the conditional affine coupling layer that can generate diverse images quickly. However, unintended severe artifacts are occasionally observed in the output of them. In this work, we address this critical issue by investigating the origins of these artifacts and proposing the conditions to avoid them. First of all, we empirically and theoretically reveal that these problems are caused by "exploding inverse" in the conditional affine coupling layer for certain out-of-distribution (OOD) conditional inputs. Then, we further validated that the probability of causing erroneous artifacts in pixels is highly correlated with a Mahalanobis distance-based OOD score for inverse problems in imaging. Lastly, based on our investigations, we propose a remark to avoid exploding inverse and then based on it, we suggest a simple remedy that substitutes the affine coupling layers with the modified rational quadratic spline coupling layers in normalizing flows, to encourage the robustness of generated image samples. Our experimental results demonstrated that our suggested methods effectively suppressed critical artifacts occurring in normalizing flows for super-resolution space generation and low-light image enhancement. △ Less

Submitted 16 March, 2023; v1 submitted 8 December, 2022; originally announced December 2022.

Comments: 16 pages

arXiv:2212.04114 [pdf, other]

Group Generalized Mean Pooling for Vision Transformer

Authors: Byungsoo Ko, Han-Gyu Kim, Byeongho Heo, Sangdoo Yun, Sanghyuk Chun, Geonmo Gu, Wonjae Kim

Abstract: Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, s… ▽ More Vision Transformer (ViT) extracts the final representation from either class token or an average of all patch tokens, following the architecture of Transformer in Natural Language Processing (NLP) or Convolutional Neural Networks (CNNs) in computer vision. However, studies for the best way of aggregating the patch tokens are still limited to average pooling, while widely-used pooling strategies, such as max and GeM pooling, can be considered. Despite their effectiveness, the existing pooling strategies do not consider the architecture of ViT and the channel-wise difference in the activation maps, aggregating the crucial and trivial channels with the same importance. In this paper, we present Group Generalized Mean (GGeM) pooling as a simple yet powerful pooling strategy for ViT. GGeM divides the channels into groups and computes GeM pooling with a shared pooling parameter per group. As ViT groups the channels via a multi-head attention mechanism, grouping the channels by GGeM leads to lower head-wise dependence while amplifying important channels on the activation maps. Exploiting GGeM shows 0.1%p to 0.7%p performance boosts compared to the baselines and achieves state-of-the-art performance for ViT-Base and ViT-Large models in ImageNet-1K classification task. Moreover, GGeM outperforms the existing pooling strategies on image retrieval and multi-modal representation learning tasks, demonstrating the superiority of GGeM for a variety of tasks. GGeM is a simple algorithm in that only a few lines of code are necessary for implementation. △ Less

Submitted 8 December, 2022; originally announced December 2022.

arXiv:2211.16374 [pdf, other]

DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model

Authors: Gwanghyun Kim, Se Young Chun

Abstract: Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D gene… ▽ More Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text. △ Less

Submitted 30 March, 2023; v1 submitted 29 November, 2022; originally announced November 2022.

Comments: Accepted to CVPR 2023, Project page: https://gwang-kim.github.io/datid_3d/

arXiv:2211.08670 [pdf, ps, other]

doi 10.1017/jfm.2022.926

Experimental observation of a confined bubble moving in shear-thinning fluids

Authors: SungGyu Chun, Bingqiang Ji, Zhengyu Yang, Vinit Kumar Malik, Jie Feng

Abstract: The motion of a long gas bubble in a confined capillary tube is ubiquitous in a wide range of engineering and biological applications. While the understanding of the deposited thin viscous film near the tube wall in Newtonian fluids is well developed, the deposition dynamics in commonly encountered non-Newtonian fluids remains much less studied. Here, we investigate the dynamics of a confined bubb… ▽ More The motion of a long gas bubble in a confined capillary tube is ubiquitous in a wide range of engineering and biological applications. While the understanding of the deposited thin viscous film near the tube wall in Newtonian fluids is well developed, the deposition dynamics in commonly encountered non-Newtonian fluids remains much less studied. Here, we investigate the dynamics of a confined bubble moving in shear-thinning fluids with systematic experiments, varying the zero-shear-rate capillary number $Ca_0$ in the range of $O(10^{-3}-10^2)$ considering the zero-shear-rate viscosity. The thickness of the deposited liquid film, the bubble speed and the bubble front/rear menisci are measured, which are further rationalized with the recent theoretical studies based on appropriate rheological models. Compared with Newtonian fluids, the film thickness decreases for both the carboxymethyl cellulose and Carbopol solutions when the shear-thinning effect dominates. We show that the film thickness follows the scaling law from \citet{aussillous2000quick} with an effective capillary number $Ca_e$, considering the characteristic shear rate in the film as proposed by \citet{picchi2021motion}. $Ca_e$ is calculated by the Carreau number and the power-law index from the Carreau-Yasuda rheological model. The shear-thinning effect also influences the bubble speed and delays the transition to the parabolic region in the bubble front and rear menisci. In particular, a high degree of undulations on the bubble surface results in intricate rear viscosity distribution for the rear meniscus and the deviation between the experiments and theory may require a further investigation to resolve the axial velocity field. Our study may advance the fundamental understandings and engineering guidelines for coating processes involving thin-film flows and non-Newtonian fluids. △ Less

Submitted 15 November, 2022; originally announced November 2022.

arXiv:2211.05910 [pdf, other]

Efficient and Accurate Quantized Image Super-Resolution on Mobile NPUs, Mobile AI & AIM 2022 challenge: Report

Authors: Andrey Ignatov, Radu Timofte, Maurizio Denna, Abdel Younes, Ganzorig Gankhuyag, Jingang Huh, Myeong Kyun Kim, Kihwan Yoon, Hyeon-Cheol Moon, Seungho Lee, Yoonsik Choe, Jinwoo Jeong, Sungjei Kim, Maciej Smyl, Tomasz Latkowski, Pawel Kubik, Michal Sokolski, Yujie Ma, Jiahao Chao, Zhou Zhou, Hongfan Gao, Zhengfeng Yang, Zhenbing Zeng, Zhengyang Zhuge, Chenghua Li , et al. (71 additional authors not shown)

Abstract: Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose… ▽ More Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: arXiv admin note: text overlap with arXiv:2105.07825, arXiv:2105.08826, arXiv:2211.04470, arXiv:2211.03885, arXiv:2211.05256

Showing 1–50 of 191 results for author: Chun, S