subscribe to arXiv mailings

AutoBencher: Creating Salient, Novel, Difficult Datasets for Language Models

Authors: Xiang Lisa Li, Evan Zheran Liu, Percy Liang, Tatsunori Hashimoto

Abstract: Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), a… ▽ More Evaluation is critical for assessing capabilities, tracking scientific progress, and informing model selection. In this paper, we present three desiderata for a good benchmark for language models: (i) salience (e.g., knowledge about World War II is more salient than a random day in history), (ii) novelty (i.e., the benchmark reveals new trends in model rankings not shown by previous benchmarks), and (iii) difficulty (i.e., the benchmark should be difficult for existing models, leaving headroom for future improvement). We operationalize these three desiderata and cast benchmark creation as a search problem, that of finding benchmarks that that satisfy all three desiderata. To tackle this search problem, we present AutoBencher, which uses a language model to automatically search for datasets that meet the three desiderata. AutoBencher uses privileged information (e.g. relevant documents) to construct reliable datasets, and adaptivity with reranking to optimize for the search objective. We use AutoBencher to create datasets for math, multilingual, and knowledge-intensive question answering. The scalability of AutoBencher allows it to test fine-grained categories and tail knowledge, creating datasets that are on average 27% more novel and 22% more difficult than existing benchmarks. A closer investigation of our constructed datasets shows that we can identify specific gaps in LM knowledge in language models that are not captured by existing benchmarks, such as Gemini Pro performing much worse on question answering about the Permian Extinction and Fordism, while OpenAGI-7B performing surprisingly well on QA about COVID-19. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: preprint

arXiv:2407.07977 [pdf, other]

Few-electron highly charged muonic Ar atoms verified by electronic $K$ x rays

Authors: T. Okumura, T. Azuma, D. A. Bennett, W. B. Doriese, M. S. Durkin, J. W. Fowler, J. D. Gard, T. Hashimoto, R. Hayakawa, Y. Ichinohe, P. Indelicato, T. Isobe, S. Kanda, D. Kato, M. Katsuragawa, N. Kawamura, Y. Kino, N. Kominato, Y. Miyake, K. M. Morgan, H. Noda, G. C. O'Neil, S. Okada, K. Okutsu, N. Paul , et al. (18 additional authors not shown)

Abstract: Electronic $K$ x rays emitted by muonic Ar atoms in the gas phase were observed using a superconducting transition-edge-sensor microcalorimeter. The high-precision energy spectra provided a clear signature of the presence of muonic atoms accompanied by a few electrons, which have never been observed before. One-, two-, and three-electron bound, i.e., H-like, He-like, and Li-like, muonic Ar atoms w… ▽ More Electronic $K$ x rays emitted by muonic Ar atoms in the gas phase were observed using a superconducting transition-edge-sensor microcalorimeter. The high-precision energy spectra provided a clear signature of the presence of muonic atoms accompanied by a few electrons, which have never been observed before. One-, two-, and three-electron bound, i.e., H-like, He-like, and Li-like, muonic Ar atoms were identified from electronic $K$ x rays and hyper-satellite $K$ x rays. These $K$ x rays are emitted after the charge transfer process by the collisions with surrounding Ar atoms. With the aid of theoretical calculations, we confirmed that the peak positions are consistent with the x-ray energies from highly charged Cl ions, and the intensities reflecting deexcitation dynamics were successfully understood by taking into account the interaction between the muon and bound electrons. △ Less

Submitted 10 July, 2024; originally announced July 2024.

arXiv:2407.04620 [pdf, other]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Authors: Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin

Abstract: Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and t… ▽ More Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden state. We propose a new class of sequence modeling layers with linear complexity and an expressive hidden state. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Both TTT-Linear and TTT-MLP match or exceed the baselines. Similar to Transformer, they can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. With preliminary systems optimization, TTT-Linear is already faster than Transformer at 8k context and matches Mamba in wall-clock time. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research. △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2407.01889 [pdf, other]

ALMA reveals spatially-resolved properties of molecular gas in the host galaxy of FRB 20191001A at z = 0.2340

Authors: Itsuki Yamanaka, Bunyo Hatsukade, Fumi Egusa, Tetsuya Hashimoto, Yuu Niino, Tzu-Yin Hsu, Hiroyuki Kaneko, Kotaro Kohno

Abstract: We report the detection of the CO(2-1) emission line with a spatial resolution of 0.9 arcsec ($3.5 \mathrm{kpc}$) from the host galaxy of the fast radio burst (FRB), FRB 20191001A at $z=0.2340$, using the Atacama Large Millimeter/submillimeter Array. This is the first detection of spatially resolved CO emission from the host galaxy of an FRB at a cosmological distance. The inferred molecular gas m… ▽ More We report the detection of the CO(2-1) emission line with a spatial resolution of 0.9 arcsec ($3.5 \mathrm{kpc}$) from the host galaxy of the fast radio burst (FRB), FRB 20191001A at $z=0.2340$, using the Atacama Large Millimeter/submillimeter Array. This is the first detection of spatially resolved CO emission from the host galaxy of an FRB at a cosmological distance. The inferred molecular gas mass of the host galaxy is $(2.3\pm0.4)\times10^{10} \mathrm{M_\odot}$, indicating that it is gas-rich, as evidenced by the measured molecular gas fraction $μ_\mathrm{gas}=0.50\pm0.22$. This molecular-gas mass and the star formation rate of the host, $\mathrm{SFR}=8.06\pm2.42 \mathrm{M_\odot yr^{-1}}$, differ from those observed in the other FRB host galaxies with the average $M_\mathrm{gas}=9.6\times10^8 \mathrm{M_\odot}$ and $\mathrm{SFR}=0.90 \mathrm{M_\odot yr^{-1}}$. This lends further credibility to the hypothesis that FRBs may originate from single or multiple progenitors across a diverse range of galaxy environments. Based on the observed velocity field modeling, we find that the molecular gas disk is dominated by an ordered circular rotation, despite the fact that the host galaxy has a gas-rich companion galaxy with a projected separation of $\sim 25 \mathrm{kpc}$. The formation of the FRB's progenitor might not have been triggered by this interaction. We derive the 3$σ$ upper limit of the molecular gas column density at the FRB detection site to be $< 2.1\times 10^{21} \mathrm{cm^{-2}}$ with a 3$σ$ upper limit. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: 10 pages, 7 figures, 3 tables

arXiv:2407.01023 [pdf, other]

DistML.js: Installation-free Distributed Deep Learning Framework for Web Browsers

Authors: Masatoshi Hidaka, Tomohiro Hashimoto, Yuto Nishizawa, Tatsuya Harada

Abstract: We present "DistML.js", a library designed for training and inference of machine learning models within web browsers. Not only does DistML.js facilitate model training on local devices, but it also supports distributed learning through communication with servers. Its design and define-by-run API for deep learning model construction resemble PyTorch, thereby reducing the learning curve for prototyp… ▽ More We present "DistML.js", a library designed for training and inference of machine learning models within web browsers. Not only does DistML.js facilitate model training on local devices, but it also supports distributed learning through communication with servers. Its design and define-by-run API for deep learning model construction resemble PyTorch, thereby reducing the learning curve for prototyping. Matrix computations involved in model training and inference are executed on the backend utilizing WebGL, enabling high-speed calculations. We provide a comprehensive explanation of DistML.js's design, API, and implementation, alongside practical applications including data parallelism in learning. The source code is publicly available at https://github.com/mil-tokyo/distmljs. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.19439 [pdf, other]

Gas conditions of a star-formation selected sample in the first billion years

Authors: Tom J. L. C. Bakx, Hiddo S. B. Algera, Bram Venemans, Laura Sommovigo, Seiji Fujimoto, Stefano Carniani, Masato Hagimoto, Takuya Hashimoto, Akio K. Inoue, Dragan Salak, Stephen Serjeant, Livia Vallini, Stephen Eales, Andrea Ferrara, Yoshinobu Fudamoto, Chihiro Imamura, Shigeki Inoue, Kirsten K. Knudsen, Hiroshi Matsuo, Yuma Sugahara, Yoichi Tamura, Akio Taniguchi, Satoshi Yamanaka

Abstract: We present Atacama Large Millimetre/submillimetre Array (ALMA) observations of the [O$_{\rm III}$] 88 $μ$m emission of a sample of thirteen galaxies at $z$ = 6 to 7.6 selected as [C$_{\rm II}$]-emitting companion sources of quasars. To disentangle the origins of the luminous Oxygen line in the $z$ > 6 Universe, we looked at emission-line galaxies that are selected through an excellent star-formati… ▽ More We present Atacama Large Millimetre/submillimetre Array (ALMA) observations of the [O$_{\rm III}$] 88 $μ$m emission of a sample of thirteen galaxies at $z$ = 6 to 7.6 selected as [C$_{\rm II}$]-emitting companion sources of quasars. To disentangle the origins of the luminous Oxygen line in the $z$ > 6 Universe, we looked at emission-line galaxies that are selected through an excellent star-formation tracer [C$_{\rm II}$] with star-formation rates between 9 and 162 M$_{\odot}$/yr. Direct observations reveal [O$_{\rm III}$] emission in just a single galaxy (L$_{\rm [O_{\rm III}]}$/L$_{\rm [C_{\rm II}]}$ = 2.3), and a stacked image shows no [O$_{\rm III}$] detection, providing deep upper limits on the L$_{\rm [O_{\rm III}]}$/L$_{\rm [C_{\rm II}]}$ ratios in the $z > 6$ Universe (L$_{\rm [O_{\rm III}]}$/L$_{\rm [C_{\rm II}]}$ < 1.2 at 3$σ$). While the fidelity of this sample is high, no obvious optical/near-infrared counterpart is seen in the JWST imaging available for four galaxies. Additionally accounting for low-redshift CO emitters, line stacking shows that our sample-wide result remains robust: The enhanced L$_{\rm [O_{\rm III}]}$/L$_{\rm [C_{\rm II}]}$ reported in the first billion years of the Universe is likely due to the selection towards bright, blue Lyman-break galaxies with high surface star-formation rates or young stellar populations. The deep upper limit on the rest-frame 90 $μ$m continuum emission (< 141 $μ$Jy at 3$σ$), implies a low average dust temperature (T$_{\rm dust}$ < 30K) and high dust mass (M$_{\rm dust}$ ~ 10$^8$ M$_{\odot}$). As more normal galaxies are explored in the early Universe, synergy between JWST and ALMA is fundamental to further investigate the ISM properties of the a broad range of samples of high-$z$ galaxies. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 20 pages; 13 figures; accepted for publication in MNRAS

arXiv:2406.14888 [pdf, other]

Finding dusty AGNs from the JWST CEERS survey with mid-infrared photometry

Authors: Tom C. -C. Chien, Chih-Teng Ling, Tomotsugu Goto, Cossas K. -W. Wu, Seong Jin Kim, Tetsuya Hashimoto, Yu-Wei Lin, Ece Kilerci, Simon C. -C. Ho, Po-Ya Wang, Bjorn Jasper R. Raquel

Abstract: The nature of the interaction between active galactic nuclei (AGNs) and their host galaxies remains an unsolved question. Therefore, conducting an AGN census is valuable to AGN research. Nevertheless, a significant fraction of AGNs are obscured by their environment, which blocks UV and optical emissions due to the dusty torus surrounding the central supermassive black hole (SMBH). To overcome this… ▽ More The nature of the interaction between active galactic nuclei (AGNs) and their host galaxies remains an unsolved question. Therefore, conducting an AGN census is valuable to AGN research. Nevertheless, a significant fraction of AGNs are obscured by their environment, which blocks UV and optical emissions due to the dusty torus surrounding the central supermassive black hole (SMBH). To overcome this challenge, mid-infrared (IR) surveys have emerged as a valuable tool for identifying obscured AGNs, as the obscured light is re-emitted in this range. With its high sensitivity, the James Webb Space Telescope (JWST) uncovered more fainter objects than previous telescopes. By applying the SED fitting, this work investigates AGN candidates in JWST Cosmic Evolution Early Release Science (CEERS) fields. We identified 42 candidates, 30 of them are classified as composites ($0.2\leq f_{\rm AGN, IR}< 0.5$), and 12 of them are AGNs ($f_{\rm AGN, IR}\geq 0.5$). We report the AGN luminosity contributions and AGN number fractions as a function of redshift and total infrared luminosity, showing that previously reported increasing relations are not apparent in our sample due to the sample size. We also extend the previous results on ultra-luminous infrared galaxies (ULIRGs, $L_{\rm TIR}\geq 10^{12} L_{\odot}$) to less luminous AGNs, highlighting the power of JWST. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 15 pages, 20 figures, 4 tables. Accepted for publication in MNRAS. The 3 min summary: https://www.youtube.com/watch?v=mWUebbgUOh8

arXiv:2406.14785 [pdf, other]

Understanding Finetuning for Factual Knowledge Extraction

Authors: Gaurav Ghosal, Tatsunori Hashimoto, Aditi Raghunathan

Abstract: In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model… ▽ More In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: To appear in ICML 2024

arXiv:2406.07975 [pdf, other]

FINER: Far-Infrared Nebular Emission Receiver for the Large Millimeter Telescope

Authors: Yoichi Tamura, Takeshi Sakai, Ryohei Kawabe, Takafumi Kojima, Akio Taniguchi, Tatsuya Takekoshi, Haoran Kang, Wenlei Shan, Masato Hagimoto, Norika Okauchi, Airi Tetsuka, Akio K. Inoue, Kotaro Kohno, Kunihiko Tanaka, Tom J. L. C. Bakx, Yoshinobu Fudamoto, Kazuyuki Fujita, Yuichi Harikane, Takuya Hashimoto, Bunyo Hatsukade, David H. Hughes, Takahiro Iino, Yuki Kimura, Hiroyuki Maezawa, Yuichi Matsuda , et al. (12 additional authors not shown)

Abstract: Unveiling the emergence and prevalence of massive/bright galaxies during the epoch of reionization and beyond, within the first 600 million years of the Universe, stands as a pivotal pursuit in astronomy. Remarkable progress has been made by JWST in identifying an immense population of bright galaxies, which hints at exceptionally efficient galaxy assembly processes. However, the underlying physic… ▽ More Unveiling the emergence and prevalence of massive/bright galaxies during the epoch of reionization and beyond, within the first 600 million years of the Universe, stands as a pivotal pursuit in astronomy. Remarkable progress has been made by JWST in identifying an immense population of bright galaxies, which hints at exceptionally efficient galaxy assembly processes. However, the underlying physical mechanisms propelling their rapid growth remain unclear. With this in mind, millimeter and submillimeter-wave spectroscopic observations of redshifted far-infrared spectral lines, particularly the [O III] 88 micron and [C II] 158 micron lines, offers a crucial pathway to address this fundamental query. To this end, we develop a dual-polarization sideband-separating superconductor-insulator-superconductor (SIS) mixer receiver, FINER, for the Large Millimeter Telescope (LMT) situated in Mexico. Harnessing advancements from ALMA's wideband sensitivity upgrade (WSU) technology, FINER covers radio frequencies spanning 120-360 GHz, delivering an instantaneous intermediate frequency (IF) of 3-21 GHz per sideband per polarization, which is followed by a set of 10.24 GHz-wide digital spectrometers. At 40% of ALMA's light-collecting area, the LMT's similar atmospheric transmittance and FINER's 5 times wider bandwidth compared to ALMA culminate in an unparalleled spectral scanning capability in the northern hemisphere, paving the way for finer spectral-resolution detection of distant galaxies. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 12 pages, 8 figures, and 3 tables. Proceedings paper presented in SPIE Astronomical Telescope and Instrumentation 2024

arXiv:2405.20456 [pdf, other]

Scaling Laws for the Value of Individual Data Points in Machine Learning

Authors: Ian Covert, Wenlong Ji, Tatsunori Hashimoto, James Zou

Abstract: Recent works have shown that machine learning models improve at a predictable rate with the total amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help design a model's training dataset, but they typically take an aggregate view of the data by only considering the dataset's size. We introduce a new perspective by… ▽ More Recent works have shown that machine learning models improve at a predictable rate with the total amount of training data, leading to scaling laws that describe the relationship between error and dataset size. These scaling laws can help design a model's training dataset, but they typically take an aggregate view of the data by only considering the dataset's size. We introduce a new perspective by investigating scaling behavior for the value of individual data points: we find that a data point's contribution to model's performance shrinks predictably with the size of the dataset in a log-linear manner. Interestingly, there is significant variability in the scaling exponent among different data points, indicating that certain points are more valuable in small datasets while others are relatively more useful as a part of large datasets. We provide learning theory to support our scaling law, and we observe empirically that it holds across diverse model classes. We further propose a maximum likelihood estimator and an amortized estimator to efficiently learn the individualized scaling behaviors from a small number of noisy observations per data point. Using our estimators, we provide insights into factors that influence the scaling behavior of different data points. Finally, we demonstrate applications of the individualized scaling laws to data valuation and data subset selection. Overall, our work represents a first step towards understanding and utilizing scaling properties for the value of individual data points. △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: ICML 2024 camera-ready

arXiv:2405.10938 [pdf, other]

Observational Scaling Laws and the Predictability of Language Model Performance

Authors: Yangjun Ruan, Chris J. Maddison, Tatsunori Hashimoto

Abstract: Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~80 publically a… ▽ More Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~80 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve. △ Less

Submitted 2 July, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

arXiv:2404.10770 [pdf, other]

Unveiling the Cosmic Gems Arc at $z\sim10.2$ with JWST

Authors: Larry D. Bradley, Angela Adamo, Eros Vanzella, Keren Sharon, Gabriel Brammer, Dan Coe, Jose M. Diego, Vasily Kokorev, Guillaume Mahler, Masamune Oguri, Abdurro'uf, Rachana Bhatawdekar, Lise Christensen, Seiji Fujimoto, Takuya Hashimoto, Tiger Y. -Y Hsiao, Akio K. Inoue, Yolanda Jiménez-Teja, Matteo Messa, Colin Norman, Massimo Ricotti, Yoichi Tamura, Rogier A. Windhorst, Xinfeng Xu, Adi Zitrin

Abstract: We present recent JWST NIRCam imaging observations of SPT0615-JD (also known as the Cosmic Gems Arc), lensed by the galaxy cluster SPT-CL J0615-5746. The 5-arcsec-long arc is the most highly magnified $z>10$ galaxy known, straddling the lensing critical curve and revealing five star clusters with radii $\sim 1$ pc or less. We measure the full arc to have F200W 24.5 AB mag, consisting of two mirror… ▽ More We present recent JWST NIRCam imaging observations of SPT0615-JD (also known as the Cosmic Gems Arc), lensed by the galaxy cluster SPT-CL J0615-5746. The 5-arcsec-long arc is the most highly magnified $z>10$ galaxy known, straddling the lensing critical curve and revealing five star clusters with radii $\sim 1$ pc or less. We measure the full arc to have F200W 24.5 AB mag, consisting of two mirror images, each 25.3 AB mag with a magnification $μ\sim 60$ (delensed 29.7 AB mag, $M_{UV} = -17.8$). The galaxy has an extremely strong Lyman break F115W$-$F200W $>3.2$ mag ($2σ$ lower limit), is undetected in all bluer filters ($< 2σ$), and has a very blue continuum slope redward of the break ($β= -2.7 \pm 0.1$), resulting in a photometric redshift $z_{phot} = 10.2 \pm 0.2$ (95% confidence) with no significant likelihood below $z < 9.8$. Based on SED fitting to the total photometry, we estimate an intrinsic stellar mass of $M_{*} \sim 2.4 - 5.6 \times 10^{7} M_{\odot}$, young mass-weighted age of $\sim 21 - 79$ Myr, low dust content ($A_V < 0.15$), and a low metallicity of $\lesssim 1\%~Z_{\odot}$. We identify a fainter third counterimage candidate within 2.2 arcsec of the predicted position, lensed to AB mag 28.4 and magnified by $μ\sim 2$, suggesting the fold arc may only show $\sim60$% of the galaxy. SPT0615-JD is a unique laboratory to study star clusters observed within a galaxy just 460 Myr after the Big Bang. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: 22 pages, 8 figures, 4 tables, submitted to ApJ

arXiv:2404.04500 [pdf, other]

Trustless Audits without Revealing Data or Models

Authors: Suppakit Waiwitlikhit, Ion Stoica, Yi Sun, Tatsunori Hashimoto, Daniel Kang

Abstract: There is an increasing conflict between business incentives to hide models and data as trade secrets, and the societal need for algorithmic transparency. For example, a rightsholder wishing to know whether their copyrighted works have been used during training must convince the model provider to allow a third party to audit the model and data. Finding a mutually agreeable third party is difficult,… ▽ More There is an increasing conflict between business incentives to hide models and data as trade secrets, and the societal need for algorithmic transparency. For example, a rightsholder wishing to know whether their copyrighted works have been used during training must convince the model provider to allow a third party to audit the model and data. Finding a mutually agreeable third party is difficult, and the associated costs often make this approach impractical. In this work, we show that it is possible to simultaneously allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties. We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights, alongside a zero-knowledge proof (ZKP) certifying that published commitments are derived from training the model. Model providers can then respond to audit requests by privately computing any function F of the dataset (or model) and releasing the output of F alongside another ZKP certifying the correct execution of F. To enable ZkAudit, we develop new methods of computing ZKPs for SGD on modern neural nets for simple recommender systems and image classification models capable of high accuracies on ImageNet. Empirically, we show it is possible to provide trustless audits of DNNs, including copyright, censorship, and counterfactual audits with little to no loss in accuracy. △ Less

Submitted 6 April, 2024; originally announced April 2024.

arXiv:2404.04475 [pdf, other]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Authors: Yann Dubois, Balázs Galambosi, Percy Liang, Tatsunori B. Hashimoto

Abstract: LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regressi… ▽ More LLM-based auto-annotators have become a key component of the LLM development process due to their cost-effectiveness and scalability compared to human-based evaluation. However, these auto-annotators can introduce complex biases that are hard to remove. Even simple, known confounders such as preference for longer outputs remain in existing automated evaluation metrics. We propose a simple regression analysis approach for controlling biases in auto-evaluations. As a real case study, we focus on reducing the length bias of AlpacaEval, a fast and affordable benchmark for chat LLMs that uses LLMs to estimate response quality. Despite being highly correlated with human preferences, AlpacaEval is known to favor models that generate longer outputs. We introduce a length-controlled AlpacaEval that aims to answer the counterfactual question: "What would the preference be if the model's and baseline's output had the same length?". To achieve this, we first fit a generalized linear model to predict the biased output of interest (auto-annotator preferences) based on the mediators we want to control for (length difference) and other relevant features. We then obtain length-controlled preferences by predicting preferences while conditioning the GLM with a zero difference in lengths. Length-controlling not only improves the robustness of the metric to manipulations in model verbosity, we also find that it increases the Spearman correlation with LMSYS' Chatbot Arena from 0.94 to 0.98. We release the code and leaderboard at https://tatsu-lab.github.io/alpaca_eval/ . △ Less

Submitted 5 April, 2024; originally announced April 2024.

arXiv:2404.01773 [pdf, other]

Measurement of the mesonic decay branch of the $\bar{K}\!N\!N$ quasi-bound state

Authors: T. Yamaga, S. Ajimura, H. Asano, G. Beer, H. Bhang, M. Bragadireanu, P. Buehler, L. Busso, M. Cargnelli, S. Choi, C. Curceanu, S. Enomoto, H. Fujioka, Y. Fujiwara, T. Fukuda, C. Guaraldo, T. Hashimoto, R. S. Hayano, T. Hiraiwa, M. Iio, M. Iliescu, K. Inoue, Y. Ishiguro, T. Ishikawa, S. Ishimoto , et al. (45 additional authors not shown)

Abstract: We conducted measurements of $K^- + {^3{\rm He}} \to π\!Y \!N + N'$ reactions using a $1~{\rm GeV}/c$ $K^-$-beam, with the objective of understanding the broad decay width of $\bar{K} \!N \!N$ (approximately twice as broad as that of $Λ(1405)$ considered to be the $\bar{K} \!N$ quasi-bound state). We successfully reproduced distributions of the $π\! Y \! N$ invariant mass and momentum transfer for… ▽ More We conducted measurements of $K^- + {^3{\rm He}} \to π\!Y \!N + N'$ reactions using a $1~{\rm GeV}/c$ $K^-$-beam, with the objective of understanding the broad decay width of $\bar{K} \!N \!N$ (approximately twice as broad as that of $Λ(1405)$ considered to be the $\bar{K} \!N$ quasi-bound state). We successfully reproduced distributions of the $π\! Y \! N$ invariant mass and momentum transfer for $π\! Y \! N$ using model fitting functions for $\bar{K} \!N \!N$ formation and quasi-free $\bar{K}$ absorption (${\rm QF}_{\bar{K}-{\rm abs}}$) processes. The model can describe the experimental data quite well, and four $\bar{K} \! N \! N \to π\! Y \! N $ cross-sections were obtained. The results indicate that mesonic decay is the dominant decay branch of $\bar{K} \! N \! N$. The results also suggest that $Γ_{πΛN} \sim Γ_{πΣN}$, which indicates that the $I_{\bar{K} \! N}=1$ absorption channel, in addition to the $I_{\bar{K} \! N}=0$ absorption channel, substantially contribute to the $\bar{K} \! N \! N$ decay, making the $\bar{K} \! N \! N$ state approximately twice as unstable as $Λ$(1405). △ Less

Submitted 2 June, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

arXiv:2404.00474 [pdf, other]

Linguistic Calibration of Long-Form Generations

Authors: Neil Band, Xuechen Li, Tengyu Ma, Tatsunori Hashimoto

Abstract: Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form gen… ▽ More Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making. △ Less

Submitted 4 June, 2024; v1 submitted 30 March, 2024; originally announced April 2024.

Comments: ICML 2024. Code available at https://github.com/tatsu-lab/linguistic_calibration

arXiv:2403.17133 [pdf, other]

RIOJA. Complex Dusty Starbursts in a Major Merger B14-65666 at z=7.15

Authors: Yuma Sugahara, Javier Álvarez-Márquez, Takuya Hashimoto, Luis Colina, Akio K. Inoue, Luca Costantin, Yoshinobu Fudamoto, Ken Mawatari, Yi W. Ren, Santiago Arribas, Tom J. L. C. Bakx, Carmen Blanco-Prieto, Daniel Ceverino, Alejandro Crespo Gómez, Masato Hagimoto, Takeshi Hashigaya, Rui Marques-Chaves, Hiroshi Matsuo, Yurina Nakazato, Miguel Pereira-Santaella, Yoichi Tamura, Mitsutaka Usui, Naoki Yoshida

Abstract: We present JWST NIRCam imaging of B14-65666 ("Big Three Dragons"), a bright Lyman-break galaxy system ($M_\text{UV}=-22.5$ mag) at $z=7.15$. The high angular resolution of NIRCam reveals the complex morphology of two galaxy components: galaxy E has a compact core (E-core), surrounded by diffuse, extended, rest-frame optical emission, which is likely to be tidal tails; and galaxy W has a clumpy and… ▽ More We present JWST NIRCam imaging of B14-65666 ("Big Three Dragons"), a bright Lyman-break galaxy system ($M_\text{UV}=-22.5$ mag) at $z=7.15$. The high angular resolution of NIRCam reveals the complex morphology of two galaxy components: galaxy E has a compact core (E-core), surrounded by diffuse, extended, rest-frame optical emission, which is likely to be tidal tails; and galaxy W has a clumpy and elongated morphology with a blue UV slope ($β_\text{UV}=-2.2\pm0.1$). The flux excess, F356W$-$F444W, peaks at the E-core ($1.05^{+0.08}_{-0.09}$ mag), tracing the presence of strong [OIII] 4960,5008 Å emission. ALMA archival data show that the bluer galaxy W is brighter in dust continua than the redder galaxy E, while the tails are bright in [OIII] 88 $\mathrm{μm}$. The UV/optical and sub-mm SED fitting confirms that B14-65666 is a major merger in a starburst phase as derived from the stellar mass ratio (3:1 to 2:1) and the star-formation rate, $\simeq1$ dex higher than the star-formation main sequence at the same redshift. The galaxy E is a dusty ($A_\text{V}=1.2\pm0.1$ mag) starburst with a possible high dust temperature ($\ge63$-$68$ K). The galaxy W would have a low dust temperature ($\le27$-$33$ K) or patchy stellar-and-dust geometry, as suggested from the infrared excess (IRX) and $β_\text{UV}$ diagram. The high optical-to-FIR [OIII] line ratio of the E-core shows its lower gas-phase metallicity ($\simeq0.2$ Z$_{\odot}$) than the galaxy W. These results agree with a scenario where major mergers disturb morphology and induce nuclear dusty starbursts triggered by less-enriched inflows. B14-65666 shows a picture of complex stellar buildup processes during major mergers in the epoch of reionization. △ Less

Submitted 25 March, 2024; originally announced March 2024.

Comments: 18 pages, 6 figures, 4 tables. Submitted to ApJ

arXiv:2402.16827 [pdf, other]

A Survey on Data Selection for Language Models

Authors: Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, Colin Raffel, Shiyu Chang, Tatsunori Hashimoto, William Yang Wang

Abstract: A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the am… ▽ More A major factor in the recent success of large language models is the use of enormous and ever-growing text datasets for unsupervised pre-training. However, naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary. Filtering out data can also decrease the carbon footprint and financial costs of training models by reducing the amount of training required. Data selection methods aim to determine which candidate data points to include in the training dataset and how to appropriately sample from the selected data points. The promise of improved data selection methods has caused the volume of research in the area to rapidly expand. However, because deep learning is mostly driven by empirical evidence and experimentation on large-scale data is expensive, few organizations have the resources for extensive data selection research. Consequently, knowledge of effective data selection practices has become concentrated within a few organizations, many of which do not openly share their findings and methodologies. To narrow this gap in knowledge, we present a comprehensive review of existing literature on data selection methods and related research areas, providing a taxonomy of existing approaches. By describing the current landscape of research, this work aims to accelerate progress in data selection by establishing an entry point for new and established researchers. Additionally, throughout this review we draw attention to noticeable holes in the literature and conclude the paper by proposing promising avenues for future research. △ Less

Submitted 8 March, 2024; v1 submitted 26 February, 2024; originally announced February 2024.

Comments: Paper list available at https://github.com/alon-albalak/data-selection-survey

arXiv:2402.10978 [pdf, other]

Language Models with Conformal Factuality Guarantees

Authors: Christopher Mohri, Tatsunori Hashimoto

Abstract: Guaranteeing the correctness and factuality of language model (LM) outputs is a major open problem. In this work, we propose conformal factuality, a framework that can ensure high probability correctness guarantees for LMs by connecting language modeling and conformal prediction. We observe that the correctness of an LM output is equivalent to an uncertainty quantification problem, where the uncer… ▽ More Guaranteeing the correctness and factuality of language model (LM) outputs is a major open problem. In this work, we propose conformal factuality, a framework that can ensure high probability correctness guarantees for LMs by connecting language modeling and conformal prediction. We observe that the correctness of an LM output is equivalent to an uncertainty quantification problem, where the uncertainty sets are defined as the entailment set of an LM's output. Using this connection, we show that conformal prediction in language models corresponds to a back-off algorithm that provides high probability correctness guarantees by progressively making LM outputs less specific (and expanding the associated uncertainty sets). This approach applies to any black-box LM and requires very few human-annotated samples. Evaluations of our approach on closed book QA (FActScore, NaturalQuestions) and reasoning tasks (MATH) show that our approach can provide 80-90% correctness guarantees while retaining the majority of the LM's original output. △ Less

Submitted 15 February, 2024; originally announced February 2024.

arXiv:2402.05386 [pdf, other]

Exploring the faintest end of mid-infrared luminosity functions up to $z\simeq 5$ with the JWST CEERS survey

Authors: Chih-Teng Ling, Tomotsugu Goto, Seong Jin Kim, Cossas K. -W. Wu, Tetsuya Hashimoto, Tom C. -C. Chien, Yu-Wei Lin, Simon C. -C. Ho, Ece Kilerci

Abstract: Mid-infrared (MIR) light from galaxies is sensitive to dust-obscured star-formation activities because it traces the characteristic emission of dust heated by young, massive stars. By constructing the MIR luminosity functions (LFs), we are able to quantify the overall dusty star formation history and the evolution of galaxies over cosmic time. In this work, we report the first rest-frame MIR LFs a… ▽ More Mid-infrared (MIR) light from galaxies is sensitive to dust-obscured star-formation activities because it traces the characteristic emission of dust heated by young, massive stars. By constructing the MIR luminosity functions (LFs), we are able to quantify the overall dusty star formation history and the evolution of galaxies over cosmic time. In this work, we report the first rest-frame MIR LFs at 7.7, 10, 12.8, 15, 18, and 21 $μ$m as well as the total IR LF from the James Webb Space Telescope (JWST) Cosmic Evolution Early Release Science (CEERS) survey. We identify 506 galaxies at $z=0-5.1$ in the CEERS survey that also have optical photometry from the Hubble Space Telescope. With the unprecedented sensitivity of the JWST, we probe the faintest end of the LFs at $z=0-1$ down to $L^* \sim 10^7 L_\odot$, $\sim 2$ orders of magnitude fainter than those from the previous generation of IR space telescopes. Our findings connect well with and continue the faint end of the MIR LFs from the deepest observations in past works. As a proxy of star formation history, we present the MIR-based luminosity density up to $z\simeq4.0$, marking the first probe of the early Universe by JWST MIRI. △ Less

Submitted 7 February, 2024; originally announced February 2024.

Comments: 22 pages, 22 figures, 7 tables. Accepted for publication in MNRAS. A summary video can be found at https://youtu.be/TRb6bjmGfOU

arXiv:2401.15866 [pdf, other]

Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution

Authors: Ian Covert, Chanwoo Kim, Su-In Lee, James Zou, Tatsunori Hashimoto

Abstract: Many tasks in explainable machine learning, such as data valuation and feature attribution, perform expensive computation for each data point and can be intractable for large datasets. These methods require efficient approximations, and learning a network that directly predicts the desired output, which is commonly known as amortization, is a promising solution. However, training such models with… ▽ More Many tasks in explainable machine learning, such as data valuation and feature attribution, perform expensive computation for each data point and can be intractable for large datasets. These methods require efficient approximations, and learning a network that directly predicts the desired output, which is commonly known as amortization, is a promising solution. However, training such models with exact labels is often intractable; we therefore explore training with noisy labels and find that this is inexpensive and surprisingly effective. Through theoretical analysis of the label noise and experiments with various models and datasets, we show that this approach significantly accelerates several feature attribution and data valuation methods, often yielding an order of magnitude speedup over existing approaches. △ Less

Submitted 28 January, 2024; originally announced January 2024.

arXiv:2401.10005 [pdf, other]

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Authors: Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada

Abstract: The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of Large Multi-Modal Models (LMMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to imbue an LMM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduc… ▽ More The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of Large Multi-Modal Models (LMMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to imbue an LMM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. Our method comprises the development of a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. We designed an LMM, which has high capabilities on region awareness to address the intricate requirements of image-text alignment. The model undergoes a three-stage training phase, starting with large-scale image-text alignment using a large-scale datasets, followed by instruction tuning, and fine-tuning with a focus on chain-of-thought reasoning. The results demonstrate a stride toward a more robust, accurate, and interpretable LMM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input. △ Less

Submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.03224 [pdf, other]

Bound star clusters observed in a lensed galaxy 460 Myr after the Big Bang

Authors: Angela Adamo, Larry D. Bradley, Eros Vanzella, Adélaïde Claeyssens, Brian Welch, Jose M Diego, Guillaume Mahler, Masamune Oguri, Keren Sharon, Abdurro'uf, Tiger Yu-Yang Hsiao, Xinfeng Xu, Matteo Messa, Augusto E. Lassen, Erik Zackrisson, Gabriel Brammer, Dan Coe, Vasily Kokorev, Massimo Ricotti, Adi Zitrin, Seiji Fujimoto, Akio K. Inoue, Tom Resseguier, Jane R. Rigby, Yolanda Jiménez-Teja , et al. (3 additional authors not shown)

Abstract: The Cosmic Gems arc is among the brightest and highly magnified galaxies observed at redshift $z\sim10.2$. However, it is an intrinsically UV faint galaxy, in the range of those now thought to drive the reionization of the Universe. Hitherto the smallest features resolved in a galaxy at a comparable redshift are between a few hundreds and a few tens of parsecs. Here we report JWST observations of… ▽ More The Cosmic Gems arc is among the brightest and highly magnified galaxies observed at redshift $z\sim10.2$. However, it is an intrinsically UV faint galaxy, in the range of those now thought to drive the reionization of the Universe. Hitherto the smallest features resolved in a galaxy at a comparable redshift are between a few hundreds and a few tens of parsecs. Here we report JWST observations of the Cosmic Gems. The light of the galaxy is resolved into five star clusters located in a region smaller than 70 parsec. They exhibit minimal dust attenuation and low metallicity, ages younger than 50 Myr and intrinsic masses of $\sim10^6$ M$_{\odot}$. Their lensing-corrected sizes are approximately 1 pc, resulting in stellar surface densities near $10^5$~M$_{\odot}$/pc$^2$, three orders of magnitude higher than typical young star clusters in the local universe. Despite the uncertainties inherent to the lensing model, they are consistent with being gravitationally bound stellar systems, i.e., proto-globular clusters. We conclude that star cluster formation and feedback likely contributed to shape the properties of galaxies during the epoch of reionization. [Abridged] △ Less

Submitted 12 June, 2024; v1 submitted 6 January, 2024; originally announced January 2024.

Comments: Accepted for publication

arXiv:2401.01087 [pdf]

doi 10.7566/JPSJ.92.124705

Electron transfer channel in the sugar recognition system assembled on nano gold particle

Authors: Takayuki Goto, Takeshi Hashimoto, Kai Sato, Yukihiro Kitamoto, Takashi Hayashita, Satoshi Iguchi, Takahiko Sasaki, Dita Puspita Sari, Isao Watanabe

Abstract: Existence of 1D spin diffusion in the electrochemical sugar recognition system consisting of a nano-sized gold particle (GNP), a ruthenium complex and a phenylboronic acid was investigated by NMR and muSR. When sugar molecules are recognized by the phenylboronic site, the response of electrochemical voltammetry of the Ru site changes, enabling the system to work as a sensitive sugar-sensor. In thi… ▽ More Existence of 1D spin diffusion in the electrochemical sugar recognition system consisting of a nano-sized gold particle (GNP), a ruthenium complex and a phenylboronic acid was investigated by NMR and muSR. When sugar molecules are recognized by the phenylboronic site, the response of electrochemical voltammetry of the Ru site changes, enabling the system to work as a sensitive sugar-sensor. In this recognition process, the change in the electronic state at the boron site caused by sugar must be transferred to the Ru site via alkyl chains. We have utilized the muon-labelled electrons method and the proton NMR to find out a channel of the electron transfer from the phenylboronic acid site to the gold nano particle via the one dimensional alkyl chain. If this transfer is driven by diffusive spin channel, characteristic field dependence is expected in the longitudinal spin relaxation rate of muSR and 1H-NMR. We have observed significant decrease in the spin relaxation rates with increasing applied field. The result is discussed in terms of low dimensional spin diffusion. △ Less

Submitted 2 January, 2024; originally announced January 2024.

arXiv:2401.01043 [pdf, other]

Polycyclic aromatic hydrocarbon (PAH) luminous galaxies in JWST CEERS data

Authors: Yu-Wei Lin, Cossas K. -W. Wu, Chih-Teng Ling, Tomotsugu Goto, Seong Jin Kim, Ece Kilerci, Tetsuya Hashimoto, Po-Ya Wang, Simon C. -C. Ho, Tiger Yu-Yang Hsiao, Bjorn Jasper R. Raquel, Yuri Uno

Abstract: It has been an unanswered question how many dusty galaxies have been undetected from the state-of-the-art observational surveys. JWST enables us to detect faint IR galaxies that have prominent polycyclic aromatic hydrocarbon (PAH) features in the mid-IR wavelengths. PAH is a valuable tracer of star formation and dust properties in the mid-infrared wavelength. The JWST Cosmic Evolution Early Releas… ▽ More It has been an unanswered question how many dusty galaxies have been undetected from the state-of-the-art observational surveys. JWST enables us to detect faint IR galaxies that have prominent polycyclic aromatic hydrocarbon (PAH) features in the mid-IR wavelengths. PAH is a valuable tracer of star formation and dust properties in the mid-infrared wavelength. The JWST Cosmic Evolution Early Release Science (CEERS) fields provide us with wavelength coverage from 7.7 to 21 $μ$m using six photometric bands of the mid-infrared instrument (MIRI). We have identified galaxies dominated by mid-IR emission from PAHs, termed PAH galaxies. From our multi-band photometry catalogue, we selected ten PAH galaxies displaying high flux ratios of $\log(S_{15}/S_{10}) > 0.8$. The SED fitting analysis indicates that these galaxies are star-forming galaxies with total IR luminosities of $10^{10}$ $\sim$ $10^{11.5}$ $L_{\odot}$ at z $\sim 1$. The morphology of PAH galaxies does not show any clear signatures of major merging or interaction within the MIRI resolution. The majority of them are on the star-formation main sequence at $z \sim 1$. Our result demonstrates that JWST can detect PAH emissions from normal star-forming galaxies at $z \sim 1$, in addition to ultra-luminous infrared galaxies (ULIRGs) or luminous infrared galaxies (LIRGs). △ Less

Submitted 2 January, 2024; originally announced January 2024.

Comments: 12 pages, 20 figures, 4 tables. Accepted by MNRAS. A summary video is at https://www.youtube.com/watch?v=UtPaVTFM4f8&ab_channel=NTHUCosmology

arXiv:2312.04469 [pdf, other]

On the Learnability of Watermarks for Language Models

Authors: Chenchen Gu, Xiang Lisa Li, Percy Liang, Tatsunori Hashimoto

Abstract: Watermarking of language model outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of language models. Existing watermarking strategies operate by altering the decoder of an existing language model. In this paper, we ask whether language models can directly learn to generate watermarked text, which would have significant implications for the real-wor… ▽ More Watermarking of language model outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of language models. Existing watermarking strategies operate by altering the decoder of an existing language model. In this paper, we ask whether language models can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks. △ Less

Submitted 2 May, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: Accepted at ICLR 2024

arXiv:2312.02090 [pdf, other]

Cosmic star-formation history and black hole accretion history inferred from the JWST mid-infrared source counts

Authors: Seong Jin Kim, Tomotsugu Goto, Chih-Teng Ling, Cossas K. -W. Wu, Tetsuya Hashimoto, Ece Kilerci, Simon C. -C. Ho, Yuri Uno, Po-Ya Wang, Yu-Wei Lin

Abstract: With the advent of the James Webb Space Telescope (JWST), extra-galactic source count studies were conducted down to sub-microJy in the mid-infrared (MIR), which is several tens of times fainter than what the previous-generation infrared (IR) telescopes achieved in the MIR. In this work, we aim to interpret the JWST source counts and constrain cosmic star-formation history (CSFH) and black hole ac… ▽ More With the advent of the James Webb Space Telescope (JWST), extra-galactic source count studies were conducted down to sub-microJy in the mid-infrared (MIR), which is several tens of times fainter than what the previous-generation infrared (IR) telescopes achieved in the MIR. In this work, we aim to interpret the JWST source counts and constrain cosmic star-formation history (CSFH) and black hole accretion history (BHAH). We employ the backward evolution of local luminosity functions (LLFs) of galaxies to reproduce the observed source counts from sub-microJy to a few tens of mJy in the MIR bands of the JWST. The shapes of the LLFs at the MIR bands are determined using the model templates of the spectral energy distributions (SEDs) for five representative galaxy types (star-forming galaxies, starbursts, composite, AGN type 2 and 1). By simultaneously fitting our model to all the source counts in the six MIR bands, along with the previous results, we determine the best-fit evolutions of MIR LFs for each of the five galaxy types, and subsequently estimate the CSFH and BHAH. Thanks to the JWST, our estimates are based on several tens of times fainter MIR sources, the existence of which was merely an extrapolation in previous studies. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Comments: 15 pages, 12 figures, published in MNRAS, https://doi.org/10.1093/mnras/stad3499. A summary video is https://youtu.be/Md6wragrYyM

arXiv:2312.01707 [pdf, other]

Perceptual Dimensions of Physical Properties of Handheld Objects Induced by Impedance Changes

Authors: Takeru Hashimoto, Shigeo Yoshida, Takuji Narumi

Abstract: Haptics in virtual reality is the emerging dimension after audiovisual experiences. Researchers designed several handheld VR controllers to simulate haptic experiences in virtual reality environments. Some of these devices, equipped to deliver active force, can dynamically alter the timing and intensity of force feedback, potentially offering a wide array of haptic sensations. Past research primar… ▽ More Haptics in virtual reality is the emerging dimension after audiovisual experiences. Researchers designed several handheld VR controllers to simulate haptic experiences in virtual reality environments. Some of these devices, equipped to deliver active force, can dynamically alter the timing and intensity of force feedback, potentially offering a wide array of haptic sensations. Past research primarily used a single index to evaluate how users perceive physical property parameters, potentially limiting the assessment to the designer's intended scope and neglecting other potential perceptual experiences. Therefore, this study evaluates not how much but how humans feel a physical property when stimuli are changed. We conducted interviews to investigate how people feel when a haptic device changes motion impedance. We used thematic analysis to abstract the results of the interviews and gain an understanding of how humans attribute force feedback to a phenomenon. We also generated a vocabulary from the themes obtained from the interviews and asked users to evaluate force feedback using the semantic difference method. A factor analysis was used to investigate how changing the basic elements of motion, such as inertia, viscosity, and stiffness of the motion system, affects haptic perception. As a result, we obtained four critical factors: size, viscosity, weight, and flexibility factor, and clarified the correspondence between these factors and the change of impedance. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2312.00782 [pdf, other]

Quantifying chaos and randomness in magnetar bursts

Authors: Shotaro Yamasaki, Ersin Gogus, Tetsuya Hashimoto

Abstract: In this study, we explore the dynamical stability of magnetar bursts within the context of the chaos-randomness phase space for the first time, aiming to uncover unique behaviors compared to various astrophysical transients, including fast radio bursts (FRBs). We analyze burst energy time series data from active magnetar sources SGR J1550-5418 and SGR J1935+2154, focusing on burst arrival time and… ▽ More In this study, we explore the dynamical stability of magnetar bursts within the context of the chaos-randomness phase space for the first time, aiming to uncover unique behaviors compared to various astrophysical transients, including fast radio bursts (FRBs). We analyze burst energy time series data from active magnetar sources SGR J1550-5418 and SGR J1935+2154, focusing on burst arrival time and energy differences between consecutive events. We find a distinct separation in the time domain, where magnetar bursts exhibit significantly lower randomness compared to FRBs, solar flares, and earthquakes, with a slightly higher degree of chaos. In the energy domain, magnetar bursts exhibit a broad consistency with other phenomena, primarily due to the wide distribution of chaos-randomness observed across different bursts and sources. Intriguingly, contrary to expectations from the FRB-magnetar connection, the arrival time patterns of magnetar bursts in our analysis do not exhibit significant proximity to repeating FRBs in the chaos-randomness plane. This finding may challenge the hypothesis that FRBs are associated with typical magnetar bursts but indirectly supports the evidence that FRBs may primarily be linked to special magnetar bursts like peculiar X-ray bursts from SGR J1935+2154 observed simultaneously with Galactic FRB 200428. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Comments: 6 pages, 3 figures, accepted for publication in MNRAS Letters

arXiv:2312.00364 [pdf, other]

Benchmarking Multi-Domain Active Learning on Image Classification

Authors: Jiayi Li, Rohan Taori, Tatsunori B. Hashimoto

Abstract: Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our bench… ▽ More Active learning aims to enhance model performance by strategically labeling informative data points. While extensively studied, its effectiveness on large-scale, real-world datasets remains underexplored. Existing research primarily focuses on single-source data, ignoring the multi-domain nature of real-world data. We introduce a multi-domain active learning benchmark to bridge this gap. Our benchmark demonstrates that traditional single-domain active learning strategies are often less effective than random selection in multi-domain scenarios. We also introduce CLIP-GeoYFCC, a novel large-scale image dataset built around geographical domains, in contrast to existing genre-based domain datasets. Analysis on our benchmark shows that all multi-domain strategies exhibit significant tradeoffs, with no strategy outperforming across all datasets or all metrics, emphasizing the need for future research. △ Less

Submitted 1 December, 2023; originally announced December 2023.

arXiv:2311.16857 [pdf, other]

SERENADE II: An ALMA Multi-Band Dust-Continuum Analysis of 28 Galaxies at $5<z<8$ and the Physical Origin of the Dust Temperature Evolution

Authors: Ikki Mitsuhashi, Yuichi Harikane, Franz E. Bauer, Tom Bakx, Andrea Ferrara, Seiji Fujimoto, Takuya Hashimoto, Akio K. Inoue, Kazushi Iwasawa, Yuri Nishimura, Masatoshi Imanishi, Yoshiaki Ono, Toshiki Saito, Yuma Sugahara, Hideki Umehata, Livia Vallini, Tao Wang

Abstract: We present an analysis of ALMA multi-band dust-continuum observations for 28 spectroscopically-confirmed bright Lyman-break galaxies at $5<z<8$. Our sample consists of 11 galaxies at $z\sim6$ newly observed in our ALMA program, which substantially increases the number of $5<z<8$ galaxies with both rest-frame 88 and 158 $μ{\rm m}$ continuum observations, allowing us to simultaneously measure the IR… ▽ More We present an analysis of ALMA multi-band dust-continuum observations for 28 spectroscopically-confirmed bright Lyman-break galaxies at $5<z<8$. Our sample consists of 11 galaxies at $z\sim6$ newly observed in our ALMA program, which substantially increases the number of $5<z<8$ galaxies with both rest-frame 88 and 158 $μ{\rm m}$ continuum observations, allowing us to simultaneously measure the IR luminosity and dust temperature for a statistical sample of $z\gtrsim5$ galaxies for the first time. We derive the relationship between the UV slope ($β_{\rm UV}$) and infrared excess (IRX) for the $z\sim6$ galaxies, and find a shallower IRX-$β_{\rm UV}$ relation compared to the previous results at $z\sim2$--4. Based on the IRX-$β_{\rm UV}$ relation consistent with our results and the $β_{\rm UV}$-$M_{\rm UV}$ relation including fainter galaxies in the literature, we find a limited contribution of the dust-obscured star formation to the total SFR density, $\sim30\%$ at $z\sim6$. Our measurements of the dust temperature at $z\sim6-7$, $T_{\rm dust}=40.9_{-9.1}^{+10.0}\,{\rm K}$ on average, supports a gentle increase of $T_{\rm dust}$ from $z=0$ to $z\sim6$--7. Using an analytic model with parameters consistent with recent {\it{JWST}} results, we discuss that the observed redshift evolution of the dust temperature can be reproduced by an $\sim0.6\,{\rm dex}$ increase in the gas depletion timescale and $\sim0.4\,{\rm dex}$ decrease of the metallicity. The variety of $T_{\rm dust}$ observed at high redshifts can also be naturally explained by scatters around the star-formation main sequence and average mass-metallicity relation, including an extremely high dust temperature of $T_{\rm dust}>80\,{\rm K}$ observed in a galaxy at $z=8.3$. △ Less

Submitted 28 November, 2023; originally announced November 2023.

Comments: Submitted to ApJ

arXiv:2311.05553 [pdf, other]

Removing RLHF Protections in GPT-4 via Fine-Tuning

Authors: Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang

Abstract: As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protectio… ▽ More As large language models (LLMs) have increased in their capabilities, so does their potential for dual use. To reduce harmful outputs, produces and vendors of LLMs have used reinforcement learning with human feedback (RLHF). In tandem, LLM vendors have been increasingly enabling fine-tuning of their most powerful models. However, concurrent work has shown that fine-tuning can remove RLHF protections. We may expect that the most powerful models currently available (GPT-4) are less susceptible to fine-tuning attacks. In this work, we show the contrary: fine-tuning allows attackers to remove RLHF protections with as few as 340 examples and a 95% success rate. These training examples can be automatically generated with weaker models. We further show that removing RLHF protections does not decrease usefulness on non-censored outputs, providing evidence that our fine-tuning strategy does not decrease usefulness despite using weaker models to generate training data. Our results show the need for further research on protections on LLMs. △ Less

Submitted 5 April, 2024; v1 submitted 9 November, 2023; originally announced November 2023.

Comments: Accepted to NAACL 2024. (7 pages)

arXiv:2310.19677 [pdf, other]

MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks

Authors: Allen Nie, Yuhui Zhang, Atharva Amdekar, Chris Piech, Tatsunori Hashimoto, Tobias Gerstenberg

Abstract: Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people… ▽ More Human commonsense understanding of the physical and social world is organized around intuitive theories. These theories support making causal and moral judgments. When something bad happens, we naturally ask: who did what, and why? A rich literature in cognitive science has studied people's causal and moral intuitions. This work has revealed a number of factors that systematically influence people's judgments, such as the violation of norms and whether the harm is avoidable or inevitable. We collected a dataset of stories from 24 cognitive science papers and developed a system to annotate each story with the factors they investigated. Using this dataset, we test whether large language models (LLMs) make causal and moral judgments about text-based scenarios that align with those of human participants. On the aggregate level, alignment has improved with more recent LLMs. However, using statistical analyses, we find that LLMs weigh the different factors quite differently from human participants. These results show how curated, challenge datasets combined with insights from cognitive science can help us go beyond comparisons based merely on aggregate metrics: we uncover LLMs implicit tendencies and show to what extent these align with human intuitions. △ Less

Submitted 31 October, 2023; v1 submitted 30 October, 2023; originally announced October 2023.

Comments: 34 pages, 7 figures. NeurIPS 2023

arXiv:2310.18413 [pdf, other]

On the Fairness ROAD: Robust Optimization for Adversarial Debiasing

Authors: Vincent Grari, Thibault Laugel, Tatsunori Hashimoto, Sylvain Lamprier, Marcin Detyniecki

Abstract: In the field of algorithmic fairness, significant attention has been put on group fairness criteria, such as Demographic Parity and Equalized Odds. Nevertheless, these objectives, measured as global averages, have raised concerns about persistent local disparities between sensitive groups. In this work, we address the problem of local fairness, which ensures that the predictor is unbiased not only… ▽ More In the field of algorithmic fairness, significant attention has been put on group fairness criteria, such as Demographic Parity and Equalized Odds. Nevertheless, these objectives, measured as global averages, have raised concerns about persistent local disparities between sensitive groups. In this work, we address the problem of local fairness, which ensures that the predictor is unbiased not only in terms of expectations over the whole population, but also within any subregion of the feature space, unknown at training time. To enforce this objective, we introduce ROAD, a novel approach that leverages the Distributionally Robust Optimization (DRO) framework within a fair adversarial learning objective, where an adversary tries to infer the sensitive attribute from the predictions. Using an instance-level re-weighting strategy, ROAD is designed to prioritize inputs that are likely to be locally unfair, i.e. where the adversary faces the least difficulty in reconstructing the sensitive attribute. Numerical experiments demonstrate the effectiveness of our method: it achieves Pareto dominance with respect to local fairness and accuracy for a given global fairness level across three standard datasets, and also enhances fairness generalization under distribution shift. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: 23 pages, 10 figures

arXiv:2310.17623 [pdf, other]

Proving Test Set Contamination in Black Box Language Models

Authors: Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto

Abstract: Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language model… ▽ More Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language models without access to pretraining data or model weights. Our approach leverages the fact that when there is no data contamination, all orderings of an exchangeable benchmark should be equally likely. In contrast, the tendency for language models to memorize example order means that a contaminated language model will find certain canonical orderings to be much more likely than others. Our test flags potential contamination whenever the likelihood of a canonically ordered benchmark dataset is significantly higher than the likelihood after shuffling the examples. We demonstrate that our procedure is sensitive enough to reliably prove test set contamination in challenging situations, including models as small as 1.4 billion parameters, on small test sets of only 1000 examples, and datasets that appear only a few times in the pretraining corpus. Using our test, we audit five popular publicly accessible language models for test set contamination and find little evidence for pervasive contamination. △ Less

Submitted 23 November, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

arXiv:2310.13807 [pdf, other]

Learning to (Learn at Test Time)

Authors: Yu Sun, Xinhao Li, Karan Dalal, Chloe Hsu, Sanmi Koyejo, Carlos Guestrin, Xiaolong Wang, Tatsunori Hashimoto, Xinlei Chen

Abstract: We reformulate the problem of supervised learning as learning to learn with two nested loops (i.e. learning problems). The inner loop learns on each individual instance with self-supervision before final prediction. The outer loop learns the self-supervised task used by the inner loop, such that its final prediction improves. Our inner loop turns out to be equivalent to linear attention when the i… ▽ More We reformulate the problem of supervised learning as learning to learn with two nested loops (i.e. learning problems). The inner loop learns on each individual instance with self-supervision before final prediction. The outer loop learns the self-supervised task used by the inner loop, such that its final prediction improves. Our inner loop turns out to be equivalent to linear attention when the inner-loop learner is only a linear model, and to self-attention when it is a kernel estimator. For practical comparison with linear or self-attention layers, we replace each of them in a transformer with an inner loop, so our outer loop is equivalent to training the architecture. When each inner-loop learner is a neural network, our approach vastly outperforms transformers with linear attention on ImageNet from 224 x 224 raw pixels in both accuracy and FLOPs, while (regular) transformers cannot run. △ Less

Submitted 7 January, 2024; v1 submitted 20 October, 2023; originally announced October 2023.

Comments: Fixed a few small typos

arXiv:2310.01846 [pdf, other]

Benchmarking and Improving Generator-Validator Consistency of Language Models

Authors: Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, Percy Liang

Abstract: As of September 2023, ChatGPT correctly answers "what is 7+8" with 15, but when asked "7+8=15, True or False" it responds with "False". This inconsistency between generating and validating an answer is prevalent in language models (LMs) and erodes trust. In this paper, we propose a framework for measuring the consistency between generation and validation (which we call generator-validator consiste… ▽ More As of September 2023, ChatGPT correctly answers "what is 7+8" with 15, but when asked "7+8=15, True or False" it responds with "False". This inconsistency between generating and validating an answer is prevalent in language models (LMs) and erodes trust. In this paper, we propose a framework for measuring the consistency between generation and validation (which we call generator-validator consistency, or GV-consistency), finding that even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. To improve the consistency of LMs, we propose to finetune on the filtered generator and validator responses that are GV-consistent, and call this approach consistency fine-tuning. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%, and the improvement extrapolates to unseen tasks and domains (e.g., GV-consistency for positive style transfers extrapolates to unseen styles like humor). In addition to improving consistency, consistency fine-tuning improves both generator quality and validator accuracy without using any labeled data. Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves the generator quality by 16% and the validator accuracy by 6.3% across all tasks. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: preprint

arXiv:2309.15817 [pdf, other]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Authors: Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J. Maddison, Tatsunori Hashimoto

Abstract: Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, setting up the environment for each test scenario manually, and finding risky cas… ▽ More Recent advances in Language Model (LM) agents and tool use, exemplified by applications like ChatGPT Plugins, enable a rich set of capabilities but also amplify potential risks - such as leaking private data or causing financial losses. Identifying these risks is labor-intensive, necessitating implementing the tools, setting up the environment for each test scenario manually, and finding risky cases. As tools and agents become more complex, the high cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks. To address these challenges, we introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios, without manual instantiation. Alongside the emulator, we develop an LM-based automatic safety evaluator that examines agent failures and quantifies associated risks. We test both the tool emulator and evaluator through human evaluation and find that 68.8% of failures identified with ToolEmu would be valid real-world agent failures. Using our curated initial benchmark consisting of 36 high-stakes tools and 144 test cases, we provide a quantitative risk analysis of current LM agents and identify numerous failures with potentially severe outcomes. Notably, even the safest LM agent exhibits such failures 23.9% of the time according to our evaluator, underscoring the need to develop safer LM agents for real-world deployment. △ Less

Submitted 17 May, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

arXiv:2309.14337 [pdf, other]

The true fraction of repeating fast radio bursts revealed through CHIME source count evolution

Authors: Shotaro Yamasaki, Tomotsugu Goto, Chih-Teng Ling, Tetsuya Hashimoto

Abstract: Fast Radio Bursts (FRBs) are classified into repeaters and non-repeaters, with only a few percent of the observed FRB population from the Canadian Hydrogen Intensity Mapping Experiment (CHIME) confirmed as repeaters. However, this figure represents only a lower limit due to the observational biases, and the true fraction of repeaters remains unknown. Correcting for these biases uncovers a notable… ▽ More Fast Radio Bursts (FRBs) are classified into repeaters and non-repeaters, with only a few percent of the observed FRB population from the Canadian Hydrogen Intensity Mapping Experiment (CHIME) confirmed as repeaters. However, this figure represents only a lower limit due to the observational biases, and the true fraction of repeaters remains unknown. Correcting for these biases uncovers a notable decline in apparently non-repeating FRB detection rate as the CHIME operational time increases. This finding suggests that a significant portion of apparently non-repeating FRBs could in fact exhibit repetition when observed over more extended periods. A simple population model infers that the true repeater fraction likely exceeds 50% with 99% confidence, a figure substantially larger than the observed face value, even consistent with 100%. This greater prevalence of repeaters had previously gone unnoticed due to their very low repetition rates ($\sim$10$^{-3.5}$ hr$^{-1}$ on average). Hence, theoretical FRB models must incorporate these low-rate repeaters. Furthermore, our results indicate a significantly higher repeater volume number density, potentially exceeding observed values by up to 10$^4$ times, which in turn impacts comparisons with potential FRB progenitors. △ Less

Submitted 12 December, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

Comments: 10 pages, 10 figures, MNRAS in press, updated to match the accepted version

arXiv:2309.07875 [pdf, other]

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Authors: Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, James Zou

Abstract: Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning.… ▽ More Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe. △ Less

Submitted 19 March, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

arXiv:2308.14840 [pdf, other]

doi 10.1561/3300000041

Identifying and Mitigating the Security Risks of Generative AI

Authors: Clark Barrett, Brad Boyd, Elie Burzstein, Nicholas Carlini, Brad Chen, Jihye Choi, Amrita Roy Chowdhury, Mihai Christodorescu, Anupam Datta, Soheil Feizi, Kathleen Fisher, Tatsunori Hashimoto, Dan Hendrycks, Somesh Jha, Daniel Kang, Florian Kerschbaum, Eric Mitchell, John Mitchell, Zulfikar Ramzan, Khawaja Shams, Dawn Song, Ankur Taly, Diyi Yang

Abstract: Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well… ▽ More Every major technical invention resurfaces the dual-use dilemma -- the new technology has the potential to be used for good as well as for harm. Generative AI (GenAI) techniques, such as large language models (LLMs) and diffusion models, have shown remarkable capabilities (e.g., in-context learning, code-completion, and text-to-image generation and editing). However, GenAI can be used just as well by attackers to generate new attacks and increase the velocity and efficacy of existing attacks. This paper reports the findings of a workshop held at Google (co-organized by Stanford University and the University of Wisconsin-Madison) on the dual-use dilemma posed by GenAI. This paper is not meant to be comprehensive, but is rather an attempt to synthesize some of the interesting findings from the workshop. We discuss short-term and long-term goals for the community on this topic. We hope this paper provides both a launching point for a discussion on this important topic as well as interesting problems that the research community can work to address. △ Less

Submitted 28 December, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

Journal ref: Foundations and Trends in Privacy and Security 6 (2023) 1-52

arXiv:2308.09157 [pdf, other]

doi 10.14778/3611479.3611496

Accelerating Aggregation Queries on Unstructured Streams of Data

Authors: Matthew Russo, Tatsunori Hashimoto, Daniel Kang, Yi Sun, Matei Zaharia

Abstract: Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it… ▽ More Analysts and scientists are interested in querying streams of video, audio, and text to extract quantitative insights. For example, an urban planner may wish to measure congestion by querying the live feed from a traffic camera. Prior work has used deep neural networks (DNNs) to answer such queries in the batch setting. However, much of this work is not suited for the streaming setting because it requires access to the entire dataset before a query can be submitted or is specific to video. Thus, to the best of our knowledge, no prior work addresses the problem of efficiently answering queries over multiple modalities of streams. In this work we propose InQuest, a system for accelerating aggregation queries on unstructured streams of data with statistical guarantees on query accuracy. InQuest leverages inexpensive approximation models ("proxies") and sampling techniques to limit the execution of an expensive high-precision model (an "oracle") to a subset of the stream. It then uses the oracle predictions to compute an approximate query answer in real-time. We theoretically analyzed InQuest and show that the expected error of its query estimates converges on stationary streams at a rate inversely proportional to the oracle budget. We evaluated our algorithm on six real-world video and text datasets and show that InQuest achieves the same root mean squared error (RMSE) as two streaming baselines with up to 5.0x fewer oracle invocations. We further show that InQuest can achieve up to 1.9x lower RMSE at a fixed number of oracle invocations than a state-of-the-art batch setting algorithm. △ Less

Submitted 17 August, 2023; originally announced August 2023.

Comments: 14 pages, 11 figures, to be published in Proceedings of the VLDB Endowment, Vol. 16, No. 11

Journal ref: PVLDB, 16(11): 2897 - 2910, 2023

arXiv:2308.04635 [pdf]

Where's the Liability in Harmful AI Speech?

Authors: Peter Henderson, Tatsunori Hashimoto, Mark Lemley

Abstract: Generative AI, in particular text-based "foundation models" (large models trained on a huge variety of information including the internet), can generate speech that could be problematic under a wide range of liability regimes. Machine learning practitioners regularly "red team" models to identify and mitigate such problematic speech: from "hallucinations" falsely accusing people of serious miscond… ▽ More Generative AI, in particular text-based "foundation models" (large models trained on a huge variety of information including the internet), can generate speech that could be problematic under a wide range of liability regimes. Machine learning practitioners regularly "red team" models to identify and mitigate such problematic speech: from "hallucinations" falsely accusing people of serious misconduct to recipes for constructing an atomic bomb. A key question is whether these red-teamed behaviors actually present any liability risk for model creators and deployers under U.S. law, incentivizing investments in safety mechanisms. We examine three liability regimes, tying them to common examples of red-teamed model behaviors: defamation, speech integral to criminal conduct, and wrongful death. We find that any Section 230 immunity analysis or downstream liability analysis is intimately wrapped up in the technical details of algorithm design. And there are many roadblocks to truly finding models (and their associated parties) liable for generated speech. We argue that AI should not be categorically immune from liability in these scenarios and that as courts grapple with the already fine-grained complexities of platform algorithms, the technical details of generative AI loom above with thornier questions. Courts and policymakers should think carefully about what technical design incentives they create as they evaluate these issues. △ Less

Submitted 16 August, 2023; v1 submitted 8 August, 2023; originally announced August 2023.

Comments: Published in the Journal of Free Speech Law (2023)

arXiv:2307.15593 [pdf, other]

Robust Distortion-free Watermarks for Language Models

Authors: Rohith Kuditipudi, John Thickstun, Tatsunori Hashimoto, Percy Liang

Abstract: We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked t… ▽ More We propose a methodology for planting watermarks in text from an autoregressive language model that are robust to perturbations without changing the distribution over text up to a certain maximum generation budget. We generate watermarked text by mapping a sequence of random numbers -- which we compute using a randomized watermark key -- to a sample from the language model. To detect watermarked text, any party who knows the key can align the text to the random number sequence. We instantiate our watermark methodology with two sampling schemes: inverse transform sampling and exponential minimum sampling. We apply these watermarks to three language models -- OPT-1.3B, LLaMA-7B and Alpaca-7B -- to experimentally validate their statistical power and robustness to various paraphrasing attacks. Notably, for both the OPT-1.3B and LLaMA-7B models, we find we can reliably detect watermarked text ($p \leq 0.01$) from $35$ tokens even after corrupting between $40$-$50\%$ of the tokens via random edits (i.e., substitutions, insertions or deletions). For the Alpaca-7B model, we conduct a case study on the feasibility of watermarking responses to typical user instructions. Due to the lower entropy of the responses, detection is more difficult: around $25\%$ of the responses -- whose median length is around $100$ tokens -- are detectable with $p \leq 0.01$, and the watermark is also less robust to certain automated paraphrasing attacks we implement. △ Less

Submitted 6 June, 2024; v1 submitted 28 July, 2023; originally announced July 2023.

Comments: reformatting of camera-ready version accepted to TMLR, with minor edits to introduction

arXiv:2307.03576 [pdf, ps, other]

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

Authors: Arvind Mahankali, Tatsunori B. Hashimoto, Tengyu Ma

Abstract: Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Akyürek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a lea… ▽ More Recent works have empirically analyzed in-context learning and shown that transformers trained on synthetic linear regression tasks can learn to implement ridge regression, which is the Bayes-optimal predictor, given sufficient capacity [Akyürek et al., 2023], while one-layer transformers with linear self-attention and no MLP layer will learn to implement one step of gradient descent (GD) on a least-squares linear regression objective [von Oswald et al., 2022]. However, the theory behind these observations remains poorly understood. We theoretically study transformers with a single layer of linear self-attention, trained on synthetic noisy linear regression data. First, we mathematically show that when the covariates are drawn from a standard Gaussian distribution, the one-layer transformer which minimizes the pre-training loss will implement a single step of GD on the least-squares linear regression objective. Then, we find that changing the distribution of the covariates and weight vector to a non-isotropic Gaussian distribution has a strong impact on the learned algorithm: the global minimizer of the pre-training loss now implements a single step of $\textit{pre-conditioned}$ GD. However, if only the distribution of the responses is changed, then this does not have a large effect on the learned algorithm: even when the response comes from a more general family of $\textit{nonlinear}$ functions, the global minimizer of the pre-training loss still implements a single step of GD on a least-squares linear regression objective. △ Less

Submitted 7 July, 2023; originally announced July 2023.

arXiv:2307.02811 [pdf, other]

doi 10.1093/mnras/stad1942

Machine Learning Classification of Repeating FRBs from FRB121102

Authors: Bjorn Jasper R. Raquel, Tetsuya Hashimoto, Tomotsugu Goto, Bo Han Chen, Yuri Uno, Tiger Yu-Yang Hsiao, Seong Jin Kim, Simon C. -C. Ho

Abstract: Fast Radio Bursts (FRBs) are mysterious bursts in the millisecond timescale at radio wavelengths. Currently, there is little understanding about the classification of repeating FRBs, based on difference in physics, which is of great importance in understanding their origin. Recent works from the literature focus on using specific parameters to classify FRBs to draw inferences on the possible physi… ▽ More Fast Radio Bursts (FRBs) are mysterious bursts in the millisecond timescale at radio wavelengths. Currently, there is little understanding about the classification of repeating FRBs, based on difference in physics, which is of great importance in understanding their origin. Recent works from the literature focus on using specific parameters to classify FRBs to draw inferences on the possible physical mechanisms or properties of these FRB subtypes. In this study, we use publicly available 1652 repeating FRBs from FRB121102 detected with the Five-hundred-meter Aperture Spherical Telescope (FAST), and studied them with an unsupervised machine learning model. By fine-tuning the hyperparameters of the model, we found that there is an indication for four clusters from the bursts of FRB121102 instead of the two clusters ("Classical" and "Atypical") suggested in the literature. Wherein, the "Atypical" cluster can be further classified into three sub-clusters with distinct characteristics. Our findings show that the clustering result we obtained is more comprehensive not only because our study produced results which are consistent with those in the literature but also because our work uses more physical parameters to create these clusters. Overall, our methods and analyses produced a more holistic approach in clustering the repeating FRBs of FRB121102. △ Less

Submitted 6 July, 2023; v1 submitted 6 July, 2023; originally announced July 2023.

Comments: 24 pages, 14 figures, accepted for publication in MNRAS. For summary video, please see https://www.youtube.com/watch?v=wYx6t2G__84&list=PLOpYDs2PkYlYIiKDjDz6r6aKXcXdJZXYb&index=13&ab_channel=NCHUAstronomy

arXiv:2307.02104 [pdf, other]

Molecular outflow in the reionization-epoch quasar J2054-0005 revealed by OH 119 $μ$m observations

Authors: Dragan Salak, Takuya Hashimoto, Akio K. Inoue, Tom J. L. C. Bakx, Darko Donevski, Yoichi Tamura, Yuma Sugahara, Nario Kuno, Yusuke Miyamoto, Seiji Fujimoto, Suphakorn Suphapolthaworn

Abstract: Molecular outflows are expected to play a key role in galaxy evolution at high redshift. To study the impact of outflows on star formation at the epoch of reionization, we performed sensitive ALMA observations of OH 119 $μ$m toward J2054-0005, a luminous quasar at $z=6.04$. The OH line is detected and exhibits a P-Cygni profile that can be fitted with a broad blue-shifted absorption component, pro… ▽ More Molecular outflows are expected to play a key role in galaxy evolution at high redshift. To study the impact of outflows on star formation at the epoch of reionization, we performed sensitive ALMA observations of OH 119 $μ$m toward J2054-0005, a luminous quasar at $z=6.04$. The OH line is detected and exhibits a P-Cygni profile that can be fitted with a broad blue-shifted absorption component, providing unambiguous evidence of an outflow, and an emission component at near-systemic velocity. The mean and terminal outflow velocities are estimated to be $v_\mathrm{out}\approx670~\mathrm{km~s}^{-1}$ and $1500~\mathrm{km~s}^{-1}$, respectively, making the molecular outflow in this quasar one of the fastest at the epoch of reionization. The OH line is marginally spatially resolved for the first time in a quasar at $z>6$, revealing that the outflow extends over the central 2 kpc region. The mass outflow rate is comparable to the star formation rate ($\dot{M}_\mathrm{out}/\mathrm{SFR}\sim2$), indicating rapid ($\sim10^7~\mathrm{yr}$) quenching of star formation. The mass outflow rate in a sample star-forming galaxies and quasars at $4<z<6.4$ exhibits a positive correlation with the total infrared luminosity, although the scatter is large. Owing to the high outflow velocity, a large fraction (up to $\sim50\%$) of the outflowing molecular gas may be able to escape from the host galaxy into the intergalactic medium. △ Less

Submitted 17 November, 2023; v1 submitted 5 July, 2023; originally announced July 2023.

Comments: Accepted to ApJ

arXiv:2307.00874 [pdf, ps, other]

Center Preserving Automorphisms of Finite Heisenberg Group over $\mathbb Z_N$

Authors: T. Hashimoto, M. Horibe, A. Hayashi

Abstract: We investigate the group structure of center-preserving automorphisms of the finite Heisenberg group over $\mathbb Z_N$ with $U(1)$ extension, which arises in finite-dimensional quantum mechanics on a discrete phase space. Constructing an explicit splitting, it is shown that, for $N=2(2k+1)$, the group is isomorphic to the semidirect product of $Sp_N$ and $\mathbb Z_N^2$. Moreover, when N is divis… ▽ More We investigate the group structure of center-preserving automorphisms of the finite Heisenberg group over $\mathbb Z_N$ with $U(1)$ extension, which arises in finite-dimensional quantum mechanics on a discrete phase space. Constructing an explicit splitting, it is shown that, for $N=2(2k+1)$, the group is isomorphic to the semidirect product of $Sp_N$ and $\mathbb Z_N^2$. Moreover, when N is divisible by $2^l (l \ge 2)$, the group has a non-trivial 2-cocycle, and its explicit form is provided. By utilizing the splitting, it is demonstrated that the corresponding projective Weil representation can be lifted to linear representation. △ Less

Submitted 2 October, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

Comments: 23 pages, 1 figure

arXiv:2306.02663 [pdf, other]

doi 10.1093/mnras/stad1679

A T-Dwarf Candidate from JWST Early Release NIRCam data

Authors: Po-Ya Wang, Tomotsugu Goto, Simon C. -C. Ho, Yu-Wei Lin, Cossas K. -W. Wu, Chih-Teng Ling, Tetsuya Hashimoto, Seong Jin Kim, Tiger Y. -Y. Hsiao

Abstract: We present a distant T$-$type brown dwarf candidate at $\approx2.55$ kpc discovered in the Cosmic Evolution Early Release Science (CEERS) fields by James Webb Space Telescope (JWST) NIRCam. In addition to the superb sensitivity, we utilised 7 filters from JWST in near-IR and thus is advantageous in finding faint, previously unseen brown dwarfs. From the model spectra in new JWST/NIRCam filter wave… ▽ More We present a distant T$-$type brown dwarf candidate at $\approx2.55$ kpc discovered in the Cosmic Evolution Early Release Science (CEERS) fields by James Webb Space Telescope (JWST) NIRCam. In addition to the superb sensitivity, we utilised 7 filters from JWST in near-IR and thus is advantageous in finding faint, previously unseen brown dwarfs. From the model spectra in new JWST/NIRCam filter wavelengths, the selection criteria of F115W-F277W$<$-0.8 and F277W-F444W$>$1.1 were chosen to target the spectrum features of brown dwarfs having temperatures from 500K to 1300K. Searching through the data from Early Release Observations (ERO) and Early Release Science (ERS), we find 1 promising candidate in the CEERS field. The result of SED fitting suggested an early T spectral type with a low effective temperature of T$_\text{eff}\approx$1300K, the surface gravity of $\log{g}\approx5.25\text{cm s}^{-2}$, and an eddy diffusion parameter of logK$_{zz}\approx7\text{cm}^2 \text{s}^{-1}$, which indicates an age of $\approx$1.8Gyr and a mass of $\approx0.05$M$_{\odot}$. In contrast to typically found T$-$dwarf within several hundred parsecs, the estimated distance of the source is $\approx2.55$kpc, showing the JWST's power to extend the search to a much larger distance. △ Less

Submitted 5 June, 2023; originally announced June 2023.

Comments: 5 pages, 6 figures and 1 table; accepted for publication in MNRAS; A summary video is available at https://youtu.be/PQW79tuS0mI

arXiv:2305.18619 [pdf, other]

Likelihood-Based Diffusion Language Models

Authors: Ishaan Gulrajani, Tatsunori B. Hashimoto

Abstract: Despite a growing interest in diffusion-based language models, existing work has not shown that these models can attain nontrivial likelihoods on standard language modeling benchmarks. In this work, we take the first steps towards closing the likelihood gap between autoregressive and diffusion-based language models, with the goal of building and releasing a diffusion model which outperforms a smal… ▽ More Despite a growing interest in diffusion-based language models, existing work has not shown that these models can attain nontrivial likelihoods on standard language modeling benchmarks. In this work, we take the first steps towards closing the likelihood gap between autoregressive and diffusion-based language models, with the goal of building and releasing a diffusion model which outperforms a small but widely-known autoregressive model. We pursue this goal through algorithmic improvements, scaling laws, and increased compute. On the algorithmic front, we introduce several methodological improvements for the maximum-likelihood training of diffusion language models. We then study scaling laws for our diffusion models and find compute-optimal training regimes which differ substantially from autoregressive models. Using our methods and scaling analysis, we train and release Plaid 1B, a large diffusion language model which outperforms GPT-2 124M in likelihood on benchmark datasets and generates fluent samples in unconditional and zero-shot control settings. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Showing 1–50 of 397 results for author: Hashimoto, T