Mendeley

Home

All issues

Volume 666 (October 2022)

A&A, 666 (2022) A176

Full HTML

Open Access

Issue		A&A Volume 666, October 2022


Article Number		A176
Number of page(s)		24
Section		Interstellar and circumstellar matter
DOI		https://doi.org/10.1051/0004-6361/202243078
Published online		25 October 2022

A&A 666, A176 (2022)

Inferring properties of dust in supernovae with neural networks

Zoe Ansari¹, Christa Gall¹, Roger Wesson² and Oswin Krause³

¹ DARK, Niels Bohr Institute, University of Copenhagen, Jagtvej 128, 2200 Copenhagen, Denmark
e-mail: zakieh.ansari@nbi.ku.dk
² Department of Physics and Astronomy, University College London, Gower Street, London WC1E 6BT, UK
³ Department of Computer Science, University of Copenhagen, Universitetsparken 1, 2100 Copenhagen, Denmark

Received: 10 January 2022
Accepted: 9 July 2022

Abstract

Context. Determining properties of dust that formed in and around supernovae from observations remains challenging. This may be due to either incomplete coverage of data in wavelength or time, but also due to often inconspicuous signatures of dust in the observed data.

Aims. Here we address this challenge using modern machine learning methods to determine the amount and temperature of dust as well as its composition from a large set of simulated data. We aim to quantify if such methods are suitable to infer quantities and properties of dust from future observations of supernovae.

Methods. We developed a neural network consisting of eight fully connected layers and an output layer with specified activation functions that allowed us to predict the dust mass, temperature, and composition as well as their respective uncertainties for each single supernova of a large set of simulated supernova spectral energy distributions (SEDs). We produced the large set of supernova SEDs for a wide range of different supernovae and dust properties using the advanced, fully three-dimensional radiative transfer code MOCASSIN. We then convolved each SED with the entire suite of James Webb Space Telescope (JWST) bandpass filters to synthesise a photometric data set. We split this data set into three subsets which were used to train, validate, and test the neural network. To find out how accurately the neural network can predict the dust mass, temperature, and composition from the simulated data, we considered three different scenarios. First, we adopted a uniform distance of ~0.43 Mpc for all simulated SEDs. Next we uniformly distributed all simulated SEDs within a volume of 0.43–65 Mpc and, finally, we artificially added random noise corresponding to a photometric uncertainty of 0.1 mag. Lastly, we conducted a feature importance analysis via SHapley Additive explanations (SHAP) to find the minimum set of JWST bandpass filters required to predict the selected dust quantities with an accuracy that is comparable to standard methods in the literature.

Results. We find that our neural network performs best for the scenario in which all SEDs are at the same distance and for a minimum subset of seven JWST bandpass filters within a wavelength range 3−25 µm. This results in rather small root-mean-square errors (RMSEs) of ~0.08 dex and ~42 K for the most reliable predicted dust masses and temperatures, respectively. For the scenario in which SEDs are distributed out to 65 Mpc and contain synthetic noise, the most reliable predicted dust masses and temperatures achieve an RMSE of ~0.12 dex and ~38 K, respectively. Thus, in all scenarios, both predicted dust quantities have smaller predicted uncertainties compared to those in the literature achieved with common SED fitting methods of actual observations of supernovae. Moreover, our neural network can well distinguish between the different dust species included in our work, reaching a classification accuracy of up to 95% for carbon and 99% for silicate dust.

Conclusions. Although we trained, validated, and tested our neural network entirely on simulated SEDs, our analysis shows that a suite of JWST bandpass filters containing NIRCam F070W, F140M, F356W and F480M as well as MIRI F560W, F770W, F1000W, F1130W, F1500W, and F1800W filters are likely the most important filters needed to derive the quantities and determine the properties of dust that formed in and around supernovae from future observations. We tested this on selected optical to infrared data of SN 1987A at 615 days past explosion and find good agreement with dust masses and temperatures inferred with standard fitting methods in the literature.

Key words: galaxies: star formation / methods: statistical / supernovae: general

© Z. Ansari et al. 2022

Open Access article, published by EDP Sciences, under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

This article is published in open access under the Subscribe-to-Open model. Subscribe to A&A to support open access publication.

1 Introduction

The origin of dust in galaxies in the Universe remains debated. Large amounts of dust are observed in galaxies and quasars in the early and local Universe (e.g. Bertoldi et al. 2003; Priddey et al. 2003; Michalowski et al. 2010a,b; Watson et al. 2015; Wang et al. 2008; Marrone et al. 2018), some of which require a rapid and efficient dust formation process (e.g. Dwek et al. 2007; Gall et al. 2011a,b; Finkelstein et al. 2012). There is growing evidence that core collapse supernovae (CCSNe), which mark the death of short-lived massive stars, are efficient dust producers likely responsible for the observed large amounts of dust in galaxies (Gall et al. 2011b, 2014; Ferrara et al. 2016; Gall & Hjorth 2018; De Looze et al. 2020). An alternative to the rapid in situ dust production in CCSNe is grain growth in cold molecular clouds in the interstellar medium (ISM, e.g. Draine 2009) from rapidly produced dust grain seeds and heavy elements by CCSNe.

Dust masses inferred from observations of supernovae (SNe) range from less than about 10⁻⁴ M_⊙ in young CCSNe of a few hundred days old to about 0.1−1.0 M_⊙ in old CCSN remnants of a few 100−1000 yr of age. From a handful of CCSNe that have observationally been monitored over several years, it is evident that the amount of dust gradually increases over about 25−30 years (Gall et al. 2011b, 2014; Wesson et al. 2015; Bevan & Barlow 2016; Gall & Hjorth 2018). Observations of older supernova remnants (SNRs) such as Cas A (Niculescu-Duvaz et al. 2021), N49 (Otsuka et al. 2010), Sgr A East (~0.02 M_⊙ and ~10000 years old, Lau et al. 2015), G11.2−0.3 (~0.34 M_⊙), G21.5−0.9 (~0.29 M_⊙), and G29.7−0.3 (~0.51 M_⊙). Chawner et al. (2019) confirm that on average about ~0.3 M_⊙ of CCSN produced dust is sustained over a period of about 3000 yr. While this is sufficient to account for the total dust mass observed in local as well as high redshift galaxies (Gall & Hjorth 2018), the final amount of dust released into the ISM may still depend on the efficiency of dust destruction and re-formation behind diverse short and long time-scale reverse shocks launched by the forward shock interaction with either the CSM (e.g. Mauerhan & Smith 2012; Matsuura et al. 2019) or ISM (e.g. Silvia et al. 2012; Micelotta et al. 2016).

Inferring dust quantities as well as properties from observations is challenging. Typically, the amount of dust and its temperature is determined by fitting the thermal dust emission in the near- to far-infrared (far-IR) wavelength range with dust models at different levels of complexity (e.g. Rho et al. 2009; Gall et al. 2011b; Wesson et al. 2015; Matsuura et al. 2015, 2019; Chen et al. 2021). However, the most common dust species are rather featureless in this wavelength range with silicates having the most prominent emission feature at around 10−12 micron (Draine & Lee 1984; Henning 2010), which also could appear featureless for cold dust and/or dust with large grains. Due to limited computational power or insufficient data, the manifold of dust model parameters can often neither be fully explored nor constrained. This leads to dust mass estimates that may vary over an order of magnitude (Gall & Hjorth 2018).

Warm and cold dust (<500K) in nearby SNe and SNRs has been detected in the mid- to far-IR wavelength range with telescopes such as WISE, SOFIA, ALMA or the Herschel mission (2009−2013, e.g. Gomez et al. 2012; Indebetouw et al. 2014; De Looze et al. 2019; Gall et al. 2011b; Gall & Hjorth 2018, and references therein) and notably the Spitzer Space Telescope, which observed during its cold (2003−2009) and warm phase (2009−2020) about 380 CCSNe out of about 1100 SNe in total (see for a summary Szalai et al. 2019). The next telescope in line with the right sensitivity to observe dust that either is newly formed or heated and to possibly constrain some dust species will be the James Webb Space Telescope (JWST, Gardner et al. 2006). With instruments onboard, such as the Near-Infrared Camera and Spectrograph (NIRCam, NIRSpec), the Near-Infrared Imager and Slitless Spectrograph (NIRISS), and the Mid-Infrared Instrument (MIRI) imaging as well as spectroscopic observations of CCSNe in the wavelength range 0.6−28 µm will be possible. However, the wavelength range of JWST is shorter than the Spitzer Infrared Spectrograph wavelength range that extended out to ~38 µm, thus JWST will preferentially allow to probe the hot and warm dust regime but will not be suitable to probe the cold dust regime at which the majority of the large dust masses in SNRs are detected.

In this paper, we investigated whether modern machine learning algorithms can be used to determine the dust mass, temperature, and possible grain species from the signatures dust imprints in the spectral energy distributions (SEDs) of SNe. We trained a neural network to predict such dust quantities from a simulated set of SEDs of CCSNe with different dust quantities and properties. The SEDs were produced using the fully three-dimensional photoionisation and dust radiative transfer code MOCASSIN¹ (Ercolano et al. 2003a, 2005) exploring a large parameter space of dust and SN properties. Assuming that the SNe are distributed within maximally 65 Mpc, we then convolved the SEDs with the suite of available JWST NIRCam (0.6−5.0 µm) and MIRI (5.0−28 µm) bandpass filters to synthesise a photometric data set. The use of simulated data was essential for this work since unfortunately, the presently existing wealth of observational data of dust in and around SNe is insufficient.

The neural network was optimised to predict the total dust mass, dust temperature, and dust species. The data input to the neural network included the entire photometric data set, which consists of 293 236 SEDs and the redshift for each SED. To obtain a practical method, we performed a feature selection method to find the minimum number of JWST filters to estimate the dust properties. Furthermore, we trained the neural network to obtain an estimate on the uncertainties of the predicted quantities (i.e. dust mass, temperature and species). We then identified the most reliable predictions using self-defined and common performance evaluation metrics, which also provide information about the overall performance of the neural network.

In Sect. 2 we describe the simulated data set which sets the basis of our analysis and which we used to train our machine learning algorithm, which is described in Sect. 3. In Sect. 4 we describe the metrics that we employed to evaluate the performance of the neural network, and discuss possible caveats in Sect. 5. We present our results in Sect. 6 and discuss the implications of our results on future observations and the SN dust community in Sect. 7. We conclude in Sect. 8. Throughout the paper we assume a ΛCDM model with H₀ = 70 (km s⁻¹) Mpc, and Ω₀ = 0.3 (Abbott et al. 2017). We applied the above mentioned assumptions on our simulated data set whenever needed, via a built-in library from astropy².

2 Simulated data

Here, we describe the simulated data set, which consists of simulated SN SEDs from which we synthesised a photometric data set using the entire suite of JWST NIRCam and MIRI bandpass filters. We describe how we dealt with either exceptionally faint or bright sources with respect to the JWST detection/sensitivity limits. Furthermore, we define three different scenarios, in each of which we derived a different data set from the simulated data set to train the neural network and test its performance for predicting the SN dust quantities and properties.

2.1 MOCASSIN

MOCASSIN (Monte Carlo Simulations of Ionised Nebulae) is a fully three-dimensional radiative transfer code that propagates radiation packets using a Monte Carlo technique (Ercolano et al. 2003a, 2005). Arbitrary distributions of material can be represented within a Cartesian grid. The material can consist of gas, dust, or both. In each grid cell, the thermal equilibrium and ionisation balance equations are solved to determine the physical conditions. For dusty models, MOCASSIN uses standard Mie scattering theory to calculate the effective absorption and scattering efficiencies for a grain of radius a at wavelength λ, from the optical constants of the material. Any type of grain size distribution and mixture of materials may be specified.

The material is illuminated by a radiation source or sources, which can be discrete point sources, or a diffuse source present within each grid cell. The spectral energy distribution of the illuminating source can be a simple blackbody (BB) or an arbitrary spectral shape such as a stellar atmosphere model. The radiation field is described by a composition of a discrete number of monochromatic packets of energy (Abbott & Lucy 1985) for all sources. At each location, the Monte Carlo estimator (Lucy 1999) derives the mean intensity of the radiation field. The contribution of each energy packet to the radiation field at each location is defined by its path through the grid.

To synthesise different SEDs of SNe with dusty shells (hereafter SN model SEDs) using MOCASSIN we defined a set of parameters for the underlying radiation source (the SN), the dust itself and its location. Specifically, in our simulation, our chosen radiation source is a central blackbody, which is defined by a temperature (T_BB) and a luminosity (L_BB). The range of the two parameters follows typical measurements of SN photospheres up to a few hundred days past explosion. The range of radii used in our models covers both the expected radii of SNe ejecta up to ~1000 days after explosion, as well as larger radii at which pre-existing dust flash-heated by a SN explosion could give rise to infrared emission. For the dust, we considered two prominent grain species, which are amorphous carbon and astronomical silicates with optical constants taken from Zubko et al. (1996) and Draine & Lee (1984), respectively. For our simulations, we considered that all the dust consists of either 100% carbon, 100% silicates, or is a 50:50 mixture of the two dust species. The range of initial dust masses is limited to 10⁻⁵−10⁻¹ M_⊙. The upper dust mass limit is partly motivated by the long run-time of simulations of SN model SEDs have, if a lot of dust is present. Another reason is that the mean dust mass for SNe and SNRs is 0.4 ± 0.07 M_⊙ (Gall & Hjorth 2018), but the dust temperatures for large dust masses (> 10⁻² M_⊙) in some SNe is <50 K (Gall et al. 2014). Even with JWST such cold dust will not be easily detected. Furthermore, we considered only single grain sizes ranging between 0.005−5 µm. Typically, such grain sizes are present in for example the Milky Way (Mathis et al. 1977, e.g.) and observed in some SNe (e.g. Gall et al. 2014; Wesson et al. 2015; Bevan et al. 2020). In total, our SN model SEDs are composed of seven parameters, for which we defined either a set of distinct choices or a range of values (some are described above). A summary of the entire parameter space is presented in Table 1. To finally create our data set, each SN model SED was synthesised from a set of parameters that was stochastically generated from this parameter space. This method ensures that the entire parameter space is uniformly exploited.

In this work, we used MOCASSIN version 2.02.73 to synthesise 293 236 model SEDs. We constructed a cubical Cartesian grid with 11 cells on each side of the 3D grid to model the dusty shells, which are defined by an inner and outer radius of the shell, R_in and R_out, respectively. We modelled one-eighth of the grid cube (shell) with the illuminating source in one corner. Assuming spherical symmetry, this cube segment was then scaled to the full cube for an effective resolution of 21³ cells (Ercolano et al. 2003b). We used 10⁶ energy packets in most of our simulations. This relatively low number (~750 energy packets per grid cell) ensures that the MOCASSIN models run very quickly. However, at wavelengths where only a few photons are emitted, the SEDs are affected by small number of statistics and hence dominated by noise. Therefore, for MOCASSIN models with dust masses lower than 10⁻⁴ M_⊙ in which few photons are reprocessed to longer wavelengths, we used 10 times as many energy packets to reduce the statistical noise in the SEDs at longer wavelengths (e.g. 5−30 µm).

For efficiency reasons, we set a maximum run-time of two minutes for each model. For most regions of the investigated parameter space, the MOCASSIN models have a run-time of a few seconds, but models with both a small shell radius (≲4 × 10¹⁶ cm) and a high dust mass (≳10⁻²M_⊙) have very high optical depths and thus, time out. This results in a slightly nonuniform filling of the entire parameter space. Furthermore, any dust grains in a simulation which reach the sublimation temperature of its species (1400 K for silicate dust, 2200 K for carbon dust) are considered to have evaporated and are not included when calculating the SED. The final dust mass is then either lower than the input dust mass or dust may even be no longer existing. Consequently, for mixed-chemistry models, the composition is altered from 50:50 to a higher carbon fraction due to the higher sublimation temperature of carbon dust. MOCASSIN does not directly provide the final dust mass and composition if dust evaporation occurs, but they are easily extracted from the output grid files by summing the dust masses in cells where the temperature is below the dust sublimation temperature. Some dust evaporation occurs in about 5% of our models. Figure 1 shows the final distribution of the SN model SEDs in the dust mass, temperature, and species MOCASSIN output-parameter space.

Table 1

Input parameters for the MOCASSIN models.

Fig. 1

Coverage of SN model SEDs in M_dust, R_out, and dust species parameter space. The colour bar represents T_dust of the SN model SEDs, with blue, denoting the coldest (200 K) and red, the hottest (2200 K) temperatures.

2.1.1 Synthetic photometry of optical and mid-IR JWST bandpass filters

The JWST is equipped with two imaging cameras, NIRCam and MIRI. The two cameras have in total six narrow and 31 broad bandpass filters available that cover the wavelength ranges 0.6−5 µm (NIRCam) and 5−30 µm (MIRI). As a next step in preparing the data set for our neural network, we convolved the SN model SEDs with both NIRCam and MIRI bandpass filters (hereafter filters) in order to synthesise a photometric data set. To do so, we used the python program Pyphot³. This program has a built-in library of transmission curves of different filters. Since Pyphot also allows customised transmission curves, we imported transmission curves for NIRCam and MIRI filters from the Spanish virtual observatory⁴. For each NIRCam and MIRI filter, we first calculated the integrated flux in units of Jansky via Pyphot, which then were converted to AB magnitudes as

(1)

following the definition of Hogg et al. (2002).

2.1.2 JWST detection limits

For the final step, we considered that our synthetic photometric data set contains magnitudes in some filters that would either be too bright or too faint to be detected with JWST. In order to filter out data with magnitudes that practically cannot be observed (hereafter missing values), we adopted the pre-calculated point-source continuum detection limits (Glasse et al. 2015; Greene et al. 2017) that have been derived using the JWST exposure time calculator (ETC, Pontoppidan et al. 2016) for a signal-to-noise ratio (S/N) of 10 and exposure times of 21.4 s and 10 000 s for the saturation and sensitivity limits, respectively. A visualisation of these limits for all NIRCam and MIRI filters is shown in the appendix in Figs. A.1 and A.2, respectively.

2.2 Three scenarios

Typically, CCSNe occur in different types of galaxies at different distances. Consequently, distant CCSNe appear fainter than the same nearby CCSNe because their brightness decreases with distance as

(2)

with F_v(λ_obs) the observed flux as a function of the observed wavelength, λ_obs, in units of Jy; L_v(λ_emit) the emitted luminosity at the emitted wavelength, λ_emit; D_L the luminosity distance and z the redshift. The observed wavelength is given by λ_obs = λ_emit(1 + z).

This implies that the SEDs of CCSNe are redshifted and a well-defined bandpass filter will sample the light from a bluer wavelength region of the intrinsic CCSN spectrum compared to the restframe wavelength range of the bandpass filter. In extreme cases (e.g. at high redshift) such an effect may cause a non-negligible degeneracy between dust properties and redshift. In what follows, we define three individual scenarios that are used to test if some quantities and properties of dust formed in and around CCSNe, such as the dust mass, M_dust dust temperature, T_dust and the dust species can be determined with neural networks.

For the first scenario, S1, we simply assumed that all CCSNe are at the same, low redshift of z = 0.0001, which corresponds to a distance of ≈0.43 Mpc. For comparison, the distance of SN 1987A, the closest observed extragalactic CCSN is ≈0.49 ± 0.0009 (statistical) ± 0.0054 (systematic) Mpc (Pietrzynski et al. 2019) and the next closest CCSN, SN 1885A (Fesen et al. 1989), is ~0.765 Mpc away.

Placing all SN model SEDs at the same such short distance has the advantage that the observed model magnitudes are nearly identical to the intrinsic magnitudes of SN model SEDs and thus, free of any possible degeneracy between dust properties and distance. Hence, we expect this scenario to be an ideal test case for the neural network. Moreover, from this scenario we can identify the smallest amount of dust detectable with JWST (see Sect. 2.1). For simplicity reasons, here for S1, we only considered the upper sensitivity limit of the JWST filters but did not apply the lower saturation limits.

For the second scenario, S2, we assumed that all our simulated CCSNe are uniformly distributed within the redshift range 0.0001−0.015, which corresponds to a distance range of ~ 0.4365 Mpc. The decrease in brightness and shift in wavelength with increasing distance, together with the sensitivity and saturation limits of the JWST filters (see Figs. A.1 and A.2) place a limit on the distance out to which dust in CCSNe may be observed. Therefore, we chose z = 0.015 (i.e. ~65 Mpc) as an upper limit. This limit is based on the SN model SEDs, for which the thermal dust emission of 10⁻⁵ M_⊙ carbon dust at a temperature of ~2000 K remains detectable (see Sect. 2.1.2) in at minimum 10 out of 28 NIRCam filters.

The data sets of scenarios S1 and S2 solely consist of synthesised magnitudes of all available JWST filters without uncertainties. Therefore, as our third test scenario, S3, we used the data set of S2 and added synthetic photometric noise. We assumed that each synthesised magnitude is ‘observed’ at S/N = 10, which translates into an uncertainty of 0.1 mag. This assumption is in line with what has been used to derive the detection limits (see Sect. 2.1.2). Hence, to create S3, we added randomly synthesised noise to the data of S2 as m_i,S3 = m_i,S2 + N(0,0.1), with m_i the magnitude of each JWST filters, and N(0,0.1) as a randomly generated number taken from a Gaussian distribution with zero mean and σ = 0.1.

3 Neural networks

Our analysis is based on training a deep neural network using simulated data (see Sect. 2). The goal is to predict three dust quantities and properties, T_dust, M_dust and dust species, together with a prediction of their respective uncertainties. To conform with machine learning nomenclature, we refer to the set of photometric data that is synthesised from each SN model SED using JWST filters, along with the redshift of the SN model SED, as the input features. We also refer to each SN model SED that corresponds to each set of synthesised magnitudes, as a data point, since it is defined as a point in the input features’ space.

In the following subsections, we describe the artificial neural networks and the corresponding hyperparameters. We also describe the specific type of neural network that we used and its corresponding optimal set of hyperparameters as well as a pre-processing method to treat the missing values in our data set. Furthermore, we describe the training process of our neural network, in which we defined target values for three dust quantities and properties. Thereafter, we explain an iterative feature selection procedure which we used to find the minimum set of the most important JWST filters, with which the dust quantities and properties can still be predicted with an acceptable accuracy.

3.1 Artificial neural networks

An artificial neural network or in short, neural network, is a set of algorithms that is used to recognise relationships in a data set, and to find patterns. The structure of a neural network is inspired by biological neurons, and thus it mimics the methodology that biological neurons use to send signals to one another. Neural networks consist of one or more layers, known as hidden layers, between an input and an output layer. Each layer contains a set of neurons. The process of training a neural network consists of transferring information from the input layer to the output layer via a set of connections. Each connection is defined between each neuron in one layer to each neuron in the next layer. There are different methods to connect neurons and to transfer information between them. In the classic framework, each neuron of a given layer is connected to all neurons in the next layer. Layers that follow this pattern are called fully connected layers. Another method for connecting neurons consists of convolutional layers, in which each neuron from a layer is only connected to a well defined set of neurons from the next layer. A neural network can be built using either one or a combination of different layers and different patterns. To transfer the information, each layer applies an activation function to a set of weights associated with a set of neurons in the layer.

The output vector of each layer is defined as follows:

(3)

where a^l−1 is the input vector to layer l, is a matrix that contains a set of weights from neuron j in layer l − 1 to neuron i in layer l, p is the number of neurons in the layer l − 1, b^l−1 is a vector of constant values assigned to neurons of layer l, known as thresholds, and ℋ^l is an activation function for layer l. For the input layer (i.e. l = 0) , where x is the input feature vector for the neural network.

The weights and the thresholds of neural networks are the model parameters that a neural network aims to optimise by improving its performance of estimating the target values. In a forward-propagation process of a neural network training, the prediction error is first calculated using random weights. The prediction errors are quantified by a ‘loss function’. In a subsequent back-propagation process (e.g. Rumelhart et al. 1986), the weights are adjusted with the aim of minimising the loss. As the name suggests, the forward-propagation method iterates from the input via the hidden to the output layer, while the back-propagation is converse. This combination of forward- and back-propagation takes place within one epoch of training (hereafter epoch). Typically, several epochs are required to minimise the loss function and to improve the performance of the neural network.

Since the loss function can be non-convex, and finding a global minimum of a general non-convex function is NP-hard (Murty & Kabadi 1987), a neural network can be considered optimised when the loss function is converged to a ‘good’ local minimum. To do so, minimisation algorithms, such as the classical gradient descent, are employed. The basic principle of such algorithms is to calculate the gradient of the loss function and step by step move in the direction as specified by the gradient, with the step size termed as the learning rate.

Choosing the right learning rate is important as for a high learning rate the calculated loss with updated model parameters can jump over the local minimum, therefore can not converge to it. On the other hand, using a low learning rate, the algorithm takes a long time to reach the local minimum of the loss function.

The batch gradient descent is a gradient descent optimisation method in which the neural network updates the weights only once per epoch for the entire training data set. Although this process is a fast approach for finding the local minimum of the loss function, the memory requirement for such computational task is large. A remedy to this is to employ a mini-batch gradient descent, which allows the neural network in each epoch to update the weights for a sub-sample of the data set separately. This subsample is called mini-batch, and the size of it is defined by the size of the mini-batch.

The classical gradient descent uses a fixed learning rate for the entire process. Since this is not optimal, other types of optimisation algorithms that can adjust the learning rate, such as Adaptive Moment Estimation (ADAM, Kingma & Ba 2014) may be used instead.

Neural network parameters, such as the number of either hidden layers, neurons or epochs, the learning rate, the optimiser, the activation function for each layer and the size of the minibatch, are referred to as hyperparameters. The hyperparameters affect the efficiency and performance of the neural network and, like the model parameters, need to be optimised to reach the best possible network performance. While the process of training a neural network adjusts the model parameters, usually the hyperparameters must be manually fine-tuned for each science case and data set in question (LeCun et al. 1998; Bengio 2012; You et al. 2017; van Rijn & Hutter 2018; Weerts et al. 2020).

3.2 Our neural network

We designed a neural network to estimate a set of target values, along with their uncertainties. Our neural network aims to approximate a distribution for each target value with a given input feature, x, of each data point and three target values correspond to three dust properties,, , and . The neural network implements this approximation by maximising the log-likelihood of the target values under the assumption that the deviations follow a normal distribution, by approximating the mean (m_k) and standard deviation , which is the expected squared difference between the y^pred and y^sim, as follows:

(4)

where N is the number of data points in the data set, while K represents the number of target values. Therefore, each target value is estimated by a mean m_k (hereafter ), and a standard deviation , that represents the estimated uncertainty of .

3.3 Hyperparameter tuning

To find the optimal set of hyperparameters for our neural network, we first explored combinations of 3−12 convolutional and fully connected layers. Each layer can have either four, 16, 32, 64, 128, 256, or 512 neurons. We used the standard Rectified Linear Units (ReLU; Maas et al. 2013) and Parametric Rectified Linear Units (PReLU; He et al. 2015) as non-linear activation functions between the input and the hidden layers. For the output layer, we used a linear activation function to predict the mean of the target values and an exponential linear unit (ELU, Clevert et al. 2015) as activation function to predict the standard deviations of the target values. Using ELU as the activation function ensures that the estimated standard deviations are positive.

We used six different learning rates of 10⁻⁶, 5 × 10⁻⁶, 10⁻⁵, 5 × 10⁻⁵, 10⁻⁴, and 10⁻³ for the ADAM optimiser (Kin gma & Ba 2014) to search for the local minimum of the loss function with mini-batch sizes of 32 and 64 data points. By comparing the validation and training loss of the neural network with different sets of hyperparameters, we found that the optimal set of hyperparameters consists of eight fully connected layers (one input and seven hidden layers) with 512, 256, 128, 64, 32, 16, eight, and four neurons in the first to the eighth layer, respectively. Furthermore, ReLU activation functions are best used between the layers together with a learning rate of 10⁻⁵ for the ADAM optimiser with mini-batch size of 64 data points. The number of epochs is chosen to be 2000 in S1 and 1500 for S2 and S3, in which the training and validation loss are converged.

3.4 Missing data

Considering the sensitivity and saturation limits (see Sect. 2.2) for both MIRI and NIRCam filters, some SN model SEDs are not detectable in all filters over the entire wavelength range. For instance, particularly bright or faint SEDs (or parts of the SEDs) result in magnitudes that either exceed the sensitivity limit or remain below the saturation limits of some filters. In reality, such cases would not lead to detections (magnitude measurements) and hence, may be considered as ‘missing values’. Here, for each filter, we replaced the synthesised magnitudes that fall outside the saturation and sensitivity limits with the magnitude of the sensitivity and saturation limits, respectively. This approach was inspired by the forced photometry measurement that is commonly used to study transients, for example for Pan-STAARS1⁵. In this method, when a source is detected in a filter at a specific location in the sky, photometric values are forced to be extracted in other filters. These forced photometric values are either the actual magnitudes of the source, or the magnitude limits.

3.5 Neural network training preparation

To train our neural network with the set of hyperparameters that are defined in Sect. 3.3, we created a ‘training − validation − test’ split from each of the data sets that are described in Sect. 2.2. Particularly, out of a total of 293 236 data points, we used 70% (193 536) as training, 15% (49 850) as validation and the remaining 15% as test data set.

We normalise and of all SN model SEDs as

with M_⊙, and K. Moreover, we define a conditional function in which we arbitrarily assign each dust species (e.g. carbon, and silicate) a target value:

We find that inferring the dust properties from SN model SEDs that contain no dust or only very small amounts of dust at cooler temperatures (M_dust < 5 × 10⁻⁵M_⊙ and T_dust < 800 K) using neural networks is challenging (see Sect. 5 for further explanation). Therefore, to let the neural network differentiate between these SN model SEDs and SN model SEDs that contain recognisable dust, we defined a dedicated target value for this group of ‘no-dust’ data points as .

3.6 Feature selection

The SHapley Additive exPlanations (SHAP; Lundberg & Lee 2017) is a framework that uses an additive feature attribution method to evaluate the importance of a certain input feature on the prediction of a neural network. In this framework, the Shapley values (Shapley 2016) are calculated for each input feature based on cooperative game theory (Nash 1953). In this theory, to calculate the contribution of each input feature to a model’s output, the average marginal effect of feature i is measured for all possible coalitions, which represents the effect of feature i on the model’s output. In an additive feature attribution method, for an input feature’s vector x, for a model f, a simplified local input feature’s vector x′, is defined for an explanatory model ℱ. The simplified local input feature’s vector is a discrete binary vector, x′ ∈ {0,1}^d (where d is the number of the input features), which means that either features are included or excluded. The explanatory model ℱ is defined as

where ϕ₀ is the base value of the model in the absence of any information, that is defined by the average of the model’s output, and ϕ_i is the explained effect of feature i, known as the attribution of feature i. The ϕ_i shows how much feature i changes the output of the model. The second term of the model ℱ is the average over marginal contributions of each feature, over all possible coalitions. The absolute value of indicates the importance of the feature i, where represents that feature i is not included in the input feature vector . Therefore, the Shapley values are defined as:

(5)

when the summation is over all feature subsets S ⊆ d.

To calculate the Shapley values, all coalition values for all possible feature permutations must be sampled. Since the relation between the number of features and the number of possible feature permutations is exponential, for a large set of features the number of calculations in ℱ is immense, and practically not feasible to implement. Therefore, the SHAP framework uses a fast approximation, Deep Learning Important FeaTures (DeepLIFT; Shrikumar et al. 2016, 2017), in which a linear approximation of Taylor series is used to approximate

in which the expectation values, E[x′], are calculated for all features and are used as referenced values in the input features vector, when the feature is omitted during the calculations.

Since the variance of the expectation values for N data points is roughly , using approximately 1000 data points gives an acceptable estimation for expectation values⁶. Therefore, in this work, for each of our three test scenarios S1, S2 and S3, we used a sample of 5000 data points that we randomly chose from the training data set to approximate the expectation values for all features (i.e. for ∀i; i ∈ x′). Thereafter, we computed the Shapley values for 1000 randomly selected data points from the validation data set (see Appendix B.1 for the details of the computational cost). We selected the subsamples from the validation and training data sets with a random seed that we changed for each step in the feature selection process. Therefore, we calculated the importance of each feature (i.e. filter) with index i via

(6)

where N = 1000. In each step, we removed the three filters that achieved the three lowest absolute Shapley values. Subsequently, in the next step, we trained the neural network using the reduced set of filters as the input feature’s vector of the entire training data set and repeated the procedure. Considering that in each step, we removed the three least important filters, we performed the process for a total of 11 steps. Therefore, we are left with four filters out of 37 filters at the end of the process.

4 Evaluation

In this section we describe the chosen evaluation metrics to evaluate the performance of our trained neural network. We address how we interpreted the resulting predictions for the dust species and how we treated the no-dust models in the performance evaluation. Furthermore, we define criteria to estimate the reliability of the predictions via the predicted standard deviations as the outputs of the neural network. Finally we describe the metrics for comparing the performance of the neural network in different steps of the feature selection process.

The performance evaluation of the predicted target values, , and , is applied on test data sets, and consists of three individual methods: root-mean-square error (RMSE), bias, and 3σ outliers. For the dust temperature, the residual of data point, n is defined as . For the dust mass, due to the logarithmic distribution of M_dust in the simulated data set, we define the residual as . For both M_dust and T_dust the bias is defined as the mean of the residuals as

and the RMSE is defined as

where n represents each data point, and N is the number of data points in the test data set. Furthermore, for M_dust and T_dust we define the 3σ-outliers as the predictions with ∣Δy_n∣ > 3 × RMSE. Moreover, due to the numeric representation of all dust species (see Sect. 3.5) that are fed to the neural network, numeric target values are predicted. In order to interpret these numeric target values, we define each dust species as a ‘class’. This way, we have the following classes: silicate, mixed, carbon and no-dust that we define by a conditional function as

Furthermore, to evaluate how well the neural network predicts the dust species, we used the definition of true and false positives, and true and false negatives (e.g. Fawcett 2006) to build a confusion matrix. The classification accuracy for each dust species class is defined as the fraction of correct predictions out of the total number of predictions of each class from the neural network.

Moreover, we investigated whether the predicted uncertainties can be used to filter out uncertain predictions reliably. For this, we assumed that errors in the predicted quantities are approximately normal distributed with mean and variance as predicted by our model. Then, given a chosen confidence level the central confidence interval of the predicted quantity is

where and are the predicted mean and standard deviation of the kth target value for the nth datapoint and y_k,n is the unknown true value. The factor a1 is a parameter that depends on the chosen confidence level, where values a1 = 1, 2, 3 give rise to the 68, 95, and 99.7% confidence levels, respectively.

With this, we define a threshold for the acceptable relative error, a2 and accept a predicted mean value as (likely) accurate if the width of the confidence interval is small compared to For M_dust and T_dust this yields the criterion

(7)

and for dust species, we use

(8)

In the following, we use a2 = 0.2 and a1 = 1. If a prediction satisfies Eqs. (7) and (8), we say that it has a reliable standard deviation. To compare the performance of the neural network in each step of the feature selection process, we used two values from the neural network output; (i) the values that are reached by the loss function (i.e. Eq. (4)), for the training and validation data sets at the end of the training process, (ii) the ratio of the number of predictions that have , to the total number of predictions of the test data set (hereafter .

Since we chose a fixed set of hyperparameters (see Sect. 3.3), for instance, a fixed number of epochs, the minimum loss achieved by the neural network in the training process in each step of the feature selection process can differ from the ‘absolute or true’ minimum that could be achieved, if the hyperparameters were to be re-adjusted for each step. This is independent of the chosen subset of JWST filters and happens in all scenarios. Ideally, in order to reach the absolute minimum loss possible one should re-adjust the hyperparameters for each step. However, this is a very time consuming process. Additionally, this would make the entire feature selection process dependent on the training data set as well as on the subset of the JWST filters, while not providing further relevant information for all the steps necessary to obtain the final preferred subset of JWST filters.

Fig. 2

Example set of SN model SEDs with unrecognisable dust signatures. Each symbol represents a synthesised magnitude by a JWST filter and is shown at its central wavelength. The filled circles and triangles correspond to the SN model SED 1 and 2, that contain silicate dust, while filled dash and downward triangles correspond to SN model SED 3 and 4, respectively, both contain a mix of silicate and carbon dust. The amount and the temperature of dust in the SN model SEDs one to four, are about 2 × 10⁻⁵ M_⊙, 314 K, 5 × 10⁻⁵ M_⊙, 329 K, 10⁻⁵ M_⊙, 620 K, 10⁻⁵ M_⊙, 617 K respectively.

5 Caveats

Typically, very low amounts of dust (less than about 10⁻⁵ M_⊙) are not easily observable in SNe, since the thermal dust emission is rather weak at the expected wavelengths. This means that in some of our SN model SEDs that contain such low amounts of dust, the thermal dust emission in the simulated SEDs may either not be clearly discernible from the emission of the SN or generally remains below the detection capabilities of JWST. Such SN model SEDs that exhibit barely noticeable or no dust signatures may therefore also remain largely unrecognised by our neural network.

In what follows, we trained the neural network on the synthesised photometric data set for S1 to identify the SN model SEDs in this data set that have the lowest M_dust and T_dust that still can be recognised by the neural network. We find that for SN model SEDs with M_dust < 5 × 10⁻⁵ M_⊙ and T_dust < 800 K the predicted dust properties have very large uncertainties (i.e. ), causing so called catastrophic outliers. Consequently, we trained the neural network again, but this time to label such SN model SEDs as ‘no-dust’ data points (see Sect. 3.5), similar to the SN model SEDs that indeed contain no dust. Figure 2 presents an example set of such no-dust SN model SEDs with M_dust and T_dust below the aforementioned thresholds. Due to the fact that from the no-dust SN model SEDs the predicted dust properties including their uncertainties are highly unreliable, we did not include these models in subsequent performance evaluations of the dust properties.

6 Results

We investigated whether a neural network can be used as an effective tool to determine different properties of dust that formed in and around CCSNe from its spectral energy distribution. Since the number of observed SNe is too sparse to be used for such an endeavour, we simulated a total of 293 236 SN SEDs (referred to as SN model SEDs), each with different dust properties. Then, we convolved each SN model SED with the entire suite of JWST NIRCam + MIRI banpass filters (see details in Sect. 2) to synthesise a photometric data set that is suitable for machine learning purposes.

For a step by step analysis we considered three different scenarios, which are described in more detail in Sect. 2.2. In short, for the first scenario, S1, all SN model SEDs are placed at the same, low redshift, z = 0.0001. In the second scenario, S2, we uniformly distributed the SN model SEDs within the redshift range 0.0001−0.015. In the third scenario, S3, we used the data set of S2 and added random noise that corresponds to a photometric uncertainty of 0.1 mag (see the details in Sect. 2.1.2). Comparing the outcome of these scenarios allowed us to examine how strongly the performances of the neural network and the feature importance change for our simulated data that are equipped with properties of real observations.

In our approach, we trained our neural network to predict the distribution of dust quantities given the SN model SEDs, p(y^sim∣x) ≈ N(y^sim; y^pred,σ^pred). To evaluate how well our estimated uncertainties align with the prediction errors, we analysed the distribution of the normalised prediction errors (y^pred − y^sim)/σ^pred. Under perfect neural network modelling circumstances, the distribution of these normalised values must follow a standard normal distribution. Figure 3 shows histograms of the normalised prediction errors for M_dust and T_dust of a test data set predicted by the trained neural network with the entire set of JWST filters, in S3, excluding the predictions that the neural network classifies them as no-dust. By fitting a normal probability distribution function to the normalised prediction errors, we find that for M_dust distribution, a mean of 0.04, and a standard deviation of 0.94 are inferred. The inferred values for T_dust are the mean and standard deviation of −0.001 and 0.76, respectively. Therefore, the inferred standard deviations corresponding to M_dust and T_dust are 6 and 24% lower than for a standard normal distribution. This might indicate that the predicted uncertainties, σ^pred, are overestimating the prediction errors.

For each scenario, S1, S2, and S3 we discuss four cases of a performance evaluation. For case-1 and case-2 we evaluated the performances of our neural network that is trained on data sets that consist of preferred subsets of JWST filters (see Sect. 7 for further discussions on the selection of preferred subsets). For case-3, and case-4 we evaluate the performances of our neural network that is trained with data sets that are constructed with a minimum subset of JWST filters (see definition Sect. 4), with which the different dust quantities are predicted with an acceptable level of accuracy. The latter means that the fraction of reliable predictions, out of the entire test data set, is ≳5%. Furthermore, for case-1 and case-3 we apply the evaluation metrics on the entire test data set. For case-2 and case-4, we apply the metrics only on the subsample of the test data set that satisfies the criteria for being reliable predictions as defined in Sect. 4.

Tables 2 and 3 summarise the outcome of the case by case performance evaluations of our neural network to predict M_dust, T_dust and to classify the dust species for all three scenarios S1, S2 and S3. Out of all scenarios and all cases we find that in S1 and for case-2, the RMSE of both M_dust and T_dust is the smallest and is maximal. For case-2, the RMSE of M_dust increases from ~0.05 dex in S1, to ~0.1 dex in S2 and to ~0.11 dex in S3.

However, for T_dust the RMSE increases from about 14 K in S1, only to ~18K in S2. From S1 to S3, the RMSE of T_dust increases to ~30 K in S3. From both S1 to S3, in case-2, the fraction of 3σ outliers for M_dust target values increases.

The bias of the T_dust predictions for most of the scenarios for case-3 and case-4 is negative. This indicates that the neural network underestimates the T_dust target values (i.e. ). For case-1 and case-2 the bias is positive for T_dust in all scenarios. This indicate that the neural network overestimates the T_dust target values. For instance, in case-3 for S1, the bias of 0.013 (dex) for M_dust represents that the average of M_dust estimations over all the test data set is about 10^0.013 ≈ 1.03 times more than the simulated M_dust. For T_dust the average of T_dust estimations over all the test data set is about 1 K more than the simulated T_dust values.

As shown in Table 3, the highest classification accuracy for dust species is achieved for S2 for case-2. For this, we find a classification accuracy of 97, 98, and 100% for carbon, mixed, and silicate dust, respectively. Comparing the classification accuracy for each dust species, we find that for all scenarios and all cases, silicate dust is predicted with the highest accuracy. Carbon dust is predicted least accurately in all scenarios and cases, except in S3 for case-4. There, the SN model SEDs that are labelled as mixed dust are predicted with the lowest accuracy (57%). In case-4 and S3, 42 % of the mixed dust species are predicted as carbon dust.

In Figs. 4–6, the performance of the neural network is shown for case-1 and case-2 for all scenarios. Overall, the performance of the neural network for case-2 is better than for case-1. As illustrated in the top panels of Figs. 4 and 5, the dispersion of the predictions around the diagonal line that represents predicted values equal to simulated values, increases from S1 to S2 for both target values M_dust and T_dust. Moreover, as summarised in Table 3 the classification accuracy decreases for all dust species from S1 to S3 in case-1.

As shown in Fig. 4, for S1 the reliable predictions for M_dust and T_dust range between about 6 × 10⁻⁵−10⁻¹ M_⊙ and 100 − 1400 K, respectively. However, Fig. 5 shows that in S2, the reliable predictions only range between about 10⁻⁴−5 × 10⁻² M_⊙, and 250−1200 K for M_dust and T_dust, respectively. Figure 6 shows that in S3, the reliable predictions for Mdust are within 5 × 10⁻⁴−10⁻¹ M_⊙ and 250−1000 K for T_dust. This means that the dust mass and temperature range of the reliable predictions for all cases shrinks from S1 to S2 to S3, and thus with the increased complexity of the scenarios.

Figure 7 presents the performance of the neural network with the subsets of the JWST filters that are selected in each step of the feature selection process. The bottom panel compares the training losses obtained for the last epoch at all feature selection steps for all three scenarios. The validation losses for S3 are also included in Fig. 7. It is evident that for S3, the validation loss closely follows that of the training loss. We find the same for the other two scenarios, although the loss values vary more drastically from step to step. The absolute local minimum of both the validation and training loss appears to be reached in step zero of the feature selection process for S3, while for both S1 and S2 the absolute local minimum is reached in step five. However, we find for S3 that both the training and validation loss slowly increase from step zero to eight by about 5%.

The first three panels of Fig. 7 show the performance evaluation of the neural network for the test data sets. It is evident that the RMSE for T_dust varies only minimally around a mean value of about 58 ± 9 K, after which it increases to about 240 K in the last step. The RMSE of M_dust behaves similarly constant over the first eight steps except for step two, and increases from step eight to eleven by about 0.45 dex. The classification accuracy for carbon dust and the mixed composition also only changes minimally over the first eight steps, but appears to decrease from step eight to eleven from about 70 to 50%. For silicate dust the classification accuracy remains nearly 100% over all steps.

Fig. 3

Comparison of the distribution of normalised prediction errors to a standard normal distribution. The histograms represent the distributions of (y^pred − y^sim)/σ^pred for M_dust (top panel), and T_dust (bottom panel), for a test data set predicted by the trained neural network with the entire suite of JWST filters, in S3. The dotted curves represent the standard normal distributions (i.e. N(0,1)). The solid curves are the normal distributions fitted to each of the histograms with µ = 0.04 and σ = 0.94 for M_dust, and µ = −0.001 and σ = 0.76 for T_dust.

7 Discussion

The performance evaluation of our trained neural network, which is designed to predict dust properties such as M_dust, T_dust and different dust species, demonstrates that neural networks can be a powerful tool, if a sufficiently large data set is at hand. One advantage of using such a method is that it is possible to obtain a good estimate on the prediction uncertainties for each dust property under consideration. For other common methods, such as fitting a simple modified black body function or combination of thereof, uncertainties of the fitted dust mass or dust temperature are often not obtained (e.g. Gall et al. 2011b, and references therein). Furthermore, due to the fact that for such fitting methods assumptions about the dust composition need to be made a priori to fitting, the parameter range can be large and often not explored in all detail. The reasons for this may include insufficient data quality, but also time and computational limitations. These issues also apply to more sophisticated dust models such as MOCASSIN, when used to fit observational data to obtain the amount and temperature of dust in and around SNe (see e.g. Wesson et al. 2015).

Table 2

Comparison of neural network performance for estimating M_dust and T_dust in different scenarios for 4 different cases.

Table 3

Comparison of neural network performance for classifying dust species, and the fraction of predictions of the test data set that have to all the predictions from the test data set , in different scenarios for four different cases.

7.1 Limitations of the model dataset

For the purpose of running a large number of models in a reasonable amount of time, we made some simplifying restrictions to the parameter space that our models cover. Some of these simplifications may have a significant effect on the predicted dust quantities from the SEDs. Our models used a single grain size only, selected from a uniform distribution in log-space. In the interstellar medium, the grain size distribution may be approximated by a Mathis et al. (1977, hereafter MRN) distribution, in which the number density of grains of radius a is proportional to a^−3.5. As this power-law distribution arises from collision and fragmentation processes over a long timescale, it is unlikely to be applicable to the dust grains found in and around CCSNe. A single grain size may be a more reasonable approximation. Observational studies tend to find evidence for large grains (e.g. Gall et al. 2014; Wesson et al. 2015; Owen & Barlow 2015; Bak Nielsen et al. 2018). If a population of grains grows by accretion, then according to the standard grain growth equation, the increase in radius with time does not depend on the initial radius of the grain. A size distribution will therefore become narrower as accretion proceeds, unless fragmentation is also taking place.

In Fig. 8, to illustrate the effect of using a single grain size as opposed to a distribution, we show the example SEDs for 20 models characterised by a single grain size, evenly spaced logarithmically between 0.005 and 0.5 µm, together with the example SED for an MRN dust distribution. It can be seen that the SED for the full grain size distribution is almost identical to the SED for a single grain size of 0.15 µm.

The calculation of a spectral energy distribution from thermal dust emission fundamentally depends on the choice of optical constants. Different literature sets of optical constants may differ significantly from each other, and the dust actually present in and around a SN may not be well represented by the materials from which optical constants have been determined. The choice of optical constants thus introduces a systematic uncertainty into the dust mass and temperature estimates.

In a future work, we plan to investigate this more thoroughly, by using the neural network to classify SEDs calculated using different optical constants to those on which the network was trained. However, in this work, we used only two species of dust, and only one set of optical constants for each species. Dust in SNe is widely assumed to be either carbonaceous, silicaceous, or a mixture, and our SEDs are calculated using widely-used optical constants for these species. However, different choices of optical constants can yield significantly different SEDs. To illustrate this, we show in Fig. 9 the variations in example SEDs for one example model. The model has a 50:50 silicate:carbon grain mixture, and Fig. 9 shows the example SEDs for a single grain size of 0.1 µm, using all possible combinations of optical constants from four sets of carbon data ((Hanner 1988, thereafter H88), and the ACAR, ACH2 and BE⁷ samples from Zubko et al. 1996, thereafter Z96) and four sets of silicate constants (Draine & Lee 1984; Laor & Draine 1993, thereafter DL84 and LD93, respectively, and oxygen-deficient and oxygen-rich constants from Ossenkopf et al. 1992, thereafter O92).

It is clear from Fig. 9 that different choices of optical constants can result in significant differences in some wavelength regions of some SEDs. Particularly affected appears to be the 1−10 µm regions. However, the differences are largest for relatively small grains and are negligible for grains as large as 5 µm As mentioned, many observational studies of dust in young and old SNRs have found evidence for generally large grains, thus tending to reduce the uncertainty due to the choice of optical constants. Additional comparisons for more grain sizes and for pure carbon and pure silicate compositions are given in Appendix D.

Fig. 4

Performance of the neural network with the preferred subset of JWST filters, for S1. The M_dust (left column), and T_dust estimates (middle column), and dust species classification (right column), are shown for all the predictions of the test data set (top panel), and the reliable predictions of the test data set (bottom panel). The dust species classifications are shown in the format of confusion matrices that represent the simulated dust species against the predicted dust species. The colour bars in the left and middle diagrams indicate the number of predictions, ranging from 1 (black) to 50 (yellow). The dashed lines mark where the predicted and simulated values of M_dust and T_dust are equal.

Fig. 5

Performance of the neural network with the preferred subset of JWST filters, for S2. The definition of the panels, the variables, the dashed lines and the colour bars are the same as in Fig. 4.

Fig. 6

Performance of the neural network with the preferred subset of JWST filters, for S3. The estimations are shown for the reliable predictions of the test data set with S/N = 3 (top panel), and the test data set with S/N = 20 (bottom panel). The definition of the columns, the variables, the dashed lines and the colour bars are the same as in Fig. 4.

7.2 Performance evaluation

Our performance evaluation demonstrates that for all scenarios and cases (see Table 2), the obtained prediction error, RMSE, for M_dust is smaller than ~ 0.55 (dex), and is smaller than ~ 78 K for T_dust. These RMSE are obtained for case-3 and S3 and are the maximum RMSE values out of all scenarios and cases. This is because in case-3, the evaluation metrics are applied only onto the test data sets and minimum subsets of JWST filters of each scenario S1, S2 and S3. Moreover, for the evaluation of case-3 the test data sets are used without prior cut and hence, contain predictions with larger uncertainties. Additionally, S3 is the most complex scenario of all scenarios. However, compared to other works in the literature with inferred amounts and temperatures of dust from observed SNe, we find that even the worst performance here in this work constitutes a very good performance. For example, we can compare to other works that estimate the amount of dust, with Spitzer Space Telescope observations up to about 25 µm for SNe such as SN 2004et (Kotak et al. 2009) and SN 1987A (Ercolano et al. 2007). For SN 2004et, the estimated range for dust mass and dust temperature at 300, 464 and 795 days after the explosion, are about 0.37 dex and 500 K, 0.26 dex and 250 K, and 0.38 dex and 80 K. For SN 1987A, the amount of carbon dust at day 615 has been estimated with an uncertainty of 0.81 dex.

We now turn to the performance evaluations of the most reliable predictions, which are drawn from case-1 and case-3 data sets that have , evaluated as case-2 and case-4. Comparing the RMSE for M_dust and T_dust between case-1 and case-2 (cases with the data sets that contain the preferred subsets of JWST filters) across the three scenarios, S1, S2 and S3, shows that the prediction errors are reduced by up to a factor of about 2-3 in case-2 where the predictions that do not have are excluded. Since this is an expected, but not guaranteed, consequence of including only the predictions that have , which removes ‘bad’ predictions that do not fulfil the criterion to have , the same is expected for the cases with the data sets that contain the minimum subsets of filters (case-3 and case-4). Our evaluations show that the effect of excluding the unreliable predictions for M_dust estimations is even stronger than that for T_dust, meaning that the RMSE (in dex) of the dust mass is smaller by about a factor of 4–5 in case-2 and case-4, compare to case-1 and case-3, while for T_dust the decrease is only about factor of 2. The classification accuracy of classifying the different dust species shows the same behaviour, which is higher for nearly all species and scenarios for case-2 and case-4 than for case-1 and case-3. Particularly for silicate dust, the classification accuracy is close to or at 100%. However, in case-4 and S3, there is a bias in predicting the mixed dust species towards the carbon dust species. The evaluation method using the definition demonstrates that the dust mass and temperature predictions that have been under the scrutiny of the criterion can truly be considered as reliable predictions.

On the other hand, as shown by in Table 3, the number of predictions that satisfy the criteria in case-2 and case-4 is smaller than the predictions using the entire data set as in case-1 and case-3. Since there are 3σ outliers (as defined in Sect. 4) also for case-2 and case-4, the fractions of the best reliable predictions for S1, S2 and S3 in case-2 and case-4 are smaller than . For instance, in case-2 for S1 the fraction of the best reliable predictions is still about 59% while in case-4 and S3 it shrinks to only about 5.8%.

Comparing the number of 3σ outliers between M_dust and T_dust, we find that for nearly all setups of cases and scenarios, the M_dust evaluations result in a larger number of 3σ outliers than the T_dust evaluations. This is because the dispersion of M_dust residuals is larger than T_dust residuals.

Fig. 7

Performance of our neural network with each filter set that are obtained at each step of the feature selection process. Bottom panel: loss values that are achieved by the training and validation data sets at the end of each training process of the neural network in S1 (downward triangles), S2 (circles), and S3 (X symbols). The empty symbols mark the training loss for each step. The filled symbols mark the validation loss for each step. The grey shaded region represents the area between the training and validation loss in S3. The single panels show the RMSE of T_dust (K), and M_dust (M_⊙), and the classification accuracy (%) for predicting the dust species for the test data sets in S3, from bottom to top. The classification accuracy for predicting carbon and silicate dust species, and a mixture of them are shown with circles, triangles, and dashes respectively.

Fig. 8

Comparison of an example model SED with single grain sizes varying between 0.005 and 0.5 µm. The colour coding is described in the legend of this figure. The same SED, but for a grain size distribution with 0.005 µm < a < 0.5 µm, n(a) ∝ a^−3.5 is shown as black solid line. The fixed parameters of the model are M_d = 1.4 × 10⁻⁴ M_⊙, R_out = 6.24 × 10¹⁶ cm, T_* = 11905 K, and L_*, = 1.56 × 10⁸L_⊙.

Fig. 9

Effect of choice of optical constants demonstrated on an example model SED with mixed dust composition. Sixteen SEDs are plotted, using all combinations of four sets of carbon and four sets of silicate optical constants as defined in Sect. 7.1. The fixed parameters of the model are M_d = 2.7 × 10⁻³M_⊙, R_out = 1.99 × 10¹⁷ cm, T_* = 12 298 K, and L_* = 2.55 × 10⁸L_⊙.

7.3 Filter selection

Since observing SNe with all the JWST filters at the same time is practically not feasible, we are interested in finding the smallest set of filters with which an acceptable performance can be achieved. To do so, we utilised a feature selection process as described in Sect. 3.6. From this we obtain two sets of filters for each scenario, one preferred set of filters and one minimum set of filters. The preferred filter set is chosen based on the absolute minimum reached by both the training and the validation loss while the minimum filter set is chosen based on criterion that the fraction of the number of reliable predictions to the total number of predictions is larger than 5%. It turns out that for S1 and S2 the preferred filter set is reached early in the filter selection process, step five, and thus still contains a large number of filters (22 filters). The minimum filter set is obtained in steps eight or ten, and thus contain fewer, between seven to thirteen, filters. Looking at the performance evaluation from the two filter sets, for example case-1 and case-3 or case-2 and case-4 in Table 2, then while as expected, the performance of the neural network is overall better with the preferred set of filters. The performance with the minimum filter set is only minimally decreased. Hence, as demonstrated in Table 2, accurate predictions of T_dust and M_dust can be achieved with the minimum set of filters.

For scenario S3, the preferred and the minimum subset of filters are chosen from step eight and nine and are thus very close to each other. It is important to note that in this case the preferred set is chosen to be at step eight instead of step zero, where the loss reached the absolute minimum. However, we do not consider step zero as ‘preferred’. Since both the training and validation loss remains rather stable until step eight as pointed out in Sect. 6, step eight can be considered as preferred.

As illustrated in Fig. 7, in S3 compared to S2 there are insignificant changes of loss values in each step of the feature selection process up to step nine. This stability of the performance of the neural network in S3, regardless of the number of filters that are used as the input features can be due to the training of the neural network with additional noise. This is because the training of a neural network with additional noise can be equivalent to a regularisation (Bishop 1995), which helps the neural network to react less to the variation of input features. Therefore, in S3 compared to S2, the training and validation losses that are achieved by the neural network with smaller sets of filters than the entire filter set, do not significantly change in each step of the feature selection process up to step nine. Table 4 summarises the minimum and preferred subsets of JWST filters obtained from the feature selection process in all three scenarios.

Figure 10 visualises the resulting Shapley values obtained for each step in scenario S3. Figures C.1, and C.2 show the same for S1 and S2, respectively. It is interesting to note that for all three scenarios, none of the narrow-band JWST filters are amongst the minimum subsets of the JWST filters. However, for S1 and S2 two such narrow-band filters are included in the preferred filter set albeit with small Shapley values and hence, marginal importance.

This implies that real observations of SNe with such JWST narrow-band filters would have the least impact on estimating dust properties with our neural network. As shown in Fig. 10, and expected, the MIRI filters that cover the longer wavelength region are crucially important to estimate the dust properties while the shorter wavelength NIRCam filters seem not to play a significant role.

One of the most pressing questions of course is, if it is technically feasible to construct an observing run with the minimum subset of filters. The NIRCam instrument uses a dichroic to split the incoming radiation into two wavelength ranges, λ < 2.5 µm and λ > 2.5 µm, known as short and long wavelength channels (Horner & Rieke 2004). This setup allows to simultaneously obtain two images with two different filters, each from one of the channels. Since, in the minimum subset for S3, two selected NIRCam filters, F070W and F140M, are in the short wavelength channel of NIRCam and two, F356W and F480M, are in the long wavelength channel, two separate runs are required to observe a SN with all four NIRCam filters. For MIRI, observations can only be conducted with one filter at a time. The entire observing time needed for all selected MIRI filters of the preferred subset may in the end depend on the brightness of the SN, the either desired or best possible signal-to-noise ratio or the phase of the SN.

Table 4

Preferred and minimum subsets of JWST filters obtained from the feature selection process and used to estimate M_dust, T_dust, and dust species.

Table 5

Comparison of neural network performance for estimating M_dust and T_dust in scenario 3 for case-2 with the same definition in Table 2.

7.4 Additional testing of the performance of the neural network

Our simulated data set is simplified by various assumptions such as a uniform S/N = 10. In reality, the achieved S/N ratio depends on different aspects, such as the brightness of the object in a given filter band, the distance to the object or the exposure time and integration setup. The JWST Exposure time calculator is an ideal tool to adjust all these aspects. It is obvious that for bright sources a high S/N ratio even with a short exposure is possible to achieve, while for faint sources, long exposure times may be necessary to reach just a minimum significance of S/N ≈ 3. While simulating more realistic S/N ratios for each filter band assuming different exposure times is possible, it is computationally expensive and hence, we decided to first test the neural network performance for a simple case, a uniform S/N = 10, which represents neither particularly good nor bad data.

However, to better understand the effect of better or worse data, we tested the performance of our neural network for scenario S3 and case-2 on two test cases with test data sets assuming S/N = 20 and S/N = 3, respectively. The results are summarised in Tables 5 and 6 and presented in Fig. 11. For the test case representing higher quality data with a S/N = 20, the RMSE of M_dust and T_dust are ~0.12 M_⊙ (dex) and ~32 K. The RMSE of M_dust and T_dust predictions for the other test case with a test data set with S/N = 3 are ~0.42 M_⊙ (dex) and ~88 K. As expected, the performance of the neural network has become worse for the test-case with a S/N = 3 compared to the S/N = 10 while for the test-case with a S/N = 20 the performance remains similar. We note that since the neural network has been trained for a S/N = 10, for the test-case of S/N = 3 our predictions are somewhat over-confident while they are under-confident for S/N = 20.

The final test of the usability of our neural network, which has been trained on a simulated data set exploring a wide, but not exhaustive range of parameters, is to use true observational data. Hence, we used the spectrophotometric observations of SN 1987A taken with the Kuiper Airborne Observatory at 615, 632 and 638 days past explosion (referred to as 615 day epoch) (Wooden et al. 1993; Moseley et al. 1989), as this epoch shows a clear signature of dust formation in the ejecta. The data cover a wavelength range of 0.33−29.5 µm. Furthermore, Wesson et al. (2015) has also fit MOCASSIN models to the same data and their best fit results in about 1 × 10⁻³ M_⊙ of dust for a clumpy model with a 85:15 carbon:silicate ratio and temperatures of 252 ± 29 K for carbon, and 316 ± 31K for silicate dust. This is a larger dust mass than the best fitting models by (Ercolano et al. 2007) who obtained ~2 x 10⁻⁴ M_⊙ at similar temperature, while (Wooden et al. 1993) obtained about 3.1 × 10⁻⁴ M_⊙ at about 400 K of graphite dust, assuming a smooth dust distribution.

Here, we created a small test data set which consists of SN 1987A data at 615 days that were replicated 500 times, each assigned a different redshift which was chosen randomly from a limited redshift range (0.0006−0.004). This ensures that the data are within the saturation and detection limits. We applied a Gaussian smoothing operator, which enabled interpolation between the data gaps at 1.02−1.48 µm and 12.67−17.32 µm and convolved the data with the JWST bandpass filters. We used the trained neural network of scenario S3, first, including all JWST bandpass filters (scenario S3) and second, using only the preferred set of the filters (S3, case-2) to predict the dust mass and temperature as well as the dust grain composition (carbon, silicates or 50:50 mix).

The results are shown in Figs. 12 and 13. There appears a trend with redshift for all predictions in all two cases. We find that with increasing redshift, the dust mass and temperature predictions increase and the predicted dust species is leaning more towards silicates. Using all JWST bandpass filters, we obtain dust masses that are predicted with 99.7% confidence to range between a few times 10⁻⁴−10⁻³ M_⊙ and temperatures to range between ~ 280−340 K, in agreement with previous estimates in the literature. The estimated dust species is carbon or a mix of carbon and silicates. In the case of using only the preferred filter set (see Sect. 7.3), at very nearby distances, the results show that with a 99.7% confidence the predicted dust mass is not larger than 2−4 times 10⁻³ M_⊙ while at z > 0.003 the predicted dust mass ranges from 10⁻³−10⁻² M_⊙ for a predicted dust species that can either be mixed or silicates. For all predictions, the temperature range overlaps with that from the first case using all JWST filter bands.

This shows that our dust mass and temperature predictions for SN 1987A at 615 days are comparable to those in the literature and hence, our dust mass and temperature predictions are reasonable for SN 1987A-like SNe. However we note that while the dust temperature predictions fulfil the criterion, the dust mass predictions do not. Moreover, using the preferred JWST filter set, results in silicates as the dominant dust species, which disagrees with what is found in the literature. A possible reason for this may be ascribed to our simplified training data set and limited parameter range. Despite this, our method can be a promising tool to analyse signatures of dust in and around SNe in their SEDs. In forthcoming work, we aim to use more detailed and realistic simulations to achieve more reliable predictions of the dust mass, temperature and possibly other dust properties.

Fig. 10

Importance of JWST filters for estimating the amount, temperature and the dust species, in S3. The normalised feature importance (ϕ_i) of each NIRCam (blue) and MIRI (red) filters in each step of the feature selection process is shown by the size of the filled circles that are scaled to three values in the legend. The preferred and minimum subsets of JWST filters are highlighted with boxes using a dash-dotted and a solid line, respectively.

Fig. 11

Performance of the neural network with the preferred subset of JWST filters, for S3. The definition of the panels, the variables, the dashed lines and the colour bars are the same as in Fig. 4.

Table 6

Fig. 12

Estimated amount, temperature, and composition of the dust in SN 1987A at 615 days after explosion for the entire set of JWST filters. The purple dots along with black lines represent the predicted values and the predicted uncertainties by the trained neural network, respectively. The estimated values by (Wesson et al. 2015, W15) and (Ercolano et al. 2007, E07) are shown as red and green solid lines (left panel) and shaded areas (middle and right panels). The blue and yellow regions in the right panel highlight x-axis labels; No dust and Silicate.

Fig. 13

Estimated amount, temperature, and composition of the dust in SN 1987A at 615 days after explosion for the preferred set of JWST filters. The symbols, lines and shaded regions are defined as in Fig. 12.

Fig. 14

Distribution of SN model SEDs in T_dust in 50 K bins. The dashdotted line represents the distribution of the entire test data set and the solid line represents the distribution of the reliable predictions from the test data set. Top panel: SN model SEDs with R_in ≲ 5 × 10¹⁶ cm. Bottom panel: SN model SEDs with R_in ≳ 5 × 10¹⁶ cm.

7.5 Implications for future observations

Figure 14 shows the histogram of the number of test data over T_dust, in S3, for the entire test data set (dashed line), and the sub-sample of test data with reliable estimated standard deviations (solid line). In the top panel, the distribution is shown for the SN model SEDs with R_in ≲ 5 × 10¹⁶ cm, while the bottom panel represents the SN model SEDs with R_in ≳ 5 × 10¹⁶ cm. This cutoff represents an approximate division of models into those, which are closest to dust signatures from newly formed dust in young SNe ejecta, and those, which can be interpreted as signatures arising from pre-existing circumstellar dust, flash-heated by a SN explosion. Pre-existing dust grains at radii less than about 5 × 10¹⁶ cm are likely to be evaporated by the SN explosion (Gall et al. 2014), although some dust may survive. Meanwhile, SNe ejecta expanding with a mean velocity of ~6000 km s⁻¹ would reach this radius after ~ 1150 days. A SN following the bolometric evolution of SN 1987A (Seitenzahl et al. 2014) would have a luminosity of ~ 10 000 L_⊙, the lowest considered in our models, at a similar epoch. Therefore, any dust estimated from model SEDs with dust located at distances ≳5 × 10¹⁶ cm (i.e. the top panel in Fig. 14) may be interpreted as being pre-existing dust. Dust estimates from model SEDs with dust located at distances ≲5 × 10¹⁶ cm (i.e. the bottom panel in Fig. 14) could be associated with newly formed dust at early epochs in young SNe. Such newly formed dust can either be located in the SN ejecta, or in the case of Type IIn SNe such as SN 2006jc (e.g. Smith et al. 2008) or SN 2010jl (e.g. Gall et al. 2014; Bevan et al. 2020), in the cool dense shell located at a distance of about 10¹⁶ cm and behind the forward shock which propagates through the dense circumstellar material that was shed off by the progenitor prior to the terminal explosion. By comparing the covered areas in both panels, we find that our neural network may be better at estimating the dust mass and temperature of model SEDs which are more closer to a pre-existing dust scenario than a ejecta dust scenario. Whether or not this is due to our chosen simplifications and parameter space coverage of our simulations, can be tested in forthcoming, more realistic simulations.

In this work we used point source continuum sensitivity limits (Glasse et al. 2015; Greene et al. 2017, and described in Sect. 2.1.2) which are calculated assuming average zodiacal background levels. While this is a reasonable approach for young SNe that are in for example resolved nearby galaxies or maybe located at the outskirts of a galaxy or intergalactic medium, it may be problematic for older SNRs that are more extended, diffuse sources, as well as for SNe located in crowded regions. ISM back- and foreground contamination from unresolved stars in distant galaxies can give rise to a brighter background than assumed here, changing the sensitivity limits to cover lower magnitudes. Greene et al. (2017) estimated that the sensitivity levels can worsen by up to a factor of ~2 for NIRCam broad band filters in case of bright backgrounds. Contamination due to cold ISM dust with temperatures ≲30–50 K could also affect the sensitivity limits. However, this is most prominent at longer wavelengths, ≳ 100 µm, and thus may not significantly affect the sensitivity limits in JWST’s wavelength range. Finally, the chosen observing strategy and possibilities for proper background subtractions may also shift the limits at which faint sources can still be detected. In forthcoming work we will test in more detail the impact of varying sensitivity limits on optimising the JWST filter selection to determine dust properties.

To use modern machine learning algorithms effectively, large data sets are essential. Presently ongoing wide-field surveys such as the Zwicky Transient Facility (Bellm 2014), Young Supernova Experiment (Jones et al. 2021) or SkyMapper Southern Sky Survey (Scalzo et al. 2017) are discovering hundreds to thousands of SNe and other transients per year and are building up a wealth of optical photometric as well as spectroscopic data of various different types of CCSNe that will be further advanced in future surveys such as the Vera C. Rubin Observatory Legacy Survey of Space and Time (Ivezic et al. 2019). While near- to mid-IR observations of CCSNe will likely boom with the launch of JWST and possibly other, future instruments on ground- and space-based telescopes, they are rare at present, and will most likely not reach the level required to train machine learning algorithms on observational near- to mid- to far-IR data.

The Open Supernova Catalog⁸ reported the discovery of 450 Type II, 102 Type Ib, and 60 Type Ic SNe in the year 2021, from which only a few have mid-IR data. Although this is a large number of observed CCSNe in just one year, collecting a data set of the size, wavelength range and degree of variation used in our study will also in future not be easily feasible. This ‘data-size’ limitation is especially important for estimating the dust properties of the types of SNe we used here in this work, where we had simulate 293 236 SN SEDs covering a wavelength range from 0.7 to 30 µm. Finally, as the dust properties and quantities cannot directly be measured from the observational data, thus are unknown, well advanced simulations with known dust properties are highly valuable. Therefore, applying a neural network that is well trained on a rich set of highly advanced simulated data exploring a large parameter space may be a promising way to determine dust quantities and properties of future observations. This work also allows testing in what detail quantities and properties of dust can be inferred from observational data. Furthermore future observational data, if included in the training of the neural network, can be used to validate the neural network and thus, will improve its performance and outcome.

8 Conclusion

In this work, we present a first test for using neural networks to estimate different quantities and properties of dust located in and around SNe including their predicted uncertainties. We aimed at predicting the temperature and amount of dust and to differentiate between three dust compositions. To do so, we simulated an extensive data set of 293 236 SN model SEDs using the 3D photoionisation and dust radiative transfer code MOCASSIN (Ercolano et al. 2003a, 2005). We convolved the simulated data set with JWST MIRI and NIRCam bandpass filters. We considered the instrument’s detection limits as well as estimated magnitude uncertainties to make the trained neural network suitable for predicting some of the properties and quantities of dust in SNe from future observations of this instrument. We defined three different scenarios to examine the feasibility and accuracy of inferring the dust properties by our neural network. In the first scenario, we assumed that all SN model SEDs have the same low redshift. In the second and third scenarios, we distributed all SN model SEDs within the redshift range of 0.0001−0.015, in which at least seven JWST bandpass filters of all SN model SEDs are within the sensitivity and saturation limits that are calculated for a S/N of 10. Additionally, in the third scenario, we added random noise to the distributed SN model SEDs within a redshift range of 0.0001−0.015. Thereafter, we selected the preferred and minimum subset of JWST filters from the feature selection process, which is based on the SHAP framework. We used these filter subsets to estimate the amount, temperature and dust species with our neural network.

From the outcome of our trained neural network in S3, which is the closest scenario to real observations, we find the minimum subset of JWST filters needed to estimate dust quantities and properties consists of NIRCam: F070W, F140M, F356W, F480M, and MIRI: F560W, F770W, F1000W, F1130W, F1500W, and F1800W filters. As presented in Table 2, our neural network can well predict the dust quantities and properties for approximately 7% of SN model SEDs from the entire test data set. This fraction has a RMSE of ~0.12 dex, and ~38 K for M_dust and T_dust. The classification accuracy is 95, 99 and 57% for carbon, silicate and a mix of carbon and silicate dust, respectively. We find that the dust quantities and properties are best predicted by our neural network for SN model SEDs that approximately range in T_dust between 250−1000 K, and M_dust between 5 × 10⁻⁴−10⁻¹ M_⊙, and are dominated by astronomical silicates.

Acknowledgements

We thank Dr. Doogesh Kodi Ramanah, Dr. Adriano Agnello and Dr. Mikako Matsuura for helpful discussions. We also like to thank the anonymous referee for insightful comments. This work is supported by a VILLUM FONDEN Investigator grant (project number 16599) and a VILLUM FONDEN Young Investor Grant (project number 25501). R.W. acknowledges support from European Research Council (ERC) Advanced Grant 694520 SNDUST. This work has made use of the Horizon Cluster hosted by Institut d’Astrophysique de Paris, and an HPC facility funded by a grant from VILLUM FONDEN (project number 16599).

Appendix A Sensitivity and saturation limits for NIRCam and MIRI

Figures A.1 and A.2 represent the sensitivity and saturation limits for observing with NIRCam and MIRI filters with a minimum signal-to-noise ratio of 10.

Fig. A.1

NIRCam saturation magnitudes in 10 000 seconds exposure time, and point source sensitivity for 21.4 seconds exposure time. The sizes roughly represent the wavelength range of each filter. There is a break in y-axis (15–25 AB magnitude) to save a large blank space between the sensitivity and saturation limits.

Fig. A.2

MIRI saturation magnitudes in 10 000 seconds exposure time, and point source sensitivity for 21.4 seconds exposure time. The sizes roughly represent the wavelength range of each filter. There is a break in y-axis (16–20 AB magnitude) to save a large blank space between the sensitivity and saturation limits.

Appendix B Computational caveats

Appendix B.1 Computational cost

Here we address the computational costs of implementing the DeepLIFT from SHAP framework on a photometric data set with a full set of JWST filters and redshift (i.e. 38 features). We compute the time consumption of calculating Shapley values for several sub-samples with different sizes from the training and the validation data sets. We fit an exponential function to the computed time consumption and the corresponding size of the sub-samples as the percentage of the training and validation data sets. We found the following function as the computational cost function for our data set:

where a ≈ −59.3, b ≈ 51, and c ≈ −2.06. Table B.1 summarises the computed and estimated computational costs as the time consumption for sub-samples with different sizes. The estimated times are calculated by Q(x) for a given x as the percentage of the training and validation data sets used to calculate the expectation and Shapley values, respectively. The computed computational costs are derived from an implementation of the algorithm on a MacBook Pro with 9 GHz 6-Core Intel Core i9 processor, and 32 GB 2400 MHz DDR4 memory.

Table B.1

Computational costs of calculating Shapley values using DeepLIFT for different sub-samples of validation and training data sets. The instance highlighted in blue is the selected size in this work.

Appendix B.2 Reproducibility

For the sake of reproducibility, we use Keras(Chollet et al. 2015) library from Tensorflow package (Abadi et al. 2015) to build our neural network. We made our code publicly accessible on GitHub⁹, for any further evaluation and/or optimisation purposes. However, a specific distribution of computations over the processors has an effect on the training. Therefore, depending on each specific machine in use the reproduced outcome of the neural network can be slightly different (e.g. Bhojanapalli et al. 2021).

Appendix C S1, S2 feature importance

Figures C.1 and C.2 represent the resulting Shapley values obtained for each step in scenarios S1 and S2.

Fig. C.1

Importance of JWST filters for estimating the amount, temperature and the dust species, in S1. The symbols, relative sizes and colour codes are the same as defined in Figure C.1.

Fig. C.2

Importance of JWST filters for estimating the amount, temperature and the dust species, in S2. The symbols, relative sizes and colour codes are the same as defined in Figure C.2.

Appendix D Effect of optical constants on example SEDs

As discussed in the text, the choice of optical constants can have a significant effect on the model SED, and thus affect the predicted dust quantities. Here, we provide a further illustration of this, using a representative model from our dataset. The model has a mixed composition with 50% carbon and 50% silicate grains. Figure 9 in the main text shows the example SEDs from all combinations of four sets of carbon and four sets of silicate optical constants, for a grain size of 0.1 µm. In Figure D.1, we show the sets of example SEDs also for grain sizes of 0.01, 1.0 and 5.0 µm, and for models with the same geometry but pure carbon and pure silicate composition. One can see that the variation between SEDs is largest for smaller grain sizes.

Fig. D.1

Effect of choice of optical constants demonstrated on example model SEDs for pure and mixed compositions, and grain sizes of 0.01, 0.1, 1.0 and 5.0µm. SEDs are plotted for all combinations of four sets of carbon and four sets of silicate optical constants, resulting in four SEDs in each panel for pure composition, and 16 for the mixed compositions. The fixed parameters of the model are M_d =3.4×10⁻⁴M_⊙, R_out=1.21×10¹⁷cm, T_* = 7097K, and L_* =1.53×10⁸L_⊙.

References

Abadi, M., Agarwal, A., Barham, P., et al. 2015, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, software available from tensorflow.org [Google Scholar]
Abbott, D. C., & Lucy, L. B. 1985, ApJ, 288, 679 [NASA ADS] [CrossRef] [Google Scholar]
Abbott, B. P., Abbott, R., Abbott, T. D., et al. 2017, Nature, 551, 85 [Google Scholar]
Bak Nielsen, A.-S., Hjorth, J., & Gall, C. 2018, A&A, 611, A67 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Bellm, E. 2014, in The Third Hot-wiring the Transient Universe Workshop, ed. P. R. Wozniak, M. J. Graham, A. A. Mahabal, & R. Seaman, 27 [Google Scholar]
Bengio, Y. 2012, Practical Recommendations for Gradient-Based Training of Deep Architectures, eds. G. Montavon, G. B. Orr, & K.-R. Müller (Berlin, Heidelberg: Springer Berlin Heidelberg) 437 [Google Scholar]
Bertoldi, F., Carilli, C. L., Cox, P., et al. 2003, A&A, 406, L55 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Bevan, A., & Barlow, M. J. 2016, MNRAS, 456, 1269 [NASA ADS] [CrossRef] [Google Scholar]
Bevan, A. M., Krafton, K., Wesson, R., et al. 2020, ApJ, 894, 111 [NASA ADS] [CrossRef] [Google Scholar]
Bhojanapalli, S., Wilber, K., Veit, A., et al. 2021, ArXiv e-prints [arXiv:2102.03349] [Google Scholar]
Bishop, C. M. 1995, Neural Comput., 7, 108 [CrossRef] [Google Scholar]
Chawner, H., Marsh, K., Matsuura, M., et al. 2019, MNRAS, 483, 70 [NASA ADS] [CrossRef] [Google Scholar]
Chen, T. W., Brennan, S. J., Wesson, R., et al. 2021, ArXiv e-prints [arXiv:2109.07942] [Google Scholar]
Chollet, F., et al. 2015, Keras, https://github.com/fchollet/keras [Google Scholar]
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. 2015, ICLR 2016, [arXiv:1511.07289] [Google Scholar]
De Looze, I., Barlow, M. J., Bandiera, R., et al. 2019, MNRAS, 488, 164 [NASA ADS] [CrossRef] [Google Scholar]
De Looze, I., Lamperti, I., Saintonge, A., et al. 2020, MNRAS, 496, 3668 [NASA ADS] [CrossRef] [Google Scholar]
Draine, B. T. 2009, ASP Conf. Ser., 414, 453 [Google Scholar]
Draine, B. T., & Lee, H. M. 1984, ApJ, 285, 89 [NASA ADS] [CrossRef] [Google Scholar]
Dwek, E., Galliano, F., & Jones, A. P. 2007, ApJ, 662, 927 [NASA ADS] [CrossRef] [Google Scholar]
Ercolano, B., Barlow, M. J., Storey, P. J., & Liu, X. W. 2003a, MNRAS, 340, 1136 [Google Scholar]
Ercolano, B., Morisset, C., Barlow, M. J., Storey, P. J., & Liu, X. W. 2003b, MNRAS, 340, 1153 [NASA ADS] [CrossRef] [Google Scholar]
Ercolano, B., Barlow, M. J., & Storey, P. J. 2005, MNRAS, 362, 1038 [Google Scholar]
Ercolano, B., Barlow, M. J., & Sugerman, B. E. K. 2007, MNRAS, 375, 753 [NASA ADS] [CrossRef] [Google Scholar]
Fawcett, T. 2006, Pattern Recognit. Lett., 27, 861 [Google Scholar]
Ferrara, A., Viti, S., & Ceccarelli, C. 2016, MNRAS, 463, L112 [NASA ADS] [CrossRef] [Google Scholar]
Fesen, R. A., Hamilton, A. J. S., & Saken, J. M. 1989, ApJ, 341, L55 [NASA ADS] [CrossRef] [Google Scholar]
Finkelstein, S. L., Papovich, C., Salmon, B., et al. 2012, ApJ, 756, 164 [NASA ADS] [CrossRef] [Google Scholar]
Gall, C., & Hjorth, J. 2018, ApJ, 868, 62 [CrossRef] [Google Scholar]
Gall, C., Andersen, A. C., & Hjorth, J. 2011a, A&A, 528, A14 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Gall, C., Hjorth, J., & Andersen, A. C. 2011b, A&ARv, 19, 43 [NASA ADS] [CrossRef] [Google Scholar]
Gall, C., Hjorth, J., Watson, D., et al. 2014, Nature, 511, 326 [CrossRef] [Google Scholar]
Gardner, J. P., Mather, J. C., Clampin, M., et al. 2006, Space Sci. Rev., 123, 485 [Google Scholar]
Glasse, A., Rieke, G. H., Bauwens, E., et al. 2015, PASP, 127, 686 [NASA ADS] [CrossRef] [Google Scholar]
Gomez, H. L., Krause, O., Barlow, M. J., et al. 2012, ApJ, 760, 96 [NASA ADS] [CrossRef] [Google Scholar]
Greene, T. P., Kelly, D. M., Stansberry, J., et al. 2017, J. Astron. Teles. Instrum. Syst., 3, 1 [Google Scholar]
Hanner, M. S. 1988, in Infrared Observations of Comets Halley and Wilson and Properties of the Grains, 22 [Google Scholar]
He, K., Zhang, X., Ren, S., & Sun, J. 2015, ArXiv e-prints [arXiv:1502.01852] [Google Scholar]
Henning, T. 2010, ARA&A, 48, 21 [Google Scholar]
Hogg, D. W., Baldry, I. K., Blanton, M. R., & Eisenstein, D. J. 2002, ArXiv e-prints [arXiv:astro-ph/0210394] [Google Scholar]
Horner, S. D., & Rieke, M. J. 2004, SPIE, 5487, 628 [NASA ADS] [Google Scholar]
Indebetouw, R., Matsuura, M., Dwek, E., et al. 2014, ApJ, 782, L2 [NASA ADS] [CrossRef] [Google Scholar]
Ivezić, Ž., Kahn, S. M., Tyson, J. A., et al. 2019, ApJ, 873, 111 [Google Scholar]
Jones, D. O., Foley, R. J., Narayan, G., et al. 2021, ApJ, 908, 143 [NASA ADS] [CrossRef] [Google Scholar]
Kingma, D. P., & Ba, J. 2014, ArXiv e-prints [arXiv:1412.6980] [Google Scholar]
Kotak, R., Meikle, W. P. S., Farrah, D., et al. 2009, ApJ, 704, 306 [NASA ADS] [CrossRef] [Google Scholar]
Laor, A., & Draine, B. T. 1993, ApJ, 402, 441 [NASA ADS] [CrossRef] [Google Scholar]
Lau, R. M., Herter, T. L., Morris, M. R., Li, Z., & Adams, J. D. 2015, Science, 348, 413 [NASA ADS] [CrossRef] [Google Scholar]
LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. 1998, Efficient BackProp, eds. G. B. Orr, & K.-R. Müller (Berlin, Heidelberg: Springer Berlin Heidelberg) 9 [Google Scholar]
Lucy, L. B. 1999, A&A, 345, 211 [NASA ADS] [Google Scholar]
Lundberg, S., & Lee, S.-I. 2017, NIPS 2017, ArXiv e-prints [arXiv:1705.07874] [Google Scholar]
Maas, A. L., Hannun, A. Y., & Ng, A. Y. 2013, in ICML Workshop on Deep Learning for Audio, Speech and Language Processing [Google Scholar]
Marrone, D. P., Spilker, J. S., Hayward, C. C., et al. 2018, Nature, 553, 51 [NASA ADS] [CrossRef] [Google Scholar]
Mathis, J. S., Rumpl, W., & Nordsieck, K. H. 1977, ApJ, 217, 425 [Google Scholar]
Matsuura, M., Dwek, E., Barlow, M. J., et al. 2015, ApJ, 800, 50 [NASA ADS] [CrossRef] [Google Scholar]
Matsuura, M., De Buizer, J. M., Arendt, R. G., et al. 2019, MNRAS, 482, 1715 [NASA ADS] [CrossRef] [Google Scholar]
Mauerhan, J., & Smith, N. 2012, MNRAS, 424, 2659 [NASA ADS] [CrossRef] [Google Scholar]
Micelotta, E. R., Dwek, E., & Slavin, J. D. 2016, A&A, 590, A65 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Michałowski, M. J., Murphy, E. J., Hjorth, J., et al. 2010a, A&A, 522, A15 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Michałowski, M. J., Watson, D., & Hjorth, J. 2010b, ApJ, 712, 942 [CrossRef] [Google Scholar]
Moseley, S. H., Dwek, E., Glaccum, W., Graham, J. R., & Loewenstein, R. F. 1989, Nature, 340, 697 [NASA ADS] [CrossRef] [Google Scholar]
Murty, K. G., & Kabadi, S. N. 1987, Math. Prog., 39, 117 [CrossRef] [Google Scholar]
Nash, J. 1953, Econometrica, 21, 128 [CrossRef] [Google Scholar]
Niculescu-Duvaz, M., Barlow, M. J., Bevan, A., Milisavljevic, D., & De Looze, I. 2021, MNRAS, 504, 2133 [NASA ADS] [CrossRef] [Google Scholar]
Ossenkopf, V., Henning, T., & Mathis, J. S. 1992, A&A, 261, 567 [NASA ADS] [Google Scholar]
Otsuka, M., van Loon, J. T., Long, K. S., et al. 2010, A&A, 518, L139 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Owen, P. J., & Barlow, M. J. 2015, ApJ, 801, 141 [NASA ADS] [CrossRef] [Google Scholar]
Pietrzyński, G., Graczyk, D., Gallenne, A., et al. 2019, Nature, 567, 200 [Google Scholar]
Pontoppidan, K. M., Pickering, T. E., Laidler, V. G., et al. 2016, SPIE Conf. Ser., 9910, 991016 [NASA ADS] [Google Scholar]
Priddey, R. S., Isaak, K. G., McMahon, R. G., Robson, E. I., & Pearson, C. P. 2003, MNRAS, 344, L74 [NASA ADS] [CrossRef] [Google Scholar]
Rho, J., Reach, W. T., Tappe, A., et al. 2009, ApJ, 700, 579 [NASA ADS] [CrossRef] [Google Scholar]
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. 1986, Nature, 323, 533 [Google Scholar]
Scalzo, R. A., Yuan, F., Childress, M. J., et al. 2017, PASA, 34, e030 [NASA ADS] [CrossRef] [Google Scholar]
Seitenzahl, I. R., Timmes, F. X., & Magkotsios, G. 2014, ApJ, 792, 10 [NASA ADS] [CrossRef] [Google Scholar]
Shapley, L. S. 2016, 17. A Value for n-Person Games, eds. H. W. Kuhn, & A. W. Tucker, Princeton: Princeton University Press, 307 [Google Scholar]
Shrikumar, A., Greenside, P., Shcherbina, A., & Kundaje, A. 2016, ArXiv eprints [arXiv:1605.01713] [Google Scholar]
Shrikumar, A., Greenside, P., & Kundaje, A. 2017, PMLR, 70, 3145 [Google Scholar]
Silvia, D. W., Smith, B. D., & Shull, J. M. 2012, ApJ, 748, 12 [NASA ADS] [CrossRef] [Google Scholar]
Smith, N., Chornock, R., Li, W., et al. 2008, ApJ, 686, 467 [NASA ADS] [CrossRef] [Google Scholar]
Szalai, T., Zsíros, S., Fox, O. D., Pejcha, O., & Müller, T. 2019, ApJS, 241, 38 [NASA ADS] [CrossRef] [Google Scholar]
van Rijn, J. N., & Hutter, F. 2018, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18 (New York, NY, USA: Association for Computing Machinery), 2367 [CrossRef] [Google Scholar]
Wang, R., Carilli, C. L., Wagg, J., et al. 2008, ApJ, 687, 848 [NASA ADS] [CrossRef] [Google Scholar]
Watson, D., Christensen, L., Knudsen, K. K., et al. 2015, Nature, 519, 327 [Google Scholar]
Weerts, H. J. P., Mueller, A. C., & Vanschoren, J. 2020, ArXiv e-prints [arXiv:2007.07588] [Google Scholar]
Wesson, R., Barlow, M. J., Matsuura, M., & Ercolano, B. 2015, MNRAS, 446, 2089 [NASA ADS] [CrossRef] [Google Scholar]
Wooden, D. H., Rank, D. M., Bregman, J. D., et al. 1993, ApJS, 88, 477 [NASA ADS] [CrossRef] [Google Scholar]
You, Y., Gitman, I., & Ginsburg, B. 2017, ArXiv e-prints [arXiv:1708.03888] [Google Scholar]
Zubko, V. G., Mennella, V., Colangeli, L., & Bussoletti, E. 1996, MNRAS, 282, 1321 [Google Scholar]

https://mocassin.nebulousresearch.org

https://docs.astropy.org/en/stable/api/astropy.cosmology.FlatLambdaCDM.html#astropy.cosmology.FlatLambdaCDM

https://mfouesneau.github.io/pyphot/

⁴

http://svo2.cab.inta-csic.es/svo/theory/fps3/

⁵

https://outerspace.stsci.edu/display/PANSTARRS/PSl+Forced+photometry+of+sources

⁶

https://shap-lrjball.readthedocs.io/en/latest/generated/shap.DeepExplainer.html

⁷

These designations refer to amorphous carbon grains produced by arc discharge between amorphous carbon electrodes in an argon atmosphere (ACAR), arc discharge in a hydrogen atmosphere (ACH2), and burning of benzene in air (BE).

⁸

https://sne.space

⁹

https://github.com/ZoeAnsari/InferringSNdustwithNN

All Tables

Table 1

Input parameters for the MOCASSIN models.

In the text

Table 2

Comparison of neural network performance for estimating M_dust and T_dust in different scenarios for 4 different cases.

Preferred and minimum subsets of JWST filters obtained from the feature selection process and used to estimate M_dust, T_dust, and dust species.

In the text

Table 5

Comparison of neural network performance for estimating M_dust and T_dust in scenario 3 for case-2 with the same definition in Table 2.

Computational costs of calculating Shapley values using DeepLIFT for different sub-samples of validation and training data sets. The instance highlighted in blue is the selected size in this work.

In the text

All Figures

	Fig. 1 Coverage of SN model SEDs in M_dust, R_out, and dust species parameter space. The colour bar represents T_dust of the SN model SEDs, with blue, denoting the coldest (200 K) and red, the hottest (2200 K) temperatures.
In the text

	Fig. 5 Performance of the neural network with the preferred subset of JWST filters, for S2. The definition of the panels, the variables, the dashed lines and the colour bars are the same as in Fig. 4.
In the text

	Fig. 6 Performance of the neural network with the preferred subset of JWST filters, for S3. The estimations are shown for the reliable predictions of the test data set with S/N = 3 (top panel), and the test data set with S/N = 20 (bottom panel). The definition of the columns, the variables, the dashed lines and the colour bars are the same as in Fig. 4.
In the text

	Fig. 9 Effect of choice of optical constants demonstrated on an example model SED with mixed dust composition. Sixteen SEDs are plotted, using all combinations of four sets of carbon and four sets of silicate optical constants as defined in Sect. 7.1. The fixed parameters of the model are M_d = 2.7 × 10⁻³M_⊙, R_out = 1.99 × 10¹⁷ cm, T_* = 12 298 K, and L_* = 2.55 × 10⁸L_⊙.
In the text

Fig. 10

In the text

	Fig. 11 Performance of the neural network with the preferred subset of JWST filters, for S3. The definition of the panels, the variables, the dashed lines and the colour bars are the same as in Fig. 4.
In the text

Fig. 12

In the text

	Fig. 13 Estimated amount, temperature, and composition of the dust in SN 1987A at 615 days after explosion for the preferred set of JWST filters. The symbols, lines and shaded regions are defined as in Fig. 12.
In the text

	Fig. 14 Distribution of SN model SEDs in T_dust in 50 K bins. The dashdotted line represents the distribution of the entire test data set and the solid line represents the distribution of the reliable predictions from the test data set. Top panel: SN model SEDs with R_in ≲ 5 × 10¹⁶ cm. Bottom panel: SN model SEDs with R_in ≳ 5 × 10¹⁶ cm.
In the text

	Fig. A.1 NIRCam saturation magnitudes in 10 000 seconds exposure time, and point source sensitivity for 21.4 seconds exposure time. The sizes roughly represent the wavelength range of each filter. There is a break in y-axis (15–25 AB magnitude) to save a large blank space between the sensitivity and saturation limits.
In the text

	Fig. A.2 MIRI saturation magnitudes in 10 000 seconds exposure time, and point source sensitivity for 21.4 seconds exposure time. The sizes roughly represent the wavelength range of each filter. There is a break in y-axis (16–20 AB magnitude) to save a large blank space between the sensitivity and saturation limits.
In the text

	Fig. C.1 Importance of JWST filters for estimating the amount, temperature and the dust species, in S1. The symbols, relative sizes and colour codes are the same as defined in Figure C.1.
In the text

	Fig. C.2 Importance of JWST filters for estimating the amount, temperature and the dust species, in S2. The symbols, relative sizes and colour codes are the same as defined in Figure C.2.
In the text

Fig. D.1

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.