Mendeley

Home

All issues

Volume 622 (February 2019)

A&A, 622 (2019) A137

Full HTML

Free Access

Issue		A&A Volume 622, February 2019


Article Number		A137
Number of page(s)		12
Section		Extragalactic astronomy
DOI		https://doi.org/10.1051/0004-6361/201833972
Published online		08 February 2019

A&A 622, A137 (2019)

Star formation rates and stellar masses from machine learning

V. Bonjean¹^,2, N. Aghanim¹, P. Salomé², A. Beelen¹, M. Douspis¹ and E. Soubrié¹

¹ Institut d’Astrophysique Spatiale (IAS), CNRS, Université Paris-Sud, UMR 8617, Bâtiment 121, 91405 Orsay, France
e-mail: victor.bonjean@ias.u-psud.fr; victor.bonjean@obspm.fr
² LERMA, Observatoire de Paris, PSL Research University, CNRS, Sorbonne Universités, UPMC Univ. Paris 06, 75014 Paris, France

Received: 27 July 2018
Accepted: 18 December 2018

Abstract

Star-formation activity is a key property to probe the structure formation and hence characterise the large-scale structures of the universe. This information can be deduced from the star formation rate (SFR) and the stellar mass (M_⋆), both of which, but especially the SFR, are very complex to estimate. Determining these quantities from UV, optical, or IR luminosities relies on complex modeling and on priors on galaxy types. We propose a method based on the machine-learning algorithm Random Forest to estimate the SFR and the M_⋆ of galaxies at redshifts in the range 0.01 < z < 0.3, independent of their type. The machine-learning algorithm takes as inputs the redshift, WISE luminosities, and WISE colours in near-IR, and is trained on spectra-extracted SFR and M_⋆ from the SDSS MPA-JHU DR8 catalogue as outputs. We show that our algorithm can accurately estimate SFR and M_⋆ with scatters of σ_SFR = 0.38 dex and σ_{M_⋆} = 0.16 dex for SFR and stellar mass, respectively, and that it is unbiased with respect to redshift or galaxy type. The full-sky coverage of the WISE satellite allows us to characterise the star-formation activity of all galaxies outside the Galactic mask with spectroscopic redshifts in the range 0.01 < z < 0.3. The method can also be applied to photometric-redshift catalogues, with best scatters of σ_SFR = 0.42 dex and σ_{M_⋆} = 0.24 dex obtained in the redshift range 0.1 < z < 0.3.

Key words: methods: data analysis / galaxies: star formation / galaxies: evolution / large-scale structure of Universe

© ESO 2019

1. Introduction

The galaxy types and their relations to the environment are key features of the study and characterisation of large-scale structures (LSS) in the context of future large surveys of galaxies such as the Large Synoptic Survey Telescope (LSST)¹, the Dark Energy Survey (DES)² or Euclid³.

In the standard understanding of galaxy evolution, star-forming galaxies (usually blue and spiral ones) align along a main sequence in diagrams showing their star formation rates (SFR) versus their stellar mass (M_⋆) (blue dots in Fig. 1). This sequence has been fitted for low-redshift galaxies (up to z ∼ 0.3) by Brinchmann et al. (2004) using the Sloan Digital Sky Survey (SDSS, York et al. 2000) galaxies (Elbaz et al. 2007): SFR_SDSS[M_⊙ yr⁻¹] = 8.7 × [M_⋆/10¹¹ M_⊙]^0.77.

Fig. 1.

SFR vs. M_⋆ diagram. The contours represent the 1σ to 5σ isodensities of all the SDSS MPA-JHU DR8 values from the training sample. The dots are 100 random galaxies taken in the catalogue. The purple solid line is the main sequence of star forming galaxies given by Elbaz et al. (2007). The colours of the galaxies are a function of the distance to the main sequence, d2ms, and are directly representative of the passivity of the galaxies.

Galaxies leave the main sequence when they stop forming stars (quenching), that is, when they loose their cold gas. This can be due to different processes that are not yet fully understood: harassment (e.g. Moore et al. 1996), strangulation (e.g. Peng et al. 2015), that is, when they enter a region with denser and hotter gas (e.g. galaxy clusters or inner parts of cosmic filaments), or ejection of the gas through AGN jets (e.g. Dubois et al. 2013). In all cases, galaxies stop forming stars and undergo a transitioning stage, the so-called green-valley (Alatalo et al. 2014), and finally become passive (or red and dead) galaxies (red dots in Fig. 1). In this general picture, the activity of a galaxy is usually defined in terms of its SFR or of its stellar mass.

Estimating the quantities SFR and M_⋆ is complex (see Kennicutt & Evans 2012 for a review); they are directly or indirectly related to the observations of stars. We briefly review here the dependence of the star properties on wavelength across the electromagnetic spectrum. The young and massive O- and B-type stars are the hottest and thus the most energetic stars. Their blackbody spectra peak in the blue wavelength and they strongly emit in the UV. The UV luminosity of distant galaxies traces these types of stars that in turn directly relate to the SFR, as they represent the youngest stellar populations. However, at these wavelengths the dust absorption is very important and correcting the UV luminosities from the dust attenuation is not trivial (Lagache et al. 2005; Kennicutt & Evans 2012). Multi-wavelength tracers or dust attenuation estimations in the UV/optical (Calzetti et al. 1994; Kennicutt 1998; Salim et al. 2007; Kennicutt & Evans 2012; Janowiecki et al. 2017) are therefore needed to correct UV luminosities and use them as a direct tracer to derive estimations of SFR.

The non-ionizing low-mass old stars represent most of the contribution to the galaxy luminosities in the optical. As they are the most numerous in a galaxy, the optical luminosity is also directly related to the number of stars, and thus to the stellar mass, provided that there exists a theoretical model of star population and an initial mass function (IMF; e.g. Bruzual & Charlot 2003). The estimation of stellar masses strongly depends on the assumed IMF. For example, a typical correcting/calibration factor of ∼1.6 is needed to change from a stellar mass with a Salpeter IMF (Salpeter 1955) to a stellar mass with a Chabrier (Chabrier 2003) IMF (Haas & Anders 2010).

In the near-IR (NIR; ∼0.8 μm < λ < ∼3 μm), the old and non-massive stars also represent most of the contribution to the total luminosity. These wavelengths can therefore also trace the stellar mass through the old population, in the same way as optical measurements do (Wen et al. 2013). In the mid-IR (MIR; ∼3 μm < λ < ∼70 μm), the contribution of dust becomes predominant. Particularly in the 8–12 μm band, the contribution of heated small grains and polycyclic aromatic hydrocarbon (PAH, Leger & Puget 1984) offers a useful tool to study the composition and the abundance of dust. From ∼20 μm to ∼70 μm, the luminosity is mostly due to thermalised dust and large grains heated by the UV emission of the energetic young O- and B-types stars. The luminosity in the IR is thus indirectly related to the SFR and this was performed using, for example, the 8 μm and the 24 μm bands from the Spitzer satellite (Werner et al. 2004) or the 12 μm and the 22 μm from the Wide-Field Infrared Survey Explorer (WISE, Wright et al. 2010) satellite (Calzetti et al. 2007; Kennicutt et al. 2009; Jarrett et al. 2013; Cluver et al. 2014, 2017).

All of these relations are well calibrated, but applying them to galaxies without having any prior on their types can lead to potential biases, as passive galaxies do not have the same properties in the IR (see Sect. 4). Ideally, optical spectroscopic data are needed to estimate the SFR and M_⋆ properties, but they are not always available and are costly in terms of observing time. This is even more prohibitive when the goal is to characterise the galaxy properties in large surveys.

In this study, we propose an alternative approach to estimate SFR and M_⋆ for all galaxies over 70% of the sky (i.e. outside the Galactic plane) with measured redshifts in the range 0 < z < 0.3 (the redshift limit of the training catalogue), without any priors on galaxy types. To do so, we use a machine-learning algorithm. As a matter of fact and for several years already, machine-learning algorithms have been developed and have now become reliable tools to classify or estimate physical properties of astrophysical objects (Aghanim et al. 2015; Huertas-Company et al. 2015; Bilicki et al. 2014, 2016; Krakowski et al. 2016; Lucie-Smith et al. 2018; Pashchenko et al. 2018; Domínguez Sánchez et al. 2018; Tuccillo et al. 2018; Ucci et al. 2018; Viquar et al. 2018; Delli Veneri et al. 2018; Siudek et al. 2018a,b). The development of the scikit-learn library (Pedregosa et al. 2011) in Python has made them relatively easy to use. It has allowed “pythoneer” astrophysicists to develop and test their own machine-learning algorithms on their data.

In the machine learning domain, algorithms are designed either to estimate or classify features based on reference samples, or to identify commonalities on the input features without resorting to any models. These two families of algorithms of machine learning are called supervised and unsupervised algorithms, respectively. The first include algorithms such as Multi-Layer Perceptron, Random Forests, Support Vector Machine, Deep learning, and so on. The second include clustering methods such as k-mean algorithms (see the scikit-learn website⁴ for more details about the algorithms). In our analysis, we use a supervised machine-learning algorithm. Such a method is able to estimate very non-linear laws based on models trained on reliable given inputs and outputs. In the present case, it allows us to estimate SFR and M_⋆ independently of any complex model or any priors on galaxy types. The quality and pertinence of the results of a supervised machine-learning algorithm depend greatly on the training set and how it captures the features that will be classified.

In Sect. 2, we therefore present the data used to generate the trained model and train the machine learning algorithm. In Sect. 3, we present an analysis of the Random Forest algorithm, and the results. We discuss the results and the limitations of the method in Sect. 4, and also present an example and an illustration of the method. Finally, we briefly summarise the method in Sect. 5.

We use the Planck 2015 cosmological parameters throughout this paper (Planck Collaboration XIII 2016) with H₀ = 67.74 km Mpc⁻¹ s⁻¹, Ω_M₀ = 0.3075 and Ω_b₀ = 0.0486. Also, all luminosities noted as Lν in different wavelengths refer to the luminosity density in this wavelength νL_ν.

2. Data

In this section, we present the data used to construct the training catalogue of the machine-learning algorithm.

2.1. WISE

The WISE satellite surveyed the whole sky in four near- and mid-infrared wavelengths (3.4, 4.6, 12 and 22 μm). From the coadded WISE Atlas Images (at angular resolutions respectively of 6.1, 6.4, 6.5 and 12 arcsec), the AllWISE Source Catalogue⁵ was generated with accurate positions, photometry, and ancillary information for 747 634 026 detected sources (Cutri et al. 2013). For our study, we use the profile-fitted photometry measurements of the W1 (3.4 μm), W2 (4.6 μm), and W3 (12 μm) bands, noted w1mpro, w2mpro, and w3mpro in the AllWISE Source Catalogue. The associated errors and signal-to-noise ratios (S/Ns) of magnitude measurements are noted w1sigmpro, w2sigmpro, w3sigmpro, and w1snr, w2snr, w3snr, respectively. We reject the sources with known detection or measurement artifacts by taking cc_flags = 0 for each of the three bands. We also select the sources with high-quality photometry measurements. The WISE magnitudes being upper-values below w*snr < 2 (Krakowski et al. 2016), where * can be 1, 2, or 3, we proceed in the same way as Krakowski et al. (2016) and select only sources with reliable magnitudes in W1 and W2: w1snr < 2, w2snr < 2. Around one third of the selected sources have w3snr < 2. In this case, the w3mpro upper-value magnitudes can be considered as a typical bias, and a correction of +0.75 can be applied to the values to correct for this bias (Fig. A.1 in Krakowski et al. 2016). We also apply a 0th-order k-correction (dependence on redshift) by adding the quantity −2.5 × log(1+z) to the measured magnitudes in each of the three bands, where we take the spectroscopic redshift z from the SDSS catalogue (see the cross-match procedure between the two catalogues in Sect. 3.3).

2.2. SDSS

The SDSS is one of the largest optical surveys available. It has produced deep images of one third of the sky in five optical bands: u, g, r, i, and z, and has performed spectroscopic measurements for more than three million astronomical objects. From these data, several value-added catalogues were generated, which provide a wealth of information about the objects thanks to the study of a large panel of spectral emission lines. We use here the MPA–JHU DR8 catalogue, from the Max Planck Institute for Astrophysics and the Johns Hopkins University (Kauffmann et al. 2003; Brinchmann et al. 2004). It provides SFR and stellar masses for 1 843 200 galaxies with redshifts up to z ∼ 0.3 (Fig. 2). These data based on the SDSS DR8 release are publicly available⁶ together with all details about the catalogue and the computations and fits of the galaxy physical properties.

Fig. 2.

SFR and M_⋆ provided in the SDSS catalogue. Each contour represents a galaxy population as defined by their position on the BPT diagram. Red contours represent passive galaxies, blue contours star-forming galaxies, green contours the galaxies from the green valley, and purple contours the AGNs.

The SFR (flagged as SFR_TOT_P50) are estimated using the H_α emission lines (when available) corrected from the dust extinction with the Balmer decrement H_α/H_β (Brinchmann et al. 2004). For no-emission line galaxies, SFRs are estimated using a relation between the SFR and the spectral index D₄₀₀₀ (Bruzual 1983; Balogh et al. 1999; Brinchmann et al. 2004). The M_⋆ (flagged as LGM_TOT_P50) are computed based on theoretical models of stellar populations (Kauffmann et al. 2003), and assuming a Kroupa IMF (Kroupa 2001).

The SDSS MPA-JHU DR8 catalogue provides BPT classes (flagged as BPTCLASS), which depend on the position of the galaxies in the Baldwin, Phillips, & Terlevich (BPT) diagram (Baldwin et al. 1981). This diagram can segregate a population of galaxies by comparing the emission-line ratios [OIII]/H_β and [NII]/H_α (see Fig. 2). In the classification provided by the MPA-JHU catalogue, BPTCLASS = 1 corresponds to star-forming galaxies, BPTCLASS = 2 to composite galaxies (transitioning), BPTCLASS = 3 to AGNs, and BPTCLASS = 4 and BPTCLASS = 5 to low-S/N emission line galaxies (Brinchmann et al. 2004). The class BPTCLASS = −1 corresponds to galaxies unclassifiable in the BPT diagram: passive galaxies without emission lines (Brinchmann et al. 2004).

From the SDSS catalogue, we generate a purer catalogue by selecting objects with only reliable properties. To do so we set the following flags: RELIABLE ≠ 0, Z_WARNING = 0, SFR_TOT_P50 ≠ −9999, LGM_TOT_P50 ≠ −9999, and Z > 0. This pure catalogue contains 794 633 galaxies.

3. The machine-learning algorithm

3.1. Principle and advantages

Increasingly used in astrophysics and cosmology, machine-learning algorithms have become very powerful tools to detect, classify, or characterise astrophysical sources (e.g. as a very non-exhaustive list: Aghanim et al. 2015; Bilicki et al. 2014, 2016; Krakowski et al. 2016 and references therein). One of the main advantages of this technique is that a model is not needed (usually complex or empirical) to perform regression on a set of data. Here only a set of input data and output data are needed; and the machine learns the relation (which can be very non-linear and complex) between the input and output data. Different kinds of machine-learning algorithms have been developed and are easily usable (e.g. see the scikit-learn website⁷). For this study, we choose to use the Random Forest (RF) algorithm (Ho 1995) which is among the simplest, fastest to run, and easiest to understand among the many machine-learning methods (see Sect. 3.4), and scikit-learn v.0.19.1.

The usual way to estimate the efficiency of the algorithm (its ability to perform a regression) when applied to a training sample (i.e. the inputs and outputs data set) is to split this training sample into several subsamples, and train and test the algorithm on these different samples. One can then train the machine-learning algorithm on a subsample (50% of the whole sample) and apply it to the other subsample (the other 50%). Knowing the reference values of the last subsample and having their estimation by the machine-learning algorithm, the errors and the bias can be estimated by comparing the two.

3.2. The choice of inputs and outputs

Ensuring good, that is, unbiased, training of the algorithm, the choice of reference inputs and the outputs is essential. This can be overridden by other algorithms, in particular deep-learning algorithms. In this study, the choice of using the WISE data is motivated by its full-sky coverage and its very large number of sources.

First of all, we have to define our inputs, that is, the data that will be used to estimate the SFRs and the stellar masses at the end. As the SFR can evolve with redshift, we choose to use z as an input. As a proxy for the stellar mass, we use the luminosity in the W1 band (3.4 μm) of WISE that traces the old non-ionizing stars (Wen et al. 2013; Jarrett et al. 2013). As a proxy for the SFR, we use the luminosity in the W3 band (12 μm) of WISE, which traces the emission from small grains and is directly related to the total quantity of dust (Jarrett et al. 2013; Cluver et al. 2014, 2017). Although the W4 band of WISE is also a good tracer of the SFR (Jarrett et al. 2013; Cluver et al. 2014, 2017), its larger beam size of 12″ and its poorer sensitivity could lead to an important incompleteness and a significant bias of source selection with respect to redshift. We therefore decided not to use it. As we want to estimate the SFR and the stellar mass for both galaxy types (active and passive) without any prior, we also chose as input the two colours of WISE that can segregate the galaxy types: W1–W2 (3.4 − 4.6 μm) and W2–W3 (4.6 − 12 μm) (Wright et al. 2010).

We then needed to choose what outputs would be used as reliable reference for SFR and M_⋆. We chose to use the SFRs and stellar masses from the SDSS MPA-JHU DR8 catalogue, since their values are based on calibrated spectra (Brinchmann et al. 2004). These latter authors estimated SFR for different types of galaxies using the metal lines as tracers (the H_α recombination line for most of galaxies, corrected from the dust attenuation with the Balmer decrement H_α/H_β), and the relation with the spectral index D₄₀₀₀ (Bruzual 1983; Balogh et al. 1999) for no-emission-lines galaxies. To estimate the stellar mass, Brinchmann et al. (2004) used theoretical models of star populations fitted with Monte Carlo (Kauffmann et al. 2003) based on models from Bruzual & Charlot (2003), and a Kroupa IMF (Kroupa 2001).

3.3. Constructing the training catalogue

We construct the training set by performing a positional cross-match of the SDSS subsample of 794 633 galaxies described in Sect. 2 with the AllWISE Sources Catalogue within a radius of 6″ (the beam of the W1 band of WISE from which the source positions are extracted). In order to ensure a pure catalogue, we remove all cross-match cases with multiple associations and end up with 603 293 galaxies. After removing sources with bad WISE magnitude measurements (only ∼5% of the total sources) as explained in Sect. 2, we finally end up with a reliable catalogue of 573 582 galaxies. For all these galaxies, we have access to measured or estimated redshifts, WISE w*mpro magnitudes, SFRs, and M_⋆. This is the basis of our input and output data (see previous section). We show in Fig. 6 the range of LW1, LW3, W1–W2 and W2–W3 (i.e. the inputs data) on the training catalogue.

3.4. The Random Forest algorithm

The RF algorithm used in the present study is based on decision tree learning. It uses a decision tree that splits the training set optimally by reducing the Gini impurity⁸. The principle is to define if-else rules on the input features, in order to finally obtain the best and purest representation of the sets at each splitting, according to the outputs (see the documentation in the scikit-learn website⁹).

The RF algorithm uses the mean estimator of a sample of decision trees learned by bootstrapping the training set. For a training set of n samples, with X = x₁, …, x_n and Y = y₁, …, y_n, the inputs and the outputs of the machine learning, respectively, the estimator for an untrained value x′ is computed as

(1)

where M is the number of decision trees, and is the estimator for x′ of the decision tree m trained on a random sample with replacement of n elements in the sample of couples (X,Y).

To optimise the results, some of the parameters have to be fixed, such as the number of trees M or the maximum depth of the tree (i.e. the maximum number of splitting) d_max. These parameters can be set by splitting the training set into several subsample sets, and by training and computing the score of the algorithm on these independent subsample sets. The best optimised parameters can be set when the score on independent subsample tests is the highest.

3.5. Optimisation

To set the optimal number of trees M and the maximum depth d_max, and further estimate the errors on the RF, we proceed in splitting the training set into subsamples (see Sect. 3.1). Following standard procedures, we split our sample into three randomised subsamples: 60% as the training set, 20% as the validation set, and 20% as the test set (we further check that changing the sizes of the subsamples does not affect the results of the algorithm). We train the RF on the training set, varying M and d_max. We compute the score of the RF on the validation set using the coefficient of determination 100 × R², where , , which is the residual sum of squares, and , the variance of the output distribution. We then set the two optimised parameters to the ones that give the highest scores on the validation set. Figure 3 shows the score of the RF on the validation set, depending on the two parameters M and d_max. For the parameter d_max, the performance of the RF increases until d_max = 12 and starts decreasing beyond. For the parameter M, a simple lower limit is enough to optimise the parameter, as increasing the value of M to higher values does not give better results. Here, setting M = 40 and d_max = 12 is sufficient to optimise the algorithm, for an optimal score of 84.5% on the validation set.

Fig. 3.

Percentage score of the RF results on the validation sample as a function of the RF parameters M and d_max (M being the number of trees and d_max the maximum depth). Setting M = 40 and d_max = 12 is enough in our case to optimise the RF.

3.6. Results and errors

We train the RF on the training set, with the two parameters fixed at M = 40 and d_max = 12, and we estimate SFR and M_⋆ with the RF on the test set (SFR_ML and M_⋆_ML). We then compare these results with their reference values, defined as those of the SDSS catalogue. We can hence estimate the performance of the machine-learning algorithm in terms of errors and biases. Figure 4 shows the comparison on the test set: we directly see an overall good agreement between the SDSS reference values and the values estimated with the RF algorithm, both for SFR and for M_⋆. This agreement shows that the RF algorithm is reasonably well trained.

Fig. 4.

Results of the RF on the test sample (20% of the entire sample), with optimisation parameters set to M = 40 and d_max = 12. Left panel: M_⋆ estimate with the RF compared to M_⋆ from the SDSS MPA-JHU DR8 catalogue. Right panel: SFR estimate with the RF compared to SFR from the SDSS MPA-JHU DR8 catalogue.

For the stellar mass values estimated from the RF algorithm (Fig. 4 left panel), the scatter between the estimated and reference SDSS values is quantified through the variance: . The associated standard deviation of σ_{M_⋆} = 0.16 dex which translates into an error of a factor 10^σ_{M_⋆} = 1.45 with respect to the reference value. For the SFR (Fig. 4 right panel), the scatter is larger and the variance is . This gives a standard deviation of σ_SFR = 0.38 dex, and an error of a factor 10^σ_SFR = 2.40.

3.7. Chasing the biases

It is important to have precise results, with error bars estimated from the RF for both SFR and the M_⋆ values and it is of equal importance to have accurate, that is, unbiased, results.

We first investigate potential biases induced by the redshift dependence of the SFR and stellar mass in the redshift range, 0 < z < 0.3, of the training catalogue used in our study. We display in Fig. 5 the errors (defined as the difference between machine-learning estimated values and SDSS values), for M_⋆ (left panel) and SFR (right panel) for the galaxies of the test set. No obvious bias on redshift is observed. In the left panel of Fig. 5 we notice a slight increase of the scatter for M_⋆ at very low redshifts. This is discussed in Sect. 4.2.

Fig. 5.

Errors of the RF results obtained for the test sample (same errors as those presented in Fig. 4) as a function of redshift for M_⋆ and SFR.

Fig. 6.

Histograms showing the range of the input data (luminosities and colours) for the sources of the training catalogue.

Another type of bias can be induced by the galaxy types. As we want a scatter of the same order for both passive and active galaxies, we compare the results of the RF algorithm as a function of the BPT classes provided in the SDSS MPA-JHU catalogue. To check that the BPT class is a reliable indicator of the galaxy type, we show in Fig. 2 the main sequence diagram of galaxies (SFR vs. M_⋆ as provided in the SDSS catalogue) with their BPT classes from the SDSS catalogue. We see that the red contours of passive galaxies, BPT = −1, are well in the cloud of red and dead galaxies, the blue contours of star-forming galaxies, BPT = 1, are well along the main sequence, and the transitioning galaxies, BPT = 2, are well in the green valley. The positions in the BPT diagram can therefore be taken as a reliable indicator of galaxy type. In Fig. 7, we show the results of the RF (same as those displayed in Fig. 4), with the contour colours displaying the different BPT classes. The RF performs equally well for any type of galaxy and we do not observe any strong bias induced by galaxy type. Moreover, the scatter of the results depends only very slightly on galaxy type. For passive galaxies, BPT = −1, the scatter on M_⋆ tends to be reduced: we find σ_SFR = 0.38 dex and σ_{M_⋆} = 0.11 dex. For active galaxies, BPT = 1, the inverse trend is seen and the scatter on the SFR tends to be reduced with a small increase of the scatter on M_⋆: we find σ_SFR = 0.30 dex and σ_{M_⋆} = 0.23 dex. For transitioning galaxies, BPT = 2, we find roughly the same scatters as the overall ones on the global set given in Sect. 3.6, with σ_SFR = 0.39 dex and σ_{M_⋆} = 0.13 dex. A summary of the different scatters is shown in Table 1.

Fig. 7.

SFR and M_⋆ obtained with the RF algorithm on the test set compared to the SDSS classification based on the BPT diagram. Colour code of the contours is the same as in Fig. 2.

Table 1.

Summary of the different scatters obtained on the same test set with different methods.

3.8. Learning from the learning

One advantage of the RF algorithm is that we can learn about the importance of the inputs during the training. An example of learning from the learning is to train the RF algorithm to estimate either SFR alone, M_⋆ alone, or both SFR and M_⋆ and to compare the importance of the features in each training (the left, middle, and right panel in Fig. 8, respectively). From the five input features, the first obvious tendency is the low impact of the redshift (only on the dependence on distance, as the redshift is also hidden in the luminosities LW1 and LW3) and of the colour W1–W2 on the training, whatever the outputs. For the estimation of M_⋆ alone, it is clear that the luminosity LW1, as expected, is the main feature used to train the RF, with a very slight contribution from the colour W2–W3. For the SFR estimation alone, two main features are used to train the RF, as expected: the luminosity LW3 and the colour W2–W3 used to segregate the two main populations of galaxies. The case where we train the RF to estimate both SFR and M_⋆ shows that the two main features are the luminosity LW1 and the colour W2–W3, with a slight contribution (of approx. 5%) of the luminosity LW3. This indicates that the two quantities LW1 and W2–W3 are the most efficient to classify and segregate galaxy populations (see also Fig. 9 where the two population of galaxies, i.e. red and blue, are very well separated.).

Fig. 8.

Feature importance during the RF training. Left panel: RF trained to perform M_⋆ estimates only. Middle panel: RF trained to perform SFR estimates only. Right panel: RF trained to estimate both SFR and M_⋆.

Fig. 9.

1σ and 3σ iso-densities of the test set in the WISE colour-luminosity W2–W3/LW1 diagram. The line styles are the same as in Fig. 2.

The estimated SFR and stellar mass from RF method rely on the redshift information. Redshifts are used to compute the luminosities and they impact the evolution of the mean global SFR over time. The need for redshifts to compute the M_⋆ and the SFR is very restrictive, as spectroscopic redshifts are hard to obtain and photometric ones are not always precise. We have tested the performance of the RF method without any redshift information. This implies that we do not perform any k-correction on the magnitudes and we do not compute the luminosities. The inputs are thus only the two magnitudes W1 and W3 and the two colours W1–W2 and W2–W3. In Fig. 10, we show the results on the test set. We find a scatter of σ_{M_⋆} = 0.32 dex for the M_⋆ and a scatter of σ_SFR = 0.43 dex for the SFR estimation (compared to σ_{M_⋆} = 0.16 dex and σ_SFR = 0.38 dex with the information about the redshift; see also Table 1). The accuracy of the method is highly degraded and we then keep the redshift in our prior inputs.

Fig. 10.

Results of the RF on the test sample (20% of the entire sample), with only W1, W3, W1–W2 and W2–W3 in input and without information about the redshift. Left panel: M_⋆ estimate with the RF compared with M_⋆ from the SDSS MPA-JHU DR8 catalogue. Right panel: SFR estimate with the RF compared with SFR from the SDSS MPA-JHU DR8 catalogue.

3.9. Applying to photometric redshift catalogues

The machine-learning algorithm is trained on a spectroscopic-redshift catalogue and can be applied to high-accuracy photometric-redshift catalogues. In order to test this, we add an error σ_z(1 + z), illustrative of errors from photometric redshifts, to the redshifts of the test sample. We then estimate the two properties, SFR and M_⋆. In Fig. 11, we show in blue the evolution of the scatter of SFR and M_⋆ estimates as a function of σ_z(1 + z). We see a large increase of the scatter for both properties with decreasing redshift accuracy. This trend has two origins. On the one hand, the increase in redshift errors obviously impacts the SFR and M_⋆ estimates. On the other hand, an additional bias increases the errors on the estimates SFR and M_⋆. This is illustrated in the left panel of Fig. 12 showing the error on SFR for σ_z(1 + z)=0.015. This bias can be corrected for. We have modelled it with a simple exponential a × exp−z/z₀ and show its evolution in Fig. 13. In Fig. 11, we show in orange the scatter on the bias-corrected properties. The scatters are significantly reduced, but still of σ_{M_⋆} = 0.35 dex and σ_SFR = 0.44 dex at σ_z(1 + z)=0.03. If we focus on the SFR and M_⋆ estimates in the range 0.1 < z < 0.3, the scatters (shown in green) are reduced down to more reasonable values such as σ_{M_⋆} = 0.24 dex and σ_SFR = 0.42 dex at σ_z(1 + z)=0.03 (the accuracy expected for the photometric-redshift catalogue of Euclid). In this way, one could apply the present approach to the whole sky based on present or future photometric-redshift catalogues like WISExSCOS, DES, LSST, Pan-Starrs, or Euclid.

Fig. 11.

Evolution of the scatters of the estimated properties with the RF as a function of the redshift error. Left panel: scatter for M_⋆ estimation. Right panel: scatter for SFR estimation. The blue lines correspond to the scatters corresponding to the whole sample, regardless of the induced bias. The orange lines correspond to the scatters of bias-corrected properties following the laws in Fig. 13. The green lines correspond to the scatters in the redshift range 0.1 < z < 0.3.

Fig. 12.

Example of bias induced by the redshift error. Left panel: for σ_z(1 + z) = 0.015 we show the errors on the SFR estimated values as a function of redshift. The blue line corresponds to the modeled bias. Right panel: same errors as in the left panel but corrected for bias.

Fig. 13.

Evolution of the bias seen as a function of redshift, for different redshift errors (indicated by the colours). Left panel: bias to correct for M_⋆ estimations. Right panel: bias to correct for SFR estimations.

4. Discussion

4.1. Comparison

Determination of SFR and M_⋆ is an active topic, and other studies have provided analytical formulae, some of them also based on the WISE luminosities (e.g. Wen et al. 2013; Jarrett et al. 2013; Cluver et al. 2014, 2017). We compare the estimates of SFR and M_⋆ derived from the RF algorithm with those computed using different approaches but based on the same observables (WISE luminosities: LW1 and LW3). We focus on the M_⋆ estimates with the relation from Wen et al. (2013), using LW1:

(2)

and on the SFR estimates from Cluver et al. (2014) using LW3, for star-forming galaxies:

(3)

We compute M_⋆ with Eq. (2) for all galaxies and compare with the masses reported in the SDSS MPA-JHU catalogue (left panel of Fig. 14). We also show the 1, 3, and 5σ contours of the RF estimates for the same galaxies. This comparison shows the smaller scatter of the masses estimated with the RF algorithm, which is not surprising considering that we have five inputs compared to only one. For these specific sources, we find σ_{M_⋆_Wen} = 0.23 dex and σ_{M_⋆_ML} = 0.16 dex (see Table 1).

Fig. 14.

Comparisons on the test set. Left panel: M_⋆ computed with the method of Wen et al. (2013), using the luminosity in W1, compared with the SDSS masses. The contours show the 1, 3, and 5σ isodensities of the RF results (Fig. 4). Right panel: SFR computed only for star-forming galaxies with the method of Cluver et al. (2014), using the luminosity in W3, compared to SFR from the SDSS catalogue. The blue contours represent the results of the RF for the same population (blue contours in Fig. 7), and the red contours represent the SFR estimation for passive galaxies computed with Cluver’s formula. In both panels, the dashed line represents the one-to-one correlation.

The SFR values are computed only for star-forming galaxies (BPT = 1, to satisfy the conditions of Cluver et al. 2014) following Eq. (3) and are compared with the SFR in the SDSS MPA-JHU catalogue (right panel of Fig. 14). The blue contours represent the results of the RF for the same population (blue contours in Fig. 7), and the red contours represent the SFR estimates for passive galaxies (BPT = −1) computed with the Cluver et al. (2014) formula (Eq. (3)). We find a smaller scatter for the SFR estimations from RF; this is again expected since we use five inputs compared to one. We also show the limitation of the application domain of a linear relation between a luminosity and an SFR, in terms of its dependence on galaxy type (huge bias of the red contours). We find σ_{SFR_Cluver} = 0.47 dex and σ_{SFR_ML} = 0.30 dex for active galaxies (see Table 1), while for passive galaxies we have σ_{SFR_Cluver} = 0.49 dex and σ_{SFR_ML} = 0.38 dex and a bias (defined as the absolute difference of the means) of b_Cluver = 0.93 dex compared with b_ML = 0.04 dex.

As a second comparison, we use a catalogue of galaxies with estimated SFRs calculated using an alternative method. An example is the extended version of the COLD GASS (CO Legacy Database for GASS) catalogue of nearby galaxies, xCOLD GASS¹⁰ (Saintonge et al. 2017). The sample contains 532 galaxies from SDSS selected in mass (M_⋆ > 10⁹ M_⊙) that span a large range of SFR values and galaxy types, with IRAM-30m CO(1–0) observations, and as they say, “because the COLD GASS sample is large and unbiased, it serves as the perfect reference for studies of particular galaxy populations”. Saintonge et al. (2017) provide ancillary information such as the SFR and M_⋆ which is computed with the method in Janowiecki et al. (2017), using a combination of UV from the Galaxy Evolution Explorer¹¹ (GALEX) and IR from WISE. We show in Fig. 15 the results of the RF compared with the values of the xCOLDGASS catalogue (shown as S17), for the galaxies with only one association (no multiple source blending) with the AllWISE catalogue within a radius of 6 arcsec. Good overall agreement is seen, and a small bias is seen in the SFR estimation, especially for passive galaxies. This bias is partially due to the way SFR is estimated. Here, we take the SDSS MPA-JHU catalogue estimation as reference to train the model, and we compare to their SFR computed with both IR and UV data. The reliability of the values estimated with the RF is directly correlated with the reliability of the values chosen as outputs, that is, the SDSS MPA-JHU values. We can also see the bias between the SDSS MPA-JHU values and the UV+IR estimation in the panel right of Fig. 15, and show that maybe unobscured UV SFR is not seen in WISE or in SDSS. The method could be applied to any other value-added catalogue of galaxies chosen as outputs of the machine-learning training, if more robust SFR or M_⋆ estimators (via multi-wavelength proxies, i.e. IR+UV) were to teach more accurate predictions (see Fig. 15).

Fig. 15.

Comparison with the xCOLDGAS catalogue from Saintonge et al. (2017) for the sources that match only once with the AllWISE catalogue in a 6 arcsec radius. Left panel: M_⋆ computed with the RF compared with the M_⋆ provided by the catalogue (SDSS MPA-JHU values). Middle panel: SFR computed by the RF compared with the SFR provided by xCOLDGAS, computed with combined UV and IR data. Right panel: SFR given by the SDSS MPA-JHU DR8 catalogue compared with the SFR from xCOLDGAS, computed with combined UV and IR data.

4.2. Limitation

The results of the RF algorithm agree well with previous works, in particular those of Wen et al. (2013) and Cluver et al. (2014). The domain of application of all three studies in terms of galaxies and more precisely in terms of redshifts is similar (mean redshift at around z = 0.15). Comparison with other available catalogues providing SFR and M_⋆ cannot be possible when the sources under consideration are too different from the domain of application of the RF algorithm, that is, its learned model. We focus on two extreme cases of domains of application where the machine learning cannot provide reasonable estimates. On the one hand, we consider nearby galaxies with redshifts z < 0.01 and on the other hand high-redshift galaxies, up to z = 8.

A sample of nearby (z < 0.01) star-forming galaxies was constructed combining a Spitzer catalogue SINGS (Kennicutt et al. 2003) and a Herschel Space Observatory catalogue KINGFISH (Kennicutt et al. 2011). This SINGS/KINGFISH catalogue of 79 sources was used by Cluver et al. (2017) to successfully fit the relation between SFR and LW3. In this catalogue, the redshift domain (z < 0.01) implies that the galaxies are resolved and therefore the use of the WISE Atlas Images is needed to accurately measure the fluxes of the objects (Saintonge et al. 2017; Cluver et al. 2017). The effect of a miscomputation of the IR fluxes for very-low-redshift resolved galaxies can be seen in Fig. 5 as a higher scatter in M_⋆ estimations for very-low-redshift galaxies (z < 0.01).

An example of distant galaxies is the COSMOS2015 catalogue¹² (Laigle et al. 2016), which provides apparent magnitudes in 30 bands for approximately half a million objects up to redshifts z = 8. It also provides photometric redshifts, SFRs, and stellar masses all computed using the code LEPHARE¹³. Here again we cannot directly apply our RF algorithm, that is, its learned model, and compare our estimates with those of the COSMOS catalogue. The main issue here is the resolution of WISE of 6 arcsec; too many COSMOS sources are associated with a WISE galaxy inside the WISE beam and therefore correct association with the AllWISE catalogue is impossible.

Finally, as the RF algorithm has been trained on galaxies up to z = 0.3, using it beyond this redshift limit will lead to strong biases as the training did not include such sources. We conclude that the application domain of the method is 0.01 < z < 0.3.

4.3. Application

As shown in statistical studies of galaxy populations in large galaxy surveys (e.g. SDSS York et al. 2000), VIMOS Public Extragalactic Redshift Survey (VIPERS; Scodeggio et al. 2018), most galaxies residing in dense environments of the cosmic web (galaxy clusters or inner parts of cosmic filaments) are passive and red and dead (Malavasi et al. 2017a,b; Kraljic et al. 2018). The information about the activity of galaxies is indeed used to detect galaxy clusters using their red sequence, originally exhibited by Gladders & Yee (2000; e.g. Rykoff et al. 2014 who developed the RedMapper algorithm).

Similarly to estimating the activity of galaxies by computing specific SFR (which illustrates the efficiency of a galaxy in forming stars), the distance to the main sequence on an SFR vs. M_⋆ plot (translated into the colours on the Fig. 1) informs us about the activity of a galaxy. Since this distance, called d2ms, is directly related to how far galaxies are from being star-forming, we prefer to introduce the term passivity rather than activity. The more red the point, the more distant from the main sequence of star-forming galaxies and the more passive the galaxies are. This quantity (quite efficiently estimated by our method) can be a very useful property to segregate populations of star-forming, green, or passive galaxies. We already successfully used this estimate in Bonjean et al. (2018) to characterise the properties of galaxies in a cosmic bridge between two galaxy clusters.

We show in Fig. 16 another simple illustration with the case study of galaxy population in the field of the Coma cluster at z = 0.0231. As in Bonjean et al. (2018), we used the union between the 2MPZ and the WISExSCOS catalogues of photometric redshifts as a basis of galaxy catalogues. Extracting a region of 15° around the Coma cluster, we select galaxies in the range 0.01 < z < 0.05, and separate their population by performing cuts in the distance to the main sequence d2ms (see Fig. 1), computed from the SFR and M_⋆. The cuts in the d2ms correspond to d2ms < 0.4, 0.4 < d2ms < 1.25, and d2ms > 1.25 for blue, green, and blue galaxies, respectively (defined using Fig. 1). We clearly see the overdensity of red galaxies, as expected, clustered in the centre of the Coma cluster, and a more uniform distribution of blue galaxies around the Coma cluster and in the field. The galaxies are overlaid on the thermal Sunyaev-Zel’dovich MILCA map from Planck (Planck Collaboration XXII 2016). This illustrates well the complexity of the galaxy dynamics inside the Coma cluster.

Fig. 16.

Example of application of our method for galaxy populations in a 10 ° ×10° field centred on the Coma cluster at z = 0.0231. We have estimated SFR and M_⋆ on 2MPZ union WISExSCOS galaxies with 0.01 < z < 0.05. Based on the “passivity” cut (d2ms parameter) to segregate population, we select blue, green, and red galaxies corresponding to d2ms > 0.4, 0.4 > d2ms > 1.25, and d2ms > 1.25, respectively. The galaxies are overlaid on the thermal Sunyaev-Zel’dovich MILCA map from Planck (Planck Collaboration XXII 2016).

5. Summary

Determining star-formation activity proxies such as the SFR and M_⋆ from UV, optical, or IR luminosities relies on complex modelling and on priors on galaxy properties, and does not accurately describe the passive galaxies, which are of particular interest in studying large-scale structures. We have developed a method based on machine learning to estimate the SFR and M_⋆ of galaxies, in the redshift range 0.01 < z < 0.3, over the whole usable sky when their redshifts are known. The algorithm is trained on the redshift z, the luminosities LW1 and LW3, and the colours W1–W2 and W2–W3. These input properties permit a very efficient segregation between the different galaxy types and computation of the stellar masses and SFR. As outputs to train the algorithm, we have chosen the SDSS MPA-JHU DR8 values of SFR and M_⋆ as reference values. However, the method can be adapted to any catalogue of value-added SFR and M_⋆ cross-matched with the AllWISE catalogue. With our method, we obtain typical errors of σ_SFR = 0.38 dex and σ_{M_⋆} = 0.16 dex, independently of galaxy type, and unbiased with respect to redshift in the range z < 0.3.

In future works, we will extend the redshift range of the application up to z ∼ 0.5, and possibly beyond. In doing so, our method will be useful to characterise cosmic structures, and can therefore be applied, for example, to studies of the dependence of galaxy populations on environments or galaxy cluster/cosmic filament detection in the context of future large galaxy surveys.

https://www.lsst.org

https://www.darkenergysurvey.org

https://www.euclid-ec.org

⁴

http://scikit-learn.org/

⁵

Available at http://wise2.ipac.caltech.edu/docs/release/allwise/expsup/sec1_3.html#src_cat

⁶

http://sdss3.org/dr8/

⁷

http://scikit-learn.org/

⁸

Detailed here: https://scikit-learn.org/stable/modules/tree.html#classification-criteria

⁹

http://scikit-learn.org/

¹⁰

Publicly avaible at http://www.star.ucl.ac.uk/xCOLDGASS/index.html

¹¹

http://www.galex.caltech.edu

¹²

http://cosmos.astro.caltech.edu

¹³

http://www.cfht.hawaii.edu/~arnouts/LEPHARE/lephare.html

Acknowledgments

The authors thank the anonymous referee for her/his useful comments. The authors acknowledge fruitful discussions with M. Bilicki, A. Decelle, M. Huertas-Company, and H. J. McCraken. We especially thank M. Cluver for providing her data on the SINGS/KINGFISH sample. This publication used data products from the Wide-field Infrared Survey Explorer, which is a joint project of the University of California, Los Angeles, and the Jet Propulsion Laboratory/California Institute of Technology, funded by NASA. This research made use of Astropy, the community-developed core Phyton package (Astropy Collaboration 2013). This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement ERC-2015-AdG 695561.

References

Aghanim, N., Hurier, G., Diego, J.-M., et al. 2015, A&A, 580, A138 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Alatalo, K., Cales, S. L., Appleton, P. N., et al. 2014, ApJ, 794, L13 [NASA ADS] [CrossRef] [Google Scholar]
Astropy Collaboration (Robitaille, T. P., et al.) 2013, A&A, 558, A33 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Baldwin, J. A., Phillips, M. M., & Terlevich, R. 1981, PASP, 93, 5 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Balogh, M. L., Morris, S. L., Yee, H. K. C., Carlberg, R. G., & Ellingson, E. 1999, ApJ, 527, 54 [NASA ADS] [CrossRef] [Google Scholar]
Bilicki, M., Jarrett, T. H., Peacock, J. A., Cluver, M. E., & Steward, L. 2014, ApJS, 210, 9 [NASA ADS] [CrossRef] [Google Scholar]
Bilicki, M., Peacock, J. A., Jarrett, T. H., et al. 2016, ApJS, 225, 5 [NASA ADS] [CrossRef] [Google Scholar]
Brinchmann, J., Charlot, S., White, S. D. M., et al. 2004, MNRAS, 351, 1151 [NASA ADS] [CrossRef] [Google Scholar]
Bonjean, V., Aghanim, N., Salomé, P., Douspis, M., & Beelen, A. 2018, A&A, 609, A49 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Bruzual, A. G., 1983, ApJ, 273, 105 [NASA ADS] [CrossRef] [Google Scholar]
Bruzual, G., & Charlot, S. 2003, MNRAS, 344, 1000 [NASA ADS] [CrossRef] [Google Scholar]
Calzetti, D., Kinney, A. L., & Storchi-Bergmann, T. 1994, ApJ, 429, 582 [NASA ADS] [CrossRef] [Google Scholar]
Calzetti, D., Kennicutt, R. C., Engelbracht, C. W., et al. 2007, ApJ, 666, 870 [NASA ADS] [CrossRef] [Google Scholar]
Chabrier, G. 2003, PASP, 115, 763 [NASA ADS] [CrossRef] [Google Scholar]
Cluver, M. E., Jarrett, T. H., Hopkins, A. M., et al. 2014, ApJ, 782, 90 [NASA ADS] [CrossRef] [Google Scholar]
Cluver, M. E., Jarrett, T. H., Dale, D. A., et al. 2017, ApJ, 850, 68 [NASA ADS] [CrossRef] [Google Scholar]
Cutri, R. M., Wright, E. L., Conrow, T., et al. 2013, Explanatory Supplement to the AllWISE Data Release Products [Google Scholar]
Delli Veneri, M., Cavuoti, S., Brescia, M., Riccio, G., & Longo, G., 2018, ArXiv e-prints [arXiv: 1805.06338] [Google Scholar]
Domínguez Sánchez, H., Huertas-Company, M., Bernardi, M., Tuccillo, D., & Fischer, J. L. 2018, MNRAS, 476, 3661 [NASA ADS] [CrossRef] [Google Scholar]
Dubois, Y., Gavazzi, R., Peirani, S., & Silk, J. 2013, MNRAS, 433, 3297 [NASA ADS] [CrossRef] [Google Scholar]
Elbaz, D., Daddi, E., Le Borgne, D., et al. 2007, A&A, 468, 33 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Gladders, M. D., & Yee, H. K. C. 2000, AJ, 120, 2148 [NASA ADS] [CrossRef] [Google Scholar]
Haas, M. R., & Anders, P. 2010, A&A, 512, A79 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Ho, T. K. 1995, Random Decision Forests. Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, 14–16 August 1995, 278 [Google Scholar]
Huertas-Company, M., Gravet, R., Cabrera-Vives, G., et al. 2015, ApJS, 221, 8 [NASA ADS] [CrossRef] [Google Scholar]
Janowiecki, S., Catinella, B., Cortese, L., et al. 2017, MNRAS, 466, 4795 [NASA ADS] [Google Scholar]
Jarrett, T. H., Masci, F., Tsai, C. W., et al. 2013, AJ, 145, 6 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Kauffmann, G., Heckman, T. M., White, S. D. M., et al. 2003, MNRAS, 341, 33 [NASA ADS] [CrossRef] [Google Scholar]
Kennicutt, Jr., R. C. 1998, ARA&A, 36, 189 [Google Scholar]
Kennicutt, R. C., & Evans, N. J. 2012, ARA&A, 50, 531 [NASA ADS] [CrossRef] [Google Scholar]
Kennicutt, Jr., R. C., Armus, L., Bendo, G., et al. 2003, PASP, 115, 928 [NASA ADS] [CrossRef] [Google Scholar]
Kennicutt, Jr., R. C., Hao, C.-N., Calzetti, D., et al. 2009, ApJ, 703, 1672 [NASA ADS] [CrossRef] [Google Scholar]
Kennicutt, R. C., Calzetti, D., Aniano, G., et al. 2011, PASP, 123, 1347 [NASA ADS] [CrossRef] [Google Scholar]
Krakowski, T., Małek, K., Bilicki, M., et al. 2016, A&A, 596, A39 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Kraljic, K., Arnouts, S., Pichon, C., et al. 2018, MNRAS, 474, 547 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Kroupa, P. 2001, MNRAS, 322, 231 [NASA ADS] [CrossRef] [Google Scholar]
Lagache, G., Puget, J.-L., & Dole, H. 2005, ARA&A, 43, 727 [NASA ADS] [CrossRef] [Google Scholar]
Laigle, C., McCracken, H. J., Ilbert, O., et al. 2016, ApJS, 224, 24 [NASA ADS] [CrossRef] [Google Scholar]
Leger, A., & Puget, J. L. 1984, A&A, 137, L5 [Google Scholar]
Lucie-Smith, L., Peiris, H. V., Pontzen, A., & Lochner, M., 2018, MNRAS, 479, 3405 [NASA ADS] [CrossRef] [Google Scholar]
Malavasi, N., Arnouts, S., Vibert, D., et al. 2017a, MNRAS, 465, 3817 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Malavasi, N., Pozzetti, L., Cucciati, O., et al. 2017b, MNRAS, 470, 1274 [NASA ADS] [CrossRef] [Google Scholar]
Moore, B., Katz, N., Lake, G., Dressler, A., & Oemler, A. 1996, Nature, 379, 613 [NASA ADS] [CrossRef] [Google Scholar]
Pashchenko, I. N., Sokolovsky, K. V., & Gavras, P. 2018, MNRAS, 475, 2326 [NASA ADS] [CrossRef] [Google Scholar]
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, J. Mach. Learn. Res., 12, 2825 [Google Scholar]
Peng, Y., Maiolino, R., & Cochrane, R. 2015, Nature, 521, 192 [NASA ADS] [CrossRef] [Google Scholar]
Planck Collaboration XIII. 2016, A&A, 594, A13 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Planck Collaboration XXII. 2016, A&A, 594, A22 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Rykoff, E. S., Rozo, E., Busha, M. T., et al. 2014, ApJ, 785, 104 [NASA ADS] [CrossRef] [Google Scholar]
Saintonge, A., Kauffmann, G., Kramer, C., et al. 2011, MNRAS, 415, 32 [NASA ADS] [CrossRef] [Google Scholar]
Saintonge, A., Catinella, B., Tacconi, L. J., et al. 2017, ApJS, 233, 22 [Google Scholar]
Salim, S., Rich, R. M., Charlot, S., et al. 2007, ApJS, 173, 267 [NASA ADS] [CrossRef] [Google Scholar]
Salpeter, E. E. 1955, ApJ, 121, 161 [Google Scholar]
Scodeggio, M., Guzzo, L., Garilli, B., et al. 2018, A&A, 609, A84 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Siudek, M., Małek, K., Pollo, A., et al. 2018a, A&A, 617, A70 [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
Siudek, M., Małek, K., Pollo, A., et al. 2018b, MNRAS, submitted [arXiv: 1805.09905] [Google Scholar]
Tuccillo, D., Huertas-Company, M., Decencière, E., et al. 2018, MNRAS, 475, 894 [NASA ADS] [CrossRef] [Google Scholar]
Ucci, G., Ferrara, A., Pallottini, A., & Gallerani, S. 2018, MNRAS, 477, 1484 [NASA ADS] [CrossRef] [Google Scholar]
Viquar, M., Basak, S., Dasgupta, A., Agrawal, S., & Saha, S. 2018, ArXiv e-prints [arXiv: 1804.05051] [Google Scholar]
Wen, X.-Q., Wu, H., Zhu, Y.-N., et al. 2013, MNRAS, 433, 2946 [NASA ADS] [CrossRef] [Google Scholar]
Werner, M. W., Roellig, T. L., Low, F. J., et al. 2004, ApJS, 154, 1 [NASA ADS] [CrossRef] [Google Scholar]
Wright, E. L., Eisenhardt, P. R. M., Mainzer, A. K., et al. 2010, AJ, 140, 1868 [NASA ADS] [CrossRef] [Google Scholar]
York, D. G., Adelman, J., Anderson, Jr., J. E., et al. 2000, AJ, 120, 1579 [CrossRef] [Google Scholar]

All Tables

Table 1.

Summary of the different scatters obtained on the same test set with different methods.

In the text

All Figures

Fig. 1.

In the text

	Fig. 2. SFR and M_⋆ provided in the SDSS catalogue. Each contour represents a galaxy population as defined by their position on the BPT diagram. Red contours represent passive galaxies, blue contours star-forming galaxies, green contours the galaxies from the green valley, and purple contours the AGNs.
In the text

	Fig. 3. Percentage score of the RF results on the validation sample as a function of the RF parameters M and d_max (M being the number of trees and d_max the maximum depth). Setting M = 40 and d_max = 12 is enough in our case to optimise the RF.
In the text

	Fig. 4. Results of the RF on the test sample (20% of the entire sample), with optimisation parameters set to M = 40 and d_max = 12. Left panel: M_⋆ estimate with the RF compared to M_⋆ from the SDSS MPA-JHU DR8 catalogue. Right panel: SFR estimate with the RF compared to SFR from the SDSS MPA-JHU DR8 catalogue.
In the text

	Fig. 5. Errors of the RF results obtained for the test sample (same errors as those presented in Fig. 4) as a function of redshift for M_⋆ and SFR.
In the text

	Fig. 6. Histograms showing the range of the input data (luminosities and colours) for the sources of the training catalogue.
In the text

	Fig. 7. SFR and M_⋆ obtained with the RF algorithm on the test set compared to the SDSS classification based on the BPT diagram. Colour code of the contours is the same as in Fig. 2.
In the text

	Fig. 8. Feature importance during the RF training. Left panel: RF trained to perform M_⋆ estimates only. Middle panel: RF trained to perform SFR estimates only. Right panel: RF trained to estimate both SFR and M_⋆.
In the text

	Fig. 9. 1σ and 3σ iso-densities of the test set in the WISE colour-luminosity W2–W3/LW1 diagram. The line styles are the same as in Fig. 2.
In the text

	Fig. 10. Results of the RF on the test sample (20% of the entire sample), with only W1, W3, W1–W2 and W2–W3 in input and without information about the redshift. Left panel: M_⋆ estimate with the RF compared with M_⋆ from the SDSS MPA-JHU DR8 catalogue. Right panel: SFR estimate with the RF compared with SFR from the SDSS MPA-JHU DR8 catalogue.
In the text

Fig. 11.

In the text

	Fig. 12. Example of bias induced by the redshift error. Left panel: for σ_z(1 + z) = 0.015 we show the errors on the SFR estimated values as a function of redshift. The blue line corresponds to the modeled bias. Right panel: same errors as in the left panel but corrected for bias.
In the text

	Fig. 13. Evolution of the bias seen as a function of redshift, for different redshift errors (indicated by the colours). Left panel: bias to correct for M_⋆ estimations. Right panel: bias to correct for SFR estimations.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.