. 2023 Feb 23;14(1):1040.

doi: 10.1038/s41467-023-36583-0.

Abstract representations emerge naturally in neural networks trained to perform multiple tasks

W Jeffrey Johnston^{1

2}, Stefano Fusi^{3

4}

Affiliations

¹ Center for Theoretical Neuroscience, Columbia University, New York, NY, USA. wjeffreyjohnston@gmail.com.
² Mortimer B. Zuckerman Mind, Brain and Behavior Institute, Columbia University, New York, NY, USA. wjeffreyjohnston@gmail.com.
³ Center for Theoretical Neuroscience, Columbia University, New York, NY, USA. sf2237@columbia.edu.
⁴ Mortimer B. Zuckerman Mind, Brain and Behavior Institute, Columbia University, New York, NY, USA. sf2237@columbia.edu.

PMID: 36823136
PMCID: PMC9950464
DOI: 10.1038/s41467-023-36583-0

Abstract representations emerge naturally in neural networks trained to perform multiple tasks

W Jeffrey Johnston et al. Nat Commun. 2023.

. 2023 Feb 23;14(1):1040.

doi: 10.1038/s41467-023-36583-0.

Authors

W Jeffrey Johnston^{1

2}, Stefano Fusi^{3

4}

Affiliations

¹ Center for Theoretical Neuroscience, Columbia University, New York, NY, USA. wjeffreyjohnston@gmail.com.
² Mortimer B. Zuckerman Mind, Brain and Behavior Institute, Columbia University, New York, NY, USA. wjeffreyjohnston@gmail.com.
³ Center for Theoretical Neuroscience, Columbia University, New York, NY, USA. sf2237@columbia.edu.
⁴ Mortimer B. Zuckerman Mind, Brain and Behavior Institute, Columbia University, New York, NY, USA. sf2237@columbia.edu.

PMID: 36823136
PMCID: PMC9950464
DOI: 10.1038/s41467-023-36583-0

Abstract

Humans and other animals demonstrate a remarkable ability to generalize knowledge across distinct contexts and objects during natural behavior. We posit that this ability to generalize arises from a specific representational geometry, that we call abstract and that is referred to as disentangled in machine learning. These abstract representations have been observed in recent neurophysiological studies. However, it is unknown how they emerge. Here, using feedforward neural networks, we demonstrate that the learning of multiple tasks causes abstract representations to emerge, using both supervised and reinforcement learning. We show that these abstract representations enable few-sample learning and reliable generalization on novel tasks. We conclude that abstract representations of sensory and cognitive variables may emerge from the multiple behaviors that animals exhibit in the natural world, and, as a consequence, could be pervasive in high-level brain regions. We also make several specific predictions about which variables will be represented abstractly.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. The abstraction metrics and input representations.**
a Two example classification tasks. (left) A classification learned between red and blue berries of one shape should generalize to other shapes. (right) A classification between red berries of two different shapes should generalize to blue berries of different shapes. b Examples of linear, abstract (left), and nonlinear, non-abstract (right) representations of the four example berries. c Schematic of the input model. d Schematic of the multi-tasking model. e Schematic of our two abstraction metrics, the classifier generalization metric (left) and the regression generalization metric (right).

**Fig. 2. The input model.**
a Schematic of the input model. Here, schematized for D = 2 latent variables. The quantitative results are for D = 5. b The 2D response fields of 25 random units from the high-d input layer of the standard input. c The same concentric square structure shown to represent the latent variables in a after being transformed by the standard input. d (left) The per-unit sparseness (averaged across units) for the latent variables (S = 0, by definition) and standard input (S = 0.97). (right) The embedding dimensionality, as measured by participation ratio, of the latent variables (5, by definition) and the standard input (~190). e Visualization of the level of abstraction present in the high-d input layer of the standard input, as measured by the classifier generalization metric (left) and regression generalization metric (right). In both cases, the y-axis shows the distance from the learned classification hyperplane (right: regression output) for a classifier (right: regression model) trained to decode the sign of the latent variable on the y-axis (right: the value of the latent variable on the y-axis) only on representations from the left part of the x-axis (trained). The position of each point on the x-axis is the output of a linear regression for a second latent variable (trained on the whole latent variable space). The points are colored according to their true category (value). f The performance of a classifier (left) and regression (right) when it is trained and tested on the same region of latent variable space (trained) or trained in one region and tested in a non-overlapping region (tested, similar to e). Both models are trained directly on the latent variables and on the input representations produced by the standard input. The gray line is chance. The standard input produces representations with significantly decreased generalization performance.

**Fig. 3. The emergence of abstraction from classification task learning.**
a Schematic of the multi-tasking model. It receives a high-dimensional input representation of D latent variables (here, from the standard input, as shown in Fig. 1e, left) and learns to perform P binary classifications on the latent variables. We study the representations that this induces in the layer prior to the output: the representation layer. b Visualization of the concentric square structure as transformed in the representation layer of a multi-tasking model trained to perform one (top), two (middle), and ten (bottom) tasks. The visualization procedure is the same as Fig. 2c. c The same as b, but for visualizations based on classifier (left) and regression (right) generalization. The classifier (regression) model is learned on the left side of the plot, and generalized to the right side of the plot. The output of the model is given on the y axis and each point is colored according to the true latent variable category (i.e., sign) or value. The visualization procedure is the same as Fig. 2e. The visualization shows that generalization performance increases with the number of tasks P (increasing from top to bottom). d The activation along the output dimension for a single task learned by the multi-tasking model for the two different output categories (purple and red). The distribution of activity is bimodal for multi-tasking models trained with one or two tasks, but becomes less so for more tasks. e The classifier (left) and regression (right) metrics applied to model representations with different numbers of tasks. f The standard (left) and generalization (right) performance of a classifier trained to perform a novel task with limited samples using the representations from a multi-tasking model trained with P = 10 tasks as input. The lower (dark gray) and upper (light gray) bounds are the standard or generalization performance of a classifier trained on the input representations (lower) and directly on the latent variables (upper). Note that the multi-tasking model performance is close to that of training directly on the latent variables in all cases.

**Fig. 4. Abstract representations emerge for heterogeneous tasks, and in spite of high-dimensional grid tasks.**
a Schematic of the multi-tasking model with grid tasks. They are defined by the grid size, grid, the number of regions along each dimension (top: grid = 2; bottom: grid = 3), and the number of latent variables, D. There are grid^D total grid chambers, which are randomly assigned to category 1 (red) or category 2 (blue). Some grid tasks are aligned with the latent variables by chance (as in top left), but this fraction is small for even moderate D. b Visualization of the representation layer of a multi-tasking model trained only on grid tasks, with P = 15. c Quantification of the abstraction developed by a grid task multi-tasking model. (left) Classifier generalization performance. (right) Regression generalization performance. d The alignment (cosine similarity) between between randomly chosen tasks for latent variable-aligned classification tasks, n = 2 and D = 5 grid tasks, and n = 3 and D = 5 grid tasks. e Schematic of the multi-tasking model with a mixture of grid and linear tasks. f Same as b, but for a multi-tasking model trained with a mixture of: P = 15 latent variable-aligned classification tasks and a variable number of grid tasks (x axis). g Same as c, but for a multi-tasking model trained with P = 15 latent variable-aligned classification tasks and a variable number of grid tasks. While the multi-tasking model trained only with grid tasks does not develop abstract representations, the multi-tasking model trained with a combination of grid and linear tasks does – even when the number of grid tasks outnumbers the number of linear tasks.

**Fig. 5. The multi-tasking model learns abstract structure from both random Gaussian process inputs and output tasks.**
a (left) Schematic of the creation of a random Gaussian process input for one latent variable dimension (D = 1). Random Gaussian processes with a length scale = 1 are learned for the single latent variable shown on the left. Then, the responses produced by these random Gaussian processes are used as input to the multi-tasking model. The full random Gaussian process input has 500 random Gaussian process dimensions and D = 5 latent variables. (right) Schematic of the creation of two random Gaussian process tasks for the D = 1-dimensional latent variable shown on the left, showing both two example binary classification tasks and the random Gaussian process that is thresholded at zero to create each task. b Visualization of the input structure for random Gaussian process inputs of different length scales. c The embedding dimensionality (participation ratio) of the random Gaussian process for different length scales. Note that it is always less than the dimensionality of 200 achieved by the standard input. d Examples of random Gaussian process tasks for a variety of length scales. The multi-tasking model is trained to perform these tasks, as schematized in a. e The embedding dimensionality (participation ratio) of the binary output patterns required by task collections of different length scales. f Classifier generalization performance of a multi-tasking model trained to perform P = 15 classification tasks with D = 5-dimensional latent variables, shown for different conjunctions of task length scale (changing along the y axis) and input length scale (changing along the x axis). g Regression generalization performance shown as in f.

**Fig. 6. The multi-tasking model produces abstract representations from image inputs.**
a Examples from the 2D shape dataset (top) and chair image dataset (bottom). The 2D shapes dataset is from: Matthey, L., Higgins, I., Hassabis, D. & Lerchner, A. dsprites: Disentanglement testing sprites dataset. https://github.com/deepmind/dsprites-dataset/(2017). b Schematic of modified model. The multi-tasking model now begins with a networked pre-trained on the ImageNet challenge, followed by a few additional layers of processing before learning binary tasks as before (see “Pre-processing using a pre-trained network” in Methods). c The classifier (left) and regression (right) generalization performance when applied to the shape image pixels (top) or ImageNet representations (bottom). d The classifier (left) and regression (right) generalization performance of the multi-tasking model on the shape images. e The same as c but for the chair images. f The same as d but for the chair images.

**Fig. 7. The multi-tasking model produces abstract representations when trained with reinforcement learning.**
a Schematic of the reinforcement learning multi-tasking model, using the deep deterministic policy gradient approach. b The performance of the network on different tasks over the course of training. Note the sharp transitions between near-chance performance (avg reward = 0) and near-perfect performance (avg reward = 1). c The fraction of tasks learned such that avgreward > 0.8 for different numbers of trained tasks, the shading is the standard error of the mean and there are n = 10 models included in the analysis. d The classifier (left) and regression (right) generalization performance of the reinforcement learning multi-tasking model.

See this image and copyright information in PMC

Comment in

Unveiling the benefits of multitasking in disentangled representation formation.
Feather J, Chung S. Feather J, et al. Trends Cogn Sci. 2023 Aug;27(8):699-701. doi: 10.1016/j.tics.2023.05.010. Epub 2023 Jun 23. Trends Cogn Sci. 2023. PMID: 37357063

Cited by

Sequential neuronal processing of number values, abstract decision, and action in the primate prefrontal cortex.
Viswanathan P, Stein AM, Nieder A. Viswanathan P, et al. PLoS Biol. 2024 Feb 16;22(2):e3002520. doi: 10.1371/journal.pbio.3002520. eCollection 2024 Feb. PLoS Biol. 2024. PMID: 38364194 Free PMC article.
Improving reduced-order models through nonlinear decoding of projection-dependent outputs.
Zdybał K, Parente A, Sutherland JC. Zdybał K, et al. Patterns (N Y). 2023 Oct 10;4(11):100859. doi: 10.1016/j.patter.2023.100859. eCollection 2023 Nov 10. Patterns (N Y). 2023. PMID: 38035196 Free PMC article.
Hierarchical VAEs provide a normative account of motion processing in the primate brain.
Vafaii H, Yates JL, Butts DA. Vafaii H, et al. bioRxiv [Preprint]. 2023 Nov 5:2023.09.27.559646. doi: 10.1101/2023.09.27.559646. bioRxiv. 2023. PMID: 37808629 Free PMC article. Preprint.
Reconstructing computational system dynamics from neural data with recurrent neural networks.
Durstewitz D, Koppe G, Thurm MI. Durstewitz D, et al. Nat Rev Neurosci. 2023 Nov;24(11):693-710. doi: 10.1038/s41583-023-00740-7. Epub 2023 Oct 4. Nat Rev Neurosci. 2023. PMID: 37794121 Review.
A view-based decision mechanism for rewards in the primate amygdala.
Grabenhorst F, Ponce-Alvarez A, Battaglia-Mayer A, Deco G, Schultz W. Grabenhorst F, et al. Neuron. 2023 Dec 6;111(23):3871-3884.e14. doi: 10.1016/j.neuron.2023.08.024. Epub 2023 Sep 18. Neuron. 2023. PMID: 37725980 Free PMC article.

See all "Cited by" articles

References

1. Saxena S, Cunningham JP. Towards the neural population doctrine. Curr. Opin. Neurobiol. 2019;55:103–111. doi: 10.1016/j.conb.2019.02.002. - DOI - PubMed
1. Ebitz RB, Hayden BY. The population doctrine in cognitive neuroscience. Neuron. 2021;109:3055–3068. doi: 10.1016/j.neuron.2021.07.011. - DOI - PMC - PubMed
1. Chung S, Abbott L. Neural population geometry: an approach for understanding biological and artificial neural networks. Curr. Opin. Neurobiol. 2021;70:137–144. doi: 10.1016/j.conb.2021.10.010. - DOI - PubMed
1. Rigotti M, et al. The importance of mixed selectivity in complex cognitive tasks. Nature. 2013;497:1–6. doi: 10.1038/nature12160. - DOI - PMC - PubMed
1. Fusi S, Miller EK, Rigotti M. Why neurons mix: High dimensionality for higher cognition. Curr. Opin. Neurobiol. 2016;37:66–74. doi: 10.1016/j.conb.2016.01.010. - DOI - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions

Associated data

figshare/10.6084/m9.figshare.21761348.v1

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Abstract representations emerge naturally in neural networks trained to perform multiple tasks

Affiliations

Abstract representations emerge naturally in neural networks trained to perform multiple tasks

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Associated data

LinkOut - more resources

Full Text Sources

Miscellaneous