This is the html version of the file http://openaccess.thecvf.com/content/ICCV2021/html/Azizi_Big_Self-Supervised_Models_Advance_Medical_Image_Classification_ICCV_2021_paper.html.
Google automatically generates html versions of documents as we crawl the web.
Page 1
Big Self-Supervised Models Advance Medical Image Classification
Shekoofeh Azizi, Basil Mustafa, Fiona Ryan, Zachary Beaver, Jan Freyberg, Jonathan Deaton,
Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting Chen, Vivek Natarajan, Mohammad Norouzi
Google Research and Health
Abstract
Self-supervised pretraining followed by supervised fine-
tuning has seen success in image recognition, especially
when labeled examples are scarce, but has received lim-
ited attention in medical image analysis. This paper stud-
ies the effectiveness of self-supervised learning as a pre-
training strategy for medical image classification. We con-
duct experiments on two distinct tasks: dermatology con-
dition classification from digital camera images and multi-
label chest X-ray classification, and demonstrate that self-
supervised learning on ImageNet, followed by additional
self-supervised learning on unlabeled domain-specific med-
ical images significantly improves the accuracy of medical
image classifiers. We introduce a novel Multi-Instance Con-
trastive Learning (MICLe) method that uses multiple im-
ages of the underlying pathology per patient case, when
available, to construct more informative positive pairs for
self-supervised learning. Combining our contributions, we
achieve an improvement of 6.7% in top-1 accuracy and
an improvement of 1.1% in mean AUC on dermatology
and chest X-ray classification respectively, outperforming
strong supervised baselines pretrained on ImageNet. In ad-
dition, we show that big self-supervised models are robust
to distribution shift and can learn efficiently with a small
number of labeled medical images.
1. Introduction
Learning from limited labeled data is a fundamental
problem in machine learning, which is crucial for medi-
cal image analysis because annotating medical images is
time-consuming and expensive. Two common pretraining
approaches to learning from limited labeled data include:
(1) supervised pretraining on a large labeled dataset such as
ImageNet, (2) self-supervised pretraining using contrastive
learning (e.g., [16, 8, 9]) on unlabeled data. After pretrain-
ing, supervised fine-tuning on a target labeled dataset of in-
terest is used. While ImageNet pretraining is ubiquitous in
medical image analysis [46, 32, 31, 29, 15, 20], the use of
self-supervised approaches has received limited attention.
Self-supervised approaches are attractive because they en-
Former intern at Google. Currently at Georgia Institute of Technology.
{shekazizi, skornblith, iamtingchen, natviv, mnorouzi}@google.com
(1) Self-supervised learning on unlabeled natural images
(3) Supervised fine-tuning on labeled medical images
(2) Self-supervised learning on unlabeled medical images
and Multi-Instance Contrastive Learning (MICLe) if
multiple images of each medical condition are available
Unlabeled
chest
x-rays
Unlabeled
dermatology
images
Labeled
chest
x-rays
Labeled
dermatology
images
Figure 1: Our approach comprises three steps: (1) Self-
supervised pretraining on unlabeled ImageNet using SimCLR [8].
(2) Additional self-supervised pretraining using unlabeled medical
images. If multiple images of each medical condition are avail-
able, a novel Multi-Instance Contrastive Learning (MICLe) is used
to construct more informative positive pairs based on different im-
ages. (3) Supervised fine-tuning on labeled medical images. Note
that unlike step (1), steps (2) and (3) are task and dataset specific.
able the use of unlabeled domain-specific images during
pretraining to learn more relevant representations.
This paper studies self-supervised learning for medi-
cal image analysis and conducts a fair comparison be-
tween self-supervised and supervised pretraining on two
distinct medical image classification tasks: (1) dermatol-
ogy skin condition classification from digital camera im-
ages, (2) multi-label chest X-ray classification among five
pathologies based on the CheXpert dataset [23]. We ob-
serve that self-supervised pretraining outperforms super-
vised pretraining, even when the full ImageNet dataset
(14M images and 21.8K classes) is used for supervised pre-
training. We attribute this finding to the domain shift and
discrepancy between the nature of recognition tasks in Im-
3478

Page 2
ResNet-50 (4x) ResNet-152 (2x)
0.60
0.62
0.64
0.66
0.68
0.70
0.72
Dermatology Top-1 Accuracy
Supervised
Self-Supervised
ResNet-50 (4x) ResNet-152 (2x)
0.740
0.745
0.750
0.755
0.760
0.765
0.770
0.775
0.780
CheXpert Mean AUC
Supervised
Self-Supervised
Figure 2: Comparison of supervised and self-supervised pretrain-
ing, followed by supervised fine-tuning using two architectures on
dermatology and chest X-ray classification. Self-supervised learn-
ing utilizes unlabeled domain-specific medical images and signif-
icantly outperforms supervised ImageNet pretraining.
ageNet and medical image classification. Self-supervised
approaches bridge this domain gap by leveraging in-domain
medical data for pretraining and they also scale gracefully
as they do not require any form of class label annotation.
An important component of our self-supervised learn-
ing framework is an effective Multi-Instance Contrastive
Learning (MICLe) strategy that helps adapt contrastive
learning to multiple images of the underlying pathology per
patient case. Such multi-instance data is often available in
medical imaging datasets – e.g., frontal and lateral views
of mammograms, retinal fundus images from each eye, etc.
Given multiple images of a given patient case, we propose
to construct a positive pair for self-supervised contrastive
learning by drawing two crops from two distinct images of
the same patient case. Such images may be taken from dif-
ferent viewing angles and show different body parts with the
same underlying pathology. This presents a great opportu-
nity for self-supervised learning algorithms to learn repre-
sentations that are robust to changes of viewpoint, imaging
conditions, and other confounding factors in a direct way.
MICLe does not require class label information and only
relies on different images of an underlying pathology, the
type of which may be unknown.
Fig. 1 depicts the proposed self-supervised learning ap-
proach, and Fig. 2 shows the summary of results. Our key
findings and contributions include:
• We investigate the use of self-supervised pretraining
on medical image classification. We find that self-
supervised pretraining on unlabeled medical images sig-
nificantly outperforms standard ImageNet pretraining
and random initialization.
• We propose Multi-Instance Contrastive Learning (MI-
CLe) as a generalization of existing contrastive learning
approaches to leverage multiple images per medical con-
dition. We find that MICLe improves the performance of
self-supervised models, yielding state-of-the-art results.
• On dermatology condition classification, our self-
supervised approach provides a sizable gain of 6.7% in
top-1 accuracy, even in a highly competitive production
setting. On chest X-ray classification, self-supervised
learning outperforms strong supervised baselines pre-
trained on ImageNet by 1.1% in mean AUC.
• We demonstrate that self-supervised models are robust
and generalize better than baseslines, when subjected to
shifted test sets, without fine-tuning. Such behavior is
desirable for deployment in a real-world clinical setting.
2. Related Work
Transfer Learning for Medical Image Analysis.
De-
spite the differences in image statistics, scale, and task-
relevant features, transfer learning from natural images is
commonly used in medical image analysis [29, 31, 32, 46],
and multiple empirical studies show that this improves
performance [1, 15, 20]. However, detailed investiga-
tions from Raghu et al. [37] of this strategy indicate this
does not always improve performance in medical imaging
contexts. They, however, do show that transfer learning
from ImageNet can speed up convergence, and is particu-
larly helpful when the medical image training data is lim-
ited. Importantly, the study used relatively small archi-
tectures, and found pronounced improvements with small
amounts of data especially when using their largest archi-
tecture of ResNet-50 (1�) [18]. Transfer learning from in-
domain data can help alleviate the domain mismatch issue.
For example, [7, 20, 26, 13] report performance improve-
ments when pretraining on labeled data in the same do-
main. However, this approach is often infeasible for many
medical tasks in which labeled data is expensive and time-
consuming to obtain. Recent advances in self-supervised
learning provide a promising alternative enabling the use of
unlabeled medical data that is often easier to procure.
Self-supervised Learning. Initial works in self-supervised
representation learning focused on the problem of learn-
ing embeddings without labels such that a low-capacity
(commonly linear) classifier operating on these embeddings
could achieve high classification accuracy [12, 14, 35, 49].
Contrastive self-supervised methods such as instance dis-
crimination [45], CPC [21, 36], Deep InfoMax [22], Ye
et al. [47], AMDIM [2], CMC [41], MoCo [10, 17],
PIRL [33], and SimCLR [8, 9] were the first to achieve
linear classification accuracy approaching that of end-to-
end supervised training. Recently, these methods have been
harnessed to achieve dramatic improvements in label effi-
ciency for semi-supervised learning. Specifically, one can
first pretrain in a task-agnostic, self-supervised fashion us-
ing all data, and then fine-tune on the labeled subset in
a task-specific fashion with a standard supervised objec-
tive [8, 9, 21]. Chen et al. [9] show that this approach ben-
efits from large (high-capacity) models for pretraining and
fine-tuning, but after a large model is trained, it can be dis-
tilled to a much smaller model with little loss in accuracy.
3479

Page 3
Augmentation
Encoder
Encoder
Projection
Projection
Maximize
Agreement
Contrastive Learning
Chest X-ray
Augmentation
Encoder
Encoder
Projection
Projection
Contrastive Learning
Dermatology Image
Multi-Instance Contrastive Learning
Dermatology Image
Encoder
Encoder
Projection
Projection
Augmentation
Augmentation
Maximize
Agreement
Maximize
Agreement
Figure 3: An illustrations of our self-supervised pretraining for medical image analysis. When a single image of a medical
condition is available, we use standard data augmentation to generate two augmented views of the same image. When
multiple images are available, we use two distinct images to directly create a positive pair of examples. We call the latter
approach Multi-Instance Contrastive Learning (MICLe).
Our Multi-Instance Contrastive Learning approach is re-
lated to previous works in video processing where multiple
views naturally arising due to temporal variation [38, 42].
These works have proposed to learn visual representations
from video by maximizing agreement between representa-
tions of adjacent frames [42] or two views of the same ac-
tion [38]. We generalize this idea to representation learning
from image datasets, when sets of images containing the
same desired class information are available and we show
that benefits of MICLe can be combined with state-of-the-
art self-supervised learning methods such as SimCLR.
Self-supervision for Medical Image Analysis. Although
self-supervised learning has only recently become viable
on standard image classification datasets, it has already
seen some application within the medical domain. While
some works have attempted to design domain-specific pre-
text tasks [3, 40, 53, 52], other works concentrate on tailor-
ing contrastive learning to medical data [5, 19, 25, 27, 51].
Most closely related to our work, Sowrirajan et al. [39]
explore the use of MoCo pretraining for classification of
CheXpert dataset through linear evaluation.
Several recent publications investigate semi-supervised
learning for medical imaging tasks (e.g., [11, 28, 43, 50]).
These methods are complementary to ours, and we believe
combining self-training and self-supervised pretraining is
an interesting avenue for future research (e.g., [9]).
3. Self-Supervised Pretraining
Our approach comprises the following steps. First, we
perform self-supervised pretraining on unlabeled images
using contrastive learning to learn visual representations.
For contrastive learning we use a combination of unlabeled
ImageNet dataset and task specific medical images. Then, if
multiple images of each medical condition are available the
Multi-Instance Contrastive Learning (MICLe) is used for
additional self-supervised pretraining. Finally, we perform
supervised fine-tuning on labeled medical images. Figure 1
shows the summary of our proposed method.
3.1. A Simple Framework for Contrastive Learning
To learn visual representations effectively with unlabeled
images, we adopt SimCLR [8, 9], a recently proposed ap-
proach based on contrastive learning. SimCLR learns rep-
resentations by maximizing agreement [4] between differ-
ently augmented views of the same data example via a con-
trastive loss in a hidden representation of neural nets.
Given a randomly sampled mini-batch of images, each
image xi is augmented twice using random crop, color dis-
tortion and Gaussian blur, creating two views of the same
example x2k−1 and x2k. The two images are encoded via
an encoder network f(�) (a ResNet [18]) to generate rep-
resentations h2k−1 and h2k. The representations are then
transformed again with a non-linear transformation network
g(�) (a MLP projection head), yielding z2k−1 and z2k that
are used for the contrastive loss.
With a mini-batch of encoded examples, the contrastive
loss between a pair of positive example i, j (augmented
from the same image) is given as follows:
NT-Xent
i,j
= − log
exp(sim(zi, zj)/τ)
2N
k=1 1[k̸=i] exp(sim(zi, zk)/τ)
, (1)
Where sim(�, �) is cosine similarity between two vectors,
and τ is a temperature scalar.
3.2. Multi-Instance Contrastive Learning (MICLe)
In medical image analysis, it is common to utilize mul-
tiple images per patient to improve classification accuracy
3480

Page 4
and robustness. Such images may be taken from differ-
ent viewpoints or under different lighting conditions, pro-
viding complementary information for medical diagnosis.
When multiple images of a medical condition are available
as part of the training dataset, we propose to learn represen-
tations that are invariant not only to different augmentations
of the same image, but also to different images of the same
medical pathology. Accordingly, we can conduct a multi-
instance contrastive learning (MICLe) stage where positive
pairs are constructed by drawing two crops from the images
of the same patient as demonstrated in Fig. 3.
In MICLe, in contrast to standard SimCLR, to construct
a mini-batch of 2N representation, we randomly sample
a mini-batch of N bags of instances and define the con-
trastive prediction task on positive pairs retrieved from the
bag of images instead of augmented views of the same im-
age. Each bag, X = {x1, x2, ..., xM } contains images
from a same patient (i.e., same pathology) captured from
different views and we assume that M could vary for dif-
ferent bags. When there is two or more instances in a bag
(M = |X| ≥ 2), we construct positive pairs by drawing two
crops from two randomly selected images in the bag. In this
case, the objective still takes the form of Eq. (1), but images
contributing to each positive pair are distinct. Algorithm 1
summarizes the proposed method.
Leveraging multiple images of the same condition using
the contrastive loss helps the model learn representations
that are more robust to the change of viewpoint, lighting
conditions, and other confounding factors. We find that
multi-instance contrastive learning significantly improves
the accuracy and helps us achieve the state-of-the-art result
on the dermatology condition classification task.
4. Experiment Setup
4.1. Tasks and datasets
We consider two popular medical imaging tasks. The
first task is in the dermatology domain and involves iden-
tifying skin conditions from digital camera images. The
second task involves multi-label classification of chest X-
rays among five pathologies. We chose these tasks as they
embody many common characteristics of medical imaging
tasks like imbalanced data and pathologies of interest re-
stricted to small local patches. At the same time, they are
also quite diverse in terms of the type of images, label space
and task setup. For example, dermatology images are visu-
ally similar to natural images whereas the chest X-rays are
gray-scale and have standardized views. This, in turn, helps
us probe the generality of our proposed methods.
Dermatology.
For the dermatology task, we follow the
experiment setup and dataset of [29]. The dataset was col-
lected and de-identified by a US based tele-dermatology
service with images of skin conditions taken using con-
sumer grade digital cameras. The images are heteroge-
Algorithm 1: Multi-Instance Contrastive Learning.
Input: batch size N, constant τ, g(�), f(�), T
while stopping criteria not met do
Sample mini-batch of {X}N
k=1 for k ← 1 to k = N
do
Draw augmentation functions t and t∼ T ;
if |Xk| ≥ 2 then
Randomly select xk and x
k ∈ Xk;
else
xk = x
k ← the only element of Xk;
end
˜x2k−1 = t(xk); ˜x2k = t(x
k);
z2k−1 = g(f(˜x2k−1)); z2k = g(f(˜x2k));
end
for i ∈ {1,..., 2N} and j ∈ {1,..., 2N} do
si,j = z
i zj /(∥zi∥∥zj ∥);
end
ℓ(i, j) ← ℓNT-Xent
i,j
in Eq. (1);
L = 1
2N
N
k=1 [ℓ(2k−1, 2k) + ℓ(2k, 2k−1)];
end
return Trained encoder network f(�)
neous in nature and exhibit significant variations in terms
of the pose, lighting, blur, and body parts. The background
also contains various noise artifacts like clothing and walls
which adds to the challenge. The ground truth labels were
aggregated from a panel of several US-board certified der-
matologists who provided differential diagnosis of skin con-
ditions in each case.
In all, the dataset has cases from a total of 12,306 unique
patients. Each case includes between one to six images.
This further split into development and test sets ensuring
no patient overlap between the two. Then, cases with
the occurrence of multiple skin conditions or poor qual-
ity images were filtered out. The final DTrain
Derm, DValidation
Derm
,
and DTest
Derminclude a total of 15,340 cases, 1,190 cases, and
4,146 cases, respectively. There are 419 unique condition
labels in the dataset. For the purpose of model development,
we identified and use the most common 26 skin conditions
and group the rest in an additional ‘Other’ class leading to
a final label space of 27 classes for the model. We refer to
this as DDermin the subsequent sections. We also use an ad-
ditional de-identified DExternal
Derm
set to evaluate the generaliza-
tion performance of our proposed method under distribution
shift. Unlike DDerm, this dataset is primarily focused on skin
cancers and the ground truth labels are obtained from biop-
sies. The distribution shift in the labels make this a partic-
ular challenging data to evaluate the zero-shot (i.e. without
any additional fine-tuning) transfer performance of models.
For SimCLR pretraining, we combine the images from
DTrain
Derm
and additional unlabeled images from the same
source leading to a total of 454,295 images for self-
supervised pretraining. We refer to this as the DUnlabeled
Derm
.
For MICLe pretraining, we only use images coming from
3481

Page 5
the 15,340 cases of the DTrain
Derm. Additional details are pro-
vided in the Appendix A.1.
Chest X-rays.
CheXpert [23] is a large open source
dataset of de-identified chest radiograph (X-ray) images.
The dataset consists of a set of 224,316 chest radiographs
coming from 65,240 unique patients. The ground truth la-
bels were automatically extracted from radiology reports
and correspond to a label space of 14 radiological observa-
tions. The validation set consists of 234 manually annotated
chest X-rays. Given the small size of the validation dataset
and following [34, 37] suggestion, for the downstream task
evaluations we randomly re-split the training set into 67,429
training images, 22,240 validation images, and 33,745 test
images. We train the model to predict the five pathologies
used by Irvin and Rajpurkar et al. [23] in a multi-label clas-
sification task setting. For SimCLR pretraining for the chest
X-ray domain, we only consider images coming from the
train set of the CheXpert dataset discarding the labels. We
refer to this as the DUnlabeled
CheXpert . In addition, we also use the
NIH chest X-ray dataset, DNIH, to evaluate the zero-shot
transfer performance which consist of 112,120 de-identified
X-rays from 30,805 unique patients. Additional details on
the dataset can be found here [44] and also are provided in
the Appendix A.2.
4.2. Pretraining protocol
To assess the effectiveness of self-supervised pretrain-
ing using big neural nets, as suggested in [8], we inves-
tigate ResNet-50 (1�), ResNet-50 (4�), and ResNet-152
(2�) architectures as our base encoder networks. Follow-
ing SimCLR [8], two fully connected layers are used to
map the output of ResNets to a 128-dimensional embed-
ding, which is used for contrastive learning. We also use
LARS optimizer [48] to stabilize training during pretrain-
ing. We perform SimCLR pretraining on DUnlabeled
Derm
and
DUnlabeled
CheXpert , both with and without initialization from Ima-
geNet self-supervised pretrained weights. We indicate pre-
training initialized using self-supervised ImageNet weights,
as ImageNet→Derm, and ImageNet→CheXpert in the fol-
lowing sections.
Unless otherwise specified, for the dermatology pretrain-
ing task, due to similarity of dermatology images to natural
images, we use the same data augmentation used to generate
positive pairs in SimCLR. This includes random color aug-
mentation (strength=1.0), crops with resize, Gaussian blur,
and random flips. We find that the batch size of 512 and
learning rate of 0.3 works well in this setting. Using this
protocol, all of models were pretrained up to 150,000 steps
using DUnlabeled
Derm
.
For the CheXpert dataset, we pretrain with learning rate
in {0.5, 1.0, 1.5}, temperature in {0.1, 0.5, 1.0}, and batch
size in {512, 1024}, and we select the model with best per-
formance on the down-stream validation set. We also tested
a range of possible augmentations and observe that the aug-
mentations that lead to the best performance on the vali-
dation set for this task are random cropping, random color
jittering (strength = 0.5), rotation (upto 45 degrees) and
horizontal flipping. Unlike the original set of proposed aug-
mentation in SimCLR, we do not use the Gaussian blur, be-
cause we think it can make it impossible to distinguish local
texture variations and other areas of interest thereby chang-
ing the underlying disease interpretation the X-ray image.
We leave comprehensive investigation of the optimal aug-
mentations to future work. Our best model on CheXpert
was pretrained with batch size 1024, and learning rate of
0.5 and we pretrain the models up to 100,000 steps.
We perform MICLe pretraining only on the dermatology
dataset as we did not have enough cases with the presence
of multiple views in the CheXpert dataset to allow com-
prehensive training and evaluation of this approach. For
MICLe pretraining we initialize our model using SimCLR
pretrained weights, and then incorporate the multi-instance
procedure as explained in Section 3.2 to further learn a more
comprehensive representation using multi-instance data of
DTrain
Derm. Due to memory limits caused by stacking up to 6
images per patient case, we train with a smaller batch size
of 128 and learning rate of 0.1 for 100,000 steps to stabilize
the training. Decreasing the learning rate for smaller batch
size has been suggested in [8]. The rest of the settings, in-
cluding optimizer, weight decay, and warmup step are the
same as our previous pretraining protocol.
In all of our pretraining experiments, images are resized
to 224 � 224. We use 16 to 64 Cloud TPU cores depending
on the batch size for pretraining. With 64 TPU cores, it
takes ∼12 hours to pretrain a ResNet-50 (1�) with batch
size 512 and for 100 epochs. Additional details about the
selection of batch size and learning rate, and augmentations
are provided in the Appendix B.
4.3. Fine-tuning protocol
We train the model end-to-end during fine-tuning using
the weights of the pretrained network as initialization for
the downstream supervised task dataset following the ap-
proach described by Chen et al. [8, 9] for all our experi-
ments. We trained for 30,000 steps with a batch size of
256 using SGD with a momentum parameter of 0.9. For
data augmentation during fine-tuning, we performed ran-
dom color augmentation, crops with resize, blurring, rota-
tion, and flips for the images in both tasks. We observe
that this set of augmentations is critical for achieving the
best performance during fine-tuning. We resize the Derm
dataset images to 448 � 448 pixels and CheXpert images to
224 � 224 during this fine-tuning stage.
For every combination of pretraining strategy and down-
stream fine-tuning task, we perform an extensive hyper-
parameter search. We selected the learning rate and weight
decay after a grid search of seven logarithmically spaced
3482

Page 6
Table 1: Performance of dermatology skin condition and Chest X-ray classification model measured by top-1 accuracy (%) and area under
the curve (AUC) across different architectures. Each model is fine-tuned using transfer learning from pretrained model on ImageNet, only
unlabeled medical data, or pretrained using medical data initialized from ImageNet pretrained model (e.g. ImageNet→Derm). Bigger
models yield better performance. pretraining on ImageNet is complementary to pretraining on unlabeled medical images.
Dermatology Classification
Chest X-ray Classifcation
Architecture
Pretraining Dataset
Top-1 Accuracy(%)
AUC
Pretraining Dataset
Mean AUC
ResNet-50 (1�)
ImageNet
62.58 � 0.84
0.9480 � 0.0014
ImageNet
0.7630 � 0.0013
Derm
63.66 � 0.24
0.9490 � 0.0011
CheXpert
0.7647 � 0.0007
ImageNet→Derm
63.44 � 0.13
0.9511 � 0.0037
ImageNet→CheXpert 0.7670 � 0.0007
ResNet-50 (4�)
ImageNet
64.62 � 0.76
0.9545 � 0.0007
ImageNet
0.7681 � 0.0008
Derm
66.93 � 0.92
0.9576 � 0.0015
CheXpert
0.7668 � 0.0011
ImageNet→Derm
67.63 � 0.32
0.9592 � 0.0004
ImageNet→CheXpert 0.7687 � 0.0016
ResNet-152 (2�)
ImageNet
66.38 � 0.03
0.9573 � 0.0023
ImageNet
0.7671 � 0.0008
Derm
66.43 � 0.62
0.9558 � 0.0007
CheXpert
0.7683 � 0.0009
ImageNet→Derm
68.30 � 0.19
0.9620 � 0.0007
ImageNet→CheXpert 0.7689 � 0.0010
learning rates between 10−3.5 and 10−0.5 and three loga-
rithmically spaced values of weight decay between 10−5
and 10−3, as well as no weight decay. For training from the
supervised pretraining baseline we follow the same proto-
col and observe that for all fine-tuning setups, 30,000 steps
is sufficient to achieve optimal performance. For supervised
baselines we compare against the identical publicly avail-
able ResNet models1 pretrained on ImageNet with standard
cross-entropy loss. These models are trained with the same
data augmentation as self-supervised models (crops, strong
color augmentation, and blur).
4.4. Evaluation methodology
After identifying the best hyperparameters for fine-
tuning a given dataset, we proceed to select the model
based on validation set performance and evaluate the chosen
model multiple times (10 times for chest X-ray task and 5
times for the dermatology task) on the test set to report task
performance. Our primary metrics for the dermatology task
are top-1 accuracy and Area Under the Curve (AUC) fol-
lowing [29]. For the chest X-ray task, given the multi-label
setup, we report mean AUC averaged between the predic-
tions for the five target pathologies following [23]. We also
use the non-parametric bootstrap to estimate the variability
around the model performance and investigating any statis-
tically significant improvement. Additional details are pro-
vided in Appendix B.1.1.
5. Experiments & Results
In this section we investigate whether self-supervised
pretraining with contrastive learning translates to a better
performance in models fine-tuned end-to-end across the se-
lected medical image classification tasks. To this end, first,
we explore the choice of the pretraining dataset for med-
ical imaging tasks. Then, we evaluate the benefits of our
proposed multi-instance contrastive learning (MICLe) for
1https://github.com/google-research/simclr
dermatology condition classification task, and compare and
contrast the proposed method against the baselines and state
of the art methods for supervised pretraining. Finally, we
explore label efficiency and transferability (under distribu-
tion shift) of self-supervised trained models in the medical
image classification setting.
5.1. Dataset for pretraining
One important aspect of transfer learning via self-
supervised pretraining is the choice of a proper unlabeled
dataset. For this study, we use architectures of varying ca-
pacities (i.e ResNet-50 (1�), ResNet-50 (4�) and ResNet-
152 (2�) as our base network, and carefully investigate
three possible scenario for self-supervised pretraining in
the medical context: (1) using ImageNet dataset only ,
(2) using the task specific unlabeled medical dataset (i.e.
Derm and CheXpert), and (3) initializing the pretraining
from ImageNet self-supervised model but using task spe-
cific unlabeled dataset for pretraining, here indicated as Im-
ageNet → CheXpert and ImageNet → CheXpert. Table 1
shows the performance of dermatology skin condition and
chest X-ray classification model measured by top-1 accu-
racy (%) and area under the curve (AUC) across different
architectures and pretraining scenarios. Our results suggest
that, best performance are achieved when both ImageNet
and task specific unlabeled data are used. Combining Im-
ageNet and Derm unlabeled data for pretraining, translates
to (1.92 � 0.16)% increase in top-1 accuracy for derma-
tology classification over only using ImageNet dataset for
self-supervised transfer learning. This results suggests that
pretraining on ImageNet is likely complementary to pre-
training on unlabeled medical images. Moreover, we ob-
serve that larger models are able to benefit much more from
self-supervised pretraining underscoring the importance of
model capacity in this setting.
As shown in Table 1, on CheXpert, we once again ob-
serve that self-supervised pretraining with both ImageNet
3483

Page 7
Table 2: Evaluation of multi instance contrastive learning (MI-
CLe) on Dermatology condition classification. Our results suggest
that MICLe consistently improves the accuracy of skin condition
classification over SimCLR on different datasets and architectures.
Model
Dataset
MICLe Top-1 Accuracy
Derm
No
66.93�0.92
ResNet-50
Derm
Yes
67.55�0.52
(4�)
ImageNet→Derm
No
67.63�0.32
ImageNet→Derm
Yes
68.81�0.41
Derm
No
66.43�0.62
ResNet-152
Derm
Yes
67.16�0.35
(2�)
ImageNet→Derm
No
68.30�0.19
ImageNet→Derm
Yes
68.43�0.32
and in-domain CheXpert data is beneficial, outperforming
self-supervised pretraining on ImageNet or CheXpert alone.
5.2. Performance of MICLe
Next, we evaluate whether utilizing multi-instance con-
trastive learning (MICLe) and leveraging the potential avail-
ability of multiple images per patient for a given pathology,
is beneficial for self-supervised pretraining. Table 2 com-
pares the performance of dermatology condition classifica-
tion models fine-tuned on representations learned with and
without MICLe pretraining. We observe that MICLe con-
sistently improves the performance of dermatology classi-
fication over the original SimCLR method under different
pretraining dataset and base network architecture choices.
Using MICLe for pretraining, translates to (1.18 � 0.09)%
increase in top-1 accuracy for dermatology classification
over using only original SimCLR.
5.3. Comparison with supervised transfer learning
We further improves the performance by providing more
negative examples with training longer for 1000 epochs and
a larger batch size of 1024. We achieve the best-performing
top-1 accuracy of (70.02 � 0.22)% using the ResNet-152
(2�) architecture and MICLe pretraining by incorporating
both ImageNet and Derm dataset in dermatology condi-
tion classification. Tables 3 and 4 show the comparison of
transfer learning performance of SimCLR and MICLe mod-
els with supervised baselines for the dermatology and the
chest X-ray classification. This result shows that after fine-
tuning, our self-supervised model significantly outperforms
the supervised baseline when ImageNet pretraining is used
(p < 0.05). We specifically observe an improvement of
over 6.7% in top-1 accuracy in the dermatology task when
using MICLe. On the chest X-ray task, the improvement is
1.1% in mean AUC without using MICLe.
Though using ImageNet pretrained models is still the
norm, recent advances have been made by supervised pre-
training on large scale (often noisy) natural datasets [24,
30] improving transfer performance on downstream tasks.
Table 3: Comparison of best self-supervised models vs. super-
vised pretraining baselines on dermatology classification.
Architecture
Method PretrainingDataset Top-1Accuracy
ResNet-152 (2�) Supervised
ImageNet
63.36 � 0.12
ResNet-101 (3�) BiT [24] ImageNet-21k
68.45 � 0.29
ResNet-152 (2�) SimCLR
ImageNet
66.38 � 0.03
ResNet-152 (2�) SimCLR ImageNet→Derm
69.43 � 0.43
ResNet-152 (2�) MICLe ImageNet→Derm
70.02 � 0.22
Table 4: Comparison of best self-supervised models vs. super-
vised pretraining baselines on chest X-ray classification.
Architecture
Method
Pretraining Dataset
Mean AUC
ResNet-152 (2�) Supervised
ImageNet
0.7625 � 0.001
ResNet-101 (3�) BiT [24]
ImageNet-21k
0.7720 � 0.002
ResNet-152 (2�) SimCLR
ImageNet
0.7671 � 0.008
ResNet-152 (2�) SimCLR
CheXpert
0.7702 � 0.001
ResNet-152 (2�) SimCLR ImageNet→CheXpert 0.7729 � 0.001
We therefore also evaluate a supervised baseline from
Kolesnikov et al. [24], a ResNet-101 (3�) pretrained on
ImageNet21-k called Big Transfer (BiT). This model con-
tains additional architectural tweaks included to boost trans-
fer performance, and was trained on a significantly larger
dataset (14M images labelled with one or more of 21k
classes, vs. the 1M images in ImageNet) which provides
us with a strong supervised baseline2. ResNet-101 (3�) has
382M trainable parameters, thus comparable to ResNet-152
(2�) with 233M trainable parameters. We observe that the
MICLe model is better than this BiT model for the derma-
tology classification task improving by 1.6% in top-1 ac-
curacy. For the chest X-ray task, self supervised model is
better by about 0.1% mean AUC. We surmise that with addi-
tional in-domain unlabeled data (we only use the CheXpert
dataset for pretraining), self-supervised pretraining can sur-
pass the BiT baseline by a larger margin. At the same time,
these two approaches are complementary but we leave fur-
ther explorations in this direction to future work.
5.4. Self-supervised models generalize better
We conduct further experiments to evaluate the robust-
ness of self-supervised pretrained models to distribution
shifts. For this purpose, we use the model post pretrain-
ing and end-to-end fine-tuning (i.e. CheXpert and Derm)
to make predictions on an additional shifted dataset without
any further fine-tuning (zero-shot transfer learning). We use
the DExternal
Derm
and DNIH as our target shifted datasets. Our re-
sults generally suggest that self-supervised pretrained mod-
els can generalize better to distribution shifts.
For the chest X-ray task, we note that self-supervised
pretraining with either ImageNet or CheXpert data im-
2This model is also available publicly at https://github.com/
google-research/big_transfer
3484

Page 8
Res50-4x Res152-2x
Architecture
0.20
0.25
0.30
0.35
Top-1 Accuracy
MICLe ImageNet+Derm
SimCLR ImageNet+Derm
Supervised ImageNet
Res152-2x Res50-4x
Architecture
0.74
0.75
0.76
0.77
Mean AUC
SimCLR ImageNet+CheXpert
SimCLR CheXpert
Supervised ImageNet
Figure 4: Evaluation of models on distribution-shifted datasets
(left: DUnlabeled
Derm
→DExternal
Derm
; right: DUnlabeled
CheXpert →DNIH) shows that self-
supervised training using both ImageNet and the target domain
significantly improves robustness to distribution shift.
proves generalisation, but stacking them both yields further
gains. We also note that when only using ImageNet for self
supervised pretraining, the model performs worse in this
setting compared to using in-domain data for pretraining.
Further we find that the performance improvement in the
distribution-shifted dataset due to self-supervised pretrain-
ing (both using ImageNet and CheXpert data) is more pro-
nounced than the original improvement on the CheXpert
dataset. This is a very valuable finding, as generalisation
under distribution shift is of paramount importance to clini-
cal applications. On the dermatology task, we observe sim-
ilar trends suggesting the robustness of the self-supervised
representations is consistent across tasks.
5.5. Self-supervised models are more label-efficient
To investigate label-efficiency of the selected self-
supervised models, following the previously explained fine-
tuning protocol, we fine-tune our models on different frac-
tions of labeled training data. We also conduct baseline fine-
tuning experiments with supervised ImageNet pretrained
models. We use the label fractions ranging from 10% to
90% for both Derm and CheXpert training datasets. Fine-
tuning experiments on label fractions are repeated multi-
ple times using the best parameters and averaged. Figure 4
shows how the performance varies using the different avail-
able label fractions for the dermatology task. First, we ob-
serve that pretraining using self-supervised models can sig-
nificantly help with label efficiency for medical image clas-
sification, and in all of the fractions, self-supervised models
outperform the supervised baseline. Moreover, these results
suggest that MICLe yields proportionally larger gains when
fine-tuning with fewer labeled examples. In fact, MICLe is
able to match baselines using only 20% of the training data
for ResNet-50 (4�) and 30% of the training data for ResNet-
152 (2�). Results on the CheXpert dataset are included in
Appendix B.2, where we observe a similar trend.
20
40
60
80
Label Fraction (%)
0.50
0.55
0.60
0.65
Top-1 Accuracy
ResNet-50 (4x)
MICLe ImageNet+Derm
SimCLR ImageNet+Derm
SimCLR Derm
Supervised ImageNet
20
40
60
80
Label Fraction (%)
0.50
0.55
0.60
0.65
Top-1 Accuracy
ResNet-152 (2x)
MICLe ImageNet+Derm
SimCLR ImageNet+Derm
SimCLR Derm
Supervised ImageNet
Figure 5: Top-1 accuracy for dermatology condition classifica-
tion for MICLe, SimCLR, and supervised models under different
unlabeled pretraining dataset and varied sizes of label fractions.
6. Conclusion
Supervised pretraining on natural image datasets such
as ImageNet is commonly used to improve medical image
classification. This paper investigates an alternative strategy
based on self-supervised pretraining on unlabeled natural
and medical images and finds that self-supervised pretrain-
ing significantly outperforms supervised pretraining. The
paper proposes the use of multiple images per medical case
to enhance data augmentation for self-supervised learning,
which boosts the performance of image classifiers even fur-
ther. Self-supervised pretraining is much more scalable than
supervised pretraining since class label annotation is not re-
quired. A natural next step for this line of research is to in-
vestigate the limit of self-supervised pretraining by consid-
ering massive unlabeled medical image datasets. Another
research direction concerns the transfer of self-supervised
learning from one imaging modality and task to another.
We hope this paper will help popularize the use of self-
supervised approaches in medical image analysis yielding
improvements in label efficiency across the medical field.
Acknowledgement
We would like to thank Yuan Liu for valuable feedback
on the manuscript. We are also grateful to Jim Winkens,
Megan Wilson, Umesh Telang, Patricia Macwilliams, Greg
Corrado, Dale Webster, and our collaborators at DermPath
AI for their support of this work.
3485

Page 9
References
[1] Laith Alzubaidi, Mohammed A Fadhel, Omran Al-Shamma,
Jinglan Zhang, J Santamarıa, Ye Duan, and Sameer R
Oleiwi. Towards a better understanding of transfer learn-
ing for medical imaging: a case study. Applied Sciences,
10(13):4523, 2020. 2
[2] Philip Bachman, R Devon Hjelm, and William Buchwalter.
Learning representations by maximizing mutual information
across views. In Advances in Neural Information Processing
Systems, pages 15535–15545, 2019. 2
[3] Wenjia Bai, Chen Chen, Giacomo Tarroni, Jinming Duan,
Florian Guitton, Steffen E Petersen, Yike Guo, Paul M
Matthews, and Daniel Rueckert. Self-supervised learning for
cardiac MR image segmentation by anatomical position pre-
diction. In International Conference on Medical Image Com-
puting and Computer-Assisted Intervention, pages 541–549.
Springer, 2019. 3
[4] Suzanna Becker and Geoffrey E Hinton. Self-organizing
neural network that discovers surfaces in random-dot stere-
ograms. Nature, 355(6356):161–163, 1992. 3
[5] Krishna Chaitanya, Ertunc Erdil, Neerav Karani, and Ender
Konukoglu. Contrastive learning of global and local fea-
tures for medical image segmentation with limited annota-
tions. arXiv preprint arXiv:2006.10511, 2020. 3
[6] Liang Chen, Paul Bentley, Kensaku Mori, Kazunari Misawa,
Michitaka Fujiwara, and Daniel Rueckert. Self-supervised
learning for medical image analysis using image context
restoration. Medical image analysis, 58:101539, 2019. 14,
16
[7] Sihong Chen, Kai Ma, and Yefeng Zheng. Med3d: Transfer
learning for 3D medical image analysis, 2019. 2
[8] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
offrey Hinton. A simple framework for contrastive learning
of visual representations. arXiv preprint arXiv:2002.05709,
2020. 1, 2, 3, 5, 14
[9] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad
Norouzi, and Geoffrey Hinton. Big self-supervised mod-
els are strong semi-supervised learners.
arXiv preprint
arXiv:2006.10029, 2020. 1, 2, 3, 5, 14, 16
[10] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.
Improved baselines with momentum contrastive learning.
arXiv preprint arXiv:2003.04297, 2020. 2
[11] Veronika Cheplygina, Marleen de Bruijne, and Josien PW
Pluim. Not-so-supervised: a survey of semi-supervised,
multi-instance, and transfer learning in medical image anal-
ysis. Medical image analysis, 54:280–296, 2019. 3
[12] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-
vised visual representation learning by context prediction. In
Proceedings of the IEEE international conference on com-
puter vision, pages 1422–1430, 2015. 2
[13] Robin Geyer, Luca Corinzia, and Viktor Wegmayr. Transfer
learning by adaptive merging of multiple models. In M. Jorge
Cardoso, Aasa Feragen, Ben Glocker, Ender Konukoglu,
Ipek Oguz, Gozde Unal, and Tom Vercauteren, editors, Pro-
ceedings of Machine Learning Research. PMLR, 2019. 2
[14] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
supervised representation learning by predicting image rota-
tions. arXiv preprint arXiv:1803.07728, 2018. 2
[15] Mara Graziani, Vincent Andrearczyk, and Henning M�ller.
Visualizing and interpreting feature reuse of pretrained cnns
for histopathology. In MVIP 2019: Irish Machine Vision
and Image Processing Conference Proceedings. Irish Pattern
Recognition and Classification Society, 2019. 1, 2
[16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual repre-
sentation learning. arXiv preprint arXiv:1911.05722, 2019.
1
[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual rep-
resentation learning. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pages
9729–9738, 2020. 2
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In Proceed-
ings of the IEEE conference on computer vision and pattern
recognition, pages 770–778, 2016. 2, 3
[19] Xuehai He, Xingyi Yang, Shanghang Zhang, Jinyu Zhao,
Yichen Zhang, Eric Xing, and Pengtao Xie. Sample-efficient
deep learning for COVID-19 diagnosis based on CT scans.
medRxiv, 2020. 3
[20] Michal Heker and Hayit Greenspan. Joint liver lesion seg-
mentation and classification via transfer learning. arXiv
preprint arXiv:2004.12352, 2020. 1, 2
[21] Olivier J H�naff, Aravind Srinivas, Jeffrey De Fauw, Ali
Razavi, Carl Doersch, SM Eslami, and Aaron van den Oord.
Data-efficient image recognition with contrastive predictive
coding. arXiv preprint arXiv:1905.09272, 2019. 2
[22] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon,
Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua
Bengio. Learning deep representations by mutual informa-
tion estimation and maximization. 2019. 2
[23] Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil-
viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad
Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert:
A large chest radiograph dataset with uncertainty labels and
expert comparison. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, pages 590–597, 2019.
1, 5, 6, 12, 13
[24] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan
Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby.
Big transfer (BiT): General visual representation learning.
arXiv preprint arXiv:1912.11370, 6, 2019. 7, 14
[25] Hongwei Li, Fei-Fei Xue, Krishna Chaitanya, Shengda Liu,
Ivan Ezhov, Benedikt Wiestler, Jianguo Zhang, and Bjoern
Menze. Imbalance-aware self-supervised learning for 3d ra-
diomic representations. arXiv preprint arXiv:2103.04167,
2021. 3
[26] Gaobo Liang and Lixin Zheng. A transfer learning method
with deep residual network for pediatric pneumonia diag-
nosis. Computer methods and programs in biomedicine,
187:104964, 2020. 2
[27] Jingyu Liu, Gangming Zhao, Yu Fei, Ming Zhang, Yizhou
Wang, and Yizhou Yu. Align, attend and locate: Chest x-ray
3486

Page 10
diagnosis via contrast induced attention network with lim-
ited supervision. In Proceedings of the IEEE International
Conference on Computer Vision, pages 10632–10641, 2019.
3
[28] Quande Liu, Lequan Yu, Luyang Luo, Qi Dou, and
Pheng Ann Heng. Semi-supervised medical image classi-
fication with relation-driven self-ensembling model. IEEE
Transactions on Medical Imaging, 2020. 3
[29] Yuan Liu, Ayush Jain, Clara Eng, David H Way, Kang Lee,
Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Mar-
inho, Jessica Gallegos, Sara Gabriele, et al. A deep learning
system for differential diagnosis of skin diseases. Nature
Medicine, pages 1–9, 2020. 1, 2, 4, 6, 12
[30] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,
Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,
and Laurens van der Maaten. Exploring the limits of weakly
supervised pretraining. In Proceedings of the European Con-
ference on Computer Vision (ECCV), pages 181–196, 2018.
7
[31] Scott Mayer McKinney, Marcin Sieniek, Varun Godbole,
Jonathan Godwin, Natasha Antropova, Hutan Ashrafian,
Trevor Back, Mary Chesus, Greg C Corrado, Ara Darzi, et al.
International evaluation of an AI system for breast cancer
screening. Nature, 577(7788):89–94, 2020. 1, 2
[32] Afonso Menegola, Michel Fornaciali, Ramon Pires,
Fl�via Vasques Bittencourt, Sandra Avila, and Eduardo
Valle. Knowledge transfer for melanoma screening with
deep learning. In 2017 IEEE 14th International Symposium
on Biomedical Imaging (ISBI 2017), pages 297–300. IEEE,
2017. 1, 2
[33] Ishan Misra and Laurens van der Maaten. Self-supervised
learning of pretext-invariant representations. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 6707–6717, 2020. 2
[34] Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang.
What is being transferred in transfer learning? Advances
in Neural Information Processing Systems, 33, 2020. 5, 12,
13
[35] Mehdi Noroozi and Paolo Favaro. Unsupervised learning
of visual representations by solving jigsaw puzzles. In
European Conference on Computer Vision, pages 69–84.
Springer, 2016. 2
[36] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
sentation learning with contrastive predictive coding. arXiv
preprint arXiv:1807.03748, 2018. 2
[37] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy
Bengio. Transfusion: Understanding transfer learning for
medical imaging. In Advances in neural information pro-
cessing systems, pages 3347–3357, 2019. 2, 5, 12, 13, 18
[38] Pierre Sermanet, Corey Lynch, Yevgen Chebotar, Jasmine
Hsu, Eric Jang, Stefan Schaal, and Sergey Levine. Time-
contrastive networks: Self-supervised learning from video.
In IEEE International Conf. on Robotics and Automation
(ICRA), pages 1134–1141. IEEE, 2018. 3
[39] Hari Sowrirajan, Jingbo Yang, Andrew Y Ng, and Pranav
Rajpurkar. Moco pretraining improves representation and
transferability of chest X-ray models. arXiv:2010.05352,
2020. 3
[40] Hannah Spitzer, Kai Kiwitz, Katrin Amunts, Stefan Harmel-
ing, and Timo Dickscheid. Improving cytoarchitectonic seg-
mentation of human brain areas with self-supervised siamese
networks. In International Conference on Medical Image
Computing and Computer-Assisted Intervention, pages 663–
671. Springer, 2018. 3
[41] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-
trastive multiview coding. arXiv preprint arXiv:1906.05849,
2019. 2
[42] Michael Tschannen, Josip Djolonga, Marvin Ritter, Ar-
avindh Mahendran, Neil Houlsby, Sylvain Gelly, and Mario
Lucic. Self-supervised learning of video-induced visual in-
variances. In 2020 IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition (CVPR). IEEE Computer So-
ciety, 2020. 3
[43] Dong Wang, Yuan Zhang, Kexin Zhang, and Liwei Wang.
Focalmix: Semi-supervised learning for 3d medical image
detection. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 3951–
3960, 2020. 3
[44] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mo-
hammadhadi Bagheri, and Ronald M Summers. Chestx-
ray8: Hospital-scale chest x-ray database and benchmarks on
weakly-supervised classification and localization of common
thorax diseases. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 2097–2106,
2017. 5
[45] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 3733–
3742, 2018. 2
[46] Huidong Xie, Hongming Shan, Wenxiang Cong, Xiaohua
Zhang, Shaohua Liu, Ruola Ning, and Ge Wang. Dual net-
work architecture for few-view CT-trained on imagenet data
and transferred for medical imaging. In Developments in
X-Ray Tomography XII, volume 11113, page 111130V. In-
ternational Society for Optics and Photonics, 2019. 1, 2
[47] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. Un-
supervised embedding learning via invariant and spreading
instance feature. In Proceedings of the IEEE Conference on
computer vision and pattern recognition, pages 6210–6219,
2019. 2
[48] Yang You, Igor Gitman, and Boris Ginsburg. Large
batch training of convolutional networks. arXiv preprint
arXiv:1708.03888, 2017. 5
[49] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
image colorization. In European conference on computer
vision, pages 649–666. Springer, 2016. 2
[50] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D
Manning, and Curtis P Langlotz. Contrastive learning of
medical visual representations from paired images and text.
arXiv preprint arXiv:2010.00747, 2020. 3
[51] Hong-Yu Zhou, Shuang Yu, Cheng Bian, Yifan Hu, Kai Ma,
and Yefeng Zheng. Comparing to learn: Surpassing ima-
genet pretraining on radiographs by comparing image repre-
sentations. In MICCAI, pages 398–407. Springer, 2020. 3
3487

Page 11
[52] Jiuwen Zhu, Yuexiang Li, Yifan Hu, Kai Ma, S Kevin Zhou,
and Yefeng Zheng. Rubik’s cube+: A self-supervised feature
learning framework for 3D medical image analysis. Medical
Image Analysis, page 101746, 2020. 3
[53] Xinrui Zhuang, Yuexiang Li, Yifan Hu, Kai Ma, Yujiu Yang,
and Yefeng Zheng. Self-supervised feature learning for 3D
medical images by playing a rubik’s cube. In International
Conference on Medical Image Computing and Computer-
Assisted Intervention, pages 420–428. Springer, 2019. 3
3488