3D semi-supervised learning with uncertainty-aware multi-view co-training

Page 1

3D Semi-Supervised Learning with Uncertainty-Aware Multi-View Co-Training

Yingda Xia1∗, Fengze Liu1, Dong Yang2, Jinzheng Cai3, Lequan Yu4,

Zhuotun Zhu1, Daguang Xu2, Alan Yuille1, Holger Roth2

1Johns Hopkins University 2NVIDIA

3University of Florida 4The Chinese University of Hong Kong

Abstract

While making a tremendous impact in various fields,

deep neural networks usually require large amounts of la-

beled data for training which are expensive to collect in

many applications, especially in the medical domain. Un-

labeled data, on the other hand, is much more abundant.

Semi-supervised learning techniques, such as co-training,

could provide a powerful tool to leverage unlabeled data.

In this paper, we propose a novel framework, uncertainty-

aware multi-view co-training (UMCT), to address semi-

supervised learning on 3D data, such as volumetric data

from medical imaging. In our work, co-training is achieved

by exploiting multi-viewpoint consistency of 3D data. We

generate different views by rotating or permuting the 3D

data and utilize asymmetrical 3D kernels to encourage di-

versified features in different sub-networks. In addition, we

propose an uncertainty-weighted label fusion mechanism

to estimate the reliability of each view’s prediction with

Bayesian deep learning. As one view requires the super-

vision from other views in co-training, our self-adaptive ap-

proach computes a confidence score for the prediction of

each unlabeled sample in order to assign a reliable pseudo

label. Thus, our approach can take advantage of unlabeled

data during training. We show the effectiveness of our pro-

posed semi-supervised method on several public datasets

from medical image segmentation tasks (NIH pancreas &

LiTS liver tumor dataset). Meanwhile, a fully-supervised

method based on our approach achieved state-of-the-art

performances on both the LiTS liver tumor segmentation

and the Medical Segmentation Decathlon (MSD) challenge,

demonstrating the robustness and value of our framework,

even when fully supervised training is feasible.

1. Introduction

Deep learning has achieved great successes in various

computer vision tasks, such as 2D image recognition [20,

*Work done during an internship at Nvidia

35, 36, 15, 17] and semantic segmentation [26, 8, 39, 9].

However, deep networks usually rely on large-scale labeled

datasets for training. When it comes to 3D data, such as

medical volumetric data and point clouds, human labeling

can be extremely costly, and often requires expert knowl-

edge. Take medical imaging for example. With the rapid

growth in the demand of finer and larger scale of computer-

aided diagnoses (CAD), 3D segmentation of medical im-

ages (such as CTs and MRIs) is acting as a critical step

in biomedical image analysis and surgical planning. How-

ever, well-annotated segmentation labels in medical images

require both high-level expertise of radiologists and care-

ful manual labeling of object masks or surface boundaries.

Therefore, semi-supervised approaches with unlabeled data

occupying a large portion of the training data are worth ex-

ploring.

In this paper, we aim to design a semi-supervised ap-

proach for 3D data, which can be applied to diverse data

sources, e.g. CT/MRI volumes and 3D point clouds. In-

spired by the success of co-training [5] and its extension

into single 2D images [30], we further extend this idea

into 3D. Typical co-training requires at least two views (i.e.

sources) of data, either of which should be sufficient to train

a classifier on. Co-training minimizes the disagreements by

assigning pseudo labels between each other view on unla-

beled data. Blum and Mitchell [5] further proved that co-

training has PAC-like guarantees on semi-supervised learn-

ing with an additional assumption that the two views are

conditionally independent given the category. Since most

computer vision tasks have only one source of data, encour-

aging view differences is a crucial point for successful co-

training. For example, deep co-training [30] trains multiple

deep networks to act as different views by utilizing adver-

sarial examples [14] to address this issue. Another aspect of

co-training to emphasize is view confidence estimation. In

multi-view settings, given sufficient variance of each view,

the quality of each prediction is not guaranteed and bad

pseudo labels can be harmful if used in the training process.

Co-training could benefit from trusting reliable predictions

and degrading the unreliable ones. However, distinguishing

3646

Page 2

Rotations/

Permutations

Input 3D data 𝐗

𝑃1

𝑃2

𝑃𝑛

multiply

𝑻𝟏

−𝟏

forward

supervision for

backward

3D CNN 𝑓

𝑛

…

෠𝑌1

෠𝑌2

෠𝑌𝑛

𝑌

Uncertainty-weighted

Label Fusion

uncertainty

estimation

Pseudo

Labels

Real

Label

𝑌

𝑻𝟏(𝑿)

𝑻𝒏(𝑿)

𝑻𝟐(𝑿)

𝑻𝟐

−𝟏

𝑻𝒏

−𝟏

Figure 1: Overall framework of uncertainty-aware multi-view co-training (UMCT), best viewed in color. The multi-view

inputs of X are first generated through different transforms T, like rotations and permutations, before being fed into n deep

networks with asymmetrical 3D kernels. A confidence score c is computed for each view by uncertainty estimation and acts

as the weights to compute the pseudo labels ˆY of other views (Eq. 6) after inverse transform T−1 of the predictions. The

pseudo labels ˆY for unlabeled data and ground truth Y for labeled data are used as supervisions during training.

reliable and unreliable predictions is challenging for unla-

beled data because of lacking ground-truth.

To address the above two important aspects, we pro-

pose an uncertainty-aware multi-view co-training (UMCT)

framework, shown in Fig. 1. First of all, we define the con-

cept of “view” in our work as a data-model combination

which combines the concepts of data source (classical co-

training) and deep network model (deep co-training). Al-

though only one source of data is available, we can still

introduce data-level view differences by exploring multi-

ple viewpoints of 3D data through spatial transformations,

such as rotation and permutation. Hence, our multi-view

approach naturally adapts to analyze 3D data and can be

integrated with the proposed co-training framework.

We further introduce the model-level view differences

by adopting 2D pre-trained models to asymmetric kernels

in 3D networks, such as 3 � 3 � 1 kernels. In this way,

we can not only utilize the 2D pre-trained weights but also

train the whole framework in a full 3D fashion [25]. Im-

portantly, such design introduces 2D biases in each view

during training, leading to complementary feature repre-

sentations in different views. During the training process,

these disagreements between views are minimized through

3D co-training, which further boosts the performance of our

model.

Another key component is the view confidence estima-

tion. We propose to estimate the uncertainty of each view’s

prediction with Bayesian deep networks by adding dropout

into the architectures [13]. A confidence score is computed

based on epistemic uncertainty [19], which can act as a

weight for each prediction. After propagation through this

uncertainty-weighted label fusion module (ULF), a set of

more accurate pseudo labels can be obtained for each view,

which is used as supervision signal for unlabeled data.

Our proposed approach is evaluated on the NIH pancreas

segmentation dataset and the training/validation set of LiTS

liver tumor segmentation challenge. It outperforms other

semi-supervised methods by a large margin. We further in-

vestigate the influence of our approach when applied in a

fully supervised setting, to see whether it can also assist

training for each branch with sufficient labeled data. A

fully-supervised method based on our approach achieved

state-of-the-art results on LiTS liver tumor segmentation

challenge and scored the second place in the Medical Seg-

mentation Decathlon challenge, without using complicated

data augmentation or model ensembles.

2. Related Work

Semi-supervised learning. Semi-supervised learning ap-

proaches aim at learning models with limited labeled data

and a large proportion of unlabeled data [5, 43, 3, 42].

Emerging semi-supervised approaches have been success-

fully applied to image recognition using deep neural net-

works [21, 31, 28, 1, 34, 7]. These algorithms mostly

rely on additional regularization terms to train the networks

to be resistant to some specific noise. A recent approach

3647

Page 3

[30] extended the co-training strategy to 2D deep networks

and multiple views, using adversarial examples to encour-

age view differences to boost performance.

Semi-supervised medical image analysis.

Cheplygina

et al. [10] mentioned that current semi-supervised medical

analysis methods fall into 3 types - self-training (teacher-

student models), co-training (with hand-crafted features)

and graph-based approaches (mostly applications of graph-

cut optimization). Bai et al. [2] introduced a deep net-

work based self-training framework with conditional ran-

dom field (CRF) based iterative refinements for medical

image segmentation. Zhou et al. [40] trained three 2D

networks from three planar slices of the 3D data and fused

them in each self-training iteration to get a stronger student

model. Li et al. [23, 24] extended the self-ensemble ap-

proach π model [21] with 90-degree rotations making the

network rotation-invariant. Generative adversarial network

(GAN) based approaches are also popular recently for med-

ical imaging [11, 18, 29].

Uncertainty estimation. Traditional approaches include

particle filtering and CRFs [4, 16]. For deep learning, un-

certainty is more often measured with Bayesian deep net-

works [13, 12, 19]. In our work, we emphasize the impor-

tance of uncertainty estimation in semi-supervised learning,

since most of the training data here is not annotated. We

propose to estimate the confidence of each view in our co-

training framework via Bayesian uncertainty estimation.

2D/3D hybrid networks. 2D networks and 3D networks

both have their advantages and limitations. The former

benefit from 2D pre-trained weights and well-studies archi-

tectures in natural image processing, while the latter bet-

ter explore 3D information utilizing 3D convolutional ker-

nels. [37, 22] either uses 2D probability maps or 2D feature

maps for building 3D models. [25] proposed a 3D archi-

tecture which can be initialized by 2D pre-trained models.

Moreover, [33, 41] illustrates the effectiveness of multi-

view training on 2D slices, even by simply averaging multi-

planar results, indicating complementary latent information

exists in the biases of 2D networks. This inspired us to

train 3D multi-view networks with 2D initializations jointly

using an additional loss function for multi-view networks

which encourages each network to learn from one another.

3. Uncertainty-aware Multi-view Co-training

In this section, we introduce our framework of

uncertainty-aware multi-view co-training (UMCT). There

are two important properties for a successful deep net-

work based co-training: view difference and view reliabil-

ity. In the following sections, we will explain how they

are achieved in our 3D framework: a general mathematical

formulation of the approach is shown in Sec 3.1; then we

demonstrate how to encourage view differences in Sec 3.2,

and how to compute the confidence of each view by uncer-

tainty estimation in Sec 3.3.

3.1. Overall Framework

We consider the task of semi-supervised learning for 3D

data. Let S and U be the labeled and unlabeled dataset, re-

spectively. Let D = S∪U be the whole provided dataset.

We denote each labeled data pair as (X, Y) ∈ S and unla-

beled data as X ∈ U. The ground truth Y can either be a

ground truth label (classification tasks) or dense prediction

map (segmentation tasks).

Suppose for each input X, we can naturally generate N

different views of 3D data by applying a transformation Ti

(rotation or permutation), which will result in multi-view

inputs Ti(X), i = 1, ..., N. Such operations will introduce

a data-level view difference. N models fi(�), i = 1, ..., N

are then trained over each view of data respectively. For

(X, Y) ∈ S , a supervised loss function Lsup is optimized

to measure the similarity between the prediction of each

view pi(X) = T−1

◦ fi ◦ Ti(X) and Y:

Lsup(X, Y) =

∑

i=1

L(pi(X), Y),

(1)

where L is a standard loss function for a supervised learning

task (e.g. classification, or segmentation). For 3D segmen-

tation task, {pi(X)}N

i=1 are the corresponding voxel-wise

prediction score maps after inverse rotation or permutation.

For unlabeled data, we construct a co-training assump-

tion under a semi-supervised setting. The co-training strat-

egy assumes the prediction on each view should reach a

consensus. So the prediction of each model can act as a

pseudo label to supervise other views in order to learn from

unlabeled data. However, since the prediction of each view

is expected to be diverse after boosting the view differences,

the quality of each view’s prediction needs to be measured

before generating trustworthy pseudo labels. This is ac-

complished by the uncertainty-weighted label fusion mod-

ule (ULF), which is introduced in Sec 3.3. With ULF, the

co-training loss for unlabeled data can be formulated as:

Lcot(X) =

∑

L(pi(X), ˆYi),

(2)

where

ˆY

i = Uf1,..fn (p1(X), .., pi−1(X),pi+1(X), .., pn(X))

(3)

is the pseudo label for the ith view, Uf1,..fn is the ULF com-

putational function.

Overall, the combined loss function is:

∑

(X,Y)∈S

Lsup(X, Y) + λcot ∑

X∈U

Lcot(X).

(4)

where λcot is a weight coefficient.

3648

Page 4

Algorithm 1 Uncertainty-aware Multi-view Co-training

Input:

Labeled dataset S & Unlabeled dataset U

uncertainty-weighted label fusion module (ULF) Uf1,..fn (�)

Output:

Model of each view f1, ..fn

1: while stopping criterion not met:

Sample batch bl = (xl,yl) ∈ S and batch bu =

(xu) ∈ U

Generate multi-view inputs Ti(xl) and Ti(xu), i ∈

{1, .., N}

for i in all views:

Compute predictions for each view and apply in-

verse rotation or permutation

pi(xl) ← T−1

◦ fi ◦ Ti(xl)

pi(xu) ← T−1

◦ fi ◦ Ti(xu)

for i in all views:

Compute pseudo labels for xu with ULF

ˆyi ← Uf1,..fn (p1(xu), .., pi−1(xu),

pi+1(xu), .., pn(xu))

Lsup = 1

|bl| ∑(xl,yl)∈bl [∑N

i L(pi(xl),yl)]

Lcot = 1

|bu| ∑(xu)∈bu [∑N

i L(pi(xu), ˆyi)]

10:

L = Lsup + λcotLcot

11:

Compute gradient of loss function L and update net-

work parameters {θi} by back propagation

12: return f1, ..fn

3.2. Encouraging View Differences

A successful co-training requires the “views” to be dif-

ferent and learn complementary information in the training

procedure. In our framework, several techniques are pro-

posed to encourage view differences, including both data-

level and feature-level.

3D multi-view generation. As stated above, in order to

generate multi-view data, we transpose X into multiple

views by rotations or permutations T. (A permutation rear-

ranges the dimensions of an array in a specific order.) For

three-view co-training, these can correspond to the coronal,

sagittal and axial views in medical imaging, which matches

the multi-planar reformatted views that radiologists typi-

cally use to analyze the image. Such operation is a natural

way to introduce data-level view difference.

Asymmetric 3D kernels and 2D initialization. The co-

training assumption encourages models to make similar pre-

dictions on both S and U, which potentially can lead to col-

lapsed neural networks mentioned in [30]. To address this

problem, we further encourage view difference at feature

level by designing a task-specific model. We propose to

use asymmetric 3D models initialized with 2D pre-trained

weights as the backbone network of each view to encour-

age diverse features for each view learning. The simplest

version of an asymmetric 3D model is to use n � n � 1

convolutional kernels instead of n � n � n 3D kernels as in

common 3D networks. This structure also makes the model

convenient to be initialized with 2D pre-trained weights but

fine-tuned in a 3D fashion [25].

3.3. Compute Reliable Psuedo Labels for Unlabeled

Data with Uncertainty Estimation

Encouraging view difference means enlarging the vari-

ance of each view’s prediction var(pi(X)). This raises

the question that which view we should trust for unlabeled

data during co-training. Bad predictions from one view

may hurt the training procedure of other views through

pseudo-label assignments. Meanwhile, encouraging to trust

a good prediction as a “strong” label from co-training will

boost the performance, and lead to improved performance

of overall semi-supervised learning. Instead of assigning

a pseudo-label for each view directly from the predictions

of other views, we propose an adaptive approach, namely

uncertainty-weighted label fusion module (ULF), to fuse the

outputs of different views. ULF is built up of all the views,

takes the predictions of each view as input, and then outputs

a set of pseudo labels for each view.

Motivated by the uncertainty measurements in Bayesian

deep networks, we measure the uncertainty of each view

branch for each training sample after turning our model into

a Bayesian deep network by adding dropout layers. Be-

tween the two types of uncertainty candidates – aleatoric

and epistemic uncertainties, we choose to compute the epis-

temic uncertainty that is raised by not having enough train-

ing data[19]. Such measurement fits the semi-supervised

learning goal: to improve model generalizability by explor-

ing unlabeled data. Suppose y is the output of a Bayesian

deep network, then the epistemic uncertainty can be esti-

mated by the following equation:

Ue(y) ≈

∑

k=1

ˆyk

2 − (

∑

k=1

ˆyk)2,

(5)

where {yk}K

k=1 are a set of sampled outputs.

With a transformation function h(�), we can transform

the uncertainty score into a confidence score c(y) =

h(Ue(y)). After normalization over all views, the confi-

dence score will act as the weight for each prediction to

assign as a pseudo label for other views. The pseudo label

ˆY

i assigned for a single view i can be formatted as

ˆY

i =

∑N

j=i c(pj(X))pj(X)

∑N

j=i c(pj(X))

(6)

3.4. Implementation Details

Network structure. In practice, we build an encoder-

decoder network based on ResNet-18[15], and modified it

3649

Page 5

into a 3D version. For the encoder part, the first 7�7 convo-

lutional layer is inflated into 7 � 7 � 3 kernels for low-level

3D feature extraction, similar to [25]. All other 3 � 3 con-

volutional layers are simply changed into 3 � 3 � 1 that can

be trained as a 3D convolutional layer. In the decoder part,

we adopt 3 skip connections from the encoder followed by

3D convolutions to give low-level cues for more accurate

boundary prediction needed in segmentation tasks.

Uncertainty-weighted label fusion. In terms of view

confidence estimation, we modify the network into a

Bayesian deep network by adding dropout operations. We

sample K = 10 outputs for each view and compute voxel-

wise epistemic uncertainty. Since the voxel-wise uncer-

tainty can be inaccurate, we sum over the whole volume

to finalize the uncertainty for each view. We simply use the

reciprocal for the confidence transformation function h(�)

to compute the confidence score. The pseudo label assigned

for one view is a weighted average of all predictions of mul-

tiple views based on the normalized confidence score.

Data pre-processing. All the training and testing data are

firstly re-sampled to an isotropic volume resolution of 1.0

mm for each axis. Data intensities are normalized to have

zero mean and unit variance. We adopt patch-based train-

ing, and sample training patches of size 963 with 1:1 ratio

between foreground and background. Unlike other 3D seg-

mentation approaches, our approach does not rely on any

kind of 3D data augmentation due to the effectiveness of

initialization with 2D pre-trained weights.

Training. The used training algorithm is shown in Algo-

rithm 1. Note that under the semi-supervised setting, the

co-training loss is only minimized on the unlabeled data.

It is not applied to labeled data as the segmentation loss is

already optimized to force the network’s prediction to be

close to the ground truth. However, we will later show that

the co-training loss can also help each sub-network to learn

better features on labeled data. The Dice loss [27] is used

as the segmentation loss function. It performs robustly with

imbalanced training data and mitigates the gap between the

training objective and commonly used evaluation metrics,

such as Dice score.

We firstly train the views separately on the labeled data

and then conduct our co-training by fine-tuning the weights.

The stochastic gradient descent (SGD) optimizer is used in

both stages. In the view-wise training stage, a constant

learning rate policy at 7 � 10−3, momentum at 0.9 and

weight decay of 4 � 10−5 for 20k iterations is used. In

the co-training stage, we adopt a constant learning rate pol-

icy at 1 � 10−3, with the parameter λcot = 0.2 and train

for 5k iterations. The batch size is 4 in both stages. Our

framework is implemented in PyTorch. The whole training

procedure takes ∼12 hours on 4 NVIDIA Titan V GPUs.

Testing. In the testing phase, there are two choices to fi-

nalize the output results: either to choose one single view

prediction or to ensemble the predictions of the multi-view

outputs. We will report both results in the following sec-

tions for a fair comparison with the baselines since the mul-

tiple view networks can be thought of being similar to the

ensemble of several single view models. The experimen-

tal results show that our model improves the performance

in both settings (single view and multi-view ensemble). We

use sliding-window testing and re-sample our testing results

back to the original image resolution to obtain the final re-

sults. Testing time for each case ranges from 1 minute to 5

minutes depending on the size of the input volume.

4. Experiments

In this section, our framework is tested on two popular

organ segmentation datasets: NIH pancreas segmentation

datasets [32] and LiTS liver tumor segmentation dataset

under semi-supervised settings. Moreover, noticing that our

approach is also applicable to fully-supervised settings, we

apply it to supervised training and show the benefits even

when all the training data is labeled.

4.1. Semi-supervised Segmentation

4.1.1 NIH Pancreas Segmentation Dataset

The NIH pancreas segmentation dataset contains 82 abdom-

inal CT volumes. The width and height of each volume are

512, while the axial view slice number can vary from 181

to 466. Under semi-supervised settings, the dataset is ran-

domly split into 20 testing cases and 62 training cases. We

report the results of 10% labeled training cases (6 labeled

and 56 unlabeled), 20% labeled training cases (12 labeled

and 50 unlabeled) and 100% labeled training cases. In the

results, the performance of one single view (the average of

all single views’ DSC scores) is reported for a fair compar-

ison, not a multi-view ensemble (see Table 1).

The segmentation accuracy is evaluated by Dice-

S�rensen coefficient (DSC). A large margin improvement

over the fully supervised baselines in terms of single view

performance can be observed, proving that our approach ef-

fectively leverages the unlabeled data. A Wilcoxon signed-

rank test comparing to the supervised baseline’s results

(20% labeling) shows significant improvements of our ap-

proach with a p-value of 0.0022. Fig. 2 shows 3 cases in

2D and 3D with ITK-SNAP [38]. In addition, our model

is compared with the state-of-the-art semi-supervised ap-

proach of deep co-training [30] and recent semi-supervised

medical segmentation approaches. In particular, we com-

pare to Li et al. [23] who extended the π model [21]

with transformation consistent constraints; and Zhou et

al. [40] who extended the self-training procedure by itera-

tively updating pseudo labels on unlabeled data using a fu-

3650

Page 6

Method

Backbone

10% lab 20% lab

Supervised

3D ResNet-18

66.75

75.79

DMPCT [40]

2D ResNet-101

63.45

66.75

DCT [30] (2v)

3D ResNet-18

71.43

77.54

TCSE [23]

3D ResNet-18

73.87

76.46

Ours (2 views)

3D ResNet-18

75.63

79.77

Ours (3 views)

3D ResNet-18

77.55

80.14

Ours (6 views)

3D ResNet-18

77.87

80.35

Table 1: Comparison to other semi-supervised approaches

on NIH dataset (DSC, %). Note that we use the same back-

bone network as [23] [30]. Here, “2v” means two views.

For our approach, the average of all single views’ DSC

score is reported for a fair comparison, not a multi-view

ensemble. “10% lab” and “20% lab” mean the percentage

of labeled data used for training.

sion of three 2D networks trained on cross-sectional views.

The results reported in Tab. 1 are based on our careful re-

implementations in order to allow a fair comparison.

Our implementations of [30] and [23] are operated on

the axial view of our single view branch with the same back-

bone structure (our customized 3D ResNet-18 model). Our

co-training approach achieve about 4% gain in the 10% la-

beled and 90% unlabeled settings. We also find that im-

provements of other approaches are small in the 20% set-

tings (only 1% compared to the baseline), while ours still

is capable to achieve a reasonable performance gain with

the growing number of labeled data. For [40] with a

2D approach, their experiment is conducted on 50 labeled

cases. We modify their backbone network (FCN [26]) into

DeepLab v2 [8], in order to fit our stricter settings (6 and

12 labeled cases). This modification leads to an improve-

ment of 3% in 100% fully supervised training (from 73%

to 76%). Their approach outputs the result after using an

ensemble of three models.

Since the main difference in two-view learning between

our approach and [30] is the way of encouraging view dif-

ferences, the results illustrate the effectiveness of our multi-

view analysis combined with asymmetric feature learning

on 3D co-training. With more views, our uncertainty-

weighted label fusion can further improve co-training per-

formance. We will report ablation studies on it in sec-

tion 4.3.

Furthermore, we performed a study on data utilization

efficiency of our approach compared to the baseline fully-

supervised network (3D ResNet-18). Fig. 3 shows the per-

formance change according to labeled data proportion on

NIH pancreas segmentation. From the plot, it can be seen

that when labeled data is over 80%, simple supervised train-

ing (with 3D ResNet-18) suffices. Note that our approach

with 20% labeled data (DSC 80.35%) performs better than

60% supervised training (DSC 78.95%). At such a percent-

age, our approach can save ∼ 70% of the labeling efforts.

View 1 Pred

View 2 Pred

View 3 Pred

Ensemble

46.03%

71.94%

65.01%

64.07%

61.82%

37.48%

45.83%

26.18%

NIH dataset (pancreas): Case #15

View 1 Pred

View 2 Pred

View 3 Pred

Ensemble

Without

Unlabeled

Data

After

UMCT

76.21%

80.52%

77.38%

76.18%

77.08%

62.75%

64.75%

69.76%

NIH dataset (pancreas): Case #5

Human Label

View 1 Pred

View 2 Pred

View 3 Pred

Ensemble

Without

Unlabeled

Data

After

UMCT

77.14%

80.92%

78.26%

78.19%

77.50%

60.98%

73.54%

67.02%

NIH dataset (pancreas): Case #11

Human Label

Label

Our Pred

Overlap

Without

Unlabeled

Data

After

UMCT

Figure 2: 2D and 3D visualizations for 3 cases in the test set

under 10% labeled data setting. DSC score is largely im-

proved by our co-training approach. Best viewed in color.

Figure 3: Performance plot of our semi-supervised ap-

proach over the fully-supervised baseline on different la-

beled data ratio.

4.1.2 LiTS Liver Tumor Segmentation Challenge

We also report our results on the training set of LiTS Liver

Tumor Segmentation Challenge. The 131 cases are ran-

domly split into 100 training and 31 testing cases. The input

volumes are all abdominal CT scans. The segmentation tar-

get contains 2 classes: liver (large and less challenging) and

lesion (tumors with large variance in size, more challeng-

ing). Our semi-supervised settings are the same as those

3651

Page 7

Liver

Lesion

Method

Single

100% Supervised

95.07

95.50

64.00

65.65

10% Supervised

92.23

93.17

43.98

48.90

20% Supervised

93.06

94.52

50.39

53.15

our UMCT 10%

92.98

93.53

49.79

52.14

our UMCT 20%

94.40

94.81

57.76

59.60

Table 2: Our 3-view co-training on LiTS dataset (DSC, %).

“Single” means the DSC score of one single view, while

“MV” means multi-view ensemble. The first three rows are

our fully supervised baselines. The last two rows are the

results of our approach, with 10% labeled data and 20%

labeled data. We report both liver and lesion (tumor) re-

sults. The improvements using UMCT over the correspond-

ing baselines are significant, especially for the performance

on liver lesions.

Method

Liver Lesion

3D AH-Net [25]

96.3

63.4

H-DenseUNet [22]

96.1

72.2

3 views UMCT (ours) 95.9

72.6

Table 3: Results of fully supervised training with UMCT on

LiTS test set (DSC, %).

used in NIH pancreas dataset experiments. We report re-

sults on 10% labeled data (10 labeled cases and 90 unla-

beled cases) and 20% labeled data (20 labeled case and 80

unlabeled cases) with 3-view co-training. The performance

of single view and multi-view ensemble both improves as

shown in Table 2. The improvement on liver segmenta-

tion is limited (less than 1%) because the liver segmenta-

tion is already very good with a single view only. If we

only use 10% data for supervised training, we can already

reach 93.17% after fusing the three views’ results by major-

ity voting. However, we see a large margin improvement in

the more challenging lesion segmentation, especially under

“our UMCT 20%” settings (even more than “our UMCT

10%”). We hypothesize that the case variance of lesions is

larger than normal organs (pancreas, liver, etc). With only

10 cases for our labeled set, Lsup can misguide the training

procedure and introduce bias to the labeled set (known as

overfitting). However, using Lcot to explore the unlabeled

part of the dataset, we can train a more robust model com-

pared to fully supervised training using the same number

of labeled cases. Overall, the improvements are significant

even on the challenging liver lesions with large case vari-

ance.

4.2. Application to Fully Supervised Settings

Our approach can also be applied to fully supervised

training. On semi-supervised tasks, we do not see a clear

improvement when enforcing Lcot on labeled data because

Task

DSC

NSD

Hepatic Vessel

0.63 0.64 0.83 0.72

Spleen

0.96

1.00

Colon

0.56

0.66

Table 4: DSC and NSD (normalized surface distance)

scores on the final validation phase of Medical Segmenta-

tion Challenge (some tasks have multi-class labels).

of the quantity limitation. However, when labeled data is

sufficient, we want to see if our multi-view co-training can

guide each 2D-initialized branch to help each other by en-

forcing 3D consistency. The final framework for fully su-

pervised training is: we firstly train the sub-networks of dif-

ferent views separately, and then fine-tune with the follow-

ing loss function:

L =

∑

(X,Y)∈S

[Lsup(X, Y) + λLcot(X)]

(7)

On LiTS dataset challenge, a fully-supervised method

based on our 3-view co-training method achieved the state-

of-the-art results in terms of tumor segmentation DSC score

and comparable liver segmentation results, see Table 3.

On Medical Segmentation Decathlon challenge, a fully-

supervised method based on our 3-view co-training method

achieved the second place in the final testing phase, see

Tabel 4. One goal of the challenge was that without any

hyperparameter change allowed, a favored model has to be

generalizable and robust to various segmentation tasks. Our

model can satisfy such requirements because we have the

following features. First, our model, although trained on 3D

patches, is initialized from 2D pre-trained models. We will

further discuss the influence of 2D pre-trained models in

the next section. Second, we have three views of networks

and use Lcot to help each other gaining more 3D informa-

tion through the multi-view co-training process. These two

characteristics boost the robustness of our model on super-

vised volumetric segmentation tasks.

4.3. Ablation Studies

In this subsection, we will provide several ablation stud-

ies for each component of the proposed UMCT framework.

On the backbone network structure. Our backbone se-

lection (2D-initialized, heavily asymmetric 3D architecture)

will introduce 2D biases in the training phase while benefit-

ing from such 2D pre-trained models. We have claimed that

we can utilize the complementary information from 3-view

networks while exploring the unlabeled data with UMCT.

We give an ablation study on the network structure, which

contains a V-Net [27], a common 3D segmentation network

with all symmetrical kernels in all dimensions. Such net-

work also shares a similar amount of parameters with our

customized 3D ResNet-18, see Table 5a. The results of V-

Net show that our multi-view co-training can be generally

3652

Page 8

Backbone

Params Sup 10%

Semi 10%

VNet

9.44M

66.97

76.89

3D ResNet-18 11.79M

66.76

77.55

3D ResNet-50 27.09M

67.96

78.74

(a) Ablation studies on backbone structures (3 views UMCT).

Method

Coronal Sagittal Axial

100% Supervised

82.13

81.41 82.53 84.18

UMCT Supervised

82.61

82.35 83.44 84.61

(b) Our UMCT on 100% labeled data from the NIH data. The first row is

pure single view training, while the second is UMCT. “Coronal”, “Sagittal”

and “Axial” correspond to three views in CT scan in radiology.

Views

DSC(%)

2 views

75.63

3 views

76.49

3 views + ULF

77.55

6 views

76.94

6 views + ULF

77.87

with difference views in training (10% labeled data,

3D ResNet-18).

λcot

DSC(%)

0.1

77.28

0.2

77.55

0.5

77.38

(d) λcot(10% labeled data, 3 views,

3D ResNet-18).

Model

w/o Init w/ Init

Deeplab-3D

76.09

80.11

Our 3D ResNet-50

78.70

82.53

(e) On the influence of initialization for 3D models. Ex-

periments are done on axial view, NIH dataset.

Table 5: Ablation studies for our UMCT on NIH dataset.

and successfully applied to 3D networks. Although the re-

sults of fully supervised parts are similar, our ResNet-18

outperforms V-Net by more than 1%, illustrating that our

asymmetric design, encouraging view differences, brings

advantages over traditional 3D deep networks.

On uncertainty-weighted label fusion (ULF), number of

views and parameter λcot. ULF acts as an important role

in pruning out bad predictions and keeping good ones as su-

pervision to train other views. Table 5c gives the single view

results in multiple views experiments. The performance be-

comes better with more views. For two views, ULF is not

applicable since we can only obtain one view prediction as

a pseudo label for the other view. For three views and six

views, ULF helps boost the performance, illustrating the ef-

fectiveness of our proposed approach for view confidence

estimation. We also tried different values of λcot in Ta-

ble 5d, where performance variance is not large. We choose

λcot = 0.2 in our experiments.

On fully supervised training. Table 5b shows how our

multi-view co-training helps with the fully supervised train-

ing on the NIH dataset. The model used is our 3D ResNet-

50 with 3 views co-training. Our approach improves the

results on each single model, as well as the multi-view en-

semble results.

On network initialization. We address the importance of

initialization for training a robust 3D model. This subsec-

tion provides an ablation study on the influence of initializa-

tion of 3D networks in the field of 3D segmentation, which

is often neglected by previous works. We trained two 3D

ResNet-50 (in our settings) in axial view on the NIH dataset

with all 100% labeled data. Here, one model uses 2D ini-

tialization, while the other is trained from scratch. We also

conduct similar comparisons with a DeepLab-3D model,

where we directly change each 2D kernel of DeepLab(v2)-

ResNet101 model into a 3D kernel. We initialize DeepLab-

3D in the same way as [6]. Table 5e shows the comparison.

Those models with initialization perform remarkably bet-

ter. Thus, we believe that initialization is helpful to train 3D

models for volumetric segmentation. Using weights from

the pre-trained models of natural image tasks is beneficial

for learning process. It would be a promising research direc-

tion to investigate approaches on 3D network initialization

or providing 3D models pre-trained on large-scale datasets.

5. Conclusion

In this paper, we presented uncertainty-aware multi-view

co-training (UMCT), aimed at 3D semi-supervised learn-

ing. We extended dual view co-training and deep co-

training on 2D images into multi-view 3D training, natu-

rally introducing data-level view differences. We also pro-

posed asymmetrical 3D kernels initialized from 2D pre-

trained models to introduce feature-level view differences.

In multi-view settings, an uncertainty-weighted label fu-

sion module (ULF) is built to estimate the accuracy of

each view prediction by Bayesian uncertainty measurement.

Epistemic uncertainty was estimated after transforming our

model into a Bayesian deep network by adding dropout.

This module gives a larger weight to more confident predic-

tions and further boost the performance on multi-view pre-

dictions. Experiments under semi-supervised setting were

performed on the NIH pancreas dataset and the LiTS liver

tumor dataset. Other approaches were outperformed by

a large margin on the NIH dataset. We also applied co-

training objectives on labeled data under fully supervised

settings. The results were also promising, illustrating the ef-

fectiveness of multi-view co-training on 2D-initialized net-

works.

Acknowledgements We thank Dr. Lingxi Xie, Siyuan

Qiao and Yuyin Zhou for instructive discussions.

3653

Page 9

References

[1] P. Bachman, O. Alsharif, and D. Precup. Learning with

pseudo-ensembles. In Advances in Neural Information Pro-

cessing Systems, pages 3365–3373, 2014.

[2] W. Bai, O. Oktay, M. Sinclair, H. Suzuki, M. Rajchl, G. Tar-

roni, B. Glocker, A. King, P. M. Matthews, and D. Rueck-

ert. Semi-supervised learning for network-based cardiac mr

image segmentation. In International Conference on Med-

ical Image Computing and Computer-Assisted Intervention,

pages 253–260. Springer, 2017.

[3] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regular-

ization: A geometric framework for learning from labeled

and unlabeled examples. Journal of machine learning re-

search, 7(Nov):2399–2434, 2006.

[4] A. Blake, R. Curwen, and A. Zisserman. A framework for

spatiotemporal control in the tracking of visual contours.

International Journal of Computer Vision, 11(2):127–145,

1993.

[5] A. Blum and T. Mitchell. Combining labeled and unlabeled

data with co-training. In Proceedings of the eleventh an-

nual conference on Computational learning theory, pages

92–100. ACM, 1998.

[6] J. Carreira and A. Zisserman. Quo vadis, action recognition?

a new model and the kinetics dataset. In Computer Vision

and Pattern Recognition (CVPR), 2017 IEEE Conference on,

pages 4724–4733. IEEE, 2017.

[7] D.-D. Chen, W. Wang, W. Gao, and Z.-H. Zhou. Tri-

net for semi-supervised deep learning. In Proceedings of

the 27th International Joint Conference on Artificial Intel-

ligence, pages 2014–2020. AAAI Press, 2018.

[8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and

A. L. Yuille. Deeplab: Semantic image segmentation with

deep convolutional nets, atrous convolution, and fully con-

nected crfs. IEEE transactions on pattern analysis and ma-

chine intelligence, 40(4):834–848, 2018.

[9] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-

thinking atrous convolution for semantic image segmenta-

tion. arXiv preprint arXiv:1706.05587, 2017.

[10] V. Cheplygina, M. de Bruijne, and J. P. Pluim. Not-so-

supervised: a survey of semi-supervised, multi-instance, and

transfer learning in medical image analysis. arXiv preprint

arXiv:1804.06353, 2018.

[11] N. Dong, M. Kampffmeyer, X. Liang, Z. Wang, W. Dai, and

E. Xing. Unsupervised domain adaptation for automatic es-

timation of cardiothoracic ratio. In International Conference

on Medical Image Computing and Computer-Assisted Inter-

vention, pages 544–552. Springer, 2018.

[12] Y. Gal. Uncertainty in deep learning. University of Cam-

bridge, 2016.

[13] Y. Gal and Z. Ghahramani. Dropout as a bayesian approxi-

mation: Representing model uncertainty in deep learning. In

international conference on machine learning, pages 1050–

1059, 2016.

[14] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and

harnessing adversarial examples. In International Confer-

ence on Learning Representations, 2015.

[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn-

ing for image recognition. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition, pages

770–778, 2016.

[16] X. He, R. S. Zemel, and M. �A. Carreira-Perpi˜n�n. Multiscale

conditional random fields for image labeling. In Computer

vision and pattern recognition, 2004. CVPR 2004. Proceed-

ings of the 2004 IEEE computer society conference on, vol-

ume 2, pages II–II. IEEE, 2004.

[17] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-

berger. Densely connected convolutional networks. In Pro-

ceedings of the IEEE conference on computer vision and pat-

tern recognition, pages 4700–4708, 2017.

[18] J. Jiang, Y.-C. Hu, N. Tyagi, P. Zhang, A. Rimner, G. S.

Mageras, J. O. Deasy, and H. Veeraraghavan. Tumor-aware,

adversarial domain adaptation from ct to mri for lung cancer

segmentation. In International Conference on Medical Im-

age Computing and Computer-Assisted Intervention, pages

777–785. Springer, 2018.

[19] A. Kendall and Y. Gal. What uncertainties do we need in

bayesian deep learning for computer vision? In Advances

in neural information processing systems, pages 5574–5584,

2017.

[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet

classification with deep convolutional neural networks. In

Advances in neural information processing systems, pages

1097–1105, 2012.

[21] S. Laine and T. Aila. Temporal ensembling for semi-

supervised learning. ICLR, 2016.

[22] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, and P. A. Heng. H-

denseunet: Hybrid densely connected unet for liver and liver

tumor segmentation from ct volumes. IEEE Transactions on

Medical Imaging, 2017.

[23] X. Li, L. Yu, H. Chen, C.-W. Fu, and P.-A. Heng. Semi-

supervised skin lesion segmentation via transformation con-

sistent self-ensembling model. BMVC, 2018.

[24] X. Li, L. Yu, H. Chen, C.-W. Fu, and P.-A. Heng.

Transformation consistent self-ensembling model for semi-

supervised medical image segmentation. arXiv preprint

arXiv:1903.00348, 2019.

[25] S. Liu, D. Xu, S. K. Zhou, O. Pauly, S. Grbic, T. Mertelmeier,

J. Wicklein, A. Jerebko, W. Cai, and D. Comaniciu. 3d

anisotropic hybrid network: Transferring convolutional fea-

tures from 2d images to 3d anisotropic volumes. In In-

ternational Conference on Medical Image Computing and

Computer-Assisted Intervention, pages 851–858. Springer,

2018.

[26] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional

networks for semantic segmentation. In Proceedings of the

IEEE conference on computer vision and pattern recogni-

tion, pages 3431–3440, 2015.

[27] F. Milletari, N. Navab, and S.-A. Ahmadi. V-net: Fully

convolutional neural networks for volumetric medical image

segmentation. In 2016 Fourth International Conference on

3D Vision (3DV), pages 565–571. IEEE, 2016.

[28] T. Miyato, S.-i. Maeda, S. Ishii, and M. Koyama. Virtual

adversarial training: a regularization method for supervised

3654

Page 10

and semi-supervised learning. IEEE transactions on pattern

analysis and machine intelligence, 2018.

[29] D. Nie, Y. Gao, L. Wang, and D. Shen. Asdnet: Attention

based semi-supervised deep networks for medical image seg-

mentation. In International Conference on Medical Image

Computing and Computer-Assisted Intervention, pages 370–

378. Springer, 2018.

[30] S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille. Deep

co-training for semi-supervised image recognition. In Pro-

ceedings of the European Conference on Computer Vision

(ECCV), pages 135–152, 2018.

[31] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and

T. Raiko. Semi-supervised learning with ladder networks. In

Advances in Neural Information Processing Systems, pages

3546–3554, 2015.

[32] H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turk-

bey, and R. M. Summers. Deeporgan: Multi-level deep con-

volutional networks for automated pancreas segmentation. In

MICCAI, 2015.

[33] H. R. Roth, L. Lu, J. Liu, J. Yao, A. Seff, K. Cherry, L. Kim,

and R. M. Summers. Improving computer-aided detection

using convolutional neural networks and random view aggre-

gation. IEEE transactions on medical imaging, 35(5):1170–

1181, 2016.

[34] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization

with stochastic transformations and perturbations for deep

semi-supervised learning. In Advances in Neural Informa-

tion Processing Systems, pages 1163–1171, 2016.

[35] K. Simonyan and A. Zisserman. Very deep convolutional

networks for large-scale image recognition. arXiv preprint

arXiv:1409.1556, 2014.

[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,

D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.

Going deeper with convolutions. In Proceedings of the

IEEE conference on computer vision and pattern recogni-

tion, pages 1–9, 2015.

[37] Y. Xia, L. Xie, F. Liu, Z. Zhu, E. K. Fishman, and A. L.

Yuille. Bridging the gap between 2d and 3d organ segmenta-

tion with volumetric fusion net. In International Conference

on Medical Image Computing and Computer-Assisted Inter-

vention, pages 445–453. Springer, 2018.

[38] P. A. Yushkevich, J. Piven, H. Cody Hazlett, R. Gim-

pel Smith, S. Ho, J. C. Gee, and G. Gerig. User-guided 3D

active contour segmentation of anatomical structures: Sig-

nificantly improved efficiency and reliability. Neuroimage,

31(3):1116–1128, 2006.

[39] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene

parsing network. In IEEE Conf. on Computer Vision and

Pattern Recognition (CVPR), pages 2881–2890, 2017.

[40] Y. Zhou, Y. Wang, P. Tang, W. Shen, E. K. Fishman, and

A. L. Yuille. Semi-supervised multi-organ segmentation via

multi-planar co-training. WACV, 2019.

[41] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L.

Yuille. A fixed-point model for pancreas segmentation in

abdominal ct scans. In International Conference on Medi-

cal Image Computing and Computer-Assisted Intervention,

pages 693–701. Springer, 2017.

[42] Z.-H. Zhou and M. Li. Semi-supervised regression with co-

training. In IJCAI, volume 5, pages 908–913, 2005.

[43] Z.-H. Zhou and M. Li. Tri-training: Exploiting unlabeled

data using three classifiers. IEEE Transactions on knowledge

and Data Engineering, 17(11):1529–1541, 2005.

3655