Searching for MobileNetV3

Page 1

Andrew Howard1

Mark Sandler1

Grace Chu1

Liang-Chieh Chen1

Bo Chen1

Mingxing Tan2

Weijun Wang1

Yukun Zhu1

Ruoming Pang2

Vijay Vasudevan2

Quoc V. Le2

Hartwig Adam1

1Google AI, 2Google Brain

{howarda, sandler, cxy, lcchen, bochen, tanmingxing, weijunw, yukun, rpang, vrv, qvl, hadam}@google.com

Abstract

We present the next generation of MobileNets based on

a combination of complementary search techniques as well

as a novel architecture design. MobileNetV3 is tuned to

mobile phone CPUs through a combination of hardware-

aware network architecture search (NAS) complemented by

the NetAdapt algorithm and then subsequently improved

through novel architecture advances. This paper starts the

exploration of how automated search algorithms and net-

work design can work together to harness complementary

approaches improving the overall state of the art. Through

this process we create two new MobileNet models for re-

lease: MobileNetV3-Large and MobileNetV3-Small which

are targeted for high and low resource use cases. These

models are then adapted and applied to the tasks of ob-

ject detection and semantic segmentation. For the task of

semantic segmentation (or any dense pixel prediction), we

propose a new efficient segmentation decoder Lite Reduced

Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve

new state of the art results for mobile classification, detec-

tion and segmentation. MobileNetV3-Large is 3.2% more

accurate on ImageNet classification while reducing latency

by 20% compared to MobileNetV2. MobileNetV3-Small is

6.6% more accurate compared to a MobileNetV2 model

with comparable latency. MobileNetV3-Large detection

is over 25% faster at roughly the same accuracy as Mo-

bileNetV2 on COCO detection. MobileNetV3-Large LR-

ASPP is 34% faster than MobileNetV2 R-ASPP at similar

accuracy for Cityscapes segmentation.

1. Introduction

Efficient neural networks are becoming ubiquitous in

mobile applications enabling entirely new on-device expe-

riences. They are also a key enabler of personal privacy al-

lowing a user to gain the benefits of neural networks without

needing to send their data to the server to be evaluated. Ad-

vances in neural network efficiency not only improve user

experience via higher accuracy and lower latency, but also

Latency, pixel 1, ms

Accuracy, Top 1, % 65.4

67.5

70.4

64.9

66.0

60.3

65.4

Small mobile models, 20-40ms CPU latency

MobileNetV3 small

MnasNet-small

MobileNetV2

40 50 60 70 80 90 100 110

Latency, pixel 1, ms

Accuracy, Top 1, %

75.2

76.6

74.675.2 75.6

76.7

70.0

71.9

Large mobile models, 40-100ms CPU latency

MobileNetV3 large

ProxylessNAS

MnasNet-A

MobileNetV2

Figure 1. The trade-off between Pixel 1 latency and top-1 Ima-

geNet accuracy. All models use the input resolution 224. V3 large

and V3 small use multipliers 0.75, 1 and 1.25 to show optimal

frontier. All latencies were measured on a single large core of the

same device using TFLite[1]. MobileNetV3-Small and Large are

our proposed next-generation mobile models.

100

300

1000

3000

Multiply Add, Millions

Accuracy, Imagenet, Top1

Accuracy vs MAdds vs model size

InceptionV2

MnasNet-A

MnasNet-small

MobileNetV1

MobileNetV2

MobileNetV3

NasNet

ProxylessNAS

ResNet-50

ShuffleNetV2

Figure 2. The trade-off between MAdds and top-1 accuracy. This

allows to compare models that were targeted different hardware or

software frameworks. All MobileNetV3 are for input resolution

224 and use multipliers 0.35, 0.5, 0.75, 1 and 1.25. See section 6

for other resolutions. Best viewed in color.

help preserve battery life through reduced power consump-

tion.

This paper describes the approach we took to develop

MobileNetV3 Large and Small models in order to deliver

the next generation of high accuracy efficient neural net-

work models to power on-device computer vision. The new

networks push the state of the art forward and demonstrate

how to blend automated search with novel architecture ad-

vances to build effective models.

1314

Page 2

The goal of this paper is to develop the best possible mo-

bile computer vision architectures optimizing the accuracy-

latency trade off on mobile devices. To accomplish this we

introduce (1) complementary search techniques, (2) new ef-

ficient versions of nonlinearities practical for the mobile set-

ting, (3) new efficient network design, (4) a new efficient

segmentation decoder. We present thorough experiments

demonstrating the efficacy and value of each technique eval-

uated on a wide range of use cases and mobile phones.

The paper is organized as follows. We start with a dis-

cussion of related work in Section 2. Section 3 reviews the

efficient building blocks used for mobile models. Section 4

reviews architecture search and the complementary nature

of MnasNet and NetAdapt algorithms. Section 5 describes

novel architecture design improving on the efficiency of the

models found through the joint search. Section 6 presents

extensive experiments for classification, detection and seg-

mentation in order do demonstrate efficacy and understand

the contributions of different elements. Section 7 contains

conclusions and future work.

2. Related Work

Designing deep neural network architecture for the op-

timal trade-off between accuracy and efficiency has been

an active research area in recent years. Both novel hand-

crafted structures and algorithmic neural architecture search

have played important roles in advancing this field.

SqueezeNet[22] extensively uses 1x1 convolutions with

squeeze and expand modules primarily focusing on re-

ducing the number of parameters. More recent works

shifts the focus from reducing parameters to reducing

the number of operations (MAdds) and the actual mea-

sured latency. MobileNetV1[19] employs depthwise sepa-

rable convolution to substantially improve computation ef-

ficiency. MobileNetV2[39] expands on this by introducing

a resource-efficient block with inverted residuals and lin-

ear bottlenecks. ShuffleNet[49] utilizes group convolution

and channel shuffle operations to further reduce the MAdds.

CondenseNet[21] learns group convolutions at the training

stage to keep useful dense connections between layers for

feature re-use. ShiftNet[46] proposes the shift operation in-

terleaved with point-wise convolutions to replace expensive

spatial convolutions.

To automate the architecture design process, reinforce-

ment learning (RL) was first introduced to search efficient

architectures with competitive accuracy [53, 54, 3, 27, 35].

A fully configurable search space can grow exponentially

large and intractable. So early works of architecture search

focus on the cell level structure search, and the same cell is

reused in all layers. Recently, [43] explored a block-level

hierarchical search space allowing different layer structures

at different resolution blocks of a network. To reduce

the computational cost of search, differentiable architecture

search framework is used in [28, 5, 45] with gradient-based

optimization. Focusing on adapting existing networks to

constrained mobile platforms, [48, 15, 12] proposed more

efficient automated network simplification algorithms.

Quantization [23, 25, 47, 41, 51, 52, 37] is another

important complementary effort to improve the network

efficiency through reduced precision arithmetic. Finally,

knowledge distillation [4, 17] offers an additional comple-

mentary method to generate small accurate ”student” net-

works with the guidance of a large ”teacher” network.

3. Efficient Mobile Building Blocks

Mobile models have been built on increasingly more effi-

cient building blocks. MobileNetV1 [19] introduced depth-

wise separable convolutions as an efficient replacement for

traditional convolution layers. Depthwise separable convo-

lutions effectively factorize traditional convolution by sep-

arating spatial filtering from the feature generation mech-

anism. Depthwise separable convolutions are defined by

two separate layers: light weight depthwise convolution for

spatial filtering and heavier 1x1 pointwise convolutions for

feature generation.

MobileNetV2 [39] introduced the linear bottleneck and

inverted residual structure in order to make even more effi-

cient layer structures by leveraging the low rank nature of

the problem. This structure is shown on Figure 3 and is

defined by a 1x1 expansion convolution followed by depth-

wise convolutions and a 1x1 projection layer. The input and

output are connected with a residual connection if and only

if they have the same number of channels. This structure

maintains a compact representation at the input and the out-

put while expanding to a higher-dimensional feature space

internally to increase the expressiveness of nonlinear per-

channel transformations.

MnasNet [43] built upon the MobileNetV2 structure by

introducing lightweight attention modules based on squeeze

and excitation into the bottleneck structure. Note that the

squeeze and excitation module are integrated in a different

location than ResNet based modules proposed in [20]. The

module is placed after the depthwise filters in the expansion

in order for attention to be applied on the largest represen-

tation as shown on Figure 4.

For MobileNetV3, we use a combination of these layers

as building blocks in order to build the most effective mod-

els. Layers are also upgraded with modified swish nonlin-

earities [36, 13, 16]. Both squeeze and excitation as well as

the swish nonlinearity use the sigmoid which can be ineffi-

cient to compute as well challenging to maintain accuracy

in fixed point arithmetic so we replace this with the hard

sigmoid [2, 11] as discussed in section 5.2.

1315

Page 3

Relu6, Dwise

Relu6, 1x1

3x3

Mobilenet V2: bottleneck with residual

Figure 3. MobileNetV2 [39] layer (Inverted Residual and Linear

Bottleneck). Each block consists of narrow input and output (bot-

tleneck), which don’t have nonlinearity, followed by expansion to

a much higher-dimensional space and projection to the output. The

residual connects bottleneck (rather than expansion).

NL, Dwise

3x3

NL, 1x1

Pool

FC,

Relu

FC,

hard-σ

⊗

Mobilenet V3 block

Figure 4. MobileNetV2 + Squeeze-and-Excite [20]. In contrast

with [20] we apply the squeeze and excite in the residual layer.

We use different nonlinearity depending on the layer, see section

5.2 for details.

4. Network Search

Network search has shown itself to be a very powerful

tool for discovering and optimizing network architectures

[53, 43, 5, 48]. For MobileNetV3 we use platform-aware

NAS to search for the global network structures by optimiz-

ing each network block. We then use the NetAdapt algo-

rithm to search per layer for the number of filters. These

techniques are complementary and can be combined to ef-

fectively find optimized models for a given hardware plat-

form.

4.1. Platform-Aware NAS for Block-wise Search

Similar to [43], we employ a platform-aware neural ar-

chitecture approach to find the global network structures.

Since we use the same RNN-based controller and the same

factorized hierarchical search space, we find similar results

as [43] for Large mobile models with target latency around

80ms. Therefore, we simply reuse the same MnasNet-A1

[43] as our initial Large mobile model, and then apply Ne-

tAdapt [48] and other optimizations on top of it.

However, we observe the original reward design is not

optimized for small mobile models. Specifically, it uses a

multi-objective reward ACC(m) � [LAT(m)/T AR]w to

approximate Pareto-optimal solutions, by balancing model

accuracy ACC(m) and latency LAT(m) for each model

m based on the target latency T AR. We observe that ac-

curacy changes much more dramatically with latency for

small models; therefore, we need a smaller weight factor

w = −0.15 (vs the original w = −0.07 in [43]) to compen-

sate for the larger accuracy change for different latencies.

Enhanced with this new weight factor w, we start a new ar-

chitecture search from scratch to find the initial seed model

and then apply NetAdapt and other optimizations to obtain

the final MobileNetV3-Small model.

4.2. NetAdapt for Layer-wise Search

The second technique that we employ in our architecture

search is NetAdapt [48]. This approach is complimentary to

platform-aware NAS: it allows fine-tuning of individual lay-

ers in a sequential manner, rather than trying to infer coarse

but global architecture. We refer to the original paper for

the full details. In short the technique proceeds as follows:

1. Starts with a seed network architecture found by

platform-aware NAS.

2. For each step:

(a) Generate a set of new proposals. Each proposal

represents a modification of an architecture that

generates at least δ reduction in latency com-

pared to the previous step.

(b) For each proposal we use the pre-trained model

from the previous step and populate the new pro-

posed architecture, truncating and randomly ini-

tializing missing weights as appropriate. Fine-

tune each proposal for T steps to get a coarse es-

timate of the accuracy.

3. Iterate previous step until target latency is reached.

In [48] the metric was to minimize the accuracy change.

We modify this algorithm and minimize the ratio between

latency change and accuracy change. That is for all pro-

posals generated during each NetAdapt step, we pick one

that maximizes: ∆Acc

|∆latency|

, with ∆latency satisfying the con-

straint in 2(a). The intuition is that because our proposals

are discrete, we prefer proposals that maximize the slope of

the trade-off curve.

This process is repeated until the latency reaches its tar-

get, and then we re-train the new architecture from scratch.

We use the same proposal generator as was used in [48]

for MobilenetV2. Specifically, we allow the following two

types of proposals:

1. Reduce the size of any expansion layer;

2. Reduce bottleneck in all blocks that share the same

bottleneck size - to maintain residual connections.

1316

Page 4

For our experiments we used T = 10000 and find that while

it increases the accuracy of the initial fine-tuning of the pro-

posals, it does not however, change the final accuracy when

trained from scratch. We set δ = 0.01|L|, where L is the

latency of the seed model.

5. Network Improvements

In addition to network search, we also introduce several

new components to the model to further improve the final

model. We redesign the computionally-expensive layers at

the beginning and the end of the network. We also intro-

duce a new nonlinearity, h-swish, a modified version of the

recent swish nonlinearity, which is faster to compute and

more quantization-friendly.

5.1. Redesigning Expensive Layers

Once models are found through architecture search, we

observe that some of the last layers as well as some of the

earlier layers are more expensive than others. We propose

some modifications to the architecture to reduce the latency

of these slow layers while maintaining the accuracy. These

modifications are outside of the scope of the current search

space.

The first modification reworks how the last few lay-

ers of the network interact in order to produce the final

features more efficiently. Current models based on Mo-

bileNetV2’s inverted bottleneck structure and variants use

1x1 convolution as a final layer in order to expand to a

higher-dimensional feature space. This layer is critically

important in order to have rich features for prediction. How-

ever, this comes at a cost of extra latency.

To reduce latency and preserve the high dimensional fea-

tures, we move this layer past the final average pooling.

This final set of features is now computed at 1x1 spatial

resolution instead of 7x7 spatial resolution. The outcome

of this design choice is that the computation of the features

becomes nearly free in terms of computation and latency.

Once the cost of this feature generation layer has been

mitigated, the previous bottleneck projection layer is no

longer needed to reduce computation. This observation al-

lows us to remove the projection and filtering layers in the

previous bottleneck layer, further reducing computational

complexity. The original and optimized last stages can be

seen in figure 5. The efficient last stage reduces the latency

by 7 milliseconds which is 11% of the running time and re-

duces the number of operations by 30 millions MAdds with

almost no loss of accuracy. Section 6 contains detailed re-

sults.

Another expensive layer is the initial set of filters. Cur-

rent mobile models tend to use 32 filters in a full 3x3 con-

volution to build initial filter banks for edge detection. Of-

ten these filters are mirror images of each other. We ex-

perimented with reducing the number of filters and using

Efficient Last Stage

1x1 Conv

H-Swish

960

1x1 Conv

H-Swish

160

Original Last Stage

1x1 Conv

H-Swish

1x1 Conv

3x3 DConv

H-Swish

1x1 Conv

H-Swish

1000

1280

1000

960

160

320

Figure 5. Comparison of original last stage and efficient last stage.

This more efficient last stage is able to drop three expensive layers

at the end of the network at no loss of accuracy.

different nonlinearities to try and reduce redundancy. We

settled on using the hard swish nonlinearity for this layer

as it performed as well as other nonlinearities tested. We

were able to reduce the number of filters to 16 while main-

taining the same accuracy as 32 filters using either ReLU

or swish. This saves an additional 2 milliseconds and 10

million MAdds.

5.2. Nonlinearities

In [36, 13, 16] a nonlinearity called swish was introduced

that when used as a drop-in replacement for ReLU, that

significantly improves the accuracy of neural networks. The

nonlinearity is defined as

swish x = x � σ(x)

While this nonlinearity improves accuracy, it comes with

non-zero cost in embedded environments as the sigmoid

function is much more expensive to compute on mobile de-

vices. We deal with this problem in two ways.

1. We replace sigmoid function with its piece-wise linear

hard analog:

ReLU6(x+3)

similar to [11, 44]. The minor

difference is we use ReLU6 rather than a custom clipping

constant. Similarly, the hard version of swish becomes

h-swish[x] = x

ReLU6(x + 3)

A similar version of hard-swish was also recently proposed

in [2]. The comparison of the soft and hard version of sig-

moid and swish nonlinearities is shown in figure 6. Our

choice of constants was motivated by simplicity and be-

ing a good match to the original smooth version. In our

experiments, we found hard-version of all these functions

to have no discernible difference in accuracy, but multi-

ple advantages from a deployment perspective. First, opti-

mized implementations of ReLU6 are available on virtually

all software and hardware frameworks. Second, in quan-

tized mode, it eliminates potential numerical precision loss

caused by different implementations of the approximate sig-

moid. Finally, in practice, h-swish can be implemented as

1317

Page 5

0.0

0.2

0.4

0.6

0.8

1.0

sigmoid vs h-sigmoid

sigmoid

h-sigmoid

8 6 4 2 0 2 4 6 8

swish vs h-swish

swish

h-swish

Figure 6. Sigmoid and swish nonlinearities and ther “hard” coun-

terparts.

a piece-wise function to reduce the number of memory ac-

cesses driving the latency cost down substantially.

2. The cost of applying nonlinearity decreases as we go

deeper into the network, since each layer activation mem-

ory typically halves every time the resolution drops. Inci-

dentally, we find that most of the benefits swish are realized

by using them only in the deeper layers. Thus in our ar-

chitectures we only use h-swish at the second half of the

model. We refer to the tables 1 and 2 for the precise layout.

Even with these optimizations, h-swish still introduces

some latency cost. However as we demonstrate in section

6 the net effect on accuracy and latency is positive with no

optimizations and substantial when using an optimized im-

plementation based on a piece-wise function.

5.3. Large squeeze-and-excite

In [43], the size of the squeeze-and-excite bottleneck

was relative the size of the convolutional bottleneck. In-

stead, we replace them all to fixed to be 1/4 of the number

of channels in expansion layer. We find that doing so in-

creases the accuracy, at the modest increase of number of

parameters, and no discernible latency cost.

5.4. MobileNetV3 Definitions

MobileNetV3 is defined as two models: MobileNetV3-

Large and MobileNetV3-Small. These models are targeted

at high and low resource use cases respectively. The models

are created through applying platform-aware NAS and Ne-

tAdapt for network search and incorporating the network

improvements defined in this section. See table 1 and 2 for

full specification of our networks.

6. Experiments

We present experimental results to demonstrate the ef-

fectiveness of the new MobileNetV3 models. We report re-

sults on classification, detection and segmentation. We also

report various ablation studies to shed light on the effects of

various design decisions.

6.1. Classification

As has become standard, we use ImageNet[38] for all

our classification experiments and compare accuracy ver-

Input

Operator

exp size

#out

2242 � 3

conv2d

1122 � 16

bneck, 3x3

1122 � 16

bneck, 3x3

562 � 24

bneck, 3x3

562 � 24

bneck, 5x5

282 � 40

bneck, 5x5

120

282 � 40

bneck, 5x5

120

282 � 40

bneck, 3x3

240

142 � 80

bneck, 3x3

200

142 � 80

bneck, 3x3

184

142 � 80

bneck, 3x3

184

142 � 80

bneck, 3x3

480

112

142 � 112

bneck, 3x3

672

112

142 � 112

bneck, 5x5

672

160

72 � 160

bneck, 5x5

960

160

72 � 160

bneck, 5x5

960

160

72 � 160

conv2d, 1x1

960

72 � 960

pool, 7x7

12 � 960

conv2d 1x1, NBN

1280

12 � 1280

conv2d 1x1, NBN

Table 1. Specification for MobileNetV3-Large. SE denotes

whether there is a Squeeze-And-Excite in that block. NL denotes

the type of nonlinearity used. Here, HS denotes h-swish and RE

denotes ReLU. NBN denotes no batch normalization. s denotes

stride.

Input

Operator

exp size

#out

2242 � 3

conv2d, 3x3

1122 � 16

bneck, 3x3

562 � 16

bneck, 3x3

282 � 24

bneck, 3x3

282 � 24

bneck, 5x5

142 � 40

bneck, 5x5

240

142 � 40

bneck, 5x5

240

142 � 40

bneck, 5x5

120

142 � 48

bneck, 5x5

144

142 � 48

bneck, 5x5

288

72 � 96

bneck, 5x5

576

72 � 96

bneck, 5x5

576

72 � 96

conv2d, 1x1

576

72 � 576

pool, 7x7

12 � 576

conv2d 1x1, NBN

1280

12 � 1280

conv2d 1x1, NBN

Table 2. Specification for MobileNetV3-Small. See table 1 for

notation.

sus various measures of resource usage such as latency and

multiply adds (MAdds).

6.1.1 Training setup

We train our models using synchronous training setup on

4x4 TPU Pod [24] using standard tensorflow RMSPropOp-

timizer with 0.9 momentum. We use the initial learning rate

of 0.1, with batch size 4096 (128 images per chip), and

learning rate decay rate of 0.01 every 3 epochs. We use

dropout of 0.8, and l2 weight decay 1e-5 and the same im-

age preprocessing as Inception [42]. Finally we use expo-

nential moving average with decay 0.9999. All our convo-

1318

Page 6

Network

Top-1 MAdds Params

P-1

P-2

P-3

V3-Large 1.0

75.2

219

5.4M

V3-Large 0.75

73.3

155

4.0M

MnasNet-A1

75.2

315

3.9M

Proxyless[5]

74.6

320

4.0M

V2 1.0

72.0

300

3.4M

V3-Small 1.0

67.4

2.9M

15.8 19.4 14.4

V3-Small 0.75

65.4

2.4M

12.8 15.6 11.7

Mnas-small [43]

64.9

65.1

1.9M

20.3 24.2 17.2

V2 0.35

60.8

59.2

1.6M

16.6 19.6 13.9

Table 3. Floating point performance on the Pixel family of phones

(P-n denotes a Pixel-n phone). All latencies are in ms and are

measured using a single large core with a batch size of one. Top-1

accuracy is on ImageNet.

lutional layers use batch-normalization layers with average

decay of 0.99.

6.1.2 Measurement setup

To measure latencies we use standard Google Pixel phones

and run all networks through the standard TFLite Bench-

mark Tool. We use single-threaded large core in all our

measurements. We don’t report multi-core inference time,

since we find this setup not very practical for mobile ap-

plications. We contributed an atomic h-swish operator to

tensorflow lite, and it is now default in the latest version.

We show the impact of optimized h-swish on figure 9.

6.2. Results

As can be seen on figure 1 our models outperform the

current state of the art such as MnasNet [43], ProxylessNas

[5] and MobileNetV2 [39]. We report the floating point

performance on different Pixel phones in the table 3. We

include quantization results in table 4.

In figure 7 we show the MobileNetV3 performance

trade-offs as a function of multiplier and resolution. Note

how MobileNetV3-Small outperforms the MobileNetV3-

Large with multiplier scaled to match the performance by

nearly 3%. On the other hand, resolution provides an even

better trade-offs than multiplier. However, it should be

noted that resolution is often determined by the problem

(e.g. segmentation and detection problem generally require

higher resolution), and thus can’t always be used as a tun-

able parameter.

6.2.1 Ablation study

Impact of non-linearities In table 5 we study the choice

of where to insert h-swish nonlinearities as well as the im-

provements of using an optimized implementation over a

naive implementation. It can be seen that using an op-

timized implementation of h-swish saves 6ms (more than

Network

Top-1

P-1

P-2

P-3

V3-Large 1.0

73.8

42.5 31.7

V2 1.0

70.9

48.3 37.0

V3-Small

64.9

15.5 14.9 10.7

V2 0.35

57.2

16.7 15.6 11.9

Table 4. Quantized performance. All latencies are in ms. The

inference latency is measured using a single large core on the re-

spective Pixel 1/2/3 device.

90 100

Latency, Pixel 1

Accuracy, Top-1

V3 large: 1/224

V3 small: 1/224

MobilenetV3

MobilenetV2

Comparison for V3 large vs V3 small vs V2

MobilenetV3 large/multiplier

MobilenetV3 small/multiplier

MobilenetV3 large/resolution

MobilenetV3 small/resolution

MobilenetV2/multiplier

MobilenetV2/resolution

Figure 7. Performance of MobileNetV3 as a function of different

multipliers and resolutions. In our experiments we have used mul-

tipliers 0.35, 0.5, 0.75, 1.0 and 1.25, with a fixed resolution of

224, and resolutions 96, 128, 160, 192, 224 and 256 with a fixed

depth multiplier of 1.0. Best viewed in color. Top-1 accuracy is on

ImageNet and latency is in ms.

Top-1

P-1

P-1 (no-opt)

V3-Large 1.0

75.2

51.4

57.5

ReLU

74.5 (-.7%) 50.5 (-1%)

50.5

h-swish @16

75.4 (+.2%) 53.5 (+4%)

68.9

h-swish @112

75.0 (-.3%) 51 (-0.5%)

54.4

Table 5. Effect of non-linearities on MobileNetV3-Large. In

h-swish @N, N denotes the number of channels, in the first layer

that has h-swish enabled. The third column shows the runtime

without optimized h-swish. Top-1 accuracy is on ImageNet and

latency is in ms.

10% of the runtime). Optimized h-swish only adds an ad-

ditional 1ms compared to traditional ReLU.

Figure 8 shows the efficient frontier based on nonlinear-

ity choices and network width. MobileNetV3 uses h-swish

in the middle of the network and clearly dominates ReLU.

It is interesting to note that adding h-swish to the entire

network is slightly better than the interpolated frontier of

widening the network.

Impact of other components In figure 9 we show how

introduction of different components moved along the la-

tency/accuracy curve.

1319

Page 7

45 50 55 60 65 70 75

74.0

74.5

75.0

75.5

76.0

76.5

Accuracy, Top-1

0.85 V3:74.3

V3:75.2

1.15 V3:76.3

h-swish@112:75.0

ReLU:74.5

h-swish@16:75.4

Optimized h-swish

45 50 55 60 65 70 75

Latency, Pixel 1

0.85 V3:74.3

V3:75.2

1.15 V3:76.3

h-swish@112:75.0

ReLU:74.5

h-swish@16:75.4

H-swish without optimization

Figure 8. Impact of h-swish vs ReLU on latency for optimized and

non-optimized h-swish. The curve shows a frontier of using depth

multiplier. Note that placing h-swish at all layers with 80 channels

or more (V3) provides the best trade-offs for both optimized h-

swish and non-optimized h-swish. Top-1 accuracy is on ImageNet

and latency is in ms.

Latency, ms

75.0

75.2

75.4

75.6

75.8

76.0

76.2

76.4

Imagenet Accuracy

MnasNet:75.2

+SE/4:75.5

+h-swish

non-opt:76.1

+h-swish:76.1

+Netadapt:75.2

+compact:75.2

Progression of V3 development

Figure 9. Impact of individual components in the development of

MobileNetV3. Progress is measured by moving up and to the left.

6.3. Detection

We use MobileNetV3 as a drop-in replacement for the

backbone feature extractor in SSDLite [39] and compare

with other backbone networks on COCO dataset [26].

Following MobileNetV2 [39], we attach the first layer of

SSDLite to the last feature extractor layer that has an out-

put stride of 16, and attach the second layer of SSDLite to

the last feature extractor layer that has an output stride of

32. Following the detection literature, we refer to these two

feature extractor layers as C4 and C5, respectively. For

MobileNetV3-Large, C4 is the expansion layer of the 13-th

bottleneck block. For MobileNetV3-Small, C4 is the ex-

pansion layer of the 9-th bottleneck block. For both net-

works, C5 is the layer immediately before pooling.

We additionally reduce the channel counts of all feature

layers between C4 and C5 by 2. This is because the last few

layers of MobileNetV3 are tuned to output 1000 classes,

which may be redundant when transferred to COCO with

90 classes.

The results on COCO test set are given in Tab. 6. With

the channel reduction, MobileNetV3-Large is 27% faster

than MobileNetV2 with near identical mAP. MobileNetV3-

Small with channel reduction is also 2.4 and 0.5 mAP

Backbone

mAP

Latency (ms) Params (M) MAdds (B)

22.2

228

5.1

1.3

22.1

162

4.3

0.80

MnasNet

23.0

174

4.88

0.84

22.0

137

4.97

0.62

V3†

22.0

119

3.22

0.51

V2 0.35

13.7

0.93

0.16

V2 0.5

16.6

1.54

0.27

MnasNet 0.35

15.6

1.02

0.18

MnasNet 0.5

18.5

1.68

0.29

V3-Small

16.0

2.49

0.21

V3-Small†

16.1

1.77

0.16

Table 6. Object detection results of SSDLite with different back-

bones on COCO test set. †: Channels in the blocks between C4

and C5 are reduced by a factor of 2.

1/4

1/8

1/16

1x1 Conv

ReLu

MobileNetV3

Segmentation Head

Resolution

1/16

49x49 Avg-Pool

Stride=[16, 20]

1x1 Conv

128

Sigmoid

Bilinear

Upsample

128

1x1 Conv

Bilinear

Upsample

128

1x1 Conv

Semantic

Segmentation

Figure 10. Building on MobileNetV3, the proposed segmentation

head, Lite R-ASPP, delivers fast semantic segmentation results

while mixing features from multiple resolutions.

higher than MobileNetV2 and MnasNet while being 35%

faster. For both MobileNetV3 models the channel reduc-

tion trick contributes to approximately 15% latency reduc-

tion with no mAP loss, suggesting that Imagenet classifica-

tion and COCO object detection may prefer different feature

extractor shapes.

6.4. Semantic Segmentation

In this subsection, we employ MobileNetV2 [39] and the

proposed MobileNetV3 as network backbones for the task

of mobile semantic segmentation. Additionally, we com-

pare two segmentation heads. The first one, referred to as

R-ASPP, was proposed in [39]. R-ASPP is a reduced de-

sign of the Atrous Spatial Pyramid Pooling module [7, 8, 9],

which adopts only two branches consisting of a 1 � 1 con-

volution and a global-average pooling operation [29, 50].

In this work, we propose another light-weight segmenta-

tion head, referred to as Lite R-ASPP (or LR-ASPP), as

shown in Fig. 10. Lite R-ASPP, improving over R-ASPP,

deploys the global-average pooling in a fashion similar to

the Squeeze-and-Excitation module [20], in which we em-

ploy a large pooling kernel with a large stride (to save some

computation) and only one 1�1 convolution in the module.

We apply atrous convolution [18, 40, 33, 6] to the last block

of MobileNetV3 to extract denser features, and further add

a skip connection [30] from low-level features to capture

1320

Page 8

more detailed information.

We conduct the experiments on the Cityscapes dataset

[10] with metric mIOU [14], and only exploit the ‘fine’ an-

notations. We employ the same training protocol as [8, 39].

All our models are trained from scratch without pretraining

on ImageNet [38], and are evaluated with a single-scale in-

put. Similar to object detection, we observe that we could

reduce the channels in the last block of network backbone

by a factor of 2 without degrading the performance signif-

icantly. We think it is because the backbone is designed

for 1000 classes ImageNet image classification [38] while

there are only 19 classes on Cityscapes, implying there is

some channel redundancy in the backbone.

We report our Cityscapes validation set results in Tab. 7.

As shown in the table, we observe that (1) reducing the

channels in the last block of network backbone by a fac-

tor of 2 significantly improves the speed while maintaining

similar performances (row 1 vs. row 2, and row 5 vs. row

6), (2) the proposed segmentation head LR-ASPP is slightly

faster than R-ASPP [39] while performance is improved

(row 2 vs. row 3, and row 6 vs. row 7), (3) reducing the

filters in the segmentation head from 256 to 128 improves

the speed at the cost of slightly worse performance (row 3

vs. row 4, and row 7 vs. row 8), (4) when employing the

same setting, MobileNetV3 model variants attain similar

performance while being slightly faster than MobileNetV2

counterparts (row 1 vs. row 5, row 2 vs. row 6, row 3

vs. row 7, and row 4 vs. row 8), (5) MobileNetV3-Small

attains similar performance as MobileNetV2-0.5 while be-

ing faster, and (6) MobileNetV3-Small is significantly bet-

ter than MobileNetV2-0.35 while yielding similar speed.

Tab. 8 shows our Cityscapes test set results. Our segmen-

tation models with MobileNetV3 as network backbone out-

performs ESPNetv2 [32], CCC2 [34], and ESPNetv1 [32]

by 6.4%, 10.6%, 12.3%, respectively while being faster

in terms of MAdds. The performance drops slightly by

0.6% when not employing the atrous convolution to extract

dense feature maps in the last block of MobileNetV3, but

the speed is improved to 1.98B (for half-resolution inputs),

which is 1.36, 1.59, and 2.27 times faster than ESPNetv2,

CCC2, and ESPNetv1, respectively. Furthermore, our mod-

els with MobileNetV3-Small as network backbone still out-

performs all of them by at least a healthy margin of 2.1%.

7. Conclusions and future work

In this paper we introduced MobileNetV3 Large and

Small models demonstrating new state of the art in mo-

bile classification, detection and segmentation. We have

described our efforts to harness multiple network architec-

ture search algorithms as well as advances in network de-

sign to deliver the next generation of mobile models. We

have also shown how to adapt nonlinearities like swish and

apply squeeze and excite in a quantization friendly and ef-

Backbone

RF2

mIOU

Params MAdds CPU (f) CPU (h)

�

256

72.84

2.11M

21.29B

3.90s

1.02s

�

256

72.56

1.15M

13.68B

3.03s

793ms

256

72.97

1.02M

12.83B

2.98s

786ms

128

72.74

0.98M

12.57B

2.89s

766ms

�

256

72.64

3.60M

18.43B

3.55s

906ms

�

256

71.91

1.76M

11.24B

2.60s

668ms

256

72.37

1.63M

10.33B

2.55s

659ms

128

72.36

1.51M

9.74B

2.47s

657ms

V2 0.5

128

68.57

0.28M

4.00B

1.59s

415ms

V2 0.35

128

66.83

0.16M

2.54B

1.27s

354ms

V3-Small

128

68.38

0.47M

2.90B

1.21s

327ms

Table 7. Semantic segmentation results on Cityscapes val set.

RF2: Reduce the Filters in the last block by a factor of 2. V2

0.5 and V2 0.35 are MobileNetV2 with depth multiplier = 0.5 and

0.35, respectively. SH: Segmentation Head, where � employs

the R-ASPP while

employs the proposed LR-ASPP. F: Number

of Filters used in the Segmentation Head. CPU (f): CPU time

measured on a single large core of Pixel 3 (floating point) w.r.t.

a full-resolution input (i.e., 1024 � 2048). CPU (h): CPU time

measured w.r.t. a half-resolution input (i.e., 512 � 1024). Row 8,

and 11 are our MobileNetV3 segmentation candidates.

Backbone

mIOU

MAdds (f) MAdds (h) CPU (f) CPU (h)

72.6

9.74B

2.48B

2.47s

657ms

72.0

7.74B

1.98B

2.06s

534ms

V3-Small

69.4

2.90B

0.74B

1.21s

327ms

V3-Small

68.3

2.06B

0.53B

1.03s

275ms

ESPNetv2 [32]

66.2

2.7B

CCC2 [34]

62.0

3.15B

ESPNetv1 [31]

60.3

4.5B

Table 8. Semantic segmentation results on Cityscapes test set. OS:

Output Stride, the ratio of input image spatial resolution to back-

bone output resolution. When OS = 16, atrous convolution is ap-

plied in the last block of backbone. When OS = 32, no atrous

convolution is used. MAdds (f): Multiply-Adds measured w.r.t. a

full-resolution input (i.e., 1024 � 2048). MAdds (h): Multiply-

Adds measured w.r.t. a half-resolution input (i.e., 512 � 1024).

CPU (f): CPU time measured on a single large core of Pixel 3

(floating point) w.r.t. a full-resolution input (i.e., 1024 � 2048).

CPU (h): CPU time measured w.r.t. a half-resolution input (i.e.,

512�1024). ESPNet [31, 32] and CCC2 [34] take half resolution

inputs, while our models directly take full-resolution inputs.

ficient manner introducing them into the mobile model do-

main as effective tools. We also introduced a new form of

lightweight segmentation decoders called LR-ASPP. While

it remains an open question of how best to blend automatic

search techniques with human intuition, we are pleased to

present these first positive results and will continue to refine

methods as future work.

Acknowledgements: We would like to thank Andrey

Zhmoginov, Dmitry Kalenichenko, Menglong Zhu, Jon

Shlens, Xiao Zhang, Benoit Jacob, Alex Stark, Achille

Brighton and Sergey Ioffe for helpful feedback and discus-

sion.

1321

Page 9

References

[1] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene

Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy

Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat,

Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Is-

ard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Man-

junath Kudlur, Josh Levenberg, Dan Man�, Rajat Monga,

Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,

Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Tal-

war, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fer-

nanda Vi�gas, Oriol Vinyals, Pete Warden, Martin Watten-

berg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-

Flow: Large-scale machine learning on heterogeneous sys-

tems, 2015. Software available from tensorflow.org. 1

[2] R. Avenash and P. Vishawanth. Semantic segmentation of

satellite images using a modified cnn with hard-swish acti-

vation function. In VISIGRAPP, 2019. 2, 4

[3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh

Raskar. Designing neural network architectures using rein-

forcement learning. CoRR, abs/1611.02167, 2016. 2

[4] Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-

Mizil. Model compression. In Proceedings of the 12th ACM

SIGKDD International Conference on Knowledge Discovery

and Data Mining, KDD ’06, pages 535–541, New York, NY,

USA, 2006. ACM. 2

[5] Han Cai, Ligeng Zhu, and Song Han. Proxylessnas: Di-

rect neural architecture search on target task and hardware.

CoRR, abs/1812.00332, 2018. 2, 3, 6

[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,

Kevin Murphy, and Alan L Yuille. Semantic image segmen-

tation with deep convolutional nets and fully connected crfs.

In ICLR, 2015. 7

[7] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,

Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image

segmentation with deep convolutional nets, atrous convolu-

tion, and fully connected crfs. TPAMI, 2017. 7

[8] Liang-Chieh Chen, George Papandreou, Florian Schroff, and

Hartwig Adam. Rethinking atrous convolution for semantic

image segmentation. CoRR, abs/1706.05587, 2017. 7, 8

[9] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian

Schroff, and Hartwig Adam. Encoder-decoder with atrous

separable convolution for semantic image segmentation. In

ECCV, 2018. 7

[10] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo

Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe

Franke, Stefan Roth, and Bernt Schiele. The cityscapes

dataset for semantic urban scene understanding. In CVPR,

2016. 8

[11] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre

David. Binaryconnect: Training deep neural networks with

binary weights during propagations. CoRR, abs/1511.00363,

2015. 2, 4

[12] Xiaoliang Dai, Peizhao Zhang, Bichen Wu, Hongxu Yin,

Fei Sun, Yanghan Wang, Marat Dukhan, Yunqing Hu, Yim-

ing Wu, Yangqing Jia, Peter Vajda, Matt Uyttendaele, and

Niraj K. Jha. Chamnet: Towards efficient network de-

sign through platform-aware model adaptation.

CoRR,

abs/1812.08934, 2018. 2

[13] Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-

weighted linear units for neural network function approxi-

mation in reinforcement learning. CoRR, abs/1702.03118,

2017. 2, 4

[14] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christo-

pher K. I. Williams, John Winn, and Andrew Zisserma. The

pascal visual object classes challenge a retrospective. IJCV,

2014. 8

[15] Yihui He and Song Han. AMC: automated deep compres-

sion and acceleration with reinforcement learning. In ECCV,

2018. 2

[16] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities

and stochastic regularizers with gaussian error linear units.

CoRR, abs/1606.08415, 2016. 2, 4

[17] Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling

the knowledge in a neural network. In NIPS Deep Learning

and Representation Learning Workshop, 2015. 2

[18] Matthias Holschneider, Richard Kronland-Martinet, Jean

Morlet, and Ph Tchamitchian. A real-time algorithm for

signal analysis with the help of the wavelet transform. In

Wavelets: Time-Frequency Methods and Phase Space, pages

289–297. Springer Berlin Heidelberg, 1989. 7

[19] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry

Kalenichenko, Weijun Wang, Tobias Weyand, Marco An-

dreetto, and Hartwig Adam. Mobilenets: Efficient convolu-

tional neural networks for mobile vision applications. CoRR,

abs/1704.04861, 2017. 2

[20] J. Hu, L. Shen, and G. Sun. Squeeze-and-Excitation Net-

works. ArXiv e-prints, Sept. 2017. 2, 3, 7

[21] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kil-

ian Q. Weinberger. Condensenet: An efficient densenet using

learned group convolutions. CoRR, abs/1711.09224, 2017. 2

[22] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf,

Song Han, William J. Dally, and Kurt Keutzer. Squeezenet:

Alexnet-level accuracy with 50x fewer parameters and

<1mb model size. CoRR, abs/1602.07360, 2016. 2

[23] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu,

Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry

Kalenichenko. Quantization and training of neural networks

for efficient integer-arithmetic-only inference. In The IEEE

Conference on Computer Vision and Pattern Recognition

(CVPR), June 2018. 2

[24] Norman P. Jouppi, Cliff Young, Nishant Patil, David A.

Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates,

Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-

luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike

Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaem-

maghami, Rajendra Gottipati, William Gulland, Robert Hag-

mann, Richard C. Ho, Doug Hogberg, John Hu, Robert

Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski,

Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen

Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le,

Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gor-

don MacKean, Adriana Maggiore, Maire Mahony, Kieran

Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni,

1322

Page 10

Kathy Nix, Thomas Norrie, Mark Omernick, Narayana

Penukonda, Andy Phelps, Jonathan Ross, Amir Salek, Emad

Samadiani, Chris Severn, Gregory Sizikov, Matthew Snel-

ham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan,

Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay

Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and

Doe Hyun Yoon. In-datacenter performance analysis of a

tensor processing unit. CoRR, abs/1704.04760, 2017. 5

[25] Raghuraman Krishnamoorthi. Quantizing deep convolu-

tional networks for efficient inference: A whitepaper. CoRR,

abs/1806.08342, 2018. 2

[26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,

Pietro Perona, Deva Ramanan, Piotr Doll�r, and C Lawrence

Zitnick. Microsoft COCO: Common objects in context. In

ECCV, 2014. 7

[27] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia

Li, Li Fei-Fei, Alan L. Yuille, Jonathan Huang, and Kevin

Murphy. Progressive neural architecture search. CoRR,

abs/1712.00559, 2017. 2

[28] Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS:

differentiable architecture search. CoRR, abs/1806.09055,

2018. 2

[29] Wei Liu, Andrew Rabinovich, and Alexander C. Berg.

Parsenet:

Looking wider to see better.

CoRR,

abs/1506.04579, 2015. 7

[30] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully

convolutional networks for semantic segmentation. In

CVPR, 2015. 7

[31] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda G.

Shapiro, and Hannaneh Hajishirzi. Espnet: Efficient spa-

tial pyramid of dilated convolutions for semantic segmen-

tation. In Computer Vision - ECCV 2018 - 15th European

Conference, Munich, Germany, September 8-14, 2018, Pro-

ceedings, Part X, pages 561–580, 2018. 8

[32] Sachin Mehta, Mohammad Rastegari, Linda G. Shapiro, and

Hannaneh Hajishirzi. Espnetv2: A light-weight, power ef-

ficient, and general purpose convolutional neural network.

CoRR, abs/1811.11431, 2018. 8

[33] George Papandreou, Iasonas Kokkinos, and Pierre-Andre

Savalle. Modeling local and global deformations in deep

learning: Epitomic convolution, multiple instance learning,

and sliding window detection. In CVPR, 2015. 7

[34] Hyojin Park, Youngjoon Yoo, Geonseok Seo, Dongyoon

Han, Sangdoo Yun, and Nojun Kwak. Concentrated-

comprehensive convolutions for lightweight semantic seg-

mentation. CoRR, abs/1812.04920, 2018. 8

[35] Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and

Jeff Dean. Efficient neural architecture search via parameter

sharing. CoRR, abs/1802.03268, 2018. 2

[36] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Search-

ing for activation functions. CoRR, abs/1710.05941, 2017.

2, 4

[37] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon,

and Ali Farhadi. Xnor-net: Imagenet classification using bi-

nary convolutional neural networks. CoRR, abs/1603.05279,

2016. 2

[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-

jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,

Aditya Khosla, Michael Bernstein, Alexander C. Berg, and

Li Fei-Fei. Imagenet large scale visual recognition challenge.

Int. J. Comput. Vision, 115(3):211–252, Dec. 2015. 5, 8

[39] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey

Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted

residuals and linear bottlenecks. mobile networks for classifi-

cation, detection and segmentation. CoRR, abs/1801.04381,

2018. 2, 3, 6, 7, 8

[40] Pierre Sermanet, David Eigen, Xiang Zhang, Micha�l Math-

ieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated

recognition, localization and detection using convolutional

networks. arXiv:1312.6229, 2013. 7

[41] Daniel Soudry, Itay Hubara, and Ron Meir. Expecta-

tion backpropagation: Parameter-free training of multilayer

neural networks with continuous or discrete weights. In

Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D.

Lawrence, and Kilian Q. Weinberger, editors, NIPS, pages

963–971, 2014. 2

[42] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke.

Inception-v4, inception-resnet and the impact of residual

connections on learning. CoRR, abs/1602.07261, 2016. 5

[43] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan,

and Quoc V. Le. Mnasnet: Platform-aware neural architec-

ture search for mobile. CoRR, abs/1807.11626, 2018. 2, 3,

5, 6

[44] SPSE the Society for Imaging Science, Technology, Soci-

ety of Photo-optical Instrumentation Engineers, and Techni-

cal Association of the Graphic Arts. Curves and Surfaces in

Computer Vision and Graphics. Number v. 1610 in Proceed-

ings of SPIE–the International Society for Optical Engineer-

ing. SPIE, 1992. 4

[45] Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Yanghan Wang,

Fei Sun, Yiming Wu, Yuandong Tian, Peter Vajda, Yangqing

Jia, and Kurt Keutzer. Fbnet: Hardware-aware efficient

convnet design via differentiable neural architecture search.

CoRR, abs/1812.03443, 2018. 2

[46] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter H. Jin, Sicheng

Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonza-

lez, and Kurt Keutzer. Shift: A zero flop, zero parameter

alternative to spatial convolutions. CoRR, abs/1711.08141,

2017. 2

[47] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and

Jian Cheng. Quantized convolutional neural networks for

mobile devices. CoRR, abs/1512.06473, 2015. 2

[48] Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang,

Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam.

Netadapt: Platform-aware neural network adaptation for mo-

bile applications. In ECCV, 2018. 2, 3

[49] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun.

Shufflenet: An extremely efficient convolutional neural net-

work for mobile devices. CoRR, abs/1707.01083, 2017. 2

[50] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang

Wang, and Jiaya Jia. Pyramid scene parsing network. In

CVPR, 2017. 7

[51] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong

Chen. Incremental network quantization: Towards lossless

cnns with low-precision weights. CoRR, abs/1702.03044,

2017. 2

1323

Page 11

[52] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu,

and Yuheng Zou. Dorefa-net: Training low bitwidth convo-

lutional neural networks with low bitwidth gradients. CoRR,

abs/1606.06160, 2016. 2

[53] Barret Zoph and Quoc V. Le. Neural architecture search with

reinforcement learning. CoRR, abs/1611.01578, 2016. 2, 3

[54] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V.

Le. Learning transferable architectures for scalable image

recognition. CoRR, abs/1707.07012, 2017. 2

1324