Open AccessArticle

Visual State Space Model for Image Deraining with Symmetrical Scanning

School of Basic Sciences for Aviation, Naval Aviation University, Yantai 264001, China

School of Electromechanical and Automotive Engineering, Yantai University, Yantai 264005, China

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Symmetry 2024, 16(7), 871; https://doi.org/10.3390/sym16070871

Submission received: 13 June 2024 / Revised: 5 July 2024 / Accepted: 6 July 2024 / Published: 9 July 2024

(This article belongs to the Special Issue Advances in Computer Vision, Pattern Recognition, Machine Learning and Symmetry)

Download

Browse Figures

Versions Notes

Abstract

Image deraining aims to mitigate the adverse effects of rain streaks on image quality. Recently, the advent of convolutional neural networks (CNNs) and Vision Transformers (ViTs) has catalyzed substantial advancements in this field. However, these methods fail to effectively balance model efficiency and image deraining performance. In this paper, we propose an effective, locally enhanced visual state space model for image deraining, called DerainMamba. Specifically, we introduce a global-aware state space model to better capture long-range dependencies with linear complexity. In contrast to existing methods that utilize fixed unidirectional scan mechanisms, we propose a direction-aware symmetrical scanning module to enhance the feature capture of rain streak direction. Furthermore, we integrate a local-aware mixture of experts into our framework to mitigate local pixel forgetting, thereby enhancing the overall quality of high-resolution image reconstruction. Experimental results validate that the proposed method surpasses state-of-the-art approaches on six benchmark datasets.

Keywords:

image deraining; rain removal; visual state space; vision Mamba

1. Introduction

Rainy images often suffer from reduced visibility and contrast, which can significantly hinder the performance of computer vision algorithms, such as object detection, tracking, and recognition. Therefore, image deraining aims to restore clear images from their rainy counterparts. This task is of paramount importance in various applications, including autonomous driving, surveillance systems, and consumer photography [1,2].

To tackle this challenge, researchers have put forward various approaches, spanning from traditional filter-based methods to deep learning-based solutions. Traditional methods [3,4,5,6] typically depend on manually engineered features and prior information to distinguish rain streaks from the background. Nevertheless, these techniques often struggle to generalize effectively across complex and diverse rainy scenes. In contrast, deep learning-based approaches [7,8,9,10] have shown considerable promise by harnessing the ability of convolutional neural networks (CNNs) to acquire hierarchical representations of both rain streaks and background scenes. However, the convolution operation is both spatially invariant and locally confined. This characteristic fails to capture the spatially diverse attributes of image contents and hinders the exploration of non-local information.

Inspired by the success of natural language processing (NLP), Transformers have demonstrated superior performance compared to CNNs in various vision tasks. Unlike convolution operations, the self-attention mechanism in Transformers captures non-local information by computing correlations between all tokens [10,11]. However, the self-attention based on the scaled dot-product operation leads to quadratic time complexity, resulting in significant computational overhead. To reduce computational costs, some approaches [10,12] opt to apply self-attention across channels rather than spatial dimensions. However, these methods fail to fully exploit spatial representations, which could potentially impact image deraining performance.

State space models, particularly the advanced vision Mamba [13], have recently surfaced as efficient frameworks owing to their ability to capture long-range dependencies with linear complexity. Inspired by this trend, we propose an effective visual state space model for image deraining, called DerainMamba, in this paper. Unlike previous image deraining works, we employ a state space model approach to explore applications in this field. We redesign a model based on the visual state space model and apply it to the image deraining task, discovering a more effective representation method. This model is named DerainMamba. For its main components, we design the local-aware mixture of experts (LaMoE) for local feature perception and the global-aware state space model (GaSSM) for global feature perception.

Specifically, LaMoE effectively extracts and represents local rain information by parallel processing through multiple CNN experts. On the other hand, GaSSM uses the Mamba processing approach as a global feature processing unit. In this module, we scan the input image from four different directions, forming four sets of different feature sequence data to comprehensively decompose the global features, which are then processed separately. This approach enables comprehensive global rain information perception within the image. Finally, the two sets of feature information are merged and processed through channel attention, effectively integrating positional information and enhancing the model’s representation ability and performance.

The main contributions of this work are summarized below:

We propose an effective deraining model, called DerainMamba, which leverages the latest Mamba architecture to explore more efficient representation methods for the deraining task.
We design key components for the model, including the local-aware mixture of experts (LaMoE) for local feature perception and the global-aware state space model (GaSSM) for global feature perception. These components effectively integrate features from different locations and enhance the model’s performance.
We extract sequence data from four different directions, effectively decomposing global features and enabling comprehensive global rain information perception within images.
Extensive experiments conducted on several benchmarks demonstrate that our model achieves superior performance compared to state-of-the-art methods.

2. Related Work

In this section, we present a review of recent work related to single image deraining and to the present research.

2.1. Single Image Deraining

The single image deraining task aims to eliminate rain information from images featuring rainy backgrounds, where critical parameters such as rain location and size are highly uncertain. Traditional algorithms [3,4,5,6] typically impose an a priori model on the clear image and rain component by manually formulating a model prior to achieving a unified solution to this problem. However, these methods still have significant limitations in removing rain from real scenes.

Deep learning-based methods have surpassed earlier conventional algorithms and demonstrate strong performance in removing rain patterns. Yang et al. [7] proposed a recursive rain detection and removal network to iteratively and progressively remove rain streaks and clear rain streak buildup. Li et al. [8] proposed a novel deep network architecture based on deep convolutional and recurrent neural networks for single image deraining. It decomposed the rain removal process into multiple stages in order to deal with overlapping rain streaks. Ren et al. [9] proposed a Progressive ResNet (PRN) to exploit recursive computation by iteratively unfolding a shallow ResNet. A recurrent layer was further introduced to exploit deep feature dependencies across stages, resulting in the Progressive Recurrent Network (PReNet). Jiang et al. [14] removed rain streaks from a single image by introducing recursive computation to capture global texture and characterize the target rain streaks using complementary and redundant information in the spatial dimension.

Although these methods provide better performance than a priori-based methods, it is difficult for them to capture global contextual relationships due to the inherent limitations of convolution. In contrast, Xiao et al. [15] proposed a Transformer-based image deraining architecture that can capture long and complex rain streaks. Liu et al. [16] proposed Swin Transformer, which improves efficiency by limiting self-attention computation to non-overlapping local windows through a shifted-window computation scheme, while also allowing for inter-window connections. Chen et al. [10] proposed an effective Deraining network Sparse Transformer (DRSformer) that adaptively keeps the most useful self-attention values for feature aggregation to better facilitate high-quality image reconstruction.

However, the self-attention mechanism in Transformer-based methods increases the quadratic computational cost of the model, requiring more computational resources and imposing a significant computational burden. Additionally, state space models (SSMs), particularly the Mamba [17] model introduced in recent advancements, outperform both CNN-based and Transformer-based models in terms of performance.

2.2. Vision Mamba

A recently proposed, selectively structured state space model is called Mamba, which excels in long sequence global modeling tasks. Mamba alleviates the modeling limitations of convolutional neural networks through global receptive fields and dynamic weighting, providing advanced modeling capabilities similar to Transformer. Crucially, it reduces the high computational cost associated with the quadratic computational complexity of Transformer. With the rapid development of technology, Mamba has similarly been applied to the field of computer vision and shows great potential as a foundational model for visualization.

UVMNet [18] is designed with a Bi-SSM block that integrates the local feature extraction ability of the convolutional layer with the ability of SSM to capture long-range dependencies, thereby achieving efficient single image dehazing. FDVMNet [19] constructs a two-path network to process the phase and magnitude information of images, respectively. A convolutional layer is integrated with SSM using C-SSM as the basic functional unit to achieve efficient local–global modeling. MambaIR [13] utilizes a local patch repetitiveness prior as well as channel-interacting residual state space blocks to generate recovery-specific feature representations for image super-resolution and image denoising tasks.

Our work is inspired by the rapidly developing research mentioned above and further demonstrates the use of the Mamba-based approach to better facilitate rain removal.

3. Proposed Method

Our goal is to develop an efficient deraining network that focuses on the relevance of both global and local information in image restoration. In this section, we first provide an overview of the network architecture of DerainMamba. Then, we delve into the details of the key components of DerainMamba, the vision DerainMamba block (VDB), including the local-aware mixture of experts (LaMoE) and the global-aware state space model (GaSSM). Finally, we introduce the operation layer and attention layer in the LaMoE and the direction-aware symmetrical scanning module (DaSM) in the GaSSM.

3.1. Network Architecture

We propose a vision DerainMamba block (VDB) based on a state space model to serve as the backbone of a four-layer inverted pyramid network. The goal is to effectively separate rain streak information from background information in rain images, enabling the reconstruction of clean and clear images. Specifically, the overall structure of the model is shown in Figure 1.

Given as the input a degraded rain image

I_{r a i n} \in R^{H \times W \times 3}

, a

3 \times 3

convolution is first applied to map the features into a four-level symmetrical encoder–decoder structure. In the first encoder layer, the shallow features

X_{1} \in R^{H \times W \times C}

are processed. At each encoder connection, down-sampling operations are added to hierarchically reduce spatial dimensions and expand the number of channels, mapping the feature space to deeper levels

X_{l} \in R^{H / 2^{(l - 1)} \times W / 2^{(l - 1)} \times 2^{(l - 1)} C}

l \in [1, 2, 3, 4]

. By the fourth encoder layer, the deep features of the image

X_{4} \in R^{H / 8 \times W / 8 \times 8 C}

are processed.

During the feature decoding and reconstruction process, the deep features with low spatial resolution

X_{4} \in R^{H / 8 \times W / 8 \times 8 C}

are fed into the decoder. The decoder mirrors the encoder’s structure to achieve symmetrical feature reconstruction. As upsampling operations are performed, the spatial dimensions of the feature maps are doubled while the number of channels is halved, gradually restoring the feature space to match that of the input image. To facilitate feature extraction and fusion, we follow the approach in [10,12] by using skip connections to concatenate encoder features with the corresponding decoder features. Subsequently, the final decoded features

X_{1} \in R^{H \times W \times C}

are passed through an output projection block and reshaped into

I_{r e s} \in R^{H \times W \times 3}

. Finally, the degraded image

I_{r a i n}

is combined with the residual image

I_{r e s}

through residual connection to generate the final reconstructed result

I_{d e r a i n} = I_{r a i n} + I_{r e s}

. The encoder–decoder modules in the network consist of

N_{L}

VDBs, where

L \in [1, 2, 3, 4]

. The detailed structure of the VDB will be elaborated in the next section.

The model is trained by minimizing the following loss function:

L = {∥I_{d e r a i n} - I_{g t}∥}_{1},

(1)

where

I_{g t}

represents the ground truth image, and

{∥ \cdot ∥}_{1}

denotes the L1-norm. By integrating this model architecture design, DerainMamba can more effectively capture rich rain streak information and fuse features at different scales.

3.2. Vision DerainMamba Block

Because rain in degraded images typically manifests in various irregular physical forms such as raindrops and streaks, and since it usually appears in an uneven distribution across the image, developing a model with simultaneous perception of both local and global information is crucial for image deraining tasks.

To enhance the model’s perception of rain information, we design the vision DerainMamba block (VDB), as shown in Figure 1. Within it, we devise two crucial components for local and global information perception: the local-aware mixture of experts and the global-aware state space model. In the first stage of VDB processing, input feature information undergoes layer normalization to provide stable and consistent normalization across the feature dimensions within each layer, thereby enhancing the training efficiency and performance of the model. Subsequently, the information is processed in parallel through LaMoE and GaSSM. The resulting local features and global features are then fused to obtain more detailed feature modeling information. Finally, the fused features are combined with the input features

V_{i n}

to obtain the first-stage features represented by V. This process is defined as follows:

L o c a l = L a M o E ({Norm}_{1} (V_{i n})),

(2)

G l o b a l = G a S S M ({Norm}_{1} (V_{i n})),

(3)

V = V_{i n} + A (L o c a l (V_{i n}), G l o b a l (V_{i n})),

(4)

where Norm represents the layer normalization,

A (\cdot)

represents the element-wise addition, and

L a M o E (\cdot)

and

G a S S M (\cdot)

represent the operation of processing features through the local-aware mixture of experts and the global-aware state space model, respectively.

To help the model better understand the semantic structure of the first-stage feature representation in the VDB, we introduce a second-stage processing module. This stage processes the encoded feature representation from the first stage. By adjusting and transforming the feature dimensions, we can effectively integrate positional information, thereby enhancing the model’s representation ability and performance. First, a corresponding layer normalization and a

3 \times 3

convolution are applied to improve feature extraction efficiency and reduce channel redundancy. Then, a channel attention mechanism [20] is used for deep feature extraction. Finally, the output is combined with the residual to obtain the output features

V_{o u t}

. This process is defined as follows:

V_{o u t} = V + CA ({Conv}_{3 \times 3} ({Norm}_{2} (V))),

(5)

where

{Conv}_{3 \times 3}

and

CA (\cdot)

represent the

3 \times 3

convolution and channel attention operation, respectively.

3.3. Local-Aware Mixture of Experts

Based on the ideas of CNN and dynamic weight allocation, we propose a key component called LaMoE for extracting local rain information, which consists of two main layers: the operation layer and the attention layer.

Referencing recently designed efficient CNN models [11,21], we chose multiple parallel local CNN operations in the operation layer to form independently distributed experts. These include standard convolution operations, dilated convolution operations, and average pooling operations.

Inspired by the dynamic weight allocation concept from [22], we use a self-attention mechanism in the attention layer to adaptively determine the importance of different representations based on the input. This calculates the attention weights corresponding to the parallel outputs of the operation layer. The generated set of weights is then multiplied by the respective feature information produced by the operation layer via matrix multiplication, dynamically outputting the most significant information from each expert layer.

Then, a matrix concatenation operation is performed to concatenate all the information. Finally, a 1 × 1 convolution is used to adjust the output feature dimensions to match those of the input features. The output is then combined with the input

X_{i n} \in R^{C \times H \times W}

through a residual connection to generate the final output features. The specific process of LaMoE is defined as follows:

L a M o E = X_{i n} + {Conv}_{1 \times 1} (C a t (p_{l}^{1} \cdot a_{l}^{1}, \dots, p_{l}^{z} \cdot a_{l}^{z})),

(6)

where

{Conv}_{1 \times 1}

and

C a t

represent the

1 \times 1

convolution and channel concatenation; p and a represent the value of the operation layer and the attention layer, respectively; l denotes the l-th local-aware mixture of experts; and z is a set of operations contained in the operation layer and attention layer. The structure of LaMoE is shown in Figure 2.

3.3.1. Operation Layer

In the operation layer, the experts are executed in parallel. The components consist of standard convolutional layers (with kernel sizes of

1 \times 1

3 \times 3

5 \times 5

, and

7 \times 7

), dilated convolutional layers (with kernel sizes of

3 \times 3

5 \times 5

, and

7 \times 7

with the dilation rate set to 2), and average pooling layers (with a receptive field of 3 × 3). After a series of convolution and pooling computations, a ReLU activation function is added to enhance the network’s non-linear abilities. To ensure the input and output sizes match, we apply zero padding to the input feature maps computed by each expert and then concatenate them along the channel dimension to obtain the final output of the operation layer, which can be expressed as

p_{l} = [p_{l}^{1}, \dots, p_{l}^{z}]

3.3.2. Attention Layer

In this study, we analyze the attention layer in the l-th LaMoE. We provide the same input as the operation layer to the attention layer, using the input feature map

X_{i n} \in R^{C \times H \times W}

. Then, we perform average value calculations across the spatial dimensions for each channel to generate the feature distribution

X_{c} \in R^{C}

along the channel dimension as follows:

X_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} x_{i n} (i, j),

(7)

where

i, j

is the

H, W

position of the feature

X_{i n}

. Then, we use the attention mechanism to generate dynamic weight distribution matrices

W_{1} \in R^{T \times C}

and

W_{2} \in R^{Z \times T}

, where T is the dimension of the weight matrices.

Next, we fuse the features of these two matrices to obtain the k-th

(k \in [1, 2, \dots, z])

output of the attention layer as follows:

a_{l}^{k} = W_{2} σ (W_{1} x_{c}),

(8)

where

σ

represents the ReLU function. The output of the total attention layer can be expressed as

a_{l} = [a_{l}^{1}, \dots, a_{l}^{z}]

3.4. Global-Aware State Space Model

Based on the excellent global modeling abilities of the state space model [17,23] in the field of image restoration, we design the global-aware state space model (GaSSM) to extract global rain information, inspired by a Mamba model based on the state space model. The key component of GaSSM is the direction-aware symmetrical scanning module.

Specifically, in GaSSM, the input features

X_{i n}

are sequentially passed through a linear layer, a depth-wise convolution, and the SiLU activation function to achieve effective feature transformation and non-linear representation of complex patterns. This facilitates advanced global rain information modeling by the DaSM. To handle different batch sizes and sequence data, a layer normalization operation is added after the DaSM. Then, the non-linear features obtained after processing the input features through a linear layer and the SiLU activation function are concatenated, enabling deeper contrastive learning to enhance the model’s ability to model global rain information, thereby achieving the goal of image deraining. Finally, the output of the GaSSM is obtained through another linear layer. The specific processing steps of the GaSSM are as follows:

X_{D} = N o r m (D a S M (ϕ (DConv (L i n e a r (x_{i n}))))),

(9)

X_{S} = ϕ (L i n e a r (x_{i n})),

(10)

G a S S M = L i n e a r (X_{D} \otimes X_{S}),

(11)

where

X_{D}

and

X_{S}

represent the output of two branches of GaSSM, and

DConv (\cdot)

ϕ (\cdot)

, and ⊗ represent the depth-wise convolution, SiLU activation function, and Hadamard product, respectively.

Direction-Aware Symmetrical Scanning Module

Previous image deraining tasks were based on the Transformer model. However, the Mamba model utilizes recursive computations that depend on the state of the previous sequence data, making the process more efficient and resolving the quadratic complexity issue of Transformers.

To enable Mamba to better handle the spatial variation of rain streaks in the sequence data derived from image features, we designed the direction-aware symmetrical scanning module. This module generates sequences from the 2D image array

P_{i n} \in R^{B \times C \times H \times W}

by combining two scanning starting points (top left and bottom right) and two scanning directions (horizontal and vertical). This results in four sequences

Q_{L}^{K}

(where

K = 1, 2, 3, 4

), each corresponding to a different scanning direction. The handling methods for the different directional scanning mechanisms from the DaSM are illustrated in Figure 3.

For example, the array

P_{i n}

can be represented as follows:

P_{in} = [\begin{matrix} x_{1} & \dots & x_{n} & \dots & x_{L} \\ \dots & \dots & \dots \\ x_{n L + 1} & \dots & x_{n L + n} & \dots & x_{(n + 1) L} \\ \dots & \dots & \dots \\ x_{L (L - 1) + 1} & \dots & x_{L (L - 1) + n} & \dots & x_{L L} \end{matrix}] .

(12)

The model performs a channel-by-channel horizontal scan from the top left to the bottom right of the array, generating sequence

Q_{L}^{1}

as follows:

Q_{L}^{1} = [\begin{matrix} x_{1} & \dots & x_{L} & x_{(n + 1) L} & \dots & x_{n L + 1} & x_{L (L - 1) + 1} & \dots & x_{L L} \end{matrix}],

(13)

where the sequences for the other three directions are generated similarly.

These four sequences are then used as the input

x_{t}

for the recursive computation, with

y_{t}

as the output result and

h_{t}

representing the intermediate latent state between

x_{t}

and

y_{t}

. The computation process is defined as follows:

h_{t} = \bar{A} h_{t - 1} + \bar{B} x_{t},

(14)

y_{t} = C h_{t} + D x_{t},

(15)

where A, B, C, and D represent four different coefficients in the state equation. Their values are iteratively updated during training, allowing the output sequence

y = [y_{1}, \dots, y_{L L}]

to more closely approximate the true deraining expression. Finally, the four output sequences are combined from different directions to serve as the global fused feature output of the DaSM, denoted as

P_{o u t} \in R^{B \times C \times H \times W}

4. Experimental Results

In this section, we outline the experimental setup and provide implementation details. To thoroughly evaluate the proposed DerainMamba component, we conduct comprehensive image deraining experiments on widely used benchmark datasets. Additionally, we perform ablation studies to confirm the effectiveness of DerainMamba.

4.1. Datasets and Metrics

In this section, we conduct extensive experiments using the Rain13K training dataset, which consists of 13,700 pairs of clean and rain-affected images. For testing, we evaluate our method on five synthetic benchmarks: Test100 [24], Rain100H [7], Rain100L [7], Test2800 [25], and Test1200 [26]. Furthermore, we evaluate our approach using a comprehensive real-world dataset, namely SPA-Data [27], which comprises 638,492 image pairs for training and 1000 for testing. We calculate the PSNR and SSIM [28] scores using the Y channel in the YCbCr color space to provide quantitative comparisons.

4.2. Implementation Details

The block numbers

{N_{1}, N_{2}, N_{3}, N_{4}}

in our model are

{4, 6, 6, 8}

. In VDB, the number of experts in LaMoE is set to 8, and the number of directional scan paths in the DaSM for GaSSM is set to 4. For training, we employed the Adam optimizer with a patch size of

256 \times 256

. The initial learning rate was set at

2 \times 10^{- 4}

and was adaptively adjusted using the cosine annealing strategy [29], gradually decreasing to

1 \times 10^{- 6}

. The main experiment involved 300,000 iterations. The entire model was implemented on the PyTorch platform using an NVIDIA RTX 4090 GPU.

4.3. Comparison with State-of-the-Art

We compare our method against 12 state-of-the-art image deraining techniques employed on the synthetic dataset Rain13k and against 14 state-of-the-art image deraining techniques on the real-world dataset SPA-Data [27], including DSC [6], GMM [5], DDN [25], DerainNet [30], SEMI [31], DIDMDN [26], UMRL [32], RESCAN [8], PReNet [9], MSPFN [14], RCDNet [33], MPRNet [34], DualGCN [35], SPDNet [36], DGUNet [37], KiT [38], Uformer [39], Restormer [12], and IDT [15]. For Uformer and IDT, since their papers do not include experiments on the Rain13K dataset, we retrained their models using their online code to ensure a fair comparison. For the other methods, we refer to the results reported in their respective papers. In the following discussion, we examine the qualitative and quantitative results of the experiments.

4.3.1. Quantitative Comparison

According to the previous works, we adopt Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) as evaluation metrics to measure the difference between the restored image and the ground-truth image. PSNR is applied to evaluate the error between corresponding pixels in an image. SSIM evaluates the similarity between two different images from different perspectives, with values typically ranging from 0 to 1. Generally, the higher the values of PSNR and SSIM, the better the performance of the model. Table 1 and Table 2 present the quantitative comparison results of different algorithms on the synthetic dataset Rain13K and real-world dataset SPA-Data. Compared to existing CNN and Transformer models, our proposed model achieves the highest PSNR and SSIM values across all six test sets. Specifically, on the Test100 [24], Rain100H [7], Rain100L [7], Test2800 [25], and Test1200 [26] datasets, DerainMamba outperforms DGUNet with PSNR improvements of 0.91 dB, 0.13 dB, 0.59 dB, and 0.01 dB, respectively. On the Test2800 dataset and in the average PSNR calculations, DerainMamba surpasses both Kit and DGUNet by 0.02 dB and 0.37 dB, respectively. These quantitative results demonstrate that our method can handle various types of rain distribution more effectively and accurately.

4.3.2. Qualitative Comparison

In the quantitative results, as shown in Figure 4, our method not only restores the most realistic and detailed information in the foreground of the image despite light rain interference, but also maximizes deraining from the background. As depicted in Figure 5, under heavy rain interference, our method continues to demonstrate superior performance compared to various other methods, accurately restoring facial details. Figure 6 illustrates the effectiveness of rain streak removal in large white areas of images. It is evident that our model excels in visual recovery of background details, textures, and boundary artifacts, outperforming other methods in handling these aspects.

In real-world scenarios, as shown in Figure 7, DerainMamba demonstrates superior performance in removing real rain compared to other algorithms. It effectively eliminates rain in complex backgrounds, which not only showcases its success in theoretical research but also highlights its outstanding performance in practical applications.

4.3.3. Model Complexity

We extend our evaluation to include an analysis of model complexity, comparing the proposed approach with state-of-the-art methods in terms of FLOPs and model parameters. As illustrated in Table 3, our model demonstrates lower FLOP values and fewer network parameters, all while achieving competitive performance, as indicated in Table 2.

4.4. Ablation Studies

In this section, we conduct ablation experiments on different variants under the same experimental conditions as the main experiment to ensure the most convincing results. We perform ablations on the main components. Additionally, we ablate the number of experts in LaMoE and the different scan paths in the DaSM of GaSSM. Detailed descriptions are provided below.

4.4.1. Analysis of Main Components

To demonstrate the superiority of our framework, we conduct several ablation experiments on the main components of VDB and their different connection methods: the ablation of LaMoE and GaSSM (a) with LaMoE and (b) with GaSSM; the ablation of LaMoE and GaSSM connection methods (c) in series and (d) in parallel; and the ablation of channel attention (e) without channel attention. Table 4 presents the quantitative results on the Test100 [24] dataset, where S represents series and P represents parallel. We observe that our model (d) outperforms other possible configurations, indicating that each design strategy we consider contributes to the final performance of DerainMamba.

4.4.2. Analysis of the Number of Experts in LaMoE

To analyze the impact of different numbers of experts in the LaMoE module on model performance, we configure various numbers of experts as shown in Figure 8, which presents the quantitative results on the Test100 [24] dataset. Models using multiple experts perform better than those using a single expert. Unlike models where all experts have the same structure [40], our LaMoE module comprises diverse expert structures. Due to differences in receptive fields and individual characteristics, different types of experts capture features from various scales within the image. This diversity enhances the model’s ability to perceive local information, resulting in improved performance.

4.4.3. Analysis of the DaSM in GaSSM

We analyze the impact of the number of directional scan paths in the DaSM on GaSSM performance using the Test100 [24] dataset. As shown in Table 5, we evaluate the deraining performance of the model using single-path, dual-path, and triple-path scanning mechanisms. Since rain typically appears at various uncertain positions in the image and exhibits uncertain shapes and sizes, scanning from four different directions allows for a more comprehensive perception of rain features. The results indicate that our four-path scanning design generally leads to better performance.

5. Conclusions

In this paper, we propose an effective and efficient visual state space model called DerainMamba for image deraining. Specifically, we integrate the global-aware state space model and the local-aware mixture of experts into the proposed framework to jointly capture rich rain representations. We demonstrate the effectiveness of the direction-aware symmetrical scanning mechanism in image deraining, showcasing its ability to better model global information with linear complexity. Extensive evaluation on and comparisons with both synthetic and real-world datasets indicate that our approach achieves a balance between model efficiency and deraining performance. The use of multiple scanning directions may lead to information redundancy and increased computational requirements, implicitly reducing model performance. In the future, we will undertake further research to improve the performance and complexity of Mamba.

Author Contributions

Writing—original draft, writing—review & editing: Y.Z.; conceptualization, writing—review & editing: X.H.; Formal analysis, methodology: C.Z.; supervision, data curation, and resources: J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly supported by the National Natural Science Foundation of China under Grants 51779028, 51309043, and the Key Project of Art Science of the Shandong Provincial Association for the Science of Arts & Culture under Grant L2024Z05100707.

Data Availability Statement

The available online experimental datasets in this paper can be found at https://www.deraining.tech/benchmark.html (accessed on 10 September 2023).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, X.; Pan, J.; Dong, J.; Tang, J. Towards unified deep image deraining: A survey and a new benchmark. arXiv 2023, arXiv:2310.03535. [Google Scholar]
Chen, X.; Pan, J.; Jiang, K.; Li, Y.; Huang, Y.; Kong, C.; Dai, L.; Fan, Z. Unpaired deep image deraining using dual contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 July 2022; pp. 2017–2026. [Google Scholar]
Gu, S.; Meng, D.; Zuo, W.; Zhang, L. Joint convolutional analysis and synthesis sparse representation for single image layer separation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1708–1716. [Google Scholar]
Kang, L.W.; Lin, C.W.; Fu, Y.H. Automatic single-image-based rain streaks removal via image decomposition. IEEE Trans. Image Process. 2011, 21, 1742–1755. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Tan, R.T.; Guo, X.; Lu, J.; Brown, M.S. Rain streak removal using layer priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2736–2744. [Google Scholar]
Luo, Y.; Xu, Y.; Ji, H. Removing rain from a single image via discriminative sparse coding. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 3397–3405. [Google Scholar]
Yang, W.; Tan, R.T.; Feng, J.; Liu, J.; Guo, Z.; Yan, S. Deep joint rain detection and removal from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1357–1366. [Google Scholar]
Li, X.; Wu, J.; Lin, Z.; Liu, H.; Zha, H. Recurrent squeeze-and-excitation context aggregation net for single image deraining. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 254–269. [Google Scholar]
Ren, D.; Zuo, W.; Hu, Q.; Zhu, P.; Meng, D. Progressive image deraining networks: A better and simpler baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3937–3946. [Google Scholar]
Chen, X.; Li, H.; Li, M.; Pan, J. Learning a sparse transformer network for effective image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 5896–5905. [Google Scholar]
Chen, X.; Pan, J.; Lu, J.; Fan, Z.; Li, H. Hybrid cnn-transformer feature fusion for single image deraining. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 378–386. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 July 2022; pp. 5728–5739. [Google Scholar]
Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S.T. MambaIR: A Simple Baseline for Image Restoration with State-Space Model. arXiv 2024, arXiv:2402.15648. [Google Scholar]
Jiang, K.; Wang, Z.; Yi, P.; Chen, C.; Huang, B.; Luo, Y.; Ma, J.; Jiang, J. Multi-scale progressive fusion network for single image deraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, hlSeattle, WA, USA, 13–19 June 2020; pp. 8346–8355. [Google Scholar]
Xiao, J.; Fu, X.; Liu, A.; Wu, F.; Zha, Z.J. Image de-raining transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 12978–12995. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Xu, R.; Yang, S.; Wang, Y.; Du, B.; Chen, H. A Survey on Vision Mamba: Models, Applications and Challenges. arXiv 2024, arXiv:2404.18861. [Google Scholar]
Zheng, Z.; Wu, C. U-shaped Vision Mamba for Single Image Dehazing. arXiv 2024, arXiv:2402.04139. [Google Scholar]
Zheng, Z.; Zhang, J. FD-Vision Mamba for Endoscopic Exposure Correction. arXiv 2024, arXiv:2402.06378. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Suganuma, M.; Liu, X.; Okatani, T. Attention-based adaptive selection of operations for image restoration in the presence of unknown combined distortions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 9039–9048. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Zhou, H.; Wu, X.; Chen, H.; Chen, X.; He, X. RSDehamba: Lightweight Vision Mamba for Remote Sensing Satellite Image Dehazing. arXiv 2024, arXiv:2405.10030. [Google Scholar]
Zhang, H.; Sindagi, V.; Patel, V.M. Image de-raining using a conditional generative adversarial network. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 3943–3956. [Google Scholar] [CrossRef]
Fu, X.; Huang, J.; Zeng, D.; Huang, Y.; Ding, X.; Paisley, J. Removing rain from single images via a deep detail network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3855–3863. [Google Scholar]
Zhang, H.; Patel, V.M. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 695–704. [Google Scholar]
Wang, T.; Yang, X.; Xu, K.; Chen, S.; Zhang, Q.; Lau, R.W. Spatial attentive single-image deraining with a high quality real rain dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 12270–12279. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Fu, X.; Huang, J.; Ding, X.; Liao, Y.; Paisley, J. Clearing the skies: A deep network architecture for single-image rain removal. IEEE Trans. Image Process. 2017, 26, 2944–2956. [Google Scholar] [CrossRef] [PubMed]
Wei, W.; Meng, D.; Zhao, Q.; Xu, Z.; Wu, Y. Semi-supervised transfer learning for image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3877–3886. [Google Scholar]
Yasarla, R.; Patel, V.M. Uncertainty guided multi-scale residual learning-using a cycle spinning cnn for single image de-raining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8405–8414. [Google Scholar]
Wang, H.; Xie, Q.; Zhao, Q.; Meng, D. A model-driven deep neural network for single image rain removal. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3103–3112. [Google Scholar]
Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M.H.; Shao, L. Multi-stage progressive image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14821–14831. [Google Scholar]
Fu, X.; Qi, Q.; Zha, Z.J.; Zhu, Y.; Ding, X. Rain streak removal via dual graph convolutional network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 1352–1360. [Google Scholar]
Yi, Q.; Li, J.; Dai, Q.; Fang, F.; Zhang, G.; Zeng, T. Structure-preserving deraining with residue channel prior guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 4238–4247. [Google Scholar]
Mou, C.; Wang, Q.; Zhang, J. Deep generalized unfolding networks for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 July 2022; pp. 17399–17410. [Google Scholar]
Lee, H.; Choi, H.; Sohn, K.; Min, D. KNN Local Attention for Image Restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 July 2022; pp. 2139–2149. [Google Scholar]
Wang, Z.; Cun, X.; Bao, J.; Zhou, W.; Liu, J.; Li, H. Uformer: A general u-shaped transformer for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 July 2022; pp. 17683–17693. [Google Scholar]
Kim, S.; Ahn, N.; Sohn, K.A. Restoring spatially-heterogeneous distortions using mixture of experts network. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]

Figure 1. The overall pipeline of DerainMamba. Each vision DerainMamba block consists of a local-aware mixture of experts and a global-aware state space model. The local-aware mixture of experts is explained in Figure 2.

Figure 2. The architecture of the local-aware mixture of experts (LaMoE).

Figure 3. The symmetrical scanning mechanism of the DaSM integrates rain features from four different directions to better perceive the spatial variation and distribution of rain streaks.

Figure 4. Visual quality comparison of deraining images obtained by different methods on the Test100 benchmark dataset.

Figure 5. Visual quality comparison of deraining images obtained by different methods on the Rain100H benchmark dataset.

Figure 6. Visual quality comparison of deraining images obtained by different methods on the Test1200 benchmark dataset.

Figure 7. Visual quality comparison of deraining images obtained by different methods on the SPA-Data benchmark dataset.

Figure 8. Ablation study for the number of experts in LaMoE.

Table 1. Comparison of quantitative results on the Rain13K benchmark dataset. Bold and underline indicate the best and second-best results, respectively.

Dataset	Test100 [24]		Rain100H [7]		Rain100L [7]		Test2800 [25]		Test1200 [26]		Average
Method	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
DerainNet [30]	22.77	0.810	14.92	0.592	27.03	0.884	24.31	0.861	23.38	0.835	22.48	0.796
SEMI [31]	22.35	0.788	16.56	0.486	25.03	0.842	24.43	0.782	26.05	0.822	22.88	0.744
DIDMDN [26]	22.56	0.818	17.35	0.524	25.23	0.741	28.13	0.867	29.95	0.901	24.64	0.770
UMRL [32]	24.41	0.829	26.01	0.832	29.18	0.923	29.97	0.905	30.55	0.910	28.02	0.880
RESCAN [8]	25.00	0.835	26.36	0.786	29.80	0.881	31.29	0.904	30.51	0.882	28.59	0.858
PReNet [9]	24.81	0.851	26.77	0.858	32.44	0.950	31.75	0.916	31.36	0.911	29.43	0.897
MSPFN [14]	27.50	0.876	28.66	0.860	32.40	0.933	32.82	0.930	32.39	0.916	30.75	0.903
MPRNet [34]	30.27	0.897	30.41	0.890	36.40	0.965	33.64	0.938	32.91	0.916	32.73	0.921
DGUNet [37]	30.32	0.899	30.66	0.891	37.42	0.969	33.68	0.938	33.23	0.920	33.06	0.923
KiT [38]	30.26	0.904	30.47	0.897	36.65	0.969	33.85	0.941	32.81	0.918	32.81	0.926
Uformer [39]	29.90	0.906	30.31	0.900	36.86	0.972	33.53	0.939	29.45	0.903	32.01	0.924
IDT [15]	29.69	0.905	29.95	0.898	37.01	0.971	33.38	0.937	31.38	0.908	32.28	0.924
Ours	31.23	0.923	30.79	0.902	38.01	0.975	33.87	0.942	33.24	0.925	33.43	0.933

Table 2. Comparison of quantitative results on SPA-Data benchmark dataset. Bold and underline indicate the best and second-best results, respectively.

Dataset	SPA-Data [27]
Method	PSNR	SSIM
DSC [6]	34.95	0.9416
GMM [5]	34.30	0.9428
DDN [25]	36.16	0.9457
RESCAN [8]	38.11	0.9707
PReNet [9]	40.16	0.9816
MSPFN [14]	43.43	0.9843
RCDNet [33]	43.36	0.9831
MPRNet [34]	43.64	0.9844
DualGCN [35]	44.18	0.9902
SPDNet [36]	43.20	0.9871
Uformer [39]	46.13	0.9913
Restormer [12]	47.98	0.9921
IDT [15]	47.35	0.9930
DRSformer [10]	48.53	0.9924
Ours	48.82	0.9954

Table 3. Model complexity comparisons with state-of-the-art methods are presented. “#FLOPs” and “#Params” denote FLOPs (in G) and the number of trainable parameters (in M), respectively.

Method	MSPFN [14]	Uformer [39]	Restormer [12]	IDT [15]	DRSformer [10]	Ours
#FLOPs (G)	595.5	45.9	174.7	61.9	242.9	68.1
#Params (M)	13.35	50.88	26.12	16.41	33.65	12.86

Table 4. Ablation study of the main components. Bold indicates the best results.

Model	LaMoE	GaSSM	S/P	Channel Attention	PSNR	SSIM
(a)	✓	✗	-	✓	30.69	0.905
(b)	✗	✓	-	✓	30.81	0.912
(c)	✓	✓	S	✓	31.14	0.918
(d)	✓	✓	P	✓	31.23	0.923
(e)	✓	✓	P	✗	31.17	0.920

Table 5. Ablation study for the number of directional scan paths in the DaSM. Bold indicates the best results.

Path	PSNR	SSIM
One path	31.08	0.913
Two paths	31.13	0.917
Three paths	31.19	0.921
Four paths	31.23	0.923

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; He, X.; Zhan, C.; Li, J. Visual State Space Model for Image Deraining with Symmetrical Scanning. Symmetry 2024, 16, 871. https://doi.org/10.3390/sym16070871

AMA Style

Zhang Y, He X, Zhan C, Li J. Visual State Space Model for Image Deraining with Symmetrical Scanning. Symmetry. 2024; 16(7):871. https://doi.org/10.3390/sym16070871

Chicago/Turabian Style

Zhang, Yaoqing, Xin He, Chunxia Zhan, and Junjie Li. 2024. "Visual State Space Model for Image Deraining with Symmetrical Scanning" Symmetry 16, no. 7: 871. https://doi.org/10.3390/sym16070871

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visual State Space Model for Image Deraining with Symmetrical Scanning

Abstract

1. Introduction

2. Related Work

2.1. Single Image Deraining

2.2. Vision Mamba

3. Proposed Method

3.1. Network Architecture

3.2. Vision DerainMamba Block

3.3. Local-Aware Mixture of Experts

3.3.1. Operation Layer

3.3.2. Attention Layer

3.4. Global-Aware State Space Model

Direction-Aware Symmetrical Scanning Module

4. Experimental Results

4.1. Datasets and Metrics

4.2. Implementation Details

4.3. Comparison with State-of-the-Art

4.3.1. Quantitative Comparison

4.3.2. Qualitative Comparison

4.3.3. Model Complexity

4.4. Ablation Studies

4.4.1. Analysis of Main Components

4.4.2. Analysis of the Number of Experts in LaMoE

4.4.3. Analysis of the DaSM in GaSSM

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI