Open AccessTechnical Note

DLSW-YOLOv8n: A Novel Small Maritime Search and Rescue Object Detection Framework for UAV Images with Deformable Large Kernel Net

Zhumu Fu

^1,2,

Yuehao Xiao

¹,

Fazhan Tao

^1,3,*,

Pengju Si

^1,2 and

Longlong Zhu

^1,2

College of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

Henan Key Laboratory of Robot and Intelligent Systems, Henan University of Science and Technology, Luoyang 471023, China

Longmen Laboratory, Luoyang 471000, China

Author to whom correspondence should be addressed.

Drones 2024, 8(7), 310; https://doi.org/10.3390/drones8070310

Submission received: 22 May 2024 / Revised: 28 June 2024 / Accepted: 5 July 2024 / Published: 9 July 2024

(This article belongs to the Special Issue Advances in Detection, Security, and Communication for UAV)

Download

Browse Figures

Versions Notes

Abstract

Unmanned aerial vehicle maritime search and rescue target detection is susceptible to external factors, which can seriously reduce detection accuracy. To address these challenges, the DLSW-YOLOv8n algorithm is proposed combining Deformable Large Kernel Net (DL-Net), SPD-Conv, and WIOU. Firstly, to refine the contextual understanding ability of the model, the DL-Net is integrated into the C2f module of the backbone network. Secondly, to enhance the small target characterization representation, a spatial-depth layer is used instead of pooling in the convolution module, and an additional detection head is integrated into the low-level feature map. The loss function is improved to enhance small target localization performance. Finally, a UAV maritime target detection dataset is employed to demonstrate the effectiveness of the proposed algorithm, whose results show that DLSW-YOLOv8n achieves a detection accuracy of 79.5%, which represents an improvement of 13.1% compared to YOLOv8n.

Keywords:

unmanned aerial vehicles; maritime search and rescue; DL-Net; space-to-depth; wise IoU

1. Introduction

The oceans cover over 70% of the Earth’s surface, and maritime transportation is essential to global economic trade [1]. Incidents of maritime disasters are, however, unavoidable. Maritime target detection is crucial in search and rescue (SAR) missions, including shipwrecks and maritime emergencies. It ensures the rapid identification and localization of targets, thereby safeguarding personnel and preventing property damage [2]. In the realm of water search and rescue (SAR), the China Maritime SAR Center (CMSRC), under China’s Ministry of Transportation and Communications (MOTC), conducted 1824 SAR operations in 2021. These efforts rescued 11,761 out of 12,258 people in distress, underscoring the importance of rapid target identification and localization at sea. Significant progress has been made in the field of computer vision for unmanned water vehicles [3,4], which enables obstacle avoidance on water [5,6]. However, autonomy is still compromised due to environmental limitations. UAVs offer flexibility, portability, and aerial accessibility, making them invaluable for maritime target detection [7]. The rapid evolution and widespread adoption of UAV technology have significantly enhanced its role in maritime SAR missions.

Previously, UAVS often relied on manual operations to accomplish specific tasks, which constrained their development of applications. However, with the rapid development of AI and 5G technologies, UAVs can now autonomously detect targets by using advanced detection technology, leading to smarter flight and mission execution. Conventional target detection approaches often depend on handcrafted features and classical machine learning algorithms such as Support Vector Machine (SVM) or Random Forest [8]. These methods may lack the flexibility required to address complex backgrounds and target variations, and their adaptability to diverse scenes and targets may be limited. Additionally, traditional methods often encounter performance bottlenecks, particularly when trained and tested on large-scale datasets. Consequently, their detection accuracy and speed may not meet practical requirements. With advancements in computer hardware technology, deep learning technology has gradually matured. This progress has led to its increasing integration into various UAV applications, including hazardous rescue [9], agricultural inspections [10], and land mapping [11]. Target detection based on deep learning plays a significant role in computer vision and serves as the key technical driver for UAV development.

Deep learning-based target detectors are categorized into two-stage and single-stage. Common two-stage target detectors include RCNN [12], SPP-Net [13], Fast R-CNN [14], and Faster R-CNN [15]. The two-stage target detector, mainly RCNN, first generates a set of potential candidate regions containing targets and builds a classification network within these regions to determine the target category. The single-stage target detector, represented by YOLO [16,17] and SSD [18], directly classifies and regresses the detected targets. The former exhibits high accuracy but is not capable of real-time detection, while the latter is slightly less accurate but offers faster detection. For maritime rescue, it is advantageous that YOLO shows good performance and real-time detection on different classes of objects.

Maritime target detection, a crucial aspect of target detection, has made significant research advancements. Several studies have introduced maritime search and rescue (SAR) target detection models that show significant potential. However, due to the low resolution of maritime targets in UAV images, extracting their characteristics and location information remains challenging. Therefore, UAV maritime target detection still needs more research. Ma et al. [19] employed an enhanced visual attention model (RVAM) and integrated a nonlocal dissimilarity measure for clutter suppression and target enhancement, thus bolstering the robustness of infrared target detection at sea. Liu et al. [20] implemented an inverse depth-separable convolutional algorithm to enhance the backbone network of YOLOv4, effectively balancing speed and accuracy for real-time environment sensing tasks in unmanned surface vehicles. Li et al. [21] integrated a Transformer into the backbone of YOLOv5 to augment the feature extraction capability by employing a simple linear transformation instead of convolution operation to reduce computational costs.

However, maritime search and rescue remains relatively underexplored and presents numerous challenges. Sambolek et al. [22] conducted a comparative analysis of detection outcomes by assessing the reliability of various algorithms on a self-compiled SARD dataset. Their findings suggest that YOLOv4 exhibits potential applicability to SAR missions. Zhang et al. [23] successfully improved UAV-based localization of individuals in distress at sea during search and rescue missions through the integration of the AFPN and the BiFormer module. Bai et al. [24] incorporated Ghost and C3Ghost into YOLOv5, leveraging a lightweighting algorithm to create a more suitable model for UAV deployment. Zhu et al. [25] introduced advanced models and algorithms into YOLOv7 to enhance its ability so as to perceive small target features, addressing the issue of insufficient advanced semantic features in small target detection in YOLOv7.

Despite the advancements achieved by current methods, several persistent issues remain unresolved, including target rotation angles, variations in target scale, and detection of small or occluded targets. In maritime rescue target detection, the ocean environment introduces numerous factors that significantly affect detection performance, including glare, water reflections, waves, and refraction. Additionally, humans and lifesaving equipment often appear as small or occluded objects in the UAV’s view, severely compromising the accuracy of detection algorithms and leading to frequent missed detections and false positives. Moreover, UAV onboard systems require algorithms to have high computational efficiency to ensure real-time performance. To address these challenges, we develop DLSW-YOLOv8n based on the YOLOv8n framework. This algorithm offers high detection accuracy and requires fewer computational resources, making it ideal for use in resource-constrained environments. The main contributions are as follows:

Refining contextual understanding ability of the model. The Deformable Large Kernel Net(DL-Net) is integrated into the C2f module of the backbone network, which can enlarge the receptive field using a large kernel convolutional selection mechanism to effectively extract the small target’s spatial information. To prevent the loss of target boundary information caused by a fixed receptive field, deformable convolution is used to enhance the target’s boundary representation.
Enhancing small target characterization representation. Small target detection heads are introduced into the shallow layers of the backbone network to handle insufficient small-target information in the high-level feature map. At the same time, the spatial depth layer is used instead of the convolutional and pooling layers to prevent the loss of fine-grained information.
Improving small target localization performance. The loss function of Wise IoU v3 is considered to improve the accuracy of small target detection, and by dynamically adjusting the weight coefficients of the loss function, different pixel regions are balanced to improve small target localization performance.

2. Model and Proposed Method

2.1. Improvement Strategies

In this paper, a UAV maritime Synthetic Aperture Radar (SAR) target detection model is developed that prioritizes detection accuracy and speed, focusing on three key aspects. Firstly, to effectively improve the model receptive field, the proposed DL-Net is integrated into the C2f module of YOLOv8 for UAV aerial images to overcome the obstacle of insufficient features due to the small size of the target. Large kernel convolution is adopted to enhance the context information extraction capability of the model, and deformable convolution is utilized to improve the representation of target boundary information. Secondly, to enhance the small target feature representation, the unused C2f module is effectively utilized to introduce a detection header specialized for the low-level feature map. Additionally, in the backbone part, the SPD layer replaces the traditional pooling layer or strided convolution in CNN to reduce the loss of information about small targets during the downsampling process. Finally, the WiseIoU loss function is adopted instead of the CIoU loss function, and a dynamic non-monotonic focusing mechanism is implemented so that the detector can fully consider anchor boxes with different qualities.

2.1.1. DL-Net

Small targets often occupy fewer pixels in an image, and using large convolution kernels can help reduce information loss during convolution. However, introducing excessive contextual information may blur target details and increase computational complexity. Large kernel convolution, as introduced in [26], aims to extract spatial information of targets by weighting and spatially merging features processed through spatial convolution with large kernels. The weights of different features are dynamically adjusted based on inputs, enabling the adaptive use of different large kernels.

To balance model complexity and computational efficiency, large kernel convolution is decomposed into a large-growth kernel and a series of depth convolutions. An upper limit on the expansion rate is set to prevent excessive expansion of the receptive field, which could result in missing information between feature maps. However, the fixed shape of the receptive field in large kernel convolution restricts its ability to adapt to significant changes in target size, leading to the loss of boundary information in the feature map. As a result, it may struggle to capture multi-scale features of targets with significant variations in shape and size. We introduce deformable convolution into the large kernel convolutional.

The deformable convolution [27] introduces a deformable sampling grid, which is derived from the original regular grid by incorporating learned offsets. This allows the sampling points of the convolution kernel to adjust dynamically based on the features of the input data at different locations and achieves an adaptive receptive field for convolution. The detailed structure of the deformable convolution is depicted in Figure 1. For instance, we define a convolution with a kernel size of 3 × 3 and a dilation rate of 1. Under these conditions, the following applies:

R = {(- 1, - 1), (- 1.0), \dots, (0, 1), (1, 1)}

(1)

For each sample point location, the output feature map of the conventional convolution is:

y (p_{*}) = \sum_{p_{a} \in R} w (p_{a}) \cdot x (p_{*} + p_{a})

(2)

In deformable convolution, the positions in the regular grid R are enumerated with an offset set

{Δ p_{a} | a = 1, \dots, N}

, where

p_{a}

. For each position

p_{*}

y (p_{*}) = \sum_{p_{a} \in R} w (p_{a}) \cdot x (p_{*} + p_{a} + Δ p_{a})

(3)

Δ p_{a}

represents the position offset, which is usually a fraction and cannot correspond to a point on

y (p_{0})

. Therefore, deformable convolution uses bilinear interpolation to realize learnable position offsets of sampled points.

TheDL-Block initially samples the input features by using deformable convolutions to extract the size information of the target and obtains the output feature map

X^{'}

. Simultaneously, the input features undergo convolution sampling with varying kernel sizes, and the features with different sensory field ranges, extracted from these different kernel sizes, are spliced together.

\tilde{U} = [{\tilde{U}}_{1} \cdot \cdot \cdot \cdot \cdot \cdot {\tilde{U}}_{i}]

(4)

To comprehensively consider different features within each channel, apply channel-wise average and max pooling to the PIU for efficient extraction of spatial relationships:

S A_{a v g} = P_{a v g} (\tilde{U}), S A_{max} = P_{max} (\tilde{U})

(5)

Combine the pooled features and utilize convolutional layers to convert the features (originally 2 channels) into N spatial attention maps:

\hat{S A} = F^{2 \to N} ([S A_{a v g}; S A_{m a x}])

(6)

Generate spatial selection masks for each decomposed large kernel using a sigmoid activation function, guiding the model’s focus to regions of interest in the input:

\tilde{S A_{i}} = σ (\tilde{S A_{i}})

(7)

Weight the features of the decomposed large kernel sequence using each spatial selection mask from (4), and fuse them using convolutional layers to obtain the attention feature map S:

S = F (\sum_{i = 1}^{N} (\tilde{S A_{i}} \cdot \tilde{U_{i}}))

(8)

The redundant information in t he attention feature map S is filtered using 3 × 3 convolution to obtain

S^{'}

. The final output is the element-by-element product between the input feature

X^{'}

and the filtered attention feature map

S^{'}

Y = X^{'} \cdot S^{'}

(9)

The fixed shapes of sensory fields in large kernel convolution limit its adaptability to different data patterns. To address this limitation, DL-Block integrates large kernel convolution and deformable convolution, enhancing the representation of the target and improving target boundary definition, as illustrated in Figure 2.

2.1.2. SPD-Conv

In UAV maritime target detection, targets appear as small objects when observed from the UAV’s perspective, typically with low pixel ratios. The limited background information provided by these small targets can lead to larger objects dominating the learning process, resulting in missed or misclassified small targets. Although conventional convolutional neural networks (CNNs) employ techniques such as strided convolutions or pooling layers to filter redundant information in standard object detection tasks, they may inadvertently lose fine-grained details crucial for detecting small targets due to the limited redundancy in the information.

To tackle these challenges, we incorporate SPD-Conv [28] into the feature extraction network of YOLOv8. SPD-Conv introduces an SPD structure, which replaces traditional striding and pooling layers in CNNs. Consisting of Spatial-to-Depth (SPD) layers and non-strided convolution layers, SPD downsamples an intermediate feature map of size

W \times W \times H_{1}

by a downsampling factor determined by the scale factor T. To achieve downsampling, the original feature map is partitioned into multiple sub-feature maps, with each sub-feature map

f (x, y)

obtained by downsampling the spatial dimensions of the original feature map, where x and y are the indices of the partition.

\begin{matrix} f_{0, 0} = X [0 : T : Z, 0 : T : Z], & f_{1, 0} = X [1 : T : Z, 0 : T : Z], \dots, \\ f_{Z - 1, 0} = X [Z - 1 : T : Z, 0 : T : Z] \end{matrix}

(10)

\begin{matrix} f_{0, 1} = X [0 : T : Z, 1 : T : Z], & f_{1, 1} = X [1 : T : Z, 1 : T : Z], \dots, \\ f_{Z - 1, 0} = X [Z - 1 : T : Z, 1 : T : Z] \end{matrix}

(11)

\begin{matrix} \dots \dots \\ f_{0, Z - 1} = X [0 : T : Z, & Z - 1 : T : Z], f_{1, Z - 1}, \dots, \\ f_{Z - 1, Z - 1} = X [Z - 1 : T : Z, Z - 1 : T : Z] . \end{matrix}

(12)

The size of these sub-feature maps is

(\frac{T}{2}, \frac{T}{2}, H_{1})

. Following decomposition, these sub-feature maps are merged along the channel dimension to form the new feature map

X^{'}

. When

Z = 2

, SPD downsamples the feature map into four sub-feature maps:

f (0, 0)

f (0, 1)

f (1, 0)

, and

f (1, 1)

. Figure 3 illustrates the processing of the feature map by the SPD-Conv module when

Z = 2

. Each sub-feature map has dimensions of

(\frac{T}{2}, \frac{T}{2}, H_{1})

. These sub-feature maps are subsequently concatenated along the channel dimension to generate an intermediate feature map, denoted as

X^{'} (\frac{T}{Z}, \frac{T}{Z}, H_{1})

. The SPD layer processes this intermediate feature map by downsampling along the spatial dimensions and concatenating the sub-feature maps along the channel dimension. After performing SPD feature transformation, we introduce a non-strided convolutional layer, which contains

C_{2}

filters, each with a size of

C_{2} < T^{2} C_{1}

. The purpose of this step is to transform the feature map

X (\frac{S}{Z}, \frac{S}{Z}, Z^{2} C_{1})

with a larger number of channels into a feature map

X^{'} (\frac{S}{Z}, \frac{S}{Z}, Z^{2} C_{2})

with a smaller number of channels. Introducing SPD-Conv into the YOLOv8n backbone network allowed us to maximize the retention of mission-critical features. We opt for non-strided convolution to prevent excessive compression of the feature maps, thereby preserving more information and maintaining discriminative feature details as much as possible.

2.1.3. Loss Function

The conventional method for evaluating object detection models is through Intersection over Union (IoU) computation. This entails computing the ratio of the intersection area between predicted bounding boxes and ground truth bounding boxes to the total area encompassed by both. However, this method has limitations in dealing with smaller objects, as they are given less weight in the computation and are more prone to be overlooked by the model. In particular, UAV images, such as those in our dataset, often contain low-quality examples. To mitigate this issue, Wise IoU [29] proposed a dynamically adjusted IoU loss function that considers factors such as object size, occlusion, and background complexity. By introducing dynamic adjustment factors, Wise IoU can more accurately evaluate the consistency between object detection results and ground truth, particularly excelling in handling small objects or complex environments. Enhanced iterations of Wise IoU encompass WIoU v1, integrating an attention-based bounding box loss, while WIoU v2 and WIoU v3 refine performance by integrating gradient boosting and algorithmic techniques to integrate a focusing mechanism.

WIoU v1 introduces distance as a metric for attention. This is achieved by reducing geometric metric penalties when there is an overlap within a certain range between the target and predicted bounding boxes. Consequently, the model exhibits improved generalization ability, thereby adapting more effectively to various shapes and positions of objects. The calculation formula for WIoU v1 is as follows:

\begin{matrix} L_{W I o U v 1} = R_{W I o U} L_{I o U}, \\ R_{W I o U} = exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(w_{g}^{2} + H_{g}^{2})}^{*}}) \end{matrix}

(13)

In addition to WIoU v1, WIoU v2 introduces a monotonic focusing coefficient

L_{I o U}^{*}

to reduce the weight of simple examples in the loss function. However, as

L_{I o U}^{*}

decreases with the decrease of

L_{I o U}

, the convergence speed is reduced during the model training process, with the average being

L_{I o U}

.To overcome this challenge, normalization of

L_{I o U}^{*}

is introduced to ensure a better balance between the weights of easy and hard examples during training, thereby enhancing the training efficiency and performance of the model. The computation formula for WIoU v2 is as follows:

L_{W I o U v 2} = {(\frac{{L^{*}}_{I o U}}{L_{I o U}})}^{y} L_{W I o U v 1}, γ > 0

(14)

WIoU v3 introduces outlier-awareness to measure anchor box quality, applying it to WIoU v1 in the form of a non-monotonic focusing factor r. The selection of this focusing factor is based on the discrete distribution of quality, where low-quality anchor boxes receive larger focusing factors to mitigate their impact on the loss function, while high-quality anchor boxes receive smaller focusing factors to exert a stronger influence on the loss function. In WIoU v3, a sophisticated gradient gain allocation strategy is employed, dynamically adjusting weights of high and low-quality anchor boxes in the loss function to prioritize average-quality samples, thereby boosting overall performance. The computation formula for WIoU v3 is as follows:

β = \frac{{L^{*}}_{I o U}}{L_{I o U}} \in [, + \infty)

(15)

L_{W I o U v 3} = r L_{W I o U v 1}, r = \frac{β}{δ α^{β - α}}

(16)

2.2. Introduction to the YOLOv8 Algorithm

In this paper, YOLOv8n is employed as the baseline model, and its structure is shown in Figure 4. YOLOv8 is an enhanced target detection model. It combines the advantages of its previous versions and introduces a number of new improvements and optimizations, thereby ensuring detection accuracy while simultaneously reducing the computational cost. The structure of YOLOv8 remains consistent with other YOLO algorithms and comprises three parts: Trunk, Neck, and Head. The input image undergoes enhancement through a series of advanced data augmentation techniques, including random cropping, random scaling, random flipping, random rotation, and mosaic enhancement. These techniques enable the model to adapt to different scenarios and conditions by applying various transformations to the input images during the training process. This improves the generalization ability and robustness of the model.

The backbone is responsible for extracting the features of the image. In the backbone, the feature map is downsampled five times using 3 × 3 convolutions, and the five feature maps are divided into P1–P5. Each time after one downsampling, the feature map size is doubled. With the change in feature map size, the resolution of small targets decreases in UAV images, which seriously affects the effectiveness of the UAV target detection algorithm. An SPPF module is utilized at the end of the backbone to perform further multiscale processing on the base features, and the processed features are subsequently passed to the neck section for fusion. In the neck section, YOLOv8 fuses the feature maps P3 to P5 through top-down paths and lateral connections, enhancing the detection of objects of different sizes in the UAV image.

2.3. Improved Network Structure DLSW-YOLOv8n

Figure 5 illustrates our DLSW-YOLOv8n model. We enhance the feature extraction capability of the model by integrating the DL-Block into the C2f module within the YOLOv8 backbone network. After each downsampling operation, the spatial size of the feature map decreases, leading to insufficient information about small targets relative to the feature map size due to their small spatial footprint in the original image. Consequently, for UAV water rescue target detection, DLSW-YOLOv8 primarily enhances the capabilities of layers P2 and P3. Initially, a separate detection head is added for the P2 layer, primarily intended for detecting small targets. The downsampling convolution module in layers P2, P3, P4, and P5 is upgraded to SPD-Conv, effectively mitigating the issue of information loss during downsampling, thereby enhancing detection accuracy while reducing parameter count. Additionally, to balance computational cost and detection accuracy for small targets, the pivotal C2f module is substituted with the proposed DL-Bottleneck-C2f (DL-C2f) module. The DL-C2f module is exclusively introduced at layers P2 and P3 to enhance the backbone’s feature extraction capability. Additionally, the WIoU can dynamically adjust the IoU value during loss calculation at the detection end. This allows for flexible adaptation to targets of varying sizes and positions. As a result, it effectively addresses issues of large localization errors and missed detection.

3. Experiment

3.1. Dataset and Experimental Environment

The SDS ODv2 dataset [30], dedicated to maritime search and rescue targets, encompasses diverse objects adrift on the water’s surface. It encompasses 14,227 images (8930 for training, 1547 for validation, and 3750 for testing) and 39,991 target objects. Each image in the dataset is meticulously annotated with category labels, including Swimmer, Boat, Jetski, Buoy, and Life-saving appliance (life vest/belt), totaling five distinct classes. The size of the detected target in the image poses a significant challenge to the maritime SAR detection mission, as exemplified in Figure 6a. Additionally, objective weather conditions affect image quality, with differences in illumination and brightness leading to variations, as illustrated in Figure 6b. Furthermore, variations in seawater ripples can interfere with the detector’s judgment of the target’s characteristics, as depicted in Figure 6c.

The experimental setup utilized an environment consisting of the Ubuntu 20.04 operating system, an NVIDIA RTX 2080ti GPU with 11 GB of memory, an Intel Xeon E5-2689 CPU running at 3.1 GHz, CUDA version 11.3, and the PyTorch 1.12.0 software framework.

3.2. Experimental Result

3.2.1. Detection Accuracy of DLSW-YOLOv8n

To mitigate the challenge posed by small objects dominating the SDS ODV2 maritime target detection dataset, the performance of YOLOv8n in detecting small targets proves unsatisfactory, with a mAP50 of only 0.664. Consequently, we labeled the model with the added detection header as YOLOv8n-small, which achieved a detection accuracy of 0.744. Leveraging this as the baseline model, the enhanced DLSW-YOLOv8n achieves a detection accuracy of 0.795, as shown in Table 1. Concurrently, we test the real-time performance of DLSW-YOLOv8n. The experiment shows that DLSW-YOLOv8n achieves an average single-image detection speed of 23 ms. Our model achieves a frame rate of 42 FPS, which fully satisfies real-time requirements.

3.2.2. Comparative Analysis of Various Loss Functions in Experiments

To assess the efficacy of various loss functions in our experiment, we compared EIoU [31], GIoU [32], DIoU [33], and WIoU. As illustrated in Figure 7 and Table 2, it is evident that only when Gamma = 0.5 does the WIoU loss function demonstrate significantly superior performance compared to EIoU, SIoU, and DIoU for this dataset, slightly outperforming GIoU. Furthermore, compared to the CIoU loss function included with YOLOV8n after training for 300 epochs, the detection accuracy improved by 0.02. Meanwhile, We compare the localization and classification losses of YOLOv8n-small with the addition of four functions. The experimental results show that employing the WIoU loss function in the YOLOV8 model can mitigate the localization challenge for small targets to a certain extent. The effectiveness of WIOU in mitigating the difficulty of localizing small targets is confirmed.

3.2.3. Comparing Different Gamma Values

We observed that the value of Gamma has different effects on the WIoU loss function due to variations in the sizes of detected objects in each image of the dataset. In order to adapt to the size changes of detected objects, we conducted experiments with different Gamma values. As shown in Table 3 and Figure 8, when Gamma = 0.3, WIoU exhibits the greatest improvement for the model. Through experimental analysis, we found that the value of Gamma significantly impacts the detection accuracy of the model to a certain extent.

3.2.4. Ablation Experiments

Four ablation experiments are conducted to evaluate the effectiveness of the proposed enhancements to YOLOv8n. The results are summarized in Table 4 and visually presented in Figure 9. “FLOPs” denotes the computational burden of the model, while “Params” refers to the model’s parameter count. In the initial experiment, we improved the YOLOv8n model by incorporating a detection head specifically designed for small targets, designated as YOLOv8n-small. Moreover, we modified the quantity of C2f modules in layers P2 and P3 to 9, resulting in an elevation of mAP50 from 0.664 to 0.744. In the subsequent experiment, a DL-Block is introduced following the C2f modules in layers P2 and P3 of YOLOv8n-small, resulting in a 0.025 enhancement in detection accuracy. In the third experiment, downsample convolutions in layers P2-P5 are substituted with SPD-Conv, leading to a decrease in flops and a 0.014 enhancement in detection accuracy. Lastly, the WIoU loss function is integrated into the preceding experiments, yielding a 0.012 enhancement in detection accuracy compared to the original YOLOv8n algorithm, thereby achieving a 0.131 increment.

3.2.5. Experimental Comparisons with Mainstream Algorithms

To demonstrate the effectiveness of our proposed method, we evaluated it using four classical object detection algorithms: YOLOv5, YOLOv7 [34], Faster-RCNN, YOLOv8s on the ocean dataset. The experimental results are presented in Table 5. As observed from the table, the detection accuracy of our DLSW-YOLOv8n outperforms the two-stage target detection algorithm Faster-RCNN by 0.013, as well as surpasses the performance of the common single-stage target detection algorithm.It is worth mentioning that DLSW-YOLOv8n has the lowest computational cost of the algorithms in the table and is easier to deploy in UAVs, Meanwhile, the detection speed is as fast as 23 ms.

3.2.6. Algorithm Analysis

To validate the small target feature extraction capability of DL-Net and the performance of SPD-Conv in reducing fine-grained information loss, we visualize the heatmaps of YOLOv8n and DLSW-YOLOv8n using the Gradient-weighted Class Activation Mapping (Grad-CAM) method. This algorithm provides valuable insights into whether the network is learning the correct features by using the class gradient to examine the network’s region of interest for a particular class.

Since our DL-Net focuses on the shallow layers of YOLOv8, we test YOLOv8n and DLSW-YOLOv8n using the SODV2 dataset on images of layers 1 to 4. The Grad-CAM images are shown in Figure 10. The experimental results show that YOLOv8 only provides good attentional coverage for large targets like boats and lacks sufficient coverage for the category of people. However, in maritime search and rescue target detection, detecting the people category is the most critical. Our DLSW-YOLOv8n achieves full attention coverage of the people category from the second layer. Although our model is also disturbed by seawater ripples, it covers a much larger portion of the target. Additionally, the detection target and the surrounding areas in the thermal image are brighter and more focused.

To demonstrate the feature extraction capability of DLSW-YOLOv8n for detecting small targets in a UAV maritime SAR scenario. Figure 11 shows the Grad-CAM images of the last layer of the YOLOv8n and DLSW-YOLOv8n backbone networks, where the brighter regions indicate the regions that receive more attention from the networks. In order to perform a comprehensive comparative analysis, we chose images of the detected objects of different grades and sizes. The experimental results show that compared with the baseline model, our DLSW-YOLOv8n improves the visualization on the three classes of boats, swimmers, and jetskis by 0.24, 0.13, and 0.2, respectively. it is worth mentioning that DLSW-YOLOv8n is relatively less affected by the environment.

Figure 12 demonstrates the impressive detection capability of DLSW-YOLOv8n in marine environments with varying seawater ripples, light intensities, and different viewing angles. Our algorithm does not miss or misdetect due to the changing environment. The performance of the detector for identifying people and life-saving equipment is significantly improved.

4. Conclusions

This paper introduces a lightweight target detector, DLSW-YOLOv8n. It is designed to address the challenges of balancing detection accuracy and computational cost in UAV maritime SAR target detection. To address issues of small target leakage and false detection in complex marine environments, our DLSW-YOLOv8n integrates DL-Net into the feature extraction backbone network to efficiently extract target spatial information. Additionally, DL-Net facilitates rapid adaptation to significant changes in target size and enhances the definition of target boundary features. Moreover, replacing the traditional Conv’s cross-convolution or pooling layer with the SPD layer effectively alleviates the loss of small-target feature information during the feature map downsampling process. Notably, SPD-Conv enhances the algorithm’s detection accuracy for small targets while reducing computational overhead. Finally, the Wise IoU loss is employed as the bounding box regression loss. Paired with the dynamic non-monotonic focusing mechanism, this enables the detector to consider anchor boxes of varying qualities, thereby enhancing the detector’s robustness. Experimental results, including loss function analysis, ablation experiments, and comparisons with mainstream algorithms on marine datasets, demonstrate that the proposed DLSW-YOLOv8 model offers superior detection accuracy, reduced computational cost, and enhanced generalization ability and robustness.

Our approach has shown promising results in UAV maritime search and rescue (SAR) detection missions. However, it has encountered challenges in addressing variations in lighting and sea surface ripples. On-board computational resources of UAVs are limited, which is another urgent challenge. Therefore, in the future, we will focus on optimizing the proposed detect model structure and algorithms using model compression techniques such as pruning and quantization to reduce computation and achieve efficient operation on mobile and embedded devices, facilitating practical applications on UAV platforms. In addition, the model will be extended to other types of UAV images, such as land search and rescue and agricultural surveillance, to enhance its broad applicability through adaptive training for different scenarios. To improve detection accuracy and reliability, multi-modal data sources such as thermal imaging, radar, and sonar will be integrated, especially in complex environments and low visibility conditions.

Author Contributions

Conceptualization, Z.F., Y.X. and F.T.; methodology, Z.F.; software, Y.X. and P.S.; validation, Z.F, Y.X. and F.T.; formal analysis, L.Z.; writing—original draft preparation, L.Z.; writing—review and editing, Y.X. and P.S.; supervision, Z.F. and P.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant (Grant Nos. 62301212, 62371182, 62201200), the Program for Science and Technology Innovation Talents in the University of Henan Province (Grant No. 23HASTIT021), Major Science and Technology Projects of Longmen Laboratory (Grant No. 231100220300), Scientific and Technological Project of Henan Province (Grant Nos. 242102241063, 242102221025), the Science and Technology Development Plan of Joint Research Program of Henan (Grant No. 225200810007).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to ownership reasons.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Taylor, I.; Smith, K. United Nations Conference on Trade and Development (UNCTAD); Routledge: London, UK, 2007. [Google Scholar]
Cho, S.W.; Park, H.J.; Lee, H.; Shim, D.H.; Kim, S.Y. Coverage path planning for multiple unmanned aerial vehicles in maritime search and rescue operations. Comput. Ind. Eng. 2021, 161, 107612. [Google Scholar] [CrossRef]
Nunes, D.; Fortuna, J.; Damas, B.; Ventura, R. Real-time vision based obstacle detection in maritime environments. In Proceedings of the 2022 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Santa Maria da Feira, Portugal, 29–30 April 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 243–248. [Google Scholar]
Nirgudkar, S.; Robinette, P. Beyond visible light: Usage of long wave infrared for object detection in maritime environment. In Proceedings of the 2021 20th International Conference on Advanced Robotics (ICAR), Ljubljana, Slovenia, 6–10 December 2021; pp. 1093–1100. [Google Scholar]
Muhovič, J.; Mandeljc, R.; Bovcon, B.; Kristan, M.; Janez Perš, J. Obstacle tracking for unmanned surface vessels using 3-d point cloud. IEEE J. Ocean. Eng. 2020, 45, 786–798. [Google Scholar] [CrossRef]
Bovcon, B.; Kristan, M. WaSR–AWater Segmentation and Refinement Maritime Obstacle Detection Network. IEEE Trans. Cybern. 2021, 52, 12661–12674. [Google Scholar] [CrossRef] [PubMed]
Lin, L.; Goodrich, M.A. UAV intelligent path planning for wilderness search and rescue. In Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, St. Louis, MO, USA, 10–15 October 2009; pp. 709–714. [Google Scholar]
Patel, J.; Bhusnoor, M.; Patel, D.; Mehta, A.; Sainkar, S.; Mehendale, N. Unmanned Aerial Vehicle-Based Forest Fire Detection Systems: A Comprehensive Review. SSRN 2023. [Google Scholar] [CrossRef]
Cao, Z.; Kooistra, L.; Wang, W.; Guo, L.; Valente, J. Real-time object detection based on uav remote sensing: A systematic literature review. Drones 2023, 7, 620. [Google Scholar] [CrossRef]
Zhang, J.; Liu, S.; Chen, Y.; Huang, W. Application of UAV and computer vision in precision agriculture. Comput. Electron. Agric. 2020, 178, 105782. [Google Scholar]
Ke, Y.; Im, J.; Son, Y.; Chun, J. Applications of unmanned aerial vehicle-based remote sensing for environmental monitoring. J. Environ. Manag. 2020, 255, 109878. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Ma, D.; Dong, L.; Xu, W. Detecting infrared maritime dark targets overwhelmed in sunlight interference by dissimilarity and saliency measure. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Liu, T.; Pang, B.; Zhang, L.; Yang, W.; Sun, X. Sea surface object detection algorithm based on YOLO v4 fused with reverse depthwise separable convolution (RDSC) for USV. J. Mar. Sci. Eng. 2021, 9, 753. [Google Scholar] [CrossRef]
Li, Y.; Yuan, H.; Wang, Y. GGT-YOLO: A novel object detection algorithm for drone-based maritime cruising. Drones 2022, 6, 335. [Google Scholar] [CrossRef]
Sambolek, S.; Ivasic-Kos, M. Automatic person detection in search and rescue operations using deep CNN detectors. IEEE Access 2021, 9, 37905–37922. [Google Scholar] [CrossRef]
Zhang, Y.; Yin, Y.; Shao, Z. An Enhanced Target Detection Algorithm for Maritime Search and Rescue Based on Aerial Images. Remote Sens. 2023, 15, 4818. [Google Scholar] [CrossRef]
Bai, J.; Dai, J.; Wang, Z.; Yang, S. A detection method of the rescue targets in the marine casualty based on improved YOLOv5s. Front. Neurorobot. 2022, 16, 1053124. [Google Scholar] [CrossRef] [PubMed]
Zhu, Q.; Ma, K.; Wang, Z.; Shi, P. Yolov7-csaw for maritime target detection. Front. Neurorobot. 2023, 17, 1210470. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Dai, j.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. arXiv 2022, arXiv:2208.03641. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Kiefer, B.; Kristan, M.; Perš, J.; Žust, L.; Poiesi, F.; Andrade, F.; Yang, M.T. 1st workshop on maritime computer vision (macvi) 2023: Challenge results. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 265–302. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]

Figure 1. Deformable Conv Schematic.

Figure 2. DL-Block structure diagram.

Figure 3. The processing of the feature map by the SPD-Conv module when T = 2.

Figure 4. YOLOv8 structure diagram.

Figure 5. DLSW-YOLOv8 structure diagram.

Figure 6. UAV-captured images in the SDS ODv2 dataset present several challenges: (a) Tiny objects and occlusions, (b) Adverse weather conditions, and (c) Complex environmental factors.

Figure 7. Detection Accuracy Curves for Different Loss Functions.

Figure 8. Detection Accuracy Curves of WIoU Loss Function with Different Gamma values.

Figure 9. Detection Accuracy Curves of Ablation experiments.

Figure 10. Grad-CAM diagram of DL-Net and SPD-Conv.

Figure 11. Grad-CAM diagram of YOLOv8 and DLSW-YOLOv8.

Figure 12. Predicted results of YOLOv8n and DLSW-YOLOv8n.

Table 1. Detection accuracy and speed of DLSW-YOLOv8n.

Model	Parameters (M)	GFLOPs (B)	AP50	FPS
YOLOv8n	3.01	8.2	0.664	47
YOLOv8n-small	2.93	12.4	0.744	44
DLSW-YOLOv8n	2.76	14.6	0.795	42

Table 2. Comparing different loss functions.

Model (Epoch = 300)	AP50	Box Loss	Cls Loss
YOLOv8n-small	0.732	1.12	0.53
YOLOv8n-small + EIOU	0.752	1.15	0.55
YOLOV8n-small + DIOU	0.733	1.17	0.57
YOLOV8n-small + GIOU	0.751	1.11	0.54
YOLOV8n-small + WIOU	0.752	1.0	0.59

Table 3. Detection Accuracy of WIoU Loss Function with Different Gamma Values.

Model (Epoch = 300)	Alpha	Gamma	AP50
YOLOV8n-small + WIOUV3	1	0.2	0.751
YOLOV8n-small + WIOUV3	1	0.3	0.759
YOLOV8n-small + WIOUV3	1	0.4	0.743
YOLOV8n-small + WIOUV3	1	0.5	0.752

Table 4. Ablation experiments.

Model	Parameters (M)	GFLOPs (B)	AP50
YOLOv8n	3.01 M	8.2	0.664
YOLOv8n-small	2.93 M	12.4	0.744
DL-YOLOv8n	2.94 M	15.8	0.769
DLS-YOLOv8n	2.76 M	14.6	0.783
DLSW-YOLOv8n	-	-	0.795

Table 5. Comparing different algorithms.

Model	GFLOPs (B)	AP50
SODV2 [30]	-	0.72
YOLOv7-BL [30]	-	0.72
YOLOv7-TILE [30]	-	0.71
YOLOv5s	16.5	0.768
YOLOv7-tiny	13.2	0.743
FasterRCNN-ResNet50	134.38	0.782
YOLOv8s	38.0	0.685
YOLOv8s-small	28.6	0.77
DLSW-YOLOv8n	14.6	0.795

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, Z.; Xiao, Y.; Tao, F.; Si, P.; Zhu, L. DLSW-YOLOv8n: A Novel Small Maritime Search and Rescue Object Detection Framework for UAV Images with Deformable Large Kernel Net. Drones 2024, 8, 310. https://doi.org/10.3390/drones8070310

AMA Style

Fu Z, Xiao Y, Tao F, Si P, Zhu L. DLSW-YOLOv8n: A Novel Small Maritime Search and Rescue Object Detection Framework for UAV Images with Deformable Large Kernel Net. Drones. 2024; 8(7):310. https://doi.org/10.3390/drones8070310

Chicago/Turabian Style

Fu, Zhumu, Yuehao Xiao, Fazhan Tao, Pengju Si, and Longlong Zhu. 2024. "DLSW-YOLOv8n: A Novel Small Maritime Search and Rescue Object Detection Framework for UAV Images with Deformable Large Kernel Net" Drones 8, no. 7: 310. https://doi.org/10.3390/drones8070310

Article Menu