Full Text

Turn on search term navigation

1. Introduction

Steel is an essential raw material in industrial production. Due to process parameters and production processes, defects such as scabs, scratches, bubbles, and cracks are generated on the surface of steel. It severely affects the strength and corrosion resistance of steel. The quality and economic benefits of the product will be significantly reduced [1]. Therefore, it is important to conduct research on steel surface defect detection technology. Common methods such as manual inspection, ultrasonic detection, and infrared detection have the disadvantages of low detection efficiency, poor accuracy, and high latency, which cannot meet the online and real-time detection requirements in steel production processes [2].

Deep learning defect detection methods are mainly divided into two categories: single-stage and two-stage methods based on their working principles and structures. Firstly, two-stage methods generate candidate regions. Secondly, they perform target classification and bounding box regression which represent the R-CNN (Region-based Convolutional Neural Networks) series such as Fast R-CNN [3] and Mask R-CNN [4]. The disadvantage is the large amount of computation, slow detection speed, and inability to meet the real-time requirements of industrial production. Theingle-stage method directly predicts the bounding box, position, and category of the target from the image input with representative algorithms such as SSD (Single Shot MultiBox Detector) [5] and the YOLO (You Only Look Once) series [6] (such as YOLOv5, YOLOv7, YOLOv8, YOLOv10, and YOLOv11), which have the advantages of smaller model size, higher detection accuracy, and faster detection speed.

In recent years, with the development of computer vision and deep learning technology, inspection technology in the industrial field has been greatly improved. Versaci et al. [7] proposed using state-of-the-art fuzzy techniques that aggregate images in a fuzzy sense, resulting in clusters of images from which a single representative image is extracted for each cluster. The innovative approach can reduce computational complexity and serve as a measure of the distance between images. Daigo et al. [8] used a deep learning technique based on a pyramid scene parsing the network for semantic segmentation which aims to detect the thickness or diameter of steel scrap for classification. The result of the class of less than 3 mm thickness or diameter was more than 0.9(F-score). Zheng et al. [9] created the RSBL dataset for scrap bundle intelligent recognition. An improved MobileNet_V3_Large model using transfer learning technology can attain an average test accuracy of 99.8%. Cui et al. [10] built strip steel defect image datasets for salient object detection. A novel autocorrelation-aware aggregation network was suggested for steel defect detection, which can perform better on a self-made dataset. Yu et al. [11] offered channel attention and bidirectional feature fusion on a fully convolutional one-stage (CABF-FCOS) network, which achieved faster and more effective defect detection in steel strips. It had an average accuracy of 76.68% at 18 frames per second. Han et al. [12] introduced a two-stage edge reuse network (TSERNet), which consists of two stages including prediction and refinement. The network can not only extract multi-scale features but also generate edge graphs, which outperform 22 relevant methods. Tang et al. [13] proposed a CNN-based segmentation model called EfficientU-Net-b3 that was a combination of 3D micro-CT data with 2D element mappings for multimineral segmentation on both an intact complex iron ore sample and the corresponding crushed fragments. Experimental results indicated the fusion model had a better segmentation effect for ore characterization.

Many scholars have applied improved YOLO algorithms in the field of steel surface defect detection. Wu et al. [14] presented an SDD-YOLO model based on YOLOv5s for strip defect detection. The Convolution-GhostNet Hybrid module and Multi-Convolution Feature Fusion block were designed for the model, which can decrease computational complexity and raise feature extraction efficiency. SC-YOLOv5 fulfilled a 6.3% increase in mAP50, arriving at 76.1% on the NEU-DET dataset. Meng et al. [15] designed the SC-YOLOv5 model, which used coordinate attention into the YOLOv5 network to improve the detection accuracy of metallurgical saw blade defects. The [email protected] of the improved YOLOv5 model was 88.5%. Tao et al. [16] introduced the BottleneckCSP structure and deep separable convolution through structure reparameterization in YOLOv5. The amount of calculation and the number of parameters were reduced while maintaining computational accuracy. Wang et al. [17] analyzed the BiFPN structure, which effectively reduced the loss of feature information and used the ECA attention module to improve the feature learning ability of the YOLOv7 backbone network, enhancing the detection speed and accuracy. Gao et al. [18] introduced the CBAM attention mechanism and SPPFCSPC module which can enhance the fusion ability of different scales, and the detection accuracy of the original model was increased by 3.3%. Zhang et al. [19] pointed out the EP module and SPPF-LSKA module to improve the model’s detection accuracy to 78%. Cheng et al. [20] integrated the MobileViTv3 block into the YOLOv10 model named YOLOv10-vit for corrosion target detection. Based on the detection model, hydraulic metal structures can be determined horizontally. Banduka et al. [21] used the YOLOv11 model to detect leather defects. Compared with manual inspection, the detection accuracy has been significantly improved. Huang et al. [22] suggested a Neural Swin Transformer-YOLO11 (NST-YOLO11) model which integrates Neural Swin-T and a Cross-Stage connected to the Spatial Pyramid Pooling-Fast (CS-SPPF) module. Integrated networks can effectively improve the ability to save feature information.

However, the above methods still have room for improvement in detection accuracy and speed for compound defects, small target defects and scale variability of steel plates.

To address the above problems, the following improvements are proposed on the basis of YOLOv8n:

The MLCA mechanism is used in the C2f module of the backbone network to solve the problem of feature information extraction loss due to insufficient receptive field.

GSConv is applied in the neck network to reduce information loss during channel conversion and reduce the amount of computation; the VoVGscsp module was introduced to achieve cross-layer network structure aggregation, integrating feature information between different network levels and improving the network’s feature perception and fusion ability.

The SA mechanism is applied in the detection network structure to increase attention to small defect targets and enhance the detection ability for small targets.

2. Improved Model Introduction

2.1. YOLOv8 Model

YOLOv8 is an advanced target detection algorithm that has undergone multiple optimizations and improvements based on the framework of YOLOv5 and YOLOv7. The algorithm structure is shown in Figure 1 and is mainly divided into four modules: Input, Backbone, Neck and Head. Input uses Mosaic data enhancement technology, adaptive image scaling, and grayscale filling strategies for pre-processing images to enhance the model’s generalization ability. The backbone network replaces the C3 module with the C2f module based on the ELAN idea of YOLOv7, which has richer gradient flow information and reduces the number of convolutional layers before upsampling. It makes the network model more lightweight. The neck network continues to use the feature pyramid (FPN) and path aggregation structure (PAN) structures. The head network uses a decoupled head structure in Decoupled-Head for different classification and detection tasks [23]. The CIOU+VFL combination is used as the loss function in regression.

2.2. Improved YOLOv8 Model

To enhance the network model’s feature extraction ability and meet the needs of detection accuracy and real-time requirements. First, the MLCA (Mixed local channel attention) [24] structure is applied in the C2f module of the backbone network, which can better integrate local, global, spatial, and channel information. The number of model parameters and the amount of computation can be reduced. Secondly, the GSConv and VoVGscsp modules [25] are used in the neck network, which have lightweight and cross-layer feature information fusion capabilities. In the detection head part, the SA (self-attention) mechanism [26] is applied to improve the feature information separation ability in the output stage and enhance the detection ability for small targets. The YOLOv8-MGVS model structure is shown in Figure 2.

2.2.1. C2f_MLCA Module

In the backbone network of YOLOv8, the C2f module works in concert with other modules such as the Conv module and Bottleneck to complete the feature extraction task. However, due to the limited local receptive field of the C2f module, it fails to capture sufficient features in cases of image occlusion which leads to the loss of feature information and aliasing. Additionally, the complexity of the steel surface background increases the computational intricacies of the C2f module in the deep network structure, which affects the operation efficiency and speed of the model. To improve the network expression ability and reduce the number and complexity of model parameters, a lightweight C2f-MLCA module is designed. The C2f-MLCA model structure is shown in Figure 3.

The Mixed Local Channel Attention (MLCA) module is an innovative attention mechanism that significantly increases the network’s ability to identify and capture key features by integrating local and global features, as well as channel and spatial information. First, the input feature map (C, W, H) undergoes local average pooling (LAP) and global average pooling (GAP) in two stages. Local average pooling focuses on extracting detailed features of local areas, while global average pooling is responsible for capturing the global feature information of the entire feature map, both of which provide rich contextual information for subsequent processing.

Second, after the pooling operation, these features undergo a 1D convolution transformation, which has the advantage of compressing the feature channels while maintaining the integrity of the spatial dimension. Subsequently, the features are rearranged. In processing local pooled features, 1D convolution and rearranged features undergo a multiplication operation with the original input features. The novel step is similar to feature screening which reinforces the network’s focus on valuable features. For global pooled features, after 1D convolution and rearrangement, which are combined with local pooled features through an addition operation. It can integrate global context information into the feature map.

Finally, the feature map that integrates local and global attention is restored to the original spatial dimension through the inverse pooling operation. This process not only retains the integrity of the features but also reinforces their expressiveness. Therefore, the design of the MLCA module significantly improves the network’s detection accuracy by combining channel and spatial attention on local and global levels, while maintaining computational efficiency.

2.2.2. GSConv Module

In the process of complex model calculation and image processing, the contradiction between a huge number of parameters and limited computing resources can hinder the actual deployment and operation efficiency of the model. Therefore, to tackle the challenge of the number and complexity of model parameters, the GSConv lightweight convolutional module is used to optimize computation volume.

Standard convolution (SConv) is a channel-dense computation. Multi-channel convolutional kernels are used to process multi-channel images, and the output feature map has both channel features and spatial characteristics. Depthwise separable convolution (DWConv) is a channel-sparse computation, which is divided into Depthwise convolution and Pointwise convolution [27]. Depthwise convolution applies convolution kernels to each input channel independently, which can significantly reduce the number of parameters and computational volume. Pointwise convolution combines features from different channels to enhance the model’s expressive ability, as shown in Figure 4. Their complexities are as follows:

SConv: O(W·H·K1·K2·C1·C2)

DWConv: O(W·H·K1·K2·C1)

W is the width of the output feature map. H is the height. K1 and K2 are the sizes of the convolution kernels. C1 is the number of channels in the input feature map. C2 is the number of channels in the output feature map.

The GSConv module is a lightweight convolution module that uses DWConv. In traditional convolutional neural networks (CNNs), the input image undergoes a series of transformations that gradually transfer spatial information to channels. However, each time the spatial size (width and height) of the feature map decreases and the number of channels increases, some semantic information may be lost. First, GSConv operates on the input feature maps. The generating feature map channels are half the number of input channels. Then, the feature maps are processed through depthwise separable convolution (DWConv) and concatenated with the original feature maps. Finally, concatenated maps (shuffle) are randomly permuted and then form new feature channels, which can improve the flow of information between features. Computational complexity is reduced while improving computational accuracy, as shown in Figure 5.

2.2.3. VoVGscsp Module

The VoVGscsp module (Figure 6) uses one-time aggregation technology to build a cross-stage partial network structure which achieves an effective fusion of feature maps between different network levels. The module takes the GSbottleneck as the core. One or more GSConv modules are integrated to reinforce the network’s ability to process image features. The model’s learning depth is significantly improved through the stacking of GSConv modules. The VoVGscsp module reduces computational complexity and inference time without sacrificing accuracy and effectively improves the efficiency of feature utilization.

2.2.4. SA-Detect Module

SA (self-attention) operates through three 1 × 1 convolutions for query, key, and value operations (Figure 7). The dot product of the query vector and the key vector are calculated to obtain the attention matrix. Then, it is normalized through the softmax function to obtain attention weights for each position. Finally, the value vector and the attention weight are weighted and summed to obtain the final attention output. For the input features $x_{i}$ , query, key, and value are calculated through three different weight matrices:

(1) $q_{i} = W_{q} x_{i}$

(2) $k_{i} = W_{k} x_{i}$

(3) $v_{i} = W_{v} x_{i}$

(4) $A_{i j} = s o f t \max (\frac{q_{i} k_{j}}{\sqrt{d_{k}}})$

(5) $y_{i} = α \sum_{j = 1}^{n} A_{i j} v_{j}$

$A_{i j}$ is the attention weight. $d_{k}$ is the vector dimension of $q_{i}$ and $k_{i}$ . $y_{i}$ is the output feature map. $α$ is the learning parameter.

The SA (self-attention) mechanism is applied in the decoupled head detection network structure (Figure 8). Two branch structures are fused through convolution feature information and SA feature information which improve the model’s perception of details and better recognize detailed targets in images. It ensures that detailed features are fully captured.

3. Experimental Results and Analysis

3.1. Experimental Environment and Dataset

The datasets used in this paper are the Northeast University steel surface defect dataset (NEU DET) and GC10-DET Metallic Surface Defect Datasets (GC10-DET). The NEU DET (Figure 9) includes six types of defects: Crazing (Cr), Inclusion (In), Patches (Pa), Pitted surface (Ps), Rolled-in scale (Rs), and Scratches (Sc). Each category has 300 images, with a total of 1800 images, which are divided into training sets, test sets, and validation sets according to 8:1:1.

3.2. Experimental Equipment and Evaluation Indicators

The experimental environment configuration included Python3.10, PyTorch2.4.1, CUDA 12.1, i5-12600KF 32GB (Manufacturer: Intel Corporation, Santa Clara, CA, USA), and NVIDIA GeForce RTX 3060 12GB (Manufacturer: NVIDIA Corporation, Santa Clara, CA, USA). The parameter configuration included 250 training epochs, 16 batch sizes, SGD optimizer, 0.937 momentum parameter, 0.01 initial learning rate, 0.0005 weight attenuation coefficient, and 640 × 640 image resolution.

In this experiment, the mean average precision (mAP), recall, model parameter volume (Parameters), computational volume (GFLOPS), and frames per second (FPS) are used as evaluation indicators.

(6) $P = \frac{T P}{T P + F P}$

(7) $R e c a l l = \frac{T P}{T P + F N}$

$T P$ (True Positive) refers to positive samples that the model predicts correctly. $F P$ (False Positive) refers to positive samples that the model predicts incorrectly. $F N$ (False Negative) refers to negative samples that the model predicts incorrectly.

(8) $A P = \int_{0}^{1} P (R) d R$

(9) $m A P = \frac{1}{n} \sum_{i = 1}^{n} A P (i) \times 100 %$

$A P$ value is the average precision rate of a certain type of defect detected. $n$ value is the number of detected defect types. [email protected] value that object detection accuracy under the union (IoU) threshold of 50% is used as the evaluation metric for model accuracy.

3.3. Attention Mechanism Experiment

The MLCA mechanism is effective in the C2f module of the backbone network compared with other attention mechanisms. Three different attention mechanisms, SE [28], CBAM [29], and CA [30], were embedded in the same position for the attention mechanism experiments. The visual heat maps are generated in Figure 10.

In Figure 10, taking scratches as an example, the attention to the defect area is significantly increased after the attention mechanism is added. Compared with other attention mechanisms, MLCA has the highest attention and the least interference from non-defect areas, and the detection effect is the best. From a visual perspective, the superiority of MLCA is proven.

Table 1 reveals that after the attention mechanism is applied to the YOLOv8n benchmark model, the [email protected]% and recall values have been improved to a certain extent. However, compared with other attention mechanisms, the application of MLCA increased the [email protected] value by 2.3% and the recall value by 6.6%. The increase in the recall value is 0.9% lower than that of CBAM attention, but its computational volume and parameter volume are 0.3 and 0.07 less than that of CBAM attention, respectively. Overall, MLCA has higher detection accuracy and is more lightweight.

3.4. Ablation Experiment

The effects of various improved methods on the YOLOv8n network model are verified through ablation experiments. The results are shown in Table 2, where “√” represents the addition of this module. And the recall curve and average accuracy curve are shown in Figure 11.

As shown in Table 2, compared with the YOLOv8n benchmark model, the [email protected] and recall values of the model have been improved to a certain extent after the application of each module.

Specifically, Model 1 uses the MLCA mechanism. Its [email protected] and recall are increased by 2.2% and 6.6%, respectively. The amount of computation and parameters is basically unchanged, which verifies that the C2f_MLCA has improved the feature extraction of the backbone network. After GSConv is applied in Model 2, the [email protected] and recall values are increased by 0.6% and 7.2%, respectively, while reducing the parameter volume and computational volume. It is validated that GSConv can reduce the model’s computational volume through depthwise separable convolution while maintaining computational accuracy. VoVGscsp is used in Model 3. VoVGscsp achieves network aggregation through GSConv for cross-layer connection, achieving feature fusion between different network levels. Its [email protected] and recall values are increased by 3% and 7%, respectively. The network aggregation ability and feature fusion ability between different layers of VoVGscsp are verified, and the computation and parameter costs are reduced by 0.7 and 1.17 × 10⁵, respectively. It is proven that GSConv is effective due to it being lightweight in the network structure of VoVGscsp. SA-detect is applied for Model 4. The [email protected] and recall values are increased by 1.8% and 3.8%, respectively, through the effect of the self-attention mechanism. It is verified that the self-attention mechanism can improve the model’s perception of details and better identify small targets in the image.

Finally, the improved overall network Model 7 (MGVS) increases the [email protected] and recall values by 5.2% and 10.5% compared to the YOLOv8n (Table 1), and the computational volume and parameter volume are significantly reduced. It is proven that the improved overall model has a great improvement in performance compared with the benchmark model.

3.5. Comparison Experiment

To further verify the performance of the improved YOLOv8n network model, comparison experiments are conducted under the same experimental conditions with mainstream algorithms such as Faster-RCNN, SSD, and the YOLO series. We evaluated different models and the results are shown in Table 3, with detection effect comparison diagrams shown in Figure 12.

Table 3 indicates that Faster-RCNN and SSD have lower detection accuracy and larger calculation and parameter quantities, which are unsuitable for fast detection. Compared with other network models, the detection accuracy, recall rate and FPS of the improved YOLOv8n network model are 79.0, 74.9, and 189.2, respectively. Its accuracy and recall rate are higher than 2.8% and 4.9% of the YOLOv11n model, and the detection speed is slightly lower than that of the YOLOv11n model. The improved YOLOv8n has the highest precision and accuracy in the performance of a real-time online inspection of steel surface defects. Experimental verification shows that the improved YOLOv8n model has better feature perception and extraction ability of defects at all scales.

3.6. Generalization Experiment

3.6.1. Generalization Experiment 1

To validate the generalization performance of the improved YOLOv8n network model, the industrial-grade GC10-DET (Figure 13) dataset is selected for verification. It includes ten types of surface defect types, namely Punching (Pu), Welding Line (Wl), Crescent Gap (Cg), Water Spot (Ws), Oil Spot (Os), Inclusion (In),Silk Spot (Ss), Rolled Pit (Rp), Crease (Cr), and Waist Folding (Wf), totaling 2293 images. The training sets, test sets, and validation sets are divided in a ratio of 8:1:1.

Table 4 and Table 5 reveal that the detected accuracy of the improved YOLOv8n network model is 70.2% ([email protected]), which is 3.4% higher than that of YOLOv8n. The computational value is 6.7 (GFLOPs), which is 1.4 lower than that of YOLOv8n. The parameter value is 2.8 (Params), which is only 0.204 higher than that of YOLOv11n. The improved model attains a speed of 192 frames per second, which is 13.7(FPS) higher than that of YOLOv10n. Compared with other YOLO series models, our model has higher average accuracy, faster image processing, and a smaller amount of computation and parameters. It is proven that the improved model has better generalization ability. The detection results of the improved YOLOv8n model, compared with traditional YOLO series algorithms and the latest YOLOv11 algorithm, are shown in Figure 14.

3.6.2. Generalization Experiment 2

To verify the application capability of the YOLOV8-MGVS model in other industrial scenarios, the Solar pane defect detections (SDD DET) dataset was selected for validation. It contains four types of surface defect categories, namely Bird Drop (Bd), Clean (Cl), Cracked (Cr), and Dust (Du), with a total of 5153 images, as shown in Figure 15. The dataset is divided into training, testing, and validation sets in a ratio of 7:2:1.

In comparison to the YOLOv8n model from Table 6, the improved YOLOv8 model achieves 0.8%, 3.7%, and 19.2 increase in [email protected], recall, and FPS, respectively. Its GFLOPs and Params are 1.2 and 0.176 lower than thhe YOLOv8n model. Compared with YOLOv11n, the improved YOLOv8 model reaches only 0.5 and 0.239 growth in GFLOPs and Params, respectively. The improved YOLOv8 model’s [email protected], recall, and FPS rise by 0.8%, 3.3%, and 9.4, respectively. YOLOV8-MGVS has obvious advantages in terms of accuracy and speed, which verifies that the structure can also be applied in other industrial scenarios. The detection results of the improved YOLOv8n model, compared with traditional YOLO series algorithms and the latest YOLOv11 algorithm, are shown in Figure 16.

3.7. YOLOv8-MGVS Interface System

To promote the YOLOv8-MGVS model adoption, the interface system was designed in Figure 17. The interface system mainly includes file import, test results, operation area, detection results, and location information. Defect detection is carried out by uploading images or videos. After completing the steps of feature extraction, feature fusion, classification, regression, etc., the coordinates and defect types of the steel surface are finally output on the display interface. In future work, the system needs to be further optimized to enable better separation of duties and better communication which will facilitate the maintenance and upgrade of the system.

4. Conclusions

To address the challenges of steel surface defect detection, the YOLOv8-MGVS model is proposed. The YOLOv8-MGVS interface system is designed for potential applications. This paper introduces the MGVS structure, a lightweight deep learning model designed for accurate and efficient defect detection.

The C2f_MLCA module can improve the feature extraction capability of the backbone network by integrating global, local, spatial, and channel information.
The GSConv module can reduce computational volume and parameter volume while maintaining computational accuracy. The VoVGscsp module leverages the network’s cross-layer aggregation capabilities to improve feature fusion capabilities.
The SA mechanism can improve the detection ability for small target defects.
Compared with the advanced YOLOv11n model, our model’s accuracy and recall rate are higher than 2.8% and 4.9% of the YOLOv11n model, and the detection speed is 5.8(FPS) lower than it. The model also performs well on the GC10-DET dataset and SDD-DET dataset, which has better generalization ability.

Author Contributions

Conceptualization, K.Z.; methodology, K.Z. and Z.X.; software, K.Z. and Z.X.; validation, J.Q. and X.D.; formal analysis, J.Q. and X.D.; investigation, P.X. and L.Z.; resources, J.Q. and L.Z.; data curation, K.Z. and Z.X.; writing—original draft preparation, K.Z. and Z.X.; writing—review and editing, K.Z. and Z.X.; visualization, K.Z.; supervision, X.D. and L.Z.; Project administration, P.X.; Funding acquisition, P.X. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Authors Kai Zeng, Junlei Qian and Xueqiang Du were employed by Tangshan Iron and Steel Enterprise Process Control and Optimization Technology Innovation Center, Tangshan ANODE Automation Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. YOLOv8 network structure.

Figure 2. YOLOv8-MGVS network structure.

Figure 3. C2f_MLCA module.

Figure 4. Convolutional contrast graph.

Figure 5. GSConv Module.

Figure 6. VoVGscsp Moduleh.

Figure 7. Self-attention.

Figure 8. SA-detect module.

Figure 9. NEU DET dataset.

Figure 10. Comparison diagram of attention experiments.

Figure 11. Comparison of Ablation experiment. (a) Recall curve and (b) average accuracy curve.

Figure 12. Comparison of detection results on NEU DET.

Figure 13. GC10-DET dataset.

Figure 14. Comparison of generalization experiments. (a) Recall curve and (b) average accuracy.

Figure 15. SDD-DET dataset.

Figure 16. Comparison of detection results on SDD DET.

Figure 17. Visual boundary structure diagram.

Table 1

Attention mechanism experiment.

Models	[email protected]/%	Recall/%	GFLOPS	Params/M
YOLOv8n	73.8	64.4	8.1	3.006
+C2_SE	74.7	68.7	8.4	3.057
+C2f_CBAM	75.2	71.9	8.5	3.082
+C2f_CA	75.8	66.8	8.4	3.063
+C2f_MLCA	76.1	71.0	8.2	3.012

Table 2

Ablation experiment.

Model	C2f_MLCA	GSConv	VoVGscsp	SA-Detect	[email protected]/%	Recall/%	GFLOPS	Params/M
1	√				76.1	71.0	8.2	3.012
2		√			74.4	71.6	8.0	2.912
3			√		76.8	71.4	7.4	2.889
4				√	75.6	68.2	7.6	3.037
5	√	√			77.4	71.7	8.1	2.919
6	√	√	√		77.4	67.6	7.4	2.802
7	√	√	√	√	79.0	74.9	6.9	2.831

Table 3

Contrast experiment.

Models	[email protected]/%	Recall/%	GFLOPs	Params/M	FPS	AP/%
Models	[email protected]/%	Recall/%	GFLOPs	Params/M	FPS	Cr	In	Pa	Ps	Rs	Sc
Faster-RCNN	76.1	89.9	402.2	137.100	16.8	45.1	83.6	91.3	87.9	60.5	87.9
SSD	63.8	38.7	281.9	26.285	33.6	47.3	68.5	88.6	68.4	54.7	55.0
YOLOv5s	75.0	69.9	15.8	7.026	128.0	49.8	80.6	92.3	83.9	62.5	80.6
YOLOv7-tiny	73.5	74.5	13.2	6.029	146.8	54.4	84.4	92.3	76.6	54.6	78.6
YOLOv8n	73.8	64.4	8.1	3.006	177.8	40.5	81.6	91.0	81.0	60.4	88.4
YOLOv10n	77.0	70.5	8.2	2.697	173.6	45.4	80.9	91.1	82.3	74.2	88.1
YOLOv11n	76.2	70.0	6.4	2.591	195.0	47.8	82.3	95.9	78.7	65.0	87.2
Improved YOLOv8n	79.0	74.9	6.9	2.831	189.2	54.6	84.2	92.3	82.6	72.7	87.3

Table 4

Various accuracy test data.

Models	AP/%
Models	Pu	Cg	Os	In	Wl	Ws	Ss	Rp	Cr	Wf
YOLOv7-tiny	89.4	95.3	63.9	58.9	82.7	66.9	55.8	23.1	40.2	79.0
YOLOv8n	88.1	97.7	56.5	36.0	90.4	72.4	53.4	40.1	39.2	93.8
YOLOv10n	87.4	91.3	58.0	39.9	87.0	74.4	42.3	33.1	38.4	80.8
YOLOv11n	91.2	96.7	56.3	37.8	95.3	67.3	50.5	41.1	40.2	93.8
Improved YOLOv8n	92.7	97.3	60.7	39.9	89.4	71.8	52.3	60.1	49.9	88.0

Table 5

Comparison of generalization experiments on GC10-DET.

Models	[email protected]/%	Recall/%	GFLOPS	Params/M	FPS
YOLOv7-tiny	65.5	66.6	13.3	6.039	145.5
YOLOv8n	66.8	62.4	8.1	3.008	169.7
YOLOv10n	63.3	58.6	8.2	2.698	178.3
YOLOv11n	67.0	64.6	6.5	2.596	204.3
Improved YOLOv8n	70.2	68.3	6.7	2.800	192.0

Table 6

Comparison of generalization experiments on SDD DET.

Models	[email protected]/%	Recall/%	GFLOPs	Params/M	FPS	AP/%
Models	[email protected]/%	Recall/%	GFLOPs	Params/M	FPS	Bd	Cl	Cr	Du
YOLOv8n	56.2	53.2	8.1	3.006	144.7	6.0	56.8	81.7	68.4
YOLOv10n	54.4	50.7	8.4	2.708	151.2	7.2	62.6	88.0	59.8
YOLOv11n	56.2	53.6	6.4	2.591	154.5	8.3	65.3	88.7	62.4
Improved YOLOv8n	57.0	56.9	6.9	2.830	163.9	9.0	66.2	90.7	62.2

References

1. Ma, Z.; Zeng, K.; Chen, B.; Xiao, P.; Zhu, L. Surface defect detection algorithm for continuous casting billets based on improved YOLOv7. China Metall.; 2024; 34, pp. 101-112. [DOI: https://dx.doi.org/10.13228/j.boyuan.issn1006-9356.20240008]

2. Wang, Y.; Zheng, Z.; Zhu, M.; Zhang, K.; Gao, X. An integrated production batch planning approach for steelmaking-continuous casting with cast batching plan as the core. Comput. Ind. Eng.; 2022; 173, 108636. [DOI: https://dx.doi.org/10.1016/j.cie.2022.108636]

3. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2016; 39, pp. 1137-1149. [DOI: https://dx.doi.org/10.1109/TPAMI.2016.2577031] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27295650]

4. Anantharaman, R.; Velazquez, M.; Lee, Y. Utilizing mask R-CNN for detection and segmentation of oral diseases. Proceedings of the 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); Madrid, Spain, 3–6 December 2018; pp. 2197-2204. [DOI: https://dx.doi.org/10.1109/BIBM.2018.8621112]

5. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A. Ssd: Single shot multibox detector. Proceedings of the ECCV 2016; Amsterdam, The Netherlands, 11–14 October 2016; pp. 21-37. [DOI: https://dx.doi.org/10.1007/978-3-319-46448-0_2]

6. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016; [DOI: https://dx.doi.org/10.48550/arXiv.1506.02640]

7. Versaci, M.; Angiulli, G.; Foresta, F.; Laganà, F.; Palumbo, A. Intuitionistic fuzzy divergence for evaluating the mechanical stress state of steel plates subject to bi-axial loads. Integr. Comput.-Aided Eng.; 2024; 31, pp. 363-379. [DOI: https://dx.doi.org/10.3233/ICA-230730]

8. Ichiro, D.; Ken, M.; Keijiro, T.; Rei, K. Thickness Classifier on Steel in Heavy Melting Scrap by Deep-learning-based Image Analysis. ISIJ Int.; 2023; 63, pp. 197-203. [DOI: https://dx.doi.org/10.2355/isijinternational.ISIJINT-2022-331]

9. Zheng, X.; Zhu, Z.; Xiao, Z.; Huang, D.; Yang, C.; He, F.; Zhou, X.; Zhao, T. CNN-based Transfer Learning in Intelligent Recognition of Scrap Bundles. ISIJ Int.; 2023; 63, pp. 1383-1393. [DOI: https://dx.doi.org/10.2355/isijinternational.ISIJINT-2023-064]

10. Cui, W.; Song, K.; Feng, H.; Jia, X.; Liu, S.; Yan, Y. Autocorrelation-Aware Aggregation Network for Salient Object Detection of Strip Steel Surface Defects. IEEE Trans. Instrum. Meas.; 2023; 72, pp. 1-12. [DOI: https://dx.doi.org/10.1109/TIM.2023.3290965]

11. Yu, J.; Cheng, X.; Li, Q. Surface Defect Detection of Steel Strips Based on Anchor-Free Network With Channel Attention and Bidirectional Feature Fusion. IEEE Trans. Instrum. Meas.; 2022; 71, pp. 1-10. [DOI: https://dx.doi.org/10.1109/TIM.2021.3136183]

12. Han, C.; Li, G.; Liu, Z. Two-Stage Edge Reuse Network for Salient Object Detection of Strip Steel Surface Defects. IEEE Trans. Instrum. Meas.; 2022; 71, pp. 1-12. [DOI: https://dx.doi.org/10.1109/TIM.2022.3200114]

13. Tang, K.; Da Wang, Y.; Mostaghimi, P.; Knackstedt, M.; Hargrave, C.; Armstrong, R.T. Deep convolutional neural network for 3D mineral identification and liberation analysis. Miner. Eng.; 2022; 183, 107592. [DOI: https://dx.doi.org/10.1016/j.mineng.2022.107592]

14. Wu, Y.; Chen, R.; Li, Z.; Ye, M.; Dai, M. SDD-YOLO: A Lightweight, High-Generalization Methodology for Real-Time Detection of Strip Surface Defects. Metals; 2024; 14, 650. [DOI: https://dx.doi.org/10.3390/met14060650]

15. Meng, L.; Cui, X.; Liu, R.; Zheng, Z.; Shao, H.; Liu, J.; Peng, Y.; Zheng, L. Research on Metallurgical Saw Blade Surface Defect Detection Algorithm Based on SC-YOLOv5. Processes; 2023; 11, 2564. [DOI: https://dx.doi.org/10.3390/pr11092564]

16. Tao, Y.; Xu, L.; Qiang, L.; Li, L. CRGF-YOLO: An Optimized Multi-Scale Feature Fusion Model Based on YOLOv5 for Detection of Steel Surface Defects. Int. J. Comput. Intell. Syst.; 2024; 17, 154. [DOI: https://dx.doi.org/10.1007/s44196-024-00559-9]

17. Wang, Y.; Wang, H.; Xin, Z. Efficient detection model of steel strip surface defects based on YOLO-V7. IEEE Access; 2022; 10, pp. 133936-133944. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3230894]

18. Gao, S.; Tian, Y. Research on Steel Surface Defects Detection Algorithms by YOLOv8 Based on Attention Mechanism. IAENG Int. J. Comput. Sci.; 2024; 51, pp. 1309-1315.

19. Zhang, X.; Wang, Y.; Fang, H. Steel surface defect detection algorithm based on ESI-YOLOv8. Mater. Res. Express; 2024; 11, 056509. [DOI: https://dx.doi.org/10.1088/2053-1591/ad46ec]

20. Cheng, H.; Kang, F. Corrosion Detection and Grading Method for Hydraulic Metal Structures Based on an Improved YOLOv10 Sequential Architecture. Appl. Sci.; 2024; 14, 12009. [DOI: https://dx.doi.org/10.3390/app142412009]

21. Banduka, N.; Tomić, K.; Živadinović, J.; Mladineo, M. Automated Dual-Side Leather Defect Detection and Classification Using YOLOv11: A Case Study in the Finished Leather Industry. Processes; 2024; 12, 2892. [DOI: https://dx.doi.org/10.3390/pr12122892]

22. Huang, Y.; Wang, D.; Wu, B.; An, D. NST-YOLO11: ViT Merged Model with Neuron Attention for Arbitrary-Oriented Ship Detection in SAR Images. Remote Sens.; 2024; 16, 4760. [DOI: https://dx.doi.org/10.3390/rs16244760]

23. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.; Huang, W. Tood: Task-aligned one-stage object detection. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); Montreal, QC, Canada, 10–17 October 2021; pp. 3490-3499. [DOI: https://dx.doi.org/10.1109/ICCV48922.2021.00349]

24. Wan, D.; Lu, R.; Shen, S.; Xu, T.; Lang, X.; Ren, Z. Mixed local channel attention for object detection. Eng. Appl. Artif. Intell.; 2023; 123, 106442. [DOI: https://dx.doi.org/10.1016/j.engappai.2023.106442]

25. Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process.; 2024; 21, 62. [DOI: https://dx.doi.org/10.1007/s11554-024-01436-6]

26. Zhang, Q.; Yang, Y. Sa-net: Shuffle attention for deep convolutional neural networks. Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); Toronto, ON, Canada, 6–11 June 2021; pp. 2235-2239. [DOI: https://dx.doi.org/10.48550/arXiv.2102.00240]

27. Zhang, T.; Xu, W.; Luo, B.; Wang, G. Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets. Neurocomputing; 2025; 617, 128998. [DOI: https://dx.doi.org/10.1016/j.neucom.2024.128998]

28. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132-7141. [DOI: https://dx.doi.org/10.1109/CVPR.2018.00745]

29. Woo, S.; Park, J.; Lee, J.; Kweon, I. CBAM: Convolutional block attention module. Proceedings of the ECCV 2018; Munich, Germany, 8–14 September 2018; pp. 3-19. [DOI: https://dx.doi.org/10.1007/978-3-030-01234-2_1]

30. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); Nashville, TN, USA, 20–25 June 2021; pp. 13708-13717. [DOI: https://dx.doi.org/10.1109/CVPR46437.2021.01350]

Word count: 5600

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Surface defects have a serious detrimental effect on the quality of steel. To address the problems of low efficiency and poor accuracy in the manual inspection process, intelligent detection technology based on machine learning has been gradually applied to the detection of steel surface defects. An improved YOLOv8 steel surface defect detection model called YOLOv8-MGVS is designed to address these challenges. The MLCA mechanism in the C2f module is applied to increase the feature extraction ability in the backbone network. The lightweight GSConv and VovGscsp cross-stage fusion modules are added to the neck network to reduce the loss of semantic information and achieve effective information fusion. The self-attention mechanism is exploited into the detection network to improve the detection ability of small targets. Defect detection experiments were carried out on the NEU-DET dataset. Compared with YOLOv8n from experimental results, the average accuracy, recall rate, and frames per second of the improved model were improved by 5.2%, 10.5%, and 6.4%, respectively, while the number of parameters and computational costs were reduced by 5.8% and 14.8%, respectively. Furthermore, the defect detection generalization experiments on the GC-10 dataset and SDD DET dataset confirmed that the YOLOv8-MGVS model has higher detection accuracy, better lightweight, and speed.

Details

Title

Steel Surface Defect Detection Technology Based on YOLOv8-MGVS

Author

Zeng, Kai¹; Xia, Zibo²; Qian, Junlei³

; Du, Xueqiang⁴; Xiao, Pengcheng⁵; Zhu, Liguang⁶

¹ College of Electrical Engineering, North China University of Science and Technology, Tangshan 063210, China; [email protected] (K.Z.); [email protected] (Z.X.); College of Metallurgy and Energy, North China University of Science and Technology, Tangshan 063210, China; [email protected]; Tangshan Iron and Steel Enterprise Process Control and Optimization Technology Innovation Center, Tangshan ANODE Automation Co., Ltd., Tangshan 063108, China; [email protected]; Hebei Collaborative Innovation Center of High-Quality Steel Continuous Casting Engineering Technology, Tangshan 063000, China; [email protected]
² College of Electrical Engineering, North China University of Science and Technology, Tangshan 063210, China; [email protected] (K.Z.); [email protected] (Z.X.)
³ College of Electrical Engineering, North China University of Science and Technology, Tangshan 063210, China; [email protected] (K.Z.); [email protected] (Z.X.); Tangshan Iron and Steel Enterprise Process Control and Optimization Technology Innovation Center, Tangshan ANODE Automation Co., Ltd., Tangshan 063108, China; [email protected]
⁴ Tangshan Iron and Steel Enterprise Process Control and Optimization Technology Innovation Center, Tangshan ANODE Automation Co., Ltd., Tangshan 063108, China; [email protected]
⁵ College of Metallurgy and Energy, North China University of Science and Technology, Tangshan 063210, China; [email protected]; Hebei Collaborative Innovation Center of High-Quality Steel Continuous Casting Engineering Technology, Tangshan 063000, China; [email protected]
⁶ Hebei Collaborative Innovation Center of High-Quality Steel Continuous Casting Engineering Technology, Tangshan 063000, China; [email protected]; School of Materials Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China

First page

109

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20754701

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/met15020109

ProQuest document ID

3171110180