Full Text

Turn on search term navigation

1. Introduction

Iris-recognition technology necessitates a thorough analysis of eye images, which includes not only identifying the iris texture with individual characteristics but also dealing with non-iris components such as eyelids, eyelashes, and reflections. Accurate iris segmentation is critical for further image processing and feature extraction [1]. Inaccurate segmentation can result in image-pixel misalignment, which reduces the accuracy of iris recognition. As a result, researchers are constantly looking for more efficient algorithms to accurately segment the iris area from complex eye images, eliminate other interfering factors, and improve the overall performance of the iris-recognition system. The principle of iris segmentation is illustrated in Figure 1.

Conventional iris-segmentation techniques, such as edge detection and Hough transform methods, are put to the test in terms of performance and stability when confronted with undesirable conditions such as eye occlusion, image blurring, insufficient resolution, or reflections. As a result, improving the accuracy and reliability of iris segmentation under these non-ideal conditions has emerged as a major focus of current research. The use of deep learning algorithms, including advanced image segmentation architectures such as FCN [2], U-Net [3], Transformer [4], provide a more accurate and stable solution for iris segmentation due to their powerful feature learning and boundary recognition abilities.

The applicability of U-Net to iris-segmentation tasks was first studied in depth and the computational efficiency of the algorithm was explored in detail by Lozej et al. [5]. Wu et al. [6] proposed a model Dense U-Net combining U-Net and DenseNet to improve iris-segmentation accuracy under non-ideal conditions. Zhang et al. [7] further developed the field by proposing the FD-U-Net algorithm, which utilizes dilation convolution instead of the traditional convolution operation, thus making significant progress in extracting global features, especially in processing the details of iris images. It shows excellent performance in handling the details of iris images, which enables the algorithm to maintain its outstanding performance in heterogeneous iris-segmentation tasks as well. Wang et al. proposed IrisParseNet [8], an efficient multitasking iris-segmentation method based on U-Net. The method models the iris mask and its parameterized inner and outer boundary information through a unified multitask network framework, which not only enhances the robustness and generalization ability of the algorithm but also provides a solid technical foundation for iris segmentation under non-ideal conditions. The Transformer architecture, with its unique self-attention mechanism, has revolutionized the domain of natural language processing with its ability to efficiently capture long-distance dependencies. With its gradual application in the field of image segmentation, this innovative technique is beginning to show its potential in visual tasks. Sun et al. [9] took an exploratory step in the field of iris segmentation by proposing HTU-Net, a hybrid architecture that incorporates Transformer. The architecture employs a convolutional layer in the encoding stage to capture the intensity of local features while utilizing the Transformer to capture correlation information at a distance. In the decoding stage, by introducing a gating mechanism, HTU-Net is able to capture rich multi-scale contextual information. In addition, Sun et al. designed the pyramid center perception module to further enhance the ability to capture global features of the iris, while Gu et al. [10] further advanced the field by deeply integrating the Swin Transformer [11] with the U-Net architecture, which dramatically improves the iris region by accurately modeling contextual information interactions between image pixels separation accuracy from background noisy pixels. Meng et al. [12] proposed a bilateral segmentation backbone network that combines the advantages of Swin Transformer and CNN for more efficient feature extraction. They also introduced the Multi-scale Information Feature Extraction Module, a module capable of extracting finer-grained multi-scale spatial information, as well as the Channel Attention Mechanism Module to enhance the discriminability of iris regions.

The Segment Anything Model (SAM), a groundbreaking development by Meta, has revolutionized the realm of image segmentation. It boasts superior capabilities for segmenting intricate scenes and a multitude of objects. In addition, the SAM model is designed to have GPT-like Prompt-based working capability compared to models such as U-Net, FRED-Net [13], OR-Skip-Net [14], etc., which means that it can use simple textual commands or clicking prompts to edit the image during image segmentation tasks which is not common in traditional models such as U-Net. However, when applying it to specific tasks, such as iris segmentation, we face a unique set of challenges. These challenges require us to fine-tune the model to fit more specialized and specific application scenarios. Therefore, we make a series of improvements to SAM to obtain better iris-segmentation results. The contributions of this paper are summarized in the following aspects:

The Segment Anything Model (SAM) was applied to the field of iris segmentation in this paper, confirming the great potential and efficacy of large pretrained models in handling this intricate visual task. This also opens up new avenues for future research on iris-segmentation algorithms based on large models.
Adapter technique has been proven to be an efficient strategy for fine-tuning large models to fit specific tasks. In this paper, we present an innovative plug-and-play adapter, the IrisAdapter, which is specifically designed to capture iris domain-specific information. The introduction of this adapter allows us to perform effective feature learning on iris images without comprehensively updating the entire model parameters while ensuring that the original knowledge of the model is preserved to avoid the problem of knowledge forgetting. More importantly, the application of IrisAdapter significantly reduces the computational and economic costs associated with large-scale model training.
In order to cope with the inadequacy of the pretrained ViT encoder in extracting localized detail information of iris images, this paper introduces a CNN branch that works in parallel with ViT. This design enables the model to capture the fine local features of iris images through the CNN branch. Furthermore, we employ a Cross-Branch Attention mechanism module, the introduction of which not only facilitates the information exchange between the ViT and CNN branches but also enables the ViT branch to integrate and utilize the local information of iris images more effectively. Through this fusion strategy, our model significantly enhances the ability to recognize iris details while maintaining the sensitivity of ViT to global contextual information, thus improving the overall segmentation performance.

2. Related Work

2.1. Segment Anything Model

Over the past few years, there has been a surge of inspiration drawn from large-scale language models like ChatGPT and GPT, and many researchers have devoted themselves to developing models with similar capabilities. These models not only have strong generalization capabilities, but also can be quickly adapted and scaled to the target task domain with a very small number of samples or even in the case of zero samples. Meta’s FAIR Lab has recently released the Segment Anything Model (SAM) [15], a model at the forefront of image segmentation technology that promises to revolutionize the field of computer vision. The architecture and pipeline of SAM is shown in Figure 2. The pipeline of SAM is as follows: receive input image and prompt information, extract features by image encoder, process prompts by Prompt Encoder, generate segmentation mask by mask decoder, process disambiguation by data engine, and finally output accurate segmentation result. Intensively trained on millions of images and over a billion masks, SAM is capable of accurate image segmentation based on a wide range of prompts such as foreground/background points, bounding boxes, masks, text, and more. Impressively, SAM is able to provide effective segmentation results even when the prompt information is not sufficiently clear. SAM’s core strength lies in its rich knowledge accumulated through training on large-scale data, which has allowed it to learn and master the basic concepts of objects. Based on this deep understanding, SAM is able to segment any object, even when faced with never-before-seen objects, without additional training or fine-tuning, demonstrating excellent zero-sample generalization capabilities. This capability not only reflects SAM’s advancement in the domain of image segmentation but also heralds the great potential of large pretrained models in solving complex visual problems.

2.2. Task-Specific SAM Fine-Tuning

SAM offers a superior framework for interactive segmentation, making itself a benchmark for image segmentation that relies on prompts. However, due to the domain differences between natural images and iris images, the performance of SAM shows a significant degradation when applied to iris images. The reason for this can be attributed to the method of data acquisition: iris images are captured using specific protocols and specialized sensors and are presented in different modes (near-infrared, visible light). These images are, therefore, based on a set of physical properties and energy sources that are very different from natural images. Therefore, the research in this paper fine-tunes SAM for a specific iris-segmentation dataset.

There are many studies on fine-tuning SAM in specific tasks. Chen et al. proposed SAM adapters [16], which employ domain-specific information or visual cues to segment networks through the use of simple but effective adapters. Ma et al. [17] collected 11 medical image datasets with different modalities and fine-tuned the SAM mask decoder on more than 1 million masks while preserving the original bounding box prompts. Deng et al. [18] proposed a multi-bounding box-triggered uncertainty estimation method for SAM, which has achieved a significant improvement in retinal image segmentation. Wu et al. proposed MSA [19], which utilizes an adapter technique to integrate medical-specific knowledge into SAM. Zhang et al. [20] proposed SAMed to integrate low-rank [21] into SAM. These preceding studies show that fine-tuning strategies or adapters can improve the performance of SAM on specific tasks. In order to combine the advantages of both the base model and the domain-specific model, Farmanifard et al. [22] developed a pixel-level iris-segmentation model, IrisSAM, and the primary innovation of this research lies in the integration of different loss functions when fine-tuning SAM on eye images. Li et al. [23] propose nnSAM, which represents an integration of the SAM model with nnUNet, enhancing the precision and robustness of medical image segmentation.

2.3. Interactive Segmentation

Interactive segmentation is the splitting of the foreground and background that the user needs to be segmented by providing certain interactive information by the user, including clicks, bounding boxes, closed curves, non-closed curves, and other interactive methods. It is characterized by obtaining information from the guidance of user interaction. The algorithm then iteratively improves the segmentation based on user feedback. Interactive segmentation is useful in many applications that require precise extraction of objects, such as medical image segmentation [24].

3. Methodology

3.1. Overview

Inheriting the decoder and prompt encoder of the original SAM, SAM-Iris improves the image encoder to better adapt to the iris-segmentation task. The architecture and pipeline of SAM-Iris is shown in Figure 3. First, we adjust the input resolution of the image encoder by reducing it from 1024 × 1024 to 256 × 256, and this improvement significantly enhances the computational efficiency of the model. Then, to compensate for the lack of local feature extraction in the original ViT encoder, we introduce a CNN branch dedicated to capturing fine local information in the image, thus enhancing the model’s iris details. Then, by introducing the CBA (Cross-Branch Attention) module, we realize the effective information exchange between the CNN branch and the ViT branch, and this cross-branch synergy enables the model to comprehensively utilize the advantages of the two branches to generate richer and more accurate feature representations. Finally, the outputs of the CNN branch and the ViT branch are merged to form an image embedding through which the encoder and decoder predict the iris mask. Furthermore, SAM requires post-processing to generate high-quality segmentation masks after the segmentation task, which includes up-sampling, which up-samples the dimensions of the model output to the dimensions of the original image, and binarization, which converts the up-sampled mask prediction results to binary masks through binarization.

3.2. Adapter

As computer hardware performance improves and the number of pretrained large model parameters increases, full fine-tuning while training downstream tasks becomes expensive and time-consuming. Based on this, the emergence of an Adapter alleviates the above problem. Adapter inserts parameters for downstream tasks into each layer of the pretrained model freezes the body of the model during fine-tuning, and trains only the task-specific parameters, thus reducing the computational power overhead during training. In the SAM model, the image encoder, as the component with the largest number of parameters, is the most important part of the SAM model. As such, we keep the original encoder parameters frozen during the fine-tuning process while equipping each Transformer module with an adapter. The implementation strategy is as follows: first, the resolution of the input feature map is efficiently reduced to C × 1 × 1 using global average pooling to achieve a compact representation of the features. Subsequently, these compressed channel embeddings are further compressed by a linear layer, followed by another linear layer, to reduce the compressed embeddings to the original dimension. This compression and reduction process not only preserves the key information but also enhances the representation of the features. Finally, the reduced channel embeddings are multiplied element-by-element with the original feature maps, and the results obtained will be used as inputs to the next layer to provide a richer and finer feature representation for the model. To further enhance the performance of the model, we introduce skip connections after each adapter, a design that not only preserves more low-level features but also promotes the effective fusion of features at different levels, thus enhancing the model’s ability to capture details. With this innovative adapter technique, we are able to significantly improve the performance and adaptability of the model in iris-segmentation tasks while maintaining computational efficiency. The structure of IrisAdapter is shown in Figure 4.

3.3. Prompt Encoder

The SAM model’s prompt encoder is highly capable, offering support for four modes of prompts: point, bounding box, mask, and textual prompts. Given the lack of pretrained models for matching iris images to text, this study focuses on fine-tuning the other three prompt modes. Compared to previous approaches that use only a single prompt for fine-tuning, the research work in this paper provides an innovative extension to retain the three prompt modes of point, bounding box, and mask. Specifically, the model proposed in this paper employs an integrated strategy that utilizes both sparse prompts (points and bounding boxes) and dense prompts (masks). For the treatment of point prompts, we employ position-encoded vector embedding combined with two learnable vector embeddings, which represent the positions of the foreground and background, respectively, and enrich the expressive power of the point prompt by their sum. For bounding box prompts, we use the position encoding of the points where the upper left and lower right corners are located, as well as the learnable embedding vectors representing these two corners, to accurately capture the features of the bounding box. For the application of dense prompt, in this paper, We employ the low-resolution feature maps produced following the model’s initial iteration as a mask prompt. Utilizing two convolutional embeddings, we reduced the input mask’s dimensions by a factor of four, concurrently modifying the number of output channels to one quarter and one sixth of the initial input channels, respectively. Ultimately, by using a 1 × 1 convolutional kernel, the channel dimensions are mapped to 256 to ensure that the feature maps are sufficiently expressive while remaining informative. This combined use of sparse and dense prompts not only improves the model’s capacity for capturing iris image features but also enhances the model’s adaptability and flexibility to different prompt modes, providing powerful technical support for accurate segmentation and analysis of iris images. The structure of Prompt Encoder is shown in Figure 5.

3.4. Mask Decoder

In this study, we used the original mask decoder of the SAM model, preserved its structure, and did not make any changes. During the training process, we focused on continuously optimizing and updating the parameters of the mask decoder. This decoder is composed of a pair of Transformer layers: a dynamic mask prediction header responsible for generating the initial prediction of the mask and an intersection-to-union (IoU) score regression header that focuses on improving the match between the predicted mask and the real mask. This design is not only lightweight and efficient but also very powerful in terms of functionality. In the model’s default mode of operation, it is able to generate three independent mask predictions for each prompt simultaneously. By comparing these predictions, the model selects the mask with the highest IoU score as the optimal solution, which guides the parameter updates. This approach ensures that the model is able to continuously improve itself during the training process, gradually increasing the accuracy and reliability of the mask predictions.

3.5. CNN Branch

The CNN branch is made up of a succession of convolution-pooling blocks that are linked consecutively. To be precise, the input data initially traverses a single convolutional block, which is then succeeded by a trio of sequential convolutional-pooling blocks. In this process, the spatial dimensions of the feature maps output from the CNN branch are matched to the feature maps of the ViT branch. In the following section of the CNN branch, these convolutional layers are sequentially iterated four times. Each convolutional layer is equipped with a 3 × 3 convolutional kernel for the convolution operation, and each convolutional-pooling hierarchy contains a maximum pooling layer with a step size and a pooling kernel of 2 following the convolutional layer. The structure of a single convolutional block is shown in Figure 6.

3.6. Cross-Branch Attention

The Cross-Branch Attention (CBA) module facilitates an information exchange pathway between the CNN and ViT branches, enhancing the model’s ability to incorporate missing local information. The structure of Cross-Branch Attention module is shown in Figure 7. For the feature map $F_{V}$ from the ViT branch and the feature map $F_{C}$ from the CNN branch, the representation formula is as follows.

(1) $C B A (F_{V} {, F}_{C}) = (σ (\frac{F_{V} M_{q} {(F_{C} M_{k})}^{T}}{\sqrt{d_{m}}}) + R) (M_{v})$

where

σ

denotes the SoftMax function.

M_{q} \in R^{d \times d_{m}}

M_{k} \in R^{d \times d_{m}}

M_{v} \in R^{d \times d_{m}}

denotes Q, K, and V in the attention mechanism, denotes the relative position encoding, and

R \in R^{h w \times h w}

denotes the dimensionality of the CBA module,

d_{m}

denotes the dimensionality of the CBA module.

3.7. Loss Function

In the iris-segmentation task, the number of non-iris pixels is much larger compared to the number of iris pixels, and this class imbalance problem can seriously interfere with the segmentation performance. To address the problem, we employ a combined loss function that includes Dice Loss, Focal Loss, and IoU Loss to supervise the model’s training process.

Dice Loss, originating from a seminal paper [25], was specifically designed to address the strong imbalance that exists between positive and negative samples in segmentation tasks. This loss function, with its unique advantages, optimizes the model’s performance when dealing with unbalanced datasets and improves the model’s accuracy in recognizing a small number of categories. Dice Loss, by assessing the resemblance between the model’s predictions and the actual labels, prompts the model to focus more on positive samples that constitute a minor part of the dataset. This approach effectively equalizes the influence of various categories and ensures the model’s capacity for generalization and robustness, as detailed further below:

(2) $L_{d i c e} = 1 - \frac{2 |P r e d i c t e d \cap T a r g e t|}{|P r e d i c t e d| + |T a r g e t|}$

where

P r e d i c t e d

represents the mask predicted by the model and

T a r g e t

represents the true mask.

IoU Loss (Intersection over Union loss) [26] is a loss function that measures the degree of overlap between predicted and true results and evaluates the performance of a model by calculating the ratio of intersection and concatenation between the predicted and true masks. The core principle of this loss function is that it provides intuitive feedback on the model’s performance by quantifying how well the predicted mask matches the true mask. IoU Loss is particularly suited to segmentation tasks that require high accuracy, as defined below:

(3) $L_{I o U} = - \frac{|P r e d i c t e d \cap T a r g e t|}{|P r e d i c t e d \cup T a r g e t|}$

Focal Loss [27] was introduced to address a common challenge in iris-segmentation tasks—the significant imbalance between positive and negative samples. This loss function is an innovative extension of the traditional cross-entropy loss function, which effectively adjusts the sensitivity of the loss function to different samples by introducing a dynamic scaling factor $γ$ . The core idea of Focal Loss is to reduce the weight of its contribution to the loss function during training for samples that are easily correctly categorized by the model, i.e., easy-to-distinguish samples, and to increase the weight for samples that are difficult to be accurately recognized by the model, i.e., hard-to-distinguish samples, increase their weights, thus prompting the model to focus on these challenging samples more quickly. In this way, Focal Loss optimizes the model’s learning process, enabling the model to learn and distinguish those elusive details in the iris-segmentation task more efficiently, and significantly improves the model’s ability to recognize a small number of classes of samples. Its formula is as follows:

(4) $L_{F o c a l} = {(1 - p_{t})}^{γ} \cdot log p_{t}$

(5) $\{\begin{matrix} p_{t} = p, y = 1 \\ p_{t} = 1 - p, o t h e r w i s e \end{matrix}$

where

p_{t}

denotes the probability that the model is predicted to be a foreground, and

γ

is a dynamic scaling factor to adjust the balance between positive and negative categories. The value of p ranges from 0 to 1 and is the probability that the model predicts a mask.

Finally, the joint loss function is formulated as, where w and s are tunable hyperparameters:

(6) $L_{t o t a l} = w * L_{F o c a l} + L_{d i c e} + s * L_{I o U}$

3.8. Fitune Strategy

In this paper, we draw on the essence of the SAM model as well as other interactive segmentation methods and train the model in depth by simulating the process of interactive segmentation. Specifically, for each batch of data, we employ a training strategy of 9 iterations. In the crucial first iteration, we initiate the segmentation process by randomly selecting a foreground point or bounding box as a sparse prompt with the same probability. The foreground points are carefully sampled from the real mask, while the bounding box is a maximal enclosing rectangular box computed based on the real mask, with points at its four corners allowed to have an offset of up to 5 pixels in order to increase the robustness of the model. In the first iteration, we adopted a comprehensive update strategy, updating the parameters of the adapter, prompt encoder, and mask decoder simultaneously. This comprehensive update provides a solid starting point for the model to capture key features of the image more accurately in subsequent iterations. Starting from the second iteration, we adopt a more flexible sparse prompt strategy by randomly selecting 1, 3, 5, or 9 points as prompt, which not only increases the diversity of the training but also motivates the model to learn to segment efficiently with different numbers of prompt. At the same time, the model uses the low-resolution feature maps generated in the previous iteration as dense prompts for the current iteration, and this strategy enables the model to gradually refine its understanding of the mask in successive iterations. In the last iteration, as well as randomly selected intermediate iterations, we only provide a dense prompt, which aims to guide the model to focus on extracting information from the existing feature prompt to further improve the accuracy and reliability of its predictions.

4. Experiments and Analysis of Results

4.1. Introduction to the Dataset

In this paper, we assess the performance of our proposed model across three iris-segmentation datasets: CASIA.v4-distance, UBIRIS.v2, and MICHE. These three datasets cover a variety of challenging factors such as different spectra (visible and near-infrared), devices, and distances, and their dataset-related information is shown in Table 1.

CASIA.v4-distance [28]: This dataset was captured using a CASIA long-range iris camera in near-infrared light. We use the same protocol as [29] for experiments, which contained 400 iris images at 640 × 480 resolution. The first 300 images from the first 30 subjects were used for training, and the last 100 images from the last 10 subjects were used for testing.

UBIRIS.v2 [30]: This dataset was captured under visible light conditions using a Canon EOS 5D camera. A subset of 1000 UBIRIS.v2 images at 400 × 300 resolution was used in the NICE.I [31] competition. We use the same protocol [31] as NICE.I competition for experiments, which selected 945 of these images to be manually labeled, of which 500 images were used for training and 445 images for testing.

MICHE [32]: This dataset was created in order to evaluate and develop algorithms for visible light iris images captured on the mobile device. It includes visible light iris images captured by three mobile devices (iPhone 5, Samsung Galaxy S4, Samsung Galaxy Table2) under unconstrained conditions. We use the same protocol as [8] for experiments, which selected 871 visible light images from MICHE, which contains 680 training images and 191 images for testing.

4.2. Metric

FP (False Positive) denotes the number of pixels that predicted the non-iris region as the iris region, and TN (True Negative) denotes the number of correctly predicted pixels in the non-iris region, FN (False Negative) denotes the number of pixels for which the iris region is predicted as a non-iris region, and TP (True Positive) denotes the number of pixels for which the iris region is correctly predicted.

4.2.1. mIoU

The mIoU represents the average intersection ratio for the type of pixel points in the iris image. A larger value of mIoU represents a better segmentation result where k represents the number of classes. The formula for mIoU is as follows.

(7) $m I o U = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}$

4.2.2. F1

The F1 Score is calculated from Precision and Recall. A larger value of F1 Score means better segmentation results. The formula for F1 Score is as follows.

(8) $F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(9) $P r e c i s i o n = \frac{T P}{T P + F P}$

(10) $R e c a l l = \frac{T P}{T P + F N}$

4.2.3. E1

E1 denotes the ratio of inconsistent pixels to the total pixels obtained by computing the dissimilarity of the predicted segmented image to each pixel in the ground truth. The smaller the value of E1, the better the segmentation result is, and the calculation formula is as follows.

(11) $E 1 = \frac{\sum_{k = 1}^{N} \sum_{i, j \in (m, n)} G (i, j) ⨁ O (i, j)}{N \times m \times n}$

where N denotes the number of iris images, m as well as n denotes the width and height of the iris images,

G (i, j)

and

O (i, j)

denote the predicted segmentation results and the pixels of the labeled images, respectively, and ⨁ denotes the logical heterodyne operation.

4.2.4. Accuracy

Accuracy represents the ratio of the number of correctly predicted pixel points to the total number of pixels in the iris image. A larger value of Acc represents a better segmentation result and is calculated as follows.

(12) $A c c = \frac{T P + T N}{T P + F N + F P + T N} \times 100 %$

4.3. Experimental Setup

This paper introduces an algorithm that is fully developed using the PyTorch framework and is trained on a single Nvidia RTX 4090 GPU. Considering the limitations of computational resources, and especially given the concentration of parameters in the encoder part of the SAM model, we chose to use ViT-B (Base) as the encoder for fine-tuning the model. Given the relatively limited dataset available for training, we set the batch size to 2 and employed an Adam optimizer with a learning rate of 0.0001 and a weight decay of 0.01 throughout the training process. To ensure that the model could fully learn and generalize on the limited data, the training process was conducted over 15 epochs. Prior to the commencement of the training process, the resolution of the images was uniformly adjusted to normalize them to 256 × 256 pixels. In the case of images with a width or height less than 256 × 256 pixels, a strategy of zero edge padding was employed to ensure the integrity and proportion of the image were maintained. In all other instances, a bilinear interpolation technique was utilized to resize the image, therefore guaranteeing that the quality and details were correctly handled during the zoom-in process.

4.4. Analysis of Experimental Results

4.4.1. Comparison Experiment

As shown in Table 2, our model demonstrates exceptional performance on a multitude of iris-segmentation datasets, including CASIA.v4-distance, UBIRIS.v2, MICHE-I. In the comparative experiments, a variety of algorithms were selected for analysis, including the traditional algorithm RTV-L 1 [33]. This was complemented by the inclusion of several advanced algorithms, including CNN-based U-Net [3], Deeplab V3+ [34], MFCNs [29], CNNHT [35], and IrisParseNet [8], as well as Transformer-based Swin Transformer [11] and TransUNet [36]. An analysis of the results indicates that the model introduced in this study surpasses CNN-based models across all four performance metrics. This result unequivocally demonstrates the superiority of the self-attention mechanism in Transformer in capturing long-distance dependent information in images, which markedly enhances the segmentation performance. Moreover, the Transformer-based algorithms Swin Transformer and TransUNet demonstrate superior performance compared to their CNN-based counterparts, further substantiating the efficacy of the self-attention mechanism. The encoder and decoder of the SAM-Iris model are both comprised of a Transformer layer, wherein the encoder incorporates the pretrained large-scale model ViT-B and the IrisAdapter module, a novel addition that endows the model with enhanced iris image feature extraction capabilities. Notably, the incorporation of bounding boxes and point prompts enhances the model’s precision in defining the iris’s annular area, which in turn effectively steers the segmentation algorithm towards higher accuracy. In comparison with the current state-of-the-art Transformer algorithms, Swin Transformer and TransUNet, SAM-Iris has been demonstrated to possess superior iris-segmentation capabilities.

Visual comparisons of the iris-segmentation results from SAM-Iris are depicted in Figure 8 and Figure 9. In the subplots of these result comparisons, blue-colored areas represent true positives (correct predictions), while red areas represent false positives (incorrect predictions in non-iris regions), and green represents true negative pixels (incorrect predictions in iris regions). Observing the results, it can be found that the method proposed in this paper not only accurately segments the outer and inner contours of the iris region but also enables the image encoder to learn iris features more efficiently, therefore effectively avoiding incorrect segmentation of the pupil region. In addition, the method also shows higher accuracy when dealing with eyelash occlusion and highlight regions, successfully excluding factors that are not related to the iris region. This improvement is mainly attributed to the enhancement of the CNN branch, which significantly improves the information extraction ability of the image encoder in localized regions, thus improving the prediction quality of the iris mask. With this comprehensive strategy, SAM-Iris not only achieves a significant improvement in overall segmentation accuracy but also demonstrates its excellent performance in detail processing.

4.4.2. Ablation Experiment

As detailed in Table 3, this section presents a comprehensive set of ablation studies designed to assess the individual impact of each component on enhancing performance. For consistency, the iris images in all four sets of experiments were resized to 256 × 256 pixels. The first set of experiments used the original SAM model without fine-tuning. The results show that the model performs poorly in segmenting iris regions and lacks the ability to generalize to specific iris-segmentation tasks. This finding emphasizes the need for targeted fine-tuning of the SAM model. The next three sets of experiments further validated that the fine-tuned SAM model can be effectively adapted to the target domain. In the second and third sets of experiments, the results showed that the IrisAdapter adapter was able to significantly improve the image encoder’s ability to extract iris features, thus confirming the effectiveness of the adapter technique in optimizing the SAM model. In addition, the second and fourth sets of experiments demonstrated the importance of local information in improving the quality of iris segmentation, and the experimental results proved that the CNN branch successfully introduced key local features for the ViT branch, further enhancing the model’s capacity to capture iris details.

4.4.3. Prompt Experiment

The SAM model is pretrained by an interactive promptable segmentation method, where a series of prompts (points, boxes, masks, etc.) are simulated for each training image, and the loss is defined as the deviation between the model’s predicted mask and the true mask. This interactive capability allows SAM to obtain reasonable segmentation results in a single interaction, which is usually not the case with traditional models. The data in Table 4 reflect in detail the performance of the iris-segmentation task when using a bounding box (Bbox) and different numbers of point prompt (1 point, 3 points, 5 points, 9 points) on different datasets. It is noticeable that as the number of point prompts increases incrementally, the model’s performance metric remains largely constant. This suggests that using an appropriate number of point prompts is advantageous for enhancing the precision of iris segmentation. However, when the number of point prompt exceeds a certain threshold, the performance improvement is not significant, indicating that more point prompts are not better, but there is an optimal number of prompts. In addition, point prompt demonstrates an advantage in improving iris-segmentation performance compared to the bounding box prompt. Point prompts are able to localize key regions of the iris more accurately, which allows the model to understand and segment the iris contour in greater detail. This performance improvement further confirms the superiority of point prompts over bounding box prompts in iris-segmentation tasks. Taken together, these findings provide important guidance: when designing SAM-based iris-segmentation algorithms, an appropriate amount of point prompts should be considered to optimize performance and also highlight the effectiveness of point prompts in accurately capturing iris features.

4.4.4. Interactive Segmentation Visualization

Figure 10 provides a visualization of the segmentation effect of the NIR iris image, with different prompts to see the differences in the segmentation results. Starting from the original image in Figure 10a, we can see the original unsegmented iris image, which provides a reference for the subsequent visualization. In Figure 10b–f, it can be seen that the model is able to segment the iris region accurately and interactively under different prompt modes that are considered to be set.

5. Conclusions

In this paper, we have successfully applied the Segment Anything Model (SAM) to the iris-segmentation domain through a series of innovative research efforts, confirming the great potential and effectiveness of large pretrained models in handling this complex visual task. We propose the IrisAdapter adapter as a plug-and-play tool that effectively captures iris domain-specific information while avoiding a full update of the entire model parameters, reducing the computational and economic cost of the training process. In addition, by introducing a CNN branch working in parallel with ViT and a Cross-Branch Attention mechanism module, our model makes significant progress in extracting local detail information of the iris image, enhances the ability to recognize iris details, and improves the overall segmentation performance. Future work can further explore and optimize the adapter technique for a wider range of application scenarios and requirements. Meanwhile, we will also focus on how to further improve the SAM model’s ability to extract local features as well as pre-training large model fine-tuning techniques to achieve higher accuracy iris segmentation.

Author Contributions

Conceptualization, J.J. and Q.Z.; methodology, J.J; software, J.J.; validation, J.J.; formal analysis, J.J. and Q.Z.; investigation, Q.Z. and C.W.; resources, Q.Z. and C.W.; data curation, J.J.; writing—original draft preparation, J.J.; writing—review and editing, Q.Z., C.W.; visualization, J.J.; supervision, Q.Z.; project administration, Q.Z.; funding acquisition, Q.Z., C.W. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The dataset presented in this study is available. All data used in the study can be obtained from publicly available databases and emails from the author’s mailbox.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. Iris segmentation. The figure on the left represents the components of the human eye, where the part between the orange inner circle and the blue outer circle is the part to be segmented, and the figure on the right represents the segmentation mask, where the white part is the part to be segmented correctly.

Figure 1. Iris segmentation. The figure on the left represents the components of the human eye, where the part between the orange inner circle and the blue outer circle is the part to be segmented, and the figure on the right represents the segmentation mask, where the white part is the part to be segmented correctly.

Figure 2. SAM architecture.

View Image - Figure 3. The architecture and pipeline of SAM-Iris, where “Freeze” means not updating the parameters and “Update” means updating the parameters.

Figure 3. The architecture and pipeline of SAM-Iris, where “Freeze” means not updating the parameters and “Update” means updating the parameters.

Figure 4. The structure of IrisAdapter.

Figure 5. The structure of Prompt Encoder.

Figure 6. The structure of a single convolutional block.

Figure 7. The structure of Cross-Branch Attention module.

View Image - Figure 8. Visualization of visible light iris-segmentation results. Blue-colored areas represent true positives, red areas represent false positives, and green represents true negative pixels.

Figure 8. Visualization of visible light iris-segmentation results. Blue-colored areas represent true positives, red areas represent false positives, and green represents true negative pixels.

View Image - Figure 9. Visualization of NIR iris image segmentation results. Blue-colored areas represent true positives, red areas represent false positives, and green represents true negative pixels.

Figure 9. Visualization of NIR iris image segmentation results. Blue-colored areas represent true positives, red areas represent false positives, and green represents true negative pixels.

Figure 10. Interactive segmentation results.

Table 1

Experimental dataset. Where NIR denotes near-infrared light and VIS denotes visible light.

Dataset	Train	Test	Resolution	Device	Spectral
CASIA.v4-distance	300	100	640 × 400	CASIA long-range iris camera	NIR
UBIRIS.v2	500	445	400 × 300	Canon EOS 5D	VIS
MICHE	680	191	400 × 400	iPhone 5 Samsung Galaxy S4 Samsung Galaxy Table2	VIS

Table 2

Comparison of experimental results. Where an upward arrow indicates that a larger value is better, and a downward arrow indicates that a smaller value is better.

Methods	CASIA.v4-Distance				UBIRIS.v2				MICHE-I
Methods	E1↓	F1↑	mIoU↑	Acc↑	E1↓	F1↑	mIoU↑	Acc↑	E1↓	F1↑	mIoU↑	Acc↑
RTV-L 1 [33]	0.68	87.55	78.25	81.04	1.21	85.97	77.63	88.83	2.42	79.24	71.47	88.97
U-Net [3]	0.42	93.96	88.84	91.28	0.91	91.59	84.67	92.50	0.76	92.63	86.67	93.36
Deeplab V3+ [34]	0.53	92.43	86.83	-	0.88	91.16	85.90	-	0.77	91.93	85.79	-
MFCNs [29]	0.59	93.09	-	-	0.90	91.04	-	-	0.74	92.01	-	-
CNNHT [35]	0.56	92.27	86.58	89.01	0.97	90.34	82.98	91.14	0.80	91.41	85.27	91.66
IrisParseNet [8]	0.41	94.25	89.52	93.29	0.84	91.78	84.88	92.31	0.66	93.05	87.27	92.53
SwinTransformer [11]	0.40	94.52	89.68	93.91	0.99	91.46	83.96	91.52	0.91	91.34	84.67	92.39
SwinUNet [37]	0.37	94.67	90.03	94.22	0.92	92.37	84.64	92.72	0.71	91.34	86.93	93.79
TransUNet [36]	0.39	94.51	89.72	93.27	0.91	91.55	84.57	91.67	0.73	92.71	86.75	93.10
MedSAM [17]	0.47	92.94	86.81	92.67	0.93	90.58	83.27	90.75	0.81	92.02	84.95	93.04
IrisSAM [22]	0.45	93.79	87.82	93.58	0.93	91.12	84.05	92.43	0.80	92.06	84.25	92.15
Ours	0.34	95.15	90.88	96.49	0.79	94.08	88.94	94.97	0.67	93.62	88.66	95.03

Table 3

Results of the ablation study. Where an upward arrow indicates that a larger value is better, and a downward arrow indicates that a smaller value is better.

Methods	CASIA.v4-Distance				UBIRIS.v2				MICHE-I
Methods	E1↓	F1↑	mIoU↑	Acc↑	E1↓	F1↑	mIoU↑	Acc↑	E1↓	F1↑	mIoU↑	Acc↑
NoFitune-SAM	3.78	37.33	24.97	64.41	6.47	32.17	21.42	77.03	5.19	43.84	31.05	67.26
SAM+IrisAdapter	0.44	91.43	88.07	96.15	0.91	93.19	87.41	94.22	0.82	91.47	84.66	92.45
SAM+CNNBranch	3.77	56.86	40.48	52.26	4.97	57.41	41.89	66.68	4.31	48.96	33.77	57.02
Ours	0.34	95.15	90.88	96.49	0.79	94.08	88.94	94.97	0.67	93.62	88.66	95.03

Table 4

Prompt comparison experiment results.

Prompt Mode	CASIA.v4-Distance				UBIRIS.v2				MICHE-I
Prompt Mode	E1↓	F1↑	mIoU↑	Acc↑	E1↓	F1↑	mIoU↑	Acc↑	E1↓	F1↑	mIoU↑	Acc↑
Bbox	0.36	94.89	90.45	96.11	0.72	93.24	87.83	94.02	0.68	93.35	88.27	94.95
1 Point	0.34	95.15	90.88	96.49	0.79	94.08	88.94	94.97	0.67	93.62	88.66	95.03
3 Points	0.37	94.68	90.22	96.03	0.73	93.20	87.79	93.92	0.68	93.33	88.27	94.87
5 Points	0.37	94.66	90.21	96.01	0.73	93.20	87.79	93.92	0.68	93.31	88.25	94.84
9 Points	0.37	94.67	90.22	96.01	0.73	93.19	87.77	93.91	0.68	93.33	88.27	94.87

References

1. Nguyen, K.; Proença, H.; Alonso-Fernandez, F. Deep Learning for Iris Recognition: A Survey. ACM Comput. Surv.; 2024; 56, pp. 1-35. [DOI: https://dx.doi.org/10.1145/3651306]

2. Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Boston, MA, USA, 7–12 June 2015; pp. 3431-3440.

3. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer International Publishing: New York, NY, USA, 2015; pp. 234-241.

4. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. Proceedings of the International Conference on Learning Representations; Virtual Event, Austria, 3–7 May 2021; Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 22 October 2020).

5. Lozej, J.; Meden, B.; Struc, V.; Peer, P. End-to-End Iris Segmentation Using U-Net. Proceedings of the 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI); San Carlos, Costa Rica, 18–20 July 2018; pp. 1-6.

6. Wu, X.; Zhao, L. Study on Iris Segmentation Algorithm Based on Dense U-Net. IEEE Access; 2019; 7, pp. 123959-123968. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2938809]

7. Zhang, W.; Lu, X.; Gu, Y.; Liu, Y.; Meng, X.; Li, J. A Robust Iris Segmentation Scheme Based on Improved U-Net. IEEE Access; 2019; 7, pp. 85082-85089. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2924464]

8. Wang, C.; Muhammad, J.; Wang, Y.; He, Z.; Sun, Z. Towards Complete and Accurate Iris Segmentation Using Deep Multi-Task Attention Network for Non-Cooperative Iris Recognition. IEEE Trans. Inf. Forensics Secur.; 2020; 15, pp. 2944-2959. [DOI: https://dx.doi.org/10.1109/TIFS.2020.2980791]

9. Sun, Y.; Lu, Y.; Liu, Y.; Zhu, X. Towards More Accurate and Complete Iris Segmentation Using Hybrid Transformer U-Net. Proceedings of the 2022 IEEE International Joint Conference on Biometrics (IJCB); Abu Dhabi, United Arab Emirates, 10–13 October 2022; pp. 1-10.

10. Gu, Z.; Wang, C.; Tian, Q.; Zhang, Q. A Symmetrical Encoder-Decoder Network with Transformer for Noise-Robust Iris Segmentation. J. Comput. Aided Des. Comput. Graph.; 2022; 34, pp. 1887-1898. [DOI: https://dx.doi.org/10.3724/SP.J.1089.2022.19235]

11. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); Montreal, QC, Canada, 10–17 October 2021; pp. 9992-10002.

12. Meng, Y.; Bao, T. Towards More Accurate and Complete Heterogeneous Iris Segmentation Using a Hybrid Deep Learning Approach. J. Imaging; 2022; 8, 246. [DOI: https://dx.doi.org/10.3390/jimaging8090246] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/36135411]

13. Arsalan, M.; Kim, D.; Lee, M.; Owais, M.; Park, K. FRED-Net: Fully residual encoder–decoder network for accurate iris segmentation. Expert Syst. Appl.; 2019; 122, pp. 217-241. [DOI: https://dx.doi.org/10.1016/j.eswa.2019.01.010]

14. Arsalan, M.; Kim, D.; Owais, M.; Park, K. OR-Skip-Net: Outer residual skip network for skin segmentation in non-ideal situations. Expert Syst. Appl.; 2020; 141, 112922. [DOI: https://dx.doi.org/10.1016/j.eswa.2019.112922]

15. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.; Lo, W. et al. Segment Anything. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2304.02643]

16. Chen, T.; Zhu, L.; Deng, C.; Cao, R.; Wang, Y.; Zhang, S.; Li, Z.; Sun, L.; Zang, Y.; Mao, P. SAM-Adapter: Adapting Segment Anything in Underperformed Scenes. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops; Paris, France, 2–3 October 2023; pp. 3367-3375.

17. Ma, J.; Wang, B. Segment Anything in Medical Images. arXiv; 2023; arXiv: 2304.12306[DOI: https://dx.doi.org/10.1038/s41467-024-44824-z] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/38253604]

18. Deng, G.; Zou, K.; Ren, K.; Wang, M.; Yuan, X.; Ying, S.; Fu, H. SAM-U: Multi-box Prompts Triggered Uncertainty Estimation for Reliable SAM in Medical Image. Medical Image Computing and Computer Assisted Intervention—MICCAI 2023 Workshops; Springer: Cham, Switzerland, 2023; pp. 368-377.

19. Wu, J.; Ji, W.; Liu, Y.; Fu, H.; Xu, M.; Xu, Y.; Jin, Y. Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2304.12620]

20. Zhang, K.; Liu, D. Customized Segment Anything Model for Medical Image Segmentation. arXiv; 2023; [DOI: https://dx.doi.org/10.48550/arXiv.2304.13785]

21. Hu, E.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv; 2021; [DOI: https://dx.doi.org/10.48550/arXiv.2106.09685]

22. Farmanifard, P.; Ross, A. Iris-SAM: Iris Segmentation Using a Foundation Model. 2024; Available online: https://api.semanticscholar.org/CorpusID:267616903 (accessed on 9 February 2024).

23. Li, Y.; Jing, B.; Li, Z.; Wang, J.; Zhang, Y. nnSAM: Plug-and-play Segment Anything Model Improves nnUNet Performance. arXiv; 2024; [DOI: https://dx.doi.org/10.48550/arXiv.2309.16967]

24. Wang, G.; Zuluaga, M.; Li, W.; Pratt, R.; Patel, P.; Aertsen, M.; Doel, T.; David, A.; Deprest, J.; Ourselin, S. et al. DeepIGeoS: A Deep Interactive Geodesic Framework for Medical Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell.; 2019; 41, pp. 1559-1572. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2840695] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29993532]

25. Milletari, F.; Navab, N.; Ahmadi, S. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV); Stanford, CA, USA, 25–28 October 2016; pp. 565-571.

26. Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An Advanced Object Detection Network. Proceedings of the 24th ACM International Conference on Multimedia; Amsterdam, The Netherlands, 15–19 October 2016; pp. 516-520. [DOI: https://dx.doi.org/10.1145/2964284.2967274]

27. Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV); Venice, Italy, 22–29 October 2017; pp. 2999-3007.

28. BIT B. I. Test. Casia-v4 Database. 2020; Available online: http://www.idealtest.org/dbDetailForUser.do?id=4 (accessed on 12 July 2021).

29. Liu, N.; Li, H.; Zhang, M.; Liu, J.; Sun, Z.; Tan, T. Accurate iris segmentation in non-cooperative environments using fully convolutional networks. Proceedings of the 2016 International Conference on Biometrics (ICB); Halmstad, Sweden, 13–16 June 2016; pp. 1-8.

30. Proenca, H.; Filipe, S.; Santos, R.; Oliveira, J.; Alexandre, L. The UBIRIS.v2: A Database of Visible Wavelength Iris Images Captured On-the-Move and At-a-Distance. IEEE Trans. Pattern Anal. Mach. Intell.; 2010; 32, pp. 1529-1535. [DOI: https://dx.doi.org/10.1109/TPAMI.2009.66] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/20558882]

31. Proenca, H.; Alexandre, L. The NICE.I: Noisy Iris Challenge Evaluation—Part I. Proceedings of the 2007 First IEEE International Conference on Biometrics: Theory, Applications, and Systems; Crystal City, VA, USA, 27–29 September 2007; pp. 1-4.

32. De Marsico, M.; Nappi, M.; Riccio, D.; Wechsler, H. Mobile Iris Challenge Evaluation (MICHE)-I, biometric iris dataset and protocols. Pattern Recognit. Lett.; 2015; 57, pp. 17-23. [DOI: https://dx.doi.org/10.1016/j.patrec.2015.02.009]

33. Zhao, Z.; Kumar, A. An Accurate Iris Segmentation Framework Under Relaxed Imaging Constraints Using Total Variation Model. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV); Santiago, Chile, 7–13 December 2015; pp. 3828-3836.

34. Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv; 2018; [DOI: https://dx.doi.org/10.48550/arXiv.1802.02611]

35. Hofbauer, H.; Jalilian, E.; Uhl, A. Exploiting superior CNN-based iris segmentation for better recognition accuracy. Pattern Recognit. Lett.; 2019; 120, pp. 17-23. [DOI: https://dx.doi.org/10.1016/j.patrec.2018.12.021]

36. Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.; Zhou, Y. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. arXiv; 2021; [DOI: https://dx.doi.org/10.48550/arXiv.2102.04306]

37. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation. Proceedings of the ECCV Workshops; Montreal, BC, Canada, 11–17 October 2021; Available online: https://api.semanticscholar.org/CorpusID:234469981 (accessed on 12 May 2021).

Word count: 7670

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The Segment Anything Model (SAM) has made breakthroughs in the domain of image segmentation, attaining high-quality segmentation results using input prompts like points and bounding boxes. However, utilizing a pretrained SAM model for iris segmentation has not achieved the desired results. This is mainly due to the substantial disparity between natural images and iris images. To address this issue, we have developed SAM-Iris. First, we designed an innovative plug-and-play adapter called IrisAdapter. This adapter allows us to effectively learn features from iris images without the need to comprehensively update the model parameters while avoiding the problem of knowledge forgetting. Subsequently, to overcome the shortcomings of the pretrained Vision Transformer (ViT) encoder in capturing local detail information, we introduced a Convolutional Neural Network (CNN) branch that works in parallel with it. This design enables the model to capture fine local features of iris images. Furthermore, we adopted a Cross-Branch Attention mechanism module, which not only promotes information exchange between the ViT and CNN branches but also enables the ViT branch to integrate and utilize local information more effectively. Subsequently, we adapted SAM for iris image segmentation by incorporating a broader set of input instructions, which included bounding boxes, points, and masks. In the CASIA.v4-distance dataset, the E1, F1, mIoU, and Acc of our model are 0.34, 95.15%, 90.88%, and 96.49%; in the UBIRIS.v2 dataset, the E1, F1, mIoU, and Acc are 0.79, 94.08%, 88.94%, and 94.97%; in the MICHE dataset, E1, F1, mIoU, and Acc were 0.67, 93.62%, 88.66%, and 95.03%. In summary, this study has improved the accuracy of iris segmentation through a series of innovative methods and strategies, opening up new horizons and directions for large-model-based iris-segmentation algorithms.

Details

Title

SAM-Iris: A SAM-Based Iris Segmentation Algorithm

Author

Jiang, Jian¹; Zhang, Qi¹; Wang, Caiyong²

¹ School of Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China; [email protected]
² School of Intelligence Science and Technology, Beijing University of Civil Engineering and Architecture, Beijing 100044, China; [email protected]

First page

246

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics14020246

ProQuest document ID

3159489308

SAM-Iris: A SAM-Based Iris Segmentation Algorithm

Jump to:

Full Text

2. Related Work

2.1. Segment Anything Model

2.2. Task-Specific SAM Fine-Tuning

2.3. Interactive Segmentation

3.1. Overview

3.2. Adapter

3.3. Prompt Encoder

3.4. Mask Decoder

3.5. CNN Branch

3.6. Cross-Branch Attention

3.7. Loss Function

3.8. Fitune Strategy

4. Experiments and Analysis of Results

4.2. Metric

4.2.1. mIoU

4.2.2. F1

4.2.3. E1

4.2.4. Accuracy

4.3. Experimental Setup

4.4. Analysis of Experimental Results

4.4.1. Comparison Experiment

4.4.2. Ablation Experiment

4.4.3. Prompt Experiment

4.4.4. Interactive Segmentation Visualization

Abstract

Details

Suggested sources