Full Text

Turn on search term navigation

1. Introduction

Satellite imagery plays a crucial role in various domains such as urban planning, disaster management, environmental monitoring, and agricultural assessment [1]. However, the spatial resolution of satellite images is often limited by the inherent constraints of the imaging hardware, including sensor quality, satellite altitude, and bandwidth [2]. Enhancing the resolution of these images, a process referred to as super-resolution (SR), is essential for maximizing the utility of remote sensing data. Super-resolution techniques aim to reconstruct a high-resolution (HR) image from a low-resolution (LR) counterpart, thereby enabling finer detail recovery and improved interpretability [3]. The super-resolution task is inherently ill posed because there are infinitely many possible solutions for reconstruction HR images that can correspond to a single LR image.

In recent years, deep learning has emerged as a dominant paradigm for super-resolution (SR), with convolutional neural networks (CNNs) [4] and generative adversarial networks (GANs) [5] delivering significant advancements. CNN-based approaches, such as enhanced deep residual networks (EDSRs) [6], exploit hierarchical feature extraction to improve image quality. GAN-based models, such as SRGAN [5], utilize adversarial training to generate photo-realistic textures. However, despite their success, these methods face several challenges. One key limitation is inadequate texture recovery. CNN-based methods often struggle to recover fine textures, particularly in high-frequency regions such as edges and detailed patterns [7]. Another issue is the lack of external priors. Existing models typically do not leverage external knowledge or priors, which could significantly guide and improve the reconstruction process. Finally, scalability presents a significant hurdle. Satellite image datasets are inherently large and computationally intensive, making scalable and efficient models a necessity for practical applications [8]. These limitations underscore the need for novel approaches that address these gaps while maintaining high reconstruction quality. Generative models, particularly those based on variational autoencoders (VAEs) [9] and vector quantized generative adversarial networks (VQGANs) [10], have demonstrated their capability to learn complex distributions and generate high-quality outputs. However, their application in super-resolution satellite images remains underexplored. These models can provide learned priors that capture high-resolution structural and textural information, which can significantly enhance SR performance.

To address the aforementioned limitations, this paper introduces the MBGPIN, a novel framework designed to enhance satellite image super-resolution. The MBGPIN framework incorporates several innovative features. Its architecture is based on a dual-pathway design, which consists of a feature extraction pathway and a generative prior pathway. The feature extraction pathway focuses on capturing multiscale spatial features, while the generative prior pathway integrates external high-resolution priors, leveraging the capabilities of a pretrained VQGAN model [10]. The model also employs a hybrid attention mechanism that combines channel and spatial attention. This mechanism ensures efficient feature extraction by dynamically prioritizing relevant spatial and spectral regions, thereby improving the quality of the reconstructed images. Another critical component of the framework is the AGPF module. This dynamic fusion module aligns and integrates multiscale features with learned priors using a similarity-based approach, specifically cosine similarity, to produce high-quality reconstructions. Moreover, the design of the MBGPIN emphasizes scalability and efficiency. Through the use of pyramidal architectures and attention mechanisms optimized for computational efficiency, the framework achieves reduced complexity without compromising reconstruction fidelity. These advancements collectively make the MBGPIN a robust and efficient solution for addressing the challenges of super-resolution in satellite images. The MBGPIN framework addresses the limitations of traditional SR models by integrating generative priors with multiscale feature extraction, enabling the robust reconstruction of high-frequency details. Furthermore, the proposed method demonstrates scalability and efficiency, making it suitable for real-world applications involving large-scale satellite datasets.

The rest of this paper is structured as follows: Section 2 reviews related work in the super-resolution of satellite image and generative models. Section 3 presents the proposed MBGPIN architecture and its components. Section 4 details the experimental setup and evaluation metrics. Section 5 discusses the results, and Section 6 concludes this paper with future research directions.

2. Related Work

The super-resolution (SR) problem has evolved through various paradigms, beginning with classical approaches and progressing to state-of-the-art deep learning and generative methods. In the domain of satellite imagery, the need for the accurate reconstruction of high-resolution images has introduced unique challenges and inspired diverse methodologies Table 1.

2.1. Traditional and Model-Based Super-Resolution Methods

The classical SR methods rely heavily on mathematical interpolation techniques, such as bicubic and bilinear interpolation, to upscale low-resolution (LR) images [18]. While straightforward and computationally efficient, these approaches often struggle to recover fine details, especially in high-frequency regions such as edges and textures. The inherent limitation of these methods lies in their inability to capture the complex spatial correlations required to reconstruct visually and structurally meaningful high-resolution (HR) outputs [19]. Techniques based on sparse coding and dictionary learning marked a shift toward model-driven approaches, where patches of LR images were approximated using a learned dictionary of HR patches [20]. Despite these advancements, the reliance on handcrafted features and the computational burden of these methods restricted their application, particularly for large-scale remote sensing datasets.

2.2. CNN-Based Super-Resolution Approaches

The advent of deep learning marked a paradigm shift in SR, offering a data-driven approach capable of learning complex mappings between LR and HR images [21]. Early CNN-based models such as SRCNN [4] introduced an end-to-end framework that directly learned the LR-to-HR transformation. As the field matured, more sophisticated architectures emerged, incorporating deeper networks, residual connections, and skip layers to enhance feature learning and mitigate vanishing gradient problems [22]. Enhanced deep residual networks (EDSRs) and similar models demonstrated significant improvements in reconstruction quality. Multispectral satellite images are essential for urban planning, climate monitoring, and agriculture applications. A deep-learning-based pipeline has been developed [23] to harmonize data from Landsat-8 and Sentinel-2, improving Landsat-8’s spatial resolution and increasing cloud-free image availability by 21% annually. Ref. [24] proposed a framework combining contrastive training with neural style transfer (NST) for unsupervised super-resolution imagery. By using high-resolution textures as style elements applied to low-resolution content images, the framework achieved superior results across various modalities, including single-band, multispectral, and RGB remote-sensing images. Ref. [25] proposed FMANet, a novel architecture combining super-resolution and fused bottleneck self-attention. A custom deep super-resolution network first enhances RS image quality, followed by a self-attention architecture that extracts features using residual and inverted networks. Features are classified using a shallow neural network, optimized via Bayesian methods. Ref. [26] proposed a multiscale texture transfer network (MTTN) for super-resolution remote sensing imagery. MTTN adaptively transfers texture information based on texture similarity with a reference image, employing a multiscale texture-matching strategy to enhance fine-texture details. Ref. [27] proposed the cross-scale hierarchical Transformer (CHT), which integrates cross-scale self-attention (CSA) for global feature modeling and cross-scale channel attention (CCA) for enriched local feature extraction. By systematically exploring cross-scale correlations, CHT effectively captures hierarchical image features. However, these approaches often exhibit limitations in recovering high-frequency details and producing textures that appear natural and visually plausible [28].

2.3. GAN-Based Super-Resolution Methods

Generative adversarial networks (GANs) further advanced the field by introducing adversarial loss, which encouraged the generation of photo-realistic textures. Models like SRGAN [5] achieved notable success in producing outputs that were perceptually superior to those generated by traditional CNNs. Ref. [17] proposed the multiscale attention GAN (MSAGAN). By integrating a multiscale structure with channel and spatial attention modules, MSAGAN highlights critical features while suppressing irrelevant details. Residual connections and dense blocks further improve the depth and performance of the generative network. Ref. [29] introduced a second-order attention generator adversarial network (SA-GAN) trained on real SR datasets derived from Gao Fen (GF) satellite images to simulate real degradation scenarios. SA-GAN utilizes a second-order channel attention mechanism, a region-level non-local module, and region-aware loss to enhance feature extraction and suppress artifacts. Ref. [30] presented MCWESRGAN, a modified ESRGAN [12] network to achieve single-image super-resolution (SISR). The model employs a multi-column discriminator and Wasserstein loss for training, reducing training time tenfold. Ref. [13] presented a methodology using a modified ESRGAN architecture enhanced with the Uformer model for the spatial resolution improvement of video satellite images. This approach significantly improves object recognition and detection capabilities. Nevertheless, the instability of GAN training, coupled with challenges such as mode collapse, posed difficulties in its widespread adoption, particularly for domains requiring consistent and reliable outputs like remote sensing.

2.4. Transformer- and Diffusion-Based Super-Resolution Models

Attention mechanisms brought another dimension to SR research by dynamically focusing on relevant spatial and channel-wise information. Channel attention mechanisms recalibrate feature maps by assigning greater weight to informative channels, while spatial attention mechanisms emphasize regions of interest within an image. These techniques, as seen in models like RCAN [11], enhanced the ability to recover textures and details, particularly from complex imagery. High-resolution remote sensing imagery is critical for land-use mapping, crop planning, and disaster surveillance applications. A two-branch multiscale residual attention network has been proposed [16] to enhance details like edges and textures for single-image super-resolution reconstruction. The network leverages multiscale efficient channel and spatial attention blocks to extract and refine features, resulting in more accurate predictions. Ref. [31] proposed the global sparse attention network (GSAN), which uses spherical locality-sensitive hashing (SLSH) to optimize attention computation, reducing complexity from quadratic to linear. GSAN effectively captures global information while improving performance and efficiency. Ref. [32] proposed ESatSR, a state-space model leveraging 2D selective scanning for long-range dependency modeling and wide receptive fields. The spatial context interaction module (SCIM) and enhanced image reconstruction module (EIRM) integrate prior knowledge, improving feature extraction and reconstruction. Yet, despite their effectiveness, with the existing attention-based models, researchers have rarely explored the integration of external priors, limiting their potential in applications requiring external guidance, such as super-resolution satellite imagery.

The current SR methods for satellite imagery exhibit several limitations. Many fail to reconstruct high-frequency details and natural textures, which are crucial for applications requiring fine spatial resolution. Furthermore, the lack of integration of external priors restricts their ability to generalize across varying conditions and datasets. Scalability remains another significant challenge, as the computational demands of advanced models often make them impractical for large-scale satellite imagery. The MBGPIN addresses these limitations by introducing a dual-pathway architecture that integrates multiscale feature extraction with generative priors derived from the VQGAN [10]. By incorporating hybrid attention mechanisms and dynamic fusion techniques, the MBGPIN achieves superior performance in recovering high-frequency details and textures. Its scalable and computationally efficient design ensures applicability to large-scale remote sensing datasets, bridging the gaps in the existing methodologies and setting a new benchmark in the super-resolution of satellite images. The proposed architecture and its components are detailed in the subsequent section.

3. Proposed Methodology

The limitations of the existing super-resolution (SR) techniques for satellite imagery, including inadequate texture recovery, lack of external prior integration, and scalability challenges, highlight the need for innovative approaches. To address these gaps, we propose the MBGPIN. This novel framework synergizes multiscale feature extraction, hybrid attention mechanisms, and generative priors to achieve high-quality super-resolution in satellite images. The architecture of the MBGPIN is designed with two complementary pathways: the feature extraction pathway and the generative prior pathway. the feature extraction pathway employs multiscale convolutions to capture spatial features across varying resolutions, while the generative prior pathway leverages a pretrained VQGAN [10] to extract high-resolution priors, enriching the reconstruction process with external knowledge. These pathways are dynamically fused using the AGPF module, which aligns and integrates features based on their contextual relevance (Figure 1). To enhance the robustness and efficiency of feature extraction, the MBGPIN incorporates a hybrid attention mechanism that combines channel and spatial attention mechanisms. This mechanism allows the network to focus on critical features and regions, ensuring precise detail recovery even in complex satellite images. Furthermore, the architecture emphasizes scalability and computational efficiency through lightweight design principles, including pyramidal architectures and optimized attention modules.

In this study, HR imagery refers to sub-meter spatial resolution, specifically in the range of 30 cm to 1 m per pixel. This definition aligns with commonly used remote sensing datasets such as UC Merced, NWPU-RESISC45, and RSSCN7, where high-resolution images provide sufficient detail for object-level recognition, urban monitoring, and structural analysis. The ability to enhance image resolution within this range is critical for applications that require sharp feature reconstruction without excessive artifacts. While traditional CNN-based super-resolution models struggle with fine-texture recovery at this scale, the integration of generative priors in the MBGPIN allows for enhanced detail preservation beyond that of conventional methods. However, it is important to note that the effectiveness of generative priors decreases as the resolution scale increases beyond the training domain, making resolutions below 10 cm per pixel more challenging to reconstruct accurately.

The proposed methodology is further reinforced by a comprehensive loss function design that balances pixel accuracy, perceptual realism, and structural consistency. By integrating these innovative components, the MBGPIN not only addresses the inherent challenges of super-resolution for satellite images but also sets a new benchmark in terms of performance, scalability, and applicability to diverse real-world scenarios. This section provides a detailed explanation of the MBGPIN’s architecture, highlighting its components and their contributions to achieving state-of-the-art super-resolution performance.

3.1. Architecture Overview

The MBGPIN combines multiscale feature extraction, generative priors, and hybrid attention to achieve high-quality super-resolution in satellite imagery. The architecture comprises two main pathways: The feature extraction pathway and the generative prior pathway, integrated through the AGPF module. The feature extraction pathway employs multiscale convolutions to capture both local and global spatial features, enhanced by hybrid attention mechanisms that dynamically focus on informative channels and regions. This ensures robust spatial feature learning while mitigating irrelevant information. The generative prior pathway utilizes a pretrained VQGAN [10] to extract high-resolution priors from a learned latent space, as shown in Figure 1.

The generative prior pathway in the MBGPIN enhances super-resolution performance by incorporating high-level structural information from a pretrained VQGAN model [10]. Unlike standard convolutional feature extraction, this approach allows the network to leverage learned priors from large-scale datasets, improving texture reconstruction and fine-detail preservation. The process begins with encoding the low-resolution (LR) input image into a discrete latent space representation using the VQGAN encoder [10], which compresses spatial information while retaining essential semantic structures. This encoding captures contextual relationships between different image regions, ensuring that important details are preserved. Once encoded, the latent features undergo refinement through a series of Transformer-based layers, which apply self-attention mechanisms to enhance global coherence and consistency. These layers refine the prior knowledge by reinforcing spatial dependencies and improving structural accuracy. The processed generative priors are then upsampled using deconvolution layers to align with the target resolution. By incorporating these high-resolution priors into the MBGPIN, the model can recover fine textures and intricate structures that are often lost in conventional super-resolution approaches. The final generative prior output is then passed to the adaptive generative prior fusion (AGPF) module, where it is dynamically integrated with the multiscale features extracted from the main convolutional pathway.

These priors, which capture intricate textures and structures, provide external guidance to enhance detail reconstruction. The AGPF module fuses the outputs from both pathways using a similarity-based mechanism that dynamically balances their contributions. This fusion ensures the effective integration of the generative priors with the extracted features, producing a refined representation. Finally, the reconstruction module upsamples and refines the fused features to produce the super-resolved output, balancing computational efficiency with high-fidelity detail recovery. This dual-pathway design enables the MBGPIN to generate high-quality, visually consistent, and scalable super-resolution results.

3.2. Hybrid Attention Mechanism

The attention mechanism in the MBGPIN enhances feature representation by refining information both along the channel dimension and across spatial regions to improve super-resolution performance. This approach is inspired by the convolutional block attention module [33], which applies channel and spatial attention sequentially. However, the MBGPIN differs by integrating both mechanisms simultaneously through an adaptive weighting function, allowing the model to dynamically balance their contributions rather than treating them as independent processes. The hybrid attention mechanism in the MBGPIN enhances feature representation by dynamically emphasizing both the channel and spatial dimensions, ensuring the network captures relevant global and local information. This mechanism combines channel attention (CA) and spatial attention (SA) to effectively recalibrate feature maps, improving the reconstruction of the fine textures and details in satellite images. Channel attention focuses on recalibrating the importance of feature channels by assigning weights based on their relevance. Each channel corresponds to a distinct type of feature (edges, textures), and the mechanism adjusts their contributions dynamically. For a feature map $F \in R^{H \times W \times C}$ , global average pooling (GAP) aggregates the spatial information for each channel:

(1) $z_{c} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F (i, j, c)$

where z_c represents the global descriptor for the c-th channel. The aggregated descriptors

z \in R^{C}

are passed through a two-layer feedforward network to compute channel-wise attention weights:

(2) $w_{c} = σ (W_{2} \cdot R e L U (W_{1} \cdot z_{c}))$

where W₁ and W₂ are learnable weight matrices, ReLU introduces non-linearity, and σ is the sigmoid activation function that normalizes the weights to between 0 and 1. The original feature map is scaled by the computed channel weights:

(3) $F^{C A} (i, j, c) = w_{c} \cdot F (i, j, c)$

This ensures that the network emphasizes more informative channels. Spatial attention identifies spatial regions of importance within the feature map, highlighting areas with critical details such as edges or textures. Spatial attention uses both global average pooling and max pooling across the channel dimension to generate two 2D descriptors:

(4) $M_{a v g} (i, j) = \frac{1}{C} \sum_{C = 1}^{C} F (i, j, c), M_{m a x} (i, j) = \begin{matrix} m a x \\ c \end{matrix} F (i, j, c)$

These descriptors provide complementary information about spatial importance. The two descriptors are concatenated along the channel axis and passed through a convolutional layer with a sigmoid activation:

(5) $A_{s} (i, j) = σ ({C o n v}_{k} ([M_{a v g,} M_{m a x}]))$

Here, Conv_k is a convolution operation with kernel size k, and σ normalizes the mask. The spatial attention mask scales the feature map:

(6) $F^{S A} (i, j, c) = A_{s} (i, j) \cdot F (i, j, c)$

The hybrid attention mechanism integrates channel and spatial attention sequentially. Channel attention is applied first to recalibrate the feature channels, followed by spatial attention to highlight significant spatial regions within the recalibrated channels. The final output of the hybrid attention mechanism is given by

(7) $F^{H A} = A_{s} (F^{C A})$

This two-step process ensures that the network captures both global channel dependencies and local spatial relationships, leading to a refined feature representation. Additionally, the MBGPIN incorporates a generative-prior-based enhancement, which CBAM does not include. The generative prior pathway further refines high-frequency details, allowing the model to recover fine textures and structural information more effectively. By integrating generative priors and an adaptive attention fusion mechanism, the MBGPIN achieves a more balanced, efficient, and context-aware feature enhancement strategy compared to traditional sequential attention models.

3.3. Adaptive Generative Prior Fusion (AGPF)

The AGPF module is a central component of the MBGPIN, designed to dynamically integrate features extracted from the feature extraction pathway and the generative prior pathway Figure 2. This fusion mechanism ensures that both the learned multiscale spatial features and the external high-resolution priors contribute to the final representation optimally. The adaptive nature of AGPF allows it to balance these two sources of information based on their relevance for specific regions and features, enabling the precise and high-fidelity reconstruction of satellite images.

AGPF operates by aligning and merging feature maps based on their contextual relevance. Given two feature maps, one from the feature extraction pathway and another from the generative prior pathway, it computes a similarity score for each spatial location. This score determines the relative importance of the generative priors and the extracted features. Specifically, the module calculates the cosine similarity between the corresponding feature vectors at each spatial location to evaluate their alignment. The similarity score serves as a weighting factor, dynamically controlling the contribution of each pathway to the final fused feature representation. The feature map is produced by the feature extraction pathway, denoted as F^feat, and a feature map from the generative prior pathway, F^gen. The cosine similarity is computed for each spatial location. This similarity is expressed as

(8) $α (i, j) = \frac{\sum_{c} F^{f e a t} (i, j, c) \cdot F^{g e n} (i, j, c)}{\sqrt{\sum_{c} {(F^{f e a t} (i, j, c))}^{2}} \cdot \sqrt{\sum_{c} {(F^{g e n} (i, j, c))}^{2}}}$

where α(i,j) represents the similarity score for spatial position (i,j). This score, normalized to between 0 and 1, reflects the degree of alignment between the two feature maps at that location. The fusion process uses these similarity scores to combine the feature maps dynamically. For each spatial position, the fused feature map F^fusion is calculated as

(9) $F^{f u s i o n} (i, j, c) = α (i, j) \cdot F^{g e n} (i, j, c) + (1 - α (i, j)) \cdot F^{f e a t} (i, j, c)$

This formulation ensures that areas with higher alignment with the generative priors rely more on F^gen, while regions requiring local texture recovery are dominated by F^feat. To further enhance the quality of fusion, the AGPF module employs an alignment mechanism to reduce the inconsistencies between the feature maps before merging. This is achieved by minimizing the difference between the feature representations through a mean-squared error-based alignment loss:

(10) $L_{a l i g n} = \frac{1}{H \times W \times C} \sum_{i = 1}^{H} \sum_{j = 1}^{W} \sum_{c = 1}^{C} {‖F^{f e a t} (i, j, c) - F^{g e n} (i, j, c)‖}^{2}$

This loss ensures that the features from both pathways are well aligned, reducing artifacts and improving the quality of the fused representation. The fused feature map generated by the AGPF module provides a unified representation that benefits from both the generative priors’ global knowledge and the spatial features’ local information. This representation is subsequently passed to the reconstruction module, where it is transformed into a high-resolution output. The AGPF module’s dynamic weighting mechanism ensures that the contributions of the pathways are contextually optimized, addressing the variability in the feature importance across the different regions of the image. By integrating features adaptively, the AGPF module enhances the reconstruction of intricate details and textures while maintaining structural consistency and minimizing artifacts. This adaptive and efficient fusion process significantly contributes to the high performance and scalability of the MBGPIN for achieving super-resolution with satellite images.

3.4. Reconstruction and Loss Functions

The reconstruction and loss functions in the MBGPIN play a pivotal role in transforming the fused feature representation into a high-resolution output and ensuring that the reconstructed image closely aligns with the ground truth. This section outlines the reconstruction process and the carefully designed loss functions used to train the MBGPIN, enabling it to produce high-quality super-resolved satellite images. The reconstruction module in the MBGPIN takes the fused feature map F^fusion from the AGPF module and transforms it into the final high-resolution (HR) image. The fused feature map $F^{f u s i o n} \in R^{H \times W \times C}$ , where H and W denote the spatial dimensions, and C represents the number of channels, is upsampled to the desired HR resolution. The upsampling is performed using a pixel shuffle operation, which rearranges the channel dimensions into spatial dimensions, increasing the resolution by a factor of r, the desired scaling factor:

(11) $I_{i n i t i a l}^{H R} = P_{s h} (F^{f u s i o n})$

Pixel shuffle ensures efficient upscaling without introducing artifacts, maintaining the spatial coherence of the features. The initial HR image $I_{i n i t i a l}^{H R}$ is passed through a series of convolutional layers to refine the details and remove any remaining inconsistencies. These layers use small kernel sizes to fine-tune the reconstructed image, producing the final HR output I^HR. To train the MBGPIN effectively, a combination of loss functions is employed, ensuring that the reconstructed image is not only pixel-wise accurate but also perceptually consistent and structurally aligned with the ground truth. The Charbonnier loss is a robust variation of the L₁ loss, designed to handle outliers and noise effectively. It minimizes the pixel-wise differences between the reconstructed image I^HR and the ground truth I^GT:

(12) $L_{C H} = \sqrt{{‖I^{H R} - I^{G T}‖}^{2} + \in^{2}}$

Here, ϵ is a small constant that stabilizes the loss when the differences are small. This loss ensures that the reconstruction is accurate at the pixel level. While pixel-wise losses focus on individual pixel differences, perceptual loss evaluates the similarity between high-level feature representations of the reconstructed and ground truth images. These features are extracted from a pretrained network, such as VGG, at intermediate layers:

(13) $L_{P} = \sum_{i = 1}^{N} {‖ϕ_{i} (I^{H R}) - ϕ_{i} (I^{G T})‖}^{2}$

Here,

ϕ_{i}

represents the feature maps at the i-th layer of the pretrained network, and N denotes the total number of layers used. This loss enhances texture fidelity and ensures that the reconstructed image appears visually realistic. To account for the structural and perceptual quality of the reconstructed image, the SSIM loss measures the similarity in luminance, contrast, and structure between

I^{H R}

and

I^{G T} :

(14) $L_{S S I M} = 1 - S S I M (I^{H R}, I^{G T})$

By maximizing SSIM, the model ensures that the reconstructed image preserves structural details and minimizes perceptual distortions. The AGPF module introduces a specific alignment loss to ensure that the feature maps from the feature extraction pathway and the generative prior pathway are well aligned before fusion. For further enhancing perceptual realism, the adversarial loss from a discriminator D can be used to encourage the generator to produce outputs indistinguishable from real HR images:

(15) $L_{A d v} = - E [l o g (D (I^{H R}))]$

The overall training objective combines these losses with appropriate weights λ_i to balance their contributions:

(16) $L_{T} = λ_{1} L_{C H} + λ_{2} L_{P} + λ_{3} L_{S S I M} + λ_{4} L_{A} + λ_{5} L_{A d v}$

Here, λ₁, λ₂, …, λ₅ are hyperparameters determined through empirical validation to ensure optimal performance.

4. Experimental Setup

The experimental setup was designed to rigorously evaluate the performance of the MBGPIN in super-resolution satellite image tasks. This section describes the datasets, evaluation metrics, training configurations, and baseline comparisons used to validate the effectiveness of the proposed model. The goal was to demonstrate the MBGPIN’s ability to produce high-quality, super-resolved satellite images while maintaining computational efficiency and generalizability across diverse conditions. The experiments were conducted on benchmark remote sensing datasets, which encompassed a wide range of satellite imagery with varying resolutions, spatial patterns, and textures. These datasets ensured that the model was tested on real-world scenarios that reflected the challenges encountered in remote sensing applications. To assess performance, quantitative metrics such as the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) were employed alongside qualitative visual inspections to evaluate the reconstruction fidelity and perceptual quality of the generated images. To establish a comprehensive comparison, the MBGPIN was evaluated against state-of-the-art methods, including both conventional interpolation techniques and advanced deep-learning-based super-resolution models. The training and testing processes were configured to ensure reproducibility, with consistent data preprocessing, augmentation, and hyperparameter settings. Additionally, the computational efficiency of the MBGPIN was analyzed in terms of FLOPs and inference time, highlighting its scalability for large-scale satellite image datasets. This section provides a detailed description of the experimental framework, ensuring a transparent and thorough evaluation of the proposed methodology.

To analyze the impact of hyperparameter selection on the MBGPIN’s performance, a systematic sensitivity study was conducted by varying key parameters while keeping all other factors constant (Table 2). The sensitivity analysis focused on five critical hyperparameters: learning rate, batch size, feature map depth, attention weight (α), and fusion weight (λ).

These sensitivity experiments confirmed that hyperparameter selection plays a crucial role in optimizing the MBGPIN’s performance. The final hyperparameter settings were chosen based on a balance between computational efficiency, training stability, and reconstruction quality, ensuring robust and generalizable model performance.

4.1. Datasets

We evaluated the MBGPIN on benchmark datasets, including UC Merced, NWPU-RESISC45, and RSSCN7, to ensure a comprehensive comparison across diverse satellite imagery (Table 3). These datasets include diverse types of satellite imagery, varying in spatial resolution, spectral characteristics, and geographic coverage. Such diversity ensured that the MBGPIN was tested under a wide range of real-world conditions, highlighting its ability to generalize across different scenarios and applications.

4.1.1. UC Merced Land Use Dataset

The UC Merced Land Use dataset is a widely used benchmark for remote sensing image analysis. It contains 2100 images, each with a spatial resolution of 256 × 256 pixels, covering 21 land-use categories, such as agricultural, residential, and industrial areas. All images are derived from aerial photographs and have consistent spectral characteristics. This dataset is ideal for evaluating the performance of the MBGPIN on relatively high-resolution satellite images with diverse spatial patterns and land-use classes. For super-resolution tasks, low-resolution (LR) images are synthetically generated by downscaling the original high-resolution (HR) images using bicubic interpolation. The downscaling factors used in this study were 2 × 2, 4 × 4, and 8 × 8, corresponding to increasing levels of reconstruction difficulty.

4.1.2. NWPU-RESISC45 Dataset

The NWPU-RESISC45 dataset is a large-scale benchmark designed for remote sensing image scene classification. It comprises 31,500 images spanning 45 scene classes, with each image having a spatial resolution of 256 × 256 pixels. This dataset provides a broad spectrum of land-cover types, including natural landscapes and man-made structure. The diversity in scene types and textures makes it an excellent choice for assessing the robustness of MBGPIN in handling complex spatial features. For super-resolution experiments, the HR images are downscaled to generate LR inputs using the same bicubic interpolation process as described for the UC Merced dataset. The dataset is split into training, validation, and testing sets, maintaining a consistent distribution of scene classes across all subsets.

4.1.3. RSSCN7 Dataset

The RSSCN7 dataset is another widely recognized benchmark in the remote sensing community. It includes 2800 images from seven distinct scene classes, such as grasslands, urban areas, and water bodies. Each image has a spatial resolution of 400 × 400 pixels, offering higher detail and variability compared to the UC Merced and NWPU-RESISC45 datasets. The relatively large image size and rich textures provide an additional challenge for super-resolution models, particularly for high scaling factors. Low-resolution versions of the images were generated by downscaling the HR data, and the model was evaluated on its ability to reconstruct these LR inputs into their original high-resolution form.

4.1.4. Dataset Preparation

For all datasets, the HR images were preprocessed to normalize the pixel values to the range [0, 1]. The LR images were generated by applying bicubic downscaling with scaling factors of 2 × 2, 4 × 4, and 8 × 8. This ensured consistency across datasets and allowed for fair comparisons of performance across different scaling levels. While this method is widely used in super-resolution benchmarks, it does not fully account for real-world degradations such as sensor noise, atmospheric interference, and compression artifacts, which are commonly present in satellite imagery. To evaluate the robustness of the MBGPIN under realistic conditions, additional experiments were conducted by introducing controlled noise levels and motion blur into the LR images. The results indicated that while the MBGPIN effectively reconstructed HR details under synthetic conditions, its performance was affected when real-world degradations were introduced, highlighting the importance of training on diverse, real-world datasets. During training, data augmentation techniques such as random cropping, flipping, and rotation are applied to improve generalization and reduce overfitting. The datasets are split into training, validation, and testing subsets, typically following an 80-10-10 distribution, ensuring that the model is evaluated on unseen data. The combination of benchmark and custom datasets provided a comprehensive evaluation framework for the MBGPIN. The benchmark datasets ensured compatibility with existing research, allowing direct comparisons with state-of-the-art methods. The custom dataset, on the other hand, offered a real-world perspective, testing the model on large-scale, unstandardized satellite imagery. By using these diverse datasets, the MBGPIN’s capability to generalize across different resolutions, geographic regions, and scaling factors was thoroughly assessed, establishing its robustness and versatility in super-resolution satellite image tasks.

4.2. Evaluation Metrics

To comprehensively evaluate the performance of the MBGPIN, a combination of quantitative and qualitative metrics was used. These metrics assessed the fidelity of the reconstructed high-resolution images in terms of pixel accuracy, structural similarity, perceptual quality, and computational efficiency. By employing multiple metrics, the evaluation captured both the technical and perceptual aspects of super-resolution in satellite images. To evaluate the quality of super-resolution outputs, we used the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) as primary metrics. While the PSNR measures pixel-wise fidelity by assessing the mean squared error (MSE) between the reconstructed and ground truth images, it does not capture perceptual differences that are important in high-resolution imagery. The SSIM, on the other hand, is designed to quantify structural similarity by comparing the luminance, contrast, and spatial structure between images. Unlike the PSNR, which treats images as pixel-wise intensity maps, the SSIM models the human visual system’s (HVS’s) sensitivity to structural patterns, making it more effective in assessing textural consistency and geometric alignment. The PSNR is a widely used metric for measuring the pixel-level accuracy of reconstructed images. It quantifies the similarity between the reconstructed high-resolution image I^HR and the ground truth I^GT. The PSNR is calculated in decibels (dB) and is defined as

(17) $P S N R = 10 \cdot {l o g}_{10} (\frac{{M A X}^{2}}{M S E})$

where MAX is the maximum possible pixel value, and MSE is the mean squared error:

(18) $M S E = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(I^{H R} (i, j) - I^{G T} (i, j))}^{2}$

Higher PSNR values indicate better reconstruction quality, with a greater focus on minimizing pixel-wise errors. The SSIM evaluates the perceptual similarity between the reconstructed image and the ground truth by measuring the structural information, luminance, and contrast. It provides a score between 0 and 1, where a higher score indicates greater similarity. The SSIM is computed as

(19) $S S I M (I^{H R}, I^{G T}) = \frac{({2 μ}_{H R} μ_{G T} + C_{1}) ({2 σ}_{H R, G T} + C_{2})}{(μ_{H R}^{2} + μ_{G T}^{2} + C_{1}) (σ_{H R}^{2} + σ_{G T}^{2} + C_{2})}$

where μ_HR and μ_GT are the means of

I^{H R}

and

I^{G T}

σ_{H R}

and

σ_{G T}

are the variances, σ_HR_,_GT is the covariance between I^HR and I^GT, and C₁ and C₂ are small constants to stabilize the calculation. The SSIM is particularly effective at capturing structural differences, making it ideal for remote sensing applications where structural consistency is critical. The NRMSE is another pixel-level metric that measures the deviation between the reconstructed image and the ground truth, normalized to the range of pixel values. It is defined as

(20) $N R M S E = \frac{\sqrt{M S E}}{M A X}$

Lower NRMSE values indicate better reconstruction accuracy, with reduced deviations from the ground truth. Although the PSNR and SSIM measure fidelity, they may not always reflect the perceptual quality of reconstructed images. To address this, a perceptual metric was employed by evaluating high-level feature similarity. This was achieved using a pretrained VGG network to extract feature maps at specific layers:

(21) $P e r c e p t u a l Q u a l i t y = \sum_{i = 1}^{N} {‖ϕ_{i} (I^{H R}) - ϕ_{i} (I^{G T})‖}^{2}$

Here, ϕ_i represents the feature map at the i-th layer of the VGG network, and N is the number of layers used. This metric ensured that the reconstructed images maintain perceptual realism. By combining these metrics, the evaluation provided a holistic view of the MBGPIN’s performance, ensuring that it is robust, perceptually accurate, and computationally efficient for super-resolution satellite image tasks.

4.3. Baseline Comparisons

Baseline comparisons were essential for establishing the effectiveness of the MBGPIN relative to existing methods in super-resolution satellite image tasks. These comparisons included traditional interpolation techniques, deep-learning-based models, and advanced generative and domain-specific approaches. By benchmarking the MBGPIN against these methods, its superiority in terms of reconstruction quality, perceptual fidelity, and computational efficiency is demonstrated. Bicubic and Lanczos interpolation served as lower baselines. These methods are computationally efficient but fail to recover high-frequency details, resulting in blurry and artifact-prone outputs. Models like SRCNN [4], EDSR, and RCAN were selected for their historical significance and state-of-the-art performance in super-resolution tasks. These methods highlight the evolution of CNN-based approaches in learning feature representations. Figure 3 presents a qualitative comparison of the super-resolution outputs from the different models for a high-resolution satellite image. The red boxes highlight regions of interest where differences in texture sharpness, edge clarity, and structure reconstruction can be observed. The comparison demonstrates how various models handle fine details, building textures, and structural edges, which are crucial in remote sensing applications. While some methods exhibit blurring and loss of detail, the MBGPIN produces sharper textures and maintains geometric consistency, making it more effective for reconstructing high-frequency details in urban environments.

GAN-based models like SRGAN and ESRGAN [12] demonstrate the ability of adversarial training to enhance perceptual quality but often at the expense of structural accuracy. Specialized remote sensing models, including TBMRA and EPRN, provide insights into domain-specific solutions tailored for satellite image datasets. Quantitative metrics, including the PSNR, SSIM, and normalized root mean square error (NRMSE), were used to assess the pixel-wise and structural accuracy of the reconstructed images. Table 4 summarizes the performance of the MBGPIN compared to the baseline methods on the UC Merced dataset with a scaling factor of ×4.

The computational efficiency of the MBGPIN was compared against other methods using metrics such as floating point operations (FLOPs), inference time, and model size. As seen in the quantitative comparison table, the MBGPIN achieves a balance between computational efficiency and performance. Its lightweight design, combined with the integration of generative priors, ensures scalability for large-scale satellite image datasets without compromising reconstruction quality. The MBGPIN achieves state-of-the-art performance across all metrics, demonstrating its superiority over both traditional and modern methods. The integration of multiscale feature extraction, hybrid attention mechanisms, and generative priors enables it to strike a balance between perceptual quality, structural accuracy, and computational efficiency. The performance gains are particularly notable in high-frequency regions and complex spatial patterns, making the MBGPIN a robust solution for super-resolution satellite image tasks. These baseline comparisons validate the effectiveness of the MBGPIN, establishing it as a versatile and efficient model for real-world remote sensing applications.

4.4. Comparison with SOTA Models

The MBGPIN’s performance was rigorously compared with state-of-the-art models across various dimensions, including reconstruction quality, perceptual fidelity, and computational efficiency (Figure 4). The selected models encompassed a range of methodologies, from deep residual networks to generative adversarial approaches and domain-specific architectures, ensuring a comprehensive evaluation of the MBGPIN’s capabilities. Figure 4 provides a side-by-side visual analysis of the performance of the different super-resolution models on an aerial image dataset. The red bounding boxes highlight critical regions, including aircraft structures and surrounding ground details, where fine-resolution preservation is essential. The differences in edge sharpness, contrast, and detail fidelity are apparent among the different models. The MBGPIN demonstrates better high-frequency detail recovery, particularly in preserving the edges of aircraft and reducing noise in shadowed areas. The visual contrast improvements in this figure further illustrate the MBGPIN’s superior feature reconstruction capabilities compared to conventional methods.

Enhanced deep residual networks (EDSRs) and residual channel attention network (RCAN) represent the pinnacle of CNN-based approaches for super-resolution tasks, known for their robust feature extraction and effective handling of residual learning. These models achieve strong reconstruction quality, particularly in recovering global spatial structures. However, their reliance on deep architectures and computationally intensive operations often results in inefficiencies, making them less scalable for satellite image datasets. The EDSR provides sharp outputs but struggles with high-frequency detail recovery, whereas the RCAN enhances channel-wise feature selection but at the cost of higher computational overhead. Generative adversarial models such as the SRGAN and ESRGAN emphasize perceptual quality through adversarial training. These models excel in producing visually appealing images with enhanced textures but frequently introduce artifacts, particularly in regions with repetitive or fine patterns. Structural accuracy often takes a backseat in these models, especially in applications requiring strict fidelity to the original image geometry. The SRGAN demonstrates notable improvements in visual quality but lacks the refinement introduced in the ESRGAN, which mitigates artifacts while improving texture realism. Nevertheless, these models are computationally expensive and may exhibit instability during training.

Specialized remote sensing models like the two-branch multiscale residual attention (TBMRA) network focus on domain-specific challenges. The TBMRA network effectively combines multiscale feature extraction with attention mechanisms to handle the unique spatial patterns in satellite imagery. However, the absence of external generative priors limits its ability to recover intricate details, especially in high-frequency regions like urban structures or dense vegetation. The quantitative comparisons focused on metrics such as the PSNR, SSIM, and normalized root mean square error (NRMSE). Table 5 presents results on the UC Merced dataset with a scaling factor of ×4.

The MBGPIN addresses these limitations by integrating multiscale feature extraction, hybrid attention mechanisms, and generative priors through the AGPF module. This integration allows the MBGPIN to balance local spatial features and global priors dynamically, resulting in superior reconstruction quality. Quantitatively, the MBGPIN achieves the highest PSNR and SSIM scores across the datasets, outperforming all baseline models. For example, on the UC Merced dataset with a scaling factor of ×4, the MBGPIN achieves a PSNR of 31.34 dB and an SSIM of 0.912, surpassing the RCAN, which scores 30.86 dB and 0.873, respectively. Additionally, the MBGPIN’s lightweight design leads to reduced FLOPs and faster inference times, making it more computationally efficient compared to its counterparts. In terms of qualitative performance, the MBGPIN consistently outperforms the baseline models upon visual inspection, particularly in preserving fine textures, sharp edges, and complex patterns. While the EDSR and RCAN provide structurally consistent outputs, they lack the perceptual richness observed in the MBGPIN’s reconstructions. GAN-based models like the SRGAN and ESRGAN enhance visual appeal but fail to maintain structural integrity, often introducing visible distortions. In contrast, the MBGPIN delivers visually realistic and structurally accurate outputs without compromising on either metric. The comparison highlights the MBGPIN’s ability to bridge the gap between traditional deep learning methods and generative approaches. By leveraging external priors and efficient feature fusion, the MBGPIN achieves a unique balance of accuracy, realism, and scalability. These results firmly establish the MBGPIN as a state-of-the-art solution for satellite image super-resolution tasks, excelling in diverse and challenging remote sensing applications.

To evaluate the contribution of each module in the MBGPIN, an ablation study was conducted by systematically removing or modifying specific components and analyzing the effect on performance. A conventional deep CNN-based super-resolution model without generative priors or the adaptive fusion module was used. We also considered a model variant where the VQGAN prior pathway was removed, allowing a direct comparison between generative prior integration and conventional feature extraction. For another model variant, we removed the adaptive generative prior fusion module, instead relying on the direct concatenation of feature maps. The quantitative results, shown in Table 6, demonstrate that removing the generative prior pathway led to a notable decrease in texture realism, while removing the AGPF module resulted in reduced structural consistency. These findings confirm that both the generative priors and adaptive fusion mechanism contribute significantly to super-resolution performance by improving texture sharpness and structural preservation.

The MBGPIN was trained using a combination of multiple loss functions, each contributing to different aspects of super-resolution quality. To understand their contributions, an ablation study was performed by training the model with different loss function combinations and evaluating the results.

Experiments were conducted by removing one loss function at a time and measuring its effect on the PSNR. The results in Table 7 indicate that removing the perceptual loss significantly degraded texture sharpness, while removing adversarial loss resulted in overly smooth outputs with reduced realism. The L1 loss played a crucial role in ensuring pixel accuracy, while structure loss helped preserve fine details in high-frequency regions. The ablation results confirm that the combination of multiple loss functions enhances the MBGPIN’s ability to generate high-quality, realistic super-resolved images. These findings provide a deeper understanding of the importance of loss function selection in super-resolution tasks.

In this study, we introduced the multi-branch generative prior integration network (MBGPIN), a novel approach for super-resolution satellite image tasks that addresses the limitations of the existing methodologies by integrating multiscale feature extraction, hybrid attention mechanisms, and generative priors. The MBGPIN’s unique architecture effectively combines the strengths of local spatial features and external generative priors through the AGPF module, resulting in enhanced reconstruction quality and computational efficiency. Comprehensive experiments on benchmark datasets, including UC Merced, NWPU-RESISC45, and RSSCN7, demonstrate that the MBGPIN consistently outperforms state-of-the-art models across multiple metrics. It achieves higher PSNR and SSIM scores, indicating superior pixel accuracy and structural fidelity, while maintaining perceptual realism, as evaluated through qualitative inspections. Furthermore, the MBGPIN’s lightweight design achieves significant reductions in computational overhead, as evidenced by its lower FLOPs and faster inference times, making it scalable for large-scale remote sensing applications. The comparisons with SOTA models highlight the MBGPIN’s capability to preserve high-frequency details and complex textures, which are critical in satellite imagery. Unlike generative models that prioritize perceptual quality at the expense of structural accuracy or CNN-based models that struggle with texture realism, the MBGPIN achieves a balanced trade-off between these objectives. Its hybrid attention mechanism ensures efficient feature extraction, while the use of generative priors enhances detail recovery, especially in challenging regions with fine patterns or intricate structures.

The MBGPIN sets a new benchmark for super-resolution in satellite images by providing a robust, scalable, and efficient solution. Its innovative integration of generative priors and adaptive fusion offers a pathway for future advancements in remote sensing and image reconstruction. Future work will explore the real-time deployment of the MBGPIN on satellite platforms, adaptation to multispectral and hyperspectral data, and further optimization for domain-specific challenges. This study underscores the potential of hybrid architectures in bridging the gap between traditional and generative approaches, paving the way for next-generation super-resolution techniques.

4.5. Scalability and Generalization Analysis

To evaluate the scalability and generalization capability of the MBGPIN, we conducted experiments across multiple datasets and assessed computational efficiency across different input resolutions.

To assess cross-dataset generalization, the MBGPIN was trained on one dataset and tested on multiple unseen datasets to measure its adaptability to different image distributions. The model was evaluated on UC Merced, NWPU-RESISC45, and RSSCN7, each containing diverse landscape structures, varying spectral resolutions, and complex textures. The performance metrics in Table 4 indicate that the MBGPIN maintains high PSNR and SSIM scores across the datasets, demonstrating strong generalization capability.

Computational scalability was analyzed by evaluating the FLOPs and inference time on images with different resolutions. The results indicate that the MBGPIN achieves competitive efficiency, outperforming large-scale models such as Transformer-based SR and diffusion-based SR methods, which require significantly more computational resources. The MBGPIN’s efficient feature extraction and generative prior integration allow it to scale effectively while maintaining high-quality reconstructions. These results confirm that the MBGPIN is well suited for large-scale remote sensing applications where both high-resolution reconstruction quality and computational efficiency are critical.

5. Discussion

The proposed multi-branch generative prior integration network (MBGPIN) demonstrates several key advantages in the context of super-resolution satellite image tasks. Through the integration of multiscale feature extraction, hybrid attention mechanisms, and generative priors, the MBGPIN effectively balances texture detail recovery and structural consistency. The experimental results across multiple benchmark datasets confirm its ability to outperform conventional CNN-based and Transformer-based models in both the PSNR and SSIM metrics. The incorporation of the adaptive generative prior fusion (AGPF) module enhances the adaptability of the MBGPIN, allowing dynamic feature weighting based on the contextual similarity between the extracted CNN features and generative priors. This approach reduces the artifacts commonly observed in standard super-resolution models while ensuring realistic texture reconstruction. Additionally, the use of hybrid attention mechanisms helps refine spatial and channel-wise features, improving feature extraction quality. Despite its strong performance, the MBGPIN has several limitations that should be addressed in future research. The effectiveness of MBGPIN in achieving super-resolution in satellite imagery is influenced by the availability and quality of generative priors. Since the model utilizes VQGAN-based priors that were trained on datasets within the 30 cm to 1 m resolution range, its ability to generalize beyond this scale is inherently constrained. While the MBGPIN significantly outperforms traditional methods in recovering textures, edges, and fine-grained structures, its effectiveness diminishes when attempting to generate ultra-high-resolution details beyond the available prior knowledge. One key limitation is that generative priors are effective at inferring missing fine textures but cannot fully reconstruct details that exceed their training distribution. This means that while the MBGPIN can enhance image quality within the sub-meter range, resolutions below 10 cm per pixel would require either additional high-resolution training data or a higher-dimensional generative prior model trained specifically for such tasks. Future improvements could explore hybrid approaches combining physical-based super-resolution techniques with learned generative models, allowing for better adaptation across different spatial scales. Additionally, domain adaptation methods could be leveraged to extend the MBGPIN’s capability to handle finer resolution levels without compromising feature consistency.

One of the primary challenges is texture reconstruction in extremely high-resolution images, where the model sometimes struggles with over-smoothing or minor inconsistencies in regions with repetitive patterns. While generative priors contribute to high-frequency detail recovery, their reliance on pretrained features can introduce artifacts when tested on unseen image distributions. To further improve the adaptability and robustness of the MBGPIN, several potential research directions can be explored. First, integrating contrast-aware loss functions or self-supervised learning techniques could enhance the model’s ability to recover details in challenging conditions, such as low-light or shadowed regions. Second, optimizing the computational efficiency of the MBGPIN through quantization and pruning techniques could facilitate deployment on real-time satellite processing platforms. Additionally, exploring hybrid fusion techniques that combine traditional convolutional operations with implicit neural representations could provide a more flexible and adaptive approach to super-resolution tasks. Future studies could also investigate the potential of using domain adaptation techniques to further enhance the generalization ability of the MBGPIN across diverse satellite image sources. While the MBGPIN presents significant advancements in super-resolution for satellite images, continued refinements in model efficiency, generalization, and robustness will be key to its practical deployment in large-scale remote sensing applications.

6. Conclusions

The MBGPIN represents a significant advancement in super-resolution for satellite images. By leveraging the integration of multiscale feature extraction, hybrid attention mechanisms, and generative priors, the MBGPIN effectively addresses the limitations of existing methodologies. Introducing the adaptive generative prior fusion (AGPF) module ensures an optimal balance between local spatial features and external priors, resulting in superior reconstruction quality and computational efficiency. Extensive experimental results on benchmark datasets, such as UC Merced, NWPU-RESISC45, and RSSCN7, validate the MBGPIN’s state-of-the-art performance. The model achieves exceptional PSNR and SSIM scores, outperforms competitive baselines in detail recovery, and demonstrates robustness across diverse satellite imagery scenarios. Its lightweight architecture reduces computational overhead, making it scalable for real-world applications. This study establishes a new benchmark in super-resolution for satellite imagery, emphasizing the potential of hybrid architectures to blend traditional and generative approaches effectively. Future research directions include the adaptation of the MBGPIN for multispectral and hyperspectral data, real-time deployment on satellite platforms, and further exploration of domain-specific challenges to enhance its versatility and impact. The MBGPIN sets a promising foundation for next-generation remote sensing technologies, advancing the field toward more accurate and efficient image reconstruction methodologies.

Author Contributions

Methodology, F.S., U.K., M.K., F.B., S.M. and Y.-I.C.; software, F.S., U.K., M.K., F.B. and S.M.; validation, F.B., S.M. and Y.-I.C.; formal analysis, F.S., U.K., M.K., F.B. and Y.-I.C.; resources, F.S., U.K., M.K. and F.B.; data curation, F.B., S.M. and Y.-I.C.; writing—original draft, F.S., U.K., M.K., F.B., S.M. and Y.-I.C.; writing—review and editing, F.S., U.K., M.K., F.B., S.M. and Y.-I.C.; supervision, Y.-I.C.; project administration, S.M. and Y.-I.C. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

All used datasets are available online with open access.

Conflicts of Interest

The authors declare no conflicts of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. The architecture of the MBGPIN for super-resolution of satellite images.

Figure 2. The feature extraction pathway and AGPF.

Figure 3. Qualitative comparison of super-resolution baseline methods on satellite images.

Figure 4. Comparison of super-resolution SOTA models in preserving fine details.

Table 1

This table provides a structured comparison of the related works based on their techniques, strengths, and weaknesses, highlighting the contributions of each and setting the context for the need for the MBGPIN model.

Model/Technique	Key Features	Advantages	Limitations
Bicubic/bilinear interpolation	Mathematical interpolation techniques for upscaling images.	Computationally efficient.	Fails to recover fine details, especially in high-frequency regions like edges and textures.
SRCNN [4]	Early CNN-based SR model with an end-to-end framework.	Learns LR-to-HR mapping directly, simple architecture.	Limited depth and feature learning capability.
EDSR [6]	Enhanced deep residual networks with improved feature extraction.	Higher reconstruction quality, effective handling of deeper architectures.	Computationally expensive, struggles with texture realism.
RCAN [11]	Uses residual channel attention networks to enhance channel-wise feature learning.	Strong ability to recover textures and fine details.	High computational cost, scalability issues with large datasets.
SRGAN [5]	GAN-based model with adversarial loss.	Generates photo-realistic textures, perceptually appealing results.	Structural accuracy issues, training instability, mode collapse.
ESRGAN [12]	Improved SRGAN with perceptual loss refinement.	Better visual quality and reduced artifacts.	Computationally intensive, occasional structural distortions.
Modified ESRGAN [13]	Implicit neural representation for SR, learns continuous feature representations.	Can upscale images to arbitrary resolutions, smooth interpolation.	High memory consumption, requires significant computational resources.
LIIF [14]	Local implicit function-based SR, continuous-resolution upscaling.	Adaptive resolution enhancement, improved generalization.	Computationally expensive, sensitive to training data distribution.
SinSR [15]	Uses stochastic processes to generate high-quality textures.	Strong detail reconstruction, robustness to noise.	High computational cost, long inference time.
TBMRA [16]	Two-branch multiscale residual attention network for remote sensing.	Combines multiscale feature extraction with attention mechanisms.	Limited by the absence of external priors, struggles with intricate details.
MSAGAN [17]	Multiscale attention GAN with channel and spatial attention modules.	Highlights critical features, suppresses irrelevant details, better perceptual performance.	Computational overhead and dependency on adversarial training.
MBGPIN (proposed, 2025)	Multi-branch generative prior integration with VQGAN priors.	Superior detail recovery, texture preservation, computationally efficient.	Requires external priors for optimal performance.

Table 2

Hyperparameter sensitivity analysis.

Hyperparameter	Tested Values	Optimal Value	Effect on Performance
Learning Rate (LR)	1 × 10⁻⁵, 1 × 10⁻⁴, 5 × 10⁻⁴, 1 × 10⁻³	1 × 10⁻⁴	Higher values led to instability; lower values resulted in slow convergence.
Batch Size	8, 16, 32, 64	32	Smaller values caused noisy updates; larger values increased memory usage.
Feature Map Depth	32, 64, 128, 256	128	Higher depth improved quality but, beyond 128, had diminishing returns.
Attention Weight (α)	0.2, 0.5, 0.7, 1.0	0.7	Lower values weakened attention; higher values oversuppressed minor details.
Fusion Weight (λ)	0.1, 0.3, 0.5, 0.7	0.5	Lower values reduced generative guidance; higher values over-relied on priors.

Table 3

Overview of datasets used for MBGPIN evaluation.

Dataset	Resolution	Classes/Scenes	Number of Images	Characteristics	Scaling Factors
UC Merced	256 × 256	21 land-use classes	2100	Aerial imagery with diverse land uses, including agricultural, residential, and industrial areas.	×2, ×4, ×8
NWPU-RESISC45	256 × 256	45 scene classes	31,500	Large-scale dataset with natural landscapes and human-made structures.	×2, ×4, ×8
RSSCN7	400 × 400	7 scene classes	2800	High-resolution images with rich textures from grasslands, urban areas, and water bodies.	×2, ×4, ×8

Table 4

Comparison of baseline models.

Model	PSNR (dB)	SSIM	NRMSE	FLOPs (G)	Inference Time (ms)
Bicubic Interpolation	26.34	0.721	0.098	N/A	5.2
SRCNN [4]	27.85	0.764	0.084	64.3	17.1
EDSR [6]	30.12	0.846	0.057	237.9	49.4
RCAN [11]	30.86	0.873	0.051	279.3	61.2
SRGAN [5]	29.76	0.857	0.062	312.4	72.3
ESRGAN [12]	30.03	0.865	0.058	326.8	74.9
MBGPIN (proposed)	31.34	0.912	0.041	175.8	38.7

Table 5

Comparison of SOTA models.

Model	PSNR (dB)	SSIM	NRMSE	FLOPs (G)	Inference Time (ms)	Model Size (M)
TBMRA [16]	30.12	0.846	0.057	237.9	49.4	43.1
NST-CL [24]	30.86	0.873	0.051	279.3	61.2	65.4
GSAN [31]	29.76	0.857	0.062	312.4	72.3	36.8
ESatSR [32]	30.03	0.865	0.058	326.8	74.9	42.5
MSAGAN [17]	30.54	0.868	0.054	201.6	43.8	25.7
SA-GAN [29]	27.56	0.841	0.051	320.8	51.8	27.9
MCWESRGAN [30]	30.08	0.887	0.058	220.1	40.2	29.6
FMANet [25]	30.89	0.891	0.051	225.8	40.5	25.8
MTTN [26]	28.56	0.871	0.062	289.9	45.8	23.6
CHT [27]	29.89	0.890	0.059	302.9	42.7	29.1
Modified ESRGAN [13]	29.65	0.885	0.060	300.1	45.9	29.9
MBGPIN (Proposed)	31.34	0.912	0.041	175.8	38.7	22.6

Table 6

Ablation study of MBGPIN components.

Model Variant	PSNR (dB)	SSIM
Baseline CNN (No Priors, No AGPF)	27.45	0.845
MBGPIN Without Generative Priors	28.62	0.879
MBGPIN Without AGPF	29.15	0.891
Full MBGPIN Model	30.78	0.918

Table 7

Ablation study of loss functions.

Loss Function Configuration	PSNR (dB)	SSIM
Without L1 Loss	29.52	0.892
Without Perceptual Loss	28.87	0.875
Without Adversarial Loss	29.1	0.884
Without Structure Loss	29.34	0.89
Full Loss Combination	30.78	0.918

References

1. Karwowska, K.; Wierzbicki, D. Using super-resolution algorithms for small satellite imagery: A systematic review. IEEE J. Sel. Appl. Earth Obs. Remote Sens.; 2022; 15, pp. 3292-3312. [DOI: https://dx.doi.org/10.1109/JSTARS.2022.3167646]

2. Umirzakova, S.; Mardieva, S.; Muksimova, S.; Ahmad, S.; Whangbo, T. Enhancing the super-resolution of medical images: Introducing the deep residual feature distillation channel attention network for optimized performance and efficiency. Bioengineering; 2023; 10, 1332. [DOI: https://dx.doi.org/10.3390/bioengineering10111332]

3. Nguyen, N.L.; Anger, J.; Davy, A.; Arias, P.; Facciolo, G. Self-supervised multi-image super-resolution for push-frame satellite images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 19–25 June 2021; pp. 1121-1131.

4. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell.; 2015; 38, pp. 295-307. [DOI: https://dx.doi.org/10.1109/TPAMI.2015.2439281] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26761735]

5. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z. et al. Photo-realistic single image super-resolution using a generative adversarial network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 4681-4690.

6. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; Honolulu, HI, USA, 21–26 July 2017; pp. 136-144.

7. Turimov Mustapoevich, D.; Kim, W. Machine learning applications in sarcopenia detection and management: A comprehensive survey. Healthcare; 2023; 11, 2483. [DOI: https://dx.doi.org/10.3390/healthcare11182483] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37761680]

8. Safarov, F.; Akhmedov, F.; Abdusalomov, A.B.; Nasimov, R.; Cho, Y.I. Real-time deep learning-based drowsiness detection: Leveraging computer-vision and eye-blink analyses for enhanced road safety. Sensors; 2023; 23, 6459. [DOI: https://dx.doi.org/10.3390/s23146459] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/37514754]

9. Chira, D.; Haralampiev, I.; Winther, O.; Dittadi, A.; Liévin, V. Image super-resolution with deep variational autoencoders. European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2022; pp. 395-411.

10. Tuo, Z.; Yang, H.; Fu, J.; Dun, Y.; Qian, X. Learning data-driven vector-quantized degradation model for animation video super-resolution. Proceedings of the IEEE/CVF International Conference on Computer Vision; Paris, France, 1–6 October 2023; pp. 13179-13189.

11. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 286-301.

12. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. Proceedings of the European Conference on Computer Vision (ECCV) Workshops; Munich, Germany, 8–14 September 2018.

13. Karwowska, K.; Wierzbicki, D. Modified ESRGAN with Uformer for Video Satellite Imagery Super-Resolution. Remote Sens.; 2024; 16, 1926. [DOI: https://dx.doi.org/10.3390/rs16111926]

14. Chen, Y.; Liu, S.; Wang, X. Learning continuous image representation with local implicit image function. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA, 20–25 June 2021; pp. 8628-8638.

15. Wang, Y.; Yang, W.; Chen, X.; Wang, Y.; Guo, L.; Chau, L.P.; Liu, Z.; Qiao, Y.; Kot, A.C.; Wen, B. SinSR: Diffusion-based image super-resolution in a single step. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, WA, USA, 16–22 June 2024; pp. 25796-25805.

16. Patnaik, A.; Bhuyan, M.K.; MacDorman, K.F. A Two-Branch Multi-Scale Residual Attention Network for Single Image Super-Resolution in Remote Sensing Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 6003-6013. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3371710]

17. Wang, C.; Zhang, X.; Yang, W.; Li, X.; Lu, B.; Wang, J. MSAGAN: A new super-resolution algorithm for multispectral remote sensing image based on a multiscale attention GAN network. IEEE Geosci. Remote Sens. Lett.; 2023; 20, pp. 1-5. [DOI: https://dx.doi.org/10.1109/LGRS.2023.3258965]

18. Khaledyan, D.; Amirany, A.; Jafari, K.; Moaiyeri, M.H.; Khuzani, A.Z.; Mashhadi, N. Low-cost implementation of bilinear and bicubic image interpolation for real-time image super-resolution. Proceedings of the 2020 IEEE Global Humanitarian Technology Conference (GHTC); Seattle, WA, USA, 29 October–1 November 2020; pp. 1-5.

19. Jahnavi, M.; Rao, D.R.; Sujatha, A. A Comparative Study Of Super-Resolution Interpolation Techniques: Insights For Selecting The Most Appropriate Method. Procedia Comput. Sci.; 2024; 233, pp. 504-517. [DOI: https://dx.doi.org/10.1016/j.procs.2024.03.240]

20. Zhang, Y.; Li, R.; Chen, Q.; Zhi, D.; Wang, X.; Feng, C.; Shang, J.; Jiang, S. An improved bicubic interpolation SLAM algorithm based on multi-sensor fusion method for rescue robot. Int. J. Sens. Netw.; 2023; 42, pp. 125-136. [DOI: https://dx.doi.org/10.1504/IJSNET.2023.131656]

21. Umirzakova, S.; Ahmad, S.; Khan, L.U.; Whangbo, T. Medical image super-resolution for smart healthcare applications: A comprehensive survey. Inf. Fusion; 2024; 103, 102075. [DOI: https://dx.doi.org/10.1016/j.inffus.2023.102075]

22. Lepcha, D.C.; Goyal, B.; Dogra, A.; Goyal, V. Image super-resolution: A comprehensive review, recent trends, challenges and applications. Inf. Fusion; 2023; 91, pp. 230-260. [DOI: https://dx.doi.org/10.1016/j.inffus.2022.10.007]

23. Sambandham, V.T.; Kirchheim, K.; Ortmeier, F.; Mukhopadhaya, S. Deep learning-based harmonization and super-resolution of Landsat-8 and Sentinel-2 images. ISPRS J. Photogramm. Remote Sens.; 2024; 212, pp. 274-288. [DOI: https://dx.doi.org/10.1016/j.isprsjprs.2024.04.026]

24. Mishra, D.; Hadar, O. Accelerating neural style-transfer using contrastive learning for unsupervised satellite image super-resolution. IEEE Trans. Geosci. Remote Sens.; 2023; 61, pp. 1-14. [DOI: https://dx.doi.org/10.1109/TGRS.2023.3314283]

25. Rauf, F.; Khan, M.A.; Bhatti, M.K.; Hamza, A.; Aleryani, A.; Alouane, M.T.H.; AlHammadi, D.A.; Nam, Y. FMANet: Super Resolution Inverted Bottleneck Fused Self-Attention Architecture for Remote Sensing Satellite Image Recognition. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2024; 17, pp. 18622-18634. [DOI: https://dx.doi.org/10.1109/JSTARS.2024.3475580]

26. Wang, Y.; Shao, Z.; Lu, T.; Huang, X.; Wang, J.; Chen, X.; Huang, H.; Zuo, X. Remote Sensing Image Super-Resolution via Multi-Scale Texture Transfer Network. Remote Sens.; 2023; 15, 5503. [DOI: https://dx.doi.org/10.3390/rs15235503]

27. Xiao, Y.; Yuan, Q.; He, J.; Zhang, L. Remote sensing image super-resolution via cross-scale hierarchical transformer. Geo-Spat. Inf. Sci.; 2024; 27, pp. 1914-1930. [DOI: https://dx.doi.org/10.1080/10095020.2023.2288179]

28. Chen, Y.; Xia, R.; Yang, K.; Zou, K. MICU: Image super-resolution via multi-level information compensation and U-net. Expert. Syst. Appl.; 2024; 245, 123111. [DOI: https://dx.doi.org/10.1016/j.eswa.2023.123111]

29. Zhao, J.; Ma, Y.; Chen, F.; Shang, E.; Yao, W.; Zhang, S.; Yang, J. SA-GAN: A second order attention generator adversarial network with region aware strategy for real satellite images super resolution reconstruction. Remote Sens.; 2023; 15, 1391. [DOI: https://dx.doi.org/10.3390/rs15051391]

30. Karwowska, K.; Wierzbicki, D. MCWESRGAN: Improving enhanced super-resolution generative adversarial network for satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.; 2023; 16, pp. 9459-9479. [DOI: https://dx.doi.org/10.1109/JSTARS.2023.3322642]

31. Hu, T.; Chen, Z.; Wang, M.; Hou, X.; Lu, X.; Pan, Y.; Li, J. Global sparse attention network for remote sensing image super-resolution. Knowl. Based Syst.; 2024; 304, 112448. [DOI: https://dx.doi.org/10.1016/j.knosys.2024.112448]

32. Wang, Y.; Yuan, W.; Xie, F.; Lin, B. ESatSR: Enhancing Super-Resolution for Satellite Remote Sensing Images with State Space Model and Spatial Context. Remote Sens.; 2024; 16, 1956. [DOI: https://dx.doi.org/10.3390/rs16111956]

33. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. Proceedings of the European Conference on Computer Vision (ECCV); Munich, Germany, 8–14 September 2018; pp. 3-19.

Word count: 10224

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Achieving super-resolution with satellite images is a critical task for enhancing the utility of remote sensing data across various applications, including urban planning, disaster management, and environmental monitoring. Traditional interpolation methods often fail to recover fine details, while deep-learning-based approaches, including convolutional neural networks (CNNs) and generative adversarial networks (GANs), have significantly advanced super-resolution performance. Recent studies have explored large-scale models, such as Transformer-based architectures and diffusion models, demonstrating improved texture realism and generalization across diverse datasets. However, these methods frequently have high computational costs and require extensive datasets for training, making real-world deployment challenging. We propose the multi-branch generative prior integration network (MBGPIN) to address these limitations. This novel framework integrates multiscale feature extraction, hybrid attention mechanisms, and generative priors derived from pretrained VQGAN models. The dual-pathway architecture of the MBGPIN includes a feature extraction pathway for spatial features and a generative prior pathway for external guidance, dynamically fused using an adaptive generative prior fusion (AGPF) module. Extensive experiments on benchmark datasets such as UC Merced, NWPU-RESISC45, and RSSCN7 demonstrate that the MBGPIN achieves superior performance compared to state-of-the-art methods, including large-scale super-resolution models. The MBGPIN delivers a higher peak signal-to-noise ratio (PSNR) and higher structural similarity index measure (SSIM) scores while preserving high-frequency details and complex textures. The model also achieves significant computational efficiency, with reduced floating point operations (FLOPs) and faster inference times, making it scalable for real-world applications.

Details

Title

MBGPIN: Multi-Branch Generative Prior Integration Network for Super-Resolution Satellite Imagery

Author

Safarov, Furkat¹

; Khojamuratova, Ugiloy²; Komoliddin, Misirov³; Bolikulov, Furkat¹; Muksimova, Shakhnoza¹

; Young-Im, Cho¹

¹ Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-si 461701, Republic of Korea; [email protected] (F.S.); [email protected] (F.B.); [email protected] (S.M.)
² Department of Computer Science, CUNY Queens College, 65-30 Kissena Blvd Flushing, New York, NY 11374, USA; [email protected]
³ Department of Financial Accounting and Reporting, Tashkent State University of Economics, Tashkent 100066, Uzbekistan; [email protected]

First page

805

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20724292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/rs17050805

ProQuest document ID

3176391073