1. Introduction
With the rapid development of computer vision techniques, urban image data analysis has become a powerful tool for researchers, enabling the extraction of spatial features such as buildings, roads, and green spaces from satellite and street view images [1]. This advancement introduces new data dimensions and analysis methods to urban economic research, which make up for the shortcomings of traditional methods in refined and dynamic analysis. With the increasing complexity of data generated by various applications, there is an increasing need for models that can effectively analyze and interpret multimodal data such as images, text, and sensor data. Multimodal data analysis not only enhances the richness of information but also enables a more comprehensive understanding of complex scenarios, which can support decisions in several industries such as traffic management, resource allocation, and investment decisions. However, multimodal data analysis faces many challenges in fusing diverse modality information, especially when dealing with information at different scales and levels of detail.
Despite the progress made in current cross-modal feature alignment techniques, it still falls short in preserving modality-specific details, which directly leads to significant information loss. Furthermore, traditional methods often struggle with challenges such as handling distribution discrepancies, scale variations, and the complexity of details across different modalities, particularly when dealing with partial data loss (e.g., incomplete modality data in image–text models). As a result, the robustness is often insufficient. Therefore, achieving high-precision alignment of multimodal data while ensuring the complete retention of information has become one of the core challenges that needs to be addressed in the field of multimodal analysis.
To address the challenges outlined above, this study adopts the cutting-edge Swin Transformer V2 (denoted as SwinV2-B) [2], which demonstrates exceptional performance in hierarchical feature extraction and window-based attention mechanisms. These mechanisms enable SwinV2-B to capture both local and global patterns, making it particularly well-suited for constructing multi-scale feature representations. However, SwinV2-B shows certain limitations when handling cross-modal interactions and fine-grained multi-scale features, particularly in the fusion of cross-modal features and the complementary information exchange between modalities [3]. Recent studies have shown that incorporating multi-scale feature extraction techniques and adaptive alignment mechanisms can significantly enhance the model’s robustness and expressiveness in multimodal data processing. Furthermore, existing research provides valuable insights into multimodal feature fusion, such as using scene graph analysis and cross-attention mechanisms to achieve deeper and finer interactions between modalities. However, these approaches still exhibit noticeable limitations in addressing distributional discrepancies and data loss in multimodal datasets, highlighting areas for future improvement and research potential.
This paper proposes an enhanced model based on SwinV2-B, which integrates a multi-scale feature extraction module and a cross-modal attention-based feature fusion mechanism. The model is designed to dynamically integrate both local and global features while leveraging the spatial structural information from scene graphs to enhance feature representation. Specifically, the proposed method has the following notable features:
An innovative cross-modal alignment and fusion mechanism is introduced, addressing distributional differences in multimodal data while reducing information loss.
An advanced multi-scale feature extraction module enhances the representation of fine-grained features.
A combination of scene graph analysis and image classification enables dynamic monitoring and in-depth analysis of urban economic activities.
The structure of the subsequent sections of this paper is organized as follows: Section 2 provides a comprehensive review of the latest research developments in the relevant field, with a particular focus on the advancements in multimodal analysis methods based on Swin Transformer V2. Section 3 offers a detailed introduction to the proposed model architecture, including the specific design of the multi-scale feature extraction module and the cross-modal fusion mechanism. Section 4 validates the practical application of the proposed model in analyzing urban economic activities through a series of experiments, accompanied by a comparative performance analysis against state-of-the-art methods. Section 5 summarizes the main findings of this study and outlines potential directions for future research and improvements.
2. Related Work
In recent years, computer vision techniques have been widely applied to urban environment analysis. In [4], Li et al. employed convolutional neural networks (CNNs) to analyze satellite and street view images, enabling the automated identification of building density and land use patterns in urban environments. Compared with traditional statistical methods, image data provide finer-grained descriptions of urban characteristics. This advantage is particularly significant in rapidly developing urban areas, where image analysis can capture dynamic changes more effectively. However, their approach relies solely on CNN-based image recognition, which may fail to capture the semantic relationships between different building structures. This limitation highlights the need for integrating scene graph generation techniques to improve the understanding of urban spatial layouts, a key innovation addressed in this study.
In the field of urban economic forecasting, multimodal data fusion techniques have seen remarkable progress in recent years. Xu et al. proposed a model combining economic indicators with image data to predict changes in urban economic activities by leveraging spatial features from satellite images [5]. By fusing visual and traditional economic data, their method achieved higher accuracy in economic activity prediction. However, structural information within images, which is critical for understanding spatial relationships, has not been fully utilized in existing studies. By incorporating Swin Transformer and scene graph generation technologies, this study innovatively extracts spatial structural features from images, further improving the accuracy of economic activity predictions.
The application of deep learning in urban image analysis has delivered significant breakthroughs. Models such as the Swin Transformer excel in processing high-resolution images due to their multi-scale feature extraction capabilities. In [6], Zhou et al. utilized the Swin Transformer to classify traffic congestion in urban monitoring systems, significantly improving the accuracy and real-time performance of congestion detection. Despite these advancements, the potential of combining Swin Transformer with scene graph generation for urban economic behavior recognition remains largely unexplored. This study addresses this gap by integrating these techniques to enhance analytical capabilities.
Scene graph generation has emerged as a powerful tool for urban environment modeling and analysis. In [7], Zhou et al. proposed a method for representing urban elements such as buildings and roads in a scene graph structure, which facilitated a better understanding of spatial layouts. Graph embedding techniques further enabled these scene graphs to be used in machine learning models, advancing intelligent analysis and predictions of urban economic activities. However, existing studies have primarily focused on static analysis of structural features, without combining them with dynamic image classification for monitoring economic activities. This study bridges this gap by innovatively combining scene graph generation with the Swin Transformer to achieve bimodal fusion of urban image data and economic analysis. This approach enables real-time monitoring and prediction of economic activities while providing valuable insights for urban planning, resource allocation, and investment decisions.
Scene graph generation techniques have recently been integrated with multimodal deep learning approaches for real-time urban environment monitoring. Wang et al. demonstrated the efficacy of combining social media, image, and geo-information data for analyzing urban market behavior and resource allocation [8]. Building on this, this study introduces a cross-modal alignment module that leverages cross-attention mechanisms to enhance multimodal data fusion [9]. This mechanism ensures proper alignment of features from distinct modalities, improving the integration of visual and structural data.
By retaining the cosine similarity mechanism from Swin Transformer V2, the proposed framework facilitates robust multimodal integration with computational efficiency. The inclusion of log-spaced continuous position bias (Log-CPB) further enhances the model’s ability to transfer knowledge across varying window resolutions [10]. By incorporating distinct position encodings for each modality within the cross-modal alignment module, the framework achieves more precise spatial information alignment, optimizing performance in handling multimodal data.
Figure 1 illustrates the architecture of the proposed multimodal data fusion framework, designed to analyze urban spatial layouts and economic activities. The framework combines image-based analysis and graph-based reasoning for decision-making. The process begins with satellite and street view images, which are processed by Mask R-CNN to extract object detection and segmentation results. These segmented objects are represented graphically, and GraphSAGE generates embeddings that capture spatial and structural relationships in urban environments.
Simultaneously, the cross-modal attention fusion (CMAF) module, based on Swin Transformer V2, processes input images to extract local and global visual features. A cross-modal attention mechanism then fuses these visual features with graph embeddings, dynamically adjusting the contribution of each modality based on attention scores. Finally, the fused multimodal representation is fed into the Decision Support System (DSS) to predict economic behaviors, providing actionable insights for urban planning, resource allocation, and investment decisions.
The proposed framework combines scene graph generation and advanced transformer-based feature extraction to address limitations in current urban economic analysis methods. By enabling robust multimodal integration and leveraging both spatial and structural information, the study offers a novel solution for real-time monitoring and decision-making in urban environments.
3. Method
This study introduces a bimodal urban economic activity assessment model that integrates scene graph generation with graph embedding representation and a Swin Transformer based image classification approach. By leveraging feature fusion, the model offers a comprehensive assessment and prediction of urban economic activities. The dual-modality design effectively captures dynamic information such as economic activity intensity, resource allocation, and traffic flow in urban spaces, providing robust decision-making tools for urban planners and policymakers.
3.1. Scene Graph Generation and Graph Embedding Representation
To analyze economic activities in urban environments, the study first applies semantic segmentation and object detection models to generate scene graphs from satellite or street view images, extracting critical spatial and structural information such as buildings, roads, and green spaces. Figure 2 presents an example of an urban area analysis image. Specifically, the Mask R-CNN algorithm is employed to detect and segment relevant elements in the images [11]. The resulting scene graph encodes the structure of economic activities, with nodes representing semantic objects (e.g., buildings, parks) and edges reflecting spatial relationships between these objects. This representation provides a detailed visualization of the spatial layout of various urban zones, such as commercial, residential, or industrial areas, serving as foundational data for further economic activity analysis.
(1)
where I is the input satellite image or street view image, H and W are the height and width of the image, respectively, and C is the number of color channels.(2)
where S is the generated scene graph, consisting of the set of nodes V and the set of edges E.(3)
Each node represents a semantic object, such as a building, road, or green space.
(4)
where each edge represents the spatial relationship between nodes and .3.2. Graph Embedding Representations
After generating the scene graphs, we use the GraphSAGE graph embedding method to transform these scene graphs into low-dimensional vector representations that can be processed by machine learning models to preserve the topological information and spatial relationships in the graph structure [12]. The process of graph embedding is illustrated in Figure 3. Through this graph embedding method, the nodes in the scene graph can be embedded into the low-dimensional space to ensure that the relationship between adjacent nodes is effectively preserved.
To manage high-dimensional scene graph embeddings, Principal Component Analysis (PCA) is used to reduce dimensionality, as illustrated in Figure 4. This approach preserves the most critical features while minimizing redundancy. Specifically, the eigenvalue decomposition method is applied to the covariance matrix, selecting the topK eigenvectors with the highest eigenvalues. The data are then projected onto the subspace formed by these eigenvectors, achieving effective dimensionality reduction. By compressing the high-dimensional embeddings into a lower-dimensional space, the model retains representative features, providing streamlined and efficient input for subsequent machine learning tasks. This embedded graph representation not only reflects the spatial distribution within a region but also captures developmental trends across various economic zones, offering structured information for feature fusion and decision support.
(5)
where X is the feature matrix of the scene graph embeddings, C is the covariance matrix, and n is the number of samples.(6)
where represents the eigenvalues of the covariance matrix C, and v is the corresponding eigenvector.(7)
where is the embedding of node v at layer k, AGGREGATE is the aggregation function (such as mean, LSTM, max pooling, etc.), is the activation function, and and are learnable parameters.3.3. Swin Transformer Image Classification
To identify economic behavior in specific regions, Swin Transformer V2 is employed for image classification. Utilizing a multi-level sliding window mechanism, the model performs self-attention calculations in local windows to extract features at varying scales, making it particularly suitable for high-resolution urban images. The input image is divided into multiple windows, where local attention computations capture fine-grained features. The scope of these windows expands layer by layer to progressively capture broader contextual features, enabling the recognition of complex economic behaviors such as traffic congestion and commercial activity intensity.
This study further enhances the model’s capability by introducing a multi-scale feature extraction module into the window attention mechanism of Swin Transformer V2. A cross-window multi-scale fusion module is added at each layer to integrate local and global attention features. Adaptive pooling is used to fuse information across windows, followed by a deconvolution layer to reconstruct a richer multi-scale feature representation. This mechanism effectively captures information across diverse scales, addressing the differences in detail and scale between bimodal data and significantly improving the model’s performance in analyzing complex urban scenes.
(8)
where I is the input image, l denotes the layer number, and is the extracted feature representation.Each layer’s feature extraction can be represented using the sliding window self-attention mechanism:
(9)
where is the dimension of the keys.3.4. Multi-Scale and Cross-Attention Fusion for Enhanced Bimodal Feature Alignment
To combine spatial structural information from the scene graph with image classification results from the Swin Transformer, a cross-modal feature fusion mechanism based on a cross-attention structure is proposed, as shown in Figure 5 [13]. This mechanism computes a cross-attention matrix by evaluating the similarity and correlation between scene graph embeddings and image classification features, enabling collaborative information fusion across modalities [14]. The cross-attention mechanism dynamically adjusts the influence of scene graph and image features in the fusion process by generating attention weights for each modality [15]. A cross-modal feature alignment module further enhances the integration by correlating representations from both modalities. This module ensures that complementary information between modalities is effectively captured. Features from the two modalities are concatenated and passed through a dedicated cross-modal attention layer, improving the model’s ability to understand and align relationships between modalities.
Finally, the fused features are fed into a decision support system (DSS), which provides actionable insights for urban planners and policymakers. By accurately assessing and forecasting economic activities, the system supports informed decisions regarding urban planning, resource allocation, and investment strategies.
The cross-modal alignment module (CMAM), which is shown in Figure 6, serves to dynamically adjust contributions from different modalities by learning their contextual relationships and interdependencies, enhancing the model’s ability to integrate diverse sources of information. This module utilizes a weighted attention mechanism that assigns adaptive importance scores to features from each modality, ensuring that salient information is emphasized while redundant or less relevant information is attenuated.
The alignment begins by processing features from each modality through independent feature extractors, yielding representations , where . These representations are then fused using a cross-modal attention mechanism.
The attention weights for each modality are computed as follows:
(10)
where is a shared query vector, is the key vector for modality i, and is a scaling factor to stabilize training. The attention weights reflect the contribution of each modality to the final fused representation.The weighted fusion is performed as
(11)
where is the value vector for modality i.To ensure dynamic adjustment, the module incorporates contextual signals by conditioning on a shared global context , derived from a pooling operation across all modalities:
(12)
This global context allows the model to refine its focus based on overall task requirements, dynamically modulating contributions from each modality.
4. Experiments
4.1. Dataset
To validate the effectiveness of the proposed bimodal urban economic activity evaluation model, this experiment utilizes a multi-source urban dataset for training and evaluation. The dataset comprises data from two major sources:
Satellite images: High-resolution satellite images were sourced from the Sentinel-2 satellite dataset, which provides multi-spectral images covering various economic zones (e.g., commercial, residential, and industrial areas) in multiple cities. All images were resized to resolution, and color space normalization was applied to standardize inputs.
Street view images: Public street view datasets, the OpenStreetMap (OSM) data, were used to capture detailed information on urban infrastructures, including buildings, roads, traffic, and green spaces. A pre-trained Mask R-CNN model was employed for semantic segmentation and object detection, labeling buildings, roads, and green areas in the images, which resulted in the generation of scene graphs. By employing these publicly available datasets, the model can leverage diverse urban contexts and enhance the generalizability of the evaluation results across different economic zones.
The dataset used in this study comprises high-resolution satellite images and street view images, which are classified into different categories based on urban regions, such as commercial, residential, and industrial areas. The data are labeled with semantic information that allows us to generate scene graphs for each image, capturing the spatial relationships between various objects. The dataset includes a total of 45,000 images, divided as follows: commercial areas: 10,000 images with 8000 scene graphs; residential areas: 15,000 images with 12,000 scene graphs; and industrial areas: 8000 images with 6000 scene graphs.
Table 1 summarizes the dataset distribution across different categories, showing the number of images and corresponding scene graphs used for analysis. The dataset’s diversity and large number of labeled scene graphs allow the model to be trained effectively on a variety of urban environments, ensuring the generalizability of the results.
4.2. Experimental Design
The primary objective of the experiments conducted in this study is to evaluate the effectiveness of multimodal feature fusion for urban infrastructure analysis using scene graphs and image data. By integrating scene graph embeddings with image features, we aim to improve the classification accuracy of economic activities in urban areas.
In this study, we applied five-fold cross-validation to evaluate the performance of the CNN model. This method divides the entire dataset into five subsets, with each subset being used once as a validation set while the remaining four subsets are used for training. This process is repeated five times, and the average performance across all folds is reported. This approach ensures that the model is tested on all parts of the dataset and provides a more reliable estimate of the model’s generalization ability.
To compare the performance of the CNN model with other methods, we used several performance metrics, including accuracy, F1 score, precision, and recall. These metrics were compared across all models to assess the effectiveness of each in handling multimodal data. In particular, the CNN model was compared with other baseline methods such as traditional machine learning models and pre-existing deep learning architectures, with the goal of demonstrating its superior capability in capturing spatial and contextual relationships within the dataset.
The dataset was split into three subsets: training, validation, and testing. The training set, which accounts for 70% of the total dataset, consists of 31,500 images, providing the model with a substantial amount of data for learning. The validation set, representing 15% of the dataset, includes 6750 images and is utilized for hyperparameter tuning and model selection. Finally, the testing set, also comprising 15% of the dataset with 6750 images, is reserved for evaluating the model’s generalization performance on unseen data.
4.2.1. Scene Graph Generation and Graph Embedding
Mask R-CNN was used to perform object detection and semantic segmentation on the input satellite and street view images, generating scene graphs containing urban infrastructure [16]. These scene graphs were saved as structured data with nodes and edges. GraphSAGE and Node2Vec were applied to convert scene graph nodes into low-dimensional vectors, preserving graph topology and spatial relationships.
4.2.2. Data Preprocessing
Prior to model training, the input images underwent a series of preprocessing steps, including normalization to adjust pixel values, data augmentation techniques such as rotation and scaling to increase dataset variability, and resizing to ensure uniform input dimensions for the models.
4.2.3. Image Classification and Multimodal Feature Fusion
Swin Transformer was used to extract multi-scale features from preprocessed images, identifying key economic activities (e.g., traffic congestion, commercial activity). The model’s ability to extract local and global features from urban images was evaluated across various economic zones.
A cross-attention mechanism was employed to fuse scene graph embeddings and image features extracted by Swin Transformer. The effectiveness of the fused features was compared with single-modality models to verify the efficacy of the cross-attention mechanism.
4.2.4. Model Settings
The settings in Table 2 highlight the variability in hyperparameters employed to ensure robust training and accurate segmentation results. The use of RTX 4090 for all experiments ensures that the hardware does not introduce variability into the performance metrics.
4.2.5. Model Comparison Experiments
In this section, we compare various graph embedding techniques and fusion mechanisms to evaluate their performance on scene graph representation and urban economic activity prediction. The results are systematically analyzed using statistical metrics such as accuracy, F1 score, and cross-entropy loss. Graph Embedding ComparisonTo assess the effectiveness of different graph embedding methods in representing scene graphs, we evaluated the performance of GraphSAGE and Node2Vec [17]. The comparison was based on several metrics, including node count, graph density, accuracy, and graph distance. As shown in Table 3, GraphSAGE outperforms Node2Vec, achieving a higher accuracy (85.2%) and a lower graph distance (0.15), indicating better retention of spatial relationships and structural integrity within the scene graphs.
GraphSAGE operates in a message-passing paradigm, preserving spatial and topological properties of the graph. This contrasts with Node2Vec, which relies on random walks to capture node proximity, often losing fine-grained structural details. GraphSAGE can generalize to unseen nodes, as it does not rely on precomputed embeddings but instead computes them dynamically using the graph structure. This inductive property is particularly useful in dynamic or incomplete scene graphs, where new nodes or edges may be introduced during inference. Node2Vec generates embeddings based on node proximity using biased random walks. While this captures high-level graph topology, it struggles to encode the detailed relationships and interactions that are essential for scene graph tasks.
The results demonstrate that while both methods are comparable in terms of node count and graph density, GraphSAGE significantly excels in accuracy and preserving graph structure. This suggests that GraphSAGE is better suited for tasks requiring high spatial resolution in graph-based representations.
The comparison of experimental results between five models, including SwinV2-B, SwinUnet, DS-TransUNet-B, CSwin-B, and our proposed method, demonstrates the effectiveness of our approach across various evaluation metrics, namely accuracy, recall, F1 score, throughput, FLOPS, and the number of parameters.
According to the analysis of the results in Table 4, our method outperforms the baseline model in both accuracy and F1 score. Specifically, our method achieves 91.5% accuracy, which is a 2.2% improvement over CSwin-B with the second highest accuracy and a 2.3% and 3% improvement over SwinUnet and SwinV2-B, respectively. Furthermore, our method achieves an F1 score of 88.9%, compared with CSwin-B’s F1 score of 88.8%. However, the F1 scores of SwinUnet and SwinV2-B are 87.2% and 86.7%, respectively. Compared with DS-TransUNet-B, our method improves the accuracy by 4.2% and improves the F1 score by 2.1%.
Our model also exhibits a notable advantage in throughput, achieving 74/s, significantly outperforming the competing models. CSwin-B and DS-TransUNet-B reach throughput levels of 60/s and 61/s, respectively, while SwinUnet and SwinV2-B record 56/s and 67/s. The increased throughput demonstrates the efficiency of our method in processing a large number of samples in a short amount of time, making it suitable for practical deployment scenarios.
Although our model’s FLOPs reach 52.3 G, slightly higher than CSwin-B (46.7 G) and SwinUnet (48.5 G), the significant performance improvements justify this computational cost. Furthermore, compared with models with lower FLOPS, such as DS-TransUNet-B and SwinV2-B, our method achieves higher efficiency and predictive performance, demonstrating a well-optimized balance between computational complexity and effectiveness.
In terms of the number of parameters, our model requires 70 M, which is higher than other methods but leads to comprehensive performance advantages. CSwin-B and SwinUnet require 60 M and 65 M parameters, respectively, while SwinV2-B and DS-TransUNet-B have relatively lower parameter counts of 50 M and 55 M. Despite the increased parameter size, our model demonstrates that a moderate increase in complexity can yield substantial performance gains, achieving a superior trade-off between model size and capability.
Overall, the experimental results demonstrate that our proposed method excels in accuracy, recall, F1 score, and throughput, outperforming all competing models. Although it incurs slightly higher computational and parameter costs, these are well compensated by the remarkable improvements in performance. These results highlight the practical value and potential of our method in addressing challenging tasks with high efficiency and reliability.
The results of the ablation study in Table 5 indicate that the baseline SwinV2-B model, without the addition of the multi-scale and cross-modal alignment modules, achieved an accuracy of 82.3%, a recall of 81.5%, and an F1 score of 82.9%. Removing the multi-scale attention module led to a decrease in accuracy to 85.3%, demonstrating the module’s significant role in enhancing the model’s ability to process dual-modal data through improved feature extraction. When the cross-modal alignment module was excluded, the accuracy slightly decreased to 86.3%, suggesting that this module aids in the fusion of dual-modal information, enhancing the model’s understanding of inter-modal relationships. In the joint ablation of both modules, the model’s performance further declined, highlighting the complementary contributions of the two modules in improving dual-modal data processing. In contrast, the single-modal model configuration yielded an accuracy of 81.6%, emphasizing the advantage of a dual-modal approach. The enhanced dual-modal model, incorporating both the multi-scale attention mechanism and cross-modal alignment module, achieved superior performance with an accuracy of 91.5%, a recall of 88.9%, and an F1 score of 86.7%, demonstrating that the proposed improvements significantly bolster the model’s capability to handle dual-modal information effectively. The model achieves high segmentation accuracy across all subsets, as shown by the focal loss and pixel accuracy values in Table 6. Figure 7 provides examples of segmentation results compared with ground truth, demonstrating the performance of the model after fine-tuning. Figure 8 illustrates the classification results for the full 426 validation dataset, the 2016–2017 subset, the 2018–2019 subset, and the 2020–2021 subset. These findings confirm the effectiveness of the multi-scale and cross-modal alignment modules in improving the SwinV2-B model’s performance on dual-modal tasks, underscoring the contributions of both feature extraction and modal alignment to the overall performance enhancements.
In this study, we evaluated the performance of the classification model on different datasets by confusion matrix. Table 6 illustrates the classification results for the full validation dataset, the 2016–2017 subset, the 2018–2019 subset, and the 2020–2021 subset.
In the full validation dataset, the model performs well on the forest category, with an accuracy of 0.960, and the accuracy of the water category is also quite high, with an accuracy of 0.968. However, the house category has 22.6% of instances misclassified as forest, which may be due to the similarity between the two categories in some features. The results for the 2016–2017 subset show that the forest category has the highest accuracy of 0.960. The house class has 19.7% misclassified as forest, which is consistent with the results on the full validation dataset, indicating that the model has some challenges in distinguishing between the two classes. The classification performance of the 2018–2019 subset was similar to that of the 2016–2017 subset, with an accuracy of 0.969 for the forest class, while 19.9% of the house class was misclassified as forest. This further confirms the difficulty of the model in distinguishing between the house and forest categories. The results for the 2020–2021 subset show that the accuracy of the forest category remains at 0.960, which is the same as the 2016–2017 subset. The house class has 19.6% misclassified as forest, indicating that the model still faces challenges in distinguishing between the two classes even in the most recent subset of the data.
These metrics give us an overview of the performance of the model on the different categories, as well as the overall classification performance. By comparing the confusion matrices of different subsets, we were able to analyze the performance of the model in different time periods and identify possible classification difficulties of the model on some categories.
We present a comparison of different fusion methods in Table 7. The cross-attention mechanism proves to be the most effective for multimodal feature fusion [21], yielding an accuracy of 81.5%, a macro F1 score of 78.4%, and a cross-entropy loss of 0.22. In contrast, simple concatenation results in a lower accuracy of 78.2% and a higher cross-entropy loss of 0.26. This difference highlights the limitations of static fusion methods, which treat all features equally, leading to suboptimal integration of multimodal data. By contrast, the cross-attention mechanism dynamically adjusts the weights of features from different modalities, allowing the model to emphasize the most relevant features from each modality. This adaptive feature weighting significantly improves model robustness, enabling better generalization to unseen data and yielding more accurate predictions. The results also indicate that sum and average fusion methods, which rely on simplistic mathematical operations, are the least effective, with accuracy scores below 75%. Figure 8 further supports these findings, showing the normalized confusion matrix for the full validation set and its temporal subsets. The matrix demonstrates that the cross-attention-based model consistently performs well across different periods, with minimal variation in segmentation performance. This reinforces the conclusion that cross-attention fusion not only improves accuracy but also enhances the stability and reliability of predictions across varying data distributions.
The results of this model can be locally analyzed and visualized. Using standard mapping software, Figure 9 demonstrates that at the finest level, the results of the model can be locally analyzed and visualized using QGIS software; the model is validated. The results in Figure 10 show that the identified building, water, and forest domains.
5. Conclusions
This paper presents a cross-modal attention fusion framework based on Swin Transformer V2, integrating scene graph generation with graph embedding and Swin Transformer V2 for image classification. The proposed model efficiently extracts and represents semantic features of urban environments, offering valuable insights into the spatial layout and economic activities of cities. The multi-scale feature extraction module enhances the model’s capability to capture information across different scales, while the cross-modal feature alignment module facilitates the fusion of complementary information from diverse modalities. Experimental results demonstrate the effectiveness of the proposed enhancements. The baseline model achieved an accuracy of 82.3%, a recall of 81.5%, and an F1 score of 82.9%. With the inclusion of the multi-scale attention module, accuracy improved to 85.3%, highlighting the importance of multi-scale feature extraction in dual-modal data processing. Incorporating the cross-modal alignment module further increased accuracy to 86.3%, underscoring its critical role in aligning and fusing dual-modal information and enhancing the model’s understanding of inter-modal relationships. When both modules were combined, the model achieved state-of-the-art performance, with an accuracy of 91.5%, a recall of 88.9%, and an F1 score of 86.7%, showcasing their complementary contributions to handling dual-modal data effectively. While the proposed model demonstrates significant performance improvements, challenges remain, such as increased computational complexity and the potential risk of overfitting due to the integration of advanced mechanisms. Future research should focus on optimizing these aspects to ensure scalability and efficiency for larger-scale applications. Additionally, exploring the applicability of this framework to other tasks, such as dialogue systems and image captioning, will further extend its utility.
Conceptualization, C.Z. and H.Z.; methodology, C.Z.; software, C.Z.; validation, C.Z., S.Z. and H.Z.; formal analysis, C.Z.; investigation, C.Z.; resources, C.Z.; data curation, C.Z.; writing—original draft preparation, C.Z.; writing—review and editing, C.Z., S.Z. and H.Z.; visualization, C.Z.; supervision, H.Z.; project administration, H.Z.; funding acquisition, S.Z. All authors have read and agreed to the published version of the manuscript.
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
The authors declare no conflicts of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Figure 3. Graph embedding process. (a) The blue nodes represent the original scene graph embedding features. (b) The red nodes represent the low-dimensional embedding features.
Figure 6. Cross-modal alignment and multimodal integration module (CMAF) based on Swin Transformer V2: Introduced before the attention module, it uses cross-modal attention to align image and text features. Multimodal position encoding: add position encoding for different modalities.
Figure 7. Examples of the original images (a,b) and results of their inference results compared with the ground truth (c,d). White represents the match between ground truth and inference, green represents the ground truth that is not covered by the inference results, and red represents the inference results that do not cover ground truth.
Figure 8. Normalized confusion matrix of segmentation results for full fine-tuning validation set and its subsets of different periods.
Figure 9. The segmentation results on the finest level: (a) OSM map view of central Kaunas city; (b) processed data of the selected time period (2019–2020) with segmented buildings (magenta), water (blue), forest (brown), and other (white) categories.
Figure 10. The example of change identification: (a,b) Original images of periods 2016–2017 and 2018–2019; (c,d) images with a hatch layer that represents mismatch of the building class in segmentation results.
Dataset distribution across urban categories.
Category | Number of Images | Number of Scene Graphs |
---|---|---|
Commercial Areas | 10,000 | 8000 |
Residential Areas | 15,000 | 12,000 |
Industrial Areas | 8000 | 6000 |
Segmentation model training settings.
Optimizer | Learning Rate | Weight Decay | Batch Size | Hardware |
---|---|---|---|---|
SGD | 0.0001 | 0.0005 | 12 | RTX 4090 |
Adam | 0.006 | 0.0001 | 12 | RTX 4090 |
RMSprop | 0.005 | 0 | 12 | RTX 4090 |
Comparison of graph embedding methods.
Embedding Method | Node Count | Graph Density | Accuracy | Graph Distance |
---|---|---|---|---|
GraphSAGE | 2000 | 0.35 | 85.2% | 0.15 |
Node2Vec | 2000 | 0.35 | 84.5% | 0.12 |
Comparison of model performance across different state-of-the-art methods.
Model | Accuracy | Recall | F1 Score | Throughput | FLOPs | Parameters |
---|---|---|---|---|---|---|
SwinV2-B | 88.5% | 85.8% | 86.7% | 67/s | 45.2 | 50 M |
SwinUnet [ | 89.2% | 86.3% | 87.2% | 56/s | 48.5 | 65 M |
DS-TransUNet-B [ | 87.3% | 85.9% | 86.8% | 61/s | 43.8 | 55 M |
CSwin-B [ | 89.3% | 87.8% | 88.8% | 60/s | 46.7 | 60 M |
Ours | 91.5% | 88.1% | 88.9% | 74/s | 52.3 | 70 M |
Ablation study on the various components of our model.
Experimental Setting | Accuracy | Recall | F1 Score |
---|---|---|---|
SwinV2-B | 88.5% | 86.7% | 87.9% |
Without Multi-Scale Module | 85.3% | 83.6% | 84.1% |
Without Cross-Modal Alignment Module | 86.3% | 84.8% | 85.7% |
Without Both Modules | 82.3% | 81.5% | 82.9% |
Single-Modal Model | 81.6% | 80.1% | 81.2% |
Ours | 91.5% | 88.9% | 86.7% |
Classification Performance on Validation Subsets (2016–2021).
Dataset | Accuracy | Other | House | Forest | Water |
---|---|---|---|---|---|
2016–2017 Subset | 92.7% | 92.4% | 76.8% | 96.0% | 96.5% |
2018–2019 Subset | 92.1% | 92.1% | 76.8% | 96.9% | 96.0% |
2020–2021 Subset | 91.2% | 91.8% | 76.4% | 96.0% | 96.0% |
Full Validation Dataset | 91.6% | 93.0% | 76.1% | 96.0% | 96.8% |
Comparison of different fusion methods.
Method | Accuracy | Macro F1 Score | Cross-Entropy Loss |
---|---|---|---|
Cross-Attention Fusion | 81.5% | 78.4% | 0.22 |
Concatenation Fusion | 78.2% | 75.9% | 0.26 |
Sum Fusion | 74.6% | 73.2% | 0.30 |
Average Fusion | 75.1% | 74.8% | 0.28 |
References
1. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. Simclr: A simple framework for contrastive learning of visual representations. Proceedings of the International Conference on Learning Representations; Virtual, 26 April–1 May 2020; Volume 2.
2. Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L. et al. Swin Transformer V2: Scaling Up Capacity and Resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); New Orleans, LA, USA, 18–24 June 2022; pp. 12009-12019.
3. Girshick, R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision (ICCV); Santiago, Chile, 7–13 December 2015.
4. Law, S.; Paige, B.; Russell, C. Take a look around: Using street view and satellite images to estimate house prices. ACM Trans. Intell. Syst. Technol. TIST; 2019; 10, pp. 1-19. [DOI: https://dx.doi.org/10.1145/3342240]
5. Xu, D.; Zhu, Y.; Choy, C.B.; Fei-Fei, L. Scene Graph Generation by Iterative Message Passing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Honolulu, HI, USA, 21–26 July 2017.
6. Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA, 21–26 July 2017; pp. 633-641.
7. Zhou, X.; Wang, W.; Chen, Y. Traffic congestion detection using Swin Transformer in urban environments. IEEE Trans. Intell. Transp. Syst.; 2021; 22, pp. 123-132.
8. Wang, J.; Zhang, L.; Liu, Y. Multimodal deep learning for urban market behavior analysis. Comput. Environ. Urban Syst.; 2021; 87, 101646. [DOI: https://dx.doi.org/10.1016/j.compenvurbsys.2021.101646]
9. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision; Montreal, QC, Canada, 10–17 October 2021; pp. 10012-10022.
10. Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Shen, C. Conditional positional encodings for vision transformers. arXiv; 2021; arXiv: 2102.10882
11. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision (ICCV); Venice, Italy, 22–29 October 2017; pp. 2961-2969. [DOI: https://dx.doi.org/10.1109/ICCV.2017.322]
12. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. Adv. Neural Inf. Process. Syst.; 2017; 30.
13. Liu, H.; Zhang, J.; Yang, K.; Hu, X.; Stiefelhagen, R. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. arXiv; 2022; arXiv: 2203.04838
14. Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell.; 2018; 41, pp. 423-443. [DOI: https://dx.doi.org/10.1109/TPAMI.2018.2798607] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/29994351]
15. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst.; 2017; 30.
16. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified Real-Time Object Detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016; pp. 779-788. [DOI: https://dx.doi.org/10.1109/CVPR.2016.91]
17. Grover, A.; Leskovec, J. Node2vec: Scalable Feature Learning for Networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, CA, USA, 13–17 August 2016; pp. 855-864. [DOI: https://dx.doi.org/10.1145/2939672.2939754]
18. Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-unet: Unet-like pure transformer for medical image segmentation. Proceedings of the European Conference on Computer Vision; Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 205-218.
19. Lin, A.; Chen, B.; Xu, J.; Zhang, Z.; Lu, G.; Zhang, D. DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation. IEEE Trans. Instrum. Meas.; 2022; 71, pp. 1-15. [DOI: https://dx.doi.org/10.1109/TIM.2022.3178991]
20. Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; New Orleans, LA, USA, 18–24 June 2022; pp. 12124-12134.
21. Natarajan, P.; Wu, S.; Vitaladevuni, S.; Zhuang, X.; Tsakalidis, S.; Park, U.; Prasad, R.; Natarajan, P. Multimodal feature fusion for robust event detection in web videos. Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition; Providence, RI, USA, 16–21 June 2012; pp. 1298-1305.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
With the increasing demand for accurate multimodal data analysis in complex scenarios, existing models often struggle to effectively capture and fuse information across diverse modalities, especially when data include varying scales and levels of detail. To address these challenges, this study presents an enhanced Swin Transformer V2-based model designed for robust multimodal data processing. The method analyzes urban economic activities and spatial layout using satellite and street view images, with applications in traffic flow and business activity intensity, highlighting its practical significance. The model incorporates a multi-scale feature extraction module into the window attention mechanism, combining local and global window attention with adaptive pooling to achieve comprehensive multi-scale feature fusion and representation. This approach enables the model to effectively capture information at different scales, enhancing its expressiveness in complex scenes. Additionally, a cross-attention-based multimodal feature fusion mechanism integrates spatial structure information from scene graphs with Swin Transformer’s image classification outputs. By calculating similarities and correlations between scene graph embeddings and image classifications, this mechanism dynamically adjusts each modality’s contribution to the fused representation, leveraging complementary information for a more coherent multimodal understanding. Compared with the baseline method, the proposed bimodal model performs superiorly and the accuracy is improved by 3%, reaching 91.5%, which proves its effectiveness in processing and fusing multimodal information. These results highlight the advantages of combining multi-scale feature extraction and cross-modal alignment to improve performance on complex multimodal tasks.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details

1 Hunan University of Science and Technology, Xiangtan 411199, China;
2 School of Automation, Central South University, Changsha 410017, China