Full Text

Turn on search term navigation

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

With the increasing demand for accurate multimodal data analysis in complex scenarios, existing models often struggle to effectively capture and fuse information across diverse modalities, especially when data include varying scales and levels of detail. To address these challenges, this study presents an enhanced Swin Transformer V2-based model designed for robust multimodal data processing. The method analyzes urban economic activities and spatial layout using satellite and street view images, with applications in traffic flow and business activity intensity, highlighting its practical significance. The model incorporates a multi-scale feature extraction module into the window attention mechanism, combining local and global window attention with adaptive pooling to achieve comprehensive multi-scale feature fusion and representation. This approach enables the model to effectively capture information at different scales, enhancing its expressiveness in complex scenes. Additionally, a cross-attention-based multimodal feature fusion mechanism integrates spatial structure information from scene graphs with Swin Transformer’s image classification outputs. By calculating similarities and correlations between scene graph embeddings and image classifications, this mechanism dynamically adjusts each modality’s contribution to the fused representation, leveraging complementary information for a more coherent multimodal understanding. Compared with the baseline method, the proposed bimodal model performs superiorly and the accuracy is improved by 3%, reaching 91.5%, which proves its effectiveness in processing and fusing multimodal information. These results highlight the advantages of combining multi-scale feature extraction and cross-modal alignment to improve performance on complex multimodal tasks.

Details

Title
Adaptive Multimodal Fusion with Cross-Attention for Robust Scene Segmentation and Urban Economic Analysis
Author
Zhong, Chun 1 ; Zeng, Shihong 1 ; Zhu, Hongqiu 2   VIAFID ORCID Logo 

 Hunan University of Science and Technology, Xiangtan 411199, China; [email protected] (C.Z.); [email protected] (S.Z.) 
 School of Automation, Central South University, Changsha 410017, China 
First page
438
Publication year
2025
Publication date
2025
Publisher
MDPI AG
e-ISSN
20763417
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3153579311
Copyright
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.