Full Text

Turn on search term navigation

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Remote-sensing visual question answering (RSVQA) aims to provide accurate answers to remote sensing images and their associated questions by leveraging both visual and textual information during the inference process. However, most existing methods ignore the significance of the interaction between visual and language features, which typically adopt simple feature fusion strategies and fail to adequately model cross-modal attention, struggling to capture the complex semantic relationships between questions and images. In this study, we introduce a unified transformer with cross-modal mixture expert (TCMME) model to address the RSVQA problem. Specifically, we utilize the vision transformer (VIT) and BERT to extract visual and language features, respectively. Furthermore, we incorporate cross-modal mixture experts (CMMEs) to facilitate cross-modal representation learning. By leveraging the shared self-attention and cross-modal attention within CMMEs, as well as the modality experts, we effectively capture the intricate interactions between visual and language features and better focus on their complex semantic relationships. Finally, we conduct qualitative and quantitative experiments on two benchmark datasets: RSVQA-LR and RSVQA-HR. The results demonstrate that our proposed method surpasses the current state-of-the-art (SOTA) techniques. Additionally, we perform an extensive analysis to validate the effectiveness of different components in our framework.

Details

Title
Unified Transformer with Cross-Modal Mixture Experts for Remote-Sensing Visual Question Answering
Author
Liu, Gang 1   VIAFID ORCID Logo  ; He, Jinlong 1   VIAFID ORCID Logo  ; Li, Pengfei 1   VIAFID ORCID Logo  ; Zhong, Shenjun 2   VIAFID ORCID Logo  ; Li, Hongyang 1 ; He, Genrong 1 

 College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China; [email protected] (G.L.); [email protected] (J.H.); [email protected] (H.L.); [email protected] (G.H.); National Engineering Laboratory of E-Government Modeling Simulation, Harbin Engineering University, Harbin 150001, China 
 Monash Biomedical Imaging, Australia and National Imaging Facility, Monash University, Victoria 3800, Australia; [email protected] 
First page
4682
Publication year
2023
Publication date
2023
Publisher
MDPI AG
e-ISSN
20724292
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2876612806
Copyright
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.