Deep Cross of Intra and Inter Modalities for Visual Question Answering
- 10.2991/ahis.k.210913.007How to use a DOI?
- Deep Learning, Inter-Modality Fusion, Intra-Modality Fusion, Visual Question Answering
Visual Question Answering (VQA) has recently attained interest in the deep learning community. The main challenge that exists in VQA is to understand the sense of each modality and how to fuse these features. In this paper, DXMN (Deep Cross Modality Network) is introduced which takes into consideration not only the inter-modality fusion but also the intra-modality fusion. The main idea behind this architecture is to take the positioning of each feature into account and then recognize the relationship between multi-modal features as well as establishing a relationship among themselves in order to learn them in a better way. The architecture is pretrained on question answering datasets like, VQA v2.0, GQA, and Visual Genome which is later fine-tuned to achieve state-of-the-art performance. DXMN achieves an accuracy of 68.65 in test-standard and 68.43 in test-dev of VQA v2.0 dataset.
- © 2021, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Rishav Bhardwaj PY - 2021 DA - 2021/09/13 TI - Deep Cross of Intra and Inter Modalities for Visual Question Answering BT - Proceedings of the 3rd International Conference on Integrated Intelligent Computing Communication & Security (ICIIC 2021) PB - Atlantis Press SP - 47 EP - 53 SN - 2589-4900 UR - https://doi.org/10.2991/ahis.k.210913.007 DO - 10.2991/ahis.k.210913.007 ID - Bhardwaj2021 ER -