Integrating features and harnessing pre-trained visual-language models for enhancing VQA reading comprehension
Author affiliations
DOI:
https://doi.org/10.15625/1813-9663/20525Keywords:
Visual reading comprehension, transformer, vision-language pre-trained model.Abstract
The Visual Question Answering (VQA) problem represents the fusion of natural language understanding (NLU) and computer vision, aiming to comprehend both textual queries and visual content. Recently, researchers have been focusing on the reading comprehension abilities of VQA models, specifically on their capacity to utilize information from scene texts to gather additional context for answering the posed questions. In this study, we present a fundamental approach for integrating diverse information contained in both images and questions. By leveraging a Transformer model, the proposed solution effectively addresses the VQA problem. Our approach, achieving the second position in the VLSP 2023 challenge on Visual Reading Comprehension for Vietnamese, highlights the effectiveness of our proposed method. This study contributes to the ongoing discourse on refining VQA models and emphasizes the potential for further advancements in this domain. The codes are available in the GitHub repository, i.e., https://github.com /truong-xuan-linh/FSO-implement.References
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visualquestion answering,” in Proceedings of the IEEE International Conference on Computer Vision(ICCV), December 2015.
H. Bao, L. Dong, S. Piao, and F. Wei, “BEit: BERT pre-training of image transformers,” in International Conference on Learning Representations, 2022.
[Online]. Available:https://openreview.net/forum?id=p-BhZSz59o4
A. Ben Abacha, V. V. Datla, S. A. Hasan, D. Demner-Fushman, and H. M ̈uller, “Overview ofthe vqa-med task at imageclef 2020: Visual question answering and generation in the medicaldomain,” in CLEF 2020 Working Notes, ser. CEUR Workshop Proceedings. Thessaloniki,Greece: CEUR-WS.org, September 22-25 2020.
P. Bongini, F. Becattini, A. D. Bagdanov, and A. D. Bimbo, “Visual question answering forcultural heritage,” IOP Conference Series: Materials Science and Engineering, vol. 949, no. 1, p.012074, nov 2020.
[Online]. Available: https://dx.doi.org/10.1088/1757-899X/949/1/012074
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deepbidirectional transformers for language understanding,” in Proceedings of the 2019 Conferenceof the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota:Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
[Online]. Available:https://aclanthology.org/N19-1423
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16words: Transformers for image recognition at scale,” in International Conference on LearningRepresentations, 2021.
[Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
Y. Du, C. Li, R. Guo, X. Yin, W. Liu, J. Zhou, Y. Bai, Z. Yu, Y. Yang, Q. Dang, and H. Wang,“Pp-ocr: A practical ultra lightweight ocr system,” ArXiv, vol. abs/2009.09941, 2020.
[Online].Available: https://api.semanticscholar.org/CorpusID:221819010
D. Gurari, Q. Li, A. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham,“Vizwiz grand challenge: Answering visual questions from blind people,” 2018 IEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 3608–3617, 2018.
[Online].Available: https://api.semanticscholar.org/CorpusID:3831582INTEGRATING PRE-TRAINED V-L MODELS FOR ENHANCING VQA READING COMPREHENSION 11
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2015.
[Online]. Available: https://arxiv.org/abs/1512.03385
——, “Deep residual learning for image recognition,” in 2016 IEEE Conference on ComputerVision and Pattern Recognition (CVPR), 2016, pp. 770–778.
M. Huang, Y. Liu, Z. Peng, C. Liu, D. Lin, S. Zhu, N. Yuan, K. Ding, and L. Jin, “Swintextspotter:Scene text spotting via better synergy between text detection and text recognition,” in Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022,pp. 4593–4603.
JaidedAI.
[Online]. Available: https://github.com/JaidedAI/EasyOCR
A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. J ́egou, and T. Mikolov, “Fasttext.zip:Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.
A. Kay, “Tesseract: An open-source optical character recognition engine,” Linux J., vol. 2007, no.159, p. 2, jul 2007.
M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, andL. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural languagegeneration, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of theAssociation for Computational Linguistics. Online: Association for Computational Linguistics,Jul. 2020, pp. 7871–7880.
[Online]. Available: https://aclanthology.org/2020.acl-main.703
M. Liao, Z. Zou, Z. Wan, C. Yao, and X. Bai, “Real-time scene text detection with differentiablebinarization and adaptive scale fusion,” IEEE Transactions on Pattern Analysis amp; MachineIntelligence, vol. 45, no. 01, pp. 919–931, jan 2023.
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swintransformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVFInternational Conference on Computer Vision (ICCV), pp. 9992–10 002, 2021.
[Online]. Available:https://api.semanticscholar.org/CorpusID:232352874
M. Malinowski and M. Fritz, “A multi-world approach to question answering about real-worldscenes based on uncertain input,” CoRR, vol. abs/1410.0210, 2014.
[Online]. Available:http://arxiv.org/abs/1410.0210
M. Mathew, D. Karatzas, and C. V. Jawahar, “Docvqa: A dataset for vqa on document images,”in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), 2021, pp.2199–2208.
M. Mathew, D. Karatzas, R. Manmatha, and C. V. Jawahar, “Docvqa: A datasetfor VQA on document images,” CoRR, vol. abs/2007.00398, 2020.
[Online]. Available:https://arxiv.org/abs/2007.00398
A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: Visual question answeringby reading text in images,” in ICDAR, 2019.
D. Q. Nguyen and A. Tuan Nguyen, “PhoBERT: Pre-trained language models for Vietnamese,”in Findings of the Association for Computational Linguistics: EMNLP 2020. Online:Association for Computational Linguistics, Nov. 2020, pp. 1037–1042.
[Online]. Available:https://aclanthology.org/2020.findings-emnlp.9212 LINH XUAN TRUONG
N. L.-T. Nguyen, N. H. Nguyen, D. T. D. Vo, K. Q. Tran, and K. V. Nguyen, “Vlsp 2022 - evjvqachallenge: Multilingual visual question answering.” Journal of Computer Science and Cybernetics,2023.
N. H. Nguyen, D. T. Vo, K. Van Nguyen, and N. L.-T. Nguyen, “Openvivqa: Task, dataset, andmultimodal fusion models for visual question answering in vietnamese,” Information Fusion, vol.100, p. 101868, 2023.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253523001847
pbcquoc.
[Online]. Available: https://github.com/pbcquoc/vietocr
L. Phan, H. Tran, H. Nguyen, and T. H. Trinh, “ViT5: Pretrained text-to-texttransformer for Vietnamese language generation,” in Proceedings of the 2022 Conferenceof the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies: Student Research Workshop. Hybrid: Seattle, Washington + Online:Association for Computational Linguistics, Jul. 2022, pp. 136–142.
[Online]. Available:https://aclanthology.org/2022.naacl-srw.18
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik,“Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentencemodels,” CoRR, vol. abs/1505.04870, 2015.
[Online]. Available: http://arxiv.org/abs/1505.04870
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, andP. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
[Online]. Available:http://jmlr.org/papers/v21/20-074.html
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection withregion proposal networks,” in Advances in Neural Information Processing Systems, C. Cortes,N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates,Inc., 2015.
[Online]. Available: https://proceedings.neurips.cc/paper files/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf
D. Schwenk, A. Khandelwal, C. Clark, K. Marino, and R. Mottaghi, “A-okvqa: Abenchmark for visual question answering using world knowledge,” 2022.
[Online]. Available:https://arxiv.org/abs/2206.01718
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale imagerecognition,” CoRR, vol. abs/1409.1556, 2014.
[Online]. Available: http://arxiv.org/abs/1409.1556
A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen, D. Parikh, and M. Rohrbach, “Towards vqamodels that can read,” in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 8317–8326.
M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,”in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings ofMachine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15Jun 2019, pp. 6105–6114.
[Online]. Available: https://proceedings.mlr.press/v97/tan19a.html
R. Tito, D. Karatzas, and E. Valveny, “Document collection visual question answering,” inDocument Analysis and Recognition – ICDAR 2021, J. Llad ́os, D. Lopresti, and S. Uchida, Eds.Cham: Springer International Publishing, 2021, pp. 778–792.INTEGRATING PRE-TRAINED V-L MODELS FOR ENHANCING VQA READING COMPREHENSION 13
K. Q. Tran, A. T. Nguyen, A. T.-H. Le, and K. V. Nguyen, “ViVQA: Vietnamese visual questionanswering,” in Proceedings of the 35th Pacific Asia Conference on Language, Information andComputation. Shanghai, China: Association for Computational Lingustics, 11 2021, pp. 683–691.
[Online]. Available: https://aclanthology.org/2021.paclic-1.72
N. L. Tran, D. M. Le, and D. Q. Nguyen, “Bartpho: Pre-trained sequence-to-sequence models for vietnamese,” ArXiv, vol. abs/2109.09701, 2021.
[Online]. Available:https://api.semanticscholar.org/CorpusID:237571389
L. X. Truong, V. Q. Pham, and K. Van Nguyen, “Transformer-based approaches for multilingualvisual question answering,” International Journal of Asian Language Processing, vol. 32, no. 04,p. 2350010, 2022.
[Online]. Available: https://doi.org/10.1142/S2717554523500108
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, andI. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference onNeural Information Processing Systems, ser. NIPS’17. Red Hook, NY, USA: Curran AssociatesInc., 2017, p. 6000–6010.
Downloads
Published
How to Cite
Issue
Section
License
1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.

