Mamba-MHAR: An efficient multimodal framework for human action recognition

Trung-Hieu Le; Thai Khanh Nguyen; Tuan-Anh Le; Mathieu Delalandre; Kien Tran Trung; Thanh-Hai Tran; Cuong Pham

doi:10.15625/1813-9663/22770

Mamba-MHAR: An efficient multimodal framework for human action recognition

Trung-Hieu Le, Khanh-Nguyen Thai, Tuan-Anh Le, Mathieu Delalandre, Trung-Kien Tran, Thanh-Hai Tran, Cuong-Pham

Author affiliations

Authors

Trung-Hieu Le School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, 01 Dai Co Viet Street, Bach Mai Ward, Ha Noi, Viet Nam https://orcid.org/0000-0001-6323-6959
Khanh-Nguyen Thai School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, 01 Dai Co Viet Street, Bach Mai Ward, Ha Noi, Viet Nam https://orcid.org/0009-0009-2028-9554
Tuan-Anh Le Dai Nam University, 01 Pho Xom, Phu Luong Ward, Ha Noi, Viet Nam
Mathieu Delalandre Polytechnic University of Tours, France
Trung-Kien Tran Institute of Information Technology, AMST, 17 Hoang Sam, Nghia Do Ward, Ha Noi, Vietnam https://orcid.org/0000-0001-5466-0539
Thanh-Hai Tran School of Electrical and Electronic Engineering, Hanoi University of Science and Technology, 01 Dai Co Viet Street, Bach Mai Ward, Ha Noi, Viet Nam https://orcid.org/0000-0003-3133-3361
Cuong-Pham Posts and Telecommunications Institute of Technology, Nguyen Trai Street, Mo Lao Ward, Ha Noi, Viet Nam https://orcid.org/0000-0003-0973-0889

DOI:

https://doi.org/10.15625/1813-9663/22770

Keywords:

Mamba, selective state space model, selection mechanism, HAR, multimodal fusion, visual sensor, inertial sensor.

Abstract

Human Action Recognition (HAR) has emerged as an active research domain in recent years with wide-ranging applications in healthcare monitoring, smart home systems, and hu- man–robot interaction. This paper introduces a method, namely Mamba-MHAR (Mamba based Multimodal Human Action Recognition), a lightweight multimodal architecture aimed at improv- ing HAR performance by effectively integrating data from inertial sensors and egocentric videos. Mamba-MHAR consists of double Mamba-based branches, one for visual feature extraction - VideoMamba, and the other for motion feature extraction - MAMC. Both branches are built upon recently introduced Selective State Space Models (SSMs) to optimize the computational cost, and they are lately fused for final human activity classification. Mamba-MHAR achieves significant efficiency gains in terms of GPU usage, making it highly suitable for real-time deployment on edge and mobile devices. Extensive experiments were conducted on two challenging multimodal datasets UESTC-MMEA-CL and MuWiGes, which contain synchronized IMU and video data recorded in natural settings. The proposed Mamba-MHAR achieves 98.00% accuracy on UESTC-MMEA-CL and 98.58% on MuWiGes, surpassing state-of-the-art baselines. These results demonstrate that a simple yet efficient fusion of multimodal lightweight Mamba-based models provides a promising solution for scalable and low-power applications in pervasive computing environments.

Downloads

Published

27-09-2025

How to Cite

[1]T.-H. Le, “Mamba-MHAR: An efficient multimodal framework for human action recognition”, J. Comput. Sci. Cybern., vol. 41, no. 3, p. 245–264, Sep. 2025.

Download Citation

Issue

Vol. 41 No. 3 (2025)

Section

Articles

License

1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.
2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.