ViTC-UReID: Enhancing unsupervised person ReID with vision transformer image encoder and camera-aware proxy learning

Hai Dang Pham; Ngoc Tu Nguyen; Ngoc Hoa Nguyen

doi:10.15625/1813-9663/23018

ViTC-UReID: Enhancing unsupervised person ReID with vision transformer image encoder and camera-aware proxy learning

Hai Dang Pham, Ngoc Tu Nguyen, Ngoc Hoa Nguyen

Author affiliations

Authors

Hai Dang Pham VNU University of Engineering and Technology, 144 Xuan Thuy Street, Cau Giay Ward, Ha Noi, Viet Nam
Ngoc Tu Nguyen Kennesaw State University, 1000 Chastain Road, Kennesaw, GA 30144, USA
Ngoc Hoa Nguyen VNU University of Engineering and Technology, 144 Xuan Thuy Street, Cau Giay Ward, Ha Noi, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/23018

Keywords:

Unsupervised person re-identification, enhanced image representation, camera-aware learning, vision transformer.

Abstract

Person re-identification (ReID) plays a crucial role in computer vision-based surveillance systems, enabling the accurate identification of individuals across multiple camera views. Traditional convolutional neural network (CNN)-based approaches, such as those utilizing ResNet-50, struggle to capture long-range dependencies and contextual relationships, limiting their effectiveness in diverse real-world scenarios. To overcome these challenges, recent advancements have explored Vision Transformer (ViT)-based architectures, leveraging self-attention mechanisms for enhanced feature representation. In this research, we introduce a ViT-based framework, namely ViTC-UReID, for unsupervised person ReID by incorporating a camera-aware proxy learning mechanism to improve feature consistency across different camera viewpoints. Moreover, ViTC-UReID also uses clustering algorithms to generate pseudo labels for samples in training datasets. Our approach significantly enhances cross-camera adaptation, reducing domain shift effects while maintaining strong feature discrimination. We evaluate our method on three widely used benchmarks Market-1501, MSMT17, and CUHK03, demonstrating its superior performance compared to existing state-of-the-art unsupervised methods, particularly those utilizing camera identity cues. Furthermore, our model achieves competitive accuracy with fully supervised methods, highlighting the effectiveness of transformer-based representations in complex person ReID scenarios. Our findings reinforce the growing potential of unsupervised person ReID methods and demonstrate that ViT architectures combined with camera-aware learning can drive substantial improvements in person ReID.

Downloads

Published

23-09-2025

How to Cite

[1]H. D. Pham, N. T. Nguyen, and N. H. Nguyen, “ViTC-UReID: Enhancing unsupervised person ReID with vision transformer image encoder and camera-aware proxy learning”, J. Comput. Sci. Cybern., vol. 41, no. 3, p. 265–284, Sep. 2025.

Download Citation

Issue

Vol. 41 No. 3 (2025)

Section

Articles

License

1. We hereby assign copyright of our article (the Work) in all forms of media, whether now known or hereafter developed, to the Journal of Computer Science and Cybernetics. We understand that the Journal of Computer Science and Cybernetics will act on my/our behalf to publish, reproduce, distribute and transmit the Work.
2. This assignment of copyright to the Journal of Computer Science and Cybernetics is done so on the understanding that permission from the Journal of Computer Science and Cybernetics is not required for me/us to reproduce, republish or distribute copies of the Work in whole or in part. We will ensure that all such copies carry a notice of copyright ownership and reference to the original journal publication.
3. We warrant that the Work is our results and has not been published before in its current or a substantially similar form and is not under consideration for another publication, does not contain any unlawful statements and does not infringe any existing copyright.
4. We also warrant that We have obtained the necessary permission from the copyright holder/s to reproduce in the article any materials including tables, diagrams or photographs not owned by me/us.