ViTC-UReID: Enhancing unsupervised person ReID with vision transformer image encoder and camera-aware proxy learning

Hai Dang Pham, Ngoc Tu Nguyen, Ngoc Hoa Nguyen
Author affiliations

Authors

  • Hai Dang Pham VNU University of Engineering and Technology, 144 Xuan Thuy Street, Cau Giay Ward, Ha Noi, Viet Nam
  • Ngoc Tu Nguyen Kennesaw State University, 1000 Chastain Road, Kennesaw, GA 30144, USA
  • Ngoc Hoa Nguyen VNU University of Engineering and Technology, 144 Xuan Thuy Street, Cau Giay Ward, Ha Noi, Viet Nam

DOI:

https://doi.org/10.15625/1813-9663/23018

Keywords:

Unsupervised person re-identification, enhanced image representation, camera-aware learning, vision transformer.

Abstract

Person re-identification (ReID) plays a crucial role in computer vision-based surveillance systems, enabling the accurate identification of individuals across multiple camera views. Traditional convolutional neural network (CNN)-based approaches, such as those utilizing ResNet-50, struggle to capture long-range dependencies and contextual relationships, limiting their effectiveness in diverse real-world scenarios. To overcome these challenges, recent advancements have explored Vision Transformer (ViT)-based architectures, leveraging self-attention mechanisms for enhanced feature representation. In this research, we introduce a ViT-based framework, namely ViTC-UReID, for unsupervised person ReID by incorporating a camera-aware proxy learning mechanism to improve feature consistency across different camera viewpoints. Moreover, ViTC-UReID also uses clustering algorithms to generate pseudo labels for samples in training datasets. Our approach significantly enhances cross-camera adaptation, reducing domain shift effects while maintaining strong feature discrimination. We evaluate our method on three widely used benchmarks Market-1501, MSMT17, and CUHK03, demonstrating its superior performance compared to existing state-of-the-art unsupervised methods, particularly those utilizing camera identity cues. Furthermore, our model achieves competitive accuracy with fully supervised methods, highlighting the effectiveness of transformer-based representations in complex person ReID scenarios. Our findings reinforce the growing potential of unsupervised person ReID methods and demonstrate that ViT architectures combined with camera-aware learning can drive substantial improvements in person ReID.

Downloads

Published

23-09-2025

How to Cite

[1]H. D. Pham, N. T. Nguyen, and N. H. Nguyen, “ViTC-UReID: Enhancing unsupervised person ReID with vision transformer image encoder and camera-aware proxy learning”, J. Comput. Sci. Cybern., vol. 41, no. 3, p. 265–284, Sep. 2025.

Issue

Section

Articles

Similar Articles

<< < 1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.