Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Motivations

While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist. (a) How can we develop an effective sampling strategy of query points for tracking that balances coverage and efficiency? Our semantic-aware points adapts better to object scale and semantic relevance while existing methods w/ grid sampling missed small objects w/ important motion (e.g., knife). (b) How can we explicitly model and utilize the motion patterns captured in point trajectories? We explicitly model relational motions within a trajectory and across trajectories.

Video understanding requires effective modeling of both motion and appearance information, particularly for fewshot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and intertrajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym.

Qualitative Results

Visualization of action trajectory similarities across four classes, where our semantic-based sampling enables object-focused trajectories. Each quadrant demonstrates intra-class motion consistency while maintaining inter-class discriminative features.

Quantitative Results

Trokens achieves state-of-the-art few-shot action recognition performance across 1, 3, and 5-shot settings on SSV2, Kinetics, UCF-101, HMDB-51, and FineGym datasets, outperforming all contemporary methods.

Trokens offsets the computational cost of clustering by efficiently selecting points that achieve higher performance with fewer tracking points. For both SSV2 Small and SSV2 Full, Trokens with just 32 points surpasses TATs (uniform sampling) with 256 points, while using 82% fewer inference-time FLOPs overall.

BibTeX

@inproceedings{kumar2025trokens, title={Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition}, author={Kumar, Pulkit and Huang, Shuaiyi and Walmer, Matthew and Rambhatla, Sai Saketh and Shrivastava, Abhinav}, booktitle={International Conference on Computer Vision}, year={2025} } @inproceedings{kumar2024trajectory, title={Trajectory-aligned Space-time Tokens for Few-shot Action Recognition}, author={Kumar, Pulkit and Padmanabhan, Namitha and Luo, Luke and Rambhatla, Sai Saketh and Shrivastava, Abhinav}, booktitle={European Conference on Computer Vision}, pages={474--493}, year={2024}, organization={Springer} }

Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Motivations

Abstract

Method

Qualitative Results

Unfolding Something

Twisting Something

Putting Sth Next to Sth

Poking a Hole into Sth

Quantitative Results

BibTeX