Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition

Pulkit Kumar*,1    Shuaiyi Huang*,1    Matthew Walmer1    Saketh Rambhatla1,2    Abhinav Shrivastava1   
1University of Maryland          2GenAI, Meta
* Equal contribution.

TL;DRSemantic sampling of query points for tracking with explicit motion modeling improves few-shot action recognition.

Motivations

Image description

While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist. (a) How can we develop an effective sampling strategy of query points for tracking that balances coverage and efficiency? Our semantic-aware points adapts better to object scale and semantic relevance while existing methods w/ grid sampling missed small objects w/ important motion (e.g., knife). (b) How can we explicitly model and utilize the motion patterns captured in point trajectories? We explicitly model relational motions within a trajectory and across trajectories.

Abstract

Video understanding requires effective modeling of both motion and appearance information, particularly for fewshot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and intertrajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym.

Method

Image description

Trokens transforms trajectory points into semantic-aware relational tokens for action recognition.
(A) Given an input video, we extract appearance tokens using DINOv2. (B) We then cluster these tokens and sample semantic-aware points in the initial frame, which are tracked using Co-tracker [25] to obtain point trajectories. (C) We compute intra- and inter-motion features, reorder appearance tokens via token alignment [28], and fuse them with motion features via element-wise addition to form semantic-aware relational trajectory tokens. (D) Finally, we input these tokens into a Decoupled Space-Time Transformer for few-shot action classification.

Qualitative Results

Image description

Visualization of action trajectory similarities across four classes, where our semantic-based sampling enables object-focused trajectories. Each quadrant demonstrates intra-class motion consistency while maintaining inter-class discriminative features.





Unfolding Something
Example 1
Example 2
Example 3
Twisting Something
Example 1
Example 2
Example 3
Putting Sth Next to Sth
Example 1
Example 2
Example 3
Poking a Hole into Sth
Example 1
Example 2
Example 3

More qualitative results of our semantic-aware point trajectories. For each action class, we randomly selected videos and overlaid them with our extracted semantic-aware point trajectories. Our method successfully focuses on action-relevant objects, even when they are small. We observe that trajectories from the same action class follow very similar motion patterns.

Quantitative Results

Image description

Trokens achieves state-of-the-art few-shot action recognition performance across 1, 3, and 5-shot settings on SSV2, Kinetics, UCF-101, HMDB-51, and FineGym datasets, outperforming all contemporary methods.

Image description

Trokens offsets the computational cost of clustering by efficiently selecting points that achieve higher performance with fewer tracking points. For both SSV2 Small and SSV2 Full, Trokens with just 32 points surpasses TATs (uniform sampling) with 256 points, while using 82% fewer inference-time FLOPs overall.

BibTeX

@inproceedings{kumar2025trokens,
  title={Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition},
  author={Kumar, Pulkit and Huang, Shuaiyi and Walmer, Matthew and Rambhatla, Sai Saketh and Shrivastava, Abhinav},
  booktitle={International Conference on Computer Vision},
  year={2025}
}

@inproceedings{kumar2024trajectory,
  title={Trajectory-aligned Space-time Tokens for Few-shot Action Recognition},
  author={Kumar, Pulkit and Padmanabhan, Namitha and Luo, Luke and Rambhatla, Sai Saketh and Shrivastava, Abhinav},
  booktitle={European Conference on Computer Vision},
  pages={474--493},
  year={2024},
  organization={Springer}
}