Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data.
Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions.
To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption.
FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking.
Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation.
These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations.
Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head.
Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption.
Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability.
Method
FineTec consists of three core modules:
â‘ Context-aware Sequence Completion restores missing or corrupted skeleton frames using in-context learning, producing $S_{base}$;
② Skeleton-based Spatial Decomposition partitions $S_{base}$ into anatomical regions by motion intensity, generating dynamic ($S_{dyna}$) and static ($S_{stat}$) variants, which are fused into $S_{pred}$;
③ Physics-driven Acceleration Modeling infers joint accelerations via Lagrangian dynamics and data-driven finite differences, producing fused temporal dynamics features $\mathbf{a}$.
The resulting positional ($S_{pred}$) and dynamic ($\mathbf{a}_{pred}$) features are used for downstream fine-grained action recognition.
For more details, please refer to our paper.
Results
The main quantitative results on the two fine-grained skeleton datasets, Gym99 and Gym288-skeleton, are presented in the table above.
These results are reported across three difficulty levels: minor (25% frame missing), moderate (50% frame missing), and severe (75% frame missing).
It can be observed that the proposed FineTec framework consistently achieves the best performance under all conditions.
Notably, in the most challenging scenario—Gym288-skeleton with severe frame missing—FineTec attains a Top-1 accuracy of 78.1%, surpassing all previous skeleton-based methods.
In terms of mean class accuracy, FineTec improves upon the best baseline by 13%, and outperforms the latest work by 50%.
Overall, these results demonstrate that FineTec achieves outstanding effectiveness across fine-grained datasets and under all levels of difficulty.
For additional results and ablation studies, please refer to our paper.
Demo Video
Dataset
Overview
The Gym288-skeleton dataset is a human skeleton-based action recognition benchmark derived from the Gym288 subset of the
FineGym dataset.
It provides temporally precise, fine-grained annotations of gymnastic actions along with 2D human pose sequences extracted from original video frames.
This dataset is designed to support research in:
Fine-grained action recognition
Temporally corrupted or incomplete action modeling
Skeleton-based representation learning
Physics-aware motion understanding
Key Statistics
Total instances: 38,223 action sequences
Action classes: 288 fine-grained gymnastic elements
Training samples: 28,739
Test samples: 9,484
Keypoint format: 17 COCO-style 2D joints per frame
• annotations: List of 38,223 dictionaries, each representing one action instance
Annotation Fields:
Key
Type
Shape / Example
Description
frame_dir
str
"A0xAXXysHUo_002184_002237_0035_0036"
Unique identifier for the action clip
label
int
268
Class label (0–287, corresponding to 288 gymnastic elements)
img_shape
tuple
(720, 1280)
Height and width of original video frames
original_shape
tuple
(720, 1280)
Same as img_shape (for compatibility)
total_frames
int
48
Number of frames in the action sequence
keypoint
np.ndarray (float16)
(1, T, 17, 2)
2D joint coordinates (x, y) for 17 COCO keypoints over T frames
keypoint_score
np.ndarray (float16)
(1, T, 17)
Confidence scores for each keypoint
kp_wo_gt
np.ndarray (float32)
(T, 17, 3)
Placeholder array (all zeros); for corrupted/noisy poses
kp_w_gt
np.ndarray (float32)
(T, 17, 3)
Ground-truth 2D poses with confidence as third channel (x, y, score)
Note: The first dimension (1) in keypoint and keypoint_score corresponds to the number of persons (always 1 in this dataset).
Action Classes
The dataset contains 288 distinct gymnastic elements across four apparatuses: Floor Exercise (FX), Balance Beam (BB), Uneven Bars (UB), and Vault – Women (VT).
Each class represents a highly specific movement (e.g., "Switch leap with 0.5 turn", "Clear hip circle backward with 1 turn to handstand"),
reflecting the fine-grained nature of competitive gymnastics scoring.
We thank the authors of FineGym
for their foundational work in fine-grained action recognition. If you use this dataset, please cite both FineTec and the original FineGym paper.
Citation
@misc{shao2025finetec,
title={FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion},
author={Dian Shao and Mingfei Shi and Like Liu},
year={2025},
eprint={2512.25067},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.25067}
}