FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion

AAAI 2026

Dian Shao^†, Mingfei Shi, Like Liu

Northwestern Polytechnical University
^†Corresponding Author

📄 Paper (Coming soon) 📝 arXiv 💻 Code 💾 Dataset 📋 BibTeX

Note: Supplementary materials are available in the arXiv version.

Abstract

Poster

Recognizing fine-grained actions from temporally corrupted skeleton sequences remains a significant challenge, particularly in real-world scenarios where online pose estimation often yields substantial missing data. Existing methods often struggle to accurately recover temporal dynamics and fine-grained spatial structures, resulting in the loss of subtle motion cues crucial for distinguishing similar actions. To address this, we propose FineTec, a unified framework for Fine-grained action recognition under Temporal Corruption. FineTec first restores a base skeleton sequence from corrupted input using context-aware completion with diverse temporal masking. Next, a skeleton-based spatial decomposition module partitions the skeleton into five semantic regions, further divides them into dynamic and static subgroups based on motion variance, and generates two augmented skeleton sequences via targeted perturbation. These, along with the base sequence, are then processed by a physics-driven estimation module, which utilizes Lagrangian dynamics to estimate joint accelerations. Finally, both the fused skeleton position sequence and the fused acceleration sequence are jointly fed into a GCN-based action recognition head. Extensive experiments on both coarse-grained (NTU-60, NTU-120) and fine-grained (Gym99, Gym288) benchmarks show that FineTec significantly outperforms previous methods under various levels of temporal corruption. Specifically, FineTec achieves top-1 accuracies of 89.1% and 78.1% on the challenging Gym99-severe and Gym288-severe settings, respectively, demonstrating its robustness and generalizability.

Method

FineTec consists of three core modules: ① Context-aware Sequence Completion restores missing or corrupted skeleton frames using in-context learning, producing $S_{base}$; ② Skeleton-based Spatial Decomposition partitions $S_{base}$ into anatomical regions by motion intensity, generating dynamic ($S_{dyna}$) and static ($S_{stat}$) variants, which are fused into $S_{pred}$; ③ Physics-driven Acceleration Modeling infers joint accelerations via Lagrangian dynamics and data-driven finite differences, producing fused temporal dynamics features $\mathbf{a}$. The resulting positional ($S_{pred}$) and dynamic ($\mathbf{a}_{pred}$) features are used for downstream fine-grained action recognition. For more details, please refer to our paper.

Results

The main quantitative results on the two fine-grained skeleton datasets, Gym99 and Gym288-skeleton, are presented in the table above. These results are reported across three difficulty levels: minor (25% frame missing), moderate (50% frame missing), and severe (75% frame missing). It can be observed that the proposed FineTec framework consistently achieves the best performance under all conditions. Notably, in the most challenging scenario—Gym288-skeleton with severe frame missing—FineTec attains a Top-1 accuracy of 78.1%, surpassing all previous skeleton-based methods. In terms of mean class accuracy, FineTec improves upon the best baseline by 13%, and outperforms the latest work by 50%. Overall, these results demonstrate that FineTec achieves outstanding effectiveness across fine-grained datasets and under all levels of difficulty. For additional results and ablation studies, please refer to our paper.

Demo Video

Dataset

Overview

The Gym288-skeleton dataset is a human skeleton-based action recognition benchmark derived from the Gym288 subset of the FineGym dataset. It provides temporally precise, fine-grained annotations of gymnastic actions along with 2D human pose sequences extracted from original video frames.

This dataset is designed to support research in:

Fine-grained action recognition
Temporally corrupted or incomplete action modeling
Skeleton-based representation learning
Physics-aware motion understanding

Key Statistics

Total instances: 38,223 action sequences
Action classes: 288 fine-grained gymnastic elements
Training samples: 28,739
Test samples: 9,484
Keypoint format: 17 COCO-style 2D joints per frame
Apparatuses: Floor Exercise (FX), Balance Beam (BB), Uneven Bars (UB), Vault (VT)
Pose estimator: HRNet

Dataset Structure

The dataset is distributed as a single Python dictionary with two top-level keys: split and annotations.

Top-Level Keys:

• split: Dictionary containing train/test splits

◦ train: List of 28,739 sample IDs (strings)

◦ test: List of 9,484 sample IDs (strings)

• annotations: List of 38,223 dictionaries, each representing one action instance

Annotation Fields:

Key	Type	Shape / Example	Description
frame_dir	str	"A0xAXXysHUo_002184_002237_0035_0036"	Unique identifier for the action clip
label	int	268	Class label (0–287, corresponding to 288 gymnastic elements)
img_shape	tuple	(720, 1280)	Height and width of original video frames
original_shape	tuple	(720, 1280)	Same as img_shape (for compatibility)
total_frames	int	48	Number of frames in the action sequence
keypoint	np.ndarray (float16)	(1, T, 17, 2)	2D joint coordinates (x, y) for 17 COCO keypoints over T frames
keypoint_score	np.ndarray (float16)	(1, T, 17)	Confidence scores for each keypoint
kp_wo_gt	np.ndarray (float32)	(T, 17, 3)	Placeholder array (all zeros); for corrupted/noisy poses
kp_w_gt	np.ndarray (float32)	(T, 17, 3)	Ground-truth 2D poses with confidence as third channel (x, y, score)

Note: The first dimension (1) in keypoint and keypoint_score corresponds to the number of persons (always 1 in this dataset).

Action Classes

The dataset contains 288 distinct gymnastic elements across four apparatuses: Floor Exercise (FX), Balance Beam (BB), Uneven Bars (UB), and Vault – Women (VT). Each class represents a highly specific movement (e.g., "Switch leap with 0.5 turn", "Clear hip circle backward with 1 turn to handstand"), reflecting the fine-grained nature of competitive gymnastics scoring.

For the full list of class names and mappings, please refer to the FineGym website and the original CVPR 2020 paper.

Usage Example

import pickle
import numpy as np

# Load the dataset
with open("gym288_skeleton.pkl", "rb") as f:
    data = pickle.load(f)

# Access training samples
train_ids = data["split"]["train"]  # 28,739 samples

# Access annotations
sample = data["annotations"][0]
print("Label:", sample["label"])
print("Frames:", sample["total_frames"])
print("Keypoints shape:", sample["keypoint"].shape)  # (1, T, 17, 2)

# Extract skeleton sequence
skeleton_seq = sample["keypoint"][0]  # (T, 17, 2)

Download & License

The dataset is available on Hugging Face under the CC-BY-4.0 license.

Acknowledgements

We thank the authors of FineGym for their foundational work in fine-grained action recognition. If you use this dataset, please cite both FineTec and the original FineGym paper.

Citation

@misc{shao2025finetec,
  title={FineTec: Fine-Grained Action Recognition Under Temporal Corruption via Skeleton Decomposition and Sequence Completion}, 
  author={Dian Shao and Mingfei Shi and Like Liu},
  year={2025},
  eprint={2512.25067},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.25067}
}

Contact

For questions about this work, please contact:
mingfeishi5@mail.nwpu.edu.cn