FineGym: 分层视频数据集,用于细粒度的动作理解免费

jsaifc 21 2021-08-24 语音识别


FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第1张




数据集层次结构 Dataset hierarchy

FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第2张

FineGym organizes both the semantic and temporal annotations hierarchically. The upper part shows three levels of categorical labels, namely events (e.g. balance beam), sets (e.g. dismounts) and elements (e.g. salto forward tucked). The lower part depicts the two-level temporal annotations, i.e. the temporal boundaries of actions (in the top bar) and sub-action instances (in the bottom bar).
 子动作示例 Sub-action examples

我们提供了一些细粒度的子操作实例的示例。每个组都属于同一事件中的三个元素类别(BB,FX,UB和VT)。可以看出,这样的细粒度实例包含细微而具有挑战性的差异。 (将鼠标悬停在GIF上,速度会降低0.25倍)

FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第3张

实证研究与分析 Empirical Studies and Analysis


FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第4张

Element-level action recognition results of representative methods.


FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第5张
Performances of TSN when varying the number of sampled frames during training.





FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第6张
(a) Per-class performances of TSN with motion and appearance features in 6 element categories.
(b) Performances of TRN on the set UB-circles using ordered or shuffled testing frames.
(c) Mean-class accuracies of TSM and TSN on Gym99 when trained with 3 frames and tested with more frames.



FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第7张
Per-class performances of I3D pre-trained on Kinetics and ImageNet in various element categories.



FineGym: 分层视频数据集,用于细粒度的动作理解 ( 语音识别 第8张
The results of person detection and pose estimation using AlphaPose for a Vault routine. It can be seen that detections and pose estimations of the gymnast are missed in multiple frames, especially in frames with intense motion. These frames are important for fine-grained recognition. (Hover on the GIF for a 0.25x slowdown)


title={FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding},
author={Shao, Dian and Zhao, Yue and Dai, Bo and Lin, Dahua},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},


We sincerely thank the outstanding annotation team for their excellent work. This work is partially supported by SenseTime Collaborative Grant on Large-scale Multi-modality Analysis and the General Research Funds (GRF) of Hong Kong (No. 14203518 and No. 14205719). The template of this webpage is borrowed from Richard Zhang.



For further questions and suggestions, please contact Dian Shao (

