1 Introduction
Human activities are commonly involved repetitive actions. Temporal repetition counting is a problem that aims to count the number of repetitive actions in a video [7, 21, 14, 26]. The repetition analysis is explored as an auxiliary cue to other video analysis applications, such as cardiac and respiratory signal recover [16], pedestrian detection [22], 3D reconstruction [15, 24], and camera calibration [11].
This is a challenging problem as repetitive actions exhibit inherently different action patterns. We summarize 4 representative cases in the left part of Figure LABEL:fig:teaser. Figure LABEL:fig:teaser(a) and (b) show the most common repetitions, in which actions are performed in fixed cycles. The problem of detecting these two repetitions is that their cycle lengths varied largely, and therefore is invalid to make restricted assumptions about the timescale of the cycle length across actions. In Figure LABEL:fig:teaser(c), the case of playing the violin shows that the cycle lengths are not always a fixed value. This case is contradictory to (a) and (b), and hence the assumption of actions will be performed in a periodic manner is false. In Figure LABEL:fig:teaser(d), a front crawl action can be decomposed into two subactions with a similar motion field, crawling with the left hand and right hand. As the two subactions are similar in motion space, contextual information in semantic space should be considered to avoid the double counting error.
Most existing methods [21, 3, 7, 14, 16] rely heavily on the periodicity assumption. As a consequence, although the representative work [14] achieves a nearperfect performance on the periodic dataset YTsegments, it cannot detect varied cycle lengths in the nonstationary video dataset QUVA Repetition [25]. While the latest work [26] address this problem, it detect repetition solely based on the motion field. Therefore it conflicts with the scenarios like Figure LABEL:fig:teaser(d), in which repetitions cannot be distinguished by motion field and contextual and semantic information is required to understand the action. Based on the above observations, we argue that detecting repetitions should 1) exhaustive search for a large range of cycle lengths to cover most unknown actions; 2) include contextual understanding and estimating cycle lengths by taking multiple periods into consideration.
In this paper, we tailor a contextaware and scaleinsensitive framework based on the above principles. The data flow is shown in the right part of Figure LABEL:fig:teaser. Following rule #1 to exhaustively search all the time scales can absolutely address the cycle lengths variations problem, but it leads to expensive computation. We combat this problem by proposing a coarsetofine cycle lengths estimation strategy integrated with a regression network. In particular, we only exhaustive search the initial cycle lengths for a local video clip. The initial estimation, is then propagated to the entire video, and each of the estimated repetition in the video is refined by our regression model. In this way, we largely reduce the computational cost in searching accurate cycle lengths, while we can adapt to large variations of cycle lengths in the same video. The proposed regression model handles rule #2, in which we inject contextual information for estimating accurate cycle lengths. Specifically, instead of taking only one action cycle as input, we sample the video to contain two consecutive repetitions, named doublecycle. Given such broad context, our regression model aims to relocate the previous and future repetitive cycles in a bidirectional manner. Furthermore, existing researches in repetition counting lack of sufficient data, therefore we propose a new repetitive action counting benchmark, named UCFRep. It is constructed by annotating repetitive actions from the widely used dataset UCF101 [28], and it is the largest dataset containing 526 videos. Extensive experiments demonstrate the proposed method is able to cope with various repetitive actions, and we outperform stateoftheart methods on three benchmarks.
Our contributions are fourfold:

We propose a coarsetofine doublecycle estimation strategy integrated with regression, which allows fast estimation of cycle lengths for the entire video and dynamic relocation of varied cycles.

We present a bidirectional contextaware regression model. It explores contextual information to simultaneously estimate the previous and future cycles in a bidirectional manner.

We construct a new and largest benchmark UCFRep. 526 repetitive action videos are annotated for training and evaluation.

The proposed network outperforms stateoftheart methods on three benchmarks, especially we achieve superior performances on two unseen benchmarks (without finetuning). It reveals the proposed framework is general enough to complex and unknown scenes.
2 Related Work
A typical solution for temporal repetition counting is to transfer the motion field into onedimensional signals, and then they try to recover the repetition structure from the signal period [13, 20, 30, 19, 1]. The mainstream of these methods obtains repetition frequency with Fourier analysis [2, 7, 21, 3]. In addition, they detect the cycle by filtering [4], peak detection [29], classification [8]
, and singular value decomposition
[6]. The above methods assume that the estimating repetition is periodic, so that they cannot handle the nonstationary repetitions. A recent work [26] addresses this limitation, and propose a novel inference scheme to detect nonstationary actions. However, they only adopt the motion field to extract features for analysis, while ignoring contextdependency in semantic domain.Like us, there are methods that also use deep features for repetition analysis. Li
et al. [16] propose to learn temporal dependency by adopting the LSTM network on the sequence of images. They aim to recover the cardiac and respiratory signals from the medical image sequence, as such their method cannot handle complex repetitions in realworld. Levy and Wolf [14] aim to propose a classification network for live repetition estimation. Their network is designed to extract features of 20 frames from the video with the predefined samplingrate. As discussed above, a predefined cycle lengths cannot adapt to the complex repetitive actions with large variations of cycle lengths.Action localization [17, 27, 18] shares a similar spirit to localize actions in temporal domain. These methods aim to locate the temporal begin and ending points of each action in the entire video, hence these methods can be easily adapted to the field of repetition counting. However, these methods find the action segment separately, which means that they ignore the repetition priors to effectively utilize the context information. In our method, we borrow the idea of the anchorbased temporal regression from this literature, and further explore context dependency.
3 Approach
In this section, we first introduce the problem formulation and overview of the proposed contextaware and scaleinsensitive framework. Then we describe two core modules of our framework, the contextaware doublecycle regression network and coarsetofine doublecycle refinement. Finally, we present the details of our newly constructed temporal repetition benchmark.
3.1 Problem Formulation
Repetition definition. We have a different problem setting than prior works, as we aim to locate both previous and future cycles in a bidirectional manner. Given a video with frames , the repetition can be defined as follows: for a frame , if we can find a previous frame and a future frame , such that the two frame sequences and contains the identical actions, then there are two repetitions existing in these two sequences. We refer these two consecutive cycles as doublecycle, and the as the previous repetitive frame of and as the next repetitive frame of .
Target formulation. In this paper, we aim to count the temporal repetition number for the given video. If the action is strongly periodic in the video, we can assume the cycle length is a constant across the entire video. Then we can easily estimate the repetitions number by finding the previous and next repetitive frame locations of an arbitrary frame and calculate the number of repetitions as:
(1) 
However, the variety between repetitions cannot be neglected in the realworld. To tackle this problem, we propose to calculate the repetition counts by estimating of each frame in the video. Therefore we formulate the problem as
(2) 
Two cycle lengths can be computed as and . For clarity, we define as the doublecycle that describes two consecutive repetitions with frame .
3.2 Framework Overview
Following the target formulation, our framework is designed to predict the doublecycle for all the position . We first propose a contextaware doublecycle regression network, which is illustrated in the left part of Figure 1 and described in Section 3.3. The network is designed to refine the given doublecycle for a specific position. Given an initial doublecycle, our network extracts the 3D features based on some sampled video frames and outputs a new doublecycle . With the extracted context information from a large range of video frames, the network is able to identify the repetition and regress the cycle lengths easily. Furthermore, this process is performed multiple times to obtain a progressively refined doublecycle.
As discussed above, an exhaustive search should be performed to cope with the large cycle length variation problem. It can also provide an reasonable initial doublecycle for the regression network. Instead of searching the entire video, we first search locally in the video, and propagated the prediction to the other frames. The right part of Figure 1 shows our method and it is described in Section 3.4. We perform exhaustive searching for one time in the middle frame of the video, such that the initial doublecycle is likely within the same scale with others. It is then propagated to the other frames, each of the new frame is integrated with the regression network for local refinement. For each stage we sample the positions uniformly across the video so that the sampled position can be the propagation root for the next stages. The final repetition counts of the video can be calculated by the repetition count summarization of all frames.
3.3 Doublecycle Regression Network
The objective of the network is to refine the input doublecycle of an assigned position . To extract features of fixed size for regression, we sample specific frames within the doublecycle. As illustrated in the left part of Figure 1, network input is a sequence with frames, which consists of two half. We sample the first half of the inputs uniformly from the range , and the next half inputs from the range . Note that we double the sampling range to detect large context like the doublemotion in Figure LABEL:fig:teaser(d). The sampled sequence is then fed into a 3Dbackbone model. We use the 3DResNext101 [10, 31] pretrained on the ActivityNet [5]. Other network architectures are also applied, please refer to the experiments for details. We remove the last classification layer and use the outputs after pooling to be the contextaware 1Dfeatures (4096 dimensions for ResNext101). The features are then fed into the newly added prediction branch for classification and regression. The prediction branch is a two fullyconnected layers with multianchor, where we use 7 anchors with default size to detect different size of the repetition. Note that totally 14 anchors are used since we have two cycles .
During training, the 3D backbone and the added branch are trained endtoend with classification loss and regression loss. With the network outputs for classification and for regression
, we formulate the overall loss function:
(3) 
where is the crossentropy loss after softmax and is the smooth regression loss [23]. is the repetition ground truth with the parameterizations of scaleinvariant center translation and the logspace cyclelengths shifting [9]. is the classification label that equals to if intersectionoverunion (IoU) of doublecycle prediction and ground truth is greater than , and otherwise. is the weighting factor that empirically set to . During inference, the objective is equal to the regression output of the anchor which has the highest classification score.
3.4 CoarsetoFine Doublecycle Refinement
Since the network extracts the features from the context determined by the original doublecycle , a good initialization will be helpful to improve localization. To this end, we propose a hierarchical pipeline to provide initialization by determining the doublecycle in a coarsetofine manner. The key idea of the proposed pipeline is the cycle length variation between different frames can be overcome by regression, especially for the neighboring frames. Therefore each stage we refine the results on the uniformly sampled positions across the video, so that the initialization of the next stage can benefit from the neighboring prediction of the previous stage. As illustrated in the right part of Figure 1, in the th stage, we predict on the uniformly sampled position . The prediction for each position consists of two process, the initialization and refinement. Algorithm 1 illustrates the initialization and refinement pipeline.
Initialization. For the first stage, we let the doublecycle of the middle position, , equal to the value sampled from the large scale , and then determine the initialized scale by the network classification confidence. In the other stages, we propagate the prediction from the previous stage as initialization, following the arrow direction in the right part of Figure 1. In particular, each position finds the previous refined neighbors for initialization. If only one neighbor is available (the first/last position of the current stage), we use it as the initialization directly. Otherwise, we merge the two observations from the previous neighbor and next neighbor averagely. Under this scheme, we do the heavy computation search only one time in the first stage, and effectively utilize the refined results for the initialization of all the frames.
Refinement. After initialization, we refine the doublecycle estimation for the given position . With the refined results from the regression network, we update the observation on position with the exponential moving average mechanism. In other word, we update the estimation with the equation , where is the decay factor set as empirically. Note that the refinement can be performed iteratively to achieve more precise results.
After the coarsetofine refinement, we obtain the cycle length prediction on uniformly sampled positions. To count the action by sampling points rather by all the N frames, we use the prediction of the final stage to present the prediction of all the frames by modifying Equation 2:
(4) 
where th stage is the final stage.
3.5 UCFRep Benchmark
The previous repetition datasets YTsegments [14] and QUVA Repetition [25] contain only 100 videos for evaluation. Due to the lack of labeled data, the previous deep learning work [14] trains their model on synthesis data. Despite the tailored design of the simulation, the domain gap between synthesis data and real data is unneglectable. Motivated by this, we present an action repetition dataset, called UCFRep benchmark, aiming to provide an environment for training and evaluate the datadriven model. All the data in the proposed benchmark are collected from the widely used action recognition dataset UCF101 [28]. Therefore, the proposed benchmark focuses on evaluating the repetition counting performance of human action. Despite all the data is labeled with category, we find that the proposed network trained on the benchmark is general enough to perform well on the previous unseen dataset YTsegments and QUVA Repetition in experiments. We mainly introduce the benchmark from three aspects, data collection, repetition labeling, and dataset statistic.
Data collection. The original UCF101 [28]
is an action recognition data set of action videos. 13320 videos are collected from YouTube and further classified into 101 action categories. Videos in each category are grouped into 25 groups according to whether they share common features, such as similar backgrounds, viewpoints, etc. We check all the 101 categories from the dataset and select 23 categories in which the action is taken cyclically. Examples of 10 categories are shown in Figure
2.Repetition labeling. We annotate the temporal bound of repetitions similar to the principle in QUVA Repetition [25]. Two human annotators are invited to mark out the interval contain repetitions and the repetitive frames in each video. First, from each group in the original UCF101, we ask the annotators to choose one video with the clearest repetitions. If no repetitions can be founded, all the videos in this group will be abandoned. As a result, 49 groups cannot find any repetition and videos are collected in our benchmark. With these videos, we let the annotators determine the repetition interval.We consider the first frame of the interval as the reference, and ask the annotators to mark all the repetitive frames of reference within the interval. Finally we use the average value of their annotations as the final label, and the number of repetitive frames determine the repetition counts.
YTSeg  QUVA  Ours  
Num. of Videos  100  100  526 
Duration(s)  1487  1754  3500 
Num. of Counts  1080  1246  3506 
Count Min/Max  4/51  4/63  3/54 
Min of Cycle(s)    0.20  0.12 
Max of Cycle(s)    7.69  6.76 
Max/Min of Cycle    38.76  56.33 
Cycle Variation  0.22  0.36  0.42 
Dataset statistic. We summarize the dataset statistic in Table 1. In the proposed benchmark, we provide totally 526 videos containing 3500 seconds. 3506 cycle bounds are annotated in our benchmark to provide abundant data for training and evaluation. The benchmark also has a larger variation compared with the previous datasets. The Max/min of Cycle indicates the difficulty from the diverse timescale between different types of the repetitions, and the cycle variation shows the cyclelength variation within the video.
Method  QUVA Repetition [25]  YTsegments [14]  UCFRep (Ours)  

MAE  OBOA  MAE  OBOA  MAE  OBOA  
Pogalin et al. [21]  0.385 0.376  0.49  0.219 0.301  0.68     
Levy and Wolf [14]  0.482 0.615  0.45  0.065 0.092  0.90     
Levy and Wolf [14]  0.237 0.339  0.52  0.142 0.231  0.73  0.286 0.574  0.68 
Runia et al. [25]  0.232 0.344  0.62  0.103 0.198  0.89     
Runia et al. [26]  0.261 0.396  0.62  0.094 0.174  0.89     
OursResnet18  0.190 0.327  0.70  0.062 0.125  0.91  0.213 0.343  0.69 
OursResnet50  0.167 0.293  0.75  0.081 0.261  0.94  0.190 0.288  0.74 
OursResnet101  0.148 0.290  0.75  0.066 0.170  0.94  0.187 0.303  0.77 
OursResnext101  0.163 0.311  0.76  0.053 0.115  0.95  0.147 0.243  0.79 
4 Experiments
Implementation Details.
We implement the proposed network using Pytorch, and test it with an NVIDIA Geforce GTX1080Ti GPU. All input video frames of the network are resized to 112
112, and we construct a frames sequences. For training, we use Adam optimizer [12] with a fixed learning rate of 0.00005 and batch size of 24. We train our network on the UCFRepwith 100 epochs.
We train our network with the same pipeline of the proposed coarsetofine refinement. Data augmentation is used to extend the annotations: if the variation of two consecutive repetitions is less than 0.3, we assume they are periodic. Then we add annotations within the interval automatically by linear interpolation.
During testing, we perform the coarsetofine refinement with stages. Our initial exhaustive searching is performed with 30 scales (ranging from 4 to ), and conduct 4 times refinement in the 1st and 2nd stages, 2 times in the 3rd stage, and 1 time in the 4th to 5th stages, leading to forwards of the estimation network. The running time of our method depends on the times of the network forwards, and it takes averagely 1.8 seconds to process a video.
Evaluation Datasets.
We evaluate our method on the three video datasets: the existing datasets YTsegments [14] and QUVA Repetition [25], as well as the proposed benchmark UCFRep. Both the YTsegments and QUVA Repetition contain 100 videos with a wide range of repetitions, like sports of humans and animal behaviors. We consider all the videos from YTsegments dataset and QUVA Repetition dataset as testing set, and all the training and the validation is done on the proposed UCFRep benchmark. As a result, we split the videos in UCFRep benchmark into the training set and validation set according to the group number from UCF101. 421 videos with group numbers 120 are split into the training set, and 105 videos with group numbers 2125 are in the validation set.
Evaluation Metric.
Following the previous works [26, 14], we evaluate the proposed method by counting accuracy. For each dataset, we report the mean absolute error (MAE) and offbyoneaccuracy (OBOA) given videos
(5) 
(6) 
where is the ground truth repetition counts. The mean absolute error is a widely used metric to directly evaluate counting errors. The offbyoneaccuracy can counts the rounding error and show the possible cycle cutoffs at both ends of the video as introduced in [26].
4.1 Comparison with Other Methods
The comparison with the existing methods for temporal repetition counting is shown in Table 2. We compare our method with two handcrafted feature methods [21, 26] and one deep learningbased method [14]. As the complete source codes of [21, 26] are unavailable, we compare to them on two previous testing datasets QUVA Repetition and YTsegments. We can observe that our method can outperform all the previous methods. It demonstrates that our method trained on the UCFRep is general enough to the common repetitions from other datasets. Especially for the nonstationary dataset QUVA Repetition, our method obtains improvement on MAE with 6.9 and OBOA with 14, indicating that our scaleinsensitive framework can better handle the videos with varied cyclelength.
To demonstrate these improvements are brought mainly by the proposed framework rather than the new dataset, we finetune the learningbased method [14] on the new benchmark using our train/validation protocol. Note that the other two competitors [21, 26] are trainingfree methods. The original implementation [14] uses a simple 3D network to learn on synthesis data with 20 5050 images as input. We replace their network with Resnext101 to extract information from 32 112112 frames for adapting to the higherdimensional data. We remove their ROI detection to keep the inference sequence similar to the training data, and the other implementations follow the published official code. Not surprisingly, because of the increased number of training data, the retrained model on UCFRep benchmark shows better performance compared with the original implementation on the QUVA Repetition dataset. However, it cannot perform well on the periodic dataset YTsegments, this is because their synthesis data is created following the restrict periodic assumption, while our dataset shows various types of repetitions. Compared with both the finetuned and original versions, our method outperforms them on all the datasets, as their network is designed to consider only a fixed scale of action. These results also demonstrate the success of our tailored contextaware and scaleinsensitive framework.
We further evaluate the robustness to timescale of our method. We follow [25] to manually speed up the video to achieve different timescales. As shown in Figure 3, when the video is processed with different speeds, it poses a challenge to the fixed timescale method [14] ( on 1x and on 2x). Compared with the results ( on 1x and on 2x) from the existing scaleinsensitive method [25], our method is more robust to speed variations ( on 1x, on 2x and on 4x), which implies that our method can detect the repetitions with different timescales.
MAE  OBOA  Iterations  

Stage 3  0.157 0.284  0.78  50 
Stage 4  0.156 0.254  0.78  58 
Stage 5  0.147 0.243  0.79  74 
Stage 6  0.151 0.254  0.79  106 
MAE  OBOA  

Fixed  0.177 0.280  0.70 
Fixed+mAnchor  0.171 0.249  0.71 
Free  0.157 0.243  0.76 
Free+mAnchor  0.147 0.243  0.79 
Metric  All  HulaHoop  Biking  Hammering  Soccer 

MAEavg  0.147  0.120  0.123  0.154  0.168 
MAEstd  0.243  0.240  0.062  0.170  0.111 
4.2 Ablation Study
We conduct the ablation study on UCFRep validation set. In Table 3, we compare the performance of our system utilizing different stages as the final stage in the coarsetofine refinement. The process with 6 stages will involve 32 iterations in the final stage, thus it overall needs iterations. The results in this table indicate that involving more stages and computations in the refinement process can improve the results. We balance the tradeoff between accuracy and speed, and choose stage 5 as the final stage.
We also compare the performance of our contextaware network with the other network designs in Table 4. We first compare the performance of using double timescales for the two consecutive repetitions (Free) or single timescale shared by the consecutive repetitions (Fixed). The results with double timescales are better than those with a single timescale, which demonstrates that the free timescales help to tackle the diverse cycle length. In addition, the multianchors design (mAnchor) achieves the best performance integrated with the double timescales. This implies that the regression can refine the cycle length with a large range, and thus benefitted from the multianchors prediction focusing on the diverse timescales.
In Table 5, we further show the performance variations with respect to different action classes. We can see that the variations within the same action class are relatively small, indicating that our model is instance and class insensitive.
4.3 Refinement Results Visualization
To show the process of coarsetofine refinement, we visualize the prediction of the 1st stage, 3rd stage and the 5th stage over a video from QUVA Repetition dataset in Figure 4. We set the each repetition prediction equal to the rounded mean value of the cycle length from the closet sampled position. From the results, we can find that we give an identical estimation to all the positions in stage 1 since it only involves one local prediction. In the 3rd stage and 5th stage, the predictions after propagation and refinement achieve high overlap with the ground truth, showing that the proposed coarsetofine refinement can overcome the variation between consecutive repetitions.
5 Conclusion
In this paper, we present a novel contextaware and scaleinsensitive framework for temporal repetition counting. To tackle the challenges posed by the diverse cyclelengths between videos and within repetitions, we propose a coarsetofine cycle refinement scheme. Instead of detecting the repetition with fixed timescales, we search the timescale with a wide range locally at the beginning and refine the scales for each temporal location in a coarsetofine manner. We further propose a contextaware regression network to learn contextual features for recognizing previous and future repetitions. The proposed network is designed to extract the contextaware features from two consecutive repetitions, and a anchorbased backend is tailored for detecting doubleerror or halferror. The proposed temporal repetition counting framework is evaluated and compared with stateoftheart methods and achieves better results in the existing benchmarks as well as our newly proposed dataset.
Acknowledgement
The work is supported by NSFC (Grant No. 61772206, U1611461, 61472145, 61702194, 61972162), Guangdong RD key project of China (Grant No. 2018B010107003, 2020B010165004, 2020B010166003), Guangdong Highlevel personnel program (Grant No. 2016TQ03X319), Guangdong NSF (Grant No. 2017A030311027), Guangzhou Key Project in Industrial Technology (Grant No. 201802010027, 201802010036), and the CCFTencent Open Research fund (CCFTencent RAGR20190112).
References
 [1] A Branzan Albu, Robert Bergevin, and Sébastien Quirion. Generic temporal segmentation of cyclic human motion. Pattern Recognition, 41(1):6–21, 2008.
 [2] Ousman Azy and Narendra Ahuja. Segmentation of periodically moving objects. In ICPR, pages 1–4, 2008.
 [3] Alexia Briassouli and Narendra Ahuja. Extraction and analysis of multiple periodic motions in video sequences. IEEE TPAMI, 29(7):1244–1261, 2007.
 [4] Gertjan J Burghouts and JM Geusebroek. Quasiperiodic spatiotemporal filtering. IEEE TIP, 15(6):1572–1582, 2006.
 [5] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A largescale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
 [6] Dmitry Chetverikov and Sándor Fazekas. On motion periodicity of dynamic textures. In BMVC, pages 167–176, 2006.
 [7] Ross Cutler and Larry S. Davis. Robust realtime periodic motion detection, analysis, and applications. IEEE TPAMI, 22(8):781–796, 2000.
 [8] James Davis, Aaron Bobick, and Whitman Richards. Categorical representation and recognition of oscillatory motion patterns. In CVPR, pages 628–635, 2000.
 [9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.

[10]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh.
Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?
In CVPR, pages 6546–6555, 2018.  [11] Shiyao Huang, Xianghua Ying, Jiangpeng Rong, Zeyu Shang, and Hongbin Zha. Camera calibration from periodic motion of a pedestrian. In CVPR, pages 3025–3033, 2016.
 [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [13] Ivan Laptev, Serge J Belongie, Patrick Perez, and Josh Wills. Periodic motion detection and segmentation via approximate sequence alignment. In ICCV, pages 816–823, 2005.
 [14] Ofir Levy and Lior Wolf. Live repetition counting. In ICCV, pages 3020–3028, 2015.
 [15] Xiu Li, Hongdong Li, Hanbyul Joo, Yebin Liu, and Yaser Sheikh. Structure from recurrent motion: From rigidity to recurrency. In CVPR, pages 3032–3040, 2018.
 [16] Xiaoxiao Li, Vivek Singh, Yifan Wu, Klaus Kirchberg, James Duncan, and Ankur Kapoor. Repetitive motion estimation network: Recover cardiac and respiratory signal from thoracic imaging. arXiv preprint arXiv:1811.03343, 2018.
 [17] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In ECCV, pages 3–19, 2018.
 [18] Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian temporal awareness networks for action localization. In CVPR, June 2019.
 [19] ChunMei Lu and Nicola J Ferrier. Repetitive motion analysis: Segmentation and event classification. IEEE TPAMI, 26(2):258–263, 2004.
 [20] Costas Panagiotakis, Giorgos Karvounas, and Antonis Argyros. Unsupervised detection of periodic segments in videos. In ICIP, pages 923–927, 2018.
 [21] Erik Pogalin, Arnold WM Smeulders, and Andrew HC Thean. Visual quasiperiodicity. In CVPR, pages 1–8, 2008.
 [22] Yang Ran, Isaac Weiss, Qinfen Zheng, and Larry S Davis. Pedestrian detection via periodic motion analysis. IJCV, 71(2):143–160, 2007.
 [23] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
 [24] Evan Ribnick and Nikolaos Papanikolopoulos. 3d reconstruction of periodic motion from a single view. IJCV, 90(1):28–44, 2010.
 [25] Tom FH Runia, Cees GM Snoek, and Arnold WM Smeulders. Realworld repetition estimation by div, grad and curl. In CVPR, pages 9009–9017, 2018.
 [26] Tom FH Runia, Cees GM Snoek, and Arnold WM Smeulders. Repetition estimation. IJCV, 127(9):1361–1383, 2019.
 [27] Zheng Shou, Dongang Wang, and ShihFu Chang. Temporal action localization in untrimmed videos via multistage cnns. In CVPR, pages 1049–1058, 2016.
 [28] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
 [29] Ashwin Thangali and Stan Sclaroff. Periodic motion detection and estimation via spacetime sampling. In WACV, pages 176–182, 2005.
 [30] Christopher J Tralie and Jose A Perea. (quasi) periodicity quantification in video data, using topology. SIAM Journal on Imaging Sciences, 11(2):1049–1077, 2018.

[31]
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He.
Aggregated residual transformations for deep neural networks.
In CVPR, pages 1492–1500, 2017.
Comments
There are no comments yet.