Deep Learning on Video (Part Three): Diving Deeper into 3D CNNs
How 3D convolutional neural networks went from zero to hero for video-based deep learning.
This post is the third in my series for video deep learning methodologies that I am writing as part of my work as a Research Scientist at Alegion. The goal of this series of blog posts is to both overview the history of deep learning on video and provide relevant context for researchers or practitioners looking to become involved in the field. Given the massive rise of video data in practical scenarios (e.g., IoT video, autonomous vehicle, and social media applications), analyzing visual data with temporal structure (i.e., video) is becoming increasingly important — simply extracting useful information from individual images or frames is often not enough [1].
In the first two posts of this series, I overviewed the initial deep learning-based efforts for processing video data, as well as the widely-used two-stream network architecture that revolutionized the use of deep learning for video. Within this post, I will further dive into 3D convolutional neural networks (CNNs)— the direct extension of 2D, image-based CNNs into the video domain. These networks were initially unsuccessful in garnering much interest from the research community, as their performance was poor in comparison to the previously-overviewed two-stream network. However, subsequent research greatly improved their data efficiency (i.e., how much data the network requires to perform well) and overall performance, making them more worthy of consideration.
The post will begin with some preliminary information that motivates the various ideas for improving 3D CNNs that are overviewed in this post. After necessary background has been established, I will overview the main research developments relevant to improving the performance of 3D CNNs, including factorized 3D convolutions, inflated 3D convolutions, and increased temporal extent. Each of these ideas plays a key role in mitigating some major issue that stands in the way of achieving state-of-the-art performance with 3D CNNs, as will be explained in more detail throughout the post.
Preliminaries
As overviewed within the previous post, the two-stream network architecture [2] was one of the first CNN-based video architectures to yield consistent improvements in performance over hand crafted techniques for video-based learning tasks. The two-stream approach was widely-used within the community and was leveraged in the development of several architectural variants in the following years [3, 4], leading alternative architectural approaches to be less explored. As such, the two-stream network architecture was dominant for several years, but there were nonetheless problems in need of being addressed that eventually led to the exploration of alternative deep learning methodologies like 3D CNNs.
What are 3D Convolutions?
The 3D CNN is a deep learning architecture comprised of several consecutive layers of 3D convolutions. As described in the initial post of this series, 3D convolutions operate by convolving, in both space and time, a four-dimensional kernel over a four-dimensional data input. These four dimensions for the data input and kernel arise from the two-spatial dimensions, the channel dimension (e.g., an RGB image has three channels), and the temporal dimension (i.e., the number of video frames). See the figure below for a basic depiction.
The convolution operation above takes a 2x3x3x3
convolutional kernel (i.e., spans two consecutive frames, three channels, and has a spatial dimension of 3x3
) and convolves this kernel over three consecutive RGB frames to produce an output representation. Here, it should be noted that the temporal dimension of the kernel (two in this case) can be increased to span any number of consecutive frames given sufficient memory capacity. As such, the output representation of a 3D convolution is, by nature, spatiotemporal (i.e., captures both spatial information within each frame and temporal information between adjacent frames).
No AlexNet Moment…
As new approaches were developed for deep learning on video, researchers were constantly drawing parallels to the massive impact that deep learning had had on image processing years earlier. Namely, the development of the AlexNet architecture [5] for image classification in 2012 yielded a drastic improvement over approaches based upon hand-crafted features (i.e., 5–10% absolute improvement). Thus, despite the strong performance of common video architectures (e.g., the two-stream architecture), their impact was not comparable to the paradigm shift caused by deep learning within image processing [1, 6], leading researchers to wonder what could be done to produce such an “AlexNet moment” within the video domain.
What was preventing better performance?
The reason for the more limited success of deep learning within the video domain was commonly attributed to two main issues:
Lack of large-scale, annotated video data
High complexity of video-based deep learning architectures
In comparison image data, densely-annotated video data is much harder to find and/or produce. Thus, many datasets used for human action recognition (HAR) (i.e., one of the most common video-based learning tasks) were small for quite some time (e.g., HMDB51 and UCF101).
The performance of models trained on these small datasets was limited, and many researchers hypothesized that the creation of larger-scale datasets could help catalyze an AlexNet-level performance improvement for video-based learning methodologies [5, 6]. As such, many larger-scale, video based datasets were subsequently developed (e.g., Kinetics, ActivityNet, Dynamic Scenes, etc.), thus mitigating issues associated with the limitations of video datasets in years to come.
In addition to the lack of large-scale datasets, naively expanding common deep learning models into the video domain yields staggering increases in computational and parameter complexity. For example, 3D CNNs have many more parameters in comparison to their image-based counterparts, as an extra time dimension is added to each convolutional kernel. Namely, each convolution operation simultaneously considers multiple video frames in computing its output (as opposed to 2D convolutions that consider a single frame/image), and the number of convolutional parameters scales linearly with the number of frames considered; see below for a depiction.
This added temporal dimension causes 3D CNNs to have higher computational costs and heavier data requirements for achieving acceptable training and generalization performance. As such, 3D CNNs were initially outperformed by simpler, low-parameter architectures that could learn meaningful representations with limited data.
How can we move forward?
The lack of an “AlexNet moment” indicated that simpler architectures (e.g., two-stream networks) were no longer enough for video deep learning — something more powerful was needed. In comparison to these simpler architectures, 3D CNNs have much higher representational capacity — they contain many parameters and have the ability incorporate temporal reasoning into their convolution operations [1]. As such, 3D CNNs have the potential to perform well if their issues with limited data and high parameter complexity can be alleviated.
Larger video-based datasets were collected, thus mitigating performance issues associated with the lack of sufficient data. In comparison to image recognition datasets, however, high-quality video data was still limited. Furthermore, the video deep learning community generally accepted that data scarcity relative to image recognition would be a lingering problem, as video data was fundamentally harder to annotate in comparison to individual frames or images. Thus, to complement the development of larger-scale training datasets, researchers began to investigate more efficient 3D CNN variants that enable better performance despite data limitations.
Alleviating issues with 3D Convolutions
Researchers investigated several avenues for utilizing 3D convolutions in a more computationally efficient and sensible manner. The two main approaches included:
Only using 3D convolutions in a smaller number of networks layers, and allowing remaining layers to perform 2D convolution operations [9, 10].
Factorizing 3D convolutions into separate 2D spatial and 1D temporal convolution operations applied in sequence [8, 9, 10].
Though using 3D convolutions in only a portion of network layers is a straightforward concept, factorizing 3D convolutions may require more explanation. Consider a 3D convolution with a kernel of sizeFx3x3x3
, which represents a standard-sized kernel over F
consecutive frames. The main argument behind factorized 3D convolutions is that this single operation can be separated into two convolution operations performed in sequence: a spatial convolution applied separately over each frame and a temporal convolution that aggregates features across the outputs for each frame. In practice, such a factorized convolution operation is implemented as two 3D convolutions of size 1x3x3x3
and Fx1x1x1
, thus reducing the number of trainable parameters from Fx3x3x3
to F + 3x3x3
; see the figure below for a schematic depiction.
Here, we use a spatial resolution of 2x2
to save space, but 3x3
convolutions are standard in practice. Such a factorized approach has several benefits in comparison to full 3D convolutions. Primarily, it greatly reduces the number of parameters within the convolution operation — it is a kind of low-rank approximation to its standard counterpart. Although this implies that the resulting network has less representational capacity (i.e., the number of transformations it can learn is more limited), less data is required for training (i.e., due to the decrease in parameters) and computational overhead is reduced.
Beyond reducing the number of trainable parameters, such an approach increases the number of non-linearities applied within the network (i.e., one can apply an element-wise non-linearity after each convolution component instead of after the 3D convolution as a whole), which has positive benefits on the overall network’s performance [10]. Additionally, the 2D component of the factorized convolution can now be initialized with pre-trained image classification weight (e.g., from ImageNet), thus enabling larger-scale image recognition datasets to be leveraged in obtaining improved performance for video deep learning.
Improving Video Understanding with 3D CNNs
By exploring more efficient architectural variants, researchers were able to drastically improve the performance of 3D CNNs and surpass other popular architectures (e.g., two-stream networks). Such improved performance was generally achieved by leveraging factorized 3D convolutions, carefully selecting the layers in which 3D convolutions (as opposed to 2D convolutions) were used, and developing methods of leveraging large image recognition datasets for improved video understanding.
Within this section, I will first summarize some early work on factorized 3D convolutions for video-based learning tasks, which was performed using older, less capable video deep learning architectures. Then, I will overview more recent 3D CNN architectures that re-purpose pre-trained, 2D CNNs to achieve remarkable increases in performance. Finally, I will explain how such high-performing, recent architectures were combined with work on efficient 3D CNN architectures, allowing 3D CNNs (when combined with some general tricks for improved training) to surpass previously-observed performance with simpler architectures.
The Factorized Approach
Factorized Spatiotemporal CNNs. The idea of factorizing 3D convolutions was originally explored in [5]. Within this work, authors claim that the success of 3D CNNs was limited because i) large-scale, supervised video data was not available and ii) 3D convolutions require lots of training data to perform well due to the many parameters they contain. Thus, some approach had to be developed to reduce the number of trainable parameters used by 3D convolutions and enable the learning of high-quality, spatiotemporal features in the limited data regime.
The proposed approach, called FstCN, divides the overall network architecture into separate spatial and temporal components. In particular, the initial layers of the network contain only 2D spatial convolutions that learn single-frame representations. Then, later network layers contain 1D temporal convolutions that capture relationships between adjacent frames. Such an approach resembles the previously-discussed factorization of 3D convolutions, but the spatial and temporal components of the factorized operation, instead of being applied in an alternating fashion, are separated into distinct network regions (i.e., all spatial convolution layers are applied first, then all temporal convolution layers are applied after).
Such an approach departs from the basic idea—as described in the previous section — that a 3D convolution can be approximated with separate spatial and temporal convolutions applied in sequence. The difference between these two approaches is depicted below, where the approach followed in [5] is denoted as a “factorized architecture”.
Interestingly, the approach in [5] was motivated by the idea of factorizing 3D convolutions into a sequence of spatial and temporal convolutions. However, the resulting architecture did not follow this approach, choosing instead to apply spatial and temporal convolutions in separate regions of the network. Because the authors did not provide specific reasoning for this choice, later work studied architectures that more closely resembled the original factorization discussed previously.
Pseudo-3D ResNets. The idea of factorizing each 3D convolutional layer into a separate spatial and temporal convolution applied in sequence was explored in [8]. Similar to previous work, the authors argue that 3D CNN architectures (e.g., C3D [11]) perform poorly due to the significant number of parameters that arise when convolution operations are extended across multiple frames. Despite the high representational capacity (i.e., ability to learn a lot of different features from data) of such 3D convolutions, sufficient training data was not available for useful representations to be learned and the computational costs of performing full 3D convolutions were significant.
Luckily, a factorized approach to 3D convolutions can lead to a reduction in both computational complexity and the number of trainable parameters. In [8], authors begin with the 3D ResNet architecture [13] and replace all 3D convolution operations with a pair of 2D spatial (i.e., convolution with a 1x3x3x3
kernel) and 1D temporal (i.e., convolution with a Fx1x1x1
kernel, where F
is the total number of adjacent frames considered) convolutions. In addition to applying these operations sequentially, the authors also attempt to apply them in parallel and in a hybrid parallel/sequential manner; see the figure below for a depiction of the options considered.
After studying each of these variations, the authors discover that the highest-performing architecture uses a mixture of these different factorized convolutions throughout the network, claiming that the variety aids in network performance. By adopting this improved, factorized architecture and leveraging several recent advancements for improved neural net training (e.g., batch normalization, residual connections, etc.), this network was able to outperform previous 3D CNN variants (e.g., C3D [11]) and other state-of-the-art methodologies (e.g., two-stream networks [2]). Such an improvement was even observed on larger, more advanced datasets such as ActivityNet and Dynamic Scene. However, the network was criticized for its complexity, as various different convolution types were used, yielding a somewhat peculiar, non-homogenous architecture.
Re-purposing 2D CNNs
Inflating 2D Networks. As previously mentioned, one of the largest problems with effectively utilizing 3D CNNs was the lack of sufficient training data. In the image recognition domain, such large-scale datasets (e.g., ImageNet) were widely available, allowing 2D CNNs to far surpass the performance of alternative methodologies (e.g., hand-crafted or machine learning-based approaches). As such, one may begin to wonder whether representations learned over such image-based datasets could be transferred to the video domain. This question was answered in [6], where the two-stream inflated 3D CNN architecture (I3D) was developed.
The main idea behind the I3D architecture was to begin with a pre-trained image recognition architecture and “inflate” its parameters through time. Practically, this was implemented by taking a pre-trained, 2D convolutional kernel of size 3x3x3
and copying it F
times temporally to create a 3D convolutional kernel of size Fx3x3x3
that considers F
adjacent frames. Then, the weights within this 3D kernel are divided by F
to ensure the expected magnitude of the convolution’s output is preserved (i.e., the output values of the new kernel should not be F
times larger than before). The idea of inflating a 2D convolutional kernel is depicted below.
Using this idea, the authors of [6] could take a high-performing image recognition architecture (i.e., in this case the inception v1 architecture [12] was used), inflate the convolutional kernels, and apply the resulting architecture to video-based learning tasks. Such an approach was extremely effective because it allowed large image datasets to be leveraged (i.e., because network parameters are initialized with pre-trained weights from image recognition tasks) in learning useful video representations. Thus, the lack of large-scale video datasets became somewhat less detrimental, as one could supplement video data with existing, large-scale image datasets.
After proposing the I3D architecture, the authors further explored only inflating certain layers within the network, finding that not all network layers necessarily need to be 3D — computational savings can be realized by only utilizing 3D convolutions where needed. Additionally, the authors found that utilizing a two-stream approach (i.e., training two separate models trained on RGB and optical flow input, then merging their predictions at test time) yields improved human action recognition performance. Such a finding was solidified in later work, leading the two-stream approach to remain heavily-utilized even in later 3D CNN architectures. The final proposal within [6] (i.e., the I3D architecture with two separate optical flow and RGB streams) was found to perform extremely well, far surpassing the performance of common architectures before it (e.g., 3D CNNs, factorized 3D CNNs, vanilla two-stream architectures, etc.).
Factorizing the inflated networks. The I3D architecture was heavily utilized in later work due to its impressive performance. Notably, however, this architecture used full 3D convolutions, which — as outlined previously — are computationally expensive and contain many parameters. Thus, one may reasonably wonder whether factorized 3D convolutions could improve the performance and efficiency of the I3D architecture. Such an idea was explored concurrently in two separate works that were published nearly in tandem [9, 10].
In [9], the authors begin with the I3D network architecture and explore different possibilities for factorizing each of its 3D convolutions. In particular, the authors study i) which network layers should have 3D vs. 2D convolutions and ii) how these 3D convolutions should be implemented. In exploring different options, the authors find that utilizing factorized 3D convolutions in the earlier layers of the network and 2D convolutions in the later layers (i.e., a “bottom-heavy” architecture) yields the best performance, resulting in a network that outperforms both 2D and full 3D counterparts. Such a result implied that motion modeling (i.e., learning the relationships between neighboring frames) is a low/mid-level operation that should be handled within earlier layers of the network.
Similarly, the authors of [10] study possible changes to the I3D architecture, but they arrive at a drastically different result. Namely, the authors discover that 3D convolutions are only needed in later network layers. Thus, their proposed architecture uses 2D convolutions in early network layers, followed by factorized 3D convolutions in later layers (i.e., a “top-heavy” architecture). Such an approach was found to yield a much better speed-accuracy tradeoff, as 3D convolutions are only leveraged in the later layers where feature maps have been downsampled. Similar to previous work, however, the authors of [10] find that, given this modified I3D architecture, the best performance is still achieved with a two-stream approach with separate networks streams for RGB and optical flow input.
At a high level, the proposals in [9, 10] show that, despite the impressive performance of the I3D architecture, better performance can be achieved by i) carefully choosing the layers in which 3D convolutions are utilized and ii) replacing full, 3D convolutions with factorized variants with fewer parameters. Such changes further improve the performance of I3D, yielding a 3D CNN architecture that surpassed the performance of numerous state-of-the-art approaches for human action recognition (and more complicated localization tasks like human action detection) at the time.
Long-Term 3D Convolutions
In addition to the proposal of factorized and inflated 3D convolutions, concurrent work studied one final property of 3D convolutions that can be used to increase their performance: their temporal extent. Put simply, the temporal extent of a 3D convolution is the number of frames considered within the convolution operation. As outlined previously, a kernel for a 3D convolution is of size Fx3x3x3
, where F
is the temporal extent or the number of frames considered in computing the output of the convolution.
Although the temporal extent is typically set to a fixed value (usually around 16 frames [2, 5]), authors of [1] extensively studied different settings of F
for 3D CNNs, finding that considering more frames within the 3D convolution (e.g., 100 frames instead of 16) can drastically improve network performance. Intuitively, such a concept makes sense, as more complex, video-based tasks may depend upon long-term temporal relationships that arise within certain videos. Although such an approach yields higher computational costs (i.e., due to a much larger 3D convolutional kernel), the authors of [1] mitigated this concern by simply reducing the spatial resolution of the input video and found that the resulting architecture outperforms comparable architectures with shorter temporal extents.
The approach in [1], due to the use of an outdated network architecture, is outperformed by simpler video architectures (e.g., two-stream networks). However, in [9], the authors again study the use of longer temporal extents in the context of factorized I3D networks, showing that such increased temporal extent again improves the network’s performance. Such a finding revealed that using longer input clips is conducive to better performance for learning tasks that contain long-term temporal relationships. The authors claim that developing proper understanding of such long-term temporal relationships was previously avoided because early human action recognition datasets could be solved with features extracted from one or a few adjacent frames.
Conclusion
Within this post, we overviewed developments for 3D CNNs that allowed them to become a viable architecture for deep learning on video. Although early 3D CNN variants performed poorly due to a lack of sufficient data to train their many parameters, their performance was improved by i) developing factorized variants of 3D convolutions with fewer parameters, ii) only using 3D convolutions in necessary layers, and iii) leveraging parameters trained over 2D image recognition datasets whenever possible. Models that leveraged all of these tricks [9, 10] significantly outperformed 2D, 3D, and two-stream approaches that came before them, especially when longer clips were used as input.
Despite their effectiveness, these efficient 3D CNN architectures still implicitly treated time and space symmetrically when learning over volumes of video data. Unfortunately, time oftentimes should not be treated identically to space — modeling motion is highly-dependent on the speed of objects within the frame, and objects are much more likely to be moving slowly or not at all. Such a realization led to development of improved SlowFast networks [14] for video deep learning, which will be covered within the next post.
Thank you so much for reading this post! I hope you found it helpful. If you have any feedback or concerns, feel free to comment on the post or reach out to me via twitter. If you’d like to follow my future work, you can follow me on Medium or check out the content on my personal website. This series of posts was completed as part of my background research as a research scientist at Alegion. If you enjoy this post, feel free to check out the company and any relevant, open positions — we are always looking to discuss with or hire motivated individuals that have an interest in deep learning-related topics!
Bibliography
[1] https://arxiv.org/abs/1604.04494
[2] https://arxiv.org/abs/1406.2199
[2] https://arxiv.org/abs/1604.06573
[3] https://arxiv.org/abs/1611.02155
[4]https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html
[5] https://arxiv.org/abs/1510.00562
[6] https://arxiv.org/abs/1705.07750
[7] https://ieeexplore.ieee.org/document/6165309
[8] https://arxiv.org/abs/1711.10305
[9] https://arxiv.org/abs/1711.11248
[10] https://arxiv.org/abs/1712.04851
[11] https://arxiv.org/abs/1412.0767