[Review] The Grand Introduction of Residual Learning

October 25, 2018 ยท 5 minute read

Paper Review 25-10-2018

Deep Residual Learning for Image Recognition

K. He, X.Zhang, S.Ren, J.Sun, Microsoft Research, 2016 IEEE Conference on Computer Vision and Pattern Recognition.

Aims of the paper

In this paper, the author presents their solution of residual networks to the problem of accuracy degredation for ultra-deep neural networks. This presentation seeks to provide for the reader:

  • A brief overview of the issue of accuracy degredation and previous works and where those works have influenced their work
  • An explanation of the Residual Learning concept along with a described implementation.
  • Give evidence for their claims with experimentation and comparison of previous techniques on commonly used datasets such as ImageNet, Cifar-10, PASCAL and MS COCO
  • This also included ablation and variation experiments with the network.

This is presented to those who are familiar with the field without over simplification and, rightly so, remains fully technical.

Paper Summary

Within image processing, Deep neural networks have made significant process in tasks such as classification and segmentation with the network structure naturally learning low/med/high-level features. As a result network depth has been found to be crucial to the task accuracy of many image processing tasks. This paper is looking to solve a problem known as degredation where at a certain depth the network suddenly begins to perform poorly and the accuracy degrades over depth which leads to deeper models having a higher training error.


The authors motivate their approach by considering two networks that have identical outcomes but with one deeper than the other. One solution to the training of the deeper network is that the network should converge to the shallower network with identity mappings for the remaining layers. This implies that training error of the depeer network should be no higher than the shallower one, but current solutions are not able to achieve this fact. The residual learning framework, therefore, avoids directly estimating the true weights, but estimates the “error” or residual $F(x) = H(x) - x$: the true weights, minus the input. They believe that this formulation is easier to optimise as the problem itself suggests deep networks struggle to approximate identity mappings using multiple non-linear layers.


The author’s designed Residual learning blocks by which a deep residual learner is constructed. This block is constructed as 2 or 3 feed forward layers along with a “shortcut connection” which adds the input $x$ to the output $F(x)$. The investigation of their hypothesis is achieved by comparing a “plain network” (completely feed forward with no shortcut connections) of 18 and 34 layers, with a residual network constructed of these residual blocks of the same depth. For each depth and network, the training, validation and test errors are then recorded on both the ImageNet and CIFAR-10 datasets. Furthermore, the authors experimented with differing learning blocks and 50, 101 and 152 layer ResNets to demonstrate their increased accuracy at larger depths


  1. The comparison at 18 and 34 layers of plain vs resnet clearly shows how resnet avoids the degredation issue (Figure 1). In addition we see a 2.5% error reduction on the Top 1 ImageNet validation error.
  2. The deeper the ResNet stack, the more accurate the network. The 152 layer far outperforms both the previous best networks and its shallower counterparts by large margins again showing how deep ResNets appear easier to optimize. (An “aggressively” deep model of >1000 layers was tested and was no harder to train than the others!)

Paper Review

This paper very clearly presents the concept of Residual Networks as a solution to the problem of degredation in very deep neural networks. As a result, there is a heavy focus on explaining the author’s thought processes and justifications through detailed explanation and extensive experimentation. In line with many deep learning papers, I had the feeling that they tried this new technique and “it just worked”, however, even with my skepticism, the level and quality of the justifications convinved me of the effectiveness of the technique. This is especially apparent as previous to reading the paper, I had not heard of the problem of degredation, even after using Resnets during previous work, implying that Resnets effectively solved this issue.

Analysing their experimentation, the comparison’s that are made are all valid with no forced comparisons for the sake of results. The same dataset is also used for network comparison within each section. A large question hangs over their choice of 18 and 34 layers as these have not been justified, most likely looking like semi-arbitrary choices. Although the tables and graphs a very clear, it is not apparent at how many trials and which aggregation methods were used to generate those results, especially in the graphs. It is good however for many of the comparisons with existing techniques that they chose to use standard well respected accuracy scores. I also found the CIFAR-10 analysis to be very informative with the focus on network behaviour where they attempted to further analyse the network. On the other hand though, I was sometimes unconvinced with their claims and musings whenever an observation was brought up. In many cases these claims or explanations appeared to be bourne out of the authors intuition or opinion and sometimes lacked evidence. This however did not detract too much from the paper itself as these claims were left as closing remarks. I did also find it unfortunate that no conclusion or discussion was included to sum up the dense experimentation as a concluding summary would have been helpful.

Overall, I think it was a very good paper. It provided great insights for myself and was very clearly written and structured. The structure was logical and there was very little waffle as would be expected in a conference paper. The level of the paper is pitched just right for somebody who has some experience in machine learning and was easy to both read and understand even though it was fairly dense. With future knowledge it is easy to see how the results of this paper became so influential and widely used. I would highly recommend this paper as a must read to anyone who is preparing to go deeper into machine learning and image processing as this underpins many of today’s impressive image processing networks such as R-CNN.