## AI帮你理解科学

## AI 精读

AI抽取本论文的概要总结

微博一下：

# Understanding the difficulty of training deep feedforward neural networks

AISTATS, (2010): 249-256

EI

摘要

Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures. All these experimental results were obtained with new initialization or trainin...更多

代码：

数据：

简介

- Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures.
- The authors study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1
- Based on these considerations, the authors propose a new initialization scheme that brings substantially faster convergence.
- Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions, one may need deep architectures

重点内容

- Whereas before 2006 it appears that deep multilayer neural networks were not successfully trained, since then several algorithms have been shown to successfully train them, with experimental results showing the superiority of deeper vs less deep architectures
- Theoretical results reviewed and discussed by Bengio (2009), suggest that in order to learn the kind of complicated functions that can represent high-level abstractions, one may need deep architectures
- Here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old multilayer neural networks
- We have found that the logistic regression or conditional log-likelihood cost function (− log P (y|x) coupled with softmax outputs) worked much better than the quadratic cost which was traditionally used to train feedforward neural networks (Rumelhart et al, 1986)

方法

**Experiments with the Sigmoid**

The sigmoid non-linearity has been already shown to slow down learning because of its none-zero mean that induces important singular values in the Hessian (LeCun et al, 1998b).- The graph shows the means and standard deviations of these activations
- These statistics along with histograms are computed at different times during learning, by looking at activation values for a fixed set of 300 test examples.
- The authors can see at the end of training that the histogram of activation values is very different from that seen with the hyperbolic tangent (Figure 4)
- Whereas the latter yields modes of the activations distribution mostly at the extremes or around 0, the softsign network has modes of activations around its knees.
- These are the areas where there is substantial non-linearity but

结论

- The final consideration that the authors care for is the success of training with different strategies, and this is best illustrated with error curves showing the evolution of test error as training progresses and asymptotes.
- The authors optimized RBF SVM models on one hundred thousand Shapeset examples and obtained 59.47% test error, while on the same set the authors obtained 50.47% with a depth five hyperbolic tangent network with normalized initialization.
- These results illustrate the effect of the choice of activation and initialization.
- The authors can remark that on Shapeset-3 × 2, because of the task difficulty, the authors observe important saturations during learning, this might explain that the normalized initialization or the softsign effects are more visible

- Table1: Test error with different activation functions and initialization schemes for deep networks with 5 hidden layers. N after the activation function name indicates the use of normalized initialization. Results in bold are statistically different from non-bold ones under the null hypothesis test with p = 0.005

基金

- Finds that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation
- Finds that a new non-linearity that saturates less can often be beneficial
- Proposes a new initialization scheme that brings substantially faster convergence
- Focuses on analyzing what may be going wrong with good old multilayer neural networks
- Evaluates the effects on these of choices of activation function and initialization procedure

引用论文

- Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2, 1–127. Also published as a book. Now Publishers, 2009.
- Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy layer-wise training of deep networks. NIPS 19 (pp. 153–160). MIT Press.
- Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5, 157–166.
- Bergstra, J., Desjardins, G., Lamblin, P., & Bengio, Y. (2009). Quadratic polynomials learn better image features (Technical Report 1337). Departement d’Informatique et de Recherche Operationnelle, Universitede Montreal.
- Bradley, D. (2009). Learning in modular systems. Doctoral dissertation, The Robotics Institute, Carnegie Mellon University.
- Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. ICML 2008.
- Erhan, D., Manzagol, P.-A., Bengio, Y., Bengio, S., & Vincent, P. (2009). The difficulty of training deep architectures and the effect of unsupervised pre-training. AISTATS’2009 (pp. 153– 160).
- Hinton, G. E., Osindero, S., & Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527– 1554.
- Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images (Technical Report). University of Toronto.
- Larochelle, H., Bengio, Y., Louradour, J., & Lamblin, P. (2009). Exploring strategies for training deep neural networks. The Journal of Machine Learning Research, 10, 1–40.
- Larochelle, H., Erhan, D., Courville, A., Bergstra, J., & Bengio, Y. (2007). An empirical evaluation of deep architectures on problems with many factors of variation. ICML 2007.
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998a). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324.
- LeCun, Y., Bottou, L., Orr, G. B., & Muller, K.-R. (1998b). Efficient backprop. In Neural networks, tricks of the trade, Lecture Notes in Computer Science LNCS 1524. Springer Verlag.
- Mnih, A., & Hinton, G. E. (2009). A scalable hierarchical distributed language model. NIPS 21 (pp. 1081–1088).
- Ranzato, M., Poultney, C., Chopra, S., & LeCun, Y. (2007). Efficient learning of sparse representations with an energy-based model. NIPS 19.
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
- Solla, S. A., Levin, E., & Fleisher, M. (1988). Accelerated learning in layered neural networks. Complex Systems, 2, 625–639.
- Vincent, P., Larochelle, H., Bengio, Y., & Manzagol, P.-A. (2008). Extracting and composing robust features with denoising autoencoders. ICML 2008.
- Weston, J., Ratle, F., & Collobert, R. (2008). Deep learning via semi-supervised embedding. ICML 2008 (pp. 1168–1175). New York, NY, USA: ACM.
- Zhu, L., Chen, Y., & Yuille, A. (2009). Unsupervised learning of probabilistic grammar-markov models for object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31, 114–128.

标签

评论

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn