暂无分享,去创建一个
Shun-ichi Amari | Jimmy Ba | Denny Wu | Atsushi Nitanda | Roger Grosse | Taiji Suzuki | Xuechen Li | Ji Xu
[1] Kenji Fukumizu,et al. Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.
[2] Ji Xu,et al. On the Optimal Weighted $\ell_2$ Regularization in Overparameterized Linear Regression , 2020, NeurIPS.
[3] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.
[4] Yann Ollivier,et al. Practical Riemannian Neural Networks , 2016, ArXiv.
[5] J. Zico Kolter,et al. A Continuous-Time View of Early Stopping for Least Squares Regression , 2018, AISTATS.
[6] Ruosong Wang,et al. Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.
[7] Bernhard Schölkopf,et al. Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.
[8] 俊一 甘利. 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .
[9] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.
[10] Alessandro Rudi,et al. Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes , 2018, NeurIPS.
[11] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.
[12] Lili Su,et al. On Learning Over-parameterized Neural Networks: A Functional Approximation Prospective , 2019, NeurIPS.
[13] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.
[14] Frederik Kunstner,et al. Limitations of the Empirical Fisher Approximation , 2019, NeurIPS.
[15] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.
[16] Lorenzo Rosasco,et al. Generalization Properties of Learning with Random Features , 2016, NIPS.
[17] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[18] Andrea Montanari,et al. Limitations of Lazy Training of Two-layers Neural Networks , 2019, NeurIPS.
[19] Nathan Srebro,et al. The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..
[20] Surya Ganguli,et al. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.
[21] Razvan Pascanu,et al. Sharp Minima Can Generalize For Deep Nets , 2017, ICML.
[22] Gang Niu,et al. Do We Need Zero Training Loss After Achieving Zero Training Error? , 2020, ICML.
[23] Qian Qian,et al. The Implicit Bias of AdaGrad on Separable Data , 2019, NeurIPS.
[24] Shun-ichi Amari,et al. Universal statistics of Fisher information in deep neural networks: mean field approach , 2018, AISTATS.
[25] James Martens,et al. New perspectives on the natural gradient method , 2014, ArXiv.
[26] Le Song,et al. Diverse Neural Network Learns True Target Functions , 2016, AISTATS.
[27] Nathan Srebro,et al. Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.
[28] Noureddine El Karoui,et al. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators : rigorous results , 2013, 1311.2445.
[29] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[30] Shun-ichi Amari,et al. Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.
[31] Lorenzo Rosasco,et al. Implicit Regularization of Accelerated Methods in Hilbert Spaces , 2019, NeurIPS.
[32] Razvan Pascanu,et al. Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.
[33] S. Péché,et al. Eigenvectors of some large sample covariance matrix ensembles , 2009 .
[34] John C. Duchi,et al. Necessary and Sufficient Geometries for Gradient Methods , 2019, NeurIPS.
[35] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..
[36] Andrea Montanari,et al. Surprises in High-Dimensional Ridgeless Least Squares Interpolation , 2019, Annals of statistics.
[37] Hongyang Zhang,et al. Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations , 2017, COLT.
[38] Ji Xu,et al. On the number of variables to use in principal component regression , 2019, NeurIPS.
[39] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..
[40] Felipe Cucker,et al. On the mathematical foundations of learning , 2001 .
[41] Peng Xu,et al. Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study , 2017, SDM.
[42] Ilya Sutskever,et al. Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.
[43] Tengyuan Liang,et al. Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.
[44] Daniel Hsu,et al. How many variables should be entered in a principal component regression equation? , 2019, NeurIPS 2019.
[45] Francis Bach,et al. Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks , 2019, NeurIPS.
[46] Taiji Suzuki,et al. Gradient Descent in RKHS with Importance Labeling , 2020, AISTATS.
[47] Kaifeng Lyu,et al. Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.
[48] Neha S. Wadia,et al. Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible , 2020, ArXiv.
[49] P. Wedin. Perturbation theory for pseudo-inverses , 1973 .
[50] Yuanzhi Li,et al. What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.
[51] Sanjeev Arora,et al. Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.
[52] Stefan Wager,et al. High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification , 2015, 1507.03003.
[53] Zhenyu Liao,et al. A Random Matrix Approach to Neural Networks , 2017, ArXiv.
[54] Yann Ollivier,et al. Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.
[55] Yi Zhang,et al. The Case for Full-Matrix Adaptive Regularization , 2018, ArXiv.
[56] Nathan Srebro,et al. Implicit Bias of Gradient Descent on Linear Convolutional Networks , 2018, NeurIPS.
[57] Francis Bach,et al. Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss , 2020, COLT.
[58] Joan Bruna,et al. Gradient Dynamics of Shallow Univariate ReLU Networks , 2019, NeurIPS.
[59] Sakinah,et al. Vol. , 2020, New Medit.
[60] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.
[61] Frederik Kunstner,et al. Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.
[62] Mikhail Belkin,et al. Reconciling modern machine learning and the bias-variance trade-off , 2018, ArXiv.
[63] F. Rubio,et al. Spectral convergence for a general class of random matrices , 2011 .
[64] Nathan Srebro,et al. Kernel and Rich Regimes in Overparametrized Models , 2019, COLT.
[65] Mikhail Belkin,et al. Classification vs regression in overparameterized regimes: Does the loss function matter? , 2020, J. Mach. Learn. Res..
[66] Yuan Cao,et al. Towards Understanding the Spectral Bias of Deep Learning , 2021, IJCAI.
[67] Andrew M. Saxe,et al. High-dimensional dynamics of generalization error in neural networks , 2017, Neural Networks.
[68] Julien Mairal,et al. On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.
[69] Lorenzo Rosasco,et al. Asymptotics of Ridge(less) Regression under General Source Condition , 2020, AISTATS.
[70] Guodong Zhang,et al. Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks , 2019, NeurIPS.
[71] Babak Hassibi,et al. Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization , 2018, ICLR.
[72] Mikhail Belkin,et al. Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.
[73] Nicolas Le Roux,et al. On the interplay between noise and curvature and its effect on optimization and generalization , 2019, AISTATS.
[74] Colin Wei,et al. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , 2018, NeurIPS.
[75] Nadav Cohen,et al. Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.
[76] Philip M. Long,et al. Benign overfitting in linear regression , 2019, Proceedings of the National Academy of Sciences.
[77] Mikhail Belkin,et al. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate , 2018, NeurIPS.
[78] Boaz Barak,et al. Deep double descent: where bigger models and more data hurt , 2019, ICLR.
[79] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.
[80] Aaron Mishkin,et al. To Each Optimizer a Norm, To Each Norm its Generalization , 2020, ArXiv.
[81] Christos Thrampoulidis,et al. Sharp Asymptotics and Optimal Performance for Inference in Binary Models , 2020, AISTATS.
[82] Roger B. Grosse,et al. A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.
[83] Nathan Srebro,et al. Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).
[84] Zhenyu Liao,et al. The Dynamics of Learning: A Random Matrix Approach , 2018, ICML.
[85] Andrea Montanari,et al. The generalization error of max-margin linear classifiers: High-dimensional asymptotics in the overparametrized regime , 2019 .
[86] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.
[87] Lei Wu. How SGD Selects the Global Minima in Over-parameterized Learning : A Dynamical Stability Perspective , 2018 .
[88] Don R. Hush,et al. Optimal Rates for Regularized Least Squares Regression , 2009, COLT.
[89] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.
[90] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[91] Qiang Liu,et al. On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.
[92] Babak Hassibi,et al. Stochastic Mirror Descent on Overparameterized Nonlinear Models , 2019, IEEE Transactions on Neural Networks and Learning Systems.
[93] Christos Thrampoulidis,et al. A Model of Double Descent for High-dimensional Binary Linear Classification , 2019, ArXiv.
[94] Andrea Montanari,et al. The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.
[95] Yue M. Lu,et al. Universality Laws for High-Dimensional Learning with Random Features , 2020, ArXiv.
[96] Florent Krzakala,et al. Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime , 2020, ICML.
[97] Taiji Suzuki,et al. Generalization of Two-layer Neural Networks: An Asymptotic Viewpoint , 2020, ICLR.
[98] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.
[99] Shun-ichi Amari,et al. The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks , 2019, NeurIPS.
[100] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.
[101] A. Caponnetto,et al. Optimal Rates for the Regularized Least-Squares Algorithm , 2007, Found. Comput. Math..
[102] Florent Krzakala,et al. Generalisation error in learning with random features and the hidden manifold model , 2020, ICML.
[103] Pradeep Ravikumar,et al. Connecting Optimization and Regularization Paths , 2018, NeurIPS.
[104] Matus Telgarsky,et al. The implicit bias of gradient descent on nonseparable data , 2019, COLT.
[105] Guodong Zhang,et al. Three Mechanisms of Weight Decay Regularization , 2018, ICLR.
[106] Taiji Suzuki,et al. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality , 2018, ICLR.
[107] Bin Dong,et al. Distillation ≈ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network , 2019, ArXiv.
[108] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[109] Di He,et al. A Gram-Gauss-Newton Method Learning Overparameterized Deep Neural Networks for Regression Problems , 2019, ArXiv.
[110] Matus Telgarsky,et al. Gradient descent aligns the layers of deep linear networks , 2018, ICLR.
[111] Greg Yang,et al. Feature Learning in Infinite-Width Neural Networks , 2020, ArXiv.
[112] Lorenzo Rosasco,et al. Optimal Rates for Multi-pass Stochastic Gradient Methods , 2016, J. Mach. Learn. Res..
[113] Rich Caruana,et al. Do Deep Nets Really Need to be Deep? , 2013, NIPS.
[114] Stanislav Minsker. On Some Extensions of Bernstein's Inequality for Self-adjoint Operators , 2011, 1112.5448.
[115] Samet Oymak,et al. Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks , 2019, AISTATS.
[116] Mikhail Belkin,et al. Does data interpolation contradict statistical optimality? , 2018, AISTATS.
[117] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[118] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.