# Bulletin of the American Physical Society

# APS March Meeting 2023

## Volume 68, Number 3

##
Las Vegas, Nevada (March 5-10)

Virtual (March 20-22); Time Zone: Pacific Time

### Session F02: Statistical Physics Meets Machine Learning II

8:00 AM–10:12 AM,
Tuesday, March 7, 2023

Room: Room 125

Sponsoring
Units:
GSNP DSOFT DBIO GDS

Chair: David Schwab, The Graduate Center, CUNY

### Abstract: F02.00003 : How SGD noise affects performance in distinct regimes of deep learning*

8:48 AM–9:00 AM

#### Presenter:

Antonio Sclocchi

(Ecole Polytechnique Federale de Lausanne)

#### Authors:

Antonio Sclocchi

(Ecole Polytechnique Federale de Lausanne)

Mario Geiger

(Massachusetts Institute of Technology)

Matthieu Wyart

(Ecole Polytechnique Federale de Lausanne)

For classification of MNIST and CIFAR10 images by deep nets, we empirically observe that: (i) if α<<1, the optimal test error is achieved for a temperature value T

_{opt }~ α

^{k}.

In the kernel regime, (ii) the relative weights variation at the end of training with respect to initialization increases as T

^{δ }P

^{γ}, where P is the number of training points; (iii) the training time t

^{*}, defined as the learning rate times the number of training steps required to bring a hinge loss to zero, increases as t

^{*}~T P

^{b }; (iv) at the cross-over temperature T

_{c }~ P

^{-a}the model escapes the kernel regime and its test error changes. We rationalize (i) with a scaling argument yielding k=(D-1)/(D+1), where D is the number of hidden layers of the network. We explain (ii,iii) using a perceptron architecture, for which we can compute the weights-dependent covariance of SGD noise and we obtain the exponents b, γ and δ. b and γ are found to depend on the density of data near the boundary separating labels. This model demonstrates that increasing the noise magnitude T increases the training time, leading to a larger change of the weights, allowing the model to escape the kernel regime. Therefore we rationalize (iv) with a scaling argument that relates the exponents a, γ, δ as a=γ/δ.

*This work was supported by a grant from the Simons Foundation (# 454953 Matthieu Wyart).

