Bulletin of the American Physical Society

APS March Meeting 2023

Volume 68, Number 3

Las Vegas, Nevada (March 5-10)
Virtual (March 20-22); Time Zone: Pacific Time

Session D02: Statistical Physics Meets Machine Learning I

Sponsoring Units: GSNP DSOFT DBIO GDS
Chair: Yuhai Tu, IBM T. J. Watson Research Center
Room: Room 125

Monday, March 6, 2023 3:00PM - 3:36PM	D02.00001: Scaling Laws in Deep Neural Networks: Insights from Statistical Mechanics and Exactly Solvable Models Invited Speaker: Yasaman Bahri Artificial deep neural networks are complex, nonlinear statistical models whose learning and function often depends strongly on the model structure and choice of data and algorithm. Empircally, it has been observed that the generalization ability of such networks in learning tasks is frequently governed by power-law trends with respect to simple scaling variables, such as the amount of data available to learn from and the number of learnable parameters. A full understanding -- and in particular, a prescriptive theoretical framework -- for what governs this scaling is lacking. Towards this end, I will discuss our work introducing a classification of different regimes of behavior -- notions of "resolution-limited" and "variance-limited" regimes -- based on the mechanistic origins behind the scaling. Along the way, I will review and then leverage insights from recently discovered exactly solvable models for deep neural networks, a setting in which we can derive the different regimes exactly. I'll close by discussing implications and remaining challenges.
Monday, March 6, 2023 3:36PM - 3:48PM	D02.00002: Results from a Mapping Between Reinforcement Learning and Non-Equilibrium Statistical Mechanics Jacob Adamczyk, Argenis Arriojas Maldonado, Stas Tiomkin, Rahul V Kulkarni Reinforcement learning (RL), a field of machine learning that can be used to solve sequential decision-making problems, has recently become a popular tool for obtaining solutions to a variety of complex problems in physics. Despite this success as a tool, there has been limited work focusing on the relationship between the theoretical frameworks of RL and statistical mechanics. Our recent work has established a mapping between average-reward entropy-regularized RL and non-equilibrium statistical mechanics (NESM) using large deviation theory. We highlight how this mapping allows one to approach problems in NESM from an RL perspective and vice versa. As an example, we discuss how results from RL research on "reward shaping" can be extended using the framework of statistical mechanics of trajectories. In this setting, we derive results in RL that are analogous to the Gibbs-Bogoliubov's inequality in equilibrium statistical mechanics. We propose methods to iteratively improve this bound based on results from RL. The mapping established in our work can thus lead to new results and algorithms in both RL and NESM.
Monday, March 6, 2023 3:48PM - 4:00PM Author not Attending	D02.00003: The Onset of Variance-Limited Behavior for Neural Networks at Finite Width and Sample Size. Alexander B Atanasov, Cengiz Pehlevan, Blake Bordelon, Sabarish Sainathan For small training set sizes, the generalization error of wide neural networks is well-approximated by the error of an infinite width neural network. However, at a training set size, the finite-width network generalization begins to worsen compared to the infinite width performance. We empirically study the transition from the infinite width behavior to this variance-limited regime as a function of training set size and network width and network initialization scale. We find that finite size effects can become relevant for very small dataset sizes going as the square root of the width for polynomial regression with ReLU networks. We discuss the source of this finite size behavior based on the variance of the NN's final neural tangent kernel (NTK). Using this, we provide a toy model which also exhibits the same scaling and has sample-size dependent benefits from feature learning.
Monday, March 6, 2023 4:00PM - 4:12PM Author not Attending	D02.00004: Feature learning and overfitting in neural networks Francesco Cagnetta It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, for a fixed task such as classifying images, feature learning is beneficial in modern architectures but detrimental in standard fully-connected feed-forward networks. Here we propose an explanation for this puzzle, by showing that feature learning can result in poor generalization performances as it leads to a `sparse' neural representation, where only a fraction of the connection in the original network are active. Although sparsity is known to be essential for learning anisotropies in the data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark image datasets. For (i), we can compute the scaling of the generalization error with number of training points analitically, thus show quantitatively how methods that do not learn features generalize better if the target function is sufficiently smooth. For (ii), we show empirically that learning features can indeed lead to sparse and thus less smooth representations. Since an image classifier must be highly smooth with respect to small deformations of the image, this is likely cause of poor performance.
Monday, March 6, 2023 4:12PM - 4:24PM	D02.00005: Flatter, Faster; Scaling Momentum for Optimal Speedup of SGD Aditya Cowsik, Tankut U Can, Paolo Glorioso Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study implicit bias arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter $1-eta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We perform experiments, including matrix sensing and ResNet on CIFAR10, which provide evidence for the robustness of these results.
Monday, March 6, 2023 4:24PM - 4:36PM	D02.00006: Statistical Mechanics of Infinitely-Wide Convolutional Networks Alessandro Favero, Francesco Cagnetta, Matthieu Wyart Understanding how convolutional neural networks (CNNs) can efficiently learn high-dimensional tasks remains a challenge. A popular belief is that these models harness the translational invariant, local, and hierarchical structure of natural data such as images. Yet, we lack a quantitative understanding of how this structure affects performance. To study this problem, we consider wide CNNs in the kernel limit, where generalisation can be characterised using statistical mechanics methods. We introduce a stylised teacher-student framework where a CNN is trained on the output of another CNN with random weights. In this framework, we control the structure of the target function by adding weight sharing and by tuning the size of the neuron receptive fields and the depth of the teacher network. First, we find that translational invariance does not change the scaling of learning curves, that measure the decay of the generalisation error with the number of training examples, and therefore is not enough to beat the curse of dimensionality. Then, we show that if the target function has a local structure, i.e., it depends only on low-dimensional subsets of adjacent input variables, CNNs beat the curse of dimensionality. In fact, the learning curve scaling is controlled by the dimension of these subsets and not by the full input dimension. Finally, we show that the hierarchical structure of CNNs is too rich to be efficiently learnable in high dimensions and discuss further classes of hierarchical target functions.
Monday, March 6, 2023 4:36PM - 4:48PM	D02.00007: Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width Dayal Singh Kalra, Maissam Barkeshli We systematically analyze optimization dynamics in deep feed-forward neural networks (DNNs) trained with stochastic gradient descent (SGD) over long time scales and study carefully the effect of learning rate, depth, and width of the neural network. By analyzing the top eigenvalue λ_t of the Hessian of the loss, which is a proxy for sharpness, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time ``edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on learning rate η = c / λ₀, depth d and width w. We identify four critical values of c: c_critical, c_loss, c_sharp, c_max, which separate qualitatively distinct phenomena. In particular, we discover a regime c_critical < c < c_loss, which opens up with increasing d/w, in which the sharpness decreases significantly but without an initial increase in the loss, violating the simple picture of catapulting out of a local basin and into a wider one by traversing up a barrier. Our results have important implications for the question of how to scale learning rate as the DNN depth and width are increased in order to remain in the same phase of learning.
Monday, March 6, 2023 4:48PM - 5:00PM	D02.00008: Generative probabilistic matrix model of data with different low-dimensional latent structures Philipp Fleig, Ilya M Nemenman Complex biological and social features are often modelled by effective models with latent features, which serve the role of emergent, collective degrees of freedom. In many contexts, identification of such features needs to proceed directly from data. Unfortunately, we know very little about how different types of latent feature models manifest themselves in data, which makes inference hard. In this work, we investigate properties of data produced by different types of latent feature models, all described as special cases of a general model involving mixing of latent features. Key ingredient of our model is that we allow for statistical dependence between the mixing coefficients, as well as latent features with a statistically dependent structure. Latent dimensionality and correlation patterns of the data are controlled by three model parameters. The model's special cases include hierarchical clusters, sparse mixing, and non-negative mixing. We describe the correlation and eigenvalue distributions of these patterns within the general model and discuss how our model can be used to generate structured training data for supervised learning.
Monday, March 6, 2023 5:00PM - 5:12PM	D02.00009: The Evolution of the Fisher Information Matrix During Deep Neural Network Training Chase W Goddard, David J Schwab Recently, deep neural networks (DNNs) have revolutionized nearly every area of machine learning, and their success has challenged our understanding. In particular, DNNs have empirically been shown to generalize well even in the overparameterized regime. Some correlates of generalization have been found, including flatness of the loss function (Jiang et. al. 2019), and these have even been shown to be causally useful in improving generalization (Foret et. al. 2021), but further study is required. Here, we study the evolution of the Fisher Information Matrix throughout training in both the early and late phase, and identify a number of dynamical signatures of its behavior. While the Fisher often coincides with flatness-based measures such as the Hessian late in training, during the early phase of training they will not in general align. In addition, the Fisher does not require labeled data to compute, allowing its computation on held-out test data. Our method is able to compute the exact Fisher and its eigendecomposition on various subsets of data throughout training, as often as every step along the training curve. In particular, we study the evolution of the Fisher across various dataset splits: train/test, per class, and per domain (in the out-of-distribution setting), and correlate these measures with generalization, both in and out of distribution.
Monday, March 6, 2023 5:12PM - 5:24PM	D02.00010: When does Dual Dimensionality Reduction perform better than Single Dimensionality Reduction? Eslam Abdelaleem, K. Michael Martini, Ahmed H Roman, Ilya M Nemenman Current experiments in many fields often generate large-dimensional datasets with multiple modalities (e.g., neural activity and animal behavior). Often the first step in understanding these experiments is finding correlations between different modalities, which requires dimensionality reduction (DR). We previously introduced the concept of Dual Dimensionality Reduction (DDR) approaches, which simultaneously compress both data modalities to maximize the covariation between their reduced descriptions. We argued that DDR requires significantly fewer data points to detect correlations than performing DR on each modality independently and then identifying relations between the reduced descriptions. Here we use a generative model of multivariate correlated data and linear dimensionality reduction approaches to carefully explore under which conditions DDR methods outperform independent approaches. We extend the argument to nonlinear reduction methods as well, using Deep Canonical Correlation Analysis as a nonlinear DDR and autoencoders for independent reduction of individual modalities. We believe that our analysis points to a general principle that DDR methods are often more data efficient in detecting weak correlations than their independent DR equivalents.
Monday, March 6, 2023 5:24PM - 5:36PM	D02.00011: Physics-Informed featurization of spectral functions Shubhang Goswami, Kipton M Barros, Matthew R Carbone Spectral functions are key observables of interacting many-body systems, and characterizing them is of great interest. We investigate two methods for approximating spectral functions via rational approximations, i.e., approximations as a ratio of two polynomials. The first approach, VFIT, approximates individual spectral functions of lattice polaron models in a reliable, simple, and accurate way. We also introduce a second fitting procedure, which we call the Smooth Rational Approximation (SRA), that simultaneously fits an entire batch of spectral functions. This fitting procedure can be regularized such that the predicted spectral functions vary smoothly with the governing physical parameters.
Monday, March 6, 2023 5:36PM - 5:48PM	D02.00012: Efficient Modelling of Ge₁₅Te₈₅ using Active Learning Methods Thomas Arbaugh, Francis W Starr Germanium-Telluride is a phase-change material (PCM) that shows promise for potential applications in advanced memory materials. In the 15:85 composition, several anomalous features, including a sharp density maximum and a likely fragile-to-strong transition in the dynamics, occur upon cooling. Unfortunately, accurate simulations of PCM materials typically rely on Density Functional Theory (DFT) and are very limited in the accessible size and time scales, making it difficult to model the properties of these materials. To overcome this challenge, we utilize recently developed machine-learning interatomic potentials (MLIPs) that enable the creation of lightweight and efficient potentials. These potentials are trained on and reproduce DFT-accurate descriptions of materials over a broad range of the phase diagram. We discuss active learning, compare training methods, and evaluate the ability of trained MLIPs to match experimentally known quantities of Ge₁₅Te₈₅. In particular, we find that these potentials reproduce the experimentally known structure to a high degree of accuracy.
Monday, March 6, 2023 5:48PM - 6:00PM	D02.00013: A simple model for Grokking modular arithmetic Andrey Gromov Grokking is a sudden onset of generalization following a long period of overfitting. This effect was first discovered empirically on datasets generated by a discrete rule such as the multiplication table for finite groups. In this talk I will present a simple neural network that groks a variety of modular arithmetic tasks. The network consists of a single hidden layer and a quadratic activation function (which can be replaced with more popular activation functions if so desired). I will show that (i) the model exhibits grokking on modular arithmetic tasks under vanilla gradient descent, MSE loss function, and in the absence of any regularization; (ii) grokking corresponds to learning very specific features whose structure is determined by the modular arithmetic task at hand; (iii) I will provide an analytic expression for the weights that solve modular addition problem and are found by gradient descent thereby establishing complete interpretability of the algorithm learnt by the network.

About APS

The American Physical Society (APS) is a non-profit membership organization working to advance the knowledge of physics.

Headquarters 1 Physics Ellipse, College Park, MD 20740-3844 (301) 209-3200
Editorial Office 100 Motor Pkwy, Suite 110, Hauppauge, NY 11788 (631) 591-4000
Office of Public Affairs 529 14th St NW, Suite 1050, Washington, D.C. 20045-2001 (202) 662-8700

Bulletin of the American Physical Society

APS March Meeting 2023

Volume 68, Number 3

Las Vegas, Nevada (March 5-10)Virtual (March 20-22); Time Zone: Pacific Time

Session D02: Statistical Physics Meets Machine Learning I

Follow Us

Engage

My APS

Information for

About APS

Las Vegas, Nevada (March 5-10)
Virtual (March 20-22); Time Zone: Pacific Time