One intuitive way of looking at it is like so - let's say that you have a gaussian-looking plot. You want to fit a gaussian. You have a stupid simple model where you can slide your gaussian left and right.
If your initial starting point happens to be roughly within range, great, your optimizer will take care of it for you and slide it into the correct place. If you're too far, too bad, no meaningful gradient.
Instead, neural nets give you the option to spawn a gaussian anywhere you please. In this case, no sliding is necessary, but it comes at a heavy parametrization cost.
1. Avoid overparameterization by design. Manually create or choose a space of functions that has limited degrees of freedom by construction.
2. Accept overparameterization and regularize.
The latter tends to be more robust, because of the bitter lesson. It's not practical to manually design an ideal, on-demand, just-right limited-parameter model for every dataset we are presented with. The best way to approach that ideal, it turns out, is really to just let the computer figure it out via regularized optimization over an overparameterized space.
Statisticians started moving in favor of overparameterization long before deep learning got off the ground. This trend dates back at least to the machine learning bible, Elements of Statistical Learning (2001).
Could you elaborate on this?
Lotteries and tickets are often used as a didactical analogy to explain the success of overparameterized neural networks: “larger networks succeed because they more likely contain a well-initialized subnetwork that can learn the task in isolation, much like buying more tickets increases the chances of winning a lottery.” This explanation is intuitive but misleading: it suggests that subnetworks can be treated in isolation from the rest of the network. Following this reasoning leads to interpreting learning in wide networks as a multi-start optimization process, where gradient descent simply conducts a parallel search over subnetworks. We argue that this view is flawed since, among other reasons, winning tickets can be made to fail by perturbing the rest of the network. We put forward a more accurate intuitive picture for the success of overparameterization based on the geometry of loss landscapes: increasing width expands the set of available dimensions for optimization, making it easier to escape bad local minima. Moreover, as width grows, bad minima become increasingly rare relative to good minima. As the field grows mature, it is important to refine the analogies we use to explain foundational phenomena, such as the apparent redundancy of large networks, reconciling practitioners' intuitions with modern theoretical insights.