Unexpected {Simplicity, Complexity}
Unexpected Simplicity
The timeline of working through Andrej Karpathy’s Neural Networks: Zero to Hero course presents some early and striking milestones.
In the first hour of the first class, The spelled-out intro to neural networks and backpropagation: building micrograd, you’re guided through building and manually verifying a single step of backpropagation of a simple mathematical expression. It’s mostly composed of a cute Value
class that knows how to add and multiply other Value
s, as well as keeping track of it’s child Value
s and local gradient.
You build up an expression of values, visualize it using graphviz
, then step through each Value
node recursively computing the derivatives of sentinel node of the expression with respect to each “backwardly” successive child expression, leveraging the chain rule.
To resurface an intuition for the chain rule, Karpathy pulls up the example on Wikipedia :
As put by George F. Simmons: “If a car travels twice as fast as a bicycle and the bicycle is four times as fast as a walking man, then the car travels 2 × 4 = 8 times as fast as the man.
Let \(z\), \(y\) and \(x\) be the (variable) positions of the car, the bicycle, and the walking man, respectively. The rate of change of relative positions of the car and the bicycle is \(dz/dy=2\). Similarly, \(dy/dx=4\).
So, the rate of change of the relative positions of the car and the walking man is
\[ dz/dx = dz/dy * dy/dx = 2 * 4 =8 \]
At this point you’ve reconstructed a healthy chunk of the core of micrograd, an engine for small neural networks. The ultimate (at the time of writing) codebase is small: engine.py
that is 94 LOC, nn.py
that is 60 LOC.
Above is a (simplified) description of backpropagation and a python implementation described in 270-odd words. Glibly, one could describe the kernel of neural networks as the recursively applied chain rule through a graph.
Unexpected Complexity
Note that micrograd
includes in it’s README
description of itself as “Potentially useful for educational purposes.”
Karpathy also notes in the lecture video that a lot of the complexity in making neural network training work is having them run efficiently. He also notes that micrograd eschews tensors as a data structure to favor pedagogy.
The backpropagation algorithm itself having a recursive structure is subject to efficiency improvement through dynamic programming.
Hardware parallelization techniques such as SIMD that PyTorch et al leverage make training a neural network reasonable relative to the time it would take for serial operations to run.
The LLM neural network inference library, llama.cpp, generates predictions based on context and inputs. While this functionality is tangential to the training of the neural network, it too has a massive iceberg of performance optimizations as noted by jart in a recent outline of enhancements including:
- mmap support for loading weights
- identifying constants for parameters that are never used
- unrolling the right loops to share registers, without stepping on CPU speculation
{Simple, Complex}
The complexity iceberg meme reigns supreme in the context of neural networks. Much like exposing a simple interface and hiding the complexity behind it, or “deep modules” à la John Ousterhout, the unexpected simplicitiy of the initial concepts give way to the unexpected complexity of the implementation.