I recently learned about Tom 7’s “GradIEEEnt half decent” project (presented in a paper and an excellent YouTube video).

The intuition behind the project is the following:

  1. Neural nets depend on non-linear activation functions; otherwise, you get no benefit from multiple layers and you’re just learning a big linear function!
  2. However, mathematically linear functions can be non-linear in practice.
    • Standard IEEE 754 floating point arithmetic necessarily does rounding in some cases; floating point numbers can’t be represented with infinite precision using the IEEE 754 standard.
    • Addition and multiplication with IEEE 754 floating point numbers are not associative. i.e. (a + b) + c ≠ a + (b + c)
    • Similarly, addition and multiplication are not commutative. i.e. a * b * c ≠ b * a * c.
  3. Can we train a neutral network with a mathematically linear activation function if we exploit the rounding that occurs during IEEE 754 arithmetic to introduce non-linearity?

Tom finds yes, and you can do a lot more with the incidental non-linearity that floating point arithmetic introduces besides. I highly recommend checking out his work.

The properties of floating point arithmetic are responsible for some of the non-determinancy that occurs during machine learning model training and inference, and more generally during distributed map-reduce style workflows.

Here’s some interesting further reading:

I learned about this phenomenon from this Reddit thread.