The “math/bits” standard library documentation states:
Functions in this package may be implemented directly by
the compiler, for better performance. For those functions
the code in this package will not be used. Which
functions are implemented by the compiler depends on the
architecture and the Go release.
Does anybody know where I can find which functions are implemented by the compiler differently for each CPU architecture? I was highly confused when I was benchmarking certain functions to supposedly faster versions for a particular fixed point type implementation only to learn that the compiler makes efficient use of the underlying hardware. I noticed this after copying the standard library code and compiling it myself, which then gave the expected (slower) benchmark results.
Noteworthy examples of such functions are bits.Mul64 and bits.Div64 that seem to compile on x86-64 to instructions that support multiplying or dividing two 64-bit numbers and handling their 128-bit counterpart directly.
I like that there is essentially hardware acceleration for this, but I’d also like to rely on this optimiziation by referring to a documention that confirms this support.
This file can be hard to read, you want to look for calls to the addF closure mainly, it’s first string argument is the package, it’s second is the function the last one is a callback that implements translating the function call into an SSA operation (some kind of pseudo assembly that can be later translated to any architecture and is helpful to write compilers).
Lastly intrinsics are not the only way to implement this, some other instructions are implemented by matching against a particular implementation, for example with GOAMD64=v3 there is an instruction that implement func IsPowerOfTwo(uint) bool natively, however you don’t find that in math/bits package.
Instead the optimizer looks very precisely for this pattern x&(x-1) == 0 (or != 0 for negation), there are plenty of other implementations of this function that would in theory perform just as fast but in practice this is the only one that optimized that way.