(1): https://gist.github.com/ChrisRackauckas/cc6ac746e2dfd285c28e... (2): https://discuss.pytorch.org/t/why-torch-jit-is-so-slow/36616...
It would seem btw that 1.5 years later, people are working on implementing generated kernels for reductions, even if it is still somewhat experimental.
The speed comparison, which is about 9 months old, is marginally related. The issue here is optimization of pointwise operations, which would be handled by the JIT fuser, except it is disabled by default on the CPU. The latest version of the benchmark code seems to not run on recent PyTorch versions as given. It still still a fair comparison in terms of it is what a user will get by default. I won't be the first PyTorch developer to say that Julia and its libs do a great job at JITed optimizations. Nonetheless I'm relatively certain that a determined PyTorch user would find ways to get that a better optimization of that ODE step using some of the disabled by default features.
The other truth, of course, is that for PyTorch, there still is more emphasis on GPU when it comes to implementing optimizations.
(Disclaimer: I'm one of the people on that thread.)