Making Julia as Fast as C++ (2019)(flow.byu.edu) |
Making Julia as Fast as C++ (2019)(flow.byu.edu) |
End result: code that is uglier and still much slower than C++. Kind of a shame.
I wrote a blog post at the time with exactly that punchline (not explicitly stated, but just look at the code!): https://spmd.org/posts/multithreadedallocations/ The example was similar to a real production-critical hot path from work.
Maybe things changed since I left Julia, but that was December 2023, for years after this blog post.
As a quick anecdote, in our take-home interview exercise, we usually receive answers in C++ or Julia, and the two fastest answers have been in Julia.
Of course it also depends on what additional libaries you are using, especially when it comes to parallel/GPU programming in C++, but easy to believe that Julia out of the box makes it easy to write high performance parallel software.
Yeah, I actually totally forgot to check the date...
So I would say that the culprit for interoperability is C and its descendants, not Fortran or Julia. The designers of C and of the languages that have imitated C have not given any thought about which order for multi-dimensional arrays is better, so the users of such languages do not have any right to blame for interoperability other languages that have done the right thing. Even if the Fortran order had not been better, it had already been used for 20 years before C, so there was no reason to choose a different order.
C has chosen to store arrays in the order in which they are typically read by humans when written on paper, but this is a choice like the choice between big-endian and little-endian, where big-endian was how Europeans wrote numbers, but little-endian is more efficient on computers.
An example of why column-major order is preferable, is the matrix-vector product, i.e. the evaluation of a function that maps linear spaces.
The matrix-vector product should not be done as it is typically taught in schools, by scalar products of rows of the matrix with the vector, because this is less efficient, by making more memory accesses. The right way to compute a matrix-vector product is by doing AXPY operations between columns of the matrix and the vector operand (segments of the output of the AXPY operations are held in registers until all partial AXPY operations are accumulated, avoiding memory accesses). In this case, you need to read columns of the input matrix for each AXPY operation, which is much more efficient when the elements of a column are stored compactly in memory, avoiding the need of strided accesses.
The same thing happens for matrix-matrix products, which must not be done in the naive way taught in schools, by scalar products of rows of the first matrix with columns of the second matrix, but it must be done by tensor products of columns of the first matrix with rows of the second matrix.
Oh such a shame indeed! They didn’t even manage to produce better looking code at least?? Julia was looking great in 2019 but it was very buggy still so I stopped looking. Had hopes that by now it would be a good choice over C++ and Rust with similar performance.
I have always seen it as a potential alternative to Java, and definitely better than Python.
My experience working in it professionally was that it was... fine. But the GC in it was not good under load and not competitive with Java's.
Also, I'm of course using nefarious in jest here in both cases. While we don't directly try to monetize our open source work, I respect that sometimes people need to do that. As long as people are transparent about it, I don't have a problem. Doing the thing we're doing seems to work, but it's a lot harder, because you have to build a successful pice of software and a (or multiple) successful something elses that has a critical dependency on it. It's like hitting the lottery twice.
Also, contributing in open source is a choice, not a mandate. I greatly benefit from Julia and its ecosystem so I chose to contribute back some of my work, no one forced me. I chose the MIT license because I want other people to be able to make money with it, just like I make money with other peoples MIT licensed stuff.
It’s nothing like Google-the-ad-company influencing Chrome. The company consumes Julia for products to sell, rather. Maybe this affects the ordering of features landing, but… meh.
My work is more combinatorial. Julia does excel at numerical computation. There's a tribal divide in math between people who can't go 30 seconds away from the real or complex numbers, and those whose tolerance is about that long. I try to keep an open mind, but I'm closer to the second camp. Julia is good enough to consider either way.
A development in recent months, AI can assist in general purpose Lean 4 programming, no longer getting confused by the dominant proof-oriented training corpus. If one is a functional programmer who believes that Haskell was on the right track, then Lean is the most interesting language choice for shaping one's thoughts. Benchmarks are inherently misleading if a better language makes it possible to express algorithms out of reach of more primitive languages.
https://github.com/Syzygies/Compare
C++ 100 13.08s ±0.08s
Rust 99 13.16s ±0.02s
Julia 90 14.54s ±0.01s
F# 90 14.54s ±0.04s
Kotlin-native 88 14.79s ±0.01s
Kotlin 86 15.18s ±0.01s
Scala 79 16.50s ±0.08s
Scala-native 76 17.14s ±0.02s
Nim 65 20.17s ±0.01s
Swift 64 20.54s ±0.04s
Ocaml 52 25.38s ±0.04s
Chez 49 26.64s ±0.02s
Haskell 37 34.96s ±0.06s
Lean 29 45.39s ±0.15sTo those who regularly write Julia code, what is your workflow? The whole thing with Revise.jl did not suit me honestly. I have enjoyed programming in Rust orders of magnitude more because there's no run time and you can do AOT. My intention is not write scripts, but high performance numerical/scientific code, and with Julia's JIT-based design, rapid iteration (to me at least) feels slower than Rust (!).
Prelude of what's to come in the self-reinforcing cycle of machines talking to machines and drowning everything else.
And that's a good thing, because Python+NumPy syntax is far more cumbersome than either Julia or MATLAB's.
You can see this at a glance from this nice trilingual cheat sheet:
One could say that we can almost replicate the semantic of a C++ program, but writing in Julia. For example we can remove bounds checks in arrays or remove hidden memory allocations.
But the goal of a language for numerical computing is capturing the mathematical formulas using high level constructs closer to the original representation while compiling to efficient code.
Domain scientists want to play with the math and the formulas, not doing common subexpression elimination in their programs. Just curious to see how it evolves
Why ?
Because of CPU's architecture - given CPU one just need to structure code in a way CPU can perform efficiently! Is it such surprising that all sugar and multi-functional smartness have cost of all that if's and loops like maps? CPU is just rock stupid and can't do anything else!
That's from where all that specialized instructions are coming and programs just need to be structured or compiled to CPU arch way to perform as fast as CPU and rest of hardware allows...
And there are some "Java machines" and that is exaclty the same story: use CPU native lang :) As much as posible.
So: give us better cpus pls :)
- not a single post has anything inside here https://flow.byu.edu/posts/
In my experience you really gotta work with the tools the language gives you. Julia gives you Revise, so it’s a bit of a handicap not using it. Maybe analogous to writing Rust without an LSP.
I get that leaning on the LSP can become a habit, and also that the Julia LSP is quite poor, but I find it wild that rapid iteration for you is faster in Rust. I write Rust as well and can’t imagine how that would be the case.
rust-analyzer is a great LSP and paired with clippy it can teach you the language itself. Also, writing numerical code is extremely easy in Rust. I can write code and just run cargo run to see the output. Julia, on the other hand, forced a REPL-based workflow which never has made sense to me. REPL-based workflow makes sense when you just want to do some script stuff. But when writing a code which will run for a long duration on a HPC? I don't get it. Part of the problem is I'm not "holding it correctly", but again, out of the box experience isn't good. You define a struct and later add or remove a field from it. Often you'll get an error because Revise.jl didn't recompile things. It was a sub-par experience and I was hoping to people would share their dev workflow in more detail
Nowadays I often use Claude Code, working with a Julia REPL in a tmux or zellij session via send-keys. I'll have it prototype and try to optimize an algorithm there, then create a notebook to "present its results", then I'll take the bits I like and add them to the production codebase.
REPL-based workflow doesn't make sense to me other than scripting work.
I hope julia developper tools will one day match the best of what other programming languages have to offer.
If you want a better Julia LSP, you might just be able to get Claude or Codex to build one for you. I've been impressed with the TLA+ bindings it generated.
Good LSPs do the autocompletion, sub par ones don't.
Is it really such a good idea to have every single automated aid turned on when picking up a new language?
How will you learn if you cannot get feedback on what you did wrong?
I mean, until you learn multiplication, maybe don't use the calculator.
Once you learn it then you get a small speed increase, but if you are new to something, LSP autocompletion is going to slow down your learning.
This is a plausible assumption to make but unfortunately it is not true at large. Especially when the traditional sizes are exceeded say n >= 2000 certain operations such as LU can be improved in terms of performance with C-major arrays. However the correct statement is you lose at some place you win at other. There are certainly linalg operations that F-major can give you more performance. However it is also true for C-major layout.
In your example matrix vector product or any BLAS2 or BLAS3 level operations you can also swap out the for loop order to convert things around (row*col buffer multiplication vs sum of weighted column sum interpretation). In particular matrix norm operations are the only exceptions (abs column sum, row abs sum etc.) that certain norms prefer certain orders. In fact if you go into the Goto method deep enough you'll see that internal order is a bit like Morton ordering to fit things into L1 Cache.
The reason why column-major is preferred is historical and requires more surgery to get it running with C-major ordering. Trust me I tried but it's too much work to gain not so much. Maybe someday when I retire I can attempt it. Hence I kept it column major in my retranslation of LAPACK https://github.com/ilayn/semicolon-lapack
Instead I implemented a "high"-performance AVX2 matrix transpose operation so that swapping the memory layout is trivial compared to the linalg cost.
This only ends up being true (for any language, but it's too often cited for C++) in a pretty useless Turing Tarpit sort of sense.
So it's not "no reason" it's just sometimes impractical to solve some problems as well in C++ as in a language that was better suited.
Now people do do impractical things sometimes. It's not very practical to swim across the English channel, but people do it. It's not very practical to climb Mt Everest, but loads of people do that for some reason. Going to the moon wasn't practical but the Americans decided to do it anyway. But the reason even the Americans stopped going for a long time is that actually "that was too hard and I don't want to" is in fact a reason.
Recent versions of Revise let you redefine structs in the REPL.
You are not forced to use the REPL, ever. It’s a fantastic convenience, however.
My dev workflow is to write my code in Neovim, sometimes with a REPL attached to the editor to try out code snippets. I don’t need or use LSPs. I do enjoy the Aerial plugin, which pops up an outline of my code for easy navigation.
For long-running jobs, I basically follow the same process as in any other language: make the functions I want to run, test them locally on a small dataset that runs relatively quickly, then launch them on the remote machines with the full data.
Revise.jl has struct redefinition now, but before that I would just use NamedTuples while iterating, then make a struct when I was ready to move something to production.
`using` is for importing modules, `include` is for specific files. At work, we currently have a monorepo, with one top-level OurProject.jl file that uses `using` to import external packages, and `include` for all the internal files.
The main strategy is to have a way of parameterize the program to bring the runtime down to seconds-minutes on a laptop. E.G. for PDEs, you may be running the HPC version on a giant mesh, but you can run the same algorithm on your local computer on a much coarser mesh.
> How do you quickly modify struct definitations
Thankfully on 1.12 this has been solved. You can redefine structs while keeping the REPL up.
> how do you define imports (using vs include syntax is so confusing!)
Yeah julia messed this up. The basic rule is that include and using are basically the same.
The key to performance with the GC in Julia is not allocating, but it has gotten substantially better since 2019.
But interfaces are informal. Not using a monorepo say makes it harder to be sure if your broke downstream or not (via downstream’s unit tests).
But freedom from Rust’s orphan rule etc means you can decompose large code into fragments easily, while getting almost Zig-style specialisation yet the ease of use of python (for consumers). I would say this takes a fair bit of skill to wield safely/in a maintainable fashion though, and many packages (including my own) are not extremely mature.
I was never an expert in the language, but worked along people who were and they generally made nice code.
But there were a few places where I saw intensely confusing patterns from overloading with multimethods. Code that became hard to follow, and had poor encapsulation.
It's interesting. I like the more opaque approach rust takes. Rust has its own issues but it seems less corporately motivated. Maybe that's why it has more corporations using it? You aren't going to end up with the core maintainers to the language rug pulling packages or language features to slow down competition who are also using the tool. I say competition because it looks like they are making money through consultancies and very broad applications of the niche language.
Weird stuff to have to think about. I just want to write code
this is not true; the other comment is wrong. there is no central body at all that "decides" what features are prioritized. features are simply worked on by whomever has the capacity, ability, and desire to do so.
many engineers at JuliaHub have all three of the capacity, ability, and desire to work on certain features because JuliaHub, in its capacity as a private business, pays them to do so. but with respect to Julia the programming language these are "just" third party contributions like any other.
From a quick Google search it looked kind of like a bunch of MIT staff/professors(?) are getting students to churn out code for a variety of business interests. Just doesn't seem right in the surface and does make me wonder about what other things happen knowing what I know about human behavior.
I am personally not interested that's for sure. Thanks for sharing your experiences though.
I don’t if these are contradictory exactly but it seems to come from a very cluttered space.
This is the last step before I move to code generation and then generating a ton of test cases/debugging.
My goal is some form of release by the end of the year.
I've thought a lot more about the engineering than any sort of marketing or businesses plan, so I just want to defer those.