High-Performance GPU Computing in the Julia Programming Language(devblogs.nvidia.com) |
High-Performance GPU Computing in the Julia Programming Language(devblogs.nvidia.com) |
I'm one of the maintainers at Google of the LLVM NVPTX backend. Happy to answer questions about it.
As background, Nvidia's CUDA ("CUDA C++?") compiler, nvcc, uses a fork of LLVM as its backend. Clang can also compile CUDA code, using regular upstream LLVM as its backend. The relevant backend in LLVM was originally contributed by nvidia, but these days the team I'm on at Google is the main contributor.
I don't know much (okay, anything) about Julia except what I read in this blog post, but the dynamic specialization looks a lot like XLA, a JIT backend for TensorFlow that I work on. So that's cool; I'm happy to see this work.
Full debug information is not supported by the LLVM NVPTX back-end yet, so cuda-gdb will not work yet.
We'd love help with this. :)
Bounds-checked arrays are not supported yet, due to a bug [1] in the NVIDIA PTX compiler. [0]
We ran into what appears to be the same issue [2] about a year and a half ago. nvidia is well aware of the issue, but I don't expect a fix except by upgrading to Volta hardware.
[0] https://julialang.org/blog/2017/03/cudanative [1] https://github.com/JuliaGPU/CUDAnative.jl/issues/4 [2] https://bugs.llvm.org/show_bug.cgi?id=27738
I've always thought it weird that I'm writing all my code in this language that compiles to C++, with semantics for any type declaration etc...And then I write chunks of code in strings, like an animal.
[1] https://github.com/ldc-developers/ldc [2] dlang.org [3] http://github.com/libmir/dcompute
The NVPTX backend would benefit imo to move towards the more general LLVM infrastructure so that emitting the dwarf info is not another special case.
To be clear, there are two ways to compile CUDA (C++) code. You can either use nvcc (which itself may use clang), or you can use regular, vanilla clang, without ever involving nvcc.
Nvidia's closed-source compiler, nvcc, uses your host (i.e. CPU) compiler (gcc or clang) because it transforms your input .cu file into two files, one of which it compiles for the GPU (using a program called cicc), and the other of which it compiles for the CPU using the host compiler.
The other way to do it is to use regular open-source clang without ever involving nvcc. The version of clang that comes with your xcode may not be new enough (I dunno), but the LLVM 5.0 release should be plenty new, unless you want to target CUDA 9, in which case you'll need to build from head.
I don't know the technical reasons why nvcc is so closely tied to the host compiler version -- it annoys me sometimes, too.
The hard part is optimization, because the GPU architecture (SIMD / SIMT) is so alien compared to normal CPUs.
Here's a step-by-step example of one guy optimizing a Matrix Multiplication scheme in OpenCL (specifically for NVidia GPUs): https://cnugteren.github.io/tutorial/pages/page1.html
Just like how high-performance CPU computing requires a deep understanding of cache and stuff... high-performance GPU computing requires a deep understanding of the various memory-spaces on the GPU.
------------
Now granted: deep optimization of routines on CPUs is similarly challenging, and actually undergoes a very similar process in how to partition your work problem into L1-sized blocks. But high-performance GPUs not only have to consider their L1 Cache... but also "Shared" (or OpenCL __local) memory and "Register" (or OpenCL __private) memory as well. Furthermore, GPUs in my experience have way less memory than CPUs per thread/shader. IE: Intel "Sandy Bridge" CPU has 64kb L1 cache per core, which can be used as 2-threads if hyperthreading is enabled. A "Pascal" GPU has 64kb of "Shared" memory, which is extremely fast like L1 cache. But this 64kb is shared between 64 FP32 cores!!!.
Furthermore, not all algorithms run faster on GPGPUs either. For example:
https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf
This paper claims that their GPGPU implementation (Xeon Phi) was slower than the CPU implementation! Apparently, the game of "Hex" is hard to parallelize / vectorize.
---------------
Now don't get me wrong, this is all very cool and stuff. Making various programming tasks easier is always welcome. Just be aware that GPUs are no silver bullet for performance. It takes a lot of work to get "high-performance code", regardless of your platform.
And sometimes, CPUs are faster.
Wow. That's very impressive.
I hope one day we get this sort of tooling with AMD GPUs.
https://docs.julialang.org/en/latest/stdlib/linalg/
It looks like Julia uses a combination of LAPACK and SuiteSparse. These are good choices, but it's not Julia code and these routines are callable from all sorts of other languages like Python, MATLAB, and Octave. As such, it still appears as though Julia is operating more like a glue language rather than a write all of your numerical libraries in Julia language, which is fine, but I don't feel like that's what it's being sold as.
The benefit comes from user code, which in many dynamic languages is interpreted and is much slower than built-in C libraries. For example, look at the Julia `sum`. It is written in Julia. Or that we are in the process of replacing openlibm (based on freebsd libm) with a pure julia implementation. Or any of the fused array kernels (arithmetic, indexing, etc.). Our entire sparse matrix implementation (except for the solvers) is in pure Julia.
Alright, so I write numerical codes professionally. Though it's not quite fair, I tend to bulk things into glue languages and computation languages. In a glue language, we combine all of our numerical drivers and produce an application. For example, optimization solvers don't really need to be written in a low-level language since their parallelism and computation is primarily governed by the function evaluations, derivatives, and linear system solvers. As long as these are fast, we can use something like like Python to code it and it runs about the same speed, and in parallel, as a C or C++ code. On the other hand, we have the computation languages where we code the low level and parallel routines like linear algebra solvers. Typically, this is done is C/C++/Fortran, but I'm curious to see how Rust can fit in with these language. For me, the primary focus of a computation language is one that it's fast and two that it's really, really easy to hook into glue languages. Since just about every language has a c-api, that's our pathway forward.
Alright, so now we have Julia. Is it a glue language? Is it a computation language? Maybe it's designed to be both. However, at the end of the day, most of the examples I see of Julia on HN are using Julia as a glue language. To me, we have lots of glue languages that already hook into whatever other stuff we care about be it plotting tools or database readers or whatever. If Julia is designed to be a computation language, great. However, that means we should be seeing people writing the next generation of things like parallel factorizations and then hooking them into a more popular glue language like Python or MATLAB or whatever. Maybe these examples exist and I haven't seen them. However, until this is more clear, I personally stay away from Julia and I advise my clients to as well.
And, to be clear, Julia may be wonderfully suited for these things. Mostly, I wanted to express my frustration of what I see as an ambiguity in the marketing.
I haven't been following very closely recently but there has been some active native implementation work such as: https://github.com/JuliaDiffEq/DifferentialEquations.jl
Which reminds me a bit of Java, where the speed is either there or getting there for tight loops, but it just doesn't play well with others at all when they are wanting to do the driving.
There's one other domain that, depending, Julia may fit well. At the moment, I prototype everything in MATLAB/Octave because the debugger drops us into a REPL where we can perform arbitrary computations on terms easily. Technically, this is possible in something like Python, but it's moderately hateful compared to MATLAB/Octave because factorizing, spectral analysis, and plotting can be done extremely easily in MATLAB/Octave. That said, I tend not to keep my codes there since MATLAB/Octave are not good, in my opinion, for developing large, deliverable applications. As such, in my business where I quickly develop one off prototype codes on a tight deadline, maybe it would be a reasonable choice.
Though, thinking about it, there may be licensing problems. The value in MATLAB is that they provide the appropriate commercial license for codes like FFTW and the good routines out of SuiteSparse rather than the default GPL. I'm looking now and it's not clear to me Julia provides the same kind of cover. This complicates the prototyping angle.