Statistics with Julia [pdf](people.smp.uq.edu.au) |
Statistics with Julia [pdf](people.smp.uq.edu.au) |
I think the language is really solidly designed, and gives you ridiculously more power AND productivity than python for a whole range of workloads. There are of course issues, but even in the short time I've been following & using the language these are being rapidly addressed. In particular: generally less rich system of libraries (but some Julia libraries are state of the art across all languages, mainly due to easy metaprogramming and multiple dispatch) + generally slow compile times (but this is improving rapidly with caching etc). I would also note that you often don't really need as many "libraries" as you do in python or R, since you can typically just write down the code you want to write, rather than being forced to find a library that wraps a C/C++ implementation like in python/r.
I don't think this is really a feature. It's nice that you can write more performant code in Julia directly and don't need to wrap lower level languages, without question, but the lack of libraries or library features is not a good thing. It's always better to use a general purpose library that's been battle tested than to write your own numerical mathematics code (because bugs in numerical code can take a long time to get noticed)
For specialized scientific computing applications, which would normally be written in C/C++, I would absolutely look into using Julia instead (though not sure what the openmp/mpi support is like). But I would also recommend against rolling your own numerical software unless you need to
You are much less likely to reinvent the wheel if you can add your one critical niche feature / bugfix to an existing library. In python, learning C and C build systems and python's C API are gigantic barriers to doing that.
More importantly, if every fast data manipulation needs to be written in C, a few of them can be profitably shared, but you need more than a few of them. Pretty soon you wind up with a giant dumping ground of undiscoverable API bloat. See: pandas.
The format for the code samples goes like (code chunk —> output/plots —> bullet points explaining the code line-by-line). This creates a bit of a readability issue. The reader will likely follow a pattern like: (Skim past the code chunk to the explanation —> Read first bullet, referencing line X —> Go back to code to find line X, keeping the explanation in mental memory —> Read second bullet point —> ...). In other words, too much switching/scrolling between sections that can be pages apart. Look at the example on pages 185-187 to see what I mean.
I’m not sure what the optimal solution is. Adding comments in the code chunks themselves adds clutter and is probably worse (not to mention creates formatting nightmares). I think my favorite format is two columns, with the code on the left side and the explanations on the right.
Here’s what I have in mind (doesn’t work on mobile): https://allennlp.org/tutorials. Does anyone know of a solution for formatting something like this?
Happy for more feedback (Yoni Nazarathy).
[1] https://github.com/JuliaLang/julia/milestone/30
[2] https://discourse.julialang.org/t/julia-v1-2-0-rc2-is-now-av...
using PyCall
np = pyimport("numpy")
np.fft.fft(rand(ComplexF64, 10))
Thats it. You call it with a julia native array, the result is in a julia native array as well.
Same with cpp
https://github.com/JuliaInterop/Cxx.jl
Or matlab
https://github.com/JuliaInterop/MATLAB.JL
It's legit magic
The “Ju” in Jupyter is for Julia, so it’s designed to be used as an interactive notebook language also. The Juno IDE is modeled after RStudio.
I'd like to offer a counter point or add on to this.
It's quirky enough to have many packages backed by some expert statistician.
I hope Julia get to be successful in this regard too.
Also the macro system allows one to define powerful DSLs (see Gen.jl for AI).
Although JuliaBox has been provided for free by Julia Computing, there has been discussion that this may not be possible in the future. However, Julia Computing does provide a distribution of Julia, the Juno IDE, and supported packages known as JuliaPro for free.
For new users, would the free JuliaPro distribution be a good alternative to JuliaBox and/or downloading the REPL and kernal from julialang.org?
JuliaBox (and https://nextjournal.com/) are cloud services, but if you have a real computer and want to do this for more than a few minutes, just install it. (There's also no need for virtualenv etc.)
[1] https://github.com/JuliaPlots/Makie.jl
I remain skeptical that this solves a lot of real-world problems (I know a lot of users of R/Python who never need to resort to writing their own C/C++ code), but that's the sales pitch.
The moment Julia shines is when your workloads can't be phrased by stringing together the limited set of vectorised verbs that python / r libraries give you: this is anything stateful and loopy like reinforcement learning, systematic trading, monte carlo simulations etc. It's also useful if you really care about performance and are doing "vanilla" computations at a truly large scale. If you want to avoid copying memory (i.e. doing vectorised operations), or want to tightly optimise / fused some numerical operations, it's great.
The other issue with python / r wrapping c++ libraries is that different libraries will generally not play well together (without coming out into python / r space, and doing a lot of copying / allocation). This tends to encourage large monolithic c/++ codebases like numpy and pandas, that are pretty impenetrable and difficult to extend / modify.
Yoni Nazarathy.
1. Indices by default start with 1. This honestly makes a ton of sense and off by one errors are less likely to happen. You have nice symmetry between the length of a collection and the last element, and in general just have to do less "+ 1" or "- 1" things in your code.
2. Native syntax for creation of matrices. Nicer and easier to use than ndarray in Python.
3. Easy one-line mathematical function definitions: f(x) = 2*x. Also being able to omit the multiplication sign (f(x) = 2x) is super nice and makes things more readable.
4. Real and powerful macros ala lisp.
5. Optional static typing. Sometimes when doing data science work static typing can get in your way (more so than for other kinds of programs), but it's useful to use most of the time.
6. A simple and easy to understand polymorphism system. Might not be structured enough for big programs, but more than suitable for Julia's niche.
Really the only thing I don't like about the language is the begin/end block syntax, but I've mentioned that before on HN and don't need to get into it again.
Take, for example, a simple program that creates a line plot (https://docs.juliaplots.org/latest/tutorial/):
using Plots
x = 1:10
y = rand(10)
plot(x, y)
After installing the package, the first run has to precompile(?), and subsequent runs use the package cache. But ~25 s to create a simple plot is incredibly slow and frustrating to work with. $ julia --version
julia version 1.1.1
$ time julia plot.jl
julia plot.jl 73.71s user 4.45s system 110% cpu 1:11.04 total
$ time julia plot.jl
julia plot.jl 24.41s user 0.39s system 100% cpu 24.633 total
$ time julia plot.jl
julia plot.jl 23.38s user 0.36s system 100% cpu 23.519 total $ julia --compile=min -e '@time (using GR; plot(rand(20)))'
0.375836 seconds (368.83 k allocations: 20.190 MiB, 1.65% gc time)
$ julia --compile=min -e '@time (using Plots; plot(rand(20)))'
4.302867 seconds (6.41 M allocations: 371.485 MiB, 5.07% gc time)Of course, we continue to work on improving compile times. About half of the time is spent in LLVM compilation, which has actually become slower over time.
$ time julia -e "using PyPlot;x=1:10;y=rand(10);plot(x,y);"
real 0m5.676sThe next day I just ended up using C++/Eigen with a simple matplotlib binding [1]. The code is nearly indistinguishable from Python/Julia (except for having more verbose types where it makes sense, using "auto" otherwise), and the entire compile+run cycle takes less time for some short runs than it takes Julia to print "Hello World".
That being said, I'm not advocating for people to use C++. I would love to use Julia, and applaud the developers for their hard work and contribution to scientific computing, but as it stands right now, it doesn't seem to be the right tool for me, since I'm relying on fast editing/execution cycles.
I’d say for most people, there’s so much great progress and improvements happening that the breakages are well worth it.
I tried to recreate something like AlphaGo in Python using Keras, I never got the learning to work (probably because I was impatient and training on a laptop CPU), but a lot of the CPU time was simply being spent on manipulating the board state.
So I ported my "Board" object to Rust, and it was a lot faster. Things like counting liberties or removing dead stones were a lot faster, which was important.
Then I rewrote the whole thing in Julia and it was just as fast as my Python / Rust combo.
So I saw for myself that Julia does solve the two language problem. It is as pleasant to write as Python (and I like it better actually), and performed as well as Rust, based on my informal benchmarks.
Besides Dijkstra's classic paper[1] showing why 0-based indexing is superior, in practice I find myself grateful for 0-based indexing in Python because of how slices and things just work out without needing +1/-1.
I'd like to understand. Could you give an example of when 1-based indexing works out better than 0-based?
Also I find it elegant that for 1-indexing that the start and end value for slices are both inclusive, instead of the first one being inclusive and the last being exclusive.
Also, isn’t it just weird that the index of an element is one less than it’s “standard” index? Like if I take the first nth elements of a list, it would stand to reason that the nth element should be the last element, right?
The reason for zero indexing is historical, related to pointer offsets. I don’t think anyone chose them to be easier for people. They just made them that way because it maps closer to how contiguous values in arrays are accessed.
Also, with 1-indexing I can multiply numbers by arrays and get reasonable offsets. 3 x 1 is three, so I would get the third element of the list. But with 0-indexing, I have 0 x 3 which gives me the same element, clearly inconsistent.
There are some good reasons for 0-indexing and I have been using it in every language for my entire career. The amount of code I’ve written in Julia is marginal compared to my 0-indexing experience, so I might be missing something.
One nice this about 0-indexing is that I can slice a list in half with the same midpoint. For example a Python array with 10 elements:
fst, snd = arr[0:5], arr[5:10]
A little nicer than:
fst, snd = arr[1:5], arr[6:10]
Though you could have inclusive slices with 0-indexing, but it would be inconvenient and suffer from the same problem as 1-indexing.
For python teaching this is almost a whole chapter, with people sharing cheat sheets and building graphics to show how slicing works what not. You don't see these things in R teaching materials.
I'm sure that for the implementation of algorithms, things might be easier with zero indexing, but for a user asking for element 4,5 and 6, 1-indexing is much, much easier on the user.
for (size_t i = v.size() - 1; i >= 0; --i) {
std::cout << i << ": " << v[i] << std::endl;
To fix the infinite loop you could write: for (size_t i = v.size(); i > 0; --i) {
std::cout << i - 1 << ": " << v[i - 1] << std::endl;
Neither is great. Switching to signed integers might make your compiler throw warnings at you.However, 1-based indexing does not work out well with modular arithmetic:
# 1 based
v[1 + (i - 1) % v.size()]
# 0 based
v[i % v.size()]
There's pros and cons with both schemes.The goals of Python were quite different from the goals of Julia.
Python programmers seem content implementing the same things over and over again. Like, for example, flattening a list/monad.
List of things python doesn’t have but should: pattern matching, multi-line lambdas, more data structures (look at Scala for an example of what kind of data structures a standard library should provide), real threading, options, monads, futures, better performance, and more.
I wouldn’t go that far and say Python is suitable for large programs. It’s clearly not. Working on a large python code base is hell.
By contrast, Python feels a bit too rigid/standardized. Everyone’s code looks like it was copy+pasted from a book of truth somewhere. This is good for sharing and engineering, not as good for expressing mathematical ideas.
So whereas R has evolved organically over decades and Python is for everyone (and alternatives like MATLAB or SAS are first and foremost software for industry rather than languages), Julia seems to be thoughtfully purpose-built to be a modern language for numerical/scientific computing. It polishes off the rough edges and blends some of the best features of each language. Again, this is just an impression from someone who already thinks in R but is learning both Python/Julia.
More to your point, maybe Julia is at a stage of development where it’s good for both students (for developing computational and mathematical thinking) and experts (for slinging concise but performant code), but not yet the rank-and-file users looking to just get things done.
https://pandas.pydata.org/pandas-docs/stable/reference/serie...
Even though it's long, it undersells the problem, because many of these methods have nontrivial overload semantics that open up like a fractal when you look in turn at their docs. The link also undersells the problem because this junkheap is evidently so incomplete that people are frequently forced to rely on numpy to extend it.
APIs should make hard things easy, but API gloveboxes like this make easy things hard. Minimal API + Performant Glue >> We do everything for you + You can't ever touch your own data or your perf dies + Good luck reverse engineering these semantics if you've forgotten the context and need to port it.
I wish the pypy team had finished numpypy. Then fast numerical programs could be written in python instead of relying on all kinds of C extensions and stuff. Python would be great for numerics then.
[1] https://docs.julialang.org/en/v1/base/math/#Base.mod1
[2] https://docs.julialang.org/en/v1/base/arrays/index.html#Base...
While you can't do it from the shell very well right now (rerunning the program at each step like you would with an interpreted language), that kind of fast cycle is something very common in Julia development but with a particular REPL based workflow [1] in which you use a tool like Revise.jl [2] to automatically update the definition whenever you save a file in your project (the only restriction is that it doesn't automatically updates new type definitions) and directly interacting with the program in the REPL. This way it will only recompile what you just altered, and it's very fast to actually run the code. Other interesting tools are Rebugger.jl (debugger for the REPL) [3] and OhMyREPL (coloring for the REPL) [4], which you can add to your startup.jl to always automatically load them.
[1] https://docs.julialang.org/en/v1/manual/workflow-tips/index....
[2] https://github.com/timholy/Revise.jl
Plotting is indeed slower than ideal, have not used Gadfly but Plots is more like 15s after restarting, then 10ms each time after. GR is faster, 5s or so the first.
http://juliadiffeq.org/DiffEqTutorials.jl/html/introduction/...
Maybe they should stop accepting them then.
Not that I have any position on Julia
It should be an easy and intuitive language, just as powerful as major competitors.
It should be open source, so anyone can contribute to its development.
Its code should be understandable as plain English.
It should be suitable for everyday tasks, allowing for short development times."
https://www.computerhistory.org/fellowawards/hall/guido-van-...
“The first sound bite I had for Python was, "Bridge the gap between the shell and C."
So I never intended Python to be the primary language for programmers, although it has become the primary language for many Python users. It was intended to be a second language for people who were already experienced programmers, as some of the early design choices reflect.‘
- I agree about pattern matching, that'd be nice.
- Multi-line lambdas haven't been important, but maybe I'm missing something.
- The list of data structures in the stdlib doesn't matter to me, since the 200k libraries on PyPI make up for it, and since packaging is easy nowadays with Poetry, they are as good as built-in but they get more frequent fixes and improvements than would be possible for the stdlib. Maybe there are some good side-effects of having extra types built in, based on a community of people using these types?
- Threading, I suppose, though Python isn't really the right language overall for that stuff anyway.
- Options, and monads, yeah that'd be nice.
- Futures are an idea whose time has come and gone IMO, but they're in asyncio anyway :\.
- For performance, PyPy is quite fast for many use cases.
Is there a language that has all these things built in and has a repl?
As a brief demonstration I can write:
foo(a, b, c) = (a + b) * c
And when I call it on integers, it emits only the necessary integer assembly, and when I call it on floats only the necessary float assembly, and when I broadcast it across vectors it emits SSE assembly. It's only when it can't prove the incoming types that it emits any sort of dynamic type code. It's also possible for the calling function to be ignorant of the types too, and so on, until a user decides to pass in an integer or a float, and all of the code is specialized to be as fast as possible.
As I’ve learned the language it’s become pretty easy to avoid those pitfalls even on initial implementations. That said, providing types in function signatures is still very useful for multiple dispatch and providing a more usable API in libraries.
However, as soon as you try to do something a bit more complicated then you’ll notice the speed and flexibility differences.
I prototyped a quick julia implementation of a simple glm (almost identical code in Julia and R), and the julia code was approximately 10-20 times faster depending on the model.
This is definitely worth looking at (mind you, the costs of redevelopment of our code in Julia is probably prohibitive). That being said, this would encourage me to call out to julia from R for some of my more computationally heavy workloads.
For example, a straightforward Python-to-LLVM compiler would generate code with every variable being a PyObject (https://docs.python.org/3/c-api/structures.html) instance, and “switch(obj.ob_type)” equivalents that would require a “sufficiently advanced compiler” to get to equivalent speed as, say, C.
Good point, in Python I don't notice that the last element is arr[len(arr)-1] because Python provides arr[-1]. I think in general your point is that it's natural for the nth element to be arr[n].
> The reason for zero indexing is historical, related to pointer offsets.
There is that, but Dijkstra's paper makes the case from first-principles that the closed, open interval of [0,n) for sequences is the most appropriate.
> with 1-indexing I can multiply numbers by arrays... 3 x 1 is three...
Sorry, I don't understand this. It makes sense that the point I don't understand is probably most-related to why Julia chose its indexing scheme and why Matlab et al. do the same.
> One nice this about 0-indexing is that I can slice a list in half with the same midpoint.
Yeah, arr[0:index] + arr[index:len(arr)] is the full list. And to your point earlier ("if I take the first nth elements of a list"), len(arr[:n]) == n seems natural.
Edit: I've been trying to formalize why Python's indexing scheme, along with its negative-indexing, is optimal (slight pseudocode):
l = ['a','b','c']
n = len(l)
i = -n
while i < n:
print(l[i++])
prints "a b c a b c". That code makes no reference to any bound but 'n', nor any constants (1,0) or offsets, yet it iterates over the list twice through its range (first negative indices then positive).He argues that it is the most appropriate when indexing into an array of the natural numbers starting with 0. If the array in question started with 1, one-based indexing would be most appropriate following exactly the same logic!
> When dealing with a sequence of length N, the elements of which we wish to distinguish by subscript, the next vexing question is what subscript value to assign to its starting element. Adhering to convention a) yields, when starting with subscript 1, the subscript range 1 ≤ i < N+1; starting with 0, however, gives the nicer range 0 ≤ i < N.
He never specifies the contents of the array, he's talking about subscripts (i.e. indexes).
This is interesting. Suppose the task is to use this approach (index * stride) to pick every third item from a list of 9 items: [1, 2, 3, 4, 5, 6, 7, 8, 9].
With 1-indexing: Multiply the sequence of valid indices (1, 2, 3, ...) by the stride (3) and use the result to 1-index into the given list. Returns [3, 6, 9].
With 0-indexing: Multiply the sequence of valid indices (0, 1, 2, ...) by the stride (3) and use the result to 0-index into the given list. Returns [1, 4, 7].
0-indexing has the start point of the return values fixed to the origin. 1-indexing has its start point float around depending on the stride. Both work, but have different emergent properties in the given example.
Julia has its own type system, which doesn't conform to the traditional static/dynamic divide. But AFAIK it doesn't have a compile-time type checker like mypy to help me catch type errors early.
However, you can simulate static typing with the @inferred macro from Test.jl. It will throw an error if all types are not infer able ahead of time which is essentially static typing.
I will say this though: in all the noise in this sub-thread, I never got any example in answer to my original question asking for any algorithm or formula that works out better with 1-based indexing (my original response was to its parent's claim that 1-based indexing results in fewer ±1s in practice). Except for maybe the "stride" examples[1] for which I still don't understand why the starting index is important. I say this not in victory (ha ha! zero-based indexes are clearly superior!) but in disappointment because I was hoping to gain understanding of why Julia/Matlab and others (which are more geared towards math/stats which is outside of my experience) made their indexing choice. Particularly because it's against the norm, they must have good reasons.
I'm not disputing the choice of closed-open rage, but I'm arguing the expressions are equally simple with either 1 and 0 based indexing.
As for an example where 1-based indexing is better, I had such an example in a another subthread: Translating between the numbers and names of months. Or indeed anywhere where you have a list of things and you want to pick them by ordinal numbers. E.g if you want the N'th president, it is simper to do presidents[N] than presidents[N-1].
The value of convenience over inconvenience isn't objectively better, that's all.
Objectively speaking, if I find the least positive residue of some integer modulo M, I get a value from 0 to M-1. If my M-sized array is from 0 to M-1, that is objectively convenient:
hash_table[hash(string) % table_size]
Objectively speaking, if I have some files in a directory and I give them zero based names like file000, file001, ... then I can objectively refer to the first volume of ten using the single pattern file00?, and the next volume as file01?. If they are numbered from 001, I need file00? file010 to match the first ten. For the next ten, the file01? pattern unfortunately matches file010 so I objectively need some way to exclude it.Objectively speaking, if I have zero based indices to a 3D array, I can find the address of an element <i, j, k> using the homogeneous Ai + Bj + Ck rather than Ai + Bj + Ck + D, which objectively adds an extra term.
Objectively speaking, the zero-based byte index B can be converted to a four byte word index using B / 4 (truncating division), and within that word, the zero-based byte local offset is B % 4. Objectively speaking, the same conversion from 1-based bytes to 1-based words requires (B-1)/4+1 and (B-1)%4+1, which is objectively more syntax and more nodes in the abstract syntax tree.
There is no reason I should like shorter, simpler, faster; that's a purely subjective aesthetic. After all, a short poem isn't better than a long one; a bacterium isn't better than a rhinoceros; and so on.
Hey, how about those one-based music intervals? C to E is a major third and all that? We have a diatonic scale with seven notes, right? As we ascend, whenever we cycle through seven notes, we have passed ... one more octave. And to invert an interval, we subtract from ... why, nine of course! And a fifth stacked on top of a fifth is a major ninth. There is objectively more cruft in 1 based music theory than 0 based. But there is no accounting for people liking it that way, right?
But I’m not trying it make the point that 0 based or 1 based is better or worse than the other. I’m just saying that it’s a borderline immaterial difference for most use-cases and Julia gives many many tools for getting around any problems that may arise when a certain indexing scheme is awkward.
The 0 or 1 based debate is one of the most boring and pedantic arguments one can have and I do my best to ridicule people when they try to start it.
But it is also an objective fact that using 1-based indexing means that the index corresponds to the ordinal numbers, e.g. index 1 is the first element, index 2 is the second element and so on. This also have a number of convenient properties.
For example February is the 2. month, so if you have a list of the names of the months, you would expect month_names[2] to be February. With zero-based you would have to do month_names[month_number - 1]. And if you want to get the month number from the name, you would have to do month_names.index_of(month_name) + 1. Be careful not to switch up the +1 and -1!
As for music theory, a third is called so because it spans three half-notes. It describe the size of a range which is independent of the offset of the indices. By the same token decades are 10 years (not 9) and centuries are 100 years (not 99).
> For example February is the 2. month, so if you have a list of the names of the months, you would expect month_names[2] to be February.
That's not 1-based indexing being good; that's conforming to (or reflecting) an externally imposed 1-based system that is itself questionable.
Should the seconds of a minute go from 1 to 60 instead of 0 to 59? Dates and times are full of poorly chosen conventions, including ones that don't match people's intuitions. For instance, many people celebrated the new millennium in January 2000. People also want decades to go from 0 to 9; the "eighties" are years matching the pattern 198?, not 1981 to 1990. Yet the 20th century goes from 1901 to 2000.
In many situations when 1 based numbering is used, it's just symbols in a sequence. It could be replaced by Gray code, greek letters, or Japanese kana in i-ro-ha order.
When the arithmetic properties of the index matter to the point that it's involved in multiplication (not merely successor/predecessor relationship), it is advantageous to make the origin zero.
If month_name[1] must be "January", I'm okay with wasting month_name[0]; that's better than supporting a one-based array (let alone making that default).
> As for music theory, a third is called so because it spans three half-notes.
No it doesn't; in that very same music theory, a "major second" interval is also known as "one step" or a "whole step"! A third is "two steps"; that's what it spans. (I don't know what you mean by "half-notes"; I sense confusion.) This nonsense was devised centuries ago by innumerates, just like various aspects of the calendar.
Then he goes on weird diatribes like the one about "corporate religion". I'm not sure, but I think he is trying to say that people who question his dubious logic are "sheeple".
Neil Armstrong was the 1st man on the moon - not the zero'th. Everywhere you have a sequence of discrete units, they are numbered starting from 1.
The thing with array indices in C is they are not ordinal numbers. They are offsets. Which means you can (at least in theory) do x[-1] to get the element before x. So a C array is not actually an array in the mathematical sense, it is just syntactic sugar for relative offsets in a larger array.
So what makes most sense? It really depends on what you want to achieve.