Haskell improves log processing 4x over Python(devblog.bu.mp) |
Haskell improves log processing 4x over Python(devblog.bu.mp) |
It's important to note that this particular job is largely bound on a.) I/O and b.) format serialization tasks. Both Python's BSON and JSON libraries are mature and have their critical sections written in C, so a speedup of 4x is still noteworthy. The Haskell version, on the other hand, is pure Haskell.
/still a Python fan
/also still a python fan :-)
I'd love to point people to this when trying to convey some advantages of Haskell. To make it more compelling, can you expand some on the downsides and maybe obstacles you encountered?
The thing I'm unsure about, is how difficult it would be for (very) talented developers to just jump in. We have really talented developers, and everyone is super time-constrained, so many are wary of diving into a language as different as Haskell. Was it hard for your developers to figure Haskell out? Did your previous use of Scala help? How long did it take them to dive into Scala?
It's all much easier to digest, though, even for "really talented developers", if they have some experience with another functional language first. OCaml is a nice stepping stone before digging into the abstractions involved in understanding Haskell's powerful type system. Scala is good too, but having the object stuff mixed in there can lead you to rely on some patterns that aren't going to be available in a non-OOP language. I think the scheme/clojure path isn't bad either, but it's probably ideal to spend some time in the "statically typed" wing of the functional universe before going to Haskell.
I came to Haskell with no understanding of monads, started writing code, and eventually used my knowledge of Haskell to learn about monads. Not understanding monads just meant I was lacking a useful design pattern, and found certain API docs confusing, but it didn't stop me from writing reasonable code in most circumstances.
On the other hand what you describe in your (awesome) blog post is a more significant Haskell project than any I've worked on, so I'd be interested to hear your experience.
I've not really written my own monad, or properly looked into monad transformer stacks, and I'm aware that I could probably clean up a lot of code using them - is that the sort of thing you mean?
To learn to program purely functional, it's best to jump into Haskell cold-turkey, since you will have to learn to think in FP.
Learning Haskell, optimization in a lazy world was the most difficult task. Often, I still have problems predicting how efficient particular code will be. The complexity of monads is somewhat overstated, though it doesn't help that some tutorials make something big and esoteric out of it. It is nothing more than a type class, that specifies how to combine computations that result in some 'boxed value'.
Haskell as an EDSL for generating hard real time, however, is very viable: http://corp.galois.com/blog/2010/9/22/copilot-a-dsl-for-moni...
Now, if I had stated that all conceivable systems programming domains are addressable with Haskell, that would have indeed been foolish.
What you're probably observing is Python's slow code generation being masked by the inherent slowness of I/O.
Except, when python's pants are on, it makes gold records.
I haven't looked to see if there are any explicit optimizations, but your statement is ridiculous; an effective IO strategy can have an enormous effect on performance.
Any reason why you didn't use Hadoop for this, then run batch jobs to extract summaries?
http://tartarus.org/james/diary/2008/06/17/widefinder-final-...
Reading data from a file handle into a buffer is a trivial operation. It's what you do with that data afterwards that is important. In C (or Go) you have complete control over what happens next. As for Python, I don't know what happens, but I don't see how it could possibly be more efficient than any other sane language.
If that is all you're doing, then yes; there isn't a much more efficient way of doing that.
In C (or Go) you have complete control over what happens next.
It is up to the programmer to know what to do next. Does haskell strike you as a language of micromanagement? Python can sometimes be multiples faster than command line grep. I haven't looked into why, but I have some ideas.
You are being incredibly vague and unhelpful.
If you don't understand what I'm talking about, it doesn't make /me/ wrong. And just saying so is also not an appropriate response. I had a specific question that was answered by the OP, and was useful to me.
Hacker News comment threads are rarely a place of education, but I will reinterpret what you said as a question.
An IO strategy is how and when you make those system calls. Reading from disk takes a vast amount of time, during which you can be doing computation. To be fast, you should be asking for the appropriate amount of lookahead, at the correct offset. Is that 4k? 1Meg? 100Megs? 1GB? Do you use threads for this? Can you skip any of the input stream? Do you let the operating system, programming language, library, or program code decide how big the read is? Where that data is stored after being read from disk is also important. Especially fast strategies use mmap to avoid copying from kernel space into user space. And of course everything is always chunked at specific intervals, so knowing where those are can sometimes reduce the number of calls. The ability to optimize for these is one thing that makes dedicated database software so successful.
It is a dark art, and it is not expected that the average person know these things. If you happen to be the kind of person working on a programming language, it could be useful to be aware of them. Here are some quick links, but there is a vast amount written on the subject.
[http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...] [http://tangentsoft.net/wskfaq/articles/io-strategies.html]
Besides that, the grandparent is right: possibly every situation where Python was used as a systems programming language, Haskell could fit in (and more).
To give you an example of what is systems programming, I have helped developed operating systems kernels, virtual machine monitors, and distributed networked systems. All of these things would be considered systems programming.
See http://en.wikipedia.org/wiki/System_programming for more information.
Basically, you can't swing a dead cat without hitting monads in the Haskell library ecosystem; therefore, you'll need to know what they are.
In general, Java tends to be better for a long-running process while Haskell might be better for a one-off job (not to say that one couldn't do the other without problem though).
I'm not that familiar with JIT, but wouldn't compiling everything to native code ahead of time be at least as good as compiling parts that turn out to be slow, just in time? Is it about what sorts of optimizations to use, which you don't know until runtime?
Also, would this situation change given the fact that they're switching to LLVM?
JIT does, however, use cycles at runtime to do the actual compile. While I have seen benchmarks where JIT wins over AoT, I'd still bet that the balance favors traditional compilers.
I'm not familiar with how common these techniques are in production JVM's, but as they're becoming common in javascript implementations I'd assume it's equally common.
LLVM can both use a JIT and compile to native code. From what I've seen, ghc doesn't seem to be doing much different, so I'd assume they're compiling to native code still.