Haskell improves log processing 4x over Python

Haskell improves log processing 4x over Python(devblog.bu.mp)

114 points by jmintz 15 years ago | 41 comments

andrewcooke 15 years ago |

The work sounds very cool (and they are hiring), but (only) a factor of 4 speedup over Python is (to repeat a phrase from elsewhere today) like boasting that you're the tallest midget ;o)

jamwt 15 years ago | |

Hi, article author here.

It's important to note that this particular job is largely bound on a.) I/O and b.) format serialization tasks. Both Python's BSON and JSON libraries are mature and have their critical sections written in C, so a speedup of 4x is still noteworthy. The Haskell version, on the other hand, is pure Haskell.

andrewcooke 15 years ago | | |

Neat - thanks.

jbellis 15 years ago | |

Agreed. Even where you can optimize the hot code in C, Python is no speed demon. Cassandra's java stress test can push out about 10x as many ops/s as the python one, even though Thrift C extension for Python is quite good.

/still a Python fan

jamwt 15 years ago | | |

Yeah, Haskell is roughly as fast as Java. I imagine, with the tuning I allude to in the blog post, we could restore about a 10x improvement, even on a single core. After all, 10x was about what we had with raw Redis ops/s before the serialization libraries got involved.

/also still a python fan :-)

Peaker 15 years ago |

Sounds great. I'm a very big Haskell fan.

I'd love to point people to this when trying to convey some advantages of Haskell. To make it more compelling, can you expand some on the downsides and maybe obstacles you encountered?

The thing I'm unsure about, is how difficult it would be for (very) talented developers to just jump in. We have really talented developers, and everyone is super time-constrained, so many are wary of diving into a language as different as Haskell. Was it hard for your developers to figure Haskell out? Did your previous use of Scala help? How long did it take them to dive into Scala?

jamwt 15 years ago | |

I would say the two real barriers to writing effective Haskell projects are a.) "getting" monads, and b.) understanding the implications of laziness, especially with regard to space leaks and unconsumed thunks. Everything else isn't that big of a deal.

It's all much easier to digest, though, even for "really talented developers", if they have some experience with another functional language first. OCaml is a nice stepping stone before digging into the abstractions involved in understanding Haskell's powerful type system. Scala is good too, but having the object stuff mixed in there can lead you to rely on some patterns that aren't going to be available in a non-OOP language. I think the scheme/clojure path isn't bad either, but it's probably ideal to spend some time in the "statically typed" wing of the functional universe before going to Haskell.

samstokes 15 years ago | | |

Could you say more about why "getting" monads was needed?

I came to Haskell with no understanding of monads, started writing code, and eventually used my knowledge of Haskell to learn about monads. Not understanding monads just meant I was lacking a useful design pattern, and found certain API docs confusing, but it didn't stop me from writing reasonable code in most circumstances.

On the other hand what you describe in your (awesome) blog post is a more significant Haskell project than any I've worked on, so I'd be interested to hear your experience.

I've not really written my own monad, or properly looked into monad transformer stacks, and I'm aware that I could probably clean up a lot of code using them - is that the sort of thing you mean?

grav1tas 15 years ago | | |

I agree. I dove into Haskell without doing any of the prior, and it was like running into a brick wall. However, persistence paid off in my case, but I do wonder how I would have handled it if I would have spent time with OCaml beforehand.

microtonal 15 years ago | |

From personal experience: I didn't make much progress in Haskell until I stopped using Scala. The problem is that Scala allows you to mix and match different paradigms and if you come from a mostly-imperative/OO background, you tend to use Scala as an OO language with some functional constructs.

To learn to program purely functional, it's best to jump into Haskell cold-turkey, since you will have to learn to think in FP.

Learning Haskell, optimization in a lazy world was the most difficult task. Often, I still have problems predicting how efficient particular code will be. The complexity of monads is somewhat overstated, though it doesn't help that some tutorials make something big and esoteric out of it. It is nothing more than a type class, that specifies how to combine computations that result in some 'boxed value'.

Locke1689 15 years ago |

The author is mostly write about the usage cases of Haskell, but simply "systems" is a bit misleading because there are certain performance characteristics of lazy programs which make them bad choices for some systems programs. Any type of real-time system, for example, can suffer unpredictable performance in critical sections, which is pretty undesirable.

dons 15 years ago | |

Hard real time systems are probably the primary thing for which Haskell-as-is is directly unsuitable.

Haskell as an EDSL for generating hard real time, however, is very viable: http://corp.galois.com/blog/2010/9/22/copilot-a-dsl-for-moni...

awj 15 years ago | |

Not to argue the example, but Python's garbage collection disqualifies it for real-time systems as well. In fact, I'm having a hard time find a "system" task for which Python (as a language) is qualified by Haskell is not.

Locke1689 15 years ago | | |

Python is not a systems programming language.

jamwt 15 years ago | |

While I agree with you that Haskell (or, really, any GC'd language) is unsuitable for real-time systems, I disagree that my statement about its excellent suitability for systems programming in general is misleading. There are many, many domains (read: most) that, in my experience, are called "systems programming" that have nothing to do with hard or soft real-time requirements.

Now, if I had stated that all conceivable systems programming domains are addressable with Haskell, that would have indeed been foolish.

Locke1689 15 years ago | | |

Hm, good point -- I agree.

ynniv 15 years ago |

Are the logs being read from disk? In my experience, python is highly optimized for reading (possibly compressed) files from disk. If your infrastructure keeps logs in memory, python will lose this advantage and compete on computational performance where Haskell has the advantage. This is important for those of us who grind logs on disk and might be considering a language switch.

enneff 15 years ago | |

What do you mean by optimized? Python makes the same read and write syscalls everyone else does.

What you're probably observing is Python's slow code generation being masked by the inherent slowness of I/O.

ynniv 15 years ago | | |

Python makes the same read and write syscalls everyone else does

Except, when python's pants are on, it makes gold records.

I haven't looked to see if there are any explicit optimizations, but your statement is ridiculous; an effective IO strategy can have an enormous effect on performance.

jamwt 15 years ago | |

Nope, this is a process that BLPOP's logs from some Redis queue, does some processing on them, then writes them to disk.

kordless 15 years ago |

I'd be interested in hearing more about how the author is using the resulting data set. Doing extractions at event generation time can be very useful if you know what you are after in advance, but not so good for adhoc analysis.

Any reason why you didn't use Hadoop for this, then run batch jobs to extract summaries?

jamwt 15 years ago | |

Yeah, the whole pipeline is actually quite more faceted than can be deduced from this summary. This stage actually just persists the events into a consolidated transaction log. Then, there are secondary processes that scan these transaction logs (in batch) and distribute data into various databases for system, business, and user analytics. I can't go into too much detail there, but the actual digesting and reporting side is more involved.

kordless 15 years ago | | |

I'd like to hear more about the use case if you have time, and can talk about it. I'm kordless at loggly dot com.

aristus 15 years ago |

Awesome work. If you haven't heard about Tim Bray's WideFinder challenge, it was really interesting.

http://tartarus.org/james/diary/2008/06/17/widefinder-final-...