Key/value is dead. Long live tuples: Pangool for Hadoop(datasalt.com) |
Key/value is dead. Long live tuples: Pangool for Hadoop(datasalt.com) |
http://en.wikipedia.org/wiki/Tuple_space
http://en.wikipedia.org/wiki/Linda_(coordination_language)
http://www.amazon.com/Mirror-Worlds-Software-Universe-Shoebo...
I've long contended that a tuple space was basically a generalised key-value store, so it's nice to see projects like this one crop up.
[1] http://java.net/projects/jini/
These goals are greatly divergent. In Pig, Java code is written to create new functions that can be used for analysis--i.e. Java is written in support of Pig Latin. Pangool focuses instead on extending Hadoop by making the Java code easier to write. This means Pig could potentially be implemented in Pangool, if Pangool were to satisfy the requirements for the task. (Not that I am suggesting that Pig actually be written--it might just be possible, depending on the technical requirements.)
Having used Hadoop in the past, I would be more inclined to use Pangool. Parts of Hadoop are poorly written--especially the reliance on singletons--and anything that makes it easier to write code that runs on a Hadoop cluster is a desirable goal in my eyes. I look forward to seeing how this project shapes up.
Though I don't have deep expertise in Hadoop, I find this claim highly suspect. High-level APIs achieve user-friendliness by making decisions/assumptions about the way a lower-level API will be used. I would be very surprised if there was no use case for which your API does impose a trade-off vs. the low-level Hadoop API.
I feel much more confident using a high-level API if its author is up-front about what assumptions it's making. If the claim is that there is no trade-off vs. the low-level API, I generally conclude that the author doesn't understand the problem space well enough to know what those trade-offs are.
I could be wrong, but this is my bias/experience.
HIVE -> Pig -> Pangool -> Cascading -> MapReduce
Nice addition!
Pangool is based on an extension of the MapReduce model we suggest and call "Tuple MapReduce". This is explained in detail in this post: http://www.datasalt.com/2012/02/tuple-mapreduce-beyond-the-c...
What this means is that in Pangool, if you worked with 2-sized Tuples, you would be able to do exactly the same that you do now with Java MapReduce - That includes custom RawComparators and arbitrary business logic in any place of the MapReduce chain (Mapper, Combiner, Reducer). Using n-sized Tuples together with Pangool's group & sort by, reduce-side join API will only mean less code, easier code at no loss of performance or flexibility.
Realize that Pangool is still a MapReduce API so it doesn't add any level of abstraction.
We designed Pangool with the aim of offering it as a replacement of the current MapReduce API. Therefore we are not labelling it as a "higher-level API" but as comparable low-level API.
On the other hand we are also benchmarking Pangool to show it doesn't impose a performance overhead: http://pangool.net/benchmark.html
Also, since the data model is more complicated and provides more features, it takes more code and a more complex implementation. This could be significant if you were trying to port the model to another language or implementation, or were trying to formally things about the code or mathematical model, etc.
I'm not saying it's not cool; I actually think it's a good and powerful abstraction -- I just object to the characterization of "all features and no tradeoffs".
I'm not arguing this is a terrible thing. In fact, I think this is an acceptable level of additional complexity for the power it buys you. But if we're going to make an honest evaluation of the trade-offs, I think we must mention this.
It may be relevant to the discussion to point out that I work on a tuple-based streaming system. Product: http://www-01.ibm.com/software/data/infosphere/streams/ Academic: http://dl.acm.org/citation.cfm?id=1890754.1890761, http://dl.acm.org/citation.cfm?id=1645953.1646061
I'm asking because in my experience the extra level of abstraction provided by Cascading, Crunch etc is a huge advantage, and if you're making a conscious choice to operate at a lower level, you better be getting something significant in return; it's not clear to me yet what that is.
But if you are thinking about learning Hadoop using the standard Hadoop API, or if you need for some particular reason to use it for your project, we recommend you to use Pangool instead.
Or if you are considering to implement another abstraction on top of Hadoop, probably using Pangool for it would also be a good idea.
In fact, what we believe is that the default Hadoop API should look like Pangool.
return Pattern.compile(regex).split(this, limit);
The benchmark seems fair to me.