Key/value is dead. Long live tuples: Pangool for Hadoop

Key/value is dead. Long live tuples: Pangool for Hadoop(datasalt.com)

79 points by ivanprado 14 years ago | 28 comments

jasonkolb 14 years ago |

I just popped in to say that I'm tired of the "X is dead" linkbait headlines. They demonstrate a myopic view of the world. Visual Basic and COBOL are still around.

philwelch 14 years ago | |

Dead is relative. Dead usually means "dead to me".

kellenfujimoto 14 years ago | | |

Or in the case of VB, "better off dead".

dredmorbius 14 years ago | |

So, what you're saying is "'X is dead' is considered harmful"?

protomyth 14 years ago |

a couple of links that came to mind with this:

http://en.wikipedia.org/wiki/Tuple_space

http://en.wikipedia.org/wiki/Linda_(coordination_language)

http://www.amazon.com/Mirror-Worlds-Software-Universe-Shoebo...

alatkins 14 years ago | |

And for a slightly more modern take on the tuple space, check out Java Spaces [1] or Gigaspaces [2]. There's still plenty of active research on the topic too [3] (disclaimer: I did my PhD thesis on distributed tuple spaces).

I've long contended that a tuple space was basically a generalised key-value store, so it's nice to see projects like this one crop up.

[1] http://java.net/projects/jini/

[2] http://www.gigaspaces.com/

[3] http://eprints.utas.edu.au/9996/

silssilsssil 14 years ago |

I'm wondering what's the need for this when we already have Apache Pig, etc?

fruchtose 14 years ago | |

Apache Pig is a much different beast than this project, from what I can tell reading the documentation for Pangool. While they both operate on tuples and work at a higher level than pure Hadoop, they accomplish their goals much differently.Pig uses its own language called Pig Latin (http://pig.apache.org/docs/r0.9.2/basic.html), which is then compiled down into code that interfaces with the Hadoop library. Pangool is much closer to Hadoop, in that you are writing Java. If you look at one of their examples (http://pangool.net/introduction.html), I get the sense that the developers aim to make Hadoop easier to use, while Pig aims to make data analysis easier to use.

These goals are greatly divergent. In Pig, Java code is written to create new functions that can be used for analysis--i.e. Java is written in support of Pig Latin. Pangool focuses instead on extending Hadoop by making the Java code easier to write. This means Pig could potentially be implemented in Pangool, if Pangool were to satisfy the requirements for the task. (Not that I am suggesting that Pig actually be written--it might just be possible, depending on the technical requirements.)

Having used Hadoop in the past, I would be more inclined to use Pangool. Parts of Hadoop are poorly written--especially the reliance on singletons--and anything that makes it easier to write code that runs on a Hadoop cluster is a desirable goal in my eyes. I look forward to seeing how this project shapes up.

ivanprado 14 years ago | |

Hi, I'm one of the developers of Pangool. The idea of Pangool is not to be yet another higher level API on top of Hadoop but rather to pose a replacement for the low-level Hadoop Java MapReduce API. Pangool has the same performance and flexibility than that of the Java MapReduce API although it makes several things a lot easier and convenient. There is no tradeoff, just advantages. There will be cases where you'd want to use Pig or Cascading. There will be some other cases where you'd want the flexibility and efficiency of MapReduce. For those cases we conceived Pangool. Nowadays only very advanced Hadoop users could write efficiently-performing MapReduce Jobs. Pangool hides all the advanced boilerplate code needed for writing highly efficient MapReduce jobs, making things like secondary sorting or reduce-side joins extremely easy.

haberman 14 years ago | | |

> There is no tradeoff, just advantages.

Though I don't have deep expertise in Hadoop, I find this claim highly suspect. High-level APIs achieve user-friendliness by making decisions/assumptions about the way a lower-level API will be used. I would be very surprised if there was no use case for which your API does impose a trade-off vs. the low-level Hadoop API.

I feel much more confident using a high-level API if its author is up-front about what assumptions it's making. If the claim is that there is no trade-off vs. the low-level API, I generally conclude that the author doesn't understand the problem space well enough to know what those trade-offs are.

I could be wrong, but this is my bias/experience.

avibryant 14 years ago | | |

Can you give an example of a job that would be difficult or impossible to perform efficiently with Cascading, but Pangool gives an advantage over raw MapReduce?

rjurney 14 years ago |

So it sounds like this slots in like so, in order of abstraction:

HIVE -> Pig -> Pangool -> Cascading -> MapReduce

Nice addition!

ferrerabertran 14 years ago | |

Hi rjurney. I would say "Hive, Pig, Cascading" are on the higher level API side and "Pangool, MapReduce" on the low-level side. Pangool is a MapReduce API that aims to make MapReduce simpler. We explain this better in our FAQ: http://pangool.net/faq.html

rjurney 14 years ago | | |

HIVE -> Pig -> Cascading -> Pangool -> MapReduce ?

lightblade 14 years ago |

Tuples reminds me of RDBMS

kbob 14 years ago | |

Exactly. A tuple is exactly the same as a relation.

sixbrx 14 years ago | | |

Set of tuples (of like kind) is a relation.