Pinterest open sources Pinball – a flexible data workflow manager

Pinterest open sources Pinball – a flexible data workflow manager(engineering.pinterest.com)

76 points by llaxsll 11 years ago | 18 comments

chatmasta 11 years ago |

There's something amazing about how Pintrest -- a seemingly simple social media app, roughly replicable with a few hours of CRUD framework programming -- takes on a life of its own when infused with venture capital funding and a strong engineering team. The core of the web product is so simple as to be almost trivial: show a grid of images and links to each user. Indeed, when Pintrest first started I'm sure the logic entailed little more than that. Now, billions of page views and dozens of engineering hires later, a once-simple app becomes a behemoth force, crunching data on the order of petabytes per day.

How do the operations powering an app like Pintrest evolve from simple to so complex? Do the complexities emerge from necessity, or simply from idle time on behalf of the engineers, who naturally crave hard problems to solve?

It's a fascinating meta-commentary on our industry that simple web apps grow to become such complex operations. A business can survive on the kernel of its core competency -- in this case, photo grids -- but to thrive, it requires careful attention to petabytes of peripheral decisions. Indeed, it seems such an evolutionary process is advantageous for a web startup. Friendster and MySpace may well have failed because they mistook their problems for simple ones. They were able to solve the core problem of a social network, but not the many peripheral ones of operating that network at scale. It's the ability to do the latter that sets apart the major successful startups from the also-rans.

jsmeaton 11 years ago |

I have a workflow that I'd really like to automate/rewrite. A wav file is generated on a remote server. That server will rsync/scp it to a processing node. The processing node will query a database, and write out a text file with parts of that file to remove. It'll then convert it to mp3 (using sox and lame) with those parts removed. Another job will then pick up the mp3 file, query another database, and if it gets a hit it will sync that file to s3.

Is this a kind of workflow that would run with pinball? Can you move files around with it, or do you use the file system and pass filepaths around? Ideally, the workflow job would hold onto the wav/mp3 and the associated database fields that are returned so I don't have to juggle weird directories around (and have to sync access to them).

I'm not familiar with any other workflow engines, so I'm unsure if this is the kind of thing that would traditionally run on one. I looked at the user guide but it's currently barren.

maoyesf 11 years ago | |

Pinball is good for this use case. You can build a workflow include a few jobs,

job1. generate a wav file, and put it somewhere say, s3://wav.file

job2 (run after job1): pick the wav file from the location s3://wav.file

you need to know the contract between the parent and child jobs from the business logic. In this example, when you implement job 1 and job 2, you need to have protocol for them to produce store and consume the wav.file..

jsmeaton 11 years ago | | |

Thanks for the reply. I'm wondering how you would share the location of the file between jobs though. Can job 1 output a file location that job 2 accepts as an input?

I see there are plans to write up some documentation, but are there any timelines that you're aiming to have those written?

Also, the README calls out mysql as being required. I assume that this, being a django project, will work with other backends too. Is there anything, to your knowledge, that would prevent a different backend being used (like postgres or oracle)?

solve 11 years ago |

I've badly been wanting one of these since I used a great one at my last job in 2010. Are there more of these now that I don't know about?

unode 11 years ago | |

yep, check out Spotify's Luigi project. Probably the most widely adopted OSS one https://github.com/spotify/luigi

andy_wrote 11 years ago | | |

Are there people who have more experience with comparative workflow managers who can quickly see the pros and cons of Pinball vs. Luigi? Perhaps someone at Pinterest who tried out other systems, as was mentioned in the post? (Though maybe Luigi wasn't available to the public when this comparison happened.)

estefan 11 years ago | | |

I've ported several reasonably complex jobs (files delivered to FTP at arbitrary times to be run through several Hadoop jobs) to luigi and it's been very good. Much more resilient than trying to use something that can only schedule jobs at specific times of the day.

It also has few dependencies and is lightweight (i.e. it's all python, so no JVM tying up resources).

Blackthorn 11 years ago | |

Likewise (though I'm still at that company). It's truly amazing how many tasks can be broken down into this paradigm.

I think my favorite non-obvious aspect is how it allows you to write each component in a different language.

asmosoinio 11 years ago |

Does anyone know how this compares to full Business process modeling platforms, such as Activiti, jBPM, or Bonita?

ecesena 11 years ago |

Does anyone know how this compares to celery?

maoyesf 11 years ago | |

http://www.celeryproject.org/ celery is a Distributed Task Queue. Pinball has the concept of workflow and in a workflow there are many jobs. Pinball handles helps translate a lot application logics like workflow, schedule, jobs into its system, and provides a lot function for end user to manage their workflow jobs.

We do compare Pinball with Apache oozie and azkaban when we start this project.

ecesena 11 years ago | | |

Thanks for the details! I will look into these resources.

mohap 11 years ago |

are there any open source toolkits that handle user workflows well?

jellyroll 11 years ago |

Wow this is awesome