Pinterest open sources Pinball – a flexible data workflow manager(engineering.pinterest.com) |
Pinterest open sources Pinball – a flexible data workflow manager(engineering.pinterest.com) |
How do the operations powering an app like Pintrest evolve from simple to so complex? Do the complexities emerge from necessity, or simply from idle time on behalf of the engineers, who naturally crave hard problems to solve?
It's a fascinating meta-commentary on our industry that simple web apps grow to become such complex operations. A business can survive on the kernel of its core competency -- in this case, photo grids -- but to thrive, it requires careful attention to petabytes of peripheral decisions. Indeed, it seems such an evolutionary process is advantageous for a web startup. Friendster and MySpace may well have failed because they mistook their problems for simple ones. They were able to solve the core problem of a social network, but not the many peripheral ones of operating that network at scale. It's the ability to do the latter that sets apart the major successful startups from the also-rans.
Is this a kind of workflow that would run with pinball? Can you move files around with it, or do you use the file system and pass filepaths around? Ideally, the workflow job would hold onto the wav/mp3 and the associated database fields that are returned so I don't have to juggle weird directories around (and have to sync access to them).
I'm not familiar with any other workflow engines, so I'm unsure if this is the kind of thing that would traditionally run on one. I looked at the user guide but it's currently barren.
job1. generate a wav file, and put it somewhere say, s3://wav.file
job2 (run after job1): pick the wav file from the location s3://wav.file
you need to know the contract between the parent and child jobs from the business logic. In this example, when you implement job 1 and job 2, you need to have protocol for them to produce store and consume the wav.file..
I see there are plans to write up some documentation, but are there any timelines that you're aiming to have those written?
Also, the README calls out mysql as being required. I assume that this, being a django project, will work with other backends too. Is there anything, to your knowledge, that would prevent a different backend being used (like postgres or oracle)?
It also has few dependencies and is lightweight (i.e. it's all python, so no JVM tying up resources).
I think my favorite non-obvious aspect is how it allows you to write each component in a different language.
We do compare Pinball with Apache oozie and azkaban when we start this project.
Also, Pinball is also all Python but it currently has a dependency on mysql so it is definitely not as a lightweight as a standalone tool as luigi but it also offers much more in terms of the available features.
When we build pinball, we aim to build a scalable and flexible workflow manager to satisfy the the following requirements (I just name a few here).
1. easy system upgrade - when we fix bug or adding new features, there should be no interruption for current running workflow and jobs. 2. easy add/test workflow - end user can easily add new jobs and workflows into pinball system, without affecting other running jobs and workflows. 3. extensibility - a workflow manager should be easy to extended. As the company and business grows, there will be a lot new requirements and features needed. And also we love your contributions as well. 4. flexible workflow scheduling policy, easy failure handling. 5. We provide rich UI for you to easily manage your workflows - auto retry failed job, - you can retry failed job, can skip some job, can select a subset of jobs of a workflow to run (all from UI) - you can easily access all the running history of your job, and also get the stderr, stdout logs of your jobs - you can also explore the topology of your workflow, and also support easy search. 6. Pinball is very generic can support different kind platform, you can use different hadoop clusters,e.g., quoble cluster, emr cluster. You can write different kind of jobs, e.g., hadoop streaming, cascading, hive, pig, spark, python ...
There are a lot interesting things built in Pinball, and you probably want to have a try!
Luigi though has a lot of pipeline building blocks - it provides api to access HDFS, S3, write/read from it etc. They are very useful, but they are executed in the same Python process as the rest of Job - which heavily loads the machine where Job is executed (in our case - same server where luigid scheduler runs).
I'm excited about Pinball architecture. I'd try to use Pinball as scheduler to execute existing Luigi task classes instances on multiple servers.