Show HN: Interactive map for architecting big data pipelines(xyz.insightdataengineering.com) |
Show HN: Interactive map for architecting big data pipelines(xyz.insightdataengineering.com) |
I wish it had some information about supported languages. Most of the processing systems are JVM-based and require that you write your program in a JVM language. Some have Python support. But I have yet to encounter one that allows you write your pipelines in Go, Rust or JavaScript, for example. One notable exception is Storm, which supports pluggable runners, including one that talks to an external program over standard I/O. My impression that aside from Python, today's pipelines require a large amount of JVM buy-in, something I'm personally not interested in.
I'd also love some kind of metric for "aliveness". For example, my impression is that Storm was hot for about a week, and then Spark and Flink happened, and now nobody is talking about it, and Twitter itself has apparently replaced it with Heron.
Also note that unlike Spark, Storm is a pure open source project that does not have a major commercial entity marketing its use cases. Hortonworks has put a little marketing effort behind it, but otherwise, it's just a mature & active Apache infrastructure project. Storm 2.0 is coming out soon and features a slew of performance- and reliability-improving enhancements.
But as for marketing buzz, Google has commercial reasons for you to use Beam and Dataflow, for example. And likewise Databricks for Spark.
It's probably a good idea to pick production large-scale data infrastructure on a metric other than recency of marketing buzz.
-$0.02 from one of the original authors of streamparse, the Python API for Storm
This format lends itself to data processing, but I think it would be really nice to apply it a variety of workflows. For example, you could model the software deployment process across different languages and frameworks. It could be a good complement to StackShare.
A bit of constructive feedback: I'm not a stickler for UX or design, but maybe spruce up the gray boxes a bit. I've never been a designer though, so take that for what you will.
Keep it simple and hierarchical. I suggest additional filters for each component of the data engineering flow that can discern unique features or commonalities.
Stream processing: Azure Stream Analytics (https://azure.microsoft.com/en-us/services/stream-analytics/)
SQL server is mentioned, but Azure Cosmos DB should also be mentioned (https://azure.microsoft.com/en-us/services/cosmos-db/)
To that point, just added CosmosDB, and plan to add others soon.
Analysis Services
Event Hub
IoT Hub
And hosting for a lot of the open source items in the original post.
It seems more a survey of tools the author knows or likes.
We might add them explicitly to Streaming as well though.
Kind of useless for us on Azure.
Details are in my profile.