GitLab is working on a tool just for data teams

GitLab is working on a tool just for data teams(about.gitlab.com)

233 points by TheMissingPiece 7 years ago | 94 comments

slap_shot 7 years ago |

This looks like an amalgamation of 8+ open source projects or industries with products put forth by companies that have dozens of employees and worked on their products for years.

It also doesn't even categorize the products they compete with correctly[0].

Why not contribute some of your resources to one of the many active open source libraries already trying to solve some of these problems, and focus your engineering efforts on your core product?

[0] Fivetran is only considered "Orchestrate" but is actually competes directly with Alooma in the Extract and Load. Also, there are DOZENS of company in that space. https://gitlab.com/meltano/meltano/blob/master/README.md#dat...

sytse 7 years ago | |

What we're doing different is making one product that does the whole lifecycle instead of having to string tools together. It took us many months to string our toolset together and we felt there had to be a better way. Just like GitLab we try to leverage existing open source projects wherever possible.

I agree Fivetran also belongs in extract and load and updated it https://gitlab.com/meltano/meltano/commit/1df9813f5ab42c4479... Do you think it should be removed from Orchestrate? Any other suggestions for proprietary products in that category?

slap_shot 7 years ago | | |

As someone who works very, very closely in this industry, I would just be very careful how much of this you think you want to bite off.

Consider how you trust using dbt more than rolling your own transformation tool. Why wouldn't this apply to the rest of your stack? The 10+ companies that offer data extraction and loading are likely a better choice. Again with Analytics - the dozens of companies that offer BI tools are probably going to be the better choice.

Maybe you can build all these tools better than the hundreds of companies with thousands of employees and millions of dollars. It just seems like the odds that you build the best of each is so unlikely.

I would have been more impressed if your team had designed some API that other tools/platforms could plug in to coordinate a lot of the above jobs with your CI system. There is a SERIOUS need for that and I've had a lot of conversations with companies about what that would look like.

To answer your quest, no, Fivetran does not currently belong in the orchestration area, IMO. I've heard they are soon to release some sort of orchestration tooling to compete with dbt, but it isn't the type of orchestration you get with Airflow.

numbsafari 7 years ago | | |

Yo. Just keep doing what you're doing. I dig it.

I'm not 100% with all the tools you are using, but stringing together random SaaS tools and having to survey a random number of open source tools in order to assemble a sensible platform makes way less sense.

At the very least, what we end up with is a group of folks working together in the open to surface some of the limitations and challenges and attempt to work out some of the alternative solutions to the problems that arise in this space.

So, I applaud your effort. Ignore the salesmen and the haters.

jakecodes 7 years ago | |

Our goal is to meet our data team's need by answering our company's data questions.

A lot of the solutions out there are fantastic but aren't up to the tasks we are looking for. Why shouldn't the whole life cycle be in one tool, be open source, and be version controllable? That's what we are looking for in a tool.

dantiberian 7 years ago | | |

There's no inherent reason that the whole life cycle can't be handled in a single tool. However, there have been tens of thousands of person-years spent on these tools, so people here are pointing out that it is a tall ask for any company to create one tool that integrates everything. This goes doubly so if it is only going to be a side project to GitLab itself.

fipple 7 years ago | |

You could say the same about Github/Gitlab themselves... that they mash together git and JIRA and .plan etc.

Ajedi32 7 years ago | | |

_Especially_ GitLab. Basically their entire product seems to be about building a whole bunch of separate tools and integrating them seamlessly into each other. GitLab has a built-in CI system, a deployment pipeline with Kubernetes integration, a built-in Docker container registry, performance monitoring tools for deployed applications, automated static analysis tools, etc. Describing it as "an amalgamation of 8+ open source projects or industries" seems pretty accurate.

That's by no means a bad thing though. While yes, there are downsides to tightly coupled tools, there are also advantages. If GitLab is trying to do the same thing for data analytics that they've already done for source control, they may very well succeed.

cheghook 7 years ago |

I can't understand why GitLab thinks they have to embark on a new project every so often instead of focusing on their current product and features. There is just a lot to work on, so many of the current features/products are half assed. At my place we moved to GitLab 2.5 years ago and updates where smoother back then but the past few months we had to hire a new sys admin for our build machines and GitLab server to follow on new issues created on GitLab.com and decide if it's safe release and even then he still reports 4-5 issues to GitLab support after every update. We were expecting it to be an easy `yum update` like a normal package but it's just getting worse update after update. It's so bad that my manager asked me to look into GitHub + another CI/CD solution.

georgewfraser 7 years ago |

Data pipelines are not a great subject for an open-source project. We've been building these for the last 3+ years at Fivetran, and I can tell you that the challenge is:

  - Studying each source to figure out the right data model
  - Chasing down a million weird corner cases
  - Working around dumb bugs in the data sources

This is the kind of problem where paying for software really works better. When people build data pipelines in-house, they tend to hack at it until it works for their use case and then stop. When we build data pipelines, we map out every feature of the data source, implement the whole thing at once, and then put it through a beta period with multiple real users. This is easy to do when you have a tight-knit dev team; much harder for a group of part-time open-source contributors.

tbrock 7 years ago |

I wish they would focus on making a fast, stable, GitHub alternative.

parasubvert 7 years ago | |

This is Gitlab taking stuff they were doing already internally and making it available to a broader audience.

Once you take VC funding, you gotta go where the money is. Everyone wants/expects "fast, stable, like Github" for free unless you have special needs. So, you do analytics on what people are doing with your free site, you offer enterprisey features, you get into the "platform" business etc.

I think Gitlab distracts itself, spreads itself thin, and isn't great at partnering, its ambition to do-it-all knows no bounds, which is both commendable and a smh moment. It's not likely sustainable or scalable. They're definitely trying to "go big or go home" as a company, which is not how most originally felt about Gitlab (a fast, stable OSS alternative to Github).

At the same time, I can't blame them. I think it comes down to: Don't hate the player, hate the game.

sytse 7 years ago | | |

We are building a fast, stable, GitHub alternative.

We have hired 3 times as many people in our security team for GitLab.com (not our product team for security) as are working on Meltano.

We have hired 3 times as many people in our SRE teams as are working on Meltano.

And we still have a lot of vacancies for both https://about.gitlab.com/jobs/

n42 7 years ago |

Is there any example of an open source software company that has taken on so many products at once, so early in its life, and succeeded?

sytse 7 years ago | |

We did https://about.gitlab.com/2017/10/11/from-dev-to-devops/ when we where at 50% of our current number of engineers. So far so good.

n42 7 years ago | | |

Not trying to be negative. I genuinely would like for GitLab to succeed. My experience (in a totally different industry and scenario, but with product building all the same) was that our decision to pare down to our core competency and focus was the best decision we ever made. We were attempting a full productivity suite, similar in concept but again a different industry. I’m interested in finding an example of a similarly modeled company to compare.

gregoriol 7 years ago | | |

Really no, look at all the comments here (and this is only from techies): you have lost us, we don't know anymore what you are doing, or even trying to do.

veritas3241 7 years ago |

Taylor from GitLab here! Happy to answer any questions about what we're doing.

thebiglebrewski 7 years ago | |

Kudos to you for trying something new!

veritas3241 7 years ago | | |

Thanks so much!

_pmf_ 7 years ago |

GitLab's usage of team members in marketing material is creeping me out (as does the whole team page[0]).

[0] https://about.gitlab.com/team/

sytse 7 years ago | |

We say team members instead of employees because some are contractors. Why does it freak you out?

BTW We don't call it a family https://about.gitlab.com/handbook/leadership/#management-tea...

_pmf_ 7 years ago | | |

I wouldn't want to have that level of public affiliation with my employer (no matter who that employer might be).

ageofwant 7 years ago |

https://quiltdata.com/ ticks a lot of boxes in this space for me.

veritas3241 7 years ago | |

This could potentially become part of the Meltano stack. At GitLab, we're not at the phase yet where we're in need of data versioning. But I could imagine a data registry that's integrated with the workflow of data analysts/scientists to easily link versions of code and data.

Thanks for the link - we'll definitely keep an eye on it.

sytse 7 years ago | |

Is that project more for versioning data like https://docs.dotmesh.com/tutorials/subdots/ or http://www.pachyderm.io/ ?

danpalmer 7 years ago |

Reading this I was concerned that it would be written in Ruby. While Ruby is a reasonable language for server development, it has almost no data science community when compared with some other ecosystems.

I was very glad to see this is Python! Python has some of the best data tools out there, and a mature ecosystem for solving all the engineering problems that go along with a great data stack.

ksec 7 years ago | |

I am on the opposite side, Given Gitlab is a Ruby house I was secretly hoping some innovation coming from Ruby Data Science.

tamersalama 7 years ago |

Is there some resemblance with Floydhub http://floydhub.com/ ?

veritas3241 7 years ago | |

Personally, I quite like the approach FloydHub has for deep learning projects. At GitLab, we currently don't have any deep learning projects happening - we're still further down the AI hierarchy of needs - i.e. focusing on solid data infrastructure and descriptive analytics.

I fully expect we'll have a use case for the "cool" machine learning stuff, but there's a lot of groundwork to cover with the basics first. Meltano is focusing on those basics for right now.

NegatioN 7 years ago | |

Does anyone have a comprehensive list of similar offerings to floydhub? or OSS alternatives?

I think this market is not being served properly, most of them seem to still require most of the heavy lifting to be done by the ML practitioner.

I suppose I would even be okay with a service that just saves all my graphs from tensorboard for later reviewing.

houqp 7 years ago | | |

I am interested in knowing more about how you think FloydHub can better serve the market. FloydHub does have metrics support for later reviewing: https://docs.floydhub.com/guides/jobs/metrics. Are you only interested in using tensorboard for graph viewing?

Luuseens 7 years ago |

The page talks mentions MVC, and the issue page[0] keeps mentioning MVC as well. Was this supposed to be MVP, or something else? Model-view-controller doesn't make sense in the context.

[0] https://gitlab.com/meltano/meltano/issues/10

jakecodes 7 years ago | |

We use the term mvc here, as "minimal valuable change", in a recognition that it may not be a product yet.

ajbosco 7 years ago |

Do you see this as a (future) competitor of Airflow/Luigi type workflow tools?

sytse 7 years ago | |

Yes, the orchestrate part (working on GitLab CI) is an alternative for Airflow. Also see https://gitlab.com/meltano/meltano/blob/master/README.md#dat...

hn_throwaway_99 7 years ago |

Be interested to know all the competitors in this space. https://data.world/ is one I am most familiar with.

slap_shot 7 years ago | |

This projects competes with too many industries to really give a succinct answer, but here's just Extraction/Loading and Analyze:

Extraction/Loading Dell Boomi SAP SAS Pentaho Domo Oracle IBM Microsoft Informatica Talend JitterBit SnapLogic Mulesoft SyncSort Information Builders Actian Attunity Datameer Alteryx Striim Treasure Data Cask StreamSets Snowplow DataTorrent Astronomer Panoply Apache Nifi Stitch Data FlyData Bedrock Data Alooma ETLeap Fivetran Xplenty MethodMill Celigo TerraSky DBSync Youredi Scribe Civis Analytics DataScience Dataloader.io datorama Astera

Analyze Microsostrategy GoodData Sisense Looker Power BI Wagon Birst Tableau Qlik Domo Hue Mode Chartio Periscope Pentaho

The amount of hype and BS in the Notebook space would require me to spend some time combing through that again.

chasewright 7 years ago | | |

slap_shot, I agree and as I disclaimer I also work at GitLab. There is no shortage of data tools in the space today. A majority of my career has been spent in the data & analytics space and I've talked / worked with at least 60% of the companies you mentioned. At the end of the day, these are the questions I've asked over and over again.

1. Do we have enough money / budget for a tool like this? 2. Can we derive enough insights from this product fast enough to make a good ROI? 3. Does this tool use a proprietary language that no one wants to learn or can I code in a language that is relevant? 4. In all honesty, can I get insights faster in a spreadsheet than these tools? 5. What is the learning curve? 6. Can I answer the business question that was originally asked?

Open to more discussions around the topic as it is a lot harder to answer than a few philosophical questions, but it certainly resonates with many data & analytics professionals. A nice goal would be to have project where you can stand up a business, turn your data pipelines on, ingest the data, and view the insights needed to make a business decision all within a short timeframe of when a business goes live.

sytse 7 years ago | |

Some of the alternatives are listed on in this table in the readme: https://gitlab.com/meltano/meltano/blob/master/README.md#dat...

jakecodes 7 years ago | |

One major difference will be the complete data life cycle vs providing just one part of it. Just like we do in GitLab except for data teams instead of software development teams.

gandutraveler 7 years ago |

Looks like gitlab just wants to be in news since Microsoft's aquisition of GitHub.

sbr464 7 years ago |

Are you releasing/sharing any of the extractors you built for various services?

jakecodes 7 years ago | |

All of our extractors are available in our source code, which is open source. http://gitlab.com/meltano/meltano/. Right now we are working towards an MVP, so things might be in flux, but we value any feedback you have.

sbr464 7 years ago | | |

Thanks. I had looked but only saw one for fastly, am I missing others somewhere?