GitLab is working on a tool just for data teams(about.gitlab.com) |
GitLab is working on a tool just for data teams(about.gitlab.com) |
It also doesn't even categorize the products they compete with correctly[0].
Why not contribute some of your resources to one of the many active open source libraries already trying to solve some of these problems, and focus your engineering efforts on your core product?
[0] Fivetran is only considered "Orchestrate" but is actually competes directly with Alooma in the Extract and Load. Also, there are DOZENS of company in that space. https://gitlab.com/meltano/meltano/blob/master/README.md#dat...
I agree Fivetran also belongs in extract and load and updated it https://gitlab.com/meltano/meltano/commit/1df9813f5ab42c4479... Do you think it should be removed from Orchestrate? Any other suggestions for proprietary products in that category?
Consider how you trust using dbt more than rolling your own transformation tool. Why wouldn't this apply to the rest of your stack? The 10+ companies that offer data extraction and loading are likely a better choice. Again with Analytics - the dozens of companies that offer BI tools are probably going to be the better choice.
Maybe you can build all these tools better than the hundreds of companies with thousands of employees and millions of dollars. It just seems like the odds that you build the best of each is so unlikely.
I would have been more impressed if your team had designed some API that other tools/platforms could plug in to coordinate a lot of the above jobs with your CI system. There is a SERIOUS need for that and I've had a lot of conversations with companies about what that would look like.
To answer your quest, no, Fivetran does not currently belong in the orchestration area, IMO. I've heard they are soon to release some sort of orchestration tooling to compete with dbt, but it isn't the type of orchestration you get with Airflow.
I'm not 100% with all the tools you are using, but stringing together random SaaS tools and having to survey a random number of open source tools in order to assemble a sensible platform makes way less sense.
At the very least, what we end up with is a group of folks working together in the open to surface some of the limitations and challenges and attempt to work out some of the alternative solutions to the problems that arise in this space.
So, I applaud your effort. Ignore the salesmen and the haters.
A lot of the solutions out there are fantastic but aren't up to the tasks we are looking for. Why shouldn't the whole life cycle be in one tool, be open source, and be version controllable? That's what we are looking for in a tool.
That's by no means a bad thing though. While yes, there are downsides to tightly coupled tools, there are also advantages. If GitLab is trying to do the same thing for data analytics that they've already done for source control, they may very well succeed.
- Studying each source to figure out the right data model
- Chasing down a million weird corner cases
- Working around dumb bugs in the data sources
This is the kind of problem where paying for software really works better. When people build data pipelines in-house, they tend to hack at it until it works for their use case and then stop. When we build data pipelines, we map out every feature of the data source, implement the whole thing at once, and then put it through a beta period with multiple real users. This is easy to do when you have a tight-knit dev team; much harder for a group of part-time open-source contributors.Once you take VC funding, you gotta go where the money is. Everyone wants/expects "fast, stable, like Github" for free unless you have special needs. So, you do analytics on what people are doing with your free site, you offer enterprisey features, you get into the "platform" business etc.
I think Gitlab distracts itself, spreads itself thin, and isn't great at partnering, its ambition to do-it-all knows no bounds, which is both commendable and a smh moment. It's not likely sustainable or scalable. They're definitely trying to "go big or go home" as a company, which is not how most originally felt about Gitlab (a fast, stable OSS alternative to Github).
At the same time, I can't blame them. I think it comes down to: Don't hate the player, hate the game.
We have hired 3 times as many people in our security team for GitLab.com (not our product team for security) as are working on Meltano.
We have hired 3 times as many people in our SRE teams as are working on Meltano.
And we still have a lot of vacancies for both https://about.gitlab.com/jobs/
BTW We don't call it a family https://about.gitlab.com/handbook/leadership/#management-tea...
Thanks for the link - we'll definitely keep an eye on it.
I was very glad to see this is Python! Python has some of the best data tools out there, and a mature ecosystem for solving all the engineering problems that go along with a great data stack.
I fully expect we'll have a use case for the "cool" machine learning stuff, but there's a lot of groundwork to cover with the basics first. Meltano is focusing on those basics for right now.
I think this market is not being served properly, most of them seem to still require most of the heavy lifting to be done by the ML practitioner.
I suppose I would even be okay with a service that just saves all my graphs from tensorboard for later reviewing.
Extraction/Loading Dell Boomi SAP SAS Pentaho Domo Oracle IBM Microsoft Informatica Talend JitterBit SnapLogic Mulesoft SyncSort Information Builders Actian Attunity Datameer Alteryx Striim Treasure Data Cask StreamSets Snowplow DataTorrent Astronomer Panoply Apache Nifi Stitch Data FlyData Bedrock Data Alooma ETLeap Fivetran Xplenty MethodMill Celigo TerraSky DBSync Youredi Scribe Civis Analytics DataScience Dataloader.io datorama Astera
Analyze Microsostrategy GoodData Sisense Looker Power BI Wagon Birst Tableau Qlik Domo Hue Mode Chartio Periscope Pentaho
The amount of hype and BS in the Notebook space would require me to spend some time combing through that again.
1. Do we have enough money / budget for a tool like this? 2. Can we derive enough insights from this product fast enough to make a good ROI? 3. Does this tool use a proprietary language that no one wants to learn or can I code in a language that is relevant? 4. In all honesty, can I get insights faster in a spreadsheet than these tools? 5. What is the learning curve? 6. Can I answer the business question that was originally asked?
Open to more discussions around the topic as it is a lot harder to answer than a few philosophical questions, but it certainly resonates with many data & analytics professionals. A nice goal would be to have project where you can stand up a business, turn your data pipelines on, ingest the data, and view the insights needed to make a business decision all within a short timeframe of when a business goes live.
If the CEO is following this, please improve basic user stories like:
* As a user, I want to easily know who has approved my merge request. Note the word "easily". The UI lists the people who did not approve next to label "Approved" and the people who did approve next to the label "Approved by". Makes absolutely no sense
* As a user, I want to see all the merge requests that I need to review because I am listed as an approved (it boggles my mind that this doesn't exist)
* As a user, I want to be notified by todos that only have any pending actions on them
* As a user, I want to disapprove a merge request
There are so many basic areas of the core product that are almost unusable. All of our engineers who have to regularly switch between github and gitlab prefer the github ui.
And while some integration is good... A lot of recent stuff is just "we try to grab the easy money"
Can you explain further what you mean by "pending actions on them"? We are working to simplify and streamline our notifications and todos in GitLab. In particular, the current thinking is that they are very similar. A "notification" is an email, and a "todo" is a something that GitLab calls your attention to in the Web UI to take action on. So mechanically, they are very similar and we would like to harmonize them.
Our latest discussion is in https://gitlab.com/gitlab-org/gitlab-ce/issues/48787.
We've improved a number of confusing approval widget states in GitLab 11.2 (https://gitlab.com/gitlab-org/gitlab-ee/issues/5439) which will ship later this month in and the ability to filter merge requests by approver is in development by a community member (https://gitlab.com/gitlab-org/gitlab-ee/issues/1951).
This is just the beginning though – code reviews and approvals are at the heart of the daily workflows of writing software and we'll be continuing to make them even better. I'm particularly excited about more structured code reviews with batch comments in 11.3 (https://gitlab.com/gitlab-org/gitlab-ee/issues/1984), better navigation between files in merge request diffs with a file tree in 11.4 (https://gitlab.com/gitlab-org/gitlab-ce/issues/14249), and our first iteration of code owners (https://gitlab.com/gitlab-org/gitlab-ee/issues/5382) also in 11.4.
Thanks for the disapprove merge request idea. We're considering this idea in https://gitlab.com/gitlab-org/gitlab-ee/issues/761 where further feedback would be much appreciated, or on any other issue.
This doesn't mean anything, maybe the customers are simply tired of reporting issues. For example last year we didn't do any updates for 6 months because we were afraid it'd break something and we were too busy to be willing to spend the time reporting problems.
We also don't report issues that are already open on gitlab.com, reporting the issue means your customer is willing to spend time reporting, following up and testing your bug. This is your job, not the customer's. At the moment we are only reporting issues that are either blocking us from work or slowing down our development. The majority of issues we are facing are performance problems.
I just wrote a script to plot the number of issues on gitlab-ce over time and percentage of open/close issues, and the overall period they have been open for, you are accumulating issues with: `backend`, `UX`, `technical debt`, `performance`, `CI/CD`, ... labels, a lot of them don't have a Milestone and have been open for a long time.
I am not sure how emailing you would help us, it's not like the problems are not reported or you don't already know about them. It just appears that the priority of GitLab, as a company, is not shipping a quality product anymore.
EDIT: I work in the aerospace industry and one of the stages of our pipelines is to run stress test on our product. I would suggest you to run a stress test on an instance of GitLab, this would be an amazing place to start looking for performance problems.
Couldn't that just be because you have more silent customers now? Probably from the people moved their projects from GitHub
The size of the GitLab is constantly growing and Meltano is adding to GitLabs capabilities, not subtracting. We've hired 2 very awesome Python developers for Meltano specifically. They each have tons of experience in the ELT space.
All this to say, that no one at GitLab has turned their eyes away from GitLab, it's the opposite. This business is here to help GitLab as our first customer. Rather than having GitLab struggle to get it's data tools together, and make business decisions based on that data, we've devoted a whole team to provide a solution while helping the community at the same time.
Personally I work as a "lone wolf" (to my own complains) because I'm in a small company that can't afford a huge team. Most of my (ETL) Transforms are done in SQL which happen to be pretty standardized as opposed to many ETL products I've seen so far.
This solution is probably far from being ready, but I find this approach quite interesting, because it look like a code based ETL that use SQL for transform (so I might be biased). Overall this might result in a more maintainable/versionable data pipeline model than GUI-first ETL which usually generate spaghetti code. Because you are usually forced to regularly adapt data-pipeline to unstable external inputs, being able to easily diff ETL process would be a blessing.
One thing that gets me really excited about it is the way we want to build version control in from the start. To give you an example of where that's really powerful - we have a bunch of dashboards in Looker. Right now, figuring out what Looks/Dashboards rely on a given field is very challenging. If I change a column in my extraction, right now I can fairly easily propagate it to my final transformed table (thanks to dbt!) and even to the LookML. But knowing what in Looker is going to change / break if I change the LookML is way harder.
But if everything was defined in code from extraction, loading, transformation, modeling, _and_ visualization, that'd be really powerful from my perspective.
The Meltano team has several user personas that they're looking at focusing on, data engineers are definitely one of them, but data analyst/BI users are as well, and we want the product to work well for the whole data team.
IMHO, if you want to make a dent in the space, figure out better debugging tools!
In particular, tools that explain how a certain (specific) value was calculated in the system, tools that let you bisect the source data in some way and let you focus on the source data that are likely to have a problem, tools that help you figure out that certain intermediate value in calculations is an outlier, tools that let you test certain assumptions about data over the whole pipeline..
I'd love for a more robust way to test data pipelines and the data within them generally. I was at DataEngConf earlier this year and many people were talking about this problem exactly. One way we're trying to address it a bit is by using the Review Apps feature on Merge Requests within GitLab. Right now, when you open an MR on our repo it will create a clone of the data warehouse that's completely isolated from production. This, obviously, can't scale once the DW is beyond a certain size, but I think there are ways to keep this sort of practice going.
The idea is to give users a set of default extractors (which are the ones we use internally, so they are battle tested), along with loaders, transformers etc. With documentation on how to build their own. For our MVP, and possibly into the future, it will work similar to Wordpress plugins where you have an extractor directory that you place your extractor which is written following our protocol, and the UI will recognize it and give you choices of extractors to run, same for loaders, and so on.
We do not want to be chasing down every last corner case, for extractors (except for our own) because that's just not a good long term solution, needing constant maintenance (as we've seen already). With user contributions, I believe it can work.
My point is that you’re aiming a lot broader than Github ever did - you are competing more as a suite than as a focused product.
And I’ve seen personally this impact the support side with customers, partnership side, etc. I help maintain a medium-large Gitlab for one of your bigger customers. Anyway this isn’t the place for me to get specific, I am just saying that you are taking a risky path in terms of sustainability IMO as a rando on the internet.
I'd point out that if you look at issue 5439 the team itself was originally unaware of the high number of edge case states of the merge request and closed the issue prematurely. Having many code paths is a code smell so I'd suggest simplifying your UX and edge cases here.
Since you own the merge request flow, I would suggest looking at the page and all it's edge cases and seeing where you can simplify for the user. There is a dizzying large amount of info and CTAs presented to the user; it's pure information overload. Don't just measure yourself by how many features you ship but rather on how you communicate those features to your users. Simplicity is a powerful feature in itself.
Looking forward to batch comments.
The disapprove merge request is a feature available in Phabricator and other competitors so I would look to see how they've implemented it.
In terms of combining them with notifications, I agree. I just need a web place to see all the "pending" action items I need to work on. A web notification feature should be the place to see all the pending notifications left for me (similar to how it would work on email except you can't expire emails).
We're doing it because we believe there are emergent benefits to having the lifecycle in a single application https://about.gitlab.com/handbook/product/single-application...
I'm using Gitlab, btw, but only for the self-hosted git and it's user interface (ie. your core). All the other parts (bug tracking, CI, chat, ...) are in different and more appropriate tools for each of our use-cases... because most of yours are not complete enough, or sometimes it's not even clear how they actually could work for us (mattermost for example).
That got the time down for the worst case we measure from 15 seconds to 3 seconds, see https://news.ycombinator.com/item?id=17671300
(P.S.: I also have the same source in gitea of a way less powerful instance which basically renders the whole thing in way less than 1 second.)
We just stopped upgrading GitLab over 2 years ago, we're on 8.9
I'm sorry to hear you experienced to much breakage. Can you maybe point to a regression or two that stayed open too long or that caused you a lot of trouble so we can learn from it?
As GitLab gets more popular I'm not surprised the number of issues grows.
We are measuring a lot of metrics on GitLab.com. And we are shipping a lot of performance improvements to improve those metrics. https://about.gitlab.com/handbook/engineering/performance/#p...
For example a really big MR had a time to first byte of 15 second. It now is 3 seconds. https://www.dropbox.com/s/ymo28t2v4i4jl4x/Screenshot%202018-...
It shouldn't take weeks of effort, a data engineer, multiple proprietary solutions, and tens of thousands of dollars to answer key questions like CAC or the efficiency of a given marketing campaign.
We're hoping to lower the barrier to entry in both cost and effort, by providing an open source pre-packaged solution.
Yeah, I get that. The analytics space is very complex and companies, even ones with good engineering teams, don't have the internal knowledge or resources to typically put all this together.
In addition to working in this space, my copmany helps companies set up their analytics stack.
We typically set them up with one cloud-based data integration tool (the one with the most # of integrations they need at the best price), dbt, and one BI tool (usually Looker or Periscope, in that order). All in, that takes us a few weeks to get them set up and going.
I applaud your effort. I just struggle to understand why you accept punting on transformations (and using dbt (amazing library, by the way - great choice)), but then try to tackle something like integrations or BI tools. The complexity of both of those is massive and there are great open source efforts already out there.
I'm eager to see where this goes.
I would love to hear your suggestion for a great open source BI tool. We tried Superset and Metabase but both didn't came close to what we could do with Looker. That is why we're giving Meltano Analyze a shot.
BTW Do you want to do a livestreamed video call to discuss further in the 30 next minutes? You have a lot of knowledge. If so please email me and comment here.
Update: He did email and livestream will happen on https://www.youtube.com/watch?v=F8tEDq3K_pE
That's GitHub's strategy. Don't choose solutions for their customers. Be a platform other tools can plug into.
Gitlab's strategy is to cobble together a bunch of open source software (including their own) to provide a solution out of the box. It's not necessarily the best one for you, but it's certainly less effort for you.
I'd love to learn more about what you'd like to see CI be able to do from a dataops perspective.
Also, I had a coffee at 5PM with someone, which is way too late to be drinking coffee, and it is evident in how quickly I'm talking >.<
@slap_shot and anyone else — I'm curious if you have thoughts on, or even have heard of the Ballerina language? It's a programming language for doing data integration work, built by the ESB/integration consultancy WSO2. It seems to have a lot of eng resources sunk into it but surprisingly little fanfare.
The CEO's interview with the Software Engineering Daily podcast was great: https://softwareengineeringdaily.com/2018/07/12/ballerina-la...
The language site tends toward buzzword-salad, but clearly has had a lot of love and thought put into it: https://ballerina.io/philosophy/