How to Become a Data Engineer in 2021(khashtamov.com) |
How to Become a Data Engineer in 2021(khashtamov.com) |
If you want to build anything mildly interesting, you need to have a solid background on software engineering (building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL), you need to really understand your runtime (e.g. the JVM, and how to tune it when working with massive amounts of data), you need a bit of knowledge about infrastructure, because some of the most specialized and powerful tools do not have yet an established "way of doing things", and the statefulness nature of them make them different from your typical web app deployment.
Maybe if you want to become a data analyst you only need SQL, and I would still doubt it. But data engineering is a bit different.
spark has dataframe API which is similar to pandas api and can be learned in one day, especially if you know python.
same for Airflow and other frameworks, it just a fancy scheduler that anyone can pick up in a couple days.
What if you build you data pipelines in sql? curious if you have an example of a data pipeline that needs spark?
The experience required differs dramatically between [semi]structured transactional data moving into data warehouses versus highly unstructured data that the data engineer has to do a lot of munging on.
If you're working in an environment where the data is mostly structured, you will be primarily working in SQL. A LOT of SQL. You'll also need to know a lot about a particular database stack and how to squeeze it. In this scenario, you're probably going to be thinking a lot about job-scheduling workflows, query optimization, data quality. It is a very operations-heavy workflow. There are a lot of tools available to help make this process easier.
If you're working in a highly unstructured data environment, you're going to be munging a lot of this data yourself. The "operations" focus is still useful, but at the entry level data engineer, you're going to be spending a lot more time thinking about writing parsers and basic jobs. If you're focusing your practice time on writing scripts that move data in Structure A in Place X to Structure B in Place Y, you're setting yourself up for success.
I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments - but - there are a lot of useful concepts in Hadoop/Spark that are helpful for data engineers to be familiar with. While you might not be using those tools on a day-to-day basis, chances are your hiring manager used them when she was in your position and it will give you an opportunity you know a few tools at a deeper level.
Old stack: Hadoop, spark, hive, hdfs.
New stack: kafka/kinesis, fivetran/stitch/singer, airflow/dagster, dbt/dataform, snowflake/redshift
For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.
I guess I'm the odd-man out because that's all I've used for this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.
I am not seeing Spark being chosen for new data eng roll-outs. It is still very prevalent in existing environments because it still works well. (used at $lastjob myself)
However - I am still seeing a lot of Spark for machine-learning work by data scientists. Distributed ML feels like it is getting split into a different toolkit than distributed DE.
I should imagine at CERN etc knowing which end of soldering iron gets hot might still be required in some cases.
I recall back in the mumble extracting data from b&w film shot with a high speed camera, by projecting it on to graph paper taped to the wall and manualy marking the position of the "object"
I bet it is still mostly the same, just using Web GUIs nowadays.
As many advantages as SQL has, in many cases it gets into the way. The closer you move to moving data (instead of doing analysis), the more it becomes annoying.
On the other hand, current languages (such as python) lack support when it comes to data transformations. Even Scala, which is one of the better languages for this, has severe drawbacks compared to SQL.
Hopefully better type-systems will help us out in the long term, in particular those with dependent types or similar power to describe data relations.
[1] https://www.holistics.io/books/setup-analytics/ [2] https://a16z.com/2020/10/15/the-emerging-architectures-for-m... [3] https://awesomedataengineering.com/
For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me. This gives me some trouble as I'm not really show how to model data properly.
I have heard advice like this from colleagues and frankly I don't buy it. It certainly isn't gospel. I think it's an oversimplification.
Columnar stores love star schemas. You can get away with a single table model too but you still need some kind of dimensional or at least domain-based thinking. Your single table is going to basically be a Kimball model but already joined together.
No database is going to be happy with joining orders and billing. The single table is still just going to be a single fact table, you just degenerate all the dimensions.
Personally I think you can gain a lot of benefit from doing proper stars because you get more sorting options but I'm a Redshift guy so maybe I'm stuck in that headspace.
I'm still waiting for someone to come along and propose something different but honestly Kimball's dimensional mental model still resonates with me. Are there compromises, can you relax the model more? Of course, but you're still going to realize huge benefits from starting with that approach. I don't think there is some "new" way of thinking that really changed the data space. All the innovation is on the compute side.
I have precisely zero Vertica experience so maybe I'm totally missing something. I'd be happy for someone to tell me I'm wrong.
As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.
I work on a small DE team. We don't have the human power to grind out SQL queries for analysts and other teams. Our solutions are platforms and tools we build on top of more fundamental tools that allows other people to get the data themselves. Think tables-as-a-service.
There is also some trends:
https://trends.google.com/trends/explore?date=today%205-y&ge...
If JetBrains gets lucky, they might manage to create a cross-platform Kotlin eco-system as they are trying hard to push, as means to sell InteliJ licenses.
Lets see if it doesn't end like like Typesafe.
My favorite so far is S3 + PrestoDB with either ORC or Parquet files. It is a solid DWH solution for most enterprises on the cloud. (Cloud or not is a different discussion). It works for small scale (50TB) to really high scale (50PB). There are some (very few) gotchas and moving parts as opposed to Hadoop + co. You can combine it with Kafka for streaming data and you got yourself a pretty solid data solution.
If you are working in that domain, being able to use the CDK in TypeScript becomes way more important than being able to build a Hadoop cluster from scratch using Scala.
We could have been using it wrong, but porting our Glue scripts to standard EMR after our initial POC saved us over 10x the cost and it was substantially faster.
https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featur...
While maybe not strictly necessary per se, it's a great way to get a foot in the door, and provides a great way to foster advanced type systems and functional programming (I personally find it to be a really fun language to write in to boot).
I would learn python. Its the number one language outside sql.
edit: Guess this was pretty much in the post.
1. SQL/analytics wizard, capable of building out dashboards and quickly finding insights in structured data. Oracle/MSSQL/PostGres etc. Maybe even capable of FE development.
2. Pipeline expert, capable of building out data pipelines for transforming data, Flink, Spark, Beam on top of Kafka/Kinesis/Pubsub run from an orchestration engine like Airflow. Even this could span from using mostly pre-built tools wiring together things with a bit of python to move data from A to B, to the other exteme of full fledge Scala engineer writing complex applications that run on these pipelines.
3. Writing infrastructure software for big data pipelines, customizing Spark/Beam/Flink/Kafka and/or writing custom big data tools when out of the box solutions don't work or scale. Some overlap with 2, but really distinguished by it being a full fledged software engineer specializing in the big data ecosystem.
So, are all three of these appropriate to call Data Engineer? Is it mainly #1 and people are getting confused? I would certainly fall into the #3, so I'm always surprised when people approach me about 'SQL transform' type jobs.
What? The Apache stack that's written in Scala recompiles all your code into JVM bytecode, regardless of what language you've written it in. Yes, that includes Scala. Spark isn't actually firing up a python interpreter and running your python code on the data.
I think these two sentences are sort of orthogonal to one another. The first, I interpret as saying that it's useful to understand Scala if you're using Spark, essentially because of the law of leaky abstractions [1]. I think you're responding to the second sentence and in that case I agree.
[1] https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...
What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.
Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.
There's the occasional Hadoop/Spark platform out there, but clients using those tend to have older platforms.
Data Engineering really comes down to being a set of hacks and workarounds, because there is no data processing system which you could use in a standardized systematic way that data analysts, engineers, scientists and anyone else could use. It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.
There are of course other parts to it, such as managing multiple data products in a systematic way, which engineering minds seem to be best suited for. But the core of data engineering in 2020, I believe, is still implementing hacks and gluing several systems together so as to have a standardized processing system.
Snowflake or Databricks Spark bring you closest to the ideal unified system despite all their shortcomings. But still, you sometimes need to process unstructured jsons, extract stuff from html and xml files, unzip a bunch of zip archives and put them into something that these systems recognize and only then you can run sql on it. It is much better than the ETL of the past, where you really had to hack and glue 50% of the system yourself, but it is still nowhere near the ideal system in which you'd simply tell your data analysts: you can do it all yourself, I'm going to show you how. And I won't have to run and maintain a preprocessing job to munge some data into something spark recognizable for you.
It is not that difficult to imagine a world where such a system exists and data engineering is not evem needed. But you can be damn sure, that before this happens, that this position will be here to stay, and will be paying high, when 90% of ML and data science is data engineering and cleaning and all these companies hired a shitton of data science and ML people who are now trying to justify their salaries by desperately trying to do data engineers' job.
But I also think that a lot of enterprise pipelines went all in on spark and so now moving to something else (SQL scripts, Snowflake, etc.) just isn't worth it. So Spark is dead, long live Spark.
https://www.linkedin.com/pulse/mapping-data-science-professi...
Several times over my career I've been brought in on a project where the team was considering replacing their RDBMS entirely with a no-SQL data store (a huge undertaking!) because they were having "performance problems". In many cases the solution is as simple as adding an index or modifying a query to use an index, but the devs regard it as some kind of wizardry to read a query plan.
Unless you are in a position where you can entirely rely on managed tools that do the work for you and all effort is centered around managing the data, rather than the holistic view of your data pipelines (Talend ETL, Informatica - the "pre-Hadoop" world, if you will, and maybe some modern tools like Snowflake), then a good Data Engineer needs a deep understanding of programming languages, networking, some sysadmin stuff, distributed systems, containerization, statics, and of course a good "architect" view on the ever-growing zoo of tools and languages with different pros and cons.
Given that at the end of the day, most "Data Pipelines" run on distributed Linux machines, I've seen and solved endless issues with Kernel and OS configurations (noexec flags, ulimits, permissions, keyring limits ...), network bottlenecks, hotspotting (both in networks and databases), overflowing partitions, odd issues on odd file systems, bad partition schemes, a myriad of network issues, JVM flags, needs for auditing and other compliance topics, heavily multi-threaded custom implementations that don't use "standard" tools and rely on language features (goroutines, multiprocessing in Python, Threadpools in Java ...), encoding problems, various TLS and other security challenges, and of course, endless use of GNU tools and other CLI-fun and I would not necessarily expect for a pure SQL use case (not discounting the fact that SQL is, in fact, very important).
Not to mention that a lot of jobs / workflows Data Engineers design and write tend to be very, very expensive, especially on managed Clouds - generally a good idea to make sure everything works and your engineers understand what they are doing.
I've led DE teams for the last decade. I have lived through shifts in toolsets, languages, etc. Regardless of platform, languages, model types, etc, etc, etc, etc, the one constant has been SQL with some sort of of scripting around it.
Right now, it seems Python is the big wrapper language, whether it's via dag or some other means but that's just the preferred method TODAY. Considering SQL has been around for decades and has outlasted just about every other language and system, many of which have opted for a SQL-like interface on top of their system, I would highly recommend DEs be very strong there.
I'm not bitter; you're bitter. /s
* Microsoft SSIS is still there, kind of a granddaddy tool but perfectly capable of single-machine ETL
* Trifacta's Wrangler has a free version with limits
* Talend's Open Studio is free, a little clunky but works fine
* Some new players that I've played around with are Airbyte (immature but evolving quick) and Fivetran (consumption-based pricing model, fairly extensible, but kind of biased about the sources/sinks they're interested in supporting)
* I haven't tried Streamsets or Stitch yet, but I've watched a few videos, again, a little more focused on cloud and streaming data sources than traditional batch ETL, but seem fair enough for those use cases as well
* If you want to roll your own SQL/Python/etc ETL, Airflow and Luigi are good and simple orchestrators/schedulers
The cloud services have pretty cheap consumption-based ETL PaaS offerings, too: Azure Data Factory, Amazon Glue, GCP Cloud Data Fusion
Unless what you're doing is highly bespoke ETL, I'd recommend trying out the new kids on the block and seeing if you can build pipelines that suit your needs from those, because they're at the forefront of a lot of evolving data architecture patterns that are about to dominate the 2020s.
Excel Power Query is also quite lightweight. But is pretty klunky in my (biased) opinion.
Also would highly highly recommend looking into kedro (which has airflow integration, or you could just run your pipelines with crontab)
Love this quote. It hits the nail on the head. Not sure why it's paid so well though...
I'd definitely do it even if it's the same pay as BA.
In an ELT/ETL pipeline:
Airflow is similar to the "extract" portion of the pipeline and is great for scheduling tasks and provides the high-level view for understanding state changes and status of a given system. I'll typically use airflow to schedule a job that will get raw data from xyz source(s), do something else with it, then drop it into S3. This can then trigger other tasks/workflows/slack notifications as necessary.
You can think of dbt as the "transform" part. It really shines with how it enables data teams to write modular, testable, and version controlled SQL - similar to how a more traditional type developer writes code. For example, when modeling a schema in a data warehouse all of the various source tables, transformation and aggregation logic, as well as materialization methods are able to to live in the their own files and be referenced elsewhere through templating. All of the table/view dependencies are handled under the hood by dbt. For my organization, it helped untangle the web of views building views building views and made it simpler to grok exactly what and where might be changing and how something may affect something else downstream. Airflow could do this too in theory, but given you write SQL to interface with dbt, it makes it far more accessible for a wider audience to contribute.
Fivetran/Stitch/Singer can serve as both the "extract" and "load" parts of the equation. Fivetran "does it for you" more or less with their range of connectors for various sources and destinations. Singer simply defines a spec for sources (taps) and destinations (targets) to be used as a standard when writing a pipeline. I think the way Singer drew a line in the sand and approached defining a way of doing things is pretty cool - however active development on it really took a hit when the company was acquired. Stitch came up with the singer spec and their offered service is through managing the and scheduling various taps and targets for you.
We also trained our analysts to write the more analytical DBT transformations which was nice, shifted that work onto them.
Don't get me wrong though, you can get really far with just DBT + Fivetran, in fact, it removes like 80% of the really tedious, but trivial ETL work. Airflow is just there for the last 20%
(Plus you can then utilize airflow as a general job scheduler)
Where it lacks is abstraction. To make that more concrete, let me ask: can you write LINQ that takes an arbitrary structure and selects every numeric field and calculate the sum over it? And if that is not possible, it should not compile.
I.e. can you define a function "numericsSum(...)" and call it with "List(User(salary: Number, deductions: Number, name: String))" and have it calcuate the sum (ignoring the name-field) but having it fail to compile when calling with "List(User(name: String, date_of_birth: Date))"?
Another example: is it possible to create your own special "join" function, that joins to data structures if they have exactly one common field (which you don't have to specify)?
In both examples, the LINQ compiler must be able to inspect the structure of a type (such as User) at compile time (not runtime) and do pretty arbitrary calculations. Most languages don't support that and I think even LINQ only works with concrete types in that sense. Which, by the way, is already better than what most languages offer - don't get me wrong here. But it is not as powerful as what SQL "compilers" offer - however those are then limited to SQL only, lacking the ability to do general purpose programming.
Not all jobs are new pasture and I think that's forgotten very frequently.
SQL is also an amazing tool and you should definitely learn if but there are a lot of DE jobs out there with Scala in the "Requirements" section of the job listing. Parts of the industry might be moving away from it, but if you're looking to make a jump into DE I think you're hamstringing yourself by avoiding Scala.
But I've never worked for FB.
It seems to be the norm for everything that takes flight around 2010. Of course many are open sourced so those are OK I guess.
The wide table thing is also kind of certified by someone inside of Vertica so I guess it does make sense. I'm not actually working in data team so I don't have the CS knowledge to prove or disprove it. Fortunately our new data lead is going to convert everything into Kimball so we are going to have solid proof eventually.
Having a wide table can have some performance benefits, but I think the primary benefit of dimensional modelling isn't related to performance, but having standardized definitions, or "conformed" dimensions. For example, having a "calendar" dimension with attributes you can pull into your analysis such as fiscal quarter or reporting week.
In a data model with wide tables, if you need to change the definition of a dimension, it's going to be painful to track down all the different places where that old dimension is used.
I do think there’s a potential for performance benefits too because you can sort the fact and dimensions in ways that make sense for each one. Calendar is sorted by week number, product is sorted by a category. Fact is sorted by a date and locale, etc. since you don’t materialize the full projection more expensive but flexible sorting like interleaving becomes possible.
It’s probably academic but it’s not as hard as people make it sound and I think there are real benefits over these wide tables with simple sorting.
Plus if you use a tool like Looker it will write the reporting SQL for you! Generating star schema queries is easy once the tables are in place.
Sorting (or partitioning) is one of the most powerful optimizations in your toolbox. But only when optimized for some kind of access pattern. When you combine domains to get more width you have to make a compromise on the sorting. Then the wheels come off.
You still need different tables for clicks and orders and payments, even if they are very wide. You may or may not physically conform your dimensions in pure Kimball style but logically you (should) still start there.
It took me a few minutes but I found indexes are called projections in Vertica and are more like materialized views than true indexes, here are the docs with a breakdown, https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/Ad...
And here is a general walk through of the architecture including key concepts such as projections. https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/C...
In a few cases I have had to go to published white papers on the technologies as well.
But honestly it's all searching for the right words and then crawling through docs and papers.
To move on to somewhere that does, until the laggards get the message.
Distributed ML is tough to train because of very little control over train loop. I personally prefer using single server trainkng even on large datasets, or switch to online learning algos that do train/inference/retrain at the same time.
as for snowflake, I havent heard of people using snowflake to train ML, but sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune
How do Snowflake (and Redshift, mentioned above) compare with CitusDB? I really like the PostgreSQL experience offered by Citus. I've been bit by too many commercial databases where the sales brochure promises the product does X, Y, and Z, only to discover later that you can't do any of them together because reasons.
But Spark is super cool and actually has algorithms which complete in a reasonable time frame on hardware I can get access to.
Like, I understand that the SQL portion is pretty commoditised (though even there, SparkSQL python and R API's are super nice), but I'm not aware of any other frameworks for doing distributed training of ML models.
Have all the hipsters moved to GPUs or something? \s
> sbnowflake is a killer in managed distribited DWH that you dont have to tinker and tune
It's so very expensive though, and their pricing model is frustratingly annoying (why the hell do I need tickets?).
That being said, tuning Spark/Presto or any of the non-managed alternatives is no fun either, so I wonder if it's the right tradeoff.
One thing I really, really like about Spark is the ability to write Python/R/Scala code to solve the problems that cannot be usefully expressed in SQL.
All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.
Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning. You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).
edit:
>All the replies to my original comment seem to forget that, or maybe Snowflake has such functionality and I'm unaware of it.
Snowflake has support for custom Javascript UDFs and a lot of built in features (you can do absurd things with window functions). I also found it much faster than Spark.
Edit: I may be wrong[1], would be curious to know what users who've used Spark AND Snowflake would add to the conversation.
[1] https://www.snowflake.com/blog/snowflake-and-spark-part-1-wh...
We load our data into SF in json and do plenty of list/struct manipulation using their inbuilt functions[1]. I guess you might have write a UDF if you are doing something super weird but inbuilt functions should get you pretty far 90% of the time.
https://docs.snowflake.com/en/sql-reference/functions-semist...
I was comparing it for "traditional" data engineering stack that used spark for data munging, transformations ect.
I don't have much insight into ML systems or how spark fits there. Not all data teams are building 'ml systems' though. Parent comment wasn't referring to any 'ml systems', not sure why that would be automatically inferred when someone mentions data stack .
For context, I'm a DS who's spent far too much time not being able to run useful models because of hardware limitations, and a Spark cluster is incredibly good for that.
Additionally, I'd argue in favour of Spark even for ETL, as the ability to write (and test!) complicated SQL queries in R, Python and Scala was super, super transformative.
We don't really use Spark at my current place, and every time I write Snowflake (which is great, to be fair), I'm reminded of the inherent limitations of SQL and how wonderful Spark SQL was.
I'm weird though, to be fair.
Along with ML it is also a very high performance extract and transformation engine.
Would love to hear what other tech that are being used to replace Spark.
UDF support isn't really the same, to be honest. You're still prisoner of the select from pattern. Don't get me wrong, SQL is wonderful where it works, but it doesn't work for everything that I need.
I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.
Tensorflow, PyTorch (not sure if Ray is needed) and Mxnet all support distributed training across CPUs/GPUs in a single machine or multiple machines. So does XGBoost if you don't want deep learning.
I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these. I haven't ever tried to do it, but based on my experience of running deep learning frameworks and distributed ML, I suspect the combination of both to be exponentially more annoying ;) (And i deal mostly with structured data, so it doesn't buy me as much).
> You can then run them with KubeFlow or on whatever platform your SaaS provider has (GCP AI Platform, AWS Sagemaker, etc.).
Do people really find these tools useful? Again, I'm not really sure what SageMaker (for example) buys me on AWS, and their pricing structure is so opaque that I'm hesitant to even invest time in it.
Not sure how it's different from what you can do in Spark in terms of data transformations. Taking a list of objects as an argument basically allows your UDF to do arbitrary computations on tabular data.
> I forgot about Xgboost, but I'm a big fan of unsupervised methods (as input to supervised methods, mostly) and Spark has a bunch of these.
That's true, distributed unsupervised methods aren't done in most other places I know of. I'm guessing there's ways to do that with neural network although I haven't looked into it. The datasets I deal with have structure in them between events even if they're unlabeled.
>I completely agree that it's faster than Spark, but it's also super-expensive and more limited. I suspect it would probably be cheaper to run a managed Spark cluster vs Snowflake and just eat the performance hit by scaling up.
I used to do that on AWS. For our use case, Athena ate its lunch in terms of performance, latency and cost by an order of magnitude. Snowflake is priced based on demand so I suspect it'd do likewise.