How to Become a Data Engineer in 2021

How to Become a Data Engineer in 2021(khashtamov.com)

264 points by adilkhash 5 years ago | 137 comments

(source for everything following: I recently hired entry-level data engineers)

The experience required differs dramatically between [semi]structured transactional data moving into data warehouses versus highly unstructured data that the data engineer has to do a lot of munging on.

If you're working in an environment where the data is mostly structured, you will be primarily working in SQL. A LOT of SQL. You'll also need to know a lot about a particular database stack and how to squeeze it. In this scenario, you're probably going to be thinking a lot about job-scheduling workflows, query optimization, data quality. It is a very operations-heavy workflow. There are a lot of tools available to help make this process easier.

If you're working in a highly unstructured data environment, you're going to be munging a lot of this data yourself. The "operations" focus is still useful, but at the entry level data engineer, you're going to be spending a lot more time thinking about writing parsers and basic jobs. If you're focusing your practice time on writing scripts that move data in Structure A in Place X to Structure B in Place Y, you're setting yourself up for success.

I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments - but - there are a lot of useful concepts in Hadoop/Spark that are helpful for data engineers to be familiar with. While you might not be using those tools on a day-to-day basis, chances are your hiring manager used them when she was in your position and it will give you an opportunity you know a few tools at a deeper level.

dominotw 5 years ago | |

Agree 100% with this comment,

Old stack: Hadoop, spark, hive, hdfs.

New stack: kafka/kinesis, fivetran/stitch/singer, airflow/dagster, dbt/dataform, snowflake/redshift

disgruntledphd2 5 years ago | | |

Huh, what replaces Spark in those lists?

For my money, its the best distributed ML system out there, so I'd be interested to know what new hotness I'm missing.

sseppola 5 years ago | | |

Can you elaborate more on the "roles" of the "new stack"? To me dbt/dataform and airflow/dagster are quite similar, so why do you need one of each? fivetran/stitch/singer are all new

fetanchaud 5 years ago | | |

Just a thought : what about dremio ?

llbeansandrice 5 years ago | |

> I agree with a few other commentators here that Hadoop/Spark isn't being used a lot in their production environments

I guess I'm the odd-man out because that's all I've used for this kind of work. Spark, Hive, Hadoop, Scala, Kafka, etc.

josephmosby 5 years ago | | |

I should have specified more thoroughly.

I am not seeing Spark being chosen for new data eng roll-outs. It is still very prevalent in existing environments because it still works well. (used at $lastjob myself)

However - I am still seeing a lot of Spark for machine-learning work by data scientists. Distributed ML feels like it is getting split into a different toolkit than distributed DE.

teddyuk 5 years ago | | |

I'm also the odd one out, so many enterprises moving to spark on databricks.

Mauricebranagh 5 years ago | |

It does rather depends what sort of data I bet a data engineer at CERN or JPL has quiet a different set of required skills to say Google or a company playing at data science because its the next big thing.

I should imagine at CERN etc knowing which end of soldering iron gets hot might still be required in some cases.

I recall back in the mumble extracting data from b&w film shot with a high speed camera, by projecting it on to graph paper taped to the wall and manualy marking the position of the "object"

pjmlp 5 years ago | | |

When I was there, almost 20 years ago, it was all about C++, Python and Fortran, with GUIs being done in Java Swing.

I bet it is still mostly the same, just using Web GUIs nowadays.

valenterry 5 years ago | |

We yet have to wait for the proper sweet spot: a language that allows SQL-like handling without the restrictions of SQL.

As many advantages as SQL has, in many cases it gets into the way. The closer you move to moving data (instead of doing analysis), the more it becomes annoying.

On the other hand, current languages (such as python) lack support when it comes to data transformations. Even Scala, which is one of the better languages for this, has severe drawbacks compared to SQL.

Hopefully better type-systems will help us out in the long term, in particular those with dependent types or similar power to describe data relations.

andkenneth 5 years ago | | |

What's your opinion of LINQ in C#? It's been a while since I've used it but to me it seems like one of the most powerful ways to manipulate data inside of a language.

alexpetralia 5 years ago | |

Great points. It depends on where the business is at, the scale of their data, how processed their data is, and the timeliness/accuracy requirements of that data.

laichzeit0 5 years ago |

I think it's missing the resources to one of the hardest sections: Data modelling, like Kimball and Data Vault. That, and maybe a section to modern data infrastructure. I'd put a link to [1] and [2] for a quick overview and probably [3] for more detail.

[1] https://www.holistics.io/books/setup-analytics/ [2] https://a16z.com/2020/10/15/the-emerging-architectures-for-m... [3] https://awesomedataengineering.com/

markus_zhang 5 years ago | |

This. I also think modern columnar databases and other techniques somehow makes Kimball to be obsolete or relaxed some how, but I could be very wrong.

For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me. This gives me some trouble as I'm not really show how to model data properly.

mulmen 5 years ago | | |

> For example we use Vertica and DBA told us that Vertica loves wide tables with many columns, which doesn't look very Kimball to me.

I have heard advice like this from colleagues and frankly I don't buy it. It certainly isn't gospel. I think it's an oversimplification.

Columnar stores love star schemas. You can get away with a single table model too but you still need some kind of dimensional or at least domain-based thinking. Your single table is going to basically be a Kimball model but already joined together.

No database is going to be happy with joining orders and billing. The single table is still just going to be a single fact table, you just degenerate all the dimensions.

Personally I think you can gain a lot of benefit from doing proper stars because you get more sorting options but I'm a Redshift guy so maybe I'm stuck in that headspace.

I'm still waiting for someone to come along and propose something different but honestly Kimball's dimensional mental model still resonates with me. Are there compromises, can you relax the model more? Of course, but you're still going to realize huge benefits from starting with that approach. I don't think there is some "new" way of thinking that really changed the data space. All the innovation is on the compute side.

I have precisely zero Vertica experience so maybe I'm totally missing something. I'd be happy for someone to tell me I'm wrong.

mulmen 5 years ago | |

SQL is easy, data is hard.

prions 5 years ago |

SQL proficiency is important but I wouldn't say it supersedes programming experience. To me, Data Engineering is a specialization of software engineering, and not something like an analyst who writes SQL all day.

As DE has evolved, the role has transitioned away from traditional low code ETL tools towards code heavy tools. Airflow, Dagster, DBT, to name a few.

I work on a small DE team. We don't have the human power to grind out SQL queries for analysts and other teams. Our solutions are platforms and tools we build on top of more fundamental tools that allows other people to get the data themselves. Think tables-as-a-service.

StreamBright 5 years ago |

2021? More like 2010. Hadoop is getting deprecated rapidly and more companies split their write and read workloads. Separated storage and compute is also popular. Scala is not used that much, I think it is not worth the time investment. More and more companies go for Kotlin instead of Java when these want to tap into the Java ecosystem.

adilkhash 5 years ago | |

Hadoop is still widely used in enterprises (especially in banks), if you have experience working with Hadoop ecosystems it is a big plus anyway.

StreamBright 5 years ago | | |

Yes, that is the status quo.

There is also some trends:

https://trends.google.com/trends/explore?date=today%205-y&ge...

pjmlp 5 years ago | |

Kotlin's future is tied with Android, on the JVM it will be another Scala in 5 years time.

If JetBrains gets lucky, they might manage to create a cross-platform Kotlin eco-system as they are trying hard to push, as means to sell InteliJ licenses.

Lets see if it doesn't end like like Typesafe.

smattiso 5 years ago | |

Are you working in this field? I am looking for a consultant to setup a modern data processing pipeline for a data driven hardware product I am building.

StreamBright 5 years ago | | |

Yes, I am moving companies to their next data pipeline, that is my specialty. I have added my email to my profile, you can reach out to me.

kentm 5 years ago | |

What does your typical stack look like?

StreamBright 5 years ago | | |

Depends. Just some random mixture of stacks: PrestoDB, S3, Airflow, Luigi, Dremio, Athena, Hive LLAP, EMC Isilon, Kafka.

My favorite so far is S3 + PrestoDB with either ORC or Parquet files. It is a solid DWH solution for most enterprises on the cloud. (Cloud or not is a different discussion). It works for small scale (50TB) to really high scale (50PB). There are some (very few) gotchas and moving parts as opposed to Hadoop + co. You can combine it with Kafka for streaming data and you got yourself a pretty solid data solution.

ABeeSea 5 years ago |

I think learning Scala is a bit of a waste of time, but I don’t know everyone’s stack. Maybe it’s a west coast bubble, but serverless seems to be the most popular choice for new ETL stacks even if the rest the cloud tech stack isn’t serverless. AWS tools like kinesis, glue (pyspark), step functions, pipelines, lambdas, etc.

If you are working in that domain, being able to use the CDK in TypeScript becomes way more important than being able to build a Hadoop cluster from scratch using Scala.

ianbutler 5 years ago | |

Glue is both more of a pain in the butt than regular old spark with pyspark and way more expensive, from my experience I would seriously question someone suggesting to use it.

We could have been using it wrong, but porting our Glue scripts to standard EMR after our initial POC saved us over 10x the cost and it was substantially faster.

ABeeSea 5 years ago | | |

Both pricing and start-up times are significantly better in Glue 2.0 (assuming one can migrate). But even on Glue 1.0, orchestrating an ETL process with with several dozen jobs is a non-trivial amount of configuration and labor. (Jobs failures, job restarts, paging, job run history, cloudwatch logs, re-usable infrastructure as code when creating a new jobs, permissions and security, etc) that the increased cost is more than worth it for us.

https://aws.amazon.com/blogs/aws/aws-glue-version-2-0-featur...

wheaties 5 years ago |

...and nothing of basic statistics? Data Science people want to know about your data pipeline and have some quantification of the quality of that data. Also, monitoring data pipelines for data integrity often relies upon a statistical test. You don't need to go as far as Bayesian but you do need to understand when a median goes way off or if it bi-modal, etc.

diehunde 5 years ago | |

That should be assumed in the "engineer" part of the role.

ZephyrBlu 5 years ago | | |

Yeah I would definitely expect an engineer to have a grasp of basic statistics such as mean, median, mode and be able to interpret statistical graphs on a basic level (Modality, skewness, shape, etc) .

adilkhash 5 years ago | |

a good catch, thanks!

dominotw 5 years ago |

I've been in this space last 6 yrs or so and my scala usuage has gone down to zero. Not worth learning scala.

pgoggijr 5 years ago | |

This is an anecdote - plenty of firms are using Scala in their data engineering stacks and it's a great tool for the job.

While maybe not strictly necessary per se, it's a great way to get a foot in the door, and provides a great way to foster advanced type systems and functional programming (I personally find it to be a really fun language to write in to boot).

dominotw 5 years ago | | |

> it's a great tool for the job.

What job can this do that can't be done via sql. dealing with unstructured data?

st1x7 5 years ago | | |

> plenty of firms are using Scala in their data engineering stacks

Isn't that just a result of everyone being into Spark a few years ago?

sidlls 5 years ago | |

Scala, when it's not used because it's just what someone learned the ropes with, is the Haskell of data science and machine learning: it's what people use when they want to inflate their credentials and/or egos.

switch007 5 years ago | |

What languages are worth learning?

dominotw 5 years ago | | |

SQL has taken over the space completely. 90% of data munging and transforms happen via SQL.

I would learn python. Its the number one language outside sql.

johanneskanybal 5 years ago | | |

sql, python, terraform, maybe some basic java. Airflow is pretty common. Whatever the company is migrating away from. As long as you`re good at one of those and can pick up the rest on the fly you should be fine to start out.

edit: Guess this was pretty much in the post.

runT1ME 5 years ago |

I've been approached about various data engineering jobs over the last couple years and the job descriptions have varied wildly. It has been everything from:

1. SQL/analytics wizard, capable of building out dashboards and quickly finding insights in structured data. Oracle/MSSQL/PostGres etc. Maybe even capable of FE development.

2. Pipeline expert, capable of building out data pipelines for transforming data, Flink, Spark, Beam on top of Kafka/Kinesis/Pubsub run from an orchestration engine like Airflow. Even this could span from using mostly pre-built tools wiring together things with a bit of python to move data from A to B, to the other exteme of full fledge Scala engineer writing complex applications that run on these pipelines.

3. Writing infrastructure software for big data pipelines, customizing Spark/Beam/Flink/Kafka and/or writing custom big data tools when out of the box solutions don't work or scale. Some overlap with 2, but really distinguished by it being a full fledged software engineer specializing in the big data ecosystem.

So, are all three of these appropriate to call Data Engineer? Is it mainly #1 and people are getting confused? I would certainly fall into the #3, so I'm always surprised when people approach me about 'SQL transform' type jobs.

ABeeSea 5 years ago | |

I’d call 2 and 3 data engineers and 1 either a data analyst or BI developer/engineer depending on technical proficiency.

dibujante 5 years ago |

"In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with Python is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset."

What? The Apache stack that's written in Scala recompiles all your code into JVM bytecode, regardless of what language you've written it in. Yes, that includes Scala. Spark isn't actually firing up a python interpreter and running your python code on the data.

fantod 5 years ago | |

> In order to undestand how these systems work I would recommend to know the language in which they are written. The biggest concern with Python is its poor performance hence the knowledge of a more efficient language will be a big plus to your skillset.

I think these two sentences are sort of orthogonal to one another. The first, I interpret as saying that it's useful to understand Scala if you're using Spark, essentially because of the law of leaky abstractions [1]. I think you're responding to the second sentence and in that case I agree.

[1] https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-a...

diehunde 5 years ago |

Nice article. From experience I would say the SQL knowledge should be advanced though. Not intermediate.

zaptheimpaler 5 years ago |

Somewhat outdated view. This may be the current stack, but its outdated now and is slowly being replaced. The new view is not big data pipelines and ETL jobs, its lambda architecture, live aggregations/materialized views and simple SQL queries on large data warehouses that hide the underlying details. The batch model may still apply to ML I guess, but I'm no expert there.

tomnipotent 5 years ago | |

This is true for only a very limited subset of data producers that need real-time or near real-time data included in ML models. For 99% of the rest, batch processing is just fine and considerably more economical.

molsongolden 5 years ago | |

Any resources/guides you'd recommend?

sseppola 5 years ago |

Great resource, thanks for sharing it! I will dig deeper into the resources linked here as there's a lot I have never seen before. The main topics are more or less exactly what I've found to be key in this space in the last 2 months trying to wrap my head around data engineering in my new job.

What I'm still trying to grasp is first how to assess the big data tools (Spark/Flink/Synapse/Big Query et.al) for my use cases (mostly ETL). It just seems like Spark wins because it's most used, but I have no idea how to differentiate these tools beyond the general streaming/batch/real-time taglines. Secondly, assessing the "pipeline orchestrator" for our use cases, where like Spark, Airflow usually comes out on top because of usage. Would love to read more about this.

Currently I'm reading Designing Data-Intensive Applications by Kleppman, which is great. I hope this will teach me the fundamentals of this space so it becomes easier to reason about different tools.

freebee16 5 years ago |

Im my experience teams operating under the "The AI Hierarchy of Needs" principles are optimized for generating white papers

darth_avocado 5 years ago |

We want all these skills, yet, we'll give you a separate title and pay you less than a software engineer. Meanwhile front end software engineers are still software engineers and get high pay.

black_mage 5 years ago | |

I just got the 2020 stats from one of the biggest tech recruiters in my country. On every level DEs outperform SWEs on salary.

alexpetralia 5 years ago | |

I don't think data engineers are paid less than software engineers.

darth_avocado 5 years ago | | |

They are. I should know. I've worked as one for years including big tech companies. For e.g. FB has a lower pay than SWE, lower RSUs etc. and you can only get SWE pay if you transition into one, and that requires you to go through an interview process internally.

tharne 5 years ago | | |

In my experience, they get paid more.

mywittyname 5 years ago |

For GCP, our stacks tend to be Composer (Airflow), BigQuery, Cloud Functions, and Tensorflow.

There's the occasional Hadoop/Spark platform out there, but clients using those tend to have older platforms.

smattiso 5 years ago | |

What is your product? I am looking for a consultant to help me setup a good process for a data driven product hardware product.

mywittyname 5 years ago | | |

The work I do is almost entirely Google Anaytics/Ads related. So probably not what you're looking for, but if so, leave your email and I'll reach out!

u678u 5 years ago |

Incidentally does anyone have resources for SMALL data? EG a few MB of a time, but requires the same ETL, scheduling, traceability. I'd love some lite versions of big-data tools but needs to be simple, small and cheap.

master_yoda_1 5 years ago |

IMHO first you need to become a programmer then you can become a data engineer. So if you need to start by learning data structure then you are doing something wrong. Also the topics suggested in "Algorithms & Data Structures" could easily be skipped, the information is drastically misleading. We should seriously have some fact checker, otherwise this kind of bullshit article keep trending on HN and people keep wasting their time on learning LSM tree (what the fuck is that in the first place).

markus_zhang 5 years ago | |

Can you recommend something else? I'm preparing to go through a full CS education.

master_yoda_1 5 years ago | | |

Coding and building teach you more than taking a course or watching a video. If you don't have any programming background, you can enroll in some coursera or udacity courses to start with. Then go through this course http://web.stanford.edu/class/cs106x/, the course reader is really good. After that for data engineering read this book https://www.amazon.com/Designing-Data-Intensive-Applications.... Also learn some sql. Take some data, feed into sql light db, and ask question and convert question into query. Becoming good at this takes some time. Be patience. The learning curve is like hokey stick, initial phase might have a dip but it accelerate in the later phase. BY ANY CHANCE DO NOT JOIN A BOOTCAMP.

snidane 5 years ago |

In data engineering your goal is "standardization". You can't afford every team using their unique tech stack, their own databases, coding styles, etc. People leave the company all the time and you as a data engineer always end up with their mess which now becomes your responsibility to maintain. You'd at least be grateful if those people had used the same methods to code stuff as you and your team so that you wouldn't have to become a Bletchley Park decoding expert any time someone leaves. Or you'd hope the tech stack was powerful and flexible enough that other people other than engineer types could pick it up and maintain themselves. They mostly cannot do that, because there is no such powerful system out there. Even when some modern ELT systems get you 80% there, you, data engineer, are still needed to bridge the gap for the 20% of the cases.

Data Engineering really comes down to being a set of hacks and workarounds, because there is no data processing system which you could use in a standardized systematic way that data analysts, engineers, scientists and anyone else could use. It's kind of a blue-collar "dirty job" of the software world, which nobody really wants to do, but which pays the highest.

There are of course other parts to it, such as managing multiple data products in a systematic way, which engineering minds seem to be best suited for. But the core of data engineering in 2020, I believe, is still implementing hacks and gluing several systems together so as to have a standardized processing system.

Snowflake or Databricks Spark bring you closest to the ideal unified system despite all their shortcomings. But still, you sometimes need to process unstructured jsons, extract stuff from html and xml files, unzip a bunch of zip archives and put them into something that these systems recognize and only then you can run sql on it. It is much better than the ETL of the past, where you really had to hack and glue 50% of the system yourself, but it is still nowhere near the ideal system in which you'd simply tell your data analysts: you can do it all yourself, I'm going to show you how. And I won't have to run and maintain a preprocessing job to munge some data into something spark recognizable for you.

It is not that difficult to imagine a world where such a system exists and data engineering is not evem needed. But you can be damn sure, that before this happens, that this position will be here to stay, and will be paying high, when 90% of ML and data science is data engineering and cleaning and all these companies hired a shitton of data science and ML people who are now trying to justify their salaries by desperately trying to do data engineers' job.

justinzollars 5 years ago |

Amazon introduced Step Functions, which are very nice to dig into and a helpful skill for Data Engineering.

jimsparkman 5 years ago | |

With direct integrations to EMR, Lambda, and Athena, its a great tool for building pipelines and effectively costs nothing on its own and is completely headache free.

justinzollars 5 years ago | | |

This is exactly how I use it. I left out that important point.

airbreather 5 years ago |

Your data is only as good as your instrumentation and you usually only get one chance to grab that data, but can have many goes at processing it, do I would argue the bit not covered is the most important.

querulous 5 years ago |

i see a lot of "spark is dead" talk here. what replaces it for transform inbetween something like kafka and redshift/bigquery?

llbeansandrice 5 years ago | |

I agree with both sides here. Spark was the in thing for a while so a lot of places are using it but probably don't need it and could have been better off running various SQL scripts to do some transformations. I worked on a project exactly like this where we should have used SQL scripts instead of Spark.

But I also think that a lot of enterprise pipelines went all in on spark and so now moving to something else (SQL scripts, Snowflake, etc.) just isn't worth it. So Spark is dead, long live Spark.

Nydhal 5 years ago |

Shameless plug to my much simpler (simplistic?) view of things. In this case, I think Data Engineers are the people building systems that solely focus on the data, all the data and nothing but the data.

https://www.linkedin.com/pulse/mapping-data-science-professi...

somurzakov 5 years ago |

advanced proficiency in SQL and in any scripting language of your choice (C#/powershell, python) is enough to be a data engineer on any technical stack: windows/linux, on-prem/cloud, vendor specific/opensource, literally anything.

knur 5 years ago | |

I disagree. That's not enough these days.

If you want to build anything mildly interesting, you need to have a solid background on software engineering (building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL), you need to really understand your runtime (e.g. the JVM, and how to tune it when working with massive amounts of data), you need a bit of knowledge about infrastructure, because some of the most specialized and powerful tools do not have yet an established "way of doing things", and the statefulness nature of them make them different from your typical web app deployment.

Maybe if you want to become a data analyst you only need SQL, and I would still doubt it. But data engineering is a bit different.

somurzakov 5 years ago | | |

I believe what you described is a job of Platform Engineer/Systems Engineer/Data lake Architect, especially JVM aspect of it. The interesting job is in the beginning when you build the cluster initially, or do major extension, after that the ops/maintenance is usually outsourced to cheap labor offshore - so this kinda job is personally not for me.

spark has dataframe API which is similar to pandas api and can be learned in one day, especially if you know python.

same for Airflow and other frameworks, it just a fancy scheduler that anyone can pick up in a couple days.

dominotw 5 years ago | | |

> building data pipelines in Spark, Flink, etc. goes way beyond knowing SQL

What if you build you data pipelines in sql? curious if you have an example of a data pipeline that needs spark?

ectoplasmaboiii 5 years ago |

Is anyone here using kdb+/q for data engineering, specifically outside of finance?

rmelhem 5 years ago |

where I work for, our stack is all about GCP/Airflow/Python/BigQuery ML, for recommender systems. I'm now playing around with Turicreate (Apple) to compare with BQML.

cargoshipit 5 years ago |

I don't recommend it