How to learn data science

248 points by spYa 10 years ago | 81 comments

minimaxir 10 years ago |

The actual problem with learning "data science" is making inferences and conclusions which do not violate the laws of statistics.

I've seen many submissions to Hacker News and Reddit's /r/dataisbeautiful subreddit where the author goes "look, the analysis supports my conclusion and the R^2 is high, therefore this is a good analysis!" without addressing the assumptions required for those results.

Of course, not everyone has a strong statistical background. Except I've seen YC-funded big data startups and venture capitalists commit the same mistakes, who should really, really know better.

"Data science" is a buzzword that successful only due to obscurity and no one actually caring if the statistics are valid. That's why I've been attempting to open source all my statistical analyses/visualizations, with detailed steps on how to reproduce. (see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/ )

nabla9 10 years ago | |

Roughly 80% of data scientists I know have PhD in something very math heavy. Rest have masters degrees. There are programmers who can assist them doing the grunt work but it's just basic programming to assist analysts to crunch data.

If you want to do data science for real:

1. Get Masters of PhD from statistics, computer science, economics, physics or some other heavy field and specialize data analysis in that field. You must learn lots of statistics when doing so.

2. Learn programming, statistical machine learning and tools of the trade.

Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

dworin 10 years ago | | |

I've seen a huge range in the people calling themselves data scientists. Some have very analytically intensive academic degrees, others just finished a data science boot camp, and there are a lot of people that used to be called 'business analysts' who are basically doing the same job with a fancier title. In every group, I've had people tell me that what they're doing is really data science, because data science needs the (academic|integrated|business) perspective that they have, and what the other people are doing isn't really data science.

whistlerbrk 10 years ago | | |

> Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.

This resonates. That is, picking and designing features. Also understand dependent variables and knowing how to test for that, which is the biggest mistakes leading to flawed conclusions I see from the 'general public'.

qq66 10 years ago | | |

Academic credentials aren't enough, good data-driven decisionmaking is as much an art as an academic discipline. A p-value of .01 is a Nobel Prize in medicine and unpublishable in physics -- domain knowledge is important to have a feel for the difference.

decisiveness 10 years ago | | |

Assuming that smart autodidacts can't obtain sound statistics knowledge is selling many people short.

DataWorker 10 years ago | | |

No. A phd in statistics or economics means almost nothing at this point. Even if it did, truly, signal mastery of the content, which it doesn't anymore, it would signal to most people who do this kind of work that you're way overqualified while simultaneously being totally ignorant of the day-to-day work of actual data scientists.

If you want to be a useful data scientist, do a lot of work with data. If you have strong programming skills and are flexible and a quick learner then you will do well.

Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.

washedup 10 years ago | |

It's dangerous to make big generalizations like "no one actually caring if the statistics are valid." This simply is not true. Sure, a lot of what you see on /r/dataisbeautiful is garbage, but that's because it's an open forum where anyone can show what they think they have found. Usually, whenever someone makes an egregious statistical error, they are called out for it. Of course, the same happens on larger scales and even in published research.

"Data science" at it's core is just statistical analysis, but it has been slowly morphing over the past few decades thanks to the budding field of machine learning and the commoditization of computing power. This has drastically changed the field of statistical research, and although the underlying math is the same, the tools and the amount of data are constantly in flux. Someone along the way must have felt that this evolution of statistical analysis needed a new name. In all honesty, it's just a name, and it doesn't matter. What matters is if you understand how to use it.

probdist 10 years ago | |

The classic venn diagram of data science is still helpful: http://drewconway.com/zia/2013/3/26/the-data-science-venn-di...

This article reads like a way to find yourself in the danger zone.

jwuphysics 10 years ago | | |

I've never seen this venn diagram before--thanks for bringing it up. I find that, as an academic (pursuing a Ph.D. in astrophysics) that plenty of traditional researchers are able to hack together code (many haven't ever taken a formal programming course; http://arxiv.org/abs/1507.03989) but many also misuse or can't interpret statistics (from personal experience). That puts us in the danger zone!

I think the key is to find the mathematics and statistics interesting because you want the [data] science to be meaningful. If that's a driving force, then you can learn math and statistics on your own (like the author did). Otherwise, yes--you will find yourself in the danger zone.

ForHackernews 10 years ago | | |

What if you just want to get paid well to play with interesting tools?

facepalm 10 years ago | |

I assume data science is often used in place of astrology. People just want to have something to cling to, to get over their fears and insecurity. So if you can generate some reassuring graphs, who cares if they are based on solid statistics or not?

forgetsusername 10 years ago | | |

>People just want to have something to cling to, to get over their fears and insecurity.

The other side of this is that some businesses (especially SMBs) are so horrible at utilizing their data that very basic analyses can reap big gains (80/20 rule!). For the vast majority of businesses there is no need for elaborate models or machine learning techniques.

liviu- 10 years ago | |

>see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/

They all seem very well-presented[0], but I can't help but ask - what do you do with this new information?

[0] https://i.imgur.com/PvWYB2n.png

minimaxir 10 years ago | | |

I am planning a blog post on the relationship between YouTube video duration on other statistics. So I did a little exploratory analysis to validate the data.

As shown, the distribution of durations in Music videos is much, much different than all other categories. As a result, it skews nearly every other analysis and I may have to exclude videos from the Music category entirely.

curiousjorge 10 years ago | |

what you described is prevalent in social sciences. Ton of biases and causation/correlation error and putting blind trust in some arcane statistical analysis without knowing what they really mean. Conclusion: statistically significant is the magic word peppered throughout academic literature.

achompas 10 years ago | | |

Agreed. Fortunately, excellent social scientists really care about this -- see Andy Gelman's blog for many rants on this topic.

Balgair 10 years ago | | |

I'll echo for biology and sports sciences (think moneyball).

http://www.wired.com/2009/09/fmrisalmon/

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1182327/

pvnick 10 years ago |

Good article for beginners. A couple thoughts, just to build on what the author said:

First off, data science == fancy name for data mining/analysis. Wanted to clear that up due to buzzwordy nature of "data science."

Learn SQL - this is the big one. You must be proficient with SQL to be effective at data science. Whether it's running on an RDBMS or translating to map/reduce (Hive) or DAG (Spark), SQL is invaluable. If you don't know what those acronyms mean yet, don't worry. Just learn SQL.

Learn to communicate insights - I would add here to try some UI techniques. Highcharts, d3.js, these are good libraries for telling your data story. You can also do a ton just with Excel and not need to write any code beyond what you wrote for the mining portion (usually SQL).

I would also go back to basics with regards to statistical techniques. Start with your simple Z Score, this is such an important tool in your data science toolbox. If you're just looking at raw numbers, try to Z-normalize the data and see what happens. You'd be surprised what you can achieve with a high school statistics textbook, Postgres/MySQL (or even Excel!), and a moderate-sized data set. These are powerful enough to answer the majority of your questions, and when they fail then move on to more sexy algorithms.

Edit: one more thing I forgot to mention. After SQL, learn Python. There are a ton of libraries in the python ecosystem that are perfect for data science (numpy, scipy, scikit-learn, etc). It's also one of the top languages used in academic settings. My preferred data science workspace involves Python, IPython Notebook, and Pandas (This book is quite good: http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython...)

stdbrouw 10 years ago |

Definitely agree that those long lists that tell you to first become awesome at combinatorics, linear algebra, then learn all about statistical inference (that is, not actual statistical procedures but the mathematical underpinnings of statistics that would enable you to construct and evaluate methods you invent yourself), then move on to stochastic optimization... those are really more about machismo than about actually helping people to learn data science. Sure, linear algebra is helpful, but whether it's fundamental really depends on the kind of data science you're keen to do.

I also generally dislike /r/machinelearning and /r/statistics because they seem to have been taken over by people who will tell you to either get a PhD or get out. But, for me, just learning whatever I thought I needed to help me solve the problem at hand got me stuck really fast. There's so much statistics where you really just have to learn it first before you can start to see when and why you'd like to use it.

It never occurred to me to use hierarchical modeling and partial pooling for a certain set of problems until after I'd read Gelman & Hill. I never thought that inference on a changing process might require different techniques from the techniques for stationary processes until I had to study Hidden Markov Models for an exam. Heck, when I got started with data analysis I didn't even realize that the accuracy of most statistics improves proportional to sqrt(n) and so the next logical step in my mind was always "get more data!" instead of "learn more about statistics!" (If you look at the industry's obsession with unsampled data, data warehouses that store absolutely everything ever and map/reduce, my hunch is I'm not the only one who lacks or at some point lacked elementary statistical knowledge because it just never came up on their self-motivated, self-directed learning path.)

So I think the ideal learning path incorporates a bit of both: learn more about what excites you and about what's immediately useful right now, but also put aside some time to fill out gaps in your knowledge – even things that don't immediately look useful – and make some time for fundamental/theoretical study.

(x-posted from DataTau)

mikedmiked 10 years ago | |

+1 for DataTau, I didn't know about that.

http://www.datatau.com/

Check out this too: http://www.pyquantnews.com/

lessthunk 10 years ago |

Data science is a stupid buzzword. The ideal candidate knows enough about IT to massage data, the more the knows about the domain to investigate the better, and for sure some statistics. Most of all always do sanity checks .. does it make sense? Can it be? Is the data correct?

It is an art. Like writing awesome code, etc. practice, practice, and working with experienced people is key.

SG- 10 years ago | |

At least it's not something engineer like every other job in the tech field.

neovive 10 years ago |

"You need something that will motivate you to keep learning." This is so true and often forgotten. I am always learning new things, but the concepts that stick, beyond just the basics, are tied to specific projects or solutions to real problems. I'm typically ok with being a "jack-of-all-trades" for most technologies, just to stay aware of new things. However, when it comes to applying new concepts, skills, or tech to solve problems, a deeper understanding is required; usually obtained through motivation.

liviu- 10 years ago | |

>"You need something that will motivate you to keep learning." This is so true and often forgotten.

I'm surprised that both you and OP seem to think this advice is rare, as I've personally seen it mentioned more than a few times. Professors always brainstorm ways to motivate students, employers always seek methods to motivate employees, etc. The answer always seems to be something in the lines of 'do what you love' which is so overused it loses its impact.

Anyway, as someone new to data science, I did not feel like I gained any new information after reading the article, and all the advice seems either intuitive or rehashed. Looking forward to read the HN discussion though.

studentrob 10 years ago | | |

> Professors always brainstorm ways to motivate students, employers always seek methods to motivate employees

Uhh.. You must have quite a fortunate streak of having great teachers and great employers. Speaking for myself, I had some great teachers, but finding a boss who seeks to motivate employees with the right challenges is rare. Bosses and companies generally assume your salary is the primary motivator.

washedup 10 years ago |

If you want to learn about data science, read this book: http://www-bcf.usc.edu/~gareth/ISL/

I have been going through it and I cannot think of a better resource.

acbart 10 years ago |

As someone who uses "Data Science" to teach "Computational Thinking", I think this blog post hits on a lot of really valuable pedagogoical notes. Getting motivated, learning things through doing, and having a strong context for your learning.

For those wondering why I put my buzzwords in quotes, it's because I don't want to sound like I'm a huge proponent of either of them. CT is the term I use to describe how I teach my students about abstractions, algorithms, and some programming. DS is the term I use to describe how students learn all of that in the context of working with data related to their own majors. I'm not trying to claim some crazy paradigm shift, just that it's a great way to convince students that CS is useful to them.

davemel37 10 years ago |

Anyone interested in data science should first study cognitive psychology. The CIA has a manual on the psychology of intelligence analysis that is a must read for anyone pursuing any analytical job.

If you dont understand how your mind sees, processes, retains and recalls data...how can you possibly analyze it accurately?

cwyers 10 years ago | |

You have a link to where to obtain said manual?

davemel37 10 years ago | | |

https://www.cia.gov/library/center-for-the-study-of-intellig...

anderspitman 10 years ago |

These principles are useful when learning anything really: human language (immersion), programming (build something), sports (practice), etc.

That said, as someone who worked in software engineering for 5 years without a degree, and recently returned to school, I would say be careful not to discount studying theory at the same time you're practicing your craft. I really think a combined approach of structured university courses and MOOCs, including reading textbooks, along with applying the knowledge has been the best approach for me.

I was arrogant about "not needing" a degree for years, feeling justified by the fact that I was making very valuable contributions as an engineer, until I finally went back to school and realized how valuable theoretical knowledge can be.

tyfon 10 years ago |

I've been working as an analyst for 7 years, it's only last couple of years I've heard of statistical analysis referred to as data science.

Am I missing something or is it just a new word?

noelsusman 10 years ago | |

In many ways it's just a new word for the same thing, but there's a few key differences. The main difference between traditional statistics and data science is strength in programming. Data scientists are also expected to be more well versed in statistical modeling than your average programmer or data analyst.

With that said it's not really a new thing, people have been doing data science for decades. The demand for people who can program and also do more complex statistical modeling has skyrocketed so I think that's why there's a new name for it now.

Part of the problem is that even with this definition there's a wide range of abilities present in data scientists. A long time computer programmer who has dabbled in statistics and a long time statistician who has dabbled in computer programming would both be data scientists even though they bring very different strengths to the table.

s73v3r 10 years ago | |

No, no, they're totally disrupting the field of statistical analysis. That's why they need a new name.

gbersac 10 years ago |

I am doing this course and find it really good : https://www.edx.org/course/scalable-machine-learning-uc-berk...

It is about creating a linear and logistic regression + pca using spark (python api).

graycat 10 years ago |

Here are some topics. Are they considered relevant to data science?

Matrix row rank and column rank are equal.

In matrix theory, the polar decomposition.

Each Hermitian matrix has an orthogonal basis of eigenvectors.

Weak law of large numbers.

Strong law of large numbers.

The Radon-Nikodym theorem and conditional expectation.

Sample mean and variance are sufficient statistics for independent, identically distributed samples from a univariate Gaussian distribution.

The Neyman-Pearson lemma.

The Cramer-Rao lower bound.

The margingale convergence theorem.

Convergence results of Markov chains.

Markov processes in continuous time.

The law of the iterated logarithm.

The Lindeberg-Feller version of the central limit theorem.

The normal equations of linear regression analysis.

Non-parametric statistical hypothesis tests.

Power spectral estimation of second order, stationary stochastic processes.

Resampling plans.

Unbiased estimation.

Minimum variance estimation.

Maximum likelihood estimation.

Uniform minimum variance unbiased estimation.

Wiener filtering.

Kalman filtering.

Autoregressive moving average (ARMA) processes.

Rank statistics are always sufficient.

Farkas lemma.

Minimum spanning trees on directed graphs.

The simplex algorithm of linear programming.

Column generation in linear programming (Gilmore-Gomory).

The simplex algorithm for min cost capacitated network flows.

conjugate gradients.

The Kuhn-Tucker conditions.

Constraint qualifications for the Kuhn-Tucker conditions.

Fourier series.

The Fourier transform.

Hilbert space.

Banach space.

Quasi-Newton iteration and updates, e.g., Broyden-Fletcher-Goldfarb-Shanno.

Orthogonal polynomials for numerically stable polynomial curve fitting.

Lagrange multipliers.

The Pontryagin maximum principle.

Quadratic programming.

Convex programming.

Multi-objective programming.

Integer linear programming.

Deterministic dynamic programming.

Stochastic dynamic programming.

The linear-quadratic-Gaussian case of dynamic programming.

vermontdevil 10 years ago |

Also learning R Language and using RStudio is a great way to get into. RStudio has so many packages to help you do any data analysis. The learning curve is quite steep though.

gtrubetskoy 10 years ago |

Read this (free) book: http://mmds.org/

searine 10 years ago |

Moving data around is just grunt work.

Real science requires a creative and critical mind, which takes years to mold.

stdbrouw 10 years ago | |

Sounds like you've also spent years molding professional disdain for everyone who's not a Real Scientist.

searine 10 years ago | | |

No I've just seen too many people spin their wheels on "analysis" that is not hypothesis driven.

You got to start with questions to get answers, and the hard part of science isn't crunching data, it is asking the right question!

pvaldes 10 years ago |

Learn biology, chemistry and physics, question yourself often and use your instinct.

curiousjorge 10 years ago |

if I had some type of practical application that I knew could benefit from data science, like learning RoR to make a marketplace app for example, it would help a lot as I have a clear goal and route to achieve that. However, data science, machine learning, these are so broad, and seemingly complicated (my fear of complicated math formulas and statistics) and worse I don't know what I want to achieve out of it nor do I know what I want to make which really hinders the learning process for me. I need some incentive or reward at the end of the goal.

sanderjd 10 years ago | |

Yes, the step 1 that nobody seems to mention is that you need to have a question that you're curious about, which data analysis may be able to help you answer. The reward is having some answer to that question, with an argument for its validity. Instead of links to a bunch of datasets, I'd love to see a site that collects questions with the potential for data-driven answers. This perhaps exists somewhere.

stickperson 10 years ago | |

Absolutely. I remember seeing an "Epic NHL goal celebration" post on here a little while ago. That was a fun read and seemed like a good project to get some exposure to ML.

http://blog.francoismaillet.com/epic-celebration/