How to learn data science(dataquest.io) |
How to learn data science(dataquest.io) |
I've seen many submissions to Hacker News and Reddit's /r/dataisbeautiful subreddit where the author goes "look, the analysis supports my conclusion and the R^2 is high, therefore this is a good analysis!" without addressing the assumptions required for those results.
Of course, not everyone has a strong statistical background. Except I've seen YC-funded big data startups and venture capitalists commit the same mistakes, who should really, really know better.
"Data science" is a buzzword that successful only due to obscurity and no one actually caring if the statistics are valid. That's why I've been attempting to open source all my statistical analyses/visualizations, with detailed steps on how to reproduce. (see my recent /r/dataisbeautiful submissions on reddit: https://www.reddit.com/user/minimaxir/submitted/ )
If you want to do data science for real:
1. Get Masters of PhD from statistics, computer science, economics, physics or some other heavy field and specialize data analysis in that field. You must learn lots of statistics when doing so.
2. Learn programming, statistical machine learning and tools of the trade.
Good data science is not based on collecting large amounts of data passively and then mining it mindlessly. You need to ask right questions and design data collection and modeling process based on those questions.
This resonates. That is, picking and designing features. Also understand dependent variables and knowing how to test for that, which is the biggest mistakes leading to flawed conclusions I see from the 'general public'.
If you want to be a useful data scientist, do a lot of work with data. If you have strong programming skills and are flexible and a quick learner then you will do well.
Spending the better part of your young adulthood getting a phd in statistics, unless you want to go into academia, just makes you look like a fool.
"Data science" at it's core is just statistical analysis, but it has been slowly morphing over the past few decades thanks to the budding field of machine learning and the commoditization of computing power. This has drastically changed the field of statistical research, and although the underlying math is the same, the tools and the amount of data are constantly in flux. Someone along the way must have felt that this evolution of statistical analysis needed a new name. In all honesty, it's just a name, and it doesn't matter. What matters is if you understand how to use it.
This article reads like a way to find yourself in the danger zone.
I think the key is to find the mathematics and statistics interesting because you want the [data] science to be meaningful. If that's a driving force, then you can learn math and statistics on your own (like the author did). Otherwise, yes--you will find yourself in the danger zone.
The other side of this is that some businesses (especially SMBs) are so horrible at utilizing their data that very basic analyses can reap big gains (80/20 rule!). For the vast majority of businesses there is no need for elaborate models or machine learning techniques.
They all seem very well-presented[0], but I can't help but ask - what do you do with this new information?
As shown, the distribution of durations in Music videos is much, much different than all other categories. As a result, it skews nearly every other analysis and I may have to exclude videos from the Music category entirely.
First off, data science == fancy name for data mining/analysis. Wanted to clear that up due to buzzwordy nature of "data science."
Learn SQL - this is the big one. You must be proficient with SQL to be effective at data science. Whether it's running on an RDBMS or translating to map/reduce (Hive) or DAG (Spark), SQL is invaluable. If you don't know what those acronyms mean yet, don't worry. Just learn SQL.
Learn to communicate insights - I would add here to try some UI techniques. Highcharts, d3.js, these are good libraries for telling your data story. You can also do a ton just with Excel and not need to write any code beyond what you wrote for the mining portion (usually SQL).
I would also go back to basics with regards to statistical techniques. Start with your simple Z Score, this is such an important tool in your data science toolbox. If you're just looking at raw numbers, try to Z-normalize the data and see what happens. You'd be surprised what you can achieve with a high school statistics textbook, Postgres/MySQL (or even Excel!), and a moderate-sized data set. These are powerful enough to answer the majority of your questions, and when they fail then move on to more sexy algorithms.
Edit: one more thing I forgot to mention. After SQL, learn Python. There are a ton of libraries in the python ecosystem that are perfect for data science (numpy, scipy, scikit-learn, etc). It's also one of the top languages used in academic settings. My preferred data science workspace involves Python, IPython Notebook, and Pandas (This book is quite good: http://www.amazon.com/Python-Data-Analysis-Wrangling-IPython...)
I also generally dislike /r/machinelearning and /r/statistics because they seem to have been taken over by people who will tell you to either get a PhD or get out. But, for me, just learning whatever I thought I needed to help me solve the problem at hand got me stuck really fast. There's so much statistics where you really just have to learn it first before you can start to see when and why you'd like to use it.
It never occurred to me to use hierarchical modeling and partial pooling for a certain set of problems until after I'd read Gelman & Hill. I never thought that inference on a changing process might require different techniques from the techniques for stationary processes until I had to study Hidden Markov Models for an exam. Heck, when I got started with data analysis I didn't even realize that the accuracy of most statistics improves proportional to sqrt(n) and so the next logical step in my mind was always "get more data!" instead of "learn more about statistics!" (If you look at the industry's obsession with unsampled data, data warehouses that store absolutely everything ever and map/reduce, my hunch is I'm not the only one who lacks or at some point lacked elementary statistical knowledge because it just never came up on their self-motivated, self-directed learning path.)
So I think the ideal learning path incorporates a bit of both: learn more about what excites you and about what's immediately useful right now, but also put aside some time to fill out gaps in your knowledge – even things that don't immediately look useful – and make some time for fundamental/theoretical study.
(x-posted from DataTau)
Check out this too: http://www.pyquantnews.com/
It is an art. Like writing awesome code, etc. practice, practice, and working with experienced people is key.
I'm surprised that both you and OP seem to think this advice is rare, as I've personally seen it mentioned more than a few times. Professors always brainstorm ways to motivate students, employers always seek methods to motivate employees, etc. The answer always seems to be something in the lines of 'do what you love' which is so overused it loses its impact.
Anyway, as someone new to data science, I did not feel like I gained any new information after reading the article, and all the advice seems either intuitive or rehashed. Looking forward to read the HN discussion though.
Uhh.. You must have quite a fortunate streak of having great teachers and great employers. Speaking for myself, I had some great teachers, but finding a boss who seeks to motivate employees with the right challenges is rare. Bosses and companies generally assume your salary is the primary motivator.
I have been going through it and I cannot think of a better resource.
For those wondering why I put my buzzwords in quotes, it's because I don't want to sound like I'm a huge proponent of either of them. CT is the term I use to describe how I teach my students about abstractions, algorithms, and some programming. DS is the term I use to describe how students learn all of that in the context of working with data related to their own majors. I'm not trying to claim some crazy paradigm shift, just that it's a great way to convince students that CS is useful to them.
If you dont understand how your mind sees, processes, retains and recalls data...how can you possibly analyze it accurately?
That said, as someone who worked in software engineering for 5 years without a degree, and recently returned to school, I would say be careful not to discount studying theory at the same time you're practicing your craft. I really think a combined approach of structured university courses and MOOCs, including reading textbooks, along with applying the knowledge has been the best approach for me.
I was arrogant about "not needing" a degree for years, feeling justified by the fact that I was making very valuable contributions as an engineer, until I finally went back to school and realized how valuable theoretical knowledge can be.
Am I missing something or is it just a new word?
With that said it's not really a new thing, people have been doing data science for decades. The demand for people who can program and also do more complex statistical modeling has skyrocketed so I think that's why there's a new name for it now.
Part of the problem is that even with this definition there's a wide range of abilities present in data scientists. A long time computer programmer who has dabbled in statistics and a long time statistician who has dabbled in computer programming would both be data scientists even though they bring very different strengths to the table.
It is about creating a linear and logistic regression + pca using spark (python api).
Matrix row rank and column rank are equal.
In matrix theory, the polar decomposition.
Each Hermitian matrix has an orthogonal basis of eigenvectors.
Weak law of large numbers.
Strong law of large numbers.
The Radon-Nikodym theorem and conditional expectation.
Sample mean and variance are sufficient statistics for independent, identically distributed samples from a univariate Gaussian distribution.
The Neyman-Pearson lemma.
The Cramer-Rao lower bound.
The margingale convergence theorem.
Convergence results of Markov chains.
Markov processes in continuous time.
The law of the iterated logarithm.
The Lindeberg-Feller version of the central limit theorem.
The normal equations of linear regression analysis.
Non-parametric statistical hypothesis tests.
Power spectral estimation of second order, stationary stochastic processes.
Resampling plans.
Unbiased estimation.
Minimum variance estimation.
Maximum likelihood estimation.
Uniform minimum variance unbiased estimation.
Wiener filtering.
Kalman filtering.
Autoregressive moving average (ARMA) processes.
Rank statistics are always sufficient.
Farkas lemma.
Minimum spanning trees on directed graphs.
The simplex algorithm of linear programming.
Column generation in linear programming (Gilmore-Gomory).
The simplex algorithm for min cost capacitated network flows.
conjugate gradients.
The Kuhn-Tucker conditions.
Constraint qualifications for the Kuhn-Tucker conditions.
Fourier series.
The Fourier transform.
Hilbert space.
Banach space.
Quasi-Newton iteration and updates, e.g., Broyden-Fletcher-Goldfarb-Shanno.
Orthogonal polynomials for numerically stable polynomial curve fitting.
Lagrange multipliers.
The Pontryagin maximum principle.
Quadratic programming.
Convex programming.
Multi-objective programming.
Integer linear programming.
Deterministic dynamic programming.
Stochastic dynamic programming.
The linear-quadratic-Gaussian case of dynamic programming.
Real science requires a creative and critical mind, which takes years to mold.
You got to start with questions to get answers, and the hard part of science isn't crunching data, it is asking the right question!
BTW, you can make interactive visualizations in pure python with bokeh: http://bokeh.pydata.org/en/latest/
Also with Blaze, you can use Pandas (or even Dplyr) syntax in python to query Hive, Spark and other large stores. http://blaze.pydata.org/en/latest/
Someone else mentioned Gelman's blog. That's a great place to find evidence that phd's do not lead to an increased ability to ferret out "truth" or insight from data. In many cases they just hide the mistakes so that others without that background don't know they're being misled.
What I am trying to ask is how do you become good at setting your start point(formulate your hypotheses), communicating your insights and selecting which tools apply where, because if your are good at coding and have experience in things related to computer science you have the abilities to handle a dataset(SQL Knowledge) and the data tools(Python, Pandas, etc), but that doesn't earn you the title of data scientist.
So, do I consider myself a data scientist? Absolutely not. But do I understand basic statistical concepts and know how to apply them to several categories of real life data analysis problems.
I'm a terrible coder, btw.
Unless one has "data scientist" title so to make "database engineer" look more fancy, then data comes in various shapes and forms. And most questions cannot be answered with a simple aggregation.
For example, data I work on (I am a data scientist freelancer) is flat csv files, xls files, JSON files, some text files I need to parse, various SQL, MongoDB, things I am getting from various APIs, etc...
While understanding joins is crucial (and normal forms, etc), SQL itself does take negligible amount of my time (and effort).
This is my experience when I worked as Data Scientist about a year ago. Now, YMMV, especially if you're a freelancer, I guess your clients are more comfortable with giving you raw dumps of data as files instead of giving you access to their database servers.
But back then I couldn't get any of my managers to understand or appreciate what I was doing. Fickle finger of fate.
Not really. The SVD is much more important. No. Yes. Yes. No (R-N) yes (CE). Yes. Yes. Yes. Personally, no. Only in the usage of MCMC. Yes. Yes. No. Of course. All the time. Yes. Yes. The most I'll do is remember to use the sample standard deviation. No. Yes. No. Yes. Yes. Yes. No. No. Yes. I just use a solver. See above. See above. Of course. Yes. Yes. Not privileged w/r/t/ other bases. Of course. I've never needed it. Ditto. As another tool in the toolbox. They would not be my first or second choice. Yes. No. No. Yes. No. Yes. Yes. Yes. No.
Neveu is elegant beyond belief, but Breiman, Probability, the SIAM book, available in paperback, is darned good, usually easier than Neveu, less elegant, closer to applications, and without some of the special Tulcea material in the back of Neveu. K. L. Chung also has a good, comparable book. Even if want Neveu to be your main probability book, which is fine, likely you should have alternative treatments.
Of course, there is Loeve, Probability -- written in English but somehow sounding like French. It has a lot, a little too much, but I liked the topics I studied in it. It turns out, Neveu and Breiman were both Loeve students.
Halmos, Measure Theory, is darned fun to read: It has the three series theorem and a famous exercise on regular conditional probabilities.
I learned the stuff from a course by A. Karr, a star student of E. Cinlar. Karr's course was the best course of any kind I ever took in school. Powerful material, beautifully presented, each day it was a shame to erase the board.
The exercises in Neveu are usually harder than the ones in Halmos, Breiman, and Chung.
Neveu makes probability a crown jewel of civilization.
The summer after Karr's course, I sat in the library for six weeks and walked out with a 50 page manuscript that was all the research and the first draft of my dissertation. Net, probability at the level of Neveu is darned powerful stuff, makes a lot in research, and research for applications, really easy -- that is, you really know just what the heck you are doing and can knock off new results having fun sitting in bed next to your wife while she watches TV (warning -- not gender neutral!).
What I've outlined is sometimes just called graduate probability. The biggest difference is that the whole subject makes daily use of measure theory.
I don't know how much you need in probability before starting on graduate probability. In my case, graduate probability was my first serious study of probability, and I never felt that I was not prepared.
But in my career I'd done a lot of practical work in both probability and statistics -- e.g., multivariate statistics, hypothesis testing, stochastic processes, digital filtering, the fast Fourier transform, beam forming (a case of antenna theory), power spectral estimation (US Navy sonar type stuff), how to get the central limit theorem out of digital filtering, and more, random number generation, etc. That work was plenty of intuitive background for graduate probability.
But in much of that work I was struggling due to what, really, at that level, is commonly weak basic knowledge of probability. So, after those struggles, seeing graduate probability be all clean and powerful was great.
I can't advise on just how much elementary probability you might need to have enough intuition to be comfortable with graduate probability. I will say, you do need both the intuitive experience and also the solid math.
I feel sorry for people who work in prob/stat without a background in grad prob: The elementary stuff is too often just confused from poor understanding from a poor background.
The sources I mentioned above were really the first sources from which I did any real study. Net, the elementary material of prob/stat is really too simple to be taken very seriously. So, for your first serious effort, just go for graduate probability from the sources above.
The Neveu, etc., material is much of the foundation for the secret sauce of my startup.
Data science is not like security. There it is more accepted that good engineers/researchers do not necessarily have the best accreditation. It seems that data science/engineering is turning around to this though.
It's not that autodidacts can not build bridges, it is that the people with the data and money do not want their bridges build by autodidacts.
Anyway... back to studying http://statweb.stanford.edu/~tibs/ElemStatLearn/ for me :).
Of course, sometimes I am given SQL access to server; but I never learnt SQL except for in action (i.e. things which I need right now).
And most of times I work with flat files. Even if they come from SQL they typically need a serious preprocessing before I can do a more adv analysis.
BTW: I have no problems with composing rather advanced queries. Just if SQL is a problem from someone (and, in case of doubt, it can't be Googled in no time) then I am curious how can get machine learning.
Just having a PhD will open doors for you that would otherwise be shut. But before pursuing that degree, you should be confident that you enjoy working in the field and want to devote your career to it. Also, you have to be prepared to work hard, not just to get the degree, but then to land a job where you'll put that experience to use. Otherwise, you'll be sharing a cubicle with DataWorker and feeling like a fool.
That said, if you don't know whether you need a PhD, that means you probably don't know what kinds of problem you want to work on. And in that case, there's a good chance you'll end up working on a problem that only interests your advisor and nobody else (most PhD advisors have more students than they have good problems to work on). In that case, I wouldn't recommend it.
Don't you want a colleague who is able to mention seminal papers for specific problems? Who is able to read and understand these papers and can distill useful features and optimizations from them?
People with PhD who go into business, usually end up in the better positions. They hire other PhD's for the good positions to keep the signal (mastery of the content) stronger.
As someone who did a lot of work with data I have little problem with my usefulness, but a lot of problems opening doors to the really interesting data companies (lacking a proper academic network). I wish I had gotten that PhD, because right now applying to Google, Microsoft, Facebook, Yahoo or eBay for data science positions makes me look like a fool.
This is why you are a DataWorker, and not a dataScientist.
Anyone can push bits around. It takes a trained mind to corral them using careful experimentation and observation.
Regarding post above, it's right. Data scientist is someone better at statistics (classical stats, bayesian, machine learning) than computer scientist, and better at programming (SQL, R/Python for building models) than academic statistician. Plus a teaspoon of visualization (ggplot or d3).
I'm trying to work on being less jaded about it, and not letting my annoyance with the-new-trendy-thing-that-i-remember-doing-years-ago-under-a-different-name get in the way of learning new technology and new lessons.
But it's a struggle.
Theory can obviously be very useful, but much of this stress on advanced statistics and phds is just a smokescreen for academics who suck at programming.
If you can't program and manipulate data, statistics won't save you because you won't have the ability to dig deep enough to find valuable insights. On the other side, if you know how to slice and dice data quickly and reliably, you can learn a huge amount by applying only the simplest statistical techniques. Generally the simple techniques are better anyway because they make mistakes less likely and your findings are easier to communicate.
Questions don't magically come out of a data set. Doing so is called a fishing expedition and usually results in boring, descriptive results which have no impact.
To answer impactful questions, you must go into your data collection with the questions in mind. To understand what questions to ask, you need a trained, critical, and creative mind. That is something you don't get from pushing bits.
>If you can't program and manipulate data
Programming, and manipulating data is easy. Almost every new statistician these days can, and does do this routinely.
What's hard is the years of intuition about what is meaningful and what is noise.
I know. It's hard to hear, and career programmers most of all hate to hear it, but its the truth.
I'm not really sure how to respond to the idea that exploring a dataset isn't a useful way to help develop questions about it. It's only a "fishing expedition" if you have no idea what you're doing.
Care to share other references you like. Real & complex analysis and algebra, in particular, are most welcome.
I've mentioned books I've spent at least some significant time with.
There are lots more books on my shelves that look good, have good recommendations, etc. but I haven't paid much attention to.
My interest in algebra is a bit meager -- I'm not seriously interested in number theory, algebraic geometry, algebraic topology, etc.
For real analysis, the books I mentioned seem to me to provide really good sources. Of course there is much more to analysis, e.g., functional analysis. And there's a lot to stochastic processes. And much more to math.
Development of a worldclass application, is difficult because of the complexity built into a program of large scope.
Knowing enough programming to competently move a data set around, is easy. Hell you could do most of it with just bash.
>I'm not really sure how to respond to the idea that exploring a dataset isn't a useful way to help develop questions about it. It's only a "fishing expedition" if you have no idea what you're doing.
Well I've seen a lot of it, in both science and business. People who spend a lot of time and money to generate a large data set simply because they lack a question to ask. They expect meaningful answers to just tumble out of it like mana from heaven, and end up confused and dismayed when the answers aren't impactful.
Fishing expeditions are looked down upon because they can only describe the data you generated. That is minimally useful, and can be done without grabbing a huge sample.
Good science starts with a question, then puts data to work to create new insight by removing confounding factors through careful design.