Hidden Tips For Transitioning From Academia To Data Science

24 Apr 2017 · by bathompso · in Data Science

I'm hardly alone in my path from academia into the world of data science, which means there exist a multitude of guides and blog posts explaining the steps you need to take to also make the transition. Most focus on how to transition skills gained via a Ph.D. into those "necessary" for work in data science, or how to improve your coding. These are definitely useful comments (and I'll list out some more conventional ones as well), but there are some questions that I almost never see in these guides that I think are absolutely necessary to answer before attempting any transition.

First, I'll get my list of "known" transition tips out of the way:

Learn a coding language. This one is obvious, and most Ph.D. students are hopefully familiar with a language like Python or R via their research. If not, there are many good textbooks and MOOCs to teach you how to code. Spend a few weeks to get to a basic level of proficiency, and then continue to improve your skills by writing programs to help with your research, or just for fun. I built a mobile game "cheat" app in Python to practice when I was still improving my Python abilities, and learned a ton by Googling syntax the entire time. Don't worry about going deep into optimizations of the language, just work on the basic problem of coding: translating a complex word problem into small steps that can be handled by code.
Know your basic statistics. Brush up on z- and t-tests, and know what ANOVA is. Honestly, stats knowledge is not as necessary as most transition guides make it out to be, as the most stats that a data scientist will have to know will be to validate experimental A/B test results, or spot when a stats faux pas has occurred. Beyond that, you really don't need to be a wizard at stats, and I wouldn't spend a ton of time on the subject.
Know basic machine learning techniques. ML is another area that guides seem to believe you need a deep knowledge on, but this really depends on roles you're interested in. If you're going into a role that will be developing ML solutions (i.e. coding these solutions up and slowly improving them), then by all means start digging through dense books and learning these algorithms from first principles. For most other data scientists, however, you'll just be applying pre-defined techniques in scikit-learn, and don't need to know the nitty gritty details of loss functions, gradient descent, and other buzzwords. You should, however, know the high-level concepts of how each ML technique works (linear regression, logistic regression, decision trees, SVMs, random forests, k-mean clustering, to name a few), and more importantly, know the assumptions of each model. Know what heteroscedasticity means.
Learn SQL. While not a "real" programming language, SQL is an absolute necessity of any data scientist toolkit. There are a number of free tutorials on SQL, with practice questions, to get you up to speed. SQL is a pretty easy language, so if you know any others (as you should per the first tip in this list), SQL should come to you without any real problems.

"But Ben, why don't you suggest I become an expert in any of the above topics?" The answer is that no data science job, especially one considering you fresh out of academia, would expect you to be an expert in anything, and it's unrealistic to expect that even after a few months of studying. Coding, and data science in general, is something you learn by doing, and so you can only get better when you get into a job and actually use these skills on a day-to-day basis.

In fact, what you should sell in a cover letter or interview for a data science position shouldn't be your domain knowledge in these subject areas, but the fact that you are a Ph.D. One of the most valuable skills of a data scientist is being able to work independently on a complex problem, and be self-sufficient enough to work around problems that arise. These are basic principles of a Ph.D. research project, so anyone reading this guide should have no problem with that. When I evaluate candidates, I don't really care about years of experience, but instead I look at what projects they accomplished. If they have a track record of delivering solutions to complex problems, I know they can learn the necessary coding, stats, ML, or SQL if they currently don't know it.

What other guides won't tell you.

Now to the lesser-talked-about necessities of a transition. While the above tips (and most you'll see other places) focus on things you can learn or do during the transition, there are a number of things required in a data scientist that aren't as easily learned from a book.

Communication

In academia, you're surrounded by others very similar to you. When you go to a conference or talk, everyone there presumably "speaks the same language" and you can go deep into technical jargon without being too worried about putting off the audience. In industry this is often not the case. In my job, I interface directly with non-technical people all the time, and it's up to me to distill my analysis and models into understandable terms to communicate my findings. Delivering quality reports is one of the most-necessary skills for a data scientist, and candidates without those skills will have a tough time passing interviews for most DS positions.

The problem here is that it's difficult to learn better communication and speaking habits from books, but instead must be practiced. So if you are still in your Ph.D., take every opportunity you have to speak at conferences to practice. Also try and speak within your department, taking time to distill your complex research project and findings to an undergraduate level. When you're done with the talk, ask the undergrads whether they understood everything, and slowly improve your presentation via their feedback. If you can communicate your findings to them, you'll be pretty close to having the skills necessary to translate your ML and analysis results to project managers and executives in an industry role.

Timelines

Another huge difference (which can often be a selling point to some people) is the vastly different timelines between academia and industry. In academia, you focus multiple years of work into a singular project, often with months of largely no progress while you work on gathering data or increasing the performance of various tests or algorithms. In industry, you're at almost the other side of the spectrum, having to juggle multiple projects on very short turn-arounds.

Because of this, you're often having to iterate quickly to a 70% solution, and unable to wrangle the time to improve it to the 100% solution. Early in my data science career, I was juggling two to three projects at a time, often on a days-to-weeks timescale. This was an abrupt change, and required quick and dirty solutions I was not accustomed to during my Ph.D. It was difficult at first to accept that this was the "quality" of work I was putting out, but I slowly came to terms with it. I have adjusted accordingly and now function fine within these confines, but forsaking the "true academic" tendency to put out a perfect solution is something you have to make sure you're prepared for.

While this type of work environment isn't something you can easily understand beforehand, I've attempted to come up with a way to "simulate" what this would be like:

Find two datasets on Kaggle that interest you, and define a business problem you want to solve with each. Some datasets already have apparent business questions they can answer, like this dataset that asks "Can you predict product backorders?" For those datasets which don't have immediate business questions created, try and think of a business that might produce this data for you, and why they might want a data scientist to analyze it.
Take a weekend and fully answer both questions. Give yourself two full days to fully analyze both problems. Download and clean the data, do some exploratory analysis, and build a model (if necessary for the business case). Focus your efforts on solving the business case, and delivering a solution to the question you asked in Step 1.
Write blog posts explaining your methodology and results. Why did you choose the model type that you did? What should the business learn from the data? What would the next steps be for the business on this project? Utilize some of the plots you generated during your data exploration phase. These blog posts should also be completely written in the weekend you're doing the data work (which will make the time go by a bit faster).
At the end of the weekend, post both blog posts online. Everything is on the honor system, but no matter far you got on both questions, post your analysis code and blogs online at the end of the weekend. Send it to some friends and get their feedback on how well you communicated everything, and whether you justified your analysis choices clearly enough. Think about how you feel about the quality of work you did in the amount of time allotted. Did you spend too much time on one question, and not enough on the other? Did you feel like both problems were rushed? What would you improve if you had another two days?

Two full Kaggle datasets in two days is hard, and I doubt you'll get to a perfect solution you can be happy with for both, but that's fine. While most DS jobs will have longer timescales than one day, this simulates some of the stress and time-management that you'll encounter in an industry role, and figuring out how you handle it is important in determining whether you'll be happy in data science. If you found yourself unhappy that you couldn't spend the entire weekend (and the following weekend) on one of these projects because you found it super interesting and wanted to improve your model performance, then you might need to come to terms with whether you'll be happy when industry timelines come to crash your party. There are certainly more "academic" timescale DS positions, and knowing what type of work culture you're able to thrive in is important as you start looking for opportunities.

Expectations

Everyone coming to data science transition guides seem to assume that they'll be happy in data science, but that's not necessarily the case. While data science is the hot industry right now, and many people become lured in by the promise of working on cool machine learning or AI problems, a vast majority of DS roles don't focus much of their time on those things. Lots of DS projects are more product analytics focus, which requires mining various data sources to determine user behavior, or detect trends via simple heuristics. These types of investigations can sometimes be intellectually rewarding and impactful, but aren't necessarily earth-shattering in their complexity.

Even in a ML project, only a small amount of time is actually spent training the model, but instead invested in data cleaning, exploration and iterative feature engineering. This work is arduous and tedious, but is absolutely necessary for moving onto the "fun" parts. The problem with learning data science via kaggle datasets and competitions is that the data is usually already cleaned for you, and the number of features fairly small. In practice, the data is much messier and feature space near limitless.

At Uber, I work with billions of mobile app events coming in every day, and have to distill this waterfall of data into a smaller bucket of features that relate to the project at hand. This requires days of SQL writing to clean and transform the data into a large number of possibly-related features, a few hours of sanity-checking these pipelines to ensure the data makes sense, and then a few more hours of slowly iterating from a huge feature list into only a few impactful ones. After I've built my model, it's spending another hour or two writing up everything and distilling it into a presentation or document outlining the entire process, and what learnings we can take from it (even if it's just a black-box model we deploy to production). While this may be a week or two of work, only a very small slice of it is spent dealing with differing models or advanced tech explorations.

Conclusion

I don't want to be a downer, because data science can be a very interesting field, and I'm extremely happy I transitioned from academia. However, leaving academia because you're unhappy to go into an industry where you're equally as unhappy isn't a good solution either. Most guides focus on what you can learn, but I also think they should focus on what you can expect. My best advice beyond the Kaggle time crunch challenge above is to seek out friends or others who have interesting business data, but no analyst, and tell them you want to give them free DS work. Do an actual project, with an actual client, and see whether you enjoy the intellectual challenge. If you do, then most likely you can find a role in DS that you will enjoy and be happy in. If you don't like it, then try and see whether there are roles that only have aspects of the project you did enjoy, or perhaps re-think whether data science is the right career path for you.

Any other "hidden" tips you'd want to share with other academics looking to transition? Let me know in the comments.

Notice a problem, or have a question? Submit an issue to this website's GitHub repo for the quickest response.

Connect With Me

Recent Twitter Status

Tweets by bathompso