I'm hardly alone in my path from academia into the world of data science, which means there exist a multitude of guides and blog posts explaining the steps you need to take to also make the transition. Most focus on how to transition skills gained via a Ph.D. into those "necessary" for work in data science, or how to improve your coding. These are definitely useful comments (and I'll list out some more conventional ones as well), but there are some questions that I almost never see in these guides that I think are absolutely necessary to answer before attempting any transition.
First, I'll get my list of "known" transition tips out of the way:
"But Ben, why don't you suggest I become an expert in any of the above topics?" The answer is that no data science job, especially one considering you fresh out of academia, would expect you to be an expert in anything, and it's unrealistic to expect that even after a few months of studying. Coding, and data science in general, is something you learn by doing, and so you can only get better when you get into a job and actually use these skills on a day-to-day basis.
In fact, what you should sell in a cover letter or interview for a data science position shouldn't be your domain knowledge in these subject areas, but the fact that you are a Ph.D. One of the most valuable skills of a data scientist is being able to work independently on a complex problem, and be self-sufficient enough to work around problems that arise. These are basic principles of a Ph.D. research project, so anyone reading this guide should have no problem with that. When I evaluate candidates, I don't really care about years of experience, but instead I look at what projects they accomplished. If they have a track record of delivering solutions to complex problems, I know they can learn the necessary coding, stats, ML, or SQL if they currently don't know it.
Now to the lesser-talked-about necessities of a transition. While the above tips (and most you'll see other places) focus on things you can learn or do during the transition, there are a number of things required in a data scientist that aren't as easily learned from a book.
In academia, you're surrounded by others very similar to you. When you go to a conference or talk, everyone there presumably "speaks the same language" and you can go deep into technical jargon without being too worried about putting off the audience. In industry this is often not the case. In my job, I interface directly with non-technical people all the time, and it's up to me to distill my analysis and models into understandable terms to communicate my findings. Delivering quality reports is one of the most-necessary skills for a data scientist, and candidates without those skills will have a tough time passing interviews for most DS positions.
The problem here is that it's difficult to learn better communication and speaking habits from books, but instead must be practiced. So if you are still in your Ph.D., take every opportunity you have to speak at conferences to practice. Also try and speak within your department, taking time to distill your complex research project and findings to an undergraduate level. When you're done with the talk, ask the undergrads whether they understood everything, and slowly improve your presentation via their feedback. If you can communicate your findings to them, you'll be pretty close to having the skills necessary to translate your ML and analysis results to project managers and executives in an industry role.
Another huge difference (which can often be a selling point to some people) is the vastly different timelines between academia and industry. In academia, you focus multiple years of work into a singular project, often with months of largely no progress while you work on gathering data or increasing the performance of various tests or algorithms. In industry, you're at almost the other side of the spectrum, having to juggle multiple projects on very short turn-arounds.
Because of this, you're often having to iterate quickly to a 70% solution, and unable to wrangle the time to improve it to the 100% solution. Early in my data science career, I was juggling two to three projects at a time, often on a days-to-weeks timescale. This was an abrupt change, and required quick and dirty solutions I was not accustomed to during my Ph.D. It was difficult at first to accept that this was the "quality" of work I was putting out, but I slowly came to terms with it. I have adjusted accordingly and now function fine within these confines, but forsaking the "true academic" tendency to put out a perfect solution is something you have to make sure you're prepared for.
While this type of work environment isn't something you can easily understand beforehand, I've attempted to come up with a way to "simulate" what this would be like:
Two full Kaggle datasets in two days is hard, and I doubt you'll get to a perfect solution you can be happy with for both, but that's fine. While most DS jobs will have longer timescales than one day, this simulates some of the stress and time-management that you'll encounter in an industry role, and figuring out how you handle it is important in determining whether you'll be happy in data science. If you found yourself unhappy that you couldn't spend the entire weekend (and the following weekend) on one of these projects because you found it super interesting and wanted to improve your model performance, then you might need to come to terms with whether you'll be happy when industry timelines come to crash your party. There are certainly more "academic" timescale DS positions, and knowing what type of work culture you're able to thrive in is important as you start looking for opportunities.
Everyone coming to data science transition guides seem to assume that they'll be happy in data science, but that's not necessarily the case. While data science is the hot industry right now, and many people become lured in by the promise of working on cool machine learning or AI problems, a vast majority of DS roles don't focus much of their time on those things. Lots of DS projects are more product analytics focus, which requires mining various data sources to determine user behavior, or detect trends via simple heuristics. These types of investigations can sometimes be intellectually rewarding and impactful, but aren't necessarily earth-shattering in their complexity.
Even in a ML project, only a small amount of time is actually spent training the model, but instead invested in data cleaning, exploration and iterative feature engineering. This work is arduous and tedious, but is absolutely necessary for moving onto the "fun" parts. The problem with learning data science via kaggle datasets and competitions is that the data is usually already cleaned for you, and the number of features fairly small. In practice, the data is much messier and feature space near limitless.
At Uber, I work with billions of mobile app events coming in every day, and have to distill this waterfall of data into a smaller bucket of features that relate to the project at hand. This requires days of SQL writing to clean and transform the data into a large number of possibly-related features, a few hours of sanity-checking these pipelines to ensure the data makes sense, and then a few more hours of slowly iterating from a huge feature list into only a few impactful ones. After I've built my model, it's spending another hour or two writing up everything and distilling it into a presentation or document outlining the entire process, and what learnings we can take from it (even if it's just a black-box model we deploy to production). While this may be a week or two of work, only a very small slice of it is spent dealing with differing models or advanced tech explorations.
I don't want to be a downer, because data science can be a very interesting field, and I'm extremely happy I transitioned from academia. However, leaving academia because you're unhappy to go into an industry where you're equally as unhappy isn't a good solution either. Most guides focus on what you can learn, but I also think they should focus on what you can expect. My best advice beyond the Kaggle time crunch challenge above is to seek out friends or others who have interesting business data, but no analyst, and tell them you want to give them free DS work. Do an actual project, with an actual client, and see whether you enjoy the intellectual challenge. If you do, then most likely you can find a role in DS that you will enjoy and be happy in. If you don't like it, then try and see whether there are roles that only have aspects of the project you did enjoy, or perhaps re-think whether data science is the right career path for you.
Any other "hidden" tips you'd want to share with other academics looking to transition? Let me know in the comments.