From Astronomy to Data Science: One Size Fits All?

29 Jul 2014 · by bathompso · in  Data Science 

Academia has been undergoing a crisis as of late. Many disciplines (even "hot ones," such as biomedical research) have been finding that they are producing too many short-term jobs (post-docs), and too few permanent positions to sustain the outflow of Ph.D.s.

Astronomy (my chosen field), is no exception. Heaped on top of this is the fact that as the U.S. government tightens it's belt, "pure science" budgets, such as those for NASA and NSF's astronomy grants, are those removed first. As grant money earned is often a metric to determine tenure, this makes it even harder to keep a tenure-track job, if you somehow manage to get one in the first place.

As I've come to terms with this, I've begun to look into alternative fields for a career, and data science is one I have seen a lot. But looking into the several "astronomer to data scientist" transition guides on the web, all seem to tell me the same thing: "learn lots of programming languages, like Python, R or SAS." "Learn NoSQL, Hadoop, and MapReduce." In short, these guides seem to think every data science job is the same: a "big data" scientist at a Silicon Valley startup or tech titan.

If somebody was preparing for a career in physics and asked you what they should learn to prepare, you wouldn't necessarily be able to answer. Interested in computational astrophysics? You'll need to know C or FORTRAN to write highly parallelized codes that run on supercomputers. Interested in engineering? You probably don't need to spend your time learning general relativity. "Physics" isn't one thing, and therefore you can't just prepare in a single way. Data science is exactly the same.

Instead of prescribing preparation steps for a very specific job, below are some general tips I've found in my time exploring the data science market, that everyone should definitely do.

  1. Do some reading on general data science principles. Regardless of what subsection of data science you want to go into, everyone should be well-versed in the "fundamental principles." Data Science for Business is an excellent book on this topic, and I suggest everyone read through it. Not only will it give you an outline of many of the basic data science algorithms, but really hammers in the fact that data science is part of business, not a hard science. Everything you do should be to further the business goals.

  2. Decide whether data science is really what you want. In the last bullet of her "Astronomy vs Data Science" post, Jessica Kirkpatrick states that you cannot dictate the subject matter in industry like you can in academia. As stated above, everything you do in data science should be to further the business goals. This can be an adjustment, as in academia you pick a subject largely unconstrained by others. Make sure this is a trade-off you're okay with.
        For me, I enjoy taking (often incomplete) data and telling a story with it. It doesn't matter what I'm looking at: star brightnesses to determine masses or customer location history to detect fraud, I enjoy any time I am "learning" something. For others, this may not be the case. No matter the benefits of a data science career (salary, locations, job security), doing something you don't enjoy isn't the way to go.

  3. Decide what section of data science you want to enhabit. Does working for Google or Facebook tracking millions (or billions) of users' behaviors online sound exciting? How about data mining thousands of healthy people's DNA to improve healthcare? Maybe you should look into the "Silicon Valley Big Data" jobs. Maybe you'd like to track how people respond to coupon offers, in order to more fine-tune a company's marketing strategy? There are many of these "data analyst" positions available as well, and they are usually in high demand all across the country. There are also all kinds of jobs inbetween.
        Do some research on job sites, such as monster or indeed, to see what data science jobs are out there. Read the job descriptions and requirements to see what types of positions you're interested in and/or qualified for.

After doing some background research on what jobs you might want, there are many diverging paths to take. Everyone's individual preparation method will be different in several areas:

  • What programs you'll need to learn. Big Data jobs: definitely brush up on Hadoop, MapReduce, Pig, Hive, etc. Learn a legitimate programming language, like Python. Depending on the company, they may also want an even newer language, like Julia. Going for an analyst position? Make sure you know your SQL backwards and forwards, as many business databases are MySQL. Depending on the company, you may not even need to learn a programming language: all calculations are expected to be in Excel! This doesn't mean it's easier, however; learning how to make Excel do advanced things requires just as much time as Python. Maybe even longer!
        Also be aware of how "stats-forward" your potential jobs may be. If you're going into Big Data, or a data-science-heavy company, you may need to brush up on your statistics in order to understand and implement Bayesian methods. If you're going into an analyst position, you might not need to know as much, because Bayesian models may hurt your cause, rather than help (see below).

  • What methods and algorithms you'll need to employ. Lots of data science is crunching numbers, but that's not the only thing available. Some data science mines text, for example to detect plagiarism in student papers. If you find many job listingss where this might be a component, you should learn more about Natural Language Processing. Similarly, some data science focuses on images; perhaps automatically comparing signatures on checks or handwriting in documents to detect fraud. Look into image processing algorithms, which may be useful here.

  • How you'll need to communicate. All data science should further the business goals (pound that notion in). Regardless of where you're working, or what you're doing, you will always have to be justifying your thoughts, approach, and implementation to others, higher up. If you cannot sell them on what you're doing, do not expect to be able to move forward (see how this is different than hard science?). How you have to defend yourself, and what tools you'll be "allowed" to use, however, will vary greatly from job to job.
        At a company with a large focus on data science, you may be working with an entire team of data scientists. You may have managers (who themselves are data scientists), who will act as the intermediary between you and the business leaders. You need to sell the managers, who will then take it upon themselves to sell the executives. This is a somewhat easier setup: the people you are directly interfacing with understand data science. You can talk about Bayesian statistics, ensemble methods and specific python modules to bolster your case, and they will be able to speak the same language.
        Going to an analyst position, or a company with a small data science team? You may be speaking directly to executives to sell them on your plan. While more advanced classification schemes, such as Bayes, may provide better results, they may hamper your efforts here, because nobody you're talking to will understand them! You may be forced to fall back to simple Decision Tree Classifiers, which you can plot, and everyone can understand. Be prepared to "dumb down" your models, or be able to explain them creatively, so that everyone can get on board.
        This is another area where data science (and business in general) is different than science. In science, you are judged by the quality of your work, and only slightly by how you present it. If you have an amazing find, but you stammer a bit in your talk, or your paper is a little disorganized, you still may be lauded in the scientific community. In business, it's the opposite: it doesn't matter whether you found an amazing classification model if you can't explain it. If you stumble through a meeting with executives, your entire project may be killed, even if it would have saved the company millions of dollars. Practice your public speaking!

So now that you've hammered out your list of things you need to learn, how will you do that? The best option is to look to free online courses, from places like Coursera, or Udacity. Another good option is just search the web for resources on the specific area you want to learn. Hadoop has an excellent walkthrough online, and many of these other advanced tools do as well.

Not only will you be learning relevant skills for your future job, but you'll also be proving to a future employer that you can. I've heard from many people in the industry that most of what companies care about are inherent traits, not skills. Sure, being able to check off the list of requirements is good, but they more want to see whether you are a deep quantitative thinker, whether you're able to carry out independent projects in a timely manner, and whether you're able to efficiently learn new skills that may be necessary for the job. Show them that you can.

Lastly, put those skills to use and practice! has many excellent challenges that give you chances to work on real datasets, and test your algorthims' accuracy. They range in difficulty and approach, providing challenges for any realm of data science you wish to enter. Go check them out!

Connect With Me
Recent Twitter Status