5 Things I Wish I Knew Before Taking My First Data Science Job

Evan Baker Case
6 min readJan 4, 2018

Two months ago, I started my first full-time position as a data scientist, working to bring machine learning to a commercial-building engineering firm. I landed the job after three months of grinding through a bootcamp and a grueling five month job search. Throughout this process, I immersed myself in the world of data science, taking on side projects, geeking out on the newest deep learning models, and scanning message boards like /r/datascience. In so doing, it was easy to get the impression that I knew exactly what being a data scientist would be like. It’s just like getting paid to do Kaggle competitions, right? Not exactly.

So far, the experience has been hugely fulfilling, but also challenging in ways I did not expect. After figuring out how to apply my skill set to a new domain with only having my mistakes to learn from, I hope these insights can help you be prepared for the challenges that lie ahead. It’s an incredibly rewarding road to walk down, and hopefully having this knowledge will help you move past the first set of obstacles on it a little more easily.

1. Never Be Afraid to Ask Questions

When data scientists describe their work, they often mention the importance of being a “domain expert.” Not only are you expected to have technical mastery over software development and machine learning, you’re also supposed to understand the subtleties of your business and where you fit in the market. However, when starting a new data science role, chances are it will related to a domain you’re not yet an expert in, and that’s okay (I’m not saying this applies to everybody, but I think 9 times out of 10 this will be the case for your first data science job).

One of the best ways you can become the expert you are expected to be as quickly as possible is by soaking up information from the people around you. Whenever somebody refers to an area of the company you’ve never heard of, or uses industry jargon you don’t know, ask them what that is. Not only will your curiosity increase your pace of learning, but you also won’t be stuck in the uncomfortable situation of having to fake an understanding the next time that subject comes up. Many people in the workplace aren’t comfortable exposing when they don’t understand something. If you are, you’ll quickly pass those people by. Go forth confidently and learn.

Asking questions goes beyond domain knowledge. I cannot tell you how many times I have seen a classmate or student hitting their head against the wall, struggling to figure out a coding problem, and not knowing where to turn. Don’t be afraid to ask the internet! Stack Overflow is your friend. There is certainly growth to be had from working through problems on your own, but asking other people how they’ve handled the same problem is an amazing chance to quickly get an answer, and hopefully find a few new techniques you may have never thought of.

2. Understand The Problem Before Dreaming Up The Solution

Given all the excitement in the data science world about new models and technologies, it can be easy to get sucked into immediately thinking about how you can apply those methods to the problems you’re working on. The truth is, often times these methods are either too complicated to be practical, or they don’t solve the problem you’re working on. Data science’s obsession with competitions to eek out .001% better accuracy can make it easy to think that achieving the highest possible model performance is the only important consideration.

In practice, there are so many limitations to your work as a data scientist. Your time, the data you have access to, and computational power, are all considerations that may lead to a simple answer being the best one. Maybe you’re working a problem for which getting that last sliver of performance is highly valuable to your company, and if that is you, by all means spend the time you need to get there. But rather than simply being drawn towards performance for its own sake, try to understand the business impact of that performance, and then weigh in your mind whether the added time and complexity make that solution worthwhile. Maybe you don’t even need to use machine learning in the first place! If you first focus on fully understanding the parameters and business implications of the problem, you’ll be much less likely to graft on an unnecessarily complicated solution.

3. Learn To Make Your Code Portable

When you’re first learning data science, you may get the impression that it’s all Jupyter notebooks and Kaggle kernels, and that as long as you can churn out a solid model in an iPython environment, you’re doing great. But what about when you actually want to put that model into development? Sure, you can pickle your model and hand it off to someone if you’re working in a Python shop, but what if that person is supposed to be you? And what about all the data munging functions you wrote earlier in your notebook? Those are going to need to be implemented as well.

More likely than not, those functions will be useful for other projects you work on, too. This is when you need to start thinking about making your code modular, and creating a set of building blocks you can easily implement in production. If you’re pulling data from a set of APIs, build wrappers that you can import into other scripts. If you’re constantly cleaning up the same data, turn your cleaning functions into a class that can be pulled in wherever that data shows up. The classic “data science project” can lead to the idea that every project is totally siloed, and there isn’t much focus on the value of the pieces you build along the way. Maybe in your bootcamp you were working on the Titanic data set one week, and doing image recognition the next, never to open that Titanic notebook again. Once you get into the workplace, you’ll be working with the same data over and over again. Focus on coding in such a way as to make your work quickly reproducible on each project, and as easy as possible to take from your notebooks to production.

4. Test and Prototype Everything

Once you start thinking modular and making your code portable, testing and prototyping in as many situations as possible is incredibly important to producing work efficiently. Say you’ve started building a machine learning pipeline. You have a script that pulls data out of your database, cleaning and transforming it. You have a separate script that pulls data from an API, another that merges all this data together, another that trains your models, and finally one that spits out a trained model that you can use in production. There are a lot of steps where this process can go wrong.

If you’re used to working in a static environment like a Jupyter notebook, with all your data coming from a CSV, it’s easy to overlook how important it is to think through the edge cases that can come up in real life. Don’t be lulled into complacency just because your scripts work for the data set you’re currently working with. Real data is messy, and it will tear your script apart. If you build your entire pipeline only to realize that the whole thing breaks when there are null values somewhere you didn’t think of, you’re going to kick yourself for not having thought of that earlier. The further back in your pipeline the problem is, the harder it becomes to diagnose, compounding the time it takes to fix. Do yourself a favor and fix those bugs before you’re ten steps deep.

5. Stay Positive

This might sound like the most trivial piece of advice, but it can also be the hardest to keep in mind when you’re spending hours working through bugs or trying to wrap your head around how an algorithm works. Chances are that if you’re reading this, you’re also relatively new to data science, and this is an incredibly challenging field. It’s no coincidence that it’s a lucrative occupation filled with PhDs and researchers. If you find yourself hitting walls, getting confused, or feeling overwhelmed by the breadth of knowledge you’re expected to have, take a breather. It’s okay to feel challenged, and if you are, that is a great thing. It means you’re learning. You have a leg up on everyone else going through their days not feeling challenged. Think about what made you excited about data science in the first place, and use that as motivation to tackle the challenges in front of you. If you can continue to enjoy your work even through the struggles, you’ll set yourself up for an amazingly rewarding and successful experience.

--

--

Evan Baker Case

Staff Data Scientist @ Calm, house music enthusiast, basketball nerd.