Opinions expressed by Entrepreneur contributors are their own.
Machine-learning applications are an integral part of our lives. Chances are, whether we realize it or not, we come into contact with machine-learning models every day online through recommendations and advertisements, fraud detection, search, image recognition and more. As a result of its growing prevalence in our day-to-day, the demand for data scientists has exploded in recent years, with projected job growth of 31% through 2029. Yet data scientists are still in short supply — in 2020, there was a data scientist shortage of 250,000.
If you’re looking to pursue a career as a data scientist, know it encompasses much more than just number crunching and programming — data scientists are also expected to have strong business acumen, communication and public speaking skills. As the machine-learning practice lead at Databricks, I oversee a growing team of data scientists and have learned firsthand what it takes to excel and stand out from the crowd.
Excited to dive into professional development and learn new tools to advance your career, but not sure where to start? Here are five skills to keep top of mind to boost your data-science career and professional profile.
1. Blending technical and non-technical communication
Communicating technical concepts to non-technical and technical audiences alike is critical for thriving as a data scientist. All the hard work you put into building the most accurate model won’t matter if you can’t explain it to others and convince them to adopt and trust it.
To help concepts stick, one tip I recommend is to use analogies to items that people see in their day-to-day life. For example, when I explain distributed computing with Apache Spark, I illustrate the process by counting easily recognizable household items, like candy. In this scenario, if I have a large bag of M&Ms, I could singlehandedly count them one by one to arrive at the exact count. An easy way to parallelize this task is to invite many of my friends — who each can count a portion of the M&Ms — to arrive at the exact count more efficiently. Now, when people go to the store and see M&M’s, they can’t help but think of Spark! Often, people use rocket-ship analogies, but unless you work at SpaceX or NASA, you likely don’t come across rocket ships in your daily life, thus making it harder for your analogy to stick.
By communicating effectively and explaining terminology in ways everyone can understand, you will boost data transparency across the organization and ensure everyone understands the value you provide.
2. Always be learning
While there is a clear need for more talent, many traditional education programs do not teach all the skills needed to be a data scientist. For example, most of the university and Coursera courses I took focused on learning and applying techniques to improve model performance against benchmarks (for example, maximizing accuracy on ImageNet). However, when I entered the industry, I learned that those processes are such a small piece of the puzzle. You need to be concerned with how the data was collected (and labeled), deployment constraints and infrastructure to serve the model, monitoring and model retraining pipelines, etc. The Google paper “Hidden Technical Debt in Machine Learning Systems” outlines this phenomenon. In this paper, they report that approximately 5% of real-world ML systems are composed of “ML code” while the rest is “glue code” to support these ML systems.
So how do you learn all the skills needed to be a data scientist and keep up with the latest innovations? Always be learning. I live my life by the philosophy that you learn something new from everyone you meet. I highly recommend building a network through colleagues and peers, attending meetups and gaining exposure to various aspects of the ML field. I have continued to take classes and participate in regular reading study groups even years after I finished grad school! I also recommend subscribing to The Batch — a free weekly digest of what’s new in ML research and innovative applications of ML in the industry (and, most importantly, areas where ML and policy need to improve).
The data field is evolving so quickly — in computer science, the typical half-life of your knowledge is seven years, but it is even shorter than that in data science. Technological innovation will continue to climb at a rapid pace, but don’t feel overwhelmed or intimated. Just keep learning at a steady pace, and you’ll always have new skills to apply.
3. Starting simple and establishing a baseline
With rapid advancements in ML, data scientists are hungry to use the latest and greatest tools. However, I always tell data scientists to start simple and establish a baseline with associated metrics. This baseline should be very naive, such as predicting the average value for regression problems (e.g., predict average house price) or the most frequent class for classification problems (e.g., always predict “no”). I can’t tell you the number of times I’ve seen someone boast, “My machine learning model is 90% accurate at predicting XYZ problem” only then for someone else to point out, “If you always predict ‘no’, you’ll be accurate 99% of the time.” Establishing a benchmark and clear product-relevant evaluation metrics is crucial for gaining trust for your ML systems. If your metric for evaluation is accuracy, the method where you consistently predict “no” might maximize accuracy, but it’s a meaningless model. In this case, the F1 score might be an appropriate metric that balances both precision and recall, not just the absolute number of correct predictions. Once you have established a baseline, treat that as a lower bound for the predictive performance of your machine-learning system.
Related: Why Your Startup Needs Data Science
4. Asking the right questions
I know data scientists are eager to build models, but understanding the data, talking to stakeholders and subject-matter experts, and continually asking questions about the data through exploratory data analysis is critical to delivering the right solution for the business.
Instead of jumping straight to solving the technical problem at hand, take a step back and understand the business problem you are trying to solve. For example, instead of discussing whether you should use PyTorch or TensorFlow, ask, “How will this model be used? How do we quantify ‘success’ for this project?” Thinking through the answers up front will pay dividends later on in the project.
You should also ask questions about your data, such as how it is collected, how it should (and should not) be used, etc. I highly recommend the “Datasheets for Datasets” paper by Gebru et al for inspiration on the right questions to ask about the data.
5. Identifying your specialization
When I interview candidates for my team, I look for people who can add to the team’s existing skillset — no matter how amazing clones of existing team members are, I want people who can bring new talents and ideas to the table. In essence, I’m seeking to build a human ensemble.
What really makes candidates stand out is when they have a passion or expertise in a given area. It can be within a particular aspect of ML, such as NLP or computer vision, or within a given industry, such as retail, but the critical differentiator is to establish yourself as a subject-matter expert and stay up to date in that area. This way, you become the go-to person for a particular topic and make yourself indispensable.
As data-science tools advance, particularly with low-code and no-code solutions, polishing your business skills in addition to mastering technical skills will enable you to stand out from the crowd and continually deliver the best value for your time.
Now, when you approach a new project, put it all together: Ensure you’re asking the right business and data questions, establish a baseline and associated metrics, learn something new while on the job, leverage your specialization and effectively communicate the results with the stakeholders. If you can accomplish all of this, you will be a rockstar.