Starting your First Data Science Project? Here are 10 Things You Must Absolutely Know
6. A Data Scientist isn’t a Magic Bullet – Learn About Other Data-Based Fields
Data Science was termed as the sexiest job of the 21st century and since then we have been trying to chase it. But here’s the caveat – becoming a data scientist isn’t the be-all and end-all of your data science journey. It is essential that we uncover other data-based roles.
A data science project covers a whole host of data-related roles, such as a data engineer, machine learning engineer, deep learning engineer, business analyst, data analyst, etc. The list goes on. A Data Scientist doesn’t build architecture for a big data system – a data engineer does. A data scientist doesn’t typically answer business-related questions – a business analyst does.
Note that these roles interchange and intertwine a lot depending on your project and your organization.
So before starting a data-based project, you can choose what you want to become. If you want to know more about the differences between different roles, you should definitely check out this article.
7. Believe me, you Need a Benchmark Model
During my first regression project as a data scientist, I built a data model using all the knowledge that I learned. But I felt that the error was coming out to be high and R-squared to be very low. After getting frustrated I took this problem to my manager. He said – “How do you know the error is high? What is your benchmark score?”
A benchmark model is your basic run of the mill machine learning model that gives you a decent score. You don’t even require to know machine learning for building a benchmark model. A benchmark model for regression can be made by taking the simple mean, and a classification model can be simply made by using the mode (though I encourage you to not do that in the industry!).
Let me give you an example from my previous data science project. We were working on a marketing analytics problem and while the data science team was busy trying to decipher which model to try out, my project manager fired up KNIME, built a simple regression model, and came up with a benchmark score. It took him 45 minutes to do this.
Note that these roles interchange and intertwine a lot depending on your project and your organization.
So before starting a data-based project, you can choose what you want to become. If you want to know more about the differences between different roles, you should definitely check out this article.
7. Believe me, you Need a Benchmark Model
During my first regression project as a data scientist, I built a data model using all the knowledge that I learned. But I felt that the error was coming out to be high and R-squared to be very low. After getting frustrated I took this problem to my manager. He said – “How do you know the error is high? What is your benchmark score?”
A benchmark model is your basic run of the mill machine learning model that gives you a decent score. You don’t even require to know machine learning for building a benchmark model. A benchmark model for regression can be made by taking the simple mean, and a classification model can be simply made by using the mode (though I encourage you to not do that in the industry!).
Let me give you an example from my previous data science project. We were working on a marketing analytics problem and while the data science team was busy trying to decipher which model to try out, my project manager fired up KNIME, built a simple regression model, and came up with a benchmark score. It took him 45 minutes to do this.
8. Always stay in touch with roots (Linear regression may help you better than advanced neural networks)
Have you seen anyone using an ax for slicing butter? Metaphorically, that’s what a lot of beginners do when starting out their machine learning journey. You may be surprised but a simple linear regression problem can help you arrive at a model that is more accurate and requires less computational power.
It is always important that you understand the problem statement, the type of data you are dealing with, and ask yourself – What do I want to accomplish with the project? Do you want your model to deliver higher accuracy or you want a simple model that will help you in variable attribution?
Remember, most organizations that have a data science division likely won’t have the computational power to support complex models. The likes of Google and Facebook have skewed our perception of data science by pumping in money to build complex multi-layered deep neural networks – don’t fall for that trap.
9. No Data Science project can succeed without the Proper Infrastructure in Place
Like most industry projects, a data science project depends on a lot of external factors. In an organization, you must make sure that these factors support your needs for a successful project.
For example, a traditional logistics company plans to build a route optimization application for the transporters but they don’t even have any architecture for tracking their fleet. This is one of the primary reasons why ~85% of data science projects end up failing. That’s a HUGE number and it’s because decision-makers don’t really understand how important the core infrastructure is before the splurge the money on building a team.
Before starting out, executives and leaders can save a lot of time and effort by making sure that everything is in place when the team requires it.
10. Get buy-in from stakeholders before you launch a new Data Science project
A project must have a clearly defined problem statement. It should have listed expected results and it should be the same for all the stakeholders. Due to lack of proper communication, the stakeholders and the data science team may get different expectations which may make your project haywire.
Let me hearken back to my previous project as an example. Our data science team was told to “use data science to increase revenue by 25% without increasing costs more than 10%”. That is an incredibly vague problem statement! We had to sit with the project manager and the leadership team to understand the scope of the project, what we could use, and what we couldn’t, etc.
If we had blindly gone in and started working on the problem, we would inevitably run into a blind alley.
It is always better to keep the stakeholders updated with proper communication in place. Otherwise, the project may take a different direction and ultimately lead to starting over again.
End Notes
To conclude, in this article, I have listed 10 things or challenges that I faced when I started out as a data scientist. This is not an exhaustive list and I am sure there must be some challenges that you must have faced personally. Let me know in the comments so that it can help the community members who have just started out.
No comments: