Starting your First Data Science Project? Here are 10 Things You Must Absolutely Know
Introduction
Can you imagine navigating through a city without Google Maps? It feels like an alien concept! We have no sense of direction and all paths seem to lead away from where we want to go.
That’s often what the first data science project feels like. I can personally attest to this and I know most data science enthusiasts are caught like a deer in the headlights when they’ve based their learning entirely on just online courses.
Building a machine learning model in Python is great – but doing that in the industry is an entirely different kettle of fish altogether. If you feel that learning Python and the basics of machine learning are going to land you your first data science project or make you a data science rockstar, you’ll be in for a shock.

For me, this reality hit home when I joined an organization as a data scientist. Building a machine learning model was not enough anymore (not even close). There were other tons of things, such as data collection, cleaning, exploration, and a lot more tough work which I had earlier ignored.
A few things I realized quickly – problem-solving skills, creativity, a structured thinking approach, and good storytelling skills will be more helpful than just applying a novel algorithm. Trust me, don’t take this lightly!
In this article, I will be sharing 10 key points that I wish I knew when I started my Data Science career. I hope this will help you out in your own data science journey.
There is a lot of difference in the data science we learn in courses and self-practice and the one we work in the industry. I’d recommend you to go through these crystal clear free courses to understand everything about analytics, machine learning, and artificial intelligence:
1. Hypothesis Generation is More Important Than you Think
Oh boy – if I could shout this from the rooftops, I would scream at the top of my lungs. Hypothesis generation is such a crucial step in a data science project. And yet almost all data science newcomers are ill-prepared for it.
The almighty question at the beginning of any data science project should be – what is the hypothesis behind your analysis?
Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
Let’s say if you go with a non-hypothesis-driven approach, you’ll be bound to look at hundreds or even thousands of variables to analyze without any prior knowledge. This is an extremely hard task for an analyst, right?
A hypothesis-driven approach is much more productive. You’ll first form a hypothesis or an assumption and then accordingly note down the potential variables you’ll need for the analysis. These variables may or may not be available. After this activity, you’ll finally go through the data and select the required variables. If the variable is not available, then you can opt for feature engineering or finding new ways to collect the data.
This hypothesis is the base of your whole project so don’t hesitate to put in the time, effort, and ask for help from your team members. In the industry, you’ll be working with several teams to come up with these hypotheses.
1. Hypothesis Generation is More Important Than you Think
Oh boy – if I could shout this from the rooftops, I would scream at the top of my lungs. Hypothesis generation is such a crucial step in a data science project. And yet almost all data science newcomers are ill-prepared for it.
The almighty question at the beginning of any data science project should be – what is the hypothesis behind your analysis?
Simply put, a hypothesis is a possible view or assertion of an analyst about the problem he or she is working upon. It may be true or may not be true.
Let’s say if you go with a non-hypothesis-driven approach, you’ll be bound to look at hundreds or even thousands of variables to analyze without any prior knowledge. This is an extremely hard task for an analyst, right?
A hypothesis-driven approach is much more productive. You’ll first form a hypothesis or an assumption and then accordingly note down the potential variables you’ll need for the analysis. These variables may or may not be available. After this activity, you’ll finally go through the data and select the required variables. If the variable is not available, then you can opt for feature engineering or finding new ways to collect the data.
This hypothesis is the base of your whole project so don’t hesitate to put in the time, effort, and ask for help from your team members. In the industry, you’ll be working with several teams to come up with these hypotheses.

2. Knowledge of Data Science Tools is Good; the Ability to Break Down Business Problems is Priceless
Data Science tools will come and go but basics will stick forever.
There is an endless number of tools out there to build your data science project. Tools like SPSS and SAS had their golden time and now R and Python have taken over the limelight. Now Julia is said to take over both of them. The competition never ends.
Learning the tool takes the least time but learning about the domain and business problems can take years of experience. The knowledge of the domain will help you in hypothesis generation, data analysis, feature engineering and finally conveying the results as a great story to the stakeholders.
Let’s say you joined an e-commerce company as a data scientist. You are part of the team tasked with building a recommendation engine for their retail products. If you have no idea how the business works, what are the different variables at play, etc., how in the world will you proceed?
You need to work on understanding the business, what the different aspects of the business are, what exactly the problem is, and then break that down into a DATA PROBLEM. Your structured thinking skills will help you out massively here.
3. Be Prepared To Do a LOT of Data Cleaning
Data Cleaning is the task that can “make or break” your whole analysis.
“Data” is the crux of the whole problem solving and analysis. If you feed dirty data into your model then it’s pretty obvious that it will spit out useless results. Therefore, you should not shy away from spending time making your data-rich in value.
While starting out, we usually practice on simple datasets that are publically available but this is as far away from real-world data as you can imagine. The industry isn’t a hackathon setting where you’ll get mostly clean data with well-defined outcomes. You would need to do all of this as a team (or yourself) – including spending A LOT of time on data cleaning.
The most common data cleaning activities include missing value imputation, outlier treatment, encoding categorical features, etc. These may sound rudimentary to you but these can literally make or break your data science project.
The real-world data may contain errors that are unique to the dataset which you may have to fish out using manual rules. An efficient data scientist never misses out on data eyeballing. 🙂
4. Fail to Explore; Prepare to Fail
Data Exploration is the most underrated step in data science.
The most crucial step that beginners miss out is simply data exploration. It is fundamental to the process of data analysis and it can help you gain crucial insights at the beginning of your data science project.
Data Exploration is usually the first step in any kind of data analysis. This activity helps to understand the dataset at a broader level. It helps in unfolding some patterns, and characteristics usually hidden in plain sight.
A good data exploration exercise will bring out information about the variables as well as their relationships and their effect on our results. I personally find this step to be very enjoyable as you get to be the detective here and it includes a lot of visualization too!
5. Model Deployment is the Key – Learn Software Engineering
If you don’t like coding, I have some bad news for you. And yes, there is no getting away from learning programming if you want to be successful in data science.
You have made a data model successfully. Now what?
Let’s take a moment to ponder the above question. After tons of hard work, you have finally created a model with high accuracy in your Jupyter notebook. What’s the next step? Will you just send the Jupyter notebook to your clients? What are the additional things you need to take care of?
This is a crucial roadblock that every data scientist hits in his or her new project because as a beginner no one has the need to deploy their model. So what to do?
It is important that you learn some basic software engineering and computer science skills. Learn everything you can about version control, how to write neat and tidy code, how to use GitHub, etc. All of this ties into your data science skillset.
Reviewed by square daily updates
on
October 11, 2020
Rating:
No comments: