How to do a Project by Mayank Kaushal
Someone rightly said, “It is always good to have hands-on experiences or can say the practical knowledge beside the theoretical knowledge to understand a concept of a domain in a better way”. This is where the hands-on projects come in the scenario. It will help you to know and understand a concept practically. Having said that the projects also give you an opportunity to know about the type of work you will be doing when you join a firm. And due to these very reasons one must work on the projects and should mention some of them in their resume. The interviewer will know that you have some practical knowledge of the field as well as your concepts are clear.
So, I will be telling you the points below to ponder upon while doing a project.
- Areas of interest
First of all one should decide on the field in which he/she wishes to do his/her project.
It may be finance related such as fraud risk detection or the prediction of insurance cost. One can also look for the client based problems. For example, let’s say you have to predict the customer satisfaction for an airline company.
I would also like to add one important thing that one must also keep in mind the statistical concept he/she wants to use. Let’s say that you wish to use the concept of linear regression then you will have to look for the data with labels having the continuous values.
- Data for the chosen problem
The next important thing is to select the data for the given problem statement. For this purpose you can go multiple ways. First you can collect the data using surveys by releasing the google form on multiple platforms. In this case you will have to do the data cleaning after collecting the raw data. Make sure that the data has all the required information to solve your problem. Second is you can select the data from a platform like Kaggle either you can download the data and then work on it or you can also work on the platform itself by importing the data. Third is you can always write to a firm for a dataset related to your problem statement but there may be issues of privacy. But you can always try this.
- Programming Language
You can always consider any programming language either Python or R . There is always a huge discussion on what is best. According to me, you should be a pro in one of them and at the same time you should have basic knowledge of the other. As nowadays python is popular, so you can start with python and after that you can move to R.
- Exploratory Data Analysis
So after you have your data, next thing you have to do is to check for missing observations. There is always a discussion about the missing observations if we have to remove it or to replace it. So I will try to clear it to some extent. Let’s say we have data with 1.3 lakhs rows and there are 1000 rows with the missing observations. Here, we can remove them as it may not affect your analysis due to less no. of rows with missing observations as compared to total rows. At the same time let’s say you have data with 6000 rows with 2000 rows having missing observations. Then you can’t delete them as it may lead to wrong analysis. Now what to do next in this case?
You can replace the missing observations with the averages (mean or median in case of continuous and mode in case of categorical) of the known observations in a particular column. For a time series data you can also replace them with the before known value or the after known value.
Next important thing is the outliers. So basically if you don’t deal with the outliers if they are present in the data then it will lead to wrong analysis. You can determine the presence of outliers with the help of boxplot, scatter plot etc. After detecting you can remove them or can retain them on the basis of requirements. Let’s say there are some outliers in the data and removing them may lead to wrong analysis then you can retain them after standardizing, normalising or taking log transformation. So one has to deal very carefully with the outliers.
Next thing is that your data can have many feature variables(columns) and all of them may not be important for the analysis.Thus you can either remove some of them on the basis of p-value or can consider some other variable based on the given two-three feature variables. For example, for weight and height , you can take BMI.
- Fitting of the model
After EDA you can try one of the ML Algorithms for the given type of the problem. If you are not satisfied with one that you used you can always try others so that it improves your prediction scores.
- The statistical concepts you are going to use in the project is of much importance for the interview purpose. You should select such topics which are much talked about such as regression, time series etc.
There are some benefits of doing this as during the interview of ½ hour, 10 min at most your interview will revolve around these topics only and if you have done the project in these topics then it will be icing on the cake and this I am telling on my experiences.
- And next important thing is that you should know the concepts very well in advance if you have used it in your project and mentioned it in the resume as most of the time the question will revolve around these concepts only.
- These are the points which helped me and I think it will help you a lot. Having said that it is not necessary that you go in this way only, you can always do it in your own way.
For any suggestion, please reach out to us on LinkedIn. You can also schedule a meeting by vising the Contact page.
Find some of the resources that helped us here.
You can create an impact by talking about your interview experience. Please fill this form and help students get a perspective about the interview structure and questions.
You can read other articles here.
Cheers and Best!