**#10 Calories Burnt Prediction Model**

In this article I will be using exercise and fitness dataset available on Kaggle to find a way to predict the total calories burnt by a given person for a given workout variables like the ones mentioned under terminology section.

### Premise

To Predict the calories burnt based on gender, age, heartbeat, body temperature among other factors

goal

### Data Sources

Exercise.csv– Published in 2018, by FERNANDO FERNANDEZ on kaggle to find correlation between calories and exercise. [15000 Rows and 8 Columns]

Calories.csv– Published in 2018, by FERNANDO FERNANDEZ on kaggle to find correlation between calories and exercise. [15000 Rows and 8 Columns]

### Terminology

** User_ID – **Unique user identity for each entry on either dataset.

** Calories – **Calories burnt by a user during the workout. (in cals)

**Gender ****– **Gender of the user. (Male or Female)

** Age – **Age of the user. (in years)

** Height **– Height of the user (in cm)

** Weight **– Weight of the user (in kg)

** Duration **– Duration of a workout (in seconds)

** Heart_Rate** – Average heart rate during the workout (in beats per minute)

** Body_Temp **– Average body temperature during a workout (in celcius)

** Calories_Data **– A combined dataframe of calories and exercise dataframe.

## Methodology

**Part one: Libraries Used**

**Part Two: Setting Up Dataframe**

Let’s combine both dataframes so that we can match each workout with the total calories burnt more easily.

**Outliers Check **

It’s important to comb through the data to make sure that the attributes are not severely skewed

As it turns out there aren’t many outliers in the data to begin with. But it’s always worth to take a look.

**null value check**

Likewise, let’s also check the data for null values.

**Data Analysis**

**Data Visualization**

Let’s visualize the data to see if we can spot some trends. Specifically in the Gender, Age and Weight category.

We have about 50 – 50 male to female distribution, which is an ideal split. That means the values are not skewed towards one gender particularly.

The age data however seems to be skewed more towards younger populational. Which makes sense as exercising becomes less of a priority as you age.

Height surprised me with a nice bell curve!

Okay so from this tangent we learned that the data is

- Evenly split between both genders.
- Skewed towards younger population.
- Height distribution follows a bell curve.

**Finding Correlation between data**

Since the whole goal of the analysis is to predict calories burnt values – we would have to find the factors (other column values really) that correlate with it the most. In DataScience, whenever we want to find the correlation between multiple attributes we create a heat matrix.

Here we can see that Calories And Duration have the highest correlation (of 1) therefore, – This makes Duration of a workout one of the best metrics to predict the total calories burnt by a workout.

**Separating features and target**

**Part Three:** **Training Algorithms**

add more training models

**Part Four: Accuracy Results**

Clearly, Polynomial Linear Regression comes out as the most accurate model by a huge margin.

### What I’ve Learned

And that marks the end of this post. In future, it would be interesting to turn this into an interactive model to see how turning one or two variables up and down affects the total calories burnt in real time! – Heck it might even help me plan my routine workouts better!

That aside, This was an interesting one to go through. Starting from choosing a topic, finding the right dataset, combing through the data to look for outliers and all. I’m rather puzzled as I was not able to apply Logarithmic scaling to the values without running into problems.

Something similar happened when I tried to scale datapipeline using standardScaler() and minMaxScaler() before feeding it to the models in order to increase accuracy (as I have done previously)

It’s hard to know what went wrong (or if anything went wrong at all) considering there is no output in the terminal at all. In any case, after spending plenty of time trying to troubleshoot these. I decided to continue without it as I wanted to push this article out and truth be told I’m more than satisfied with the accuracy of the models even without these enhancements.

Well these are problems for future me to go through I suppose.

As always the whole code for this post is available on my github.