#10 Calories Burnt Prediction Model

#10 Calories Burnt Prediction Model

In this article I will be using exercise and fitness dataset available on Kaggle to find a way to predict the total calories burnt by a given person for a given workout variables like the ones mentioned under terminology section.

Premise

To Predict the calories burnt based on gender, age, heartbeat, body temperature among other factors

goal

Data Sources

Exercise.csv– Published in 2018, by FERNANDO FERNANDEZ on kaggle to find correlation between calories and exercise. [15000 Rows and 8 Columns]

Calories.csv– Published in 2018, by FERNANDO FERNANDEZ on kaggle to find correlation between calories and exercise. [15000 Rows and 8 Columns]

Terminology

User_ID Unique user identity for each entry on either dataset.

Calories Calories burnt by a user during the workout. (in cals)

Gender Gender of the user. (Male or Female)

Age Age of the user. (in years)

Height – Height of the user (in cm)

Weight – Weight of the user (in kg)

Duration – Duration of a workout (in seconds)

Heart_Rate – Average heart rate during the workout (in beats per minute)

Body_Temp – Average body temperature during a workout (in celcius)

Calories_Data – A combined dataframe of calories and exercise dataframe.

Methodology

Part one: Libraries Used

Part Two: Setting Up Dataframe

Let’s combine both dataframes so that we can match each workout with the total calories burnt more easily.

Outliers Check

It’s important to comb through the data to make sure that the attributes are not severely skewed

As it turns out there aren’t many outliers in the data to begin with. But it’s always worth to take a look.

null value check

Likewise, let’s also check the data for null values.

Data Analysis

Data Visualization

Let’s visualize the data to see if we can spot some trends. Specifically in the Gender, Age and Weight category.

We have about 50 – 50 male to female distribution, which is an ideal split. That means the values are not skewed towards one gender particularly.

The age data however seems to be skewed more towards younger populational. Which makes sense as exercising becomes less of a priority as you age.

Height surprised me with a nice bell curve!

Okay so from this tangent we learned that the data is

  • Evenly split between both genders.
  • Skewed towards younger population.
  • Height distribution follows a bell curve.

Finding Correlation between data

Since the whole goal of the analysis is to predict calories burnt values – we would have to find the factors (other column values really) that correlate with it the most. In DataScience, whenever we want to find the correlation between multiple attributes we create a heat matrix.

Here we can see that Calories And Duration have the highest correlation (of 1) therefore, – This makes Duration of a workout one of the best metrics to predict the total calories burnt by a workout.

Separating features and target

Part Three: Training Algorithms

add more training models

Part Four: Accuracy Results

Clearly, Polynomial Linear Regression comes out as the most accurate model by a huge margin.

What I’ve Learned

And that marks the end of this post. In future, it would be interesting to turn this into an interactive model to see how turning one or two variables up and down affects the total calories burnt in real time! – Heck it might even help me plan my routine workouts better!

That aside, This was an interesting one to go through. Starting from choosing a topic, finding the right dataset, combing through the data to look for outliers and all. I’m rather puzzled as I was not able to apply Logarithmic scaling to the values without running into problems.

Nothing changed before and after

Something similar happened when I tried to scale datapipeline using standardScaler() and minMaxScaler() before feeding it to the models in order to increase accuracy (as I have done previously)

The terminal was stuck at this…

It’s hard to know what went wrong (or if anything went wrong at all) considering there is no output in the terminal at all. In any case, after spending plenty of time trying to troubleshoot these. I decided to continue without it as I wanted to push this article out and truth be told I’m more than satisfied with the accuracy of the models even without these enhancements.

Well these are problems for future me to go through I suppose.

As always the whole code for this post is available on my github.

Resources