New DG Food Agro are a multinational exporter of various grains from India since nearly 130 years. But their main product of exporting since early 1980s has been Wheat. They export wheat to countries like America, Afghanistan, Australia etc.
They started seeing varying exports of sales year on year for various countries. The reason that was theorized by them
had a lot of natural causes like floods, country growth, population explosion etc. Now they need to decide which countries fall in the same range of export and which don’t. They also need to know which countries export is low and can be improved and which countries are
performing very well across the years.
The data provided right now is across 18 years. What they need is a repeatable solution
which won’t get affected no matter how much data is added across time and that they should
be able to explain the data across years in less number of variables.
Objective: Our objective is to cluster the countries based on various sales data provided to us
across years. We have to apply an unsupervised learning technique like K means or
Hierarchical clustering so as to get the final solution. But before that we ha
ve to bring the
exports (in tons) of all countries down to same scale across years. Plus, as this solution needs
to be repeatable we will have to do PCA so as to get the principal components which explain
Read the data file
and check for any missing values
Change the headers to country and year accordingly.
Cleanse the data if required and remove null or blank values
After the EDA part is done, try to think which algorithm should be applied here.
As we need to make this acro
ss years we need to apply PCA first.
Apply PCA on the dataset and find the number of principal components which explain
nearly all the variance.
Plot elbow chart or scree plot to find out optimal number of clusters.
Then try to apply K means, Hierarchical
clustering and showcase the results.
You can either choose to group the countries based on years of data or using the
Then see which countries are consistent and which are largest importers of the good
based on scale and position o