Choose one category from http://jmcauley.ucsd.edu/data/amazon/ - amazon product review data.
Choose at least 25,000 (reviews). [if no. of reviews > 25k)
Review rule, for dataset:
[overall > 3.0] - positive
[overall <= 3.0] - negative
Module - 1 (Statistics):
Explain the text processing pipeline adopted by you.
Generate term statistics:
Vocabulary size with word frequencies
Verify Zipf’s law – what is the best fit for your corpus?
Which set of terms best describe your corpus? How did you arrive at it?
Module - 2 (Sentiment Analysis using statistical NLP):
Use the following vector space models
Any external vectorizer (cite the original paper).
Do sentiment analysis using all (a,b,c) using classical ML techniques
Naive Bayes Model.
Report metrics [accuracy, f1 score, confusion matrix] for all the combinations in (1 and 2)
Analyse the results. [Report clearly which vector space model is giving better results on each model used]
Module - 3 (Topic analysis and topic (attribute) wise sentiment analysis):
Extract the topics from the reviews using any topic extraction technique of your choice.
Report sentences under each topic.
Analyse whether the topics extracted make sense. Justify your claim with some examples.
Report topic wise sentiment distribution for the whole repository. Explain the method that you used. Give complete reference of any paper that you use for the purpose.
Submit a .zip file containing all the working codes (.py files). Zip file should be named in the format <RollNo1_RollNo2_RollNo3>.zip.
Submit a report which should contain:
Detailed description of what all you have done,
Links to the Google-Colab files,
Clearly mention the contribution of each group member.
Copying from the Internet and/or your classmates is strictly prohibited. Any team found guilty will be awarded a suitable penalty as per IIT rules.