Design Python code for text pre-processing (a) Parsing and tokenizing - read files from RCV1v2, find the documentID and record it to a collection of BowDocument Objects.

data mining

Description

Required to be submitted: 

1. Please save your output into a text or word files for each question (file name is your full name_Q2a, e.g., Yuefeng_Li_Q2a.txt) and put all codes into a folder (e.g., Yuefeng_Li_Q2a). Then zip all txt files and folders into a zip file as your “student ID_Surname_Asm1.zip”. 

2. Submit your zip file for this assignment in BB before 11.59pm on 24 April 2020. 

3. Answer all four questions (10 sub-questions). 4. All sub-questions are worth 2 marks each


Data (RCV1v2 document collection) 

• You will be working with a sample dataset which is a small subset of just 10 documents from the RCV1v2 document collection, which is a pre-tokenized version (for convenience, and for copyright reasons). The dataset can be downloaded from Blackboard.


Question 1. Design Python code for text pre-processing (a) Parsing and tokenizing - read files from RCV1v2, find the documentID and record it to a collection of BowDocument Objects. 

• The documentID is simply assigned by the ‘itemid’ in 

• In this task, the created BowDocument can be initialled with found documentID and an empty dictionary of key-value pair of (String term: int frequency). 

• Build up a collection of BowDocument for the given dataset, this collection can be a dictionary structure (a linked list or other data structure. Please note the rest descriptions are based on the dictionary structure) with documentID as key and BowDocument object as value. 

• Create a method (or function) to print out all documentIDs by iterating above collection and calling BowDocument’s method getDocId(). 

• Tokenizing – fill term:freq dictionary for each document.


Related Questions in data mining category