Statistics of the overall data of the articles in the dataset (including a total of several articles)

data mining

Description

 

1.     Statistics of the overall data of the articles in the dataset (including a total of several articles)

2.     Build a custom corpus

3.     Remove stopwords / web links from dataset articles. In general, do a corpus preprocessing on text

4.     After preprocessing, perform frequency vectors.

5.     Perform TF-IDF to get the most frequent words

6.     Perform the distributed representation (because the length of each article is different)

7.     Perform the clustering of text similarity

8.     Topic modelling (to make a general induction) is one of latent Dirichlet allocation or latent semantic analysis (LSA)


Related Questions in data mining category