filtering forum posts by nltk machine learning

Posted by millikan at 2020-03-25

This problem belongs to the classification of unstructured data. We implement it with nltk's naive Bayesian algorithm

First of all, we should prepare the data that has been accurately judged as spam posts, and turn them into raw text

Then, we need to design a feature algorithm. For simplicity, we need to design whether each word appears as a feature algorithm:

Def spam feature func (post): '' for simplicity, only the situation of each word is considered here ''return dict ([word, true] for word in post)

After that, prepare the labeled feature set, that is, the content we have identified as spam posts: suppose there are 10000 spam posts, stored as: spam = list (spam post)

Start to prepare training and test datasets: featuresets = [(spam ﹣ feature ﹣ func (post),'spam ') for post inspams] trainsets, Testsets = featuresets [: 1000], featuresets [: 1000]

Now, start machine learning: training classifier = nltk.classify.naive Bayes classifier.train (trainset)

Then, for a given post, judge (predict) whether it is a spam post: classifier. Classify (spam feature func (post))

Finally, test the accuracy of prediction: nltk.classify.accuracy (classifier, testset)

In this way, get more and more junk posts, and constantly let the machine to learn, the accuracy will be higher and higher!

Here, the most important thing is the implementation of the feature algorithm. If we fully grasp the features, then we can achieve very accurate results

Detailed work:

Extension function: the algorithm can also be extended to judge whether the article is reactionary, whether the article is classified (for example, whether a news belongs to sports or current affairs), etc