This problem belongs to the classification of unstructured data. We implement it with nltk's naive Bayesian algorithm
First of all, we should prepare the data that has been accurately judged as spam posts, and turn them into raw text
Then, we need to design a feature algorithm. For simplicity, we need to design whether each word appears as a feature algorithm:
Def spam feature func (post): '' for simplicity, only the situation of each word is considered here ''return dict ([word, true] for word in post)
After that, prepare the labeled feature set, that is, the content we have identified as spam posts: suppose there are 10000 spam posts, stored as: spam = list (spam post)
Start to prepare training and test datasets: featuresets = [(spam ﹣ feature ﹣ func (post),'spam ') for post inspams] trainsets, Testsets = featuresets [: 1000], featuresets [: 1000]
Now, start machine learning: training classifier = nltk.classify.naive Bayes classifier.train (trainset)
Then, for a given post, judge (predict) whether it is a spam post: classifier. Classify (spam feature func (post))
Finally, test the accuracy of prediction: nltk.classify.accuracy (classifier, testset)
In this way, get more and more junk posts, and constantly let the machine to learn, the accuracy will be higher and higher!
Here, the most important thing is the implementation of the feature algorithm. If we fully grasp the features, then we can achieve very accurate results
Detailed work:
- Remove noise: remove the stop words in Chinese, such as "I", "we", "here" and so on. There are dozens of stop words in Chinese
- Adding thesaurus to feature algorithm
- Add the location in the feature algorithm, such as "China", "China" and "country", which usually appear together
- Combined with the characteristics of users and weighted, for example, if a user often Posts spam, the post he is currently posting may also be spam
- Add other feature weights, such as time (if the percentage of spam and total posts is related to time)
Extension function: the algorithm can also be extended to judge whether the article is reactionary, whether the article is classified (for example, whether a news belongs to sports or current affairs), etc