This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. The second half of this paper mainly introduces the model used and the corresponding verification results, including naive Bayes, support vector basis and deep learning.
Preface
Spam, as the most controversial by-product in the Internet, has an impact on the mailbox users of enterprises. Firstly, it brings extra burden to the daily office and mailbox managers. According to incomplete statistics, 80% of users still need to spend about 10 minutes a week to deal with spam in an efficient anti spam environment, while for most of China's enterprise email applications are still in an inefficient anti spam environment, this proportion has increased by dozens of times, as shown in Figure 1-1, the total amount of spam in China has reached the third place in the world. For enterprise mail service providers, the malicious delivery of spam will also occupy a lot of network resources, which makes 85% of the system resources of the mail server used to deal with the identification of spam, not only the waste of resources is extremely serious, but also may lead to network blocking and paralysis, affecting the normal business mail communication of enterprises.
The world's most spammed countries
This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. The second half of this paper mainly introduces the model used and the corresponding verification results, including naive Bayes, support vector basis and deep learning.
data set
The data set used for spam recognition is Enron spam data set, which is the most widely used public data set in e-mail related research. Its e-mail data is Enron Corporation, Originally one of the world's largest integrated natural gas and power companies, it is the number one wholesaler of natural gas and power in North America) 150 senior managers' e-mails. The emails were posted online when Enron was under investigation by the Federal Energy Regulatory Commission.
In the field of machine learning, Enron spam data sets are used to study document classification, part of speech tagging, spam recognition and so on. Because Enron spam data sets are real mails in real environment, it is of great practical significance.
Enron spam dataset home page
The Enron spam data set uses different folders to distinguish between normal mail and spam as shown in the figure.
Enron spam dataset folder structure
Examples of normal mail content are as follows:
Subject: christmas baskets
the christmas baskets have been ordered .
we have ordered several baskets .
individual earth – sat freeze – notis
smith barney group baskets
rodney keys matt rodgers charlie
notis jon davis move
Team
phillip randle chris hyde
Harvey
Freese
Faclities
Examples of spam content are as follows:
Subject: fw : this is the solution i mentioned lsc
OO
Thank you,
your email address was obtained from a purchased list ,
reference # 2020 mid = 3300 . if you wish to unsubscribe
from this list , please click here and enter
your name into the remove box . if you have previously unsubscribed
and are still receiving this message , you may email our abuse
control center , or call 1 – 888 – 763 – 2497 , or write us at : nospam ,
6484 coral way , miami , fl , 33155 ” . 2002
web credit inc . all rights reserved .
The corresponding website of Enron spam dataset is: http://www2.aueb.gr/users/ion/data/enron spam/
feature extraction
Method 1: word bag model
There are two very important models for text feature extraction
Word set model: a set of words. There is only one element in the set, that is, there is only one word in the word set
Word bag model: if a word appears more than once in a document, count the number of times it appears (frequency)
The essential difference between the two is that the word bag increases the latitude of frequency on the basis of the word set. The word set only pays attention to have and not, and the word bag also pays attention to several.
Suppose we want to characterize an article, the most common way is word bag.
Import related function library
Instantiate participle object
>>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer
CountVectorizer(analyzer=…’ word’, binary=False, decode_error=…’ Strict ',
Dtype=<... ‘numpy.int64′>, encoding=…’ utf-8′, input=…’ Content ',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=…’ (?u)\\b\\w\\w+\\b’,
tokenizer=None, vocabulary=None)
Word bag processing of text
>>> corpus = [
... 'This is the first document.',
... 'This is the second second document.',
... 'And the third one.',
... 'Is this the first document?',
]...]
>>> X = vectorizer.fit_transform(corpus)
> > X
<4×9 sparse matrix of type ‘<… ‘numpy.int64′>’
with 19 stored elements in Compressed Sparse … Format>
Get the corresponding feature name
>>> vectorizer.get_feature_names() == (
... ['and', 'document', 'first', 'is', 'one',
... 'second', 'the', 'third', 'this'])
True
To obtain the word bag data, we have completed the word bag. But for other text in the program, how can we use the features of the existing word bag to vectorize it?
>>> X.toarray()
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
[0, 1, 0, 1, 0, 2, 1, 0, 1],
[1, 0, 0, 0, 1, 0, 1, 1, 0],
[0, 1, 1, 1, 0, 0, 1, 0, 1]]… )
We define the feature space of the word bag as vocabulary.
The existing vocabulary can be used directly when dealing with other texts.
In this case, treat the whole message including the body as a string, where carriage return and line feed need to be filtered out.
def load_one_file(filename):
X= ""
with open(filename) as f:
for line in f:
line=line.strip(‘\n’)
line = line.strip(‘\r’)
X+=line
Return x
Traverse all the files under the specified folder and load the data.
def load_files_from_dir(rootdir):
X=[]
list = os.listdir(rootdir)
for i in range(0, len(list)):
path = os.path.join(rootdir, list[i])
if os.path.isfile(path):
v=load_one_file(path)
X.append (V)
Return x
The data of Enron spam data set is scattered in 6 folders, enron1 to enron6. The normal files are in ham folder, and the spam files are in spam folder. All the data are recorded in turn.
def load_all_files():
Ham=[]
Spam=[]
for i in range(1,7):
path=”../data/mail/enron%d/ham/” % i
print “Load %s” % path
ham+=load_files_from_dir(path)
path=”../data/mail/enron%d/spam/” % i
print “Load %s” % path
spam+=load_files_from_dir(path)
return ham,spam
In this paper, we use the word bag model to vectorize the normal mail and spam samples. The samples in ham folder are marked as 0, normal mail, spam folder as 1 and spam.
def get_features_by_wordbag():
ham, spam=load_all_files()
X=ham+spam
y=[0]*len(ham)+[1]*len(spam)
vectorizer = CountVectorizer(
decode_error=’ignore’,
strip_accents=’ascii’,
max_features=max_features,
stop_words=’english’,
Max_df=1,
Min_df=1)
print vectorizer
x=vectorizer.fit_transform(x)
x=x.toarray()
Return x, y
Several important parameters of countvectorize function are:
Decode error, the way to deal with decoding failure, is divided into three ways: "strict", "ignore", "replace"
Strip ABCD accents, how to remove accents in the preprocessing step
Max ABCD features, the maximum number of bag features
Stop \
Max? DF, DF Max
Min, DF min
Binary, the default is false, which needs to be set to true when used in combination with TF-IDF
The data sets processed in this example are all in English, so for decoding failure, ignore is used, stop words is used, and strip accents is used.
Method 2: TF-IDF model
There is also a feature extraction method in the field of text processing, called TF-IDF model (term frequency – inverse document frequency, word frequency and reverse file frequency). TF-IDF is a statistical method to evaluate the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in the document, but decreases in inverse proportion to the number of times it appears in the corpus. Various forms of TF-IDF weighting are often applied to search as a measure or rating of the degree of correlation between files and user queries.
The main idea of TF-IDF is: if a word or phrase appears frequently in an article (term frequency), the word frequency is high, and it seldom appears in other articles, it is considered that this word or phrase has a good ability of classification and is suitable for classification. TF-IDF is actually TF * IDF. TF indicates how often entries appear in document D. The main idea of IDF (inverse document frequency) is: if there are fewer documents containing entry T, that is, the smaller n is, the larger IDF is, it means that entry t has a good ability to distinguish categories. If the number of documents containing entry t in a certain type of document C is m, and the total number of documents containing T in other types is k, obviously, the number of documents containing T is n = m + K. when m is large, n is also large, and the value of IDF obtained according to IDF formula is small, which means that the ability to distinguish the entry T category is not strong. But in fact, if an entry appears frequently in a document of a class, it means that the entry can well represent the characteristics of the text of this class. Such entries should give them a higher weight and be selected as the characteristic words of this type of text to distinguish it from other documents.
TF-IDF algorithm is implemented in scikit learn, just instantiate tfidftransformer.
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer
TfidfTransformer(norm=…’ l2′, smooth_idf=False, sublinear_tf=False, use_idf=True)
TF-IDF model is usually used with the word bag model to further process the array generated by the word bag model.
>>> counts = [[3, 0, 1],
… [2, 0, 0],
… [3, 0, 0],
… [4, 0, 0],
… [3, 2, 0],
… [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
> > TFIDF
<6×3 sparse matrix of type ‘<… ‘numpy.float64′>’ with 9 stored elements in Compressed Sparse … Format>
>>> tfidf.toarray()
array([[ 0.81940995, 0. , 0.57320793],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 1. , 0. , 0. ],
[ 0.47330339, 0.88089948, 0. ],
[ 0.58149261, 0. , 0.81355169]])
In this example, after obtaining ham and spam data, use word bag model countvectorizer to carry out word bag, where the binary parameter needs to be set to true, and then use tfiftransformer to calculate TF-IDF.
def get_features_by_wordbag_tdjfd():
ham, spam=load_all_files()
X=ham+spam
y=[0]*len(ham)+[1]*len(spam)
vectorizer = CountVectorizer(binary=True,
decode_error=’ignore’,
strip_accents=’ascii’,
max_features=max_features,
stop_words=’english’,
Max_df=1.0,
Min_df=1)
x=vectorizer.fit_transform(x)
x=x.toarray()
transformer = TfidfTransformer(smooth_idf=False)
tfidf = transformer.fit_transform(x)
x = tfidf.toarray()
Return x, y
Method 3: vocabulary model
Word bag model can show which words the text consists of, but it can't express the relationship between the words. So people use the idea of word bag model for reference, and use the generated vocabulary to code the original sentences one by one according to the words. Tensorflow supports this model by default.
tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0,
vocabulary=None,
tokenizer_fn=None)
The meaning of each parameter is:
Max [document] length:, the maximum length of the document. If the length of the text is greater than the maximum length, it will be cut, otherwise it will be filled with 0
If the frequency is less than the minimum, it will not be included in the vocabulary
Vocabulary, categoricalvocabulary object
Tokenizer FN, word segmentation function
Suppose you have the following sentences to deal with:
X_text = [
'i love you',
'me too'
]
Generate a vocabulary based on the above sentence and code the sentence 'I me too'.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
vocab_processor.fit(x_text)
print next(vocab_processor.transform(['i me too'])).tolist()
x = np.array(list(vocab_processor.fit_transform(x_text)))
Print x
The operation result is:
This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. [1, 4, 5, 0]
[[1 23 0]
[4 50 0]]
The whole process is shown in the figure.
Coding with glossary model
In this example, after getting the ham and spam data, process the data set through the vocabularyprocessor function, get the glossary, and truncate it according to the defined maximum text length. If the maximum text length is not reached, fill it with 0.
global max_document_length
X=[]
Y=[]
ham, spam=load_all_files()
X=ham+spam
y=[0]*len(ham)+[1]*len(spam)
vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
min_frequency=0,
vocabulary=None,
tokenizer_fn=None)
x=vp.fit_transform(x, unused_y=None)
x=np.array(list(x))
Return x, y
Summary
This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. To know how to use naive Bayes, support vector bases and DNN, RNN and CNN of deep learning to identify spam, please see the lower part.