IMCAFS

Home

learning algorithms for secure spam recognition (i)

Posted by deaguero at 2020-03-11
all

This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. The second half of this paper mainly introduces the model used and the corresponding verification results, including naive Bayes, support vector basis and deep learning.

Preface

Spam, as the most controversial by-product in the Internet, has an impact on the mailbox users of enterprises. Firstly, it brings extra burden to the daily office and mailbox managers. According to incomplete statistics, 80% of users still need to spend about 10 minutes a week to deal with spam in an efficient anti spam environment, while for most of China's enterprise email applications are still in an inefficient anti spam environment, this proportion has increased by dozens of times, as shown in Figure 1-1, the total amount of spam in China has reached the third place in the world. For enterprise mail service providers, the malicious delivery of spam will also occupy a lot of network resources, which makes 85% of the system resources of the mail server used to deal with the identification of spam, not only the waste of resources is extremely serious, but also may lead to network blocking and paralysis, affecting the normal business mail communication of enterprises.

The world's most spammed countries

This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. The second half of this paper mainly introduces the model used and the corresponding verification results, including naive Bayes, support vector basis and deep learning.

data set

The data set used for spam recognition is Enron spam data set, which is the most widely used public data set in e-mail related research. Its e-mail data is Enron Corporation, Originally one of the world's largest integrated natural gas and power companies, it is the number one wholesaler of natural gas and power in North America) 150 senior managers' e-mails. The emails were posted online when Enron was under investigation by the Federal Energy Regulatory Commission.

In the field of machine learning, Enron spam data sets are used to study document classification, part of speech tagging, spam recognition and so on. Because Enron spam data sets are real mails in real environment, it is of great practical significance.

Enron spam dataset home page

The Enron spam data set uses different folders to distinguish between normal mail and spam as shown in the figure.

Enron spam dataset folder structure

Examples of normal mail content are as follows:

Subject: christmas baskets

the christmas baskets have been ordered .

we have ordered several baskets .

individual earth – sat freeze – notis

smith barney group baskets

rodney keys matt rodgers charlie

notis jon davis move

Team

phillip randle chris hyde

Harvey

Freese

Faclities

Examples of spam content are as follows:

Subject: fw : this is the solution i mentioned lsc

OO

Thank you,

your email address was obtained from a purchased list ,

reference # 2020 mid = 3300 . if you wish to unsubscribe

from this list , please click here and enter

your name into the remove box . if you have previously unsubscribed

and are still receiving this message , you may email our abuse

control center , or call 1 – 888 – 763 – 2497 , or write us at : nospam ,

6484 coral way , miami , fl , 33155 ” . 2002

web credit inc . all rights reserved .

The corresponding website of Enron spam dataset is: http://www2.aueb.gr/users/ion/data/enron spam/

feature extraction

Method 1: word bag model

There are two very important models for text feature extraction

Word set model: a set of words. There is only one element in the set, that is, there is only one word in the word set

Word bag model: if a word appears more than once in a document, count the number of times it appears (frequency)

The essential difference between the two is that the word bag increases the latitude of frequency on the basis of the word set. The word set only pays attention to have and not, and the word bag also pays attention to several.

Suppose we want to characterize an article, the most common way is word bag.

Import related function library

Instantiate participle object

>>> vectorizer = CountVectorizer(min_df=1)

>>> vectorizer

CountVectorizer(analyzer=…’ word’, binary=False, decode_error=…’ Strict ',

Dtype=<... ‘numpy.int64′>, encoding=…’ utf-8′, input=…’ Content ',

lowercase=True, max_df=1.0, max_features=None, min_df=1,

ngram_range=(1, 1), preprocessor=None, stop_words=None,

strip_accents=None, token_pattern=…’ (?u)\\b\\w\\w+\\b’,

tokenizer=None, vocabulary=None)

Word bag processing of text

>>> corpus = [

... 'This is the first document.',

... 'This is the second second document.',

... 'And the third one.',

... 'Is this the first document?',

]...]

>>> X = vectorizer.fit_transform(corpus)

> > X

<4×9 sparse matrix of type ‘<… ‘numpy.int64′>’

with 19 stored elements in Compressed Sparse … Format>

Get the corresponding feature name

>>> vectorizer.get_feature_names() == (

... ['and', 'document', 'first', 'is', 'one',

... 'second', 'the', 'third', 'this'])

True

To obtain the word bag data, we have completed the word bag. But for other text in the program, how can we use the features of the existing word bag to vectorize it?

>>> X.toarray()

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],

[0, 1, 0, 1, 0, 2, 1, 0, 1],

[1, 0, 0, 0, 1, 0, 1, 1, 0],

[0, 1, 1, 1, 0, 0, 1, 0, 1]]… )

We define the feature space of the word bag as vocabulary.

The existing vocabulary can be used directly when dealing with other texts.

In this case, treat the whole message including the body as a string, where carriage return and line feed need to be filtered out.

def load_one_file(filename):

X= ""

with open(filename) as f:

for line in f:

line=line.strip(‘\n’)

line = line.strip(‘\r’)

X+=line

Return x

Traverse all the files under the specified folder and load the data.

def load_files_from_dir(rootdir):

X=[]

list = os.listdir(rootdir)

for i in range(0, len(list)):

path = os.path.join(rootdir, list[i])

if os.path.isfile(path):

v=load_one_file(path)

X.append (V)

Return x

The data of Enron spam data set is scattered in 6 folders, enron1 to enron6. The normal files are in ham folder, and the spam files are in spam folder. All the data are recorded in turn.

def load_all_files():

Ham=[]

Spam=[]

for i in range(1,7):

path=”../data/mail/enron%d/ham/” % i

print “Load %s” % path

ham+=load_files_from_dir(path)

path=”../data/mail/enron%d/spam/” % i

print “Load %s” % path

spam+=load_files_from_dir(path)

return ham,spam

In this paper, we use the word bag model to vectorize the normal mail and spam samples. The samples in ham folder are marked as 0, normal mail, spam folder as 1 and spam.

def get_features_by_wordbag():

ham, spam=load_all_files()

X=ham+spam

y=[0]*len(ham)+[1]*len(spam)

vectorizer = CountVectorizer(

decode_error=’ignore’,

strip_accents=’ascii’,

max_features=max_features,

stop_words=’english’,

Max_df=1,

Min_df=1)

print vectorizer

x=vectorizer.fit_transform(x)

x=x.toarray()

Return x, y

Several important parameters of countvectorize function are:

Decode error, the way to deal with decoding failure, is divided into three ways: "strict", "ignore", "replace"

Strip ABCD accents, how to remove accents in the preprocessing step

Max ABCD features, the maximum number of bag features

Stop \

Max? DF, DF Max

Min, DF min

Binary, the default is false, which needs to be set to true when used in combination with TF-IDF

The data sets processed in this example are all in English, so for decoding failure, ignore is used, stop words is used, and strip accents is used.

Method 2: TF-IDF model

There is also a feature extraction method in the field of text processing, called TF-IDF model (term frequency – inverse document frequency, word frequency and reverse file frequency). TF-IDF is a statistical method to evaluate the importance of a word to a document set or one of the documents in a corpus. The importance of a word increases in proportion to the number of times it appears in the document, but decreases in inverse proportion to the number of times it appears in the corpus. Various forms of TF-IDF weighting are often applied to search as a measure or rating of the degree of correlation between files and user queries.

The main idea of TF-IDF is: if a word or phrase appears frequently in an article (term frequency), the word frequency is high, and it seldom appears in other articles, it is considered that this word or phrase has a good ability of classification and is suitable for classification. TF-IDF is actually TF * IDF. TF indicates how often entries appear in document D. The main idea of IDF (inverse document frequency) is: if there are fewer documents containing entry T, that is, the smaller n is, the larger IDF is, it means that entry t has a good ability to distinguish categories. If the number of documents containing entry t in a certain type of document C is m, and the total number of documents containing T in other types is k, obviously, the number of documents containing T is n = m + K. when m is large, n is also large, and the value of IDF obtained according to IDF formula is small, which means that the ability to distinguish the entry T category is not strong. But in fact, if an entry appears frequently in a document of a class, it means that the entry can well represent the characteristics of the text of this class. Such entries should give them a higher weight and be selected as the characteristic words of this type of text to distinguish it from other documents.

TF-IDF algorithm is implemented in scikit learn, just instantiate tfidftransformer.

>>> from sklearn.feature_extraction.text import TfidfTransformer

>>> transformer = TfidfTransformer(smooth_idf=False)

>>> transformer

TfidfTransformer(norm=…’ l2′, smooth_idf=False, sublinear_tf=False, use_idf=True)

TF-IDF model is usually used with the word bag model to further process the array generated by the word bag model.

>>> counts = [[3, 0, 1],

… [2, 0, 0],

… [3, 0, 0],

… [4, 0, 0],

… [3, 2, 0],

… [3, 0, 2]]

...

>>> tfidf = transformer.fit_transform(counts)

> > TFIDF

<6×3 sparse matrix of type ‘<… ‘numpy.float64′>’ with 9 stored elements in Compressed Sparse … Format>

>>> tfidf.toarray()

array([[ 0.81940995, 0. , 0.57320793],

[ 1. , 0. , 0. ],

[ 1. , 0. , 0. ],

[ 1. , 0. , 0. ],

[ 0.47330339, 0.88089948, 0. ],

[ 0.58149261, 0. , 0.81355169]])

In this example, after obtaining ham and spam data, use word bag model countvectorizer to carry out word bag, where the binary parameter needs to be set to true, and then use tfiftransformer to calculate TF-IDF.

def get_features_by_wordbag_tdjfd():

ham, spam=load_all_files()

X=ham+spam

y=[0]*len(ham)+[1]*len(spam)

vectorizer = CountVectorizer(binary=True,

decode_error=’ignore’,

strip_accents=’ascii’,

max_features=max_features,

stop_words=’english’,

Max_df=1.0,

Min_df=1)

x=vectorizer.fit_transform(x)

x=x.toarray()

transformer = TfidfTransformer(smooth_idf=False)

tfidf = transformer.fit_transform(x)

x = tfidf.toarray()

Return x, y

Method 3: vocabulary model

Word bag model can show which words the text consists of, but it can't express the relationship between the words. So people use the idea of word bag model for reference, and use the generated vocabulary to code the original sentences one by one according to the words. Tensorflow supports this model by default.

tf.contrib.learn.preprocessing.VocabularyProcessor (max_document_length, min_frequency=0,

vocabulary=None,

tokenizer_fn=None)

The meaning of each parameter is:

Max [document] length:, the maximum length of the document. If the length of the text is greater than the maximum length, it will be cut, otherwise it will be filled with 0

If the frequency is less than the minimum, it will not be included in the vocabulary

Vocabulary, categoricalvocabulary object

Tokenizer FN, word segmentation function

Suppose you have the following sentences to deal with:

X_text = [

'i love you',

'me too'

]

Generate a vocabulary based on the above sentence and code the sentence 'I me too'.

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

vocab_processor.fit(x_text)

print next(vocab_processor.transform(['i me too'])).tolist()

x = np.array(list(vocab_processor.fit_transform(x_text)))

Print x

The operation result is:

This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. [1, 4, 5, 0]

[[1 23 0]

[4 50 0]]

The whole process is shown in the figure.

Coding with glossary model

In this example, after getting the ham and spam data, process the data set through the vocabularyprocessor function, get the glossary, and truncate it according to the defined maximum text length. If the maximum text length is not reached, fill it with 0.

global max_document_length

X=[]

Y=[]

ham, spam=load_all_files()

X=ham+spam

y=[0]*len(ham)+[1]*len(spam)

vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,

min_frequency=0,

vocabulary=None,

tokenizer_fn=None)

x=vp.fit_transform(x, unused_y=None)

x=np.array(list(x))

Return x, y

Summary

This paper mainly takes spam recognition as an example, introduces common text processing methods and common text processing related machine learning algorithms. The first half mainly introduces the data sets used in spam recognition and the feature extraction methods used, including the word bag model, TF-IDF model and glossary model. To know how to use naive Bayes, support vector bases and DNN, RNN and CNN of deep learning to identify spam, please see the lower part.