IMCAFS

Home

document classification using naive bayes algorithm

Posted by punzalan at 2020-03-17
all

In this paper, the naive Bayesian algorithm in scikit learn is used to classify documents, so as to understand the Bayesian algorithm more deeply. The focus of this paper is not natural language processing, so corpus uses English directly to avoid introducing Chinese word segmentation technology. In order to better understand the principle, this paper introduces TF-IDF, which is a model to express word weight information.

In scikit learn, naive Bayes algorithm is implemented in sklearn. Naive Bayes package, including several typical probability distribution algorithms introduced in this chapter. Gaussiannb implements the naive Bayes algorithm of Gaussian distribution, multinomialnb implements the naive Bayes algorithm of polynomial distribution, and bernoullinb implements the naive Bayes algorithm of Bernoulli distribution. In this paper, we use multinomialnb to realize automatic document classification. If you are not familiar with naive Bayes algorithm, please refer to another blog of the author.

sklearn.naive_bayes GaussianNB MultinomialNB BernoulliNB MultinomialNB

1 get data set

The dataset used in this section is from 20news-18828 on mlcomp.org, which can be downloaded after free registration. After downloading the dataset, you can decompress it to the directory ~ / code / datasets / mlcomp /. After decompressing, a directory named 379 will be generated under ~ / code / datasets / mlcomp, which contains three subdirectories and an introduction file named metadata:

~/code/datasets/mlcomp/ ~/code/datasets/mlcomp 379 metadata $ cd ~/code/datasets/mlcomp $ ls 379 metadata raw test train

We will use the documents in the train subdirectory for model training, and then use the documents in the test subdirectory for model testing. The train subdirectory contains 20 subdirectories. Each subdirectory represents a document type. All documents in the subdirectory belong to the document type identified by the directory name. Readers and friends can browse the dataset at will, so as to have a perceptual understanding of the dataset. For example, datasets / mlcomp / 379 / train / rec.auto/6652-103421 is a plain text file that can be opened with any text editor. This is a post about the theme of automobile:

train test train datasets/mlcomp/379/train/rec.autos/6652-103421

Hahahahahaha. gasp pant Hm, I'm not sure whether the abovewas just a silly remark or a serious remark. But in case there aresome misconceptions, I think Henry Robertson hasn't updated his datafile on Korea since...mid 1970s. Owning a car in Korea is no longera luxury. Most middle class people in Korea can afford a car and dohave at least one car. The problem in Korea, especially in Seoul, isthat there are just so many privately-owned cars, as well as taxis andbuses, the rush-hour has become a 24 hour phenomenon and that there isno place to park. Last time I heard, back in January, the Kim Administrationwanted to legislate a law requireing a potential car owner to providehis or her own parking area, just like they do in Japan.

Also, Henry would be glad to know that Hyundai isn't the onlycar manufacturer in Korea. Daewoo has always manufactured cars andI believe Kia is back in business as well. Imported cars, such asMercury Sable are becoming quite popular as well, though they are stillquite expensive.

Finally, please ignore Henry's posting about Korean politicsand bureaucracy. He's quite uninformed.

2 mathematical expression of documents

How to express a document as information that a computer can understand and process? This is an important topic in natural language processing. The complete content can be written into a great work. This section briefly introduces the principles of TF-IDF, so that readers can better understand the examples introduced in this article.

TF-IDF is a statistical method to evaluate the importance of a word to a document. TF stands for term frequency. For a document, term frequency is the number of times a specific word appears in the document divided by the total number of words in the document. For example, a document has 1000 words in total, among which "naive Bayes" appears five times, "de" appears 25 times, and "application" appears 12 times, so their word frequency is 0.005, 0.025, 0.012 respectively.

IDF refers to the inverse document frequency of a word. It can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the quotient as logarithm. It represents the weight index of the word. For example, our dataset has a total of 10000 documents, of which "naive Bayes" only appears in 10 documents, then its weight index IDF = log (10000 / 10) = 3. If "of" appears in all documents, its weight index IDF = log (1) = 0. If "application" appears in 1000 documents, its weight index IDF = log (10000 / 1000) = 1.

After calculating the word frequency and weight index of each word and multiplying them, we can get the importance of the word in the document. The importance of a word increases in proportion to the number of times it appears in the document, but decreases in inverse proportion to the number of times it appears in the corpus. For the application of TF-IDF in search engines, please refer to the article "how to determine the relevance between web pages and queries" in teacher Wu Jun's "beauty of mathematics".

With TF-IDF, we can convert a document into a vector. First of all, we can extract all the words from our dataset (in the field of natural language processing, also known as corpus, or corpus), which we call dictionaries. If there are 10000 words in the dictionary, each document can be transformed into a vector of 10000 dimensions. Secondly, for each word in the document we want to convert, we will calculate its TF-IDF value, and fill the value into the document vector, which corresponds to the element of the word. This completes the process of converting a document into a vector. A document is often composed of only a small number of words in the dictionary, which means that most elements in this vector are zero.

Fortunately, we don't need to write our own code to complete the above process. Scikit learn software package implements the process of transforming documents into vectors. First, we read the training corpus into memory:

from time import time from sklearn.datasets import load_files print("loading train dataset ...") t = time() news_train = load_files('datasets/mlcomp/379/train') print("summary: {0} documents in {1} categories.".format( len(news_train.data), len(news_train.target_names))) print("done in {0} seconds".format(time() - t))

Our code is stored in ~ / code / directory, and its relative path, datasets / mlcomp / 379 / train directory, is our corpus, which contains 20 subdirectories. The name of each subdirectory represents the category of documents, and the subdirectory contains all documents of this category. The load file() function will read all documents into memory from this directory and automatically label them according to the name of the subdirectory. Among them, news_train.data is an array containing the text information of all documents. News_train.target is also an array, which contains the categories of all documents, and news_train.target_names is the category name of. Therefore, if we want to know the category name of the first document, we only need to use the code news_train.target_names [news_train. Target [0]].

~/code/ datasets/mlcomp/379/train news_train.data news_train.target news_train.target_names news_train.target_names[news_train.target[0]]

The output of the above code on my computer is:

loading train dataset ... summary: 13180 documents in 20 categories. done in 0.212177991867 seconds

It is not hard to see that there are 13180 documents in our corpus, which are divided into 20 categories. Next, we need to convert all these documents into vectors composed of weight information expressed by TF-IDF:

from sklearn.feature_extraction.text import TfidfVectorizer print("vectorizing train dataset ...") t = time() vectorizer = TfidfVectorizer(encoding='latin-1') X_train = vectorizer.fit_transform((d for d in news_train.data)) print("n_samples: %d, n_features: %d" % X_train.shape) print("number of non-zero features in sample [{0}]: {1}".format( news_train.filenames[0], X_train[0].getnnz())) print("done in {0} seconds".format(time() - t))

Among them, tfidfvectorizer class is used to transform all documents into a matrix. Each row of the matrix represents a document, and each element in a row represents the importance of a corresponding word, which is represented by TF-IDF. Readers who are familiar with the scikit learn API should be aware that the fit ﹐ transform() method is a combination of fit() and transform(). Among them, fit () will first complete the operation of idiom database analysis, dictionary extraction, etc., and transform () will convert each document into a vector, and finally form a matrix, which will be saved in the X? Train variable. The output of this code on my computer is:

fit_transform() fit() transform() fit() transform() X_train vectorizing train dataset ... n_samples: 13180, n_features: 130274 number of non-zero features in sample [datasets/mlcomp/379/train/talk.politics.misc/17860-178992]: 108 done in 4.15024495125 seconds

From the output of the program, we can know that our dictionary has a total of 130274 words, that is, each document can be converted into a 130274 dimensional vector (this is a huge vector). In the first document, there are only 108 non-zero elements, that is, this article is composed of 108 words, and the number of words is greater than or equal to 108 (because some words can appear multiple times). TF-IDF value of 108 words in this document will be calculated and saved in the specified position in the vector. X train is a sparse matrix with dimension 13180 x 130274 (this is a huge sparse matrix).

X_train

3 model training

It took a lot of effort to transform the document data into a typical training data set matrix in scikit learn: each row of the matrix represents a data sample, and each column of the matrix represents a feature. Next, we can use multinomialnb directly to train data sets:

MultinomialNB from sklearn.naive_bayes import MultinomialNB print("traning models ...".format(time() - t)) t = time() y_train = news_train.target clf = MultinomialNB(alpha=0.0001) clf.fit(X_train, y_train) train_score = clf.score(X_train, y_train) print("train score: {0}".format(train_score)) print("done in {0} seconds".format(time() - t))

Among them, alpha represents smoothing parameter, the smaller its value is, the more likely it is to cause over fitting, and the larger its value is, the more likely it is to cause under fitting. The output of this code on my computer is:

alpha traning models ... train score: 0.997875569044 done in 0.274363040924 seconds

Next, we load the test data set and take a document to predict whether it is accurate. The test data set is in ~ / code / datasets / mlcomp / 379 / test directory. We use the same method described above to load the data set first:

~/code/datasets/mlcomp/379/test print("loading test dataset ...") t = time() news_test = load_files('datasets/mlcomp/379/test') print("summary: {0} documents in {1} categories.".format( len(news_test.data), len(news_test.target_names))) print("done in {0} seconds".format(time() - t))

The output on my computer is:

loading test dataset ... summary: 5648 documents in 20 categories. done in 0.117918014526 seconds print("vectorizing test dataset ...") t = time() X_test = vectorizer.transform((d for d in news_test.data)) y_test = news_test.target print("n_samples: %d, n_features: %d" % X_test.shape) print("number of non-zero features in sample [{0}]: {1}".format( news_test.filenames[0], X_test[0].getnnz())) print("done in %fs" % (time() - t))

It should be noted that the vectorizer variable is an example of the broad-based vectorizer like tfdf that we use to process the training data set. Here, we only need to call transform() for TF-IDF numerical calculation, and no need to call fit() for corpus analysis. The output of this code on my computer is:

vectorizer transform() fit() vectorizing test dataset ... n_samples: 5648, n_features: 130274 number of non-zero features in sample [datasets/mlcomp/379/test/rec.autos/7429-103268]: 61 done in 2.915759s

In this way, our test data set is also transformed into a sparse matrix with a dimension of 5648 x 130274. We can take the first document in the test data set for preliminary verification to see whether our trained model can correctly predict the category of this document:

pred = clf.predict(X_test[0]) print("predict: {0} is in category {1}".format( news_test.filenames[0], news_test.target_names[pred[0]])) print("actually: {0} is in category {1}".format( news_test.filenames[0], news_test.target_names[news_test.target[0]])) predict: datasets/mlcomp/379/test/rec.autos/7429-103268 is in category rec.autos actually: datasets/mlcomp/379/test/rec.autos/7429-103268 is in category rec.autos

It seems that the prediction is in line with the reality.

4 model evaluation

Obviously, we can not evaluate the accuracy of the model through the prediction of a sample. We need to have a comprehensive evaluation of the model. Fortunately, scikit learn software package provides a comprehensive model evaluation tool.

First, we need to predict the test data set:

print("predicting test dataset ...") t0 = time() pred = clf.predict(X_test) print("done in %fs" % (time() - t0))

Output on my computer:

predicting test dataset ... done in 0.090978s classification_report() from sklearn.metrics import classification_report print("classification report on test set for classifier:") print(clf) print(classification_report(y_test, pred, target_names=news_test.target_names))

The output on my computer is:

classification report on test set for classifier: MultinomialNB(alpha=0.0001, class_prior=None, fit_prior=True) precision recall f1-score support alt.atheism 0.90 0.91 0.91 245 comp.graphics 0.80 0.90 0.85 298 comp.os.ms-windows.misc 0.82 0.79 0.80 292 comp.sys.ibm.pc.hardware 0.81 0.80 0.81 301 comp.sys.mac.hardware 0.90 0.91 0.91 256 comp.windows.x 0.88 0.88 0.88 297 misc.forsale 0.87 0.81 0.84 290 rec.autos 0.92 0.93 0.92 324 rec.motorcycles 0.96 0.96 0.96 294 rec.sport.baseball 0.97 0.94 0.96 315 rec.sport.hockey 0.96 0.99 0.98 302 sci.crypt 0.95 0.96 0.95 297 sci.electronics 0.91 0.85 0.88 313 sci.med 0.96 0.96 0.96 277 sci.space 0.94 0.97 0.96 305 soc.religion.christian 0.93 0.96 0.94 293 talk.politics.guns 0.91 0.96 0.93 246 talk.politics.mideast 0.96 0.98 0.97 296 talk.politics.misc 0.90 0.90 0.90 236 talk.religion.misc 0.89 0.78 0.83 171 avg / total 0.91 0.91 0.91 5648

As can be seen from the output, the accuracy rate, recall rate and F1 score are calculated for each category. Readers who are not familiar with these concepts can refer to another article about the design and optimization of blog machine learning system. In addition, we can also generate a confusion matrix by using the configuration matrix() function to observe the misclassification of each category, for example, which categories these misclassified documents are misclassified into:

confusion_matrix() from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, pred) print("confusion matrix:") print(cm)

The output on my computer is:

confusion matrix: [224 0 0 0 0 0 0 0 0 0 0 0 0 0 2 5 0 0 1 13] [ 1 267 5 5 2 8 1 1 0 0 0 2 3 2 1 0 0 0 0 0] [ 1 13 230 24 4 10 5 0 0 0 0 1 2 1 0 0 0 0 1 0] [ 0 9 21 242 7 2 10 1 0 0 1 1 7 0 0 0 0 0 0 0] [ 0 1 5 5 233 2 2 2 1 0 0 3 1 0 1 0 0 0 0 0] [ 0 20 6 3 1 260 0 0 0 2 0 1 0 0 2 0 2 0 0 0] [ 0 2 5 12 3 1 235 10 2 3 1 0 7 0 2 0 2 1 4 0] [ 0 1 0 0 1 0 8 300 4 1 0 0 1 2 3 0 2 0 1 0] [ 0 1 0 0 0 2 2 3 283 0 0 0 1 0 0 0 0 0 1 1] [ 0 1 1 0 1 2 1 2 0 297 8 1 0 1 0 0 0 0 0 0] [ 0 0 0 0 0 0 0 0 2 2 298 0 0 0 0 0 0 0 0 0] [ 0 1 2 0 0 1 1 0 0 0 0 284 2 1 0 0 2 1 2 0] [ 0 11 3 5 4 2 4 5 1 1 0 4 266 1 4 0 1 0 1 0] [ 1 1 0 1 0 2 1 0 0 0 0 0 1 266 2 1 0 0 1 0] [ 0 3 0 0 1 1 0 0 0 0 0 1 0 1 296 0 1 0 1 0] [ 3 1 0 1 0 0 0 0 0 0 1 0 0 2 1 280 0 1 1 2] [ 1 0 2 0 0 0 0 0 1 0 0 0 0 0 0 0 236 1 4 1] [ 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 3 0 290 1 0] [ 2 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 10 7 212 0] [16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12 4 1 4 134]]

From the first row of data, we can see that 13 documents of category 0 (alt.atheism) are wrongly classified into category 19 (talk. Correlation. MISC). Of course, we can also visualize the confusion matrix data:

# Show confusion matrix import matplotlib.pyplot as plt plt.figure(figsize=(8, 8), dpi=144) plt.title('Confusion matrix of the classifier') ax = plt.gca() ax.spines['right'].set_color('none') ax.spines['top'].set_color('none') ax.spines['bottom'].set_color('none') ax.spines['left'].set_color('none') ax.xaxis.set_ticks_position('none') ax.yaxis.set_ticks_position('none') ax.set_xticklabels([]) ax.set_yticklabels([]) plt.matshow(cm, fignum=1, cmap='gray') plt.colorbar() plt.show()

The output on my computer is as follows:

Confusion matrix

The lighter the color except the diagonal, the more errors there are. Through these data, we can analyze the sample data in detail to find out why one category is wrongly classified into another category, so as to further optimize the model.

5 parameter selection

An important parameter of multinomialnb is alpha, which is used to control the smoothness of model fitting. We chose the value of 0.0001. In fact, a more scientific method is to use skikit learn's sklearn.model_selection.gridsearchcv to make automatic selection. That is, we can set the range of an alpha parameter, and then let the code select an optimal parameter value in this range. Interested friends can read the documents of gridsearchcv for a try.

MultinomialNB sklearn.model_selection.GridSearchCV

(end)