simple verification code for machine learning

Posted by millikan at 2020-03-03

Published on October 15, 2016

There are a lot of great God tutorials about captcha recognition on the Internet, but most of them are too professional. It's a magic book for non professionals to read. However, with the popularization of machine learning, a large number of open-source tools for machine learning have emerged, which is also a good news for most of the learning scum like me. Because some machines are involved in the recent project Learning related things, so I have been learning machine related things recently. The identification of this verification code is also a practice. This article is also a note in learning, so there are inevitably some mistakes in the article. Welcome to all directions.

As I am not a professional, I will not discuss the algorithms in this paper, mainly for the purpose of actual combat.


It mainly uses some machine learning open source framework and some auxiliary tools.

problem analysis

First of all, we need to see what problems we need to solve before carrying out specific work. For the verification code identification, it can be regarded as a classification problem. For the digital picture verification code, it is actually a 0-9 number classification problem. The most difficult part of verification code identification is how to cut the verification code into a single character picture. Of course, for There are many methods for image clipping, i.e. feature extraction, such as vertical projection method, isometric cutting method, etc. among them, isometric cutting is also relatively simple, but the accuracy of verification code recognition is very low for slightly more complex verification code, because the verification code is cut according to the same width of the camera when isometric cutting, and even if the width of those characters is different, it is not cut out It can well represent the characteristics of characters, so sometimes it is necessary to carry out a series of preprocessing on the image, such as character correction, etc., and then use the vertical projection method to cut the x-axis and y-axis according to the size of the projection.

For the vertical projection method, at last, we have to consider that the training set agrees on the dimension. Because of the hierarchical cutting, the pixels of each picture must be different, so in order to unify the dimension, we have to fill it. In short, it's a little bit troublesome.

This is mainly based on the example of equidistant cutting. Because the operation is relatively simple, the mask is also identified by a pure digital verification code of 0-9. The verification code is as follows

This kind of picture looks like the space is almost big, so it's easy to segment. After cutting the picture into four pieces, you can take each piece for training and recognition.

If machine learning is used for training and recognition, we have to consider feature selection. Generally, there is a set of standard process for verification code recognition. The picture comes from

For the verification code recognition, we are not concerned about the color of the verification code, but the meaning of the character representation. Therefore, in the process of image processing, grayscale, binarization and denoising are carried out. For example, if the interference line is removed, there are corresponding algorithms for denoising. We will not discuss specifically here. Binarization is actually to present the image with two colors, namely, non black and white, which is good In feature processing, 0 and 1 can be used to represent black and white, and what color 0 and 1 represent depends on personal preference.

In this way, the binary image and the image processed in other steps are used for feature extraction. The black pixel is marked as 1, and the white pixel is marked as 0. Thus, the numerical representation of the image can be obtained. Then the feature dimension is equal to the size of the image pixel. Finally, the image is represented according to the x-axis or the y-axis, that is, the marked value of the pixel is combined into one line, for example

one hundred and eleven thousand one hundred and ten trillion and one

one hundred and eleven thousand trillion

Represented as 1111100000000011110000000000000000, so that each picture can be represented by a row of 0 and 1 values.

After feature extraction, we get the mathematical representation of pictures, so the next step is model training. As mentioned above, picture recognition is a classification problem, so in machine learning, I mainly use two models for training, SVM support vector machine and BP neural network for model training, SVM uses scikit learn machine learning The implementation in the package is done, and the neural network is implemented by pytrain.

For the algorithm part of SVM and BP neural network, you'd better search the relevant paper on the Internet, so that you can know what algorithm can solve what problems, and what its general principle looks like. Students with ability can deduce these two algorithms.


In the part of problem analysis, we have got a general idea of verification code identification, so this part is mainly implementing the above part.

First of all, we should understand that SVM and neural network model algorithm belong to supervised learning, that is, we need to label samples, that is, mark each picture to represent that number, but the actual problem is that if the amount of data is small, we can label manually, so in the case of large amount of data, manual labeling may not be practical In the case of pictures, you have to mark the number after cutting, so we need to preprocess the cut pictures, that is, mark them. I'm lazy, so I won't label them one by one. So here, OCR is used to pre classify the cut pictures. The accuracy of OCR in single character recognition is OK So, after the OCR pre classification, we only need to correct the wrong images, which can greatly reduce the workload.

The implementation here mainly includes the following steps:

Image collection

The image collection is relatively simple, but it's described in more details, as shown in the code below

Save the downloaded image to the specified location according to the time stamp

Image preprocessing and image clipping

After preprocessing the picture, the equidistant cutting method is used to cut the picture

The cropped picture is as follows

Picture pre classification

Pyteseract is used for image pre classification to reduce the workload. The specific code is as follows

The accuracy of OCR classification should be more than 50%. The rest is to manually correct the pre classified images.

Classification rendering of OCR

Results after manual error correction and marking

Each catalog represents a category label.

feature extraction

Please refer to the problem analysis for details of feature extraction. The key codes are as follows

The mathematical representation of the final image will be recorded in / users / iswin / downloads / YZM / train data / train_data.txt, and the format of the data is shown in the figure below

The red wireframe represents the numerical representation of a picture, and the last number 0 represents the type of the picture. I am adding the label to the last line for convenience.

SVM model classification

Here, the implementation of SVM is implemented by using scikit learn. For the use of scikit learn, go to the official website to see tutorial. Here, we need to talk about the problem of parameter selection of SVM. We all know that SVM supports multiple kernel functions, such as Gaussian kernel, linear kernel, poly and sgmoid kernel functions. But at the beginning, how to choose kernel functions for unfamiliar students is really So here we use scikit learn's gridsearchcv to optimize the parameters. After parameter optimization, the final effect of Gaussian kernel is good, so we use Gaussian kernel to train directly.

In order to facilitate the use of prediction, the joblib module is used to persist the training results. In order to simply evaluate the model, a 50% cross validation is used to test the results.

The accuracy of the final result is: accuracy: 0.96 (+ / - 0.09)

The specific code is as follows:

Take an example of prediction and see the effect

BP neural network model classification

BP neural network, also known as negative feedback neural network, is a multilayer feedforward network trained by error back propagation algorithm. It is one of the most widely used neural network models at present. After BP neural network, CNN, convolution neural network, which is most widely used in deep learning, has emerged. These days, it is also learning.

In this paper, three-layer BP neural network is used to train the training set, i.e. input layer + 2-layer hidden layer + output layer. About BP neural network itself, we need to pay attention to the selection of activation function and the selection of output layer function for multi classification problems. The activation function mainly includes SIGMOD, tanh and relu. How to select activation function is not well understood, Generally, every activation function runs once to see the final effect.

The neural network model classification here is mainly to learn the usage of pytrain and the basic knowledge of BP neural network. The input layer uses linearlayer, i.e. linear input layer, and the hidden layer uses SIGMOD, i.e. sigmoidlayer, i.e. activation function. As the output layer is a multi classification problem, softmaxlayer is used. Finally, the maximum value is selected in the result of neural network calculation The index position of is the predicted verification code category, which is the value between 0-9.

There are not many official documents about pytrain, but there are two ways to build neural network: one is to build network function, the other is to use feedforwardnetwork function. Here, we need to note that if we use feedforwardnetwork to build, we need to manually add bias item to each layer Otherwise, the result may be very poor. At that time, I didn't add it in the experiment, and the half day calculation result was wrong. Finally, I looked at the source code of the buildnetwork function and found that there was no bias item added. Besides, I need to pay attention to the steps of iteration to convergence, i.e. the * maxepochs = 500 in the function. This is adjusted according to the situation. Pytrain has its own data set format, so it must be used according to its format To initialize the data.

In addition to the input layer dimension (i.e. the training set dimension of the verification code) and the output are fixed, the number of neurons in the hidden layer can also be adjusted, and the specific interested students can adjust and then see the results.

The 10 fold cross validation is used for simple evaluation of the model. The error rate is about 0.062, which is a little worse than SVM. The accuracy should be improved through parameter tuning, but the emphasis is on learning.

Training set sample / users / iswin / downloads / YZM / train data / train_data_uniq.txt

The main codes are as follows:

For example, let's see the prediction effect


Through this small experiment, at least I have a general understanding of machine learning and related algorithms. At the same time, as a security personnel, at least I know how to use the open-source machine learning framework to build my own model. There are inevitably mistakes in the notes. Welcome to give your opinions.