the implementation of lexical analyzer

Posted by punzalan at 2020-03-28

Opening chapter

Compilation, in short, is to convert the source program into an executable program. From the Hello world program running mechanism, it simply explains the process of program running and how a program becomes an executable file step by step. In this process, the compiler has done a lot of important work. I am interested in the bottom layer, naturally, and I am eager to understand the internal implementation of compilation, that is, the principle of compilation.

This article mainly talks about the compiler front-end, the principle of lexical analyzer, and finally gives a simple implementation of lexical analyzer.  


Compilation is simply to transform the source program into another form of program, and the key part is to understand the meaning of the source program, so as to transform it into another source program.

A metaphor can be used to explain the problem: people a and B want to talk, but they do not know each other's language, which requires a translation of C, while understanding the language of a and B. With C as the middle layer, a and B can communicate normally. C is a bit like a compiler, it must be able to understand the meaning of the source program to convey information to another.

The compiler is also the same. Its input is the source file of the language (generally it can be a text file). For the input file, first of all, each element of the input file (keywords, variables, symbols,,) should be separated

Then according to the grammar of the language, it analyzes whether the combination of these elements is legal and the meaning of these combinations.

Different from natural languages, programming languages are all described by symbols. Each specific symbol represents a specific meaning, and programming languages are context free. Context free means that the meaning of a particular statement has nothing to do with its context, only with its own decision.

This post is mainly about lexical analysis, that is, sorting the input string of symbols into specific morphemes.

lexical analysis


The function input source program of lexical analyzer is decomposed into a series of word symbols according to the rules of word formation. Words are the smallest unit with independent meaning in a language, including keywords, identifiers, operators, delimiters, constants, etc

(1) Keywords are fixed identifiers defined by the program language. For example, begin, end, if, while in Pascal are reserved words. These words are usually not used as general identifiers.

(2) Identifiers are used to represent various names, such as variable names, array names, procedure names, etc.

(3) The types of constant are integer, real, Boolean, literal and so on.

(4) Operators such as +, -, *, / and so on.

(5) Delimiters such as comma, semicolon, bracket, etc.


The word symbols output by lexical analyzer are often expressed as the following binary formula:

(word type, attribute value of word symbol)

Word types are usually encoded as integers. Identifiers are generally classified as one. Constants should be classified by type (integer, real, Boolean, etc.). Keywords can be considered as a whole. Operators can use one method. Generally, a method of one character is used for a delimiter. For each word symbol, in addition to the category code, the attribute information about the word symbol should also be given. The attributes of word symbols refer to the characteristics or features of word symbols.


For example, the following code segment:

while(i>=j) i--

After being processed by lexical analyzer, it will be converted into the following sequence of word symbols:

   <while, _>

< (>)

< ID, pointer to symbol table item of I >

> >, >

< ID, pointer to symbol table item of J >

< >

< ID, pointer to symbol table item of I >

< - >


Lexical analyzer as an independent subroutine

Lexical analysis is a stage in the compilation process, which is carried out before syntax analysis. Lexical analysis can simplify design, improve compilation efficiency and increase the portability of compilation system. It can also be combined with grammar analysis as one pass. The grammar analysis program calls the lexical analysis program to obtain the current word for grammar analysis.

Design of lexical analyzer

Input, preprocessing

The first step in lexical analyzer work is to input source text. In many cases, in order to better recognize the word symbols, preprocess the input string. Preprocessing mainly filters out spaces, skipping comments, line breaks, etc.

Advanced search

In the process of lexical analysis, sometimes in order to determine the part of speech, it is necessary to scan several characters in advance.

For FORTRAN language, keywords are not reserved words, but can be used as identifiers. The space symbol has no meaning. In order to determine part of speech, it is necessary to scan several characters in advance.


         1   DO99K=1,10

         2   IF(5.EQ.M) I=10

         3   DO99K=1.10

         4   IF(5)=55

All four statements are correct. Statements 1 and 2 are do and if statements respectively, and statements 3 and 4 are assignment statements. In order to distinguish 1 and 3, 2 and 4 statements correctly, several characters need to be scanned in advance.

           1   DO99K=1,10          2   IF(5.EQ.M) I=10

           3   DO99K=1.10          4   IF(5)=55

The difference between statements 1 and 3 is the first delimiter after the symbol: one is a comma, the other is the end of the sentence. The main difference between statements 2 and 4 is the first character after the closing bracket: one is a letter and the other is an equal sign. In order to recognize the keywords in 1 and 2, it is necessary to scan multiple characters in advance. Advance to the place where the part of speech can be confirmed. In order to distinguish between 1 and 3, the first delimiter after the equal sign must be scanned in advance. For statements 2 and 4, you must scan ahead to the first character after the right bracket corresponding to the left bracket after if.

State transition diagram

The lexical analyzer uses state transition diagrams to identify word symbols. The state transition diagram is a finite direction diagram. In the state transition diagram, there is an initial state and at least one final state.

Where 0 is the initial state and 2 is the final state. The process of identifying (accepting) identifiers in this transformation diagram is: starting from the initial state 0, if the input character under the state 0 is a letter, then read it in and turn to the state 1. Under state 1, if the next input character is a letter or a number, read it in and re-enter state 1. Repeat this process until state 1 finds that the input character is no longer a letter or a number (the character has also been read in) and enters state 2. State 2 is the final state, which means that an identifier has been identified and the identification process has been terminated. An asterisk on the final node means that a character that is not part of the identifier has been read in and should be returned to the input port. If the input character is not "letter" in status 0, it means that the identifier cannot be recognized, or the transformation diagram is not successful.

Normal expression and normal set

Normal expression is an important representation (token) of words and a tool to define normal set. In lexical analysis, normal expressions are used to describe the possible forms of identifiers.

Definition (normal expression and the normal set it represents):

Set the alphabet to s,

1. E and Ø are normal expressions on S, and their normal sets are {e} and {} respectively;

2. Any a Î s, a is a normal expression on S, and its normal set is {a};

3. Suppose u and V are normal expressions on S, and the normal sets they represent are l (U) and l (V), respectively. Then, (U), u | V, u · V, u * are also normal expressions. The normal sets they represent are l (U), l (U) È L (V), l (U) l (V), and l (U)) *;

4. Only the expressions defined by the above three steps are normal expressions on S, and only the word set represented by these normal expressions are normal sets on S.

Operators of normal form read "or" for "½", and "·" for "connection"; and "*" for "closure" (that is, any finite number of self repeated connections).

In case of no confusion, the brackets can be omitted, but the priority of operators is specified as "(", ")", "*", "·", "½". The connector "·" can be omitted.

"*", "..." and "½" are all left combined.

For example, let s = {a, B}, the normal expressions on S and the corresponding normal sets are as follows:

Normal set of normal form

a               {a}

a½b                    {a,b}

ab                       {ab}

(a½b)(a              {aa,ab,ba,bb}

a *                    {e ,a,a, …… String} of any a

ba*                                           {b, ba, baa, baaa, … }

(a½b)*                                      {e ,a,b,aa,ab …… All by a and B

Composed string}

All a * s * have two successive a's

Or a string of two successive B's}

Theorem: if two normal expressions u and V represent the same normal set, then u and V are equivalent, and u = v.

Proof B (AB) * = (BA) * B

Proof: because l (b (AB) = {B} {e, AB, ABAB, ABAB }

                                   ={b, bab, babab, bababab, … }

                   L((ba)*b) ={e, ba, baba, bababa, … }{b}

                                   ={b, bab, babab, bababab, … }

                                   = L(b(ab)*)

So, B (AB) * = (BA) * B

Let u, V and W be normal forms, and the algebraic laws of normal forms are as follows:

(1) U half v = V half U (exchange law)

(2) U half (V half w) = (U half V) half w (combination law)

(3) U (VW) = (UV) w (combining law)

(4) U (V / W) = UV / uW (V / W) u = Vu / Wu (distribution law)

(5) eU=U e=U

Simple implementation of analyzer

The above mainly introduces some related knowledge of lexical analysis, but the specific implementation of lexical analyzer has not been specifically mentioned. In order to better understand lexical analysis, I wrote a simple lexical analyzer.

Although it's a parser, the function is very simple. It just takes out the annotation of the input program, which uses the above knowledge about the state transition diagram.


General programming language, the form of annotation part is

/*Comment section*/

Our program always reads the input file one character at a time. Our purpose is to remove the comment part. For the input character stream, as long as we recognize "/ *", we will know that the latter part is the comment part until "* /" appears in the input stream.

The processing of character stream is one by one. Every time a character is read in, it will be judged. If the character is "/", it means that the following part may be a comment. Then look at the next input character. If it is "*", it is the above situation: "/ *" then the following part is the comment part, and then use the same method to find "* /".

This recognition process can be clearly represented by state transition diagram:

Each symbol read in should be judged. If it is "/", it means that the following part may be a comment, and enter state 1. If the following input is "*", then it can be determined that the subsequent content is comment content. If the following input is not "*", it means that the following content is not comment, and the "/" appearing in the front may be used as division sign, such as "5 / 3"

In fact, the flow chart above also corresponds to the logic of the program implementation, which can be implemented with switch case. For each input, jump to the corresponding state after judgment, and then continue to judge.

Here is the program pseudocode:



    case 1 :if ch=="/",state=2,break;

    case 2: if ch=="*",state=3

        else state=1;break;

    case 3:..........

    case 4:..........

Lexical analyzer

This program is relatively simple, so the source code will not be given. Next is the code of a simple lexical analyzer, which can realize the recognition of keywords (such as while end if), numbers, and the removal of space characters.

Here are the functions of this analyzer:

1. The morphology of simple language to be analyzed

(1) Keywords:

begin  if   then   while   do   end

All keywords are lowercase.

(2) Operators and bounds:

:=   +   –   *   /   <   <=   <>   >   >=   =   ;   (   )   #

(3) The other words are identifier (ID) and integer constant (Num), defined by the following normal formula:

ID=letter(letter| digit)*

NUM=digit digit *

(4) Spaces consist of spaces, tabs, and line breaks. Spaces are commonly used to separate IDS, num, operators, delimiters, and keywords, and the lexical analysis phase is often ignored.

2. Category codes corresponding to various word symbols

Functions of lexical analysis program

Input: the source string of the given grammar.

Output: a sequence of two tuples (syn, token, or sum).

Among them: SYN is the word category code;

Token is the string of the stored word itself;

Sum is an integer constant.

The following is the program source code. Based on the above discussion, it should be better understood.

The program is debugged on c-free5

Here is a screenshot of the program:


This paper mainly talks about lexical analysis in compiler, and introduces some related knowledge of lexical analysis. Finally, a very simple implementation of lexical analyzer is given.

See resources: Compilation Principle

If reprinted, please indicate the source:

A fish, Yin Yanling @ blog Park 2012-17

 E-mail:[email protected]