[其它技巧]New Compiler PPT.ppt
Principles of Compiler Theory and Design,Central South UniversityCollege of Information Science and Engineering,Contents,Preface,Chapter 4 Syntax Analysis,Chapter 1 Introduction to compiling,Chapter 2 Context-free Grammers,Chapter 3 Lexical Analysis,Contents,Chapter 5 Syntax Analysis,Chapter 6 Syntax-Diricted Translation,Chapter 7 Type Checking,Chapter 8 Intermediate Code Generation,Chapter 9 Run-Time Environments,课程说明,课程名称:编译原理 先修课程:C语言、汇编语言、离散数学、数据结构总学时:64(授课学时:64)课程任务:编译原理是计算机专业的重要专业课之一,是每个优秀的计算机专业人员必修的一门课程。以研究程序设计语言编译构造的基本原理和基本实现方法为主要目标,其研究对象是程序设计语言的编译器。通过本课程学习,使学生掌握编译方法的基本理论和设计思想,加深对程序设计语言的理解,为其学习后继专业课奠定坚实的理论基础,并将本课程讨论的概念和技术应用于其他软件设计中,可以比较迅速地掌握新的语言工具。通过本课程的教学培养学生的抽象思维、逻辑推导和概括能力。,Chapter1 Introduction to Compiling1.Compliers,A compiler is a program that reads program written in one language the source language and translates it into a equivalent program in another language the target language.As an important par of this translation process,the compile reports to its user the presence of errors in the source program.,2.The Context of a Compiler,Lexical AnalysisIn a compiler,linear analysis is called lexical analysis or scanning.For example,in lexical analysis the characters in the assignment statement position:=initial+rate*60,would be grouped into the following tokens:,1 The identifier position.2 The assignment symbol:=3 The identifier initial 4 The plus sign.5 The identifier rate.6 The multiplication sign.7 The number 60.The blanks separating tee characters of these tokens would normally be eliminated during lexical analysis.,Syntax Analysis,Hierarchical analysis is called parsing or syntax analysis.It involves grouping the tokens of the source program into grammatical phrases that are used by the synthesize output.,Semantic Analysis and Intermediate Code Generation,The semantic analysis phase checks the source program for semantic errors and gathers type information for the subsequent code-generation phase.After syntax and semantic analysis,some compilers generate an explicit intermediate representation of the source program.,Code Optimization,The code optimization phase attempts to improve the intermediate code.,Code Generation,The final phase of the compiler is generation of target code,consisting normally of relocatable machine code or assemble code.,3.The Phases of a Compiler,source program,target program,syntax analyzer,semantic analyzer,intermediate code generator,code optimizer,code generator,symbol-table manger,error handler,lexical analyzer,Chapter 2 Context-Free Grammars,Every programming language has rules that prescribe the syntactic structure of well-formed programs.Many programming language constructs have an inherently recursive structure that can be defined by context-free grammars.A context free grammar consists of terminals,nonterminals,a start symbol,and productions.,G=(VT,VN,S,P)VT,terminals are the basic symbols from which strings are formed.For example,if,then,and else is a terminal.VN,nonterminals are syntactic variables that denote sets of strings.,S,start symbol,It is a nonterminal and denotes is the language defined by the grammar.P,the productions of a grammar specify the manner in which the terminals and nonterminals can be combined to form strings.Each production consists of a nonterminal,followed by an arrow(sometimes the symbole:=is used in place of the arrow),followed by a string of nonterminals and terminals.,Notational Conventions,To avoid always having to state that“these are the terminals”,“these are the nonterminals”,and so on,we shall employ the following notational conventions with regard to grammars throughout the remainder of this book.,1.These symbols are terminals:i)Lower-case letters early in the alphabet such as a,b,c.ii)Operator symbols such as+,-,etc.iii)Punctuation symbols such as parentheses,comma,etc.iv)The digits 0,1,9.v)Boldface string such as id or if.,2.These symbols are nonterminals:i)Upper-case letter early in the alphabet such A,B,C.ii)The letter S,which,when it appears,is usually the start symbol.iii)Lower-case italic names such as expr or stmt.,3.Lower-case Greek letters,for example,represent strings of grammar symbols Thus,a generic production could be written as A,indicating that there is a single nonterminal A on the left of the arrow(the left side of the production)and a string of grammar symbols to the right of the arrow(the right side of the production).,4.If A1,A2,Ak,are all productions with A on the left(we calla productions),we may write A1|2|k.5.Unless otherwise stated,the left side of the first production is the start symbol.,Example 2.1,The grammar of simple arithmeticexpressions.EE+E|E*E|(E)|-E|i,Derivations,Consider the following grammar forarithmetic expressions G(2.1):EE+E|E*E|(E)|-E|i,The production E-E signifiers that an expression preceded by a minus sign is also an expression.The production can used to generate more complex expressions from simple expressions by allowing use replace any instance of an E by E.,In the simplest case,we can replace asingle E by E.We can describe this bywriting.E=-E Which is read“E derivesE”.The symbol=means“derives inone step”.,Example 2.2,The string(i+i)can be derived from E.E=-(E)=-(E+E)=-(i+E)=-(i+i)(2.2)Often we wish to say“derives in zero ormore steps”.For this purpose we can use thesymbol=*.,Thus 1.=*for any string,and 2.if=*and=*,then=*.Likewise,we use=+to mean“derivesin one or more steps”.,Given a grammar G with start symbol S,we candefine L(G),the language generated by GL(G)=|VT and S=+We say a string of terminals w is in L(G)if andonly if S=+w.The string w is called asentence of G.If two grammars generate the same language,the grammar are said to be equivalent.,Example 2.3.,The string(i+i*i)is a sentence ofgrammar(2.1)because there is thederivationE=(E)=(E*E)=(E+E*E)=(i+E*E)=(i+i*E)=(i+i*i)(2.3),Leftmost derivation,To understand how certain parsers work we need to consider derivations in which only the leftmost nonterminal in any sentential form is replaced at each step.Such derivations are termed leftmost.,If=by a step in which the leftmostnonterminal in is replaced,we write=lm.Since derivation(2.2)is leftmost,we canrewrite it as:E=lm(E)=lm(E+E)=lm(E+E*E)=lm(i+E*E)=lm(i+i*E)=lm(i+i*i),Rightmost derivation,Analogous definitions hold for Rightmost derivations in which rightmostis replaced at each step.Rightmost derivations are sometimescalled cononical derivations.,Parse Trees,A parse tree may be viewed as a graphical representation for a derivation that filters out the choice regarding replacement order.For example,the parse tree for(i+i)implied by derivation derivation(2.2)is shown in Fig.2.1.,Fig.2.1.Parse tree for(i+i),E-E(E)E+E i i,Ambiguity,A grammar that produces more than one parse tree for some sentence is said to be ambiguous.Put another way,an ambiguous gram is one that produces more than one leftmost or more than one rightmost derivation for the same sentence.,Example 2.3.,The grammar(2.2)is ambiguous.Because there is the sentence i+i*i thathas the two distinct leftmost derivations:E=E+E E=E*E=i+E=E+E*E=i+E*E=i+E*E=i+i*E=i+i*E=i+i*i=i+i*i,The Language Generated by a Grammar,Example 2.4 Consider the grammar(2.2)SAB AaA|a BbB|b The language generated by grammar(2.4):L(G)=ambn|m,n1,Writing A grammar,Grammars are capable of describing most,butnot all,of the syntax of programming languages.Example 2.5.Design a grammar for the following language.L1=anbn|n1 The grammar G1 SaSb|ab generates the language L1,Example 2.6.,Design a grammar for the followinglanguage.L2=anbn|n1The grammar G2 SaSb|ab generates the language L2.,Example 2.7.,Design a grammar for the following language.L3=anbncmdm|m,n1The grammar G3 SAB AaAb|ab BcBd|cd generates the language L3.,Example 2.8.,Design a grammar for the following language.L4=anbmcmdn|m,n1 The grammar G4 SaSd|aAd AbAc|bc generates the language L4.,Example 2.9.,Design a grammar for the following language.L5=wcwR|w is in(a|b)*and wR stands for wreversed The grammar G5 SaSa|bSb|c generates the language L5.,Eliminating Ambiguity,Sometimes an ambiguous grammar canbe rewritten to eliminate the ambiguity.As an example,we shall eliminate theambiguity from the following grammar G EE+E|E*E|(E)|i,The grammar G is ambiguous because it doesnot specify the associativity and precedence ofthe operators+and*.The Unambiguousgrammar EE+T|T TT*F|F F(E)|i,EXERCISES,2.1Consider the grammar S(L)|a LL,S|S a)Find parse trees for the following sentences:i)(a,(a,a)ii)(a,(a,a),(a,a)b)Construct a leftmost derivation for each of sentences in(a).c)Construct a rightmost derivation for each of the sentences in(a).,2.2 Consider the grammar SaSbS|bSaS|(1)Show that this grammar is ambiguous by constructing two different leftmost derivations for the sentence abab.(2)Construct the corresponding rightmost derivations for abab.(3)Construct the corresponding parse trees for abab.,2.3 Consider the grammar bexprbexpr or bterm|bterm btermbterm and bfactor|bfactor bfactornot bfactor|(bexpr)|true|false(a)Consider a parse tree for the sentence not(true or false).(b)Show that this grammar generates all Boolean expressions.Is this grammar ambiguous?Why?,2.4Try to design a grammar for each of the following languages.a)The set of all string of 0s and 1s such that every 0 is immediately followed by at least one 1.b)String of 0s and 1s with an equal number of 0s and 1s.,c)The set of all even numbers,but 0 is not first number.d)The set of all odd numbers,but 0 is not first number.,2.5 Show the following grammar is ambiguous SiSeS|iS|a 2.6 Consider the grammar SSS|(S)|()a)Show that this grammar is ambiguous b)Eliminate the ambiguity from the grammar,Chapter 3.Lexical Analysis,This lexical analyzer is the first phase of acompiler.Its main task is to read the inputcharacters and produce as output a sequenceof tokens that the parser uses for syntaxanalysis.Sometimes,lexical analyzers aredivided into a cascade of two phases,the firstcalled“scanning”,and the second“lexicalanalysis”.,The scanner is responsible for simple tasks,while the lexical analyzer proper does the morecomplex operations.For example,a FORTRANcompiler might use a scanner to eliminateblanks from the input.,Attributes for Tokens,The lexical analyzer collects information about tokens into their associated attributes.As a practical matter,a token has usually only a single attributea pointer to the symbol-table entry in which the information about the token is kept;the pointer becomes the attribute for the token.,Example 3.1.,The tokens and associated attribute-values forThe FORTRAN statement E=M*C*2 are written below as a sequence of pairs:,Note that in certain pairs there is no need foran attribute value;the first component issufficient to identify the lexeme.,Input Buffering,For many source languages,the lexical analyzer may need to look ahead server characters beyond the lexeme for a pattern before a match can be announced.We use a buffer divided into two N-character halves,as shown in Fig.3.1.Typically,N is the number of characters on one disk block,e.g.,1024 or 4096.,E=M*eof C*2eof eof,forward,lexeme_beginning,lexeme_beginning,lexeme_beginning,Fig.3.1 Buffer Pairs,forward:=forward+1;if forward=eof then begin if forward at end of first half then begin reload second half;forward:=forward+1 end,else if forward at end of second half then begin reload first half;move forward to beginning of first half end else terminate lexical analysis end,The Lookahead Operator,Lexical analyzers for certainprogramming language constructs needto look ahead beyond the end of alexeme before they can determine atoken with certainty.,In FORTRAN blanks are not significantoutside of comments and strings,sosuppose that all removable blanks arestripped before lexical analysis begins.,DO5I=1.25 DO5I=1,25,In the first statement,we cannot tell until we see the decimal point that the string DO is part of the identifier DO5I.In the second statement,DO is a keyword by itself.,The lookahead operator can also be used to cope with another difficult lexical analysis problem if FORTRAN:distinguishing keywords from identifiers.For example,the input IF(I,J)=3 is a perfectly good FORTRAN assignment statement,not a logical if-statement.,Transition Diagrams,Transition diagrams depict the actions that take place when a lexical analyzer is called by the parser to get the next token.Positions in a transition diagram are drawn as circles and are called states.The states are connected by arrows,called edges.Edges leaving states s have labels indicating the input characters that can next appear after the transition diagram has reached state s.,Figure 3.2 shows a transition diagram for patters=and.The transition diagram works as follows.Its start is state 0.In state 0,we read the next input character.The edge labels from state 0 is to be followed to state 1 if this input character is.Otherwise,we have failed to recognize either or=.,0,1,2,3,*,Fig.3.2 Transition diagram for=,Since keywords are sequences of letters,they are exceptions to the rule that a sequence of letters and digits starting with a letter is an identifier.,start,=,other,0,1,2,start,letter,other,*,Fig.3.3 Transition diagram for identifiers and keywords,Implementing a Transition Diagram,int code,value;StrToken:=”;GetChar();GetBC();If(IsLetter()begin while(IsLetter()or IsDigit()begin Concat();GetChar();endRetract();code:=Reserve();,If(code=0)begin value:=InsertId(StrToken);return($ID,value);endelse return(code,-);endelse if(IsDigit(),begin while(IsDigit()begin Concat();GetChar();end Retract();value:=InsertConst(strToken);return($INT,Value);end,else if(ch=)return($ASSIGN,-);else if(ch=+)return($PLUS,-);else if(ch=*)begin GetChar();if(ch=*)return($POWER,-);Retract();return($STAR,-);end,else if(ch=;)return($SEMICOLON,-);else if(ch=()return($LPAR,-);else if(ch=)return($RPAR,-);else if(ch=return($LBRACE,-);else if(ch=)return($RBRACE,-);else ProcError();,Here are the rules that define the regular expressions over alphabet.Associated with each is a specification of the language denoted by the regular expression being defined.A sequence of transition diagrams can be converted into a program to look for the tokens specified by the diagrams.,1.is regular expression that denotes 2.If a is a symbol in,then a is a regular expression that denotes a.3.Suppose r and s are regular expressions denoting the languages L(r)and L(s).Then,(1)(r)|(s)is a regular expression denoting L(r)L(s).(2)(r)(s)is a regular expression denoting L(r)L(s).(3)(r)*is a regular expression denoting(L(r)*.(4)(r)is a regular expression denoting L(r).A language denoted by a regular expression is said to be a regular set.,Unnecessary parentheses can be avoided in regular expressions if we adopt the conventions that:1.the unary operator*has the highest precedence and is left associative.2.concatenation has the second highest precedence and is left associative.3.|has the lowest precedence and is left associative,Example 3.2,Let=a,b 1.The regular expression a|b denotes the set