"System Programming @ D.I.E.T": Compiler

Compiler

9:48 AM | Posted in

What is a Compiler?

A compiler is a computer program (or set of programs) that transforms source code written in a computer language (the source language) into another computer language (the target language, often having a binary form known as object code). The most common reason for wanting to transform source code is to create an executable program.

The Basic Structure of a Compiler

The five stages of a compiler combine to translate a high level language to a low level language, generally closer to that of the target computer. Each stage, or sub-process, fulfills a single task and has one or more classic techniques for implementation.

Component	Purpose	Techniques
Lexical Analyzer	Analyzes the Source Code Removes "white space" and comments Formats it for easy access (creates tokens) Tags language elements with type information Begins to fill in information in the SYMBOL TABLE **	Linear Expressions Finite State Machines LEX
Syntactic Analyzer	Analyzes the Tokenized Code for structure Amalgamates symbols into syntactic groups Tags groups with type information	Backus-Naur Form Top-down analyzers Bottom-up analyzers Expression analyzers YACC
Semantic Analyzer	Analyzes the Parsed Code for meaning Fills in assumed or missing information Tags groups with meaning information	Attribute Grammars Ad hoc analyzers
Code Generator	Linearizes the Qualified Code and produces the equivalent Object Code	Generally completed by hand-written code
Optimizer	Examines the Object Code to determine whether there are more efficient means of execution	Common-subexpression elimination Loop unrolling Operator reduction etc.

** The Symbol Table is the data structure that all elements of the compiler use to collect and share information about symbols and groups of symbols in the program being translated.

*Alternative* Answer:-

Question - How does a language compiler work? For example, what is

the mechanism behind the compiling process of a program in a specific

language?

The question was about compilers, so I will explain how a compiler works,

rather than the process of converting a source program into an executable

program. The first question involves only a compiler, while in the second

process, a compiler is only one of the programs involved.

A compiler for a language generally has several different stages as it

processes the input.

These are:

1. Preprocessing

2. Lexical analysis

3. Syntactical analysis

4. Semantical analysis

5. Intermediate code generation

6. Code optimization

7. Code generation

Most of theses stages occur during a single pass or reading of the source

files. In other words, for example, the preprocessing stage is usually

reads only slightly ahead of the lexical analysis stage, which is usually

one world ahead of the syntactical analysis stage.

1. Preprocessing

During the preprocessing stage, comments, macros, and directives are

processed.

Comments are removed from the source file. This greatly simplifies the

later stages.

If the language supports macros, the macros are replaced with the equivalent

text.

For example, C and C++ support macros using the #define directive. So if a

macro were defined for pi as:

#define PI 3.1415927

Any time the preprocessor encountered the word PI, it would replace PI with

3.1415927 and process the resulting text.

The preprocessor also handles preprocessor directives. These are most often

include statements. In C and C++, an include statement looks like either:

#include

#include "file"

These lines are replaced by the actual file and the resulting text

processed.

The preprocessor may also replace special strings with other characters. In

C and C++, the preprocessor recognizes the \ character as an escape code,

and will replace the escape sequence with a special character. For example

\t is the escape code for a tab, so \t would be replaced at this stage with

a tab character.

2. Lexical analysis is the process of breaking down the source files into

key words, constants, identifiers, operators and other simple tokens. A

token is the smallest piece of text that the language defines.

A. Key words are words the language defines, and which always have specific

meaning in the language. In C and C++ some of these key words are:

else

int

char

while

for

struct

return

B. Constants are the literal valued items that the language can recognize.

Often these are numbers, strings, and characters:

i. Numbers are the types of numbers that may be used in expressions: 3.14,

5, 12, 0. But, usually negative numbers (-17) are processes as an operator

(-) and a number (17)

ii. Strings are text items the language can recognize. In C or C++ a string

is enclosed by double quotes: "This is a string"

iii. Characters are single letters. In C or C++, a character is enclosed by

single quotes: 'c'

C. Identifiers are names the programmer has given to something. These

include variables, functions, classes, enumerations, etc. Each language has

rules for specifying how these names can be written.

D. Operators are the mathematical, logical, and other operators that the

language can recognize. Each language generally has the standard operators

+, -, *, /, and often defines many other operators as well. For example

some of the additional C and C++ define are:

% modulo

-- decrement

++ increment

E. Other tokens are things not covered by any of the above items. Often

these will produce errors, but depending on the compiler, things like

{ ( ) } may be valid in the language, but not treated as a key word or

operator.

3. Syntactical analysis is the process of combining the tokens into

well-formed expressions, statements, and programs. Each language has

specific rules about the structure of a program--called the grammar or

syntax. Just like English grammar, it specifies how things may be put

together. In English, a simple sentence is: subject, verb, predicate.

In C or C++ an if statement is:

if ( expression ) statement

The syntactical analysis checks that the syntax is correct, but doesn't

enforce that it makes sense. In English, a subject could be: Pants, the

verb: are, the predicate: a kind of car. This would yield: Pants are a kind

of car. Which is a sentence, but doesn't make much sense.

In C or C++, a constant can be used in an expression: so the expression:

float x = "This is red"++

Is syntactically valid, but doesn't make sense because a float number can

not have string assigned to it, and a string can not be incremented.

4. Semantic analysis is the process of examining the types and values of the

statements used to make sure they make sense. During the semantic

analysis, the types, values, and other required information about statements

are recorded, checked, and transformed as appropriate to make sure the

program makes sense.

For C/C++ in the line:

float x = "This is red"++

The semantic analysis would reveal the types do not match and can not be

made to match, so the statement would be rejected and an error reported.

While in the statement:

float y = 5 + 3.0;

The semantical analysis would reveal that 5 is an integer, and 3.0 is a

double, and also that the rules for the language allow 5 to be converted to

a double, so the addition could be done, so the expression would then be

transformed to a double and the addition performed. Then, the compiler

would recognize y as a float, and perform another conversion from the double

8.0 to a float and process the assignment.

5. Intermediate code generation

Depending on the compiler, this step may be skipped, and instead the program

may be translated directly into the target language (usually machine object

code). If this step is implemented, the compiler designers also design a

machine independent language of there own that is close to machine language

and easily translated into machine language for any number of different

computers.

The purpose of this step is to allow the compiler writers to support

different target computers and different languages with a minimum of effort.

The part of the compiler which deals with processing the source files,

analyzing the language and generating the intermediate code is called the

front end, while the process of optimizing and converting the intermediate

code into the target language is called the back end.

6. Code optimization

During this process the code generated is analyzed and improved for

efficiency. The compiler analyzes the code to see if improvements can be

made to the intermediate code that couldn't be made earlier. For example,

some languages like Pascal do not allow pointers, while all machine

languages do. When accessing arrays, it is more efficient to use pointers,

so the code optimizer may detect this case and internally use pointers.

7. Code generation

Finally, after the intermediate code has been generated and optimized, the

compiler will generated code for the specific target language. Almost

always this is machine code for a particular target machine.

Also, it us usually not the final machine code, but is instead object code,

which contains all the instructions, but not all of the final memory

addresses have been determined.

A subsequent program, called a linker is used to combine several different

object code files into the final executable program.

ALTERNATIVE ANSWER

CLICK ON IMAGE TO SEE THE FULL VIEW.

Category:

��

"System Programming @ D.I.E.T"

Compiler

The Basic Structure of a Compiler

Comments

About Me