Вы находитесь на странице: 1из 11


A compiler is a special program that processes statements written in a

particular programming language and turns them into machine language
or "code" that a computer's processor uses. Typically, a programmer
writes language statements in a language such as Pascal or C one line at a
time using an editor . The file that is created contains what are called the
source statements . The programmer then runs the appropriate language
compiler, specifying the name of the file that contains the source

When executing (running), the compiler first parses (or analyzes) all of
the language statements syntactically one after the other and then, in one
or more successive stages or "passes", builds the output code, making
sure that statements that refer to other statements are referred to correctly
in the final code. Traditionally, the output of the compilation has been
called object code or sometimes an object module . (Note that the term
"object" here is not related to object-oriented programming .) The object
code is machine code that the processor can process or "execute" one
instruction at a time.

More recently, the Java programming language, a language used in

object-oriented programming , has introduced the possibility of compiling
output (called bytecode ) that can run on any computer system platform
for which a Java virtual machine or bytecode interpreter is provided to
convert the bytecode into instructions that can be executed by the actual
hardware processor. Using this virtual machine, the bytecode can
optionally be recompiled at the execution platform by a just-in-time
compiler .

Traditionally in some operating systems, an additional step was required

after compilation - that of resolving the relative location of instructions
and data when more than one object module was to be run at the same
time and they cross-refered to each other's instruction sequences or data.
This process was sometimes called linkage editing and the output known
as a load module .

A compiler works with what are sometimes called 3GL and higher-level
languages . An assembler works on programs written using a processor's
assembler language.

A processor is the logic circuitry that responds to and processes the basic
instructions that drive a computer.
The term processor has generally replaced the term central processing
unit (CPU). The processor in a personal computer or embedded in small
devices is often called a microprocessor

- Machine code is the elemental language of computers, consisting

of a stream of 0's and 1's. Ultimately, the output of any programming
language analysis and processing is machine code. After you write a
program, your source language statements are compiled or (in the case of
assembler language) assembled into output that is machine code. This
machine code is stored as an executable file until someone tells the
computer's operating system to run it. (In personal computer operating
systems, these files often have the suffix of ".exe".)

The computer's microprocessor reads in and handles a certain number of

0's and 1's at a time. For example, it may be designed to read 32 binary
digits at a time. Because it is designed to know how many bits (and which
bits) tell it what operation to do, it can look at the right sequence of bits
and perform the next operation. Then it reads the next instruction, and so

In analyzing problems or debugging programs, a tool to use is a dump of

the program. A dump is a printout that shows the program in its machine
code form, but since putting it in 0's and 1's would be hard to read, each
four bits (of 0's and 1's) are represented by a single hexadecimal numeral.
(Dumps also contain other information about the computer's operation,
such as the address of the instruction that was being executed at the time
the dump was initiated.)

- In the computer industry, these abbreviations are widely used to

represent major steps or "generations" in the evolution of programming

1GL or first-generation language was (and still is) machine language or

the level of instructions and data that the processor is actually given to
work on (which in conventional computers is a string of 0s and 1s).

2GL or second-generation language is assembler (sometimes called

"assembly") language. A typical 2GL instruction looks like this:

ADD 12,8

An assembler converts the assembler language statements into machine

3GL or third-generation language is a "high-level" programming
language, such as PL/I, C, or Java. Java language statements look like

public boolean handleEvent (Event evt) {

switch (evt.id) {
case Event.ACTION_EVENT: {
if ("Try me"
.equald(evt.arg)) {

A compiler converts the statements of a specific high-level programming

language into machine language. (In the case of Java, the output is called
bytecode, which is converted into appropriate machine language by a
Java virtual machine that runs as part of an operating system platform.) A
3GL language requires a considerable amount of programming

4GL or fourth-generation language is designed to be closer to natural

language than a 3GL language. Languages for accessing databases are
often described as 4GLs. A 4GL language statement might look like this:



5GL or fifth-generation language is programming that uses a visual or

graphical development interface to create source language that is usually
compiled with a 3GL or 4GL language compiler. Microsoft, Borland,
IBM, and other companies make 5GL visual programming products for
developing applications in Java, for example. Visual programming allows
you to easily envision object-oriented programming class hierarchies and
drag icons to assemble program components
A compiler is a computer program (or set of programs) that translates
text written in a computer language (the source language) into another
computer language (the target language). The original sequence is
usually called the source code and the output called object code.
Commonly the output has a form suitable for processing by other
programs (e.g., a linker), but it may be a human-readable text file.

The most common reason for wanting to translate source code is to create
an executable program. The name "compiler" is primarily used for
programs that translate source code from a high-level programming
language to a lower level language (e.g., assembly language or machine
language). A program that translates from a low level language to a
higher level one is a decompiler. A program that translates between high-
level languages is usually called a language translator, source to source
translator, or language converter. A language rewriter is usually a
program that translates the form of expressions without a change of

A compiler is likely to perform many or all of the following operations:

lexical analysis, preprocessing, parsing, semantic analysis, code
generation, and code optimization.

In computer science, source code (commonly just source or code) is any

sequence of statements or declarations written in some human-readable
computer programming language. Source code is written in a
programming language, which is usually a simplified form of the English
language to reduce ambiguity. Source code allows the programmer to
communicate with the computer using a reserved number of instructions.

The source code which constitutes a program is usually held in one or

more text files, sometimes stored in databases as stored procedures and
may also appear as code snippets printed in books or other media. A large
collection of source code files may be organized into a directory tree, in
which case it may also be known as a source tree.

A computer program's source code is the collection of files needed to

convert from human-readable form to some kind of computer-executable
form. The source code may be converted into an executable file by a
compiler, or executed on the fly from the human readable form with the
aid of an interpreter.

The code base of a programming project is the larger collection of all the
source code of all the computer programs which make up the project.
In computer science, object code, or an object file, is the representation
of code that a compiler or assembler generates by processing a source
code file. Object files contain compact code, often called "binaries". A
linker is typically used to generate an executable or library by linking
object files together. The only essential element in an object file is
machine code (code directly executed by a computer's CPU). Object files
for embedded systems might contain nothing but machine code.
However, object files often also contain data for use by the code at
runtime, relocation information, program symbols (names of variables
and functions) for linking and/or debugging purposes, and other
debugging information.

In computer science, a linker or link editor is a program that takes one

or more objects generated by compilers and assembles them into a single
executable program.

In IBM mainframe environments such as OS/360 this program is known

as a linkage editor.

(On Unix variants the term loader is often used as a synonym for linker.
Because this usage blurs the distinction between the compile-time process
and the run-time process, this article will use linking for the former and
loading for the latter. However, in some operating systems the same
program handles both the jobs of linking and loading a program; see
dynamic linking.)

Computer programs typically comprise several parts or modules; these

parts, if not all contained within a single object-code file, refer to each
other by means of symbols (Grover 2002). Typically, an object file can
contain three kinds of symbols:

• defined symbols, which allow it to be called by other modules,

• undefined symbols, which call the other modules where these
symbols are defined, and
• local symbols, used internally within the object file to facilitate
relocation (Grover 2002).

When a program comprises multiple object-code files, the linker

combines these files into a unified executable program, resolving the
symbols as it goes along.

Linkers can take objects from a collection called a library. Some linkers
do not include the whole library in the output; they only include its
symbols that are referenced from other object files or libraries. Libraries
exist for diverse purposes, and one or more system libraries are usually
linked in by default.

The linker also takes care of arranging the objects in a program's address
space. This may involve relocating code that assumes a specific base
address to another base. Since a compiler seldom knows where an object
will reside, it often assumes a fixed base location (for example, zero).
Relocating machine code may involve re-targeting of absolute jumps,
loads and stores.

The executable output by the linker may need another relocation pass
when it is finally loaded into memory (just before execution). This pass is
usually omitted on hardware offering virtual memory — every program is
put into its own address space, so there is no conflict even if all programs
load at the same base address. This pass may also be omitted if the
executable is a position independent executable.

An assembly language is a low-level language for programming

computers. It implements a symbolic representation of the numeric
machine codes and other constants needed to program a particular CPU
architecture. This representation is usually defined by the hardware
manufacturer, and is based on abbreviations (called mnemonics) that help
the programmer remember individual instructions, registers, etc. An
assembly language is thus specific to a certain physical or virtual
computer architecture (as opposed to most high-level languages, which
are portable).

Assembly languages were first developed in the 1950s, when they were
referred to as second generation programming languages. They
eliminated much of the error-prone and time-consuming first-generation
programming needed with the earliest computers, freeing the programmer
from tedium such as remembering numeric codes and calculating
addresses. They were once widely used for all sorts of programming.
However, by the 1980s (1990s on small computers), their use had largely
been supplanted by high-level languages, in the search for improved
programming productivity. Today, assembly language is used primarily
for direct hardware manipulation, access to specialized processor
instructions, or to address critical performance issues. Typical uses are
device drivers, low-level embedded systems, and real-time systems.

A utility program called an assembler is used to translate assembly

language statements into the target computer's machine code. The
assembler performs a more or less isomorphic translation (a one-to-one
mapping) from mnemonic statements into machine instructions and data.
(This is in contrast with high-level languages, in which a single statement
generally results in many machine instructions. A compiler, analogous to
an assembler, is used to translate high-level language statements into
machine code; or an interpreter executes statements directly.)

Many sophisticated assemblers offer additional mechanisms to facilitate

program development, control the assembly process, and aid debugging.
In particular, most modern assemblers include a macro facility (described
below), and are called macro assemblers.

In computer science, lexical analysis is the process of converting a

sequence of characters into a sequence of tokens. Programs performing
lexical analysis are called lexical analyzers or lexers. A lexer is often
organized as separate scanner and tokenizer functions, though the
boundaries may not be clearly defined.

[edit] Lexical grammar

The specification of a programming language will include a set of rules,

often expressed syntactically, specifying the set of possible character
sequences that can form a token or lexeme. The whitespace characters are
often ignored during lexical analysis.

[edit] Token

A token is a categorized block of text. The block of text corresponding to

the token is known as a lexeme. A lexical analyzer processes lexemes to
categorize them according to function, giving them meaning. This
assignment of meaning is known as tokenization. A token can look like
anything: English, gibberish symbols, anything; it just needs to be a
useful part of the structured text.

Consider this expression in the C programming language:


Tokenized in the following table:

token type






Tokens are frequently defined by regular expressions, which are

understood by a lexical analyzer generator such as lex. The lexical
analyzer (either generated automatically by a tool like lex, or hand-
crafted) reads in a stream of characters, identifies the lexemes in the
stream, and categorizes them into tokens. This is called "tokenizing." If
the lexer finds an invalid token, it will report an error.

Following tokenizing is parsing. From there, the interpreted data may be

loaded into data structures, for general use, interpretation, or compiling.

Consider a text describing a calculation:

46 - number_of(cows);

The lexemes here might be: "46", "-", "number_of", "(", "cows", ")"
and ";". The lexical analyzer will denote lexemes "46" as 'number', "-"
as 'character' and "number_of" as a separate token. Even the lexeme
";" in some languages (such as C) has some special meaning.

[edit] Scanner

The first stage, the scanner, is usually based on a finite state machine. It
has encoded within it information on the possible sequences of characters
that can be contained within any of the tokens it handles (individual
instances of these character sequences are known as lexemes). For
instance, an integer token may contain any sequence of numerical digit
characters. In many cases, the first non-whitespace character can be used
to deduce the kind of token that follows and subsequent input characters
are then processed one at a time until reaching a character that is not in
the set of characters acceptable for that token (this is known as the
maximal munch rule). In some languages the lexeme creation rules are
more complicated and may involve backtracking over previously read

[edit] Tokenizer

Tokenization is the process of demarcating and possibly classifying

sections of a string of input characters. The resulting tokens are then
passed on to some other form of processing. The process can be
considered a sub-task of parsing input.

Take, for example, the following string. Unlike humans, a computer

cannot intuitively 'see' that there are 9 words. To a computer this is only a
series of 43 characters.

The quick brown fox jumps over the lazy dog

A process of tokenization could be used to split the sentence into word

tokens. Although the following example is given as XML there are many
ways to represent tokenized input:


A lexeme, however, is only a string of characters known to be of a certain

kind (eg, a string literal, a sequence of letters). In order to construct a
token, the lexical analyzer needs a second stage, the evaluator, which
goes over the characters of the lexeme to produce a value. The lexeme's
type combined with its value is what properly constitutes a token, which
can be given to a parser. (Some tokens such as parentheses do not really
have values, and so the evaluator function for these can return nothing.
The evaluators for integers, identifiers, and strings can be considerably
more complex. Sometimes evaluators can suppress a lexeme entirely,
concealing it from the parser, which is useful for whitespace and

For example, in the source code of a computer program the string

net_worth_future = (assets - liabilities);

might be converted (with whitespace suppressed) into the lexical token


NAME "net_worth_future"
NAME "assets"
NAME "liabilities"

Though it is possible and sometimes necessary to write a lexer by hand,

lexers are often generated by automated tools. These tools generally
accept regular expressions that describe the tokens allowed in the input
stream. Each regular expression is associated with a production in the
lexical grammar of the programming language that evaluates the lexemes
matching the regular expression. These tools may generate source code
that can be compiled and executed or construct a state table for a finite
state machine (which is plugged into template code for compilation and

Regular expressions compactly represent patterns that the characters in

lexemes might follow. For example, for an English-based language, a
NAME token might be any English alphabetical character or an
underscore, followed by any number of instances of any ASCII
alphanumeric character or an underscore. This could be represented
compactly by the string [a-zA-Z_][a-zA-Z_0-9]*. This means
"any character a-z, A-Z or _, followed by 0 or more of a-z, A-Z, _ or 0-

Regular expressions and the finite state machines they generate are not
powerful enough to handle recursive patterns, such as "n opening
parentheses, followed by a statement, followed by n closing parentheses."
They are not capable of keeping count, and verifying that n is the same on
both sides — unless you have a finite set of permissible values for n. It
takes a full-fledged parser to recognize such patterns in their full
generality. A parser can push parentheses on a stack and then try to pop
them off and see if the stack is empty at the end.

The Lex programming tool and its compiler is designed to generate code
for fast lexical analysers based on a formal description of the lexical
syntax. It is not generally considered sufficient for applications with a
complicated set of lexical rules and severe performance requirements; for
instance, the GNU Compiler Collection uses hand-written lexers.

[edit] Lexer generator

Lexical analysis can often be performed in a single pass if reading is done

a character at a time. Single-pass lexers can be generated by tools such as
the classic flex.

The lex/flex family of generators uses a table-driven approach which is

much less efficient than the directly coded approach. With the latter
approach the generator produces an engine that directly jumps to follow-
up states via goto statements. Tools like re2c and queχ have proven (e.g.
article about re2c) to produce engines that are between two to three times
faster than flex produced engines.[citation needed] It is in general difficult to
hand-write analyzers that perform better than engines generated by these
latter tools.

The simple utility of using a scanner generator should not be discounted,

especially in the developmental phase, when a language specification
might change daily. The ability to express lexical constructs as regular
expressions facilitates the description of a lexical analyzer. Some tools
offer the specification of pre- and post-conditions which are hard to
program by hand. In that case, using a scanner generator may save a lot of
development time.