Академический Документы
Профессиональный Документы
Культура Документы
When executing (running), the compiler first parses (or analyzes) all of
the language statements syntactically one after the other and then, in one
or more successive stages or "passes", builds the output code, making
sure that statements that refer to other statements are referred to correctly
in the final code. Traditionally, the output of the compilation has been
called object code or sometimes an object module . (Note that the term
"object" here is not related to object-oriented programming .) The object
code is machine code that the processor can process or "execute" one
instruction at a time.
A compiler works with what are sometimes called 3GL and higher-level
languages . An assembler works on programs written using a processor's
assembler language.
A processor is the logic circuitry that responds to and processes the basic
instructions that drive a computer.
The term processor has generally replaced the term central processing
unit (CPU). The processor in a personal computer or embedded in small
devices is often called a microprocessor
ADD 12,8
The most common reason for wanting to translate source code is to create
an executable program. The name "compiler" is primarily used for
programs that translate source code from a high-level programming
language to a lower level language (e.g., assembly language or machine
language). A program that translates from a low level language to a
higher level one is a decompiler. A program that translates between high-
level languages is usually called a language translator, source to source
translator, or language converter. A language rewriter is usually a
program that translates the form of expressions without a change of
language.
The code base of a programming project is the larger collection of all the
source code of all the computer programs which make up the project.
In computer science, object code, or an object file, is the representation
of code that a compiler or assembler generates by processing a source
code file. Object files contain compact code, often called "binaries". A
linker is typically used to generate an executable or library by linking
object files together. The only essential element in an object file is
machine code (code directly executed by a computer's CPU). Object files
for embedded systems might contain nothing but machine code.
However, object files often also contain data for use by the code at
runtime, relocation information, program symbols (names of variables
and functions) for linking and/or debugging purposes, and other
debugging information.
(On Unix variants the term loader is often used as a synonym for linker.
Because this usage blurs the distinction between the compile-time process
and the run-time process, this article will use linking for the former and
loading for the latter. However, in some operating systems the same
program handles both the jobs of linking and loading a program; see
dynamic linking.)
Linkers can take objects from a collection called a library. Some linkers
do not include the whole library in the output; they only include its
symbols that are referenced from other object files or libraries. Libraries
exist for diverse purposes, and one or more system libraries are usually
linked in by default.
The linker also takes care of arranging the objects in a program's address
space. This may involve relocating code that assumes a specific base
address to another base. Since a compiler seldom knows where an object
will reside, it often assumes a fixed base location (for example, zero).
Relocating machine code may involve re-targeting of absolute jumps,
loads and stores.
The executable output by the linker may need another relocation pass
when it is finally loaded into memory (just before execution). This pass is
usually omitted on hardware offering virtual memory — every program is
put into its own address space, so there is no conflict even if all programs
load at the same base address. This pass may also be omitted if the
executable is a position independent executable.
Assembly languages were first developed in the 1950s, when they were
referred to as second generation programming languages. They
eliminated much of the error-prone and time-consuming first-generation
programming needed with the earliest computers, freeing the programmer
from tedium such as remembering numeric codes and calculating
addresses. They were once widely used for all sorts of programming.
However, by the 1980s (1990s on small computers), their use had largely
been supplanted by high-level languages, in the search for improved
programming productivity. Today, assembly language is used primarily
for direct hardware manipulation, access to specialized processor
instructions, or to address critical performance issues. Typical uses are
device drivers, low-level embedded systems, and real-time systems.
[edit] Token
sum=3+2;
lexem
token type
e
sum IDENT
= ASSIGN_OP
3 NUMBER
+ ADD_OP
2 NUMBER
; SEMICOLON
46 - number_of(cows);
The lexemes here might be: "46", "-", "number_of", "(", "cows", ")"
and ";". The lexical analyzer will denote lexemes "46" as 'number', "-"
as 'character' and "number_of" as a separate token. Even the lexeme
";" in some languages (such as C) has some special meaning.
[edit] Scanner
The first stage, the scanner, is usually based on a finite state machine. It
has encoded within it information on the possible sequences of characters
that can be contained within any of the tokens it handles (individual
instances of these character sequences are known as lexemes). For
instance, an integer token may contain any sequence of numerical digit
characters. In many cases, the first non-whitespace character can be used
to deduce the kind of token that follows and subsequent input characters
are then processed one at a time until reaching a character that is not in
the set of characters acceptable for that token (this is known as the
maximal munch rule). In some languages the lexeme creation rules are
more complicated and may involve backtracking over previously read
characters.
[edit] Tokenizer
<sentence>
<word>The</word>
<word>quick</word>
<word>brown</word>
<word>fox</word>
<word>jumps</word>
<word>over</word>
<word>the</word>
<word>lazy</word>
<word>dog</word>
</sentence>
NAME "net_worth_future"
EQUALS
OPEN_PARENTHESIS
NAME "assets"
MINUS
NAME "liabilities"
CLOSE_PARENTHESIS
SEMICOLON
Regular expressions and the finite state machines they generate are not
powerful enough to handle recursive patterns, such as "n opening
parentheses, followed by a statement, followed by n closing parentheses."
They are not capable of keeping count, and verifying that n is the same on
both sides — unless you have a finite set of permissible values for n. It
takes a full-fledged parser to recognize such patterns in their full
generality. A parser can push parentheses on a stack and then try to pop
them off and see if the stack is empty at the end.
The Lex programming tool and its compiler is designed to generate code
for fast lexical analysers based on a formal description of the lexical
syntax. It is not generally considered sufficient for applications with a
complicated set of lexical rules and severe performance requirements; for
instance, the GNU Compiler Collection uses hand-written lexers.