Вы находитесь на странице: 1из 11

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

Luigi Micco homepage


vbPDF, vbPDFParser, vbQRCode, vbDataMatrix,
vbPDF417, clsPDFCreator, vbGCalendar
en:doc:pdf_structure

Tools:

Understand the PDF format


The purpose of these notes is that to propose a brief analysis of a document PDF,
underlining the structure and the constitutive elements of it and facing therefore
the problem list related to the management of the fonts, of the vectorial graphics
and of the images raster.
For brevity, we won't face the problem of the compression of the data, of the
revision or update of a le created in precedence, neither we will see whether to
realize the thumbails or the structure of bookmark of the document
The information contained in these section ARE NOT NECESSARY to use the
library, but can be used for building new functions.

The basic elements


A le PDF is substantially a text le, understood as sequence of characters and
separators of line (ASCII 13 or ASCII 10), type structured,where which information
assume a particular meaning in how much you insert in structures that respect a
particular syntax.
The basic elements are the objects obj, that can contain sequences to the
necessity stream, dictionaries dictonary or other. An object can represent a
page, an image, a graphic sequence, etc.
Every object, contained among the keywords obj and endobj is identied by a
number and by a revision. Considering that we won't consider the changes of le
created in precedence, all of our objects will have as number of revision 0 (zero).

1 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

2 0 obj
.. .. .. .. ..
endobj

The objects don't necessarily have to present themselves in numerical order and
it are possible to make reference to a future object, or not yet dened; this
results particularly useful and perhaps essential in some cases (for instance
when in the le it is necessary to point out the length of a text before the same
text has been inserted). When it is necessary to eect a reference to an object,
all it takes is pointing out his number and the revision followed by the letter R.
.. ..
/Parent 3 0 R
.. ..

In general, every makes whenever him necessary to use more times the same
object in more points of the document both it an image or a generic resource, to
optimize the use of memory and the speed of visualization, it is worthwhile to
create an object that contains the resource and to use references to this in all the
points in which it applies.
The sequences of data are contained among the key words stream and
endstream. This can contain any sequence of characters (also those not
printable) and they serve to describe a text, an image or other.
stream
//.. sequences of characters ..//
endstream

The dictionaries are couples variable/value contained among the delimited


and . They are used for characterizing particular objects dening the attributes
of it. A value can be express with a numerical constant (the decimal part
foresees the point and not the comma), an alphanumeric lace (delimited by a
couple of round parenthesis), with a further dictionary, with a reference to an
object or with an array (delimited by a couple of square parenthesis).
<< /Type /Page
/Parent 3 0 R
/Resources << /ProcSet 6 0 R >>
/MediaBox [0 0 612 792]
.. .. ..
>>

A le PDF is not anything else other than an opportune sequence of objects,


built respecting an enough simple syntax (case sensitive), tied up among them
from specic references, equipped from how much necessary to the application
that the le reads (for instance Acrobat Reader) to know whether to recover the
information and in what order.

Structure of a PDF le
2 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

The structure of le PDF can be reassumed in the following scheme


HEADER
BODY
CROSS-REFERENCE TABLE
TRAILER
The section HEADER contains useful information for the software Reader to
identify the type of le and the standard used PDF.
Is represented by the rst line of the le and it is of the type
%PDF-1.3

where the symbol % generally points out a line of comment and 1.3 it points out
whether to correctly read the contained information in the le is necessary
Acrobat Reader 4.0 (rather than 1.4 for which it is necessary Acrobat Reader
5.0, rather than 1.5 for Acrobat Reader 6.0 and so street).In the succession,
we will consider to always operate with formed compatible with Acrobat Reader
4.0 or with a specication % PDF-1.3.
The section BODY contains the objects that will be represented on the pages and
on which will detain subsequently there.
The section CROSS-REFERENCE TABLE is a tables that brings a reference to
every object present in the section BODY and his possible revision; particularly it
points out the position of the rst character of the denition of an object in
comparison to the beginning of the le and the number of revision to which it
refers.
xref
0 23
0000000019
0000000009
.. .. ..
0000000300
0000000384

65535 f
00000 n
00000 n
00000 n

The section TRAILER points out to the Reader how many objects are present in
the section BODY (/ Size), qual is the initial object (/ Root), what object contains
the general information of the document what author, title, dates of creation (/
Info), whether to nd the CROSS- REFERENCE TABLE and besides it marks the
end of the le (%% EOF).
trailer
<< /Size 7
/Root 1 0 R
/Info 2 0 R
>>

3 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

startxref
408
%%EOF

Structure of a PDF document


The structure of a PDF document can be reassumed in the following scheme

The object CATALOG represents the root of the whole document and has to be
that to which stings the reference /Root foresees in the section TRAILER.
1 0 obj
<< /Type /Catalog
/Pages 3 0 R
/Outlines 20 0 R
>>
endobj

In turn it contains a reference to the root of the pages, (/ Pages) and a reference
to the root of the tree that serves as index (/Outlines), that that, when it opens
a document with Acrobat Reader, it appears to the left usually some page and
it allows to quickly stir in the document, and that we for simplicity won't
analyze.
The object PAGES represents the root of the pages, it points out the general
number of the pages of the document (/Count) and it brings a reference to the
object that contains every page (/Kids).
3 0 obj
<< /Type /Pages
/Count 3
/Kids [4 0 R 8 0 R 10 0 R]

4 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

>>
endobj

The object PAGE brings a reference to his own root (/Parent), a list of the
resources used in the page (/Resources, will see subsequently what are), an
array with the dimensions of the anticipated format of press (/MediaBox) and
nally a reference to the object that contains the elements to represent on the
page (/Contents).
4 0 obj
<< /Type /Page
/Parent 3 0 R
/Resources << /ProcSet [/PDF /Text] >>
/MediaBox [0 0 595.2 842]
/Contents 5 0 R
>>
endobj

If in the document they are wanted to bring information related to the author, to
the application that has produced the le or the date of creation, the following
object can be used (watching out for the parentheses that belong to the syntax).
These information appear if from the Reader the ownerships of the document
are visualized.
2 0 obj
<< /Title (title)
/Author (author)
/Creator (application_creator)
/Producer (copyright)
/CreationDate (D:yyyymmddhhmmss+0100)
>>

The system of coordinates


The system of default coordinates in document PDF has as unity of measure the
point, dened as 1/72 of inch. The origin of the aces is set in the angle in low to
the left. In this system, the normal sheet A4 (2129.7 cm) has dimension 595.2 x
842 points. For default all the images, to less than an explicit translation/scaling
/rotation, have dimension 11 and they are set with the left bottom corner in the
point (0,0).

The operators
For the denition of the contained elements in every page, the syntax of the
standard PDF foresees the use of some operating ones. Following some are
listed, only postponing to the texts in bibliography for a description more
detailed. Particularly, keep mind that the graphic operators describe only a run
(path) that will be traced only physically when a special operator will be used
and that the components of color go from 0 to 1 and not from 0 to 255.

5 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

Operator

Description

XYm

Set the current cursor point at (X,Y)

XYl

Add a line ending at (X,Y) to the current path

Xw

Set the line width to X points

Close a path

End a path

Fill a path

r g b RG

Set the color for stroking operations

r g b rg

Set the color for no stroking operations

gray G

Set the percent shading gray for stroking operations

gray g

Set the percent shading gray for non stroking


operatios

x1 y1 x2 y2 x3 y3 c

Add a Bezier curve to path

x y width height re

Add a rectangle to path

W 0 0 H X Y cm

Graphic transformation matrix

Stroke the path

Close and stroke the path

Close and ll the path

BT

Start a text sequence

ET

End a text sequence

space Tc

Set the char spacing

space Tw

Set the words spacing

scale Tz

Set the percent scaling for text

space TL

Set the line spacing

X Y Td

Set the insertion point for text

fontname size Tf

Set the font and the fontsize

string Tj

Draw a text

/name Do

Play the object name

The paths
The paths represent some invisible layouts, that become visible only following an
6 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

opportune command. We explain better this concept with an example: in a


normal graphic context, if I want to trace one segment, established therefore
from more lines, physically trace the rst line, then the second, and so way. In a
document PDF, prepares instead the rst line, (or I build an invisible path that
describes him), then the second and so way and after the last line throwing an
operator that physically traces the whole broken. This way of operating is due to
the fact that the path described could be used not for physically tracing on the
page one segment, but even to delimit (clipping) another portion of following
graphics, to contain a text, etc.. .

Some commons elements - (Font and text)


One of the rst objects that we analyze are the necessary object to describe a
font. In this phase we will see whether to use one of the 14 default font TYPE1 in
the standard PDF or font that the Reader already knows and that they don't
need particular information (vice versa, for other types of font TrueType is
necessary to furnish detailed information, as for instance the width in points of
every single character).
The 14 standard font TYPE1 are:
CourierNew (.Italic, .Bold, .BoldItalic)
Arial (.Italic, .Bold, .BoldItalic)
TimesNewRoman (.Italic, .Bold, .BoldItalic)
ZapfDingbats
Symbol
To use one of them, it is necessary to create an object that contains it. In the
example that follows, the object 7 contain the font Arial with attributes Bold
and Italic, to the font the name Fn1, with a charset ANSI.
7 0 obj
<< /Type /Font
/Subtype /Type1
/Name /Fn1
/BaseFont /Arial.BoldItalic
/Encoding /WinAnsiEncoding
>>
endobj

To write the classic Hello World to the position of coordinates (100,400) with
the font dened in precedence named Fn1, it will be enough to insert in an
object the sequence
\\ % Write Hello World with Arial Bold Italic 24 pts
BT
/Fn1 24 Tf
100 400 Td
(Hello World) Tj
ET

7 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

Some commons elements - (Vector graphic)


If we want to insert some graphic elements, it will be enough to insert a
sequence like
% Draw a filled red rect, with blue border
.5 .75 1 rg
1 0 0 RG
200 300 50 75 re
B
% Draw a line with width 2 points
2 w
150 250 m
150 350 l
S

Some commons elements - (Bitmap images)


Another common object is that necessary to represent some images type
BITMAP (BMP). For the images it is possible to opt for two way: to create an
object that every time can be recalled that the same image must be visualizes,
even if with dierent dimensions (and it is this the case that we will analyze) or
to dene the image inside the page without the possibility to be able to reused.
For instance, if an image Img1 is dened, 10 x 5 pixels, 24 bit color (8 x 3
component color RGB), using a hex representation, they are necessary 150
elements or 10 x 5 x 3.
12 0 obj
<< /Type /XObject
/Subtype /Image
/Name /Img1
/Width 10
/Height 5
/BitsPerComponent 8
/ColorSpace /DeviceRGB
/Filter /ASCIIHexDecode
/Length 13 0 R
>>
stream
80 A1 2F .. .. %150 hex elements
endstream
endobj
13 0 obj
150
endobj

To notice as to the place of the length of the stream a reference has been used
to an object (13 0) that it contains (only) the value 150.
If we want to visualize the image with the left inferior angle in the point (100,
8 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

80), with a horizontal dimension 200 and a vertical of 300, it will be enough to
insert the sequence
q
200 0 0 300 100 80 cm
/Img1 Do
Q

%
%
%
%
%

save the graphic state


graphic transformation matrix
to translate/scaling the image
draw the image Img1
restore the graphic state

Likewise the sequence


.. ..
/BitsPerComponent 8
/ColorSpace /DeviceGray
/Length 50
>>
stream
$-#etT .. .. %50 bytes
endstream
endobj

dene an image in shade of gray (only 1 component of color), represented with a


sequence of 50 bytes (10x5x1).

Some commons elements - (The Form)


Often understands that portions of documents, are goes reproduced more times,
as for instance the header or the footer of pages, the watermarks (those oblique
writings under to the text type draft, ). The standard PDF foresees the use of a
particular object, know as forms, that, we immediately clarify, they don't have
anything to whether to do with the forms of Visual Basic. For instance all of this
that is contained in the stream of the object 13 with name Frm1,
13 0 obj
<< /Type /XObject
/Subtype /Form
/FormType 1
/Name /Frm1
/BBox [0 0 595.2 842]
/Matrix [1 0 0 1 0 0]
/Length 14 0 R >>
stream
.. .. .. .. ..
endstream
endobj

that can be reproduced, anywhere, with the sequence


/Frm1 Do

The idea: the structure of document made by


the library
9 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

As we have seen, the principal elements of a document PDF are the objects. We
have the necessity therefore to build a mechanism that allows us to manage the
writing of the object and contextually the management of the section CROSSREFERENCE TABLE, in which, as already says, a reference must be written to
the created object. It doesn't need to forget that there are some objects that in a
logical order they would go before others, for example the root of the pages
/Pages, even if they contain references, to the pages, that are known only after
all the objects have been built. To resolve the underlined problems, is thought
about using a structure and such a numeration of the objects that the references
were known previously. This way doing, the document is built in such way that
the object /Pages, and in similar way the others, have always the same number
of reference, even if is physically built after the others. In other words, the
structure type of the produced document is the following
1 0 obj Info
2 0 obj Catalog
3 0 obj Encoding
6 0 obj (available)
7 0 obj (available)
.. .. .. ..
.. .. .. .. ..
n-1 0 obj (available)
n 0 obj (available)
4 0 obj Pages
5 0 obj Resource

by this way the reference to the intermediary objects, for instance the /Parent
foresees in every object /Page, is always known, while the other ones can be
built to hand hand that the objects are created and inserted at the end in the
denition of the objects 4 and 5.

The object Resources


We have mentioned that every object /Page contains a reference to an object
/Resources, a dictionary that substantially describes the content and the
resources used in the document or the list of the used Font and the objects
Form, and it describes the content of the document. For instance
4 0 obj
<< /Type /Pages
/Resources 5 0 R
.. ..
>>
endobj
5 0 obj
<< /Font <</Fnt1 6 0 R /Fnt2 7 0 R >>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
/XObject <</Img1 8 0 R /Frm1 9 0 R >>
>>
endobj

10 of 11

03/06/2015 02:31 AM

Understand the PDF format [Luigi Micco homepage]

http://www.luigimicco.altervista.org/doku.php/en/doc/pdf_...

the object 5 0 precise that the document uses two fonts (Fnt1 and Fnt2)
respectively represented by the objects 6 0 and 7 0, an object image Img1
(object 8 0) and an object form (object 9 0). Besides it points out that inside the
document there are some standard operators (/PDF), of the objects type text
(/Text), of the images in staircase of grey (/ImageB), of the color images RGB
(/ImageC) and of the images to indexed palette (/ImageI).

The unities of measure


The unity of measure of the standard PDF is 72 points for thumb. Obviously with
the opportune conversions, it is possible to use others of it (centimeters,
millimeters and thumbs). As it regards the decimal gures, the standard foresees
3 gures on the absolute value in unity of measure standard, and you/he/she
should not be necessary to go over.

Vector graphics
Before using the operating type line, arc or other, to always remember
himself/herself/themselves settare the point of beginning of the layout with the
operative MoveTo. In the case of layouts composed from more elementary lines,
recommend him to use only on the last line the options of sketch, closing or
lling.
en/doc/pdf_structure.txt Last modied: 19/03/2013 09:58 (external edit)

11 of 11

03/06/2015 02:31 AM

Вам также может понравиться