Вы находитесь на странице: 1из 30

Agenda

DB2 UDB for z/OS & OS/390 Character Conversion & Unicode Fundamentals
Chris Crone Senior Software Engineer / IBM cjc@us.ibm.com

Character Conversion Fundamentals Unicode Fundamentals Unicode Support in DB2 UDB for z/OS & OS/390 V7 Things to Look Out For Example Scenarios Summary

This is a basic survival lesson in character conversion and Unicode. What is a CCSID? What does DB2 do with it? Why do I care? What does Unicode do for me? If you care about data integrity and run on more than one operating system in a global business, this is the basic information you need to survive.

This is the agenda for a basic survival lesson in character conversion and Unicode. What is a CCSID, what does DB2 do with it, why do I care, and what does Unicode do for me. In a global economy, this is the basic information you need to survive. While these are the fundamentals, they are not simple. There are many misconceptions. Just because you get some of the information back correctly does not guarantee that you are always getting the data without loss.

Chris Crone

04:07 PM

04/16/02

1-2

Terminology
For the purposes of this presentation UNICODE
UTF-8 UTF-16

ASCII
ASCII is a generic term that refers to all ASCII CCSIDs that DB2 currently supports

EBCDIC
EBCDIC is a generic term that refers to all EBCDIC CCSIDs that DB2 currently supports

Character Conversion Fundamentals

CCSID Coded Character Set Identifier Used By DB2 to tag string data

When I use the term UNICODE, I mean UTF-16, and UTF-8 encodings. See www.unicode.org for more. When I use the term ASCII, I mean any generic ASCII CCSID (like 850, 819, 437) Single Byte Character Set (SBCS), Mixed , or Double Byte Character Set (DBCS) When I use the term EBCDIC I mean any generic EBCDIC CCSID (like 500, 37) SBCS, Mixed, or DBCS CCSID CCSIDs are used by DB2 to tag string data. A CCSID precisely defines the encoding of the data.

When we store data in some EBCDIC CCSID and display it on a PC in some ASCII CCSID, it must be translated. There are many translations, and we're just getting started.

Chris Crone

04:07 PM

04/16/02

3-4

What is Character Conversion


CCSID 437 CCSID 1252

Conversion Methods
Native DB2
SYSIBM.SYSSTRINGS (V2.3)

ICONV
Uses LE base services (V6) - Non-Strategic
Requires OS/390 V2R9

OS/390 V2 R8/R9/R10 & z/OS support for Unicode (V7)


Conversion Services
Requires OS/390 V2R8 and above + APAR OW44581

code and program directory http://www6.software.ibm.com/dl/os390/unicodespt-p documentation http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunpde00.pdf http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunuge00.pdf Information APAR II13048 and II03049

This slide depicts two common PC codepages. Note the differences in them. Codepage 1252 defines the things like the Euro, and the full Latin-1 character set with all the accented characters. Codepage 437 defines a partial Latin-1 accented character list Because the codepages do not contain the exact same set of characters, data cannot necessarily be converted from one to the other without the potential loss of data. Note also that characters are represented in different areas Look at ae ligature '19'x and '29'x in CCSID 437 and '6e'x and '6c'x for CCSID 1252

DB2 uses three methods for conversion SYSSTRINGS - This is the conversion services that were introduced in DB2 V2R3 and the ones most people are familiar with ICONV - Introduced in DB2 V6 and requiring OS/390 V2R9 and above. This was our first attempt at leveraging OS/390 infrastructure to perform conversion. It is non-strategic because the OS/390 V2 R8/R9/R10 support for Unicode provides more functionality with better performance OS/390 V2 R8/R9/R10 support for Unicode - Starting in V7, DB2 will be leveraging this service for most future character conversion support.

Chris Crone

04:07 PM

04/16/02

5-6

Native DB2

OS/390 V2 R8/R9/R10 & z/OS support for Unicode


Central repository for OS/390 system High performance
Uses HW instructions available in z900 GA2 See appendix for complete list of HW instructions Uses page fixed tables in a data space

Based on SYSIBM.SYSSTRINGS
High Performance Added with DB2 V2R3 Support for Single byte, Mixed, Double byte ASCII EBCDIC Uses a combination of 256 Byte conversion tables and special two stage look up tables

Conversion image built by off-line utility


CUNMIUTL - see sample in hlq.SCUNJCL (CUNJIUTL)

Administered via OS/390 Console


SET UNI DISPLAY UNI

The native conversion services that DB2 ships are based on support that was added in DB2 V2R3 They are high performance and rely on a cached copy of SYSIBM.SYSSTRINGS rows to perform conversion. A large number of ASCII and EBCDIC, Single byte, Mixed, and Double byte conversions are supported. These conversions use a combination of conversion tables contained in the TRANSTAB field of SYSSTRINGS and two stage conversion tables (for mixed and Double byte conversions) specified by the TRANSPROC field of SYSSTRINGS.

With the introduction of the OS/390 support for Unicode system service, OS/390 now has a central repository for conversion that can be used by applications, middleware, and subsystems. This service is designed to be high performance and utilizes new HW instructions and page fixed conversion tables to perform the conversions. This service uses a conversion image that is built by the off-line utility CUNMIUTL. A customer specifies which conversions will be supported by the conversion image. The conversion image is managed via the OS/390 console, not DB2. The SET UNI command specifies the image to be loaded The DISPLAY UNI command displays information about the currently loaded conversion image

Chris Crone

04:07 PM

04/16/02

7-8

Conversion Services Example


//CUNMIUTL EXEC PGM=CUNMIUTL //SYSPRINT DD SYSOUT=* //TABIN DD DISP=SHR,DSN=hlq.SCUNTBL //SYSIMG DD DSN=hlq.IMAGES(CUNIMG00),DISP=SHR //SYSIN DD * /******************************************** * INPUT STATEMENTS FOR THE IMAGE GENERATOR * ********************************************/ CONVERSION 00850,01047,ER; /*ASCII -> EBCDIC */ CONVERSION 01047,00850,ER; /*EBCDIC -> ASCII */ CONVERSION 00037,1200,ER; /*EBCDIC 037 -> UCS-2 */ CONVERSION 1200,00037,ER; /*UCS-2 -> EBCDIC 037*/ CONVERSION 00500,1200,ER; /*Latin-1 EBC -> UCS-2 */ CONVERSION 1200,00500,ER; /*UCS-2 -> Latin-1 EBC*/ CONVERSION 01047,1200,ER; /*EBCDIC 1047 -> UCS-2 */ CONVERSION 1200,01047,ER; /*UCS-2 -> EBCDIC 1047*/ CONVERSION 01208,1200,ER; /*UnicodeCCSID-> UCS-2 */ CONVERSION 1200,01208,ER; /*UCS-2 -> UnicodeCCSI*/ CONVERSION 01383,1200,ER; /*Simp Chines -> UCS-2 */ CONVERSION 1200,01383,ER; /*UCS-2 -> Simp Chines*/ CONVERSION 00932,1200,ER; /*Jpn MCCSID -> UCS-2 */ CONVERSION 1200,00932,ER; /*UCS-2 -> Jpn MCCSID */ CONVERSION 00939,1200,ER; /*Jpn-ExtEng -> UCS-2 */ CONVERSION 1200,00939,ER; /*UCS-2 -> Jpn-ExtEng */ CONVERSION 00300,1200,ER; /*Jpn GCCSID -> UCS-2 */ CONVERSION 1200,00300,ER; /*UCS-2 -> Jpn GCCSID */ CONVERSION 00500,00850,ER; /*Latin-1 EBC -> ASCII */ CONVERSION 00850,00500,ER; /*ASCII -> Latin-1 EBC*/ /*

Conversion Services Configuration


Which Conversions should be configured
CCSID 367 (7-Bit ASCII) <-> ASCII & EBCDIC System CCSID(s) CCSID 1208 (UTF-8) <-> ASCII & EBCDIC System CCSID(s) CCSID 1200 (UTF-16) <-> ASCII & EBCDIC System CCSID(s) Client CCSID(s) <-> Unicode CCSIDs (367, 1208, 1200) Additional ASCII or EBCDIC Conversions Starting with V7, most new code conversion support will be via conversion services. Native DB2 conversions will continue to be supported and used, but in most cases, not enhanced Other Conversions needed to LOAD/UNLOAD Data Conversions needed to support application encoding bind option, DECLARE VARIABLE, or CCSID overrides

Here's an example of the CUNJIUTL utility Note the specification of ER (enforced subset, round trip) after each CCSID pair. The ER specification is required by DB2.

Which Conversions should be configured All DB2 conversions involving Unicode are supported via the OS/390 conversion services. Any conversion involving Unicode must be configured. ASCII and EBCDIC conversions not supported by Native DB2 Conversion methods Other conversions, that are not supported via SYSIBM.SYSSTRINGS are supported via the conversion services

Chris Crone

04:08 PM

04/16/02

9-10

Conversion Services Example


14.34.14 d uni,all 14.34.15 CUN3000I 14.34.14 UNI DISPLAY 097 ENVIRONMENT: CREATED 12/11/2000 AT 09.13.53 MODIFIED 12/11/2000 AT 09.13.53 IMAGE CREATED 12/06/2000 AT 17.10.01 SERVICE: CUNMCNV CUNMCASE STORAGE: ACTIVE 50 PAGES LIMIT 524287 PAGES CASECONV: NONE CONVERSION: 00500-00367-ER 00500-01208-ER 00500-01200(13488)-ER 00367-00500-ER 00367-01208-ER 00367-01200(13488)-ER 01208-00500-ER 01208-00367-ER 01208-01200-ER 01200(13488)-00500-ER 01200(13488)-00367-ER 01200-01208-ER

Round Trip - VS - Enforced Subset


Round Trip (RT) Conversions
Designed to preserve codepoints that are not representable in both codepages

Enforced Subset (ES) Conversions


Codepoints that are not representable are converted to SUB character

DB2 Uses a combination of RT and ES conversions


Trend is toward ES conversions Continue to use RT conversions in some cases for compatibility reasons

Here's an example of output from a display UNI command. Note when the image was created is displayed Also, the number of pages used by the image is also displayed. This is important because these pages are page fixed so they are taking up dedicated memory space on the machine Finally, a list of conversions that are supported in this image are displayed. Note for CCSID 1200 conversions the base CCSID that the conversion was created from is also displayed in parenthesis.

DB2 conversions are either Round Trip, or Enforced Subset. Round Trip conversions attempt to avoid loss of data by mapping unrepresented codepoints to unused (or unlikely to be used) codepoints Data loss can be avoided or delayed Can cause strange conversions Enforced Subset conversions map any unrepresented codepoints to the sub-character Data loss occurs immediately

Chris Crone

04:08 PM

04/16/02

11-12

Expanding and Contracting Conversions


Conversions can cause the length of a string to change Expanding Conversions
When data converted from one CCSID to another expands For Example - 'C5'x in CCSID 819 -> 'C385'x in CCSID 1208

What are CCSIDs used for?


DB2 uses CCSIDs to describe data stored in the DB2 subsystem DB2 supports specification of CCSIDs at a subsystem level With V7, DB2 supports 3 encoding schemes
ASCII EBCDIC UNICODE

Contracting Conversions
When data converted from one CCSID to another contracts For Example - '00C5'x in CCSID 1200 -> 'C5'x in CCSID 819

Data is comparable only within a single encoding scheme

There are some cases where a conversion causes the length of the data to change. Expanding conversions cause the length of the data to grow Contracting conversions cause the length of the data to shrink

So now that we know all about CCSIDs, what are they used for. DB2 uses CCSIDs just like we use data type and length. They are part of the metadata that describes the data being stored in DB2. In V7 DB2 supports specification of three sets of CCSIDs. These three sets of CCSIDs represent the three encoding schemes (ASCII, EBCDIC, and Unicode), that DB2 supports. DB2 supports the specification of these CCSIDs at the subsystem level. Once these values have been specified, they should not be changed.

Chris Crone

04:08 PM

04/16/02

13-14

Installation
Specification of CCSIDs is performed at installation via install Panel DSNTIPF
DSNTIPF INSTALL DB2 -APPLICATION PROGRAMMING DEFAULTS PANEL 1 ===>_ Enter data below: 1 LANGUAGE DEFAULT ===> IBMCOB ASM,C,CPP,COBOL,COB2,IBMCOB,FORTRAN,PLI 2 DECIMAL POINT IS ===> . .or , 3 MINIMUM DIVIDE SCALE ===> NO NO or YES for a minimum of 3 digits to right of decimal after division 4 STRING DELIMITER ===> DEFAULT DEFAULT,"or '(COBOL or COB2 only) 5 SQL STRING DELIMITER ===> DEFAULT DEFAULT,"or ' 6 DIST SQL STR DELIMTR ===> ' 'or " 7 MIXED DATA ===> NO NO or YES for mixed DBCS data 8 EBCDIC CCSID ===> 0 CCSID of your SBCS or MIXED DATA 9 ASCII CCSID ===> 0 CCSID of SBCS or mixed data. 10 Unicode CCSID ===> 1208 CCSID of Unicode UTF-8 data. 11 DEF ENCODING SCHEME ===> EBCDIC EBCDIC, ASCII, or UNICODE 12 LOCALE LC_CTYPE ===> 13 APPLICATION ENCODING ===> EBCDIC EBCDIC, ASCII, UNICODE ccsid (1-65533) 14 DECIMAL ARITHMETIC ===> DEC15 DEC15,DEC31,15,31 15 USE FOR DYNAMICRULES ===> YES YES or NO 16 DESCRIBE FOR STATIC ===> NO Allow DESCRIBE for STATIC SQL.NO or YES.

Installation (continued)
Information from DSNTIPF ends up in Job DSNTIJUZ

Mixed System
DSNHDECM ASCCSID=1088, AMCCSID=949, AGCCSID=951, SCCSID=833, MCCSID=933, GCCSID=834, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=YES

Non-Mixed System
DSNHDECM ASCCSID=819, AMCCSID=65534, AGCCSID=65534, SCCSID=37, MCCSID=65534, GCCSID=65534, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=NO

END

END

Install panel DSNTIPF is used to specify CCSID information Options 8,9, and 10 are where the CCSIDs for the three encoding schemes are specified. Notice that ASCII and EBCDIC CCSIDs are initialized to 0 and the Unicode CCSID is initialized to 1208 The ASCII and EBCDIC CCSIDs are not pre-filled, these values needs to be set by the customer. The EBCDIC should be set to the CCSID that the customer's 3270 emulators, CICS, and IMS transactions use. The ASCII value should be set to the CCSID that is most commonly used by workstations in the customer shop (1252 for example). The Unicode value is pre-filled with 1208 cannot be changed. This value specifies the mixed CCSID for Unicode tables. Other things to note on this page Option 11 - This specifies the default encoding scheme for Objects created in the DB2 subsystem. Option 13 - This option specifies the default application encoding. Changing this valuse should be done with great care.

The information from panel DSNTIPF flows to Job DSNTIJUZ. In the case on the Left, we have a Mixed = Yes system that is set up to support Korea. The ASCII and EBCDIC system CCSIDs that actually would have been specified on panel DSNTIPF, to result in this specification, would have been 949 and 833. For mixed systems, and for the Unicode CCSID, the Mixed CCSID is specified on install panel DSNTIPF and DB2 will pick the corresponding Single byte and Graphic (Double Byte) CCSIDs. In the case on the Right, we have a Mixed = No system that is set up to support US English. Note that the user specified 819 and 37 for the ASCII and EBCDIC Single byte CCSIDs and that DB2 used the value 65534 for the ASCII and EBCDIC Mixed and Graphic (Double byte) CCSIDs. 65534 is a reserved value that means no CCSID. Also note that the Default Encoding and Default Application Encoding also flow to this job. Note there is a bug in DSNTIJUZ and DSNHDECM - These ship with CCSID 500 as default.

Chris Crone

04:08 PM

04/16/02

15-16

Where Is Encoding Information stored?


CCSIDs are stored in the following places
SYSIBM.SYSDATABASE SYSIBM.SYSTABLESPACE SYSIBM.SYSVTREE SYSIBM.SYSPLAN (V7) SYSIBM.SYSPACKAGE (V7) Plans and Packages (SCT02 and SPT01) Directory (DSNDB01) (V5) DECP

DSNTIPF Mixed Data Option


Mixed = No systems have support for
SBCS Data - Pure single byte data Mixed Data

Unicode UTF-8 MBCS ( 1-4 bytes/char) data. No support for ASCII/EBCDIC mixed data
Graphic data

Unicode UTF-16 (2 or 4 bytes/char) data. No support for ASCII/EBCDIC DBCS data

In ENCODING_SCHEME column of - Stored as 'A', 'E', 'U', or blank (default)


SYSIBM.SYSDATATYPES SYSIBM.SYSDATABASE SYSIBM.SYSPARMS SYSIBM.SYSTABLESPACE SYSIBM.SYSTABLES

Mixed = Yes systems have support for SBCS Data - Pure single byte data Mixed Data - Single & double byte data in a single string Graphic Data - Pure Double byte data

Once we've specified CCSIDs for our system, what does DB2 do with them? DB2 stores CCSIDs in the Catalog, the directory, in bound statements, the directory, and of course the DECP The value stored in these areas depends on what release of DB2 was used to create the object, and the value in the DECP at the time the object was created. If a value is 0, it is assumed that the object is EBCDIC. In general, DB2 does not support changing of a CCSID once it is specified in a DECP The exceptions are Changing from 0 to a valid value Changing from a CCSID that does not support the EURO symbol to a CCSID that supports the EURO symbol (37 -> 1140 for instance). Note that this sort of change requires special, disruptive, changes and should be undertaken only after the documentation has been read and the process is thoroughly understood. Encoding enformation is also stored in some catalog tables

Mixed = Yes systems are used in the Far East, primarily China, Japan, and Korea. They offer support for SBCS, Mixed, and DBCS ASCII and EBCDIC data. Mixed = No systems are used elsewhere in the world and only have support for SBCS data. Unicode data is always considered mixed regardless of the Mixed = Yes/Mixed = No setting of the system Creation of columns with data types of For Mixed data and Graphic are allowed, for EBCDIC tables, on Mixed = No systems prior to DB2 V7. As of DB2 V7, Mixed = No systems only allow specification of these types of columns in Unicode tables.

Chris Crone

04:08 PM

04/16/02

17-18

Mixed Data
Mixed Data
SBCS Data
SBCS can be compared to mixed without conversion to mixed because it is a subset of the mixed repertoire. This is true for ASCII, EBCDIC, and Unicode

When Does Conversion Occur?


Local
Generally, conversion does not occur for local applications When dealing with ASCII/Unicode tables When specified by application

Mixed Data
Capable of representing SBCS and MBCS data
EBCDIC SO, SI ('0E'x, '0F'x) delineate DBCS data AB<A> -> 'C1C20E42C10F'x ASCII uses first byte code point, if the first byte is within a certain range, say 'A0'x - 'AF'x, then it is the first byte of a DBCS character. For example 'A055'x would be a DBCS character Some CCSIDs have several first byte code point ranges. UTF-8 data uses the high order bit to indicate MBCS data For example 'EFBC91'x is a three byte UTF-8 character

CCSID Override in SQLDA (V2.3 & above) Declare Variable (V7) Application Encoding Bind Option (V7) Current Application Encoding Special Register (V7) Remote
Automatically when needed

Graphic Data
ASCII or EBCDIC - DBCS characters no shift or first byte code points needed Unicode - DBCS characters. Surrogates take two DBCS characters

DRDA Receiver Makes Right

Mixed = yes systems use a CCSID triplet. That is to say, there is an SBCS CCSID, a Mixed CCSID, and a DBCS CCSID. On these systems, the SBCS CCSID is a subset of the Mixed CCSID. Because of this, SBCS and Mixed data can be compared without converting. SBCS columns are created by specifying the "FOR SBCS DATA" clause on create like CREATE TABLE T1 (C1 CHAR(10) FOR SBCS DATA); Mixed columns are the default, on MIXED = Yes systems, and can be explicitly specified by using the "FOR MIXED DATA" clause on CREATE. DBCS data is stored in Graphic columns

There are no hard and fast rules as to when a conversion occurs. The short answer is that conversion occurs when necessary. Some of the cases when we do conversion are listed here.

Chris Crone

04:08 PM

04/16/02

19-20

Why Unicode?
Unicode is a single character set that encodes all of the worlds scripts (sort of). The Unicode standard provides a cross platform, cross vendor method of encoding data that enbles lossless representation and manipulation

Before Unicode

Unicode Fundamentals

Many Standards
ANSI JIS TISI

Provided by various vendors


IBM ASCII (pSeries, xSereis) and EBCDIC (zSeries and iSeries) HP Microsoft

Sort of, because new characters are being added all the time and so at any given time, an implementation of the standard is somewhat behind. Sort of because there is one standard that contains several implementations of the standard Prior to Unciode, many different standards and vendor implelmentations existed Unicode attempts to standardize the representation and manipulation of data across vendors and platforms

Chris Crone

04:08 PM

04/16/02

21-22

Unicode Fundamentals
Four forms of Unicode
UTF-8 Unicode Transformation Format in 8 bits UCS-2 Universal Character Set coded in 2 octets UTF-16 Unicode Transformation Format in 16 bits UTF-32 Unicode Transformation Format in 32 bits Introduced with Unicode Technical Report # 19 to replace UCS-4

UTF-8 (CCSID 1208)


ASCII Safe UNICODE (maps to 7-Bit ASCII)
Bytes '00'x - '7F'x = 7-Bit ASCII

Bytes '00'x - '7F'x represented by single byte chars Chars above '80'x are encoded by 2-6 byte chars
Most characters take 2-3 bytes
Most Japanese, Chinese, and Korean characters take 3 bytes Most Extended Latin characters take 2 bytes

Surrogates take 4 bytes

There are currently 4 forms of Unicode that are being promoted by the Unicode standards organization. UTF-8, UTF-16, and UTF-32. UTF-16 is the preferred format (according to UTR#19) UCS-2 is the precursor to UTF-16.

UTF-8 is represented by CCSID 1208. This is a growing CCSID. This means that as characters are added to the Unicode standard, they will be added to this CCSID. UTF-8 is also commonly called ASCII safe Unicode The first 127 characters are the same as CCSID 367 which is a 7-bit ASCII CCSID Other characters are represented as MBCS, 1-4 byte, characters One nice feature of UTF-8 is that since it is an 8-bit encoding, it does not have any big endian/little endian issues.

Chris Crone

04:08 PM

04/16/02

23-24

UCS-2 (CCSID 13488, 17584)


Basic Multilingual Plane - BMP(0) Pure Double Byte Characters
64K characters in Repertoire

UTF-16 (CCSID 1200)


UCS-2 with Surrogate Support
Uses two two-byte characters to represent additional characters

~1 Million characters in repertoire


BMP1-BMP16 (additional 16 planes). Supplementary Multilingual Plane (SMP) - Plane 1 U+10000..U+1FFFF Supplementary Ideographic Plane (SIP) - Plane 2 U+20000..U+2FFFF Supplementary Special Purpose Plane (SSP) - Plane 14 U+E0000..U+EFFFF BMP15 and BMP16 are reserved for private use

'0000'x - '00FF'x Represent 8 bit ASCII


'00'x appended to 8 Bit ASCII characters

'00FF'x - 'FFFF'x Represent additional characters


Greek -> '0370'x - '03FF'x Cyrillic -> '0400'x - '04FF' ...

UCS-2 is represented by CCSIDs 13488 and 17584. 13488 corresponds to Unicode Version 2, and 17584 corresponds to Unicode Version 3. When people say Unicode, without qualifying the encoding format, this is usually what they mean. Other characters are allocated in blocks (there's a block for Greek chars, a block for Cyrillic chars....)

UTF-16 is represented by CCSID 1200, which is a growing CCSID also UTF-16 is a superset of UCS-2 and uses reserved sections of BMP0 to map an additional 16 planes Version 3.1 of the Unicode standard defines the first characters in the surrogate area

Chris Crone

04:08 PM

04/16/02

25-26

UTF-32
Each Character is 4 bytes
Range is restricted to values '00000000'x - '0010FFFF'x Represents the same repertoire as UTF-16

Endianess
Big Endian
pSeries, zSeries, iSeries, Sun, HP Lease significant byte is leftmost For a 4 byte word - Byte order 0,1,2,3

UCS-4 Implemented by SUN Solaris and HP/UX as base Unicode data type XPG/4 standard requires fixed width character format
z/Series, p/Series looking at UTF-32 implementations to support surrogate characters in C/C++ applications

Little Endian
Intel based machines including xSeries Most significant byte is leftmost For a 4 byte word - Byte order 3,2,1,0

UTF-8 - not affected by endianess issues UTF-16 and UTF-32 are effected by endianess issues
Big Endian 'A' = x'0041' for UTF-16 or x'00000041' for UTF-32 Little Endian 'A' = x'4100' for UTF-16 or x'41000000' for UTF-32

Note: A BYTE is always ordered as leftmost most significant bit to rightmost least significant bit. Bit order within a byte is always 7,6,5,4,3,2,1,0

For completeness, I'm mentioning UTF-32 and UCS-4. DB2 is not implementing any support for these implementations of the Unicode standard, although other vendors have.

DB2 and DRDA manipulate and store data in Big Endian format. Little Endian clients convert data to Big Endian before putting on the wire.

Chris Crone

04:08 PM

04/16/02

27-28

Character Examples
A, a, 9, (The character A with Ring accent),
ASCII
'41'x, '61'x, '39'x, 'C5'x, 'CDDB'x (ccsid 939), N/A
U+9860, U+200D0

UTF-8
'41'x, '61'x, '39'x, 'C385'x, 'E9A1A0'x, 'F0A08390'x
Note: 'C5'x becomes double byte in UTF-8

UTF-16 (Big Endian format)


'0041'x, '0061'x, '0039'x, '00C5'x, '9860'x, 'D840DCD0'x

DB2 for z/OS & OS/390 V7 Enhancements for Unicode

UTF-32 (Big Endian format)


'00000041'x, '00000061'x, '00000039'x, '000000C5'x, '00009860'x, '000200D0'x

Note: UCS-2/UTF-16 and UCS-4/UTF-32 are using a technique called Zero Extension

Now that we know all about Unicode, here are some examples of what this stuff looks like. Note that A-Ring takes two bytes to represent in UTF-8, and that other characters can take three or 4 bytes to represent

Chris Crone

04:08 PM

04/16/02

29-30

Requirement
Enable Unicode on DB2 UDB for OS/390 and z/OS
Support Vendors implementing Unicode applications Support needs of Multinational Companies Support data from more than one country/language in one DB2 subsystem

Solution
Allow UNICODE to be specified as the Encoding Scheme (ES) at the
System Level UNICODE CCSIDs (Install) Similar to ASCII/EBCDIC System CCSIDs Database create database mydb ccsid unicode Table Space create tablespace myts in mydb ccsid unicode Table create table t1 (c1 char(10)) ccsid unicode Other create procedure mysp (in in_parm char(10) ccsid unicode) ...

For V7 our challenge was to enable Unicode data storage on DB2, without regressing function or performance for our ASCII and EBCDIC customers. We wanted to meet the needs of ERP and CRM vendors, as well as address the needs of customer written applications that have a need to store multinational data

The Unicode support we have added with DB2 V7 is similar to the support we have for ASCII, that was added in V5. This support allows ASCII, EBCDIC, and Unicode objects to coexist in a single DB2 subsystem. specification of the encoding scheme is made at the system level as well as the object level

Chris Crone

04:08 PM

04/16/02

31-32

Storage
Storage of Unicode Data Char/VarChar/CLOB FOR SBCS DATA (7-bit) ASCII this is a subset of UTF-8 CCSID 367 Char/VarChar/CLOB FOR MIXED DATA UTF-8 CCSID 1208 Graphic/VarGraphic/DBCLOB UTF-16 CCSID 1200

Parsing

Parsing will be in EBCDIC


Conversion to EBCDIC system CCSID from ASCII EBCDIC UNICODE Need to ensure that literal values are convertible to System EBCDIC CCSID. If substitution occurs in statement text being converted to EBCDIC - SQLCODE +335 issued Use Host Variables or Parameter markers where conversion to system EBCDIC CCSID is an issue

Data stored in Unicode tables in DB2 will be stored in one of the following CCSIDs: 367, 1208, 1200.

Parsing for DB2 V7 is in EBCDIC. This means that all statements sent to DB2 will be converted to the system EBCDIC CCSID, and then parsed. Since statements are converted to EBCDIC, there is a possibility of data loss when the data is converted. Since all DB2 keywords, such as SELECT, are representable on all EBCDIC code pages, there shouldn't be a problem with statement text Literals contained in the statement are subject to data loss. SQLCODE +355 has been added to alert the user of this sort of data loss

Chris Crone

04:08 PM

04/16/02

33-34

Catalog
The catalog will be encoded in the default EBCDIC CCSID
Object Names will need to be convertible to Default EBCDIC CCSID Database Name Table space/Index space Name External Names (UDF, SP, Exits, Fieldproc...) Identifiers may not be generateable from all clients so should really be limited to common subset that is representable on all clients.

Unicode Literals
Literals
UTF-8 literals (char/varchar/clob) conform to normal rules for character strings INSERT INTO T1 (C1) VALUES('123'); UTF-16 literals (graphic/vargraphic/dbclob) Specified as UTF-8 literals Specified as Graphic literals INSERT INTO T1 (C1) VALUES('123'); INSERT INTO T1 (C1) VALUES(

U+BBC0 U+BBC1

); --Unicode

The DB2 V7 catalog will remain EBCDIC, so names stored in the catalog must also be convertible to EBCDIC without loss Since the DB2 catalog contains data stored in the system EBCDIC default CCSID, it is possible to have things like Japanese, or Cyrillic names for things like columns (depending on your system CCSID). However, since all clients may not be capable of producing these characters, users should really be limited to the subset of Latin-1 characters. Note: To store Japanese, Chinese, or Korean in the DB2 catalog, a MIXED=YES system is needed and the data being stored in the catalog must conform to the rules for well formed mixed data.

Specification of literals for Unicode tables is essentially the same as it is for ASCII or EBCDIC tables. The one thing to note here is that character literals may be specified for UTF-16 columns. These character literals may be used any place a graphic literal would normally be used. For instance in the values clause of an insert statement

Chris Crone

04:08 PM

04/16/02

35-36

Host Variables and Parameter Markers

Declare Variable

ASCII/EBCDIC/UNICODE -> UNICODE


Char or Graphic -> UTF-8 or UTF-16

UNICODE -> ASCII/EBCDIC/UNICODE


UTF-8 or UTF-16 -> Char or Graphic

DECLARE VARIABLE statement New way to allow CCSID to be specified for host variables
Example EXEC SQL DECLARE :HV1 CCSID UNICODE; EXEC SQL DECLARE :HV2 CCSID 37; Precompiler directive to treat hostvar as a specific CCSID Useful for PREPARE/EXECUTE IMMEDIATE statement text EXEC SQL PREPARE S1 FROM :HV2; May be used with any character host variable on input or output

UTF-8 <-> UTF-16 Applications don't need to change just because the back end data store changes

When dealing with Unicode tables, we have torn down the barrier between CHAR and GRAPHIC. This means your back end data store can be either UTF-8 or UTF-16 and you can use ASCII, EBCDIC, or Unicode character or graphic host variables and DB2 will perform the necessary conversions to/from the CCSID of the host variable even if the host variable doesn't match the column type (for ASCII and EBCDIC back end data stores, in most cases char and graphic are incompatible).

The new DECLARE VARIABLE statement can be used to specify the CCSID of a particular host variable. This is a precompiler directive that causes the precompiler to specify the CCSID of the host variable in any SQLDA that the precompiler generates to reference the host variable. This directive works for both input and output host variables.

Chris Crone

04:08 PM

04/16/02

37-38

Application Encoding

Application Encoding (continued)


Example Assume Package MY_PACK is bound with APPLICATION ENCODING(UNICODE) All Char input/output host variables for static statements are assumed to be in CCSID 1208 All Graphic input/output host variables for static statements are assumed to be in CCSID 1200 Initial Value for Application Encoding Special register will be 1208 Declare Variable statement or CCSID overrides can be used for overriding bind option or special register

New Application Encoding Scheme


System Default
Determines Encoding Scheme when none is explicitly specified

Bind Option
Allows explicit specification of ES at an application level. Affects Static SQL - Provides default for dynamic System Default used if bind option not specified

Special Register Allows explicit specification of ES at the application level. Affects Dynamic SQL Initialized with Bind Option OPTION is ignored when packages are executed remotely DRDA specified Input CCSID, Data flows as it to client

Also new to DB2 V7 is the specification of Application Encoding Scheme. This allows a default Application Encoding to be specified Preset to EBCDIC The Application Encoding Scheme can also be specified on BIND PLAN or PACKAGE If not specified, the system default value is used for the bind option Plans/Packages bound prior to V7 are assumed to be EBCDIC The option applies to Static SQL The Application Encoding Scheme special register can be used to affect dynamic SQL Initial value is the value of the Bind Option.

In this example, package MY_PACK is bound with APPLICATION ENCODING(UNICODE) Character host variables will be treated as CCSID 1208 Graphic host variables will be treated as CCSID 1200 Initial value of APPLICATION ENCODING special register will be 1208 DECLARE VARIABLE statement can be used to override the bind option/special register for host variables For statements that use a DESCRIPTOR, as in FETCH USING DESCRIPTOR, CCSID overrides can be coded by hand in the SQLDA.

Chris Crone

04:08 PM

04/16/02

39-40

ODBC/SQLJ/JDBC
ODBC Support
Support for Wide Character API's (UCS2/UTF-16) See ODBC Guide and Reference (SC26-9941-01) Example
SQLRETURN SQLPrepare ( SQLHSTMT hstmt, SQLCHAR *szSqlStr, SQLINTEGER cbSqlStr ); SQLRETURN SQLPrepareW ( SQLHSTMT hstmt, SQLWCHAR *szSqlStr, SQLINTEGER cbSqlStr );

COBOL
Enterprise COBOL for z/OS and OS/390 V3R1 Supports Unicode
NATIONAL is used to declare UTF-16 variables MY-UNISTR pix N(10). -- declares a UTF-16 Variable N and NX Literals N'123' NX'003100320033' Conversions NATIONAL-OF Converts to UTF-16 DISPLAY-OF Converts to specific CCSID

SQLJ/JDBC Support
Remove current support for converting to EBCDIC before calling engine. Let DB2 engine determine where conversion is necessary

Greek-EBCDIC pic X(10) value " ". UTF16STR pic N(10). UTF8STR pix X(20). Move Function National-of(Greek-EBCDIC, 00875) to UTF16STR. Move Function Display-of(UTF16STR, 01208) to UTF8STR.

ODBC support for Unicode is included as part of the effort to support ODBC 3.0 SQLJ and JDBC already support Unicode, but changes have been made to exploit Unicode support in DB2 V7

Cobol has recently added support for Unicode characters Included in this support New NATIONAL data type N and NX literals Conversion operations More

Chris Crone

04:08 PM

04/16/02

41-42

Joins, Sub-queries, Unions...


Support will be consistent with ASCII support
No mixing of ES in: Queries, Joins, Sub-queries, Unions.... For example: Select T1C1, T2C1 from T1,T2 where...
fails if T1 and T2 are not the same ES

Predicates

Predicates limited to 255 bytes (except like) Basic Predicate


SELECT ... WHERE C1 = :HG1
(where C1 is UTF-8 and :HG1 is UTF-16)

You cannot reference the DB2 catalog in a query against an ASCII or Unicode table

Like predicate

SELECT ... WHERE C1 LIKE :HG1 ESCAPE :HG2;


(where C1 is UTF-8 and :HG1 and :HG2 are UTF-16)

In Predicate

SELECT ... WHERE C1 in (:HG1, :HV1);


(where C1 is UTF-8 and :HG1 is UTF-16 and HV1 is character)

As in prior releases, you cannot reference tables from more than one encoding scheme in a single statement In the first example we fail because T1 and T2 are not of the same encoding scheme.

For queries against Unicode tables, Host variables used in a predicate may be specified as UTF-16 or UTF-8 regardless of the data type of the column. This allows the back end data store to change, without changing the application. When data is primarily Latin-1, it may be more efficient to store UTF-8 data When data is not primarily Latin-1, it may be more efficient to store data in UTF-16

Chris Crone

04:08 PM

04/16/02

43-44

Scalar Functions
Functions
LENGTH, SUBSTR, POSSTR, LOCATE Byte Oriented for SBCS and Mixed (UTF-8) Double-Byte Character Oriented for DBCS (UTF-16) Cast functions UTF-16/UTF-8 are accepted any where char is accepted (char, date, time, integer...) SELECT DATE(graphic column) FROM T1; SELECT INTEGER(graphic column) FROM T1; UTF-8 is result data type/CCSID 1208 for character functions (char(float_col)...)

Routines

Routines UDFs, UDTFs, and SPs will all be enabled to allow Unicode parameters Parameters will be converted as necessary between char (UTF-8) and graphic (UTF-16) Date/Time/Timestamp passed as UTF-8 (ISO Format)

All Built In Functions (BIFs) have been extended to support Unicode Some BIFs, such as LENGTH, SUBSTR, POSSTR, and LOCATE are byte oriented for UTF-8 and Double-Byte character oriented for UTF-16 Many new functions were added in V7, the CCSID_ENCODING function has been added to help users determine the encoding, ASCII, EBCDIC, or UNICODE of a particular CCSID UTF-16 data is accepted in casting type functions such as DATE or INTEGER Result CCSIDs for functions that return character strings will return UTF-8/CCSID 1208

User written routines (User Defined Functions, User Defined Table Functions, and Stored Procedures) have been extended to support Unicode Parameters will be converted as necessary Date, Time, and Timestamp values are passed to the routine as a UTF-8 character string. These values will be in the ISO format as specified in the DB2 SQL Reference.

Chris Crone

04:09 PM

04/16/02

45-46

Utilities

Limits

Utilities LOAD Utility UTF-16 <-> UTF-8


SBCS/MIXED -> DBCS DBCS -> SBCS/MIXED

Index Key Size - remains 255 Char limit still 255 bytes Varying length string limit still 32K bytes Strings > 32K bytes - use LOB's

ASCII/EBCDIC <-> UNICODE UNLOAD Utility ASCII/EBCDIC <-> UNICODE No support for
SBCS/MIXED -> DBCS DBCS -> SBCS/MIXED

The load utility has been extended to support conversion to and from Unicode. Additionally, the load utility will support conversion between character and graphic as long as conversion exists. Character in load dataset -> Graphic column Graphic in load dataset -> character column The unload utility, new for V7, supports conversion to/from Unicode, but does not support conversion between character and graphic

We haven't changed any of these limits. These are the same limits as we had for V6. The limit on index key sizes is something to watch out for. Unicode data can take from 1-3 times the space needed to store ASCII or EBCDIC data For character strings longer than 255 use Varchar. Varying length strings are still limited to 32704 bytes. For longer strings, use LOBs

Chris Crone

04:09 PM

04/16/02

47-48

Things to look out for


UTF-8 and UTF-16 are compatible just about everywhere, but you will pay a conversion cost. It is best to match the DB2 data definition to the UNICODE model the application is using

Things to look out for

If application uses UTF-8, DB2 tables should be UTF-8 If application uses UTF-16, DB2 tables should be UTF-16 Collation Unicode Collation is more like ASCII collation than EBCDIC Numbers come before letters Upper characters come before lower case UTF-8 and UTF-16 Collations are not the same if Surrogates involved

UTF-8 and UTF-16 are very compatible, but the cost of conversion, even with HW support can be high. In addition to the conversion costs, there can be other effects such as predicate indexability that may be affected by a mismatch in data types. Matching applications and back end data store will optimize applications and provide the best performance. Collation of Unicode data will be more like ASCII data. When surrogates are involved, UTF-8 and UTF-16 do not collate the same.

Chris Crone

04:09 PM

04/16/02

49-50

Things to look out for


Storage size does not equal rendered size Japanese characters take 3 bytes to store 1 character in UTF-8 Latin-1 accented characters take two bytes in UTF-8 UNICODE has things called combining characters that allow something like A-Ring to be represented as A and Combining Character Ring. Combining characters can add to the size needed for both UTF-8 and UTF-16 columns can be represented as '00C5'x (or 'C385'x for UTF-8) '00410307'x (or '41CC87'x for UTF-8) Client Connections Clients need to be compatible Can use two stage conversions

Things to look out for


UTF-16 and SPUFI or DSNTEP2 SPUFI and DSNTEP2 really aren't UTF-16 aware In most cases, you should use CHAR(graphic column) when selecting data. For example, use: SELECT CHAR(g1) FROM T1 Not SELECT g1 FROM T1 Hex constants are character based. INSERT INTO T1 (g1) VALUES(x'0041); -- will result in x'00000041' not x'0041' as you might expect. Because hex constants are character based, DB2 will convert from UTF-8 to UTF-16 for you. x'00' -> x'0000' and x'41' -> x'0041'.

There are many issues that need to be dealt with in a Unicode environment Some of these are storage related and affect things like database definitions Some of these are application related and affect things like rendering characters for printing or display Expanding and contracting conversions are very common in Unicode environments. The sizes of these expansions and contractions are not easy to calculate because of things like combining characters When a connection is made between a client and server using DRDA, CCSIDs are exchanged. If a conversion is not available to convert between CCSIDs, the connection will fail. For conversions that would not normally be available, two stage converters can be used. For example - converters from Chinese CCSIDs to Japanese CCSIDs aren't normally available, however, we could convert from Chinese to Unicode and from Unicode to Japanese. It is possible to create two stage converters using the OS/390 support for Unicode.

SPUFI and DSNTEP2 are not designed to work with GRAPHIC data on a MIXED = NO subsystem. HEX constants are character based, not graphic based. HEX constants should be used with UTF-16 with care

Chris Crone

04:09 PM

04/16/02

51-52

Pre-V7
DRDA - CCSID 850 DRDA - CCSID 819

3270 CCSID 500

DB2 V6 Mixed = No EBCDIC CCSID 500

DB2 V6 Mixed = No EBCDIC CCSID 37

Example Scenarios

3270 CCSID 37

DRDA - CCSID 290/930/300 (Japanese)

Even though 37 and 500 are both Latin 1 Code pages and are compatible, You should not connect into a CCSID 500 system with a CCSID 37 emulator. Characters such as [,], |, and ! are not represented the same on CCSID 37 and CCSID 500 and thus, these characters will be corrupted when returned to another user via DRDA or 3270 data stream.

Chris Crone

04:09 PM

04/16/02

53-54

V7 and Beyond
DRDA - CCSID 850 DRDA - CCSID 912 (Czech)

Summary

Character Conversion Fundamentals Unicode Fundamentals

3270 CCSID 500

DB2 V7 With Unicode Mixed = No EBCDIC CCSID 500

DB2 V7 with Unicode Mixed = No EBCDIC CCSID 37

Unicode Support in DB2 UDB for OS/390 V7 Things to look out for Example Scenarios

3270 CCSID 37

DRDA - CCSID 290/930/300 (Japanese)

With V7 connections from clients that, in the past were not possible, are now possible. This doesn't mean that there aren't challenges Conversions need to be defined Correct behavior may depend on Application Encoding Bind option specification

Chris Crone

04:09 PM

04/16/02

55-56

Appendix - CCSID Information and Documentation


Installation Guide
Appendix A.Character conversion

Appendix - Catalog

Catalog changes for Unicode in DB2 for z/OS & OS/390 V7


Added Columns SYSPLAN RELBOUND (indicates release when plan was last bound or rebound)
ENCODING_CCSID (Bind option value)

SQL Reference
Character sets and code pages Character conversion Conversion rules for string assignment Conversion rules for string comparison Character conversion in unions and concatenations Selecting the result CCSID SQL descriptor area (SQLDA)

Administration Guide
Choosing string or numeric data types

SYSPACKAGE RELBOUND (indicates release when plan was last bound or rebound) ENCODING_CCSID (Bind option value) SYSVIEWS RELCREATED (indicates release when view was created) SYSTABLES RELCREATED (indicates release when table was created) Updated Columns (Updated for ENCODING UNICODE) SYSDATABASE SYSTABLESPACE SYSTABLES SYSPARMS SYSDATATYPES

There are many sections in the DB2 documentation that deal with character conversion. Some of the more important ones are shown here. I have listed the section titles for the books. I haven't listed page numbers or section numbers because these may vary depending on the form of book, paper, PDF, or Bookmanager you use. You should become familiar with these sections of the documentation if character conversion is occurring on your system

There were many changes to the Catalog for DB2 V7 The changes related to Unicode support are listed here

Chris Crone

04:09 PM

04/16/02

57-58

Appendix - zSeries Unicode Support


The UTF-8 <-> UTF-16 instructions are used when DB2 converts from char to graphic or graphic to char. These are used in DB2 V7 if running on G5, G6, zSeries 900, zSeries 800 or OS/390 V2R8 or later.
CUUTF - Convert UTF-16 to UTF-8 CUTFU - Convert UTF-8 to UTF-16

References
DB2 UDB Server for OS/390 Version 7 and z/OS Presentation Guide
Redbook on DB2 V7, SG24-6121

The following two instructions are similar to CLCLE and MVCLE. DB2 will use these instructions to perform comparison and padding on UTF-16 data because you can specify a two byte padding character. DB2 UDB for OS/390 and z/OS V7 uses these instructions if running on zSeries 900 or OS/390 V2R8:
CLCLU - Compare logical long UNICODE MVCLU - Move logical long UNICODE

DB2 Universal Database Administration Guide - SC09-2946


Appendix E - National Language Support

These instructions pack/unpack ASCII (also UNICODE UTF-8) and UNICODE (UTF-16) data. These instructions are used when DB2 converts a character string to a decimal or internal date/time/timestamp or a decimal value to a character string. DB2 UDB for OS/390 and z/OS V7 will use these instructions if running on zSeries 900 or OS/390 V2R8:
PKU - Pack Unicode PKA - Pack ASCII UNPKU - Unpack Unicode UNPKA - Unpack ASCII

The Unicode Standard Version 3.0


The Unicode Consortium - Addison-Wesley www.unicode.org http://www.ibm.com/developerworks/unicode/

These instructions are all used when DB2 performs conversion. For instance from ASCII SBCS to UNICODE UTF-16, DB2 will use the TROT (one byte to two byte characters). DB2 indirectly uses these instructions via the Conversion System Services (available in OS/390 V2R8 and above) in DB2 UDB for z/OS & OS/390 V7:
TRTT - Translate Two to Two TRTO - Translate Two to One TROT - Translate One to Two TROO - Translate One to One

Character Data Representation Architecture: Reference & Registry


SC09-2190

National Language Design Guide Volume 2


SE09-8002

The z/Architecture, zSeries 900 and zSeries 800 processors provides instructions which support Unicode data processing. The support is detailed here

In appendix E of the DB2 UDB Admin Guide, there is a discussion of NLS issues and how to set/override the codepage at the client using the codepage keyword on NT, and LANG variable on AIX.

Chris Crone

04:09 PM

04/16/02

59-60

Вам также может понравиться