Академический Документы
Профессиональный Документы
Культура Документы
DB2 UDB for z/OS & OS/390 Character Conversion & Unicode Fundamentals
Chris Crone Senior Software Engineer / IBM cjc@us.ibm.com
Character Conversion Fundamentals Unicode Fundamentals Unicode Support in DB2 UDB for z/OS & OS/390 V7 Things to Look Out For Example Scenarios Summary
This is a basic survival lesson in character conversion and Unicode. What is a CCSID? What does DB2 do with it? Why do I care? What does Unicode do for me? If you care about data integrity and run on more than one operating system in a global business, this is the basic information you need to survive.
This is the agenda for a basic survival lesson in character conversion and Unicode. What is a CCSID, what does DB2 do with it, why do I care, and what does Unicode do for me. In a global economy, this is the basic information you need to survive. While these are the fundamentals, they are not simple. There are many misconceptions. Just because you get some of the information back correctly does not guarantee that you are always getting the data without loss.
Chris Crone
04:07 PM
04/16/02
1-2
Terminology
For the purposes of this presentation UNICODE
UTF-8 UTF-16
ASCII
ASCII is a generic term that refers to all ASCII CCSIDs that DB2 currently supports
EBCDIC
EBCDIC is a generic term that refers to all EBCDIC CCSIDs that DB2 currently supports
CCSID Coded Character Set Identifier Used By DB2 to tag string data
When I use the term UNICODE, I mean UTF-16, and UTF-8 encodings. See www.unicode.org for more. When I use the term ASCII, I mean any generic ASCII CCSID (like 850, 819, 437) Single Byte Character Set (SBCS), Mixed , or Double Byte Character Set (DBCS) When I use the term EBCDIC I mean any generic EBCDIC CCSID (like 500, 37) SBCS, Mixed, or DBCS CCSID CCSIDs are used by DB2 to tag string data. A CCSID precisely defines the encoding of the data.
When we store data in some EBCDIC CCSID and display it on a PC in some ASCII CCSID, it must be translated. There are many translations, and we're just getting started.
Chris Crone
04:07 PM
04/16/02
3-4
Conversion Methods
Native DB2
SYSIBM.SYSSTRINGS (V2.3)
ICONV
Uses LE base services (V6) - Non-Strategic
Requires OS/390 V2R9
code and program directory http://www6.software.ibm.com/dl/os390/unicodespt-p documentation http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunpde00.pdf http://publibfp.boulder.ibm.com/pubs/pdfs/os390/cunuge00.pdf Information APAR II13048 and II03049
This slide depicts two common PC codepages. Note the differences in them. Codepage 1252 defines the things like the Euro, and the full Latin-1 character set with all the accented characters. Codepage 437 defines a partial Latin-1 accented character list Because the codepages do not contain the exact same set of characters, data cannot necessarily be converted from one to the other without the potential loss of data. Note also that characters are represented in different areas Look at ae ligature '19'x and '29'x in CCSID 437 and '6e'x and '6c'x for CCSID 1252
DB2 uses three methods for conversion SYSSTRINGS - This is the conversion services that were introduced in DB2 V2R3 and the ones most people are familiar with ICONV - Introduced in DB2 V6 and requiring OS/390 V2R9 and above. This was our first attempt at leveraging OS/390 infrastructure to perform conversion. It is non-strategic because the OS/390 V2 R8/R9/R10 support for Unicode provides more functionality with better performance OS/390 V2 R8/R9/R10 support for Unicode - Starting in V7, DB2 will be leveraging this service for most future character conversion support.
Chris Crone
04:07 PM
04/16/02
5-6
Native DB2
Based on SYSIBM.SYSSTRINGS
High Performance Added with DB2 V2R3 Support for Single byte, Mixed, Double byte ASCII EBCDIC Uses a combination of 256 Byte conversion tables and special two stage look up tables
The native conversion services that DB2 ships are based on support that was added in DB2 V2R3 They are high performance and rely on a cached copy of SYSIBM.SYSSTRINGS rows to perform conversion. A large number of ASCII and EBCDIC, Single byte, Mixed, and Double byte conversions are supported. These conversions use a combination of conversion tables contained in the TRANSTAB field of SYSSTRINGS and two stage conversion tables (for mixed and Double byte conversions) specified by the TRANSPROC field of SYSSTRINGS.
With the introduction of the OS/390 support for Unicode system service, OS/390 now has a central repository for conversion that can be used by applications, middleware, and subsystems. This service is designed to be high performance and utilizes new HW instructions and page fixed conversion tables to perform the conversions. This service uses a conversion image that is built by the off-line utility CUNMIUTL. A customer specifies which conversions will be supported by the conversion image. The conversion image is managed via the OS/390 console, not DB2. The SET UNI command specifies the image to be loaded The DISPLAY UNI command displays information about the currently loaded conversion image
Chris Crone
04:07 PM
04/16/02
7-8
Here's an example of the CUNJIUTL utility Note the specification of ER (enforced subset, round trip) after each CCSID pair. The ER specification is required by DB2.
Which Conversions should be configured All DB2 conversions involving Unicode are supported via the OS/390 conversion services. Any conversion involving Unicode must be configured. ASCII and EBCDIC conversions not supported by Native DB2 Conversion methods Other conversions, that are not supported via SYSIBM.SYSSTRINGS are supported via the conversion services
Chris Crone
04:08 PM
04/16/02
9-10
Here's an example of output from a display UNI command. Note when the image was created is displayed Also, the number of pages used by the image is also displayed. This is important because these pages are page fixed so they are taking up dedicated memory space on the machine Finally, a list of conversions that are supported in this image are displayed. Note for CCSID 1200 conversions the base CCSID that the conversion was created from is also displayed in parenthesis.
DB2 conversions are either Round Trip, or Enforced Subset. Round Trip conversions attempt to avoid loss of data by mapping unrepresented codepoints to unused (or unlikely to be used) codepoints Data loss can be avoided or delayed Can cause strange conversions Enforced Subset conversions map any unrepresented codepoints to the sub-character Data loss occurs immediately
Chris Crone
04:08 PM
04/16/02
11-12
Contracting Conversions
When data converted from one CCSID to another contracts For Example - '00C5'x in CCSID 1200 -> 'C5'x in CCSID 819
There are some cases where a conversion causes the length of the data to change. Expanding conversions cause the length of the data to grow Contracting conversions cause the length of the data to shrink
So now that we know all about CCSIDs, what are they used for. DB2 uses CCSIDs just like we use data type and length. They are part of the metadata that describes the data being stored in DB2. In V7 DB2 supports specification of three sets of CCSIDs. These three sets of CCSIDs represent the three encoding schemes (ASCII, EBCDIC, and Unicode), that DB2 supports. DB2 supports the specification of these CCSIDs at the subsystem level. Once these values have been specified, they should not be changed.
Chris Crone
04:08 PM
04/16/02
13-14
Installation
Specification of CCSIDs is performed at installation via install Panel DSNTIPF
DSNTIPF INSTALL DB2 -APPLICATION PROGRAMMING DEFAULTS PANEL 1 ===>_ Enter data below: 1 LANGUAGE DEFAULT ===> IBMCOB ASM,C,CPP,COBOL,COB2,IBMCOB,FORTRAN,PLI 2 DECIMAL POINT IS ===> . .or , 3 MINIMUM DIVIDE SCALE ===> NO NO or YES for a minimum of 3 digits to right of decimal after division 4 STRING DELIMITER ===> DEFAULT DEFAULT,"or '(COBOL or COB2 only) 5 SQL STRING DELIMITER ===> DEFAULT DEFAULT,"or ' 6 DIST SQL STR DELIMTR ===> ' 'or " 7 MIXED DATA ===> NO NO or YES for mixed DBCS data 8 EBCDIC CCSID ===> 0 CCSID of your SBCS or MIXED DATA 9 ASCII CCSID ===> 0 CCSID of SBCS or mixed data. 10 Unicode CCSID ===> 1208 CCSID of Unicode UTF-8 data. 11 DEF ENCODING SCHEME ===> EBCDIC EBCDIC, ASCII, or UNICODE 12 LOCALE LC_CTYPE ===> 13 APPLICATION ENCODING ===> EBCDIC EBCDIC, ASCII, UNICODE ccsid (1-65533) 14 DECIMAL ARITHMETIC ===> DEC15 DEC15,DEC31,15,31 15 USE FOR DYNAMICRULES ===> YES YES or NO 16 DESCRIBE FOR STATIC ===> NO Allow DESCRIBE for STATIC SQL.NO or YES.
Installation (continued)
Information from DSNTIPF ends up in Job DSNTIJUZ
Mixed System
DSNHDECM ASCCSID=1088, AMCCSID=949, AGCCSID=951, SCCSID=833, MCCSID=933, GCCSID=834, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=YES
Non-Mixed System
DSNHDECM ASCCSID=819, AMCCSID=65534, AGCCSID=65534, SCCSID=37, MCCSID=65534, GCCSID=65534, USCCSID=367, UMCCSID=1208, UGCCSID=1200, ENSCHEME=EBCDIC, APPENSCH=EBCDIC, MIXED=NO
END
END
Install panel DSNTIPF is used to specify CCSID information Options 8,9, and 10 are where the CCSIDs for the three encoding schemes are specified. Notice that ASCII and EBCDIC CCSIDs are initialized to 0 and the Unicode CCSID is initialized to 1208 The ASCII and EBCDIC CCSIDs are not pre-filled, these values needs to be set by the customer. The EBCDIC should be set to the CCSID that the customer's 3270 emulators, CICS, and IMS transactions use. The ASCII value should be set to the CCSID that is most commonly used by workstations in the customer shop (1252 for example). The Unicode value is pre-filled with 1208 cannot be changed. This value specifies the mixed CCSID for Unicode tables. Other things to note on this page Option 11 - This specifies the default encoding scheme for Objects created in the DB2 subsystem. Option 13 - This option specifies the default application encoding. Changing this valuse should be done with great care.
The information from panel DSNTIPF flows to Job DSNTIJUZ. In the case on the Left, we have a Mixed = Yes system that is set up to support Korea. The ASCII and EBCDIC system CCSIDs that actually would have been specified on panel DSNTIPF, to result in this specification, would have been 949 and 833. For mixed systems, and for the Unicode CCSID, the Mixed CCSID is specified on install panel DSNTIPF and DB2 will pick the corresponding Single byte and Graphic (Double Byte) CCSIDs. In the case on the Right, we have a Mixed = No system that is set up to support US English. Note that the user specified 819 and 37 for the ASCII and EBCDIC Single byte CCSIDs and that DB2 used the value 65534 for the ASCII and EBCDIC Mixed and Graphic (Double byte) CCSIDs. 65534 is a reserved value that means no CCSID. Also note that the Default Encoding and Default Application Encoding also flow to this job. Note there is a bug in DSNTIJUZ and DSNHDECM - These ship with CCSID 500 as default.
Chris Crone
04:08 PM
04/16/02
15-16
Unicode UTF-8 MBCS ( 1-4 bytes/char) data. No support for ASCII/EBCDIC mixed data
Graphic data
Mixed = Yes systems have support for SBCS Data - Pure single byte data Mixed Data - Single & double byte data in a single string Graphic Data - Pure Double byte data
Once we've specified CCSIDs for our system, what does DB2 do with them? DB2 stores CCSIDs in the Catalog, the directory, in bound statements, the directory, and of course the DECP The value stored in these areas depends on what release of DB2 was used to create the object, and the value in the DECP at the time the object was created. If a value is 0, it is assumed that the object is EBCDIC. In general, DB2 does not support changing of a CCSID once it is specified in a DECP The exceptions are Changing from 0 to a valid value Changing from a CCSID that does not support the EURO symbol to a CCSID that supports the EURO symbol (37 -> 1140 for instance). Note that this sort of change requires special, disruptive, changes and should be undertaken only after the documentation has been read and the process is thoroughly understood. Encoding enformation is also stored in some catalog tables
Mixed = Yes systems are used in the Far East, primarily China, Japan, and Korea. They offer support for SBCS, Mixed, and DBCS ASCII and EBCDIC data. Mixed = No systems are used elsewhere in the world and only have support for SBCS data. Unicode data is always considered mixed regardless of the Mixed = Yes/Mixed = No setting of the system Creation of columns with data types of For Mixed data and Graphic are allowed, for EBCDIC tables, on Mixed = No systems prior to DB2 V7. As of DB2 V7, Mixed = No systems only allow specification of these types of columns in Unicode tables.
Chris Crone
04:08 PM
04/16/02
17-18
Mixed Data
Mixed Data
SBCS Data
SBCS can be compared to mixed without conversion to mixed because it is a subset of the mixed repertoire. This is true for ASCII, EBCDIC, and Unicode
Mixed Data
Capable of representing SBCS and MBCS data
EBCDIC SO, SI ('0E'x, '0F'x) delineate DBCS data AB<A> -> 'C1C20E42C10F'x ASCII uses first byte code point, if the first byte is within a certain range, say 'A0'x - 'AF'x, then it is the first byte of a DBCS character. For example 'A055'x would be a DBCS character Some CCSIDs have several first byte code point ranges. UTF-8 data uses the high order bit to indicate MBCS data For example 'EFBC91'x is a three byte UTF-8 character
CCSID Override in SQLDA (V2.3 & above) Declare Variable (V7) Application Encoding Bind Option (V7) Current Application Encoding Special Register (V7) Remote
Automatically when needed
Graphic Data
ASCII or EBCDIC - DBCS characters no shift or first byte code points needed Unicode - DBCS characters. Surrogates take two DBCS characters
Mixed = yes systems use a CCSID triplet. That is to say, there is an SBCS CCSID, a Mixed CCSID, and a DBCS CCSID. On these systems, the SBCS CCSID is a subset of the Mixed CCSID. Because of this, SBCS and Mixed data can be compared without converting. SBCS columns are created by specifying the "FOR SBCS DATA" clause on create like CREATE TABLE T1 (C1 CHAR(10) FOR SBCS DATA); Mixed columns are the default, on MIXED = Yes systems, and can be explicitly specified by using the "FOR MIXED DATA" clause on CREATE. DBCS data is stored in Graphic columns
There are no hard and fast rules as to when a conversion occurs. The short answer is that conversion occurs when necessary. Some of the cases when we do conversion are listed here.
Chris Crone
04:08 PM
04/16/02
19-20
Why Unicode?
Unicode is a single character set that encodes all of the worlds scripts (sort of). The Unicode standard provides a cross platform, cross vendor method of encoding data that enbles lossless representation and manipulation
Before Unicode
Unicode Fundamentals
Many Standards
ANSI JIS TISI
Sort of, because new characters are being added all the time and so at any given time, an implementation of the standard is somewhat behind. Sort of because there is one standard that contains several implementations of the standard Prior to Unciode, many different standards and vendor implelmentations existed Unicode attempts to standardize the representation and manipulation of data across vendors and platforms
Chris Crone
04:08 PM
04/16/02
21-22
Unicode Fundamentals
Four forms of Unicode
UTF-8 Unicode Transformation Format in 8 bits UCS-2 Universal Character Set coded in 2 octets UTF-16 Unicode Transformation Format in 16 bits UTF-32 Unicode Transformation Format in 32 bits Introduced with Unicode Technical Report # 19 to replace UCS-4
Bytes '00'x - '7F'x represented by single byte chars Chars above '80'x are encoded by 2-6 byte chars
Most characters take 2-3 bytes
Most Japanese, Chinese, and Korean characters take 3 bytes Most Extended Latin characters take 2 bytes
There are currently 4 forms of Unicode that are being promoted by the Unicode standards organization. UTF-8, UTF-16, and UTF-32. UTF-16 is the preferred format (according to UTR#19) UCS-2 is the precursor to UTF-16.
UTF-8 is represented by CCSID 1208. This is a growing CCSID. This means that as characters are added to the Unicode standard, they will be added to this CCSID. UTF-8 is also commonly called ASCII safe Unicode The first 127 characters are the same as CCSID 367 which is a 7-bit ASCII CCSID Other characters are represented as MBCS, 1-4 byte, characters One nice feature of UTF-8 is that since it is an 8-bit encoding, it does not have any big endian/little endian issues.
Chris Crone
04:08 PM
04/16/02
23-24
UCS-2 is represented by CCSIDs 13488 and 17584. 13488 corresponds to Unicode Version 2, and 17584 corresponds to Unicode Version 3. When people say Unicode, without qualifying the encoding format, this is usually what they mean. Other characters are allocated in blocks (there's a block for Greek chars, a block for Cyrillic chars....)
UTF-16 is represented by CCSID 1200, which is a growing CCSID also UTF-16 is a superset of UCS-2 and uses reserved sections of BMP0 to map an additional 16 planes Version 3.1 of the Unicode standard defines the first characters in the surrogate area
Chris Crone
04:08 PM
04/16/02
25-26
UTF-32
Each Character is 4 bytes
Range is restricted to values '00000000'x - '0010FFFF'x Represents the same repertoire as UTF-16
Endianess
Big Endian
pSeries, zSeries, iSeries, Sun, HP Lease significant byte is leftmost For a 4 byte word - Byte order 0,1,2,3
UCS-4 Implemented by SUN Solaris and HP/UX as base Unicode data type XPG/4 standard requires fixed width character format
z/Series, p/Series looking at UTF-32 implementations to support surrogate characters in C/C++ applications
Little Endian
Intel based machines including xSeries Most significant byte is leftmost For a 4 byte word - Byte order 3,2,1,0
UTF-8 - not affected by endianess issues UTF-16 and UTF-32 are effected by endianess issues
Big Endian 'A' = x'0041' for UTF-16 or x'00000041' for UTF-32 Little Endian 'A' = x'4100' for UTF-16 or x'41000000' for UTF-32
Note: A BYTE is always ordered as leftmost most significant bit to rightmost least significant bit. Bit order within a byte is always 7,6,5,4,3,2,1,0
For completeness, I'm mentioning UTF-32 and UCS-4. DB2 is not implementing any support for these implementations of the Unicode standard, although other vendors have.
DB2 and DRDA manipulate and store data in Big Endian format. Little Endian clients convert data to Big Endian before putting on the wire.
Chris Crone
04:08 PM
04/16/02
27-28
Character Examples
A, a, 9, (The character A with Ring accent),
ASCII
'41'x, '61'x, '39'x, 'C5'x, 'CDDB'x (ccsid 939), N/A
U+9860, U+200D0
UTF-8
'41'x, '61'x, '39'x, 'C385'x, 'E9A1A0'x, 'F0A08390'x
Note: 'C5'x becomes double byte in UTF-8
Note: UCS-2/UTF-16 and UCS-4/UTF-32 are using a technique called Zero Extension
Now that we know all about Unicode, here are some examples of what this stuff looks like. Note that A-Ring takes two bytes to represent in UTF-8, and that other characters can take three or 4 bytes to represent
Chris Crone
04:08 PM
04/16/02
29-30
Requirement
Enable Unicode on DB2 UDB for OS/390 and z/OS
Support Vendors implementing Unicode applications Support needs of Multinational Companies Support data from more than one country/language in one DB2 subsystem
Solution
Allow UNICODE to be specified as the Encoding Scheme (ES) at the
System Level UNICODE CCSIDs (Install) Similar to ASCII/EBCDIC System CCSIDs Database create database mydb ccsid unicode Table Space create tablespace myts in mydb ccsid unicode Table create table t1 (c1 char(10)) ccsid unicode Other create procedure mysp (in in_parm char(10) ccsid unicode) ...
For V7 our challenge was to enable Unicode data storage on DB2, without regressing function or performance for our ASCII and EBCDIC customers. We wanted to meet the needs of ERP and CRM vendors, as well as address the needs of customer written applications that have a need to store multinational data
The Unicode support we have added with DB2 V7 is similar to the support we have for ASCII, that was added in V5. This support allows ASCII, EBCDIC, and Unicode objects to coexist in a single DB2 subsystem. specification of the encoding scheme is made at the system level as well as the object level
Chris Crone
04:08 PM
04/16/02
31-32
Storage
Storage of Unicode Data Char/VarChar/CLOB FOR SBCS DATA (7-bit) ASCII this is a subset of UTF-8 CCSID 367 Char/VarChar/CLOB FOR MIXED DATA UTF-8 CCSID 1208 Graphic/VarGraphic/DBCLOB UTF-16 CCSID 1200
Parsing
Data stored in Unicode tables in DB2 will be stored in one of the following CCSIDs: 367, 1208, 1200.
Parsing for DB2 V7 is in EBCDIC. This means that all statements sent to DB2 will be converted to the system EBCDIC CCSID, and then parsed. Since statements are converted to EBCDIC, there is a possibility of data loss when the data is converted. Since all DB2 keywords, such as SELECT, are representable on all EBCDIC code pages, there shouldn't be a problem with statement text Literals contained in the statement are subject to data loss. SQLCODE +355 has been added to alert the user of this sort of data loss
Chris Crone
04:08 PM
04/16/02
33-34
Catalog
The catalog will be encoded in the default EBCDIC CCSID
Object Names will need to be convertible to Default EBCDIC CCSID Database Name Table space/Index space Name External Names (UDF, SP, Exits, Fieldproc...) Identifiers may not be generateable from all clients so should really be limited to common subset that is representable on all clients.
Unicode Literals
Literals
UTF-8 literals (char/varchar/clob) conform to normal rules for character strings INSERT INTO T1 (C1) VALUES('123'); UTF-16 literals (graphic/vargraphic/dbclob) Specified as UTF-8 literals Specified as Graphic literals INSERT INTO T1 (C1) VALUES('123'); INSERT INTO T1 (C1) VALUES(
U+BBC0 U+BBC1
); --Unicode
The DB2 V7 catalog will remain EBCDIC, so names stored in the catalog must also be convertible to EBCDIC without loss Since the DB2 catalog contains data stored in the system EBCDIC default CCSID, it is possible to have things like Japanese, or Cyrillic names for things like columns (depending on your system CCSID). However, since all clients may not be capable of producing these characters, users should really be limited to the subset of Latin-1 characters. Note: To store Japanese, Chinese, or Korean in the DB2 catalog, a MIXED=YES system is needed and the data being stored in the catalog must conform to the rules for well formed mixed data.
Specification of literals for Unicode tables is essentially the same as it is for ASCII or EBCDIC tables. The one thing to note here is that character literals may be specified for UTF-16 columns. These character literals may be used any place a graphic literal would normally be used. For instance in the values clause of an insert statement
Chris Crone
04:08 PM
04/16/02
35-36
Declare Variable
DECLARE VARIABLE statement New way to allow CCSID to be specified for host variables
Example EXEC SQL DECLARE :HV1 CCSID UNICODE; EXEC SQL DECLARE :HV2 CCSID 37; Precompiler directive to treat hostvar as a specific CCSID Useful for PREPARE/EXECUTE IMMEDIATE statement text EXEC SQL PREPARE S1 FROM :HV2; May be used with any character host variable on input or output
UTF-8 <-> UTF-16 Applications don't need to change just because the back end data store changes
When dealing with Unicode tables, we have torn down the barrier between CHAR and GRAPHIC. This means your back end data store can be either UTF-8 or UTF-16 and you can use ASCII, EBCDIC, or Unicode character or graphic host variables and DB2 will perform the necessary conversions to/from the CCSID of the host variable even if the host variable doesn't match the column type (for ASCII and EBCDIC back end data stores, in most cases char and graphic are incompatible).
The new DECLARE VARIABLE statement can be used to specify the CCSID of a particular host variable. This is a precompiler directive that causes the precompiler to specify the CCSID of the host variable in any SQLDA that the precompiler generates to reference the host variable. This directive works for both input and output host variables.
Chris Crone
04:08 PM
04/16/02
37-38
Application Encoding
Bind Option
Allows explicit specification of ES at an application level. Affects Static SQL - Provides default for dynamic System Default used if bind option not specified
Special Register Allows explicit specification of ES at the application level. Affects Dynamic SQL Initialized with Bind Option OPTION is ignored when packages are executed remotely DRDA specified Input CCSID, Data flows as it to client
Also new to DB2 V7 is the specification of Application Encoding Scheme. This allows a default Application Encoding to be specified Preset to EBCDIC The Application Encoding Scheme can also be specified on BIND PLAN or PACKAGE If not specified, the system default value is used for the bind option Plans/Packages bound prior to V7 are assumed to be EBCDIC The option applies to Static SQL The Application Encoding Scheme special register can be used to affect dynamic SQL Initial value is the value of the Bind Option.
In this example, package MY_PACK is bound with APPLICATION ENCODING(UNICODE) Character host variables will be treated as CCSID 1208 Graphic host variables will be treated as CCSID 1200 Initial value of APPLICATION ENCODING special register will be 1208 DECLARE VARIABLE statement can be used to override the bind option/special register for host variables For statements that use a DESCRIPTOR, as in FETCH USING DESCRIPTOR, CCSID overrides can be coded by hand in the SQLDA.
Chris Crone
04:08 PM
04/16/02
39-40
ODBC/SQLJ/JDBC
ODBC Support
Support for Wide Character API's (UCS2/UTF-16) See ODBC Guide and Reference (SC26-9941-01) Example
SQLRETURN SQLPrepare ( SQLHSTMT hstmt, SQLCHAR *szSqlStr, SQLINTEGER cbSqlStr ); SQLRETURN SQLPrepareW ( SQLHSTMT hstmt, SQLWCHAR *szSqlStr, SQLINTEGER cbSqlStr );
COBOL
Enterprise COBOL for z/OS and OS/390 V3R1 Supports Unicode
NATIONAL is used to declare UTF-16 variables MY-UNISTR pix N(10). -- declares a UTF-16 Variable N and NX Literals N'123' NX'003100320033' Conversions NATIONAL-OF Converts to UTF-16 DISPLAY-OF Converts to specific CCSID
SQLJ/JDBC Support
Remove current support for converting to EBCDIC before calling engine. Let DB2 engine determine where conversion is necessary
Greek-EBCDIC pic X(10) value " ". UTF16STR pic N(10). UTF8STR pix X(20). Move Function National-of(Greek-EBCDIC, 00875) to UTF16STR. Move Function Display-of(UTF16STR, 01208) to UTF8STR.
ODBC support for Unicode is included as part of the effort to support ODBC 3.0 SQLJ and JDBC already support Unicode, but changes have been made to exploit Unicode support in DB2 V7
Cobol has recently added support for Unicode characters Included in this support New NATIONAL data type N and NX literals Conversion operations More
Chris Crone
04:08 PM
04/16/02
41-42
Predicates
You cannot reference the DB2 catalog in a query against an ASCII or Unicode table
Like predicate
In Predicate
As in prior releases, you cannot reference tables from more than one encoding scheme in a single statement In the first example we fail because T1 and T2 are not of the same encoding scheme.
For queries against Unicode tables, Host variables used in a predicate may be specified as UTF-16 or UTF-8 regardless of the data type of the column. This allows the back end data store to change, without changing the application. When data is primarily Latin-1, it may be more efficient to store UTF-8 data When data is not primarily Latin-1, it may be more efficient to store data in UTF-16
Chris Crone
04:08 PM
04/16/02
43-44
Scalar Functions
Functions
LENGTH, SUBSTR, POSSTR, LOCATE Byte Oriented for SBCS and Mixed (UTF-8) Double-Byte Character Oriented for DBCS (UTF-16) Cast functions UTF-16/UTF-8 are accepted any where char is accepted (char, date, time, integer...) SELECT DATE(graphic column) FROM T1; SELECT INTEGER(graphic column) FROM T1; UTF-8 is result data type/CCSID 1208 for character functions (char(float_col)...)
Routines
Routines UDFs, UDTFs, and SPs will all be enabled to allow Unicode parameters Parameters will be converted as necessary between char (UTF-8) and graphic (UTF-16) Date/Time/Timestamp passed as UTF-8 (ISO Format)
All Built In Functions (BIFs) have been extended to support Unicode Some BIFs, such as LENGTH, SUBSTR, POSSTR, and LOCATE are byte oriented for UTF-8 and Double-Byte character oriented for UTF-16 Many new functions were added in V7, the CCSID_ENCODING function has been added to help users determine the encoding, ASCII, EBCDIC, or UNICODE of a particular CCSID UTF-16 data is accepted in casting type functions such as DATE or INTEGER Result CCSIDs for functions that return character strings will return UTF-8/CCSID 1208
User written routines (User Defined Functions, User Defined Table Functions, and Stored Procedures) have been extended to support Unicode Parameters will be converted as necessary Date, Time, and Timestamp values are passed to the routine as a UTF-8 character string. These values will be in the ISO format as specified in the DB2 SQL Reference.
Chris Crone
04:09 PM
04/16/02
45-46
Utilities
Limits
Index Key Size - remains 255 Char limit still 255 bytes Varying length string limit still 32K bytes Strings > 32K bytes - use LOB's
ASCII/EBCDIC <-> UNICODE UNLOAD Utility ASCII/EBCDIC <-> UNICODE No support for
SBCS/MIXED -> DBCS DBCS -> SBCS/MIXED
The load utility has been extended to support conversion to and from Unicode. Additionally, the load utility will support conversion between character and graphic as long as conversion exists. Character in load dataset -> Graphic column Graphic in load dataset -> character column The unload utility, new for V7, supports conversion to/from Unicode, but does not support conversion between character and graphic
We haven't changed any of these limits. These are the same limits as we had for V6. The limit on index key sizes is something to watch out for. Unicode data can take from 1-3 times the space needed to store ASCII or EBCDIC data For character strings longer than 255 use Varchar. Varying length strings are still limited to 32704 bytes. For longer strings, use LOBs
Chris Crone
04:09 PM
04/16/02
47-48
If application uses UTF-8, DB2 tables should be UTF-8 If application uses UTF-16, DB2 tables should be UTF-16 Collation Unicode Collation is more like ASCII collation than EBCDIC Numbers come before letters Upper characters come before lower case UTF-8 and UTF-16 Collations are not the same if Surrogates involved
UTF-8 and UTF-16 are very compatible, but the cost of conversion, even with HW support can be high. In addition to the conversion costs, there can be other effects such as predicate indexability that may be affected by a mismatch in data types. Matching applications and back end data store will optimize applications and provide the best performance. Collation of Unicode data will be more like ASCII data. When surrogates are involved, UTF-8 and UTF-16 do not collate the same.
Chris Crone
04:09 PM
04/16/02
49-50
There are many issues that need to be dealt with in a Unicode environment Some of these are storage related and affect things like database definitions Some of these are application related and affect things like rendering characters for printing or display Expanding and contracting conversions are very common in Unicode environments. The sizes of these expansions and contractions are not easy to calculate because of things like combining characters When a connection is made between a client and server using DRDA, CCSIDs are exchanged. If a conversion is not available to convert between CCSIDs, the connection will fail. For conversions that would not normally be available, two stage converters can be used. For example - converters from Chinese CCSIDs to Japanese CCSIDs aren't normally available, however, we could convert from Chinese to Unicode and from Unicode to Japanese. It is possible to create two stage converters using the OS/390 support for Unicode.
SPUFI and DSNTEP2 are not designed to work with GRAPHIC data on a MIXED = NO subsystem. HEX constants are character based, not graphic based. HEX constants should be used with UTF-16 with care
Chris Crone
04:09 PM
04/16/02
51-52
Pre-V7
DRDA - CCSID 850 DRDA - CCSID 819
Example Scenarios
3270 CCSID 37
Even though 37 and 500 are both Latin 1 Code pages and are compatible, You should not connect into a CCSID 500 system with a CCSID 37 emulator. Characters such as [,], |, and ! are not represented the same on CCSID 37 and CCSID 500 and thus, these characters will be corrupted when returned to another user via DRDA or 3270 data stream.
Chris Crone
04:09 PM
04/16/02
53-54
V7 and Beyond
DRDA - CCSID 850 DRDA - CCSID 912 (Czech)
Summary
Unicode Support in DB2 UDB for OS/390 V7 Things to look out for Example Scenarios
3270 CCSID 37
With V7 connections from clients that, in the past were not possible, are now possible. This doesn't mean that there aren't challenges Conversions need to be defined Correct behavior may depend on Application Encoding Bind option specification
Chris Crone
04:09 PM
04/16/02
55-56
Appendix - Catalog
SQL Reference
Character sets and code pages Character conversion Conversion rules for string assignment Conversion rules for string comparison Character conversion in unions and concatenations Selecting the result CCSID SQL descriptor area (SQLDA)
Administration Guide
Choosing string or numeric data types
SYSPACKAGE RELBOUND (indicates release when plan was last bound or rebound) ENCODING_CCSID (Bind option value) SYSVIEWS RELCREATED (indicates release when view was created) SYSTABLES RELCREATED (indicates release when table was created) Updated Columns (Updated for ENCODING UNICODE) SYSDATABASE SYSTABLESPACE SYSTABLES SYSPARMS SYSDATATYPES
There are many sections in the DB2 documentation that deal with character conversion. Some of the more important ones are shown here. I have listed the section titles for the books. I haven't listed page numbers or section numbers because these may vary depending on the form of book, paper, PDF, or Bookmanager you use. You should become familiar with these sections of the documentation if character conversion is occurring on your system
There were many changes to the Catalog for DB2 V7 The changes related to Unicode support are listed here
Chris Crone
04:09 PM
04/16/02
57-58
References
DB2 UDB Server for OS/390 Version 7 and z/OS Presentation Guide
Redbook on DB2 V7, SG24-6121
The following two instructions are similar to CLCLE and MVCLE. DB2 will use these instructions to perform comparison and padding on UTF-16 data because you can specify a two byte padding character. DB2 UDB for OS/390 and z/OS V7 uses these instructions if running on zSeries 900 or OS/390 V2R8:
CLCLU - Compare logical long UNICODE MVCLU - Move logical long UNICODE
These instructions pack/unpack ASCII (also UNICODE UTF-8) and UNICODE (UTF-16) data. These instructions are used when DB2 converts a character string to a decimal or internal date/time/timestamp or a decimal value to a character string. DB2 UDB for OS/390 and z/OS V7 will use these instructions if running on zSeries 900 or OS/390 V2R8:
PKU - Pack Unicode PKA - Pack ASCII UNPKU - Unpack Unicode UNPKA - Unpack ASCII
These instructions are all used when DB2 performs conversion. For instance from ASCII SBCS to UNICODE UTF-16, DB2 will use the TROT (one byte to two byte characters). DB2 indirectly uses these instructions via the Conversion System Services (available in OS/390 V2R8 and above) in DB2 UDB for z/OS & OS/390 V7:
TRTT - Translate Two to Two TRTO - Translate Two to One TROT - Translate One to Two TROO - Translate One to One
The z/Architecture, zSeries 900 and zSeries 800 processors provides instructions which support Unicode data processing. The support is detailed here
In appendix E of the DB2 UDB Admin Guide, there is a discussion of NLS issues and how to set/override the codepage at the client using the codepage keyword on NT, and LANG variable on AIX.
Chris Crone
04:09 PM
04/16/02
59-60