Migrating a DB2 database from a Big Endian environment to a Little Endian environment

roger

By Roger Sanders, DB2 for LUW Offering Manager, IBM

What Is Big-Endian and Little-Endian?

Big-endian and little-endian are terms that are used to describe the order in which a sequence of bytes are stored in computer memory, and if desired, are written to disk. (Interestingly, the terms come from Jonathan Swift’s Gulliver’s Travels where the Big Endians were a political faction who broke their boiled eggs on the larger end, defying the Emperor’s edict that all eggs be broken on the smaller end; the Little Endians were the Lilliputians who complied with the Emperor’s law.)

Specifically, big-endian refers to the order where the most significant byte (MSB) in a sequence (i.e., the “big end”) is stored at the lowest memory address and the remaining bytes follow in decreasing order of significance. Figure 1 illustrates how a 32-bit integer would be stored if the big-endian byte order is used.

endian image1Figure 1. Big-endian byte order

For people who are accustomed to reading from left-to-right, big-endian seems like a natural way to store a string of characters or numbers; since data is stored in the order in which it would normally be presented, programmers can easily read and translate octal or hexadecimal data dumps. Another advantage of using big-endian storage is that the size of a number can be more easily estimated because the most significant digit comes first. It is also easy to tell whether a number is positive or negative—this information can be obtained by examining the bit at offset 0 in the lowest order byte.

Little-endian, on the other hand, refers to the order where the least significant byte (LSB) in a sequence (i.e., the “little end”) is stored at the lowest memory address and the remaining bytes follow in increasing order of significance. Figure 2 illustrates how the same 32-bit integer presented earlier would be stored if the little-endian byte order were used.

endian image 2

 Figure 2. Little-endian byte order

One argument for using the little-endian byte order is that the same value can be read from memory, at different lengths, without having to change addresses—in other words, the address of a value in memory remains the same, regardless of whether a 32-bit, 16-bit, or 8-bit value is read. For instance, the number 12 could be read as a 32-bit integer or an 8-bit character, simply by changing the fetch instruction used. Consequently, mathematical functions involving multiple precisions are much easier to write.

Little-endian byte ordering also aids in the addition and subtraction of multi-byte numbers. When performing such operations, the computer must start with the least significant byte to see if there is a carry to a more significant byte—much like an individual will start with the rightmost digit when doing longhand addition to allow for any carryovers that may take place. By fetching bytes sequentially from memory, starting with the least significant byte, the computer can start doing the necessary arithmetic while the remaining bytes are read. This parallelism results in better performance; if the system had to wait until all bytes were fetched from memory, or fetch them in reverse order (which would be the case with big-endian), the operation would take longer.

IBM mainframes and most RISC-based computers (such as IBM Power Systems, Hewlett-Packard ProLiant servers, and Oracle SPARC servers) utilize big-endian byte ordering. Computers with Intel and AMD processors (CPUs) use little-endian byte ordering instead.

It is important to note that regardless of whether big-endian or little-endian byte ordering is used, the bits within each byte are usually stored as big-endian. That is, there is no attempt to reverse the order of the bit stream that is represented by a single byte. So, whether the hexadecimal value ‘CD’ for example, is stored at the lowest memory address or the highest memory address, the bit order for the byte will always be: 1100 1101

Moving a DB2 Database To a System With a Different Endian Format

One of the easiest ways to move a DB2 database from one platform to another is by creating a full, offline backup image of the database to be moved and restoring that image onto the new platform. However, this process can only be used if the endianness of the source and target platform is the same. A change in endian format requires a complete unload and reload of the database, which can be done using the DB2 data movement utilities. Replication-based technologies like SQL Replication, Q Replication, and Change Data Capture (CDC), which transform log records into SQL statements that can be applied to a target database, can be used for these types of migrations as well. On the other hand, DB2 High Availability Disaster Recovery (HADR) cannot be used because HADR replicates the internal format of the data thereby maintaining the underlying endian format.

The DB2 Data Movement Utilities (and the File Formats They Support)

DB2 comes equipped with several utilities that that can be used to transfer data between databases and external files. This set of utilities consists of:

  • The Export utility: Extracts data from a database using an SQL query or an XQuery statement, and copies that information to an external file.
  • The Import utility: Copies data from an external file to a table, hierarchy, view, or nickname using INSERT SQL statements. If the object receiving the data is already populated, the input data can either replace or be appended to the existing data.
  • The Load utility: Efficiently moves large quantities of data from an external file, named pipe, device, or cursor into a target table. The load utility is faster than the Import utility because it writes formatted pages directly into the database, instead of performing multiple INSERT
  • The Ingest utility: A high-speed, client-side utility that streams data from files and named pipes into target tables.

Along with these built-in utilities, IBM InfoSphere Optim High Performance Unload for DB2 for Linux, UNIX and Windows, an add-on tool that must be purchased separately, can be used to rapidly unload, extract, and repartition data in a DB2 database. Designed to improve data availability, mitigate risk, and accelerate database migrations, this tool helps DBAs work with very large quantities of data with less effort and faster results.

Regardless of which utility is used, data can only be written to or read from files that utilize one of the following formats:

  • Delimited ASCII
  • Non-delimited or fixed-length ASCII
  • PC Integrated Exchange Format
  • Extensible Markup Language (IBM InfoSphere Optim High Performance Unload for DB2 for Linux, UNIX and Windows only.)

Delimited ASCII (DEL)

The delimited ASCII file format is used by a wide variety of software applications to exchange data. With this format, data values typically vary in length, and a delimiter, which is a unique character not found in the data values themselves, is used to separate individual values and rows. Actually, delimited ASCII format files typically use three distinct delimiters:

  • Column delimiters. Characters that are used to mark the beginning or end of a data value. Commas (,) are typically used as column delimiter characters.
  • Row delimiters. Characters that are used to mark the end of a single record or row. On UNIX systems, the new line character (0x0A) is typically used as the row delimiter; on Windows systems, the carriage return/linefeed characters (0x0D–0x0A) are normally used instead.
  • Character delimiters. Character that are used to mark the beginning and end of character data values. Single quotes (‘) and double quotes (“) are typically used as character delimiter characters.

Typically, when data is written to a delimited ASCII file, rows are streamed into the file, one after another. The appropriate column delimiter is used to separate each column’s data values, the appropriate row delimiter is used to separate each individual record (row), and all character and character string values are enclosed with the appropriate character delimiters. Numeric values are represented by their ASCII equivalent—the period character (.) is used to denote the decimal point (if appropriate); real values are represented with scientific notation (E); negative values are preceded by the minus character (-); and positive values may or may not be preceded by the plus character (+).

For instance, if the comma character is used as the column delimiter, the carriage return/line feed character is used as the row delimiter, and the double quote character is used as the character delimiter, the contents of a delimited ASCII file might look something like this:

10,”Headquarters”,860,”Corporate”,”New York”

15,”Research”,150,”Eastern”,”Boston”

20,”Legal”,40,”Eastern”,”Washington”

38,”Support Center 1″,80,”Eastern”,”Atlanta”

42,”Manufacturing”,100,”Midwest”,”Chicago”

51,”Training Center”,34,”Midwest”,”Dallas”

66,”Support Center 2″,112,”Western”,”San Francisco”

84,”Distribution”,290,”Western”,”Denver”

Non-Delimited ASCII (ASC)

With the non-delimited ASCII file format, data values have a fixed length, and the position of each value in the file determines which column and row a particular value belongs to.

When data is written to a non-delimited ASCII file, rows are streamed into the file, one after another and each column’s data value is written using a fixed number of bytes. (If a data value is smaller that the fixed length allotted for a particular column, it is padded with blanks.) As with delimited ASCII files, a row delimiter is used to separate each individual record (row) — on UNIX systems the new line character (0x0A) is typically used; on Windows systems, the carriage return/linefeed characters (0x0D–0x0A) are used instead. Numeric values are treated the same as when they are stored in delimited ASCII format files.

Thus, a simple non-delimited ASCII file might look something like this:

10Headquarters       860Corporate   New York

15Research                150Eastern          Boston

20Legal                        40 Eastern         Washington

38Support Center   180Eastern        Atlanta

42Manufacturing    100Midwest       Chicago

51Training Center   34 Midwest       Dallas

66Support Center   211Western        San Francisco

84Distribution         290Western        Denver

 

PC Integrated Exchange Format (IXF)

The PC Integrated Exchange Format file format is a special file format that is used almost exclusively to move data between different DB2 databases. Typically, when data is written to a PC Integrated Exchange Format file, rows are streamed into the file, one after another, as an unbroken sequence of variable-length records. Character data values are stored in their original ASCII representation (without additional padding), and numeric values are stored as either packed decimal values or as binary values, depending upon the data type used to store them in the database. Along with data, table definitions and associated index definitions are also stored in PC Integrated Exchange Format files. Thus, tables and any corresponding indexes can be both defined and populated when this file format is used

Extensible Markup Language (XML)

Extensible Markup Language (XML) is a simple, yet flexible text format that provides a neutral way to exchange data between different devices, systems, and applications. Originally designed to meet the challenges of large-scale electronic publishing, XML is playing an increasingly important role in the exchange of data on the web and throughout companies. XML data is maintained in a self-describing format that is hierarchical in nature. Thus, a very simple XML file might look something like this:

<?xml version=”1.0″ encoding=”UTF-8″ ?>

<customerinfo>

<name>John Doe</name>

<addr country=”United States”>

<street>25 East Creek Drive</street>

<city>Raleigh</city>

<state-prov>North Carolina</state-prov>

<zip-pcode>27603</zip-pcode>

</addr>

<phone type=”work”>919-555-1212</phone>

<email>john.doe@xyz.com</email>

</customerinfo>

As noted earlier, only IBM InfoSphere Optim High Performance Unload for DB2 for Linux, UNIX and Windows can work with XML files.

db2move and db2look

As you might imagine, the Export utility, together with the Import utility or the Load utility, can be used to copy a table from one database to another. These same tools can also be used to move an entire database from one platform to another, one table at a time. But a more efficient way to move an entire DB2 database is by using the db2move utility. This utility queries the system catalog of a specified database and compiles a list of all user tables found. Then it exports the contents and definition of each table found to individual PC Integrated Exchange Format (IXF) formatted files. The set of files produced can then be imported or loaded into another DB2 database on the same system, or they can be transferred to another server and be imported or loaded to a DB2 database residing there.

The db2move utility can be run in one of four different modes: EXPORT, IMPORT, LOAD, or COPY. When run in EXPORT mode, db2move utilizes the Export utility to extract data from a database’s tables and externalize it to a set of files. It also generates a file named db2move.lst that contains the names of all of the tables that were processed, along with the names of the files that each table’s data was written to. The db2move utility may also produce one or more message files containing warning or error messages that were generated as a result of the Export operation.

When run in IMPORT mode, db2move uses the file db2move.lst to establish a link between the PC Integrated Exchange Format (IXF) formatted files needed and the tables into which data is to be populated. It then invokes the Import utility to recreate each table and their associated indexes using information stored in the external files.

And, when run in LOAD mode, db2move invokes the Load utility to populate tables that already exist with data stored in PC Integrated Exchange Format (IXF) formatted files. (LOAD mode should never be used to populate a database that does not already contain table definitions.) Again, the file db2move.lst is used to establish a link between the external files used and the tables into which their data is to be loaded.

Unfortunately, the db2move utility can only be used to move table and index objects. And if the database to be migrated contains other objects such as aliases, views, triggers, user-defined data types (UDTs), user-defined functions (UDFs), and stored procedures, you must duplicate those objects in the target database as well. That’s where the db2look utility comes in handy. When invoked, db2look can reverse-engineer an existing database and produce a set of Data Definition Language (DDL) SQL statements that can then be used to recreate all of the data objects found in the database that was analyzed. The db2look utility can also collect environment registry variable settings, configuration parameter settings, and statistical (RUNSTATS) information, which can be used to duplicate a DB2 environment on another system.