Вы находитесь на странице: 1из 320

SYBASE REPLICATION SERVER PERFORMANCE AND TUNING

Understanding and Achieving Optimal Performance with Sybase Replication Server

ver 2.0.1

Final v2.0.1

Table of Contents
Table of Contents .............................................................................................................................i Authors Note ................................................................................................................................ iii Introduction.....................................................................................................................................1 Document Scope ...........................................................................................................................1 Major Changes in this Document..................................................................................................2 Overview and Review .....................................................................................................................5 Replication System Components ..................................................................................................5 RSSD or Embedded RSSD (eRSSD) ............................................................................................6 Replication Server Internal Processing .........................................................................................7 Analyzing Replication System Performance...............................................................................10 Primary Dataserver/Database......................................................................................................13 Dataserver Configuration Parameters .........................................................................................13 Primary Database Transaction Log .............................................................................................14 Application/Database Design......................................................................................................15 Replication Agent Processing.......................................................................................................29 Secondary Truncation Point Management ..................................................................................29 Rep Agent LTL Generation.........................................................................................................31 Replication Agent Communications ...........................................................................................34 Replication Agent Tuning ...........................................................................................................34 Replication Agent Troubleshooting ............................................................................................41 Replication Server General Tuning.............................................................................................53 Replication Server/RSSD Hosting ..............................................................................................53 RS Generic Tuning......................................................................................................................55 RSSD Generic Tuning.................................................................................................................63 STS Tuning .................................................................................................................................63 RSM/SMS Monitoring ................................................................................................................66 RS Monitor Counters ..................................................................................................................67 Impact on Replication .................................................................................................................75 RS M&C Analysis Repository ....................................................................................................76 RS_Ticket....................................................................................................................................77 Inbound Processing.......................................................................................................................87 RepAgent User (Executor) ..........................................................................................................87 SQM Processing..........................................................................................................................97 SQT Processing.........................................................................................................................113 Distributor (DIST) Processing ..................................................................................................127 Minimal Column Replication....................................................................................................141 Outbound Queue Processing ......................................................................................................145 DSI SQM Processing ................................................................................................................147 DSI SQT Processing .................................................................................................................148 DSI Transaction Grouping ........................................................................................................155 DSIEXEC Function String Generation .....................................................................................165 DSIEXEC Command Batching.................................................................................................172 DSIEXEC Execution.................................................................................................................179 DSIEXEC Execution Monitor Counters ...................................................................................180 DSI Post-Execution Processing.................................................................................................183

Final v2.0.1

End-to-End Summary ................................................................................................................ 184 Replicate Dataserver/Database.................................................................................................. 187 Maintenance User Performance Monitoring............................................................................. 187 Warm Standby, MSA and the Need for RepDefs ..................................................................... 192 Query Related Causes............................................................................................................... 194 Triggers & Stored Procedures................................................................................................... 196 Concurrency Issues................................................................................................................... 199 Procedure Replication ................................................................................................................ 201 Procedure vs. Table Replication ............................................................................................... 201 Procedure Replication & Performance ..................................................................................... 202 Procedure Transaction Control ................................................................................................. 207 Procedures & Grouped Transactions ........................................................................................ 210 Procedures with Select/Into................................................................................................... 210 Replication Routes ...................................................................................................................... 217 Routing Architectures............................................................................................................... 217 Routing Internals ...................................................................................................................... 225 Routing Performance Advantages ............................................................................................ 229 Routing Performance Tuning.................................................................................................... 229 Parallel DSI Performance .......................................................................................................... 233 Need for Parallel DSI................................................................................................................ 233 Parallel DSI Internals................................................................................................................ 234 Serialization Methods ............................................................................................................... 244 Transaction Execution Sequence .............................................................................................. 249 Large Transaction Processing................................................................................................... 253 Maximizing Performance with Parallel DSIs .......................................................................... 259 Tuning Parallel DSIs with Monitor Counters.......................................................................... 265 Text/Image Replication .............................................................................................................. 273 Text/Image Datatype Support................................................................................................... 273 RS Implementation & Internals ................................................................................................ 275 Performance Implications ......................................................................................................... 282 Asynchronous Request Functions ............................................................................................. 283 Purpose ..................................................................................................................................... 283 Implementation & Internals ...................................................................................................... 285 Performance Implications ......................................................................................................... 287 Multiple DSIs ............................................................................................................................. 289 Concepts & Terminology.......................................................................................................... 289 Performance Benefits................................................................................................................ 289 Implementation ......................................................................................................................... 290 Business Cases.......................................................................................................................... 305 Integration with EAI .................................................................................................................. 309 Replication vs. Messaging ........................................................................................................ 309 Integrating Replication & Messaging ....................................................................................... 312 Performance Benefits of Integration......................................................................................... 312 Messaging Conclusion.............................................................................................................. 313

ii

Final v2.0.1

Authors Note
Thinking is hard work Silver Bullets are much easier.
Several years ago, when Replication Server 11.0 was fairly new, Replication Server Engineering (RSE) collaborated on a paper that was a help to us all. Since that time, Replication Server has gone through several releases and Replication Server Engineering has been too busy keeping up with the advances in Adaptive Server Enterprise and the future of Replication Server to maintain the document. However, the requests for a paper such as this have been a frequent occurrence, both internally as well as from customers. Hopefully, this paper will satisfy those requests. But as the above comment suggests, reading this paper will require extensive thinking (and considerable time). Anyone hoping for a silver bullet does not belong in the IT industry. This paper was written for and addresses the functionality in Replication Server 12.6 and 15.0 with Adaptive Server Enterprise 12.5.2 through 15.0.1 (Rep Agent and MDA Tables). As the Replication Server product continues to be developed and improved, it is likely that later improvements to the product may supersede the recommendations contained in this paper. It is assumed that the reader is familiar with Replication Server terminology, internal processing and in general the contents of the Replication Server System Administration Guide. In addition, basic Adaptive Server Enterprise performance and tuning knowledge is considered critical to the success of any Replication Systems performance analysis. This document could not have been achieved without the considerable contributions of the Replication Server engineering team, Technical Support Engineers, and the collective Replication Server community of consultants, educators, etc. who are always willing to share their knowledge. Thank you. Document Version: 2.0.1 January 7, 2007

iii

Final v2.0.1

Introduction
Just How Fast Is It?
This question gets asked constantly. Unfortunately, there are no standard benchmarks such as TPC-C for replication technologies and RSE does not have the bandwidth nor resources to do benchmarking. Consequently, the stock RSE reply used to be 5MB/min (or 300MB/hr) based on their limited testing on development machines (small ones at that). However, Replication Server has been clocked at 2.4GB/hr sustained in a 1.2TB database and more than 40GB has been replicated in a single day into the same 1.2TB database (RS 12.0 and ASE 11.9.3 on Compaq Alpha GS140s for the curious). Additionally, some customers have claimed that by using multiple DSIs, they have achieved 10,000,000 transactions an hour!! Although this sounds unrealistic, a monitored benchmark in 1995 using Replication Server 10.5 achieved 4,000,000 transactions (each with 10 write operations) a day from the source replicating to three destinations (each with only 5 DSIs) for a total delivery of 12,000,000 transactions per day (containing 120,000,000 write operations). Lately, RS 12.6 has been able to sustain ~3,000 rows/sec on a dual 3.0 GHz P4 XEON with internal SCSI disks. As usual, your results may vary. Significantly. It all depends. And every other trite caveat muttered by a tuning guru/educator/consultant. Of course, your expectations also need to be realistic. Of course, implementers also need to be realistic as well. Product management recently got a call from a customer asking if Replication Server could achieve replicating 20GB of data in 15 minutes. The reality is that this is likely not even achievable using raw file IO streaming commands such as the unix dd command let alone via a process that needs to inspect the data values and decide on subscription rules. Replication Server is a highly configurable and highly tunable product. However, that places considerable responsibility on the system designers, implementers and operations staff to design and implement an efficient data movement strategy as well as operations staff to monitor, tune and adjust the implementation as necessary. The goal of this paper is to educate so that the reader understands why they may be seeing the performance they are and suggest possible avenues to explore with the goal of improved performance without resorting to the old tried-and-true trial-and-error stumble-fumble. Because performance and tuning is so situational dependent, it is doubtful that attempting to read this paper at a single sitting will be beneficial. Those familiar with Replication Server may want to skip to the specific detail sections that are applicable to their situation. Document Scope Before we begin, however, it is best to lay some ground rules about what to expect or not to expect from this paper. Focusing on the latter: This paper will not discuss database server performance and tuning (although it frequently is the cause of poor replication performance) except as required for replication processing. This paper will not discuss non-ASE RepAgent performance (perhaps it will in a future version) except where such statements can be made generically about RepAgents. This paper will not discuss Replication Server Manager. This paper will not discuss how to benchmark a replicated system. This paper will not discuss Replication Server system administration.

Now that we know what we are going to skip, what we will cover: This paper will discuss all of the components in a replication system and how each impacts performance. This paper will discuss the internal processing of the Replication Server, ASEs Replication Agent and the corresponding tuning parameters that are specific for performance.

It is expected that the reader is already familiar with Replication Server internal processing and basic replication terminology as described in the product manuals. This paper focuses heavily on Replication Server in an Adaptive Server Enterprise environment. In the future, it is expected that this paper will be expanded to cover several topics only lightly addressed in this version or not addressed at all. In the past, this list mostly focused on broader topics such as routing and heterogeneous replication. Routing has since been added, while heterogeneous replication has since been documented in the Replication Server Documentation. As a result, future topics will likely be new features added to existing functionality much like the addition of the discussions on DSI partitioning (new in 12.5) and DSI commit control (new in 12.6) have been added to Parallel DSIs.

Final v2.0.1
Major Changes in this Document Because many people have read earlier versions of this document, the following sections will list the topics added to respective sections. This will aid by allowing them to skip to the applicable sections to read the updated information. An attempt was made to red-line the changed sections, including minor changes not noted above. However, this document is produced using MS Word - which provides extremely rudimentary, inconsistent (and sometimes not persistent) and unreliable red-lining capabilities (it also crashes frequently during spell checking and hates numerical list formats.one wonders how Microsoft produces their own documentation with such unreliable tools). As a result, red-lining will not be used to denote changes. Updates 1.6 1.9

The following additions were made to this document in v1.9 as compared to v1.6: Document Topic Batch processing Batch processing Batch processing Rep Agent processing Monitors & Counters Rep Agent User Thread SQM Thread DIST Thread Parallel DSI Routing RS & EAI Modification Added NT informal benchmark with 750-1,000 rows/second Added trick to show how to replicate the SQL statement itself instead of the rows. Added discussion about ignore_dupe_key and CLR records with impact on RS Added description of sp_help_rep_agent dbname, scan with example to clarify output of start/end/current markers and log recs scanned. Added information about join to rs_databases and recommendation to increase RSSD size, add view to span counter tables, etc. Expanded section to include processing & diagram Added diagram to illustrate message queues Expanded discussion on SRE, TD & MD Expanded discussion on transaction execution sequence to cover disappearing updates more thoroughly. Added section. Added section

Updates 1.9

2.0

The following additions were made to this document in v2.0 as compared to v1.9: Document Topic RS Overview RS Internals Application Design Application Design Modification Add description of embedded RSSD Discussion on SMP feature and internal threading Impact of "Chained Mode" on RepAgent throughput and RS processing Further emphasized the impact of high-impact SQL statements and the fact that the latency is driven by the replicate DBMS vs. RS itself, including a benchmark from a financial trading system. Added discussion on sp_sysmon repagent output as well as using MDA tables. Discussion on SMP feature and impact on configuration parameters such as num_mutexes, etc. Added discussion about rs_ticket Added 12.6 and 15.0 counters to each section with samples Discussion about embedded RSSD & tuning

Rep Agent Tuning RS General Tuning RS General Tuning RS General Tuning RS General Tuning

Final v2.0.1

Document Topic Routes Parallel DSI Parallel DSI Replicate Dataserver/ Database

Modification Added 12.6 and 15.0 counters and discussion about load balancing using multiple routes in multi-database configurations. Updated for commit control Added discussion about MDA-based monitor tables to detect contention, SQL tracking, and RS performance metrics Removed somewhat outdated section on Historical Server and added new material on monitoring with MDA tables and in particular a lot of details on using the WaitEvents and the monOpenObjectActivity/monSysStatement tables. Because of the depth of detail, this not only replaces the section on the legacy Historical Server, but also replaces the section on the Replicate DBMS resources. Added discussion on using procedures to emulate dynamic SQL (fully prepared statements) and performance gains as a result at the replicate database. Added discussion about changes in ASE 15.0.1 that allows the use of a global unique nonclustered index on the text pointer instead of the mass TIPSA update when marking tables with text for replication.

Procedure Replication Text Replication

Final v2.0.1

Overview and Review


Where Do We Start?
Unfortunately, this is the same question that is asked by someone faced with the task of finding and resolving throughput performance problems in a distributed system. The last words of that sentence hold the keyits a distributed system. That means that there are lots of pieces and parts that contribute to Replication Server performance most of which are outside of the Replication Server. After the system has been in operation, there are several RS commands that will help isolate where to begin. However, if just designing the system and you wish to take performance in to consideration during the design phase (always a must for scalable systems), then the easiest place to begin is the beginning. Accordingly, this paper will attempt to trace a data bit being replicated through the system. Along the way, the various threads, processes, etc. will be described to help the reader understand what is happening (or should happen?) at each stage of data movement. After getting the data to the replicate site, a number of topics will be discussed in greater detail. These topics include text/image replication, parallel DSIs, etc. A quick review of the components in a replication system and the internal processing within Replication Server are illustrated in the next sections Replication System Components The components in a basic replication system are illustrated below. For clarity, the same abbreviations used in product manuals as well as educational materials are used. The only addition to this over pictures in the product manuals is the inclusion of SMS in particular, Replication Server Manager (RSM) and the inclusion of the host for the RS/RSSD.

Host

LOG RSM

RSSD RSSD DS

PDB LOG

RA

PDS

RS

RDS LOG RDB

Figure 1 Components of a Simple Replication System


Of course, the above is extremely simple the basic single direction primary to replicate distributed system, one example of which is the typical Warm-Standby configuration. Whether for performance reasons or due to architectural requirements, often the system design involves more than one RS. A quick illustration is included below:

Final v2.0.1

LOG

PRS RSSD

LOG

RRS RSSD

RSM PRS RSSD DS PDB LOG RA PDS PRS RRS RSSD DS RRS

RDS

LOG

RDB

IRS

IRS RSSD DS

LOG

IRS RSSD

Figure 2 Components of a Replication System Involving More Than One RS


The above is still fairly basic. Today, some customers have progressed to multi-level tree-like structures or virtual networks exploiting high-speed bandwidth backbones to form information buses. RSSD or Embedded RSSD (eRSSD) Those familiar with RS from the past have always been aware that the RS required an ASE engine for managing the RSSD. Starting with version 12.6, DBA's now have a choice of using the older ASE-based RSSD implementation or the new embedded RSSD. The eRSSD is an ASA based implementation that offers the following benefits: Easier to manage much of the DBA tasks associated with managing the DBMS for the RSSD have been built-in to the RS. This includes: o RS will automatically start and stop the eRSSD DBMS. o The eRSSD will automatically grow as space is required - a useful feature when doing extensive monitoring using monitor counters o The eRSSD transaction log is automatically managed - eliminating RS crashes due to log suspend, or the dangerous practice of truncate log on checkpoint Reduced impact on smaller single or dual cpu implementations ASE as a DBMS is tuned to consume every resource it can and even when not busy, ASE will "spin" looking for work. Consequently, ASE as a RSSD platform can lead to a "heavy" cpu and memory footprint in smaller implementations robbing memory or cpu resources from the RS itself. With RS 15.0, the added capability of routing with an embedded RSSD removes any architectural advantage over using ASE Since an ASA database is bi-endian, migrating RS between different platforms is much simpler than the cross-platform dump/load (XPDL) procedure for ASE (although manual steps may be required in either situation). Benchmarks using an eRSSD vs. an RSSD have shown no difference in performance impact. While theoretical design and architectures would allow an ASE system to outscale an ASA based system, RSs RSSD primary user activity does not reach the levels that would distinguish the two.

As a result, the only reason that might tip a DBA to using ASE for the RSSD for new installation using RS 15 is simply due to familiarity. One other difference is that tools and components shipped with ASE - such as the ASE Sybase

Final v2.0.1
Central Plug-in - allows DBAs to connect to the ASE RSSD to view objects and data. This is especially useful when wanting to reverse engineer RSSD procedures or quickly view data in one of the tables. The similar Sybase Central ASA plug-in is not shipped with Replication Server. One way of obtaining the same tools is to simply download the SQL Anywhere Developers Edition, which as of this writing, is free. Replication Server Internal Processing When hearing the terms internal processing, most Replication Server administrators immediately picture the internal threads. While understanding the internal threads is an important fundamental concept, it is strictly the starting point to beginning to understand how Sybase Replication Server processes transactions. Unfortunately, many Replication Server administrators stop there, and as a result never really understand how Replication Server is processing their workload. Consequently, this leaves the administrator ill equipped to resolve issues and in particular to analyze performance bottlenecks within the distributed system. Details about what is happening within each thread as data is replicated will be discussed in later chapters. Replication Server Threads There are several different diagrams that depict the Replication Server internal processing threads. Most of these are extremely similar and only differ in the relationships between the SQM, SQT and dAIO threads. For the sake of this paper, we will be using the following diagram, which is slightly more accurate than those documented in the Replication Server Administration Guide:

Figure 3 Replication Server Internal Processing Flow Diagram


Replicated transactions flow through the system as follows: 1. 2. Replication Agent forwards logged changes scanned from the transaction log to the Replication Server. Replication Agent User thread functions as a connection manager for the Replication Agent and passes the changes to the SQM. Additionally, it filters and normalizes the replicated transactions according to the replication definitions. The Stable Queue Manager (SQM) writes the logged changes to disk via the operating systems asynchronous I/O routines. The SQM notifies that Asynchronous I/O daemon (dAIO) that it has scheduled an I/O. The dAIO polls the O/S for completion and notifies the SQM that the I/O completed. Once written to disk, the Replication Agent can safely move the secondary truncation point forward (based on scan_batch_size setting). Transactions from source systems are stored in the inbound queue until a copy has been distributed to all subscribers (outbound queue). The Stable Queue Transaction (SQT) thread requests the next disk block using SQM logic (SQMR) and sorts the transactions into commit order using the 4 lists Open, Closed, Read, and Truncate. Again, the

3.

4. 5.

Final v2.0.1
read request is done via async i/o by the SQTs SQM read logic and the SQT notified by the dAIO when the read has completed. 6. Once the commit record for a transaction has been seen, the transaction is put in the closed list and the SQT alerts the Distributor thread that a transaction is available. The Distributor reads the transaction and determines who is subscribing to it, whether subscription migration is necessary, etc. 7. Once all of the subscribers have been identified, the Distributor thread forwards the transaction to the SQM for the outbound queue for the destination connections. This point in the process serves as the boundary between the inbound connection process and the outbound connection processing. 8. Similar to the inbound queue, the SQM writes to the queue using the async i/o interface and continues working. The dAIO will notify the SQM when the write has completed. 9. Transactions are stored in the outbound queue until delivered to the destination. 10. The DSI Scheduler uses the SQM library functions (SQMR) to retrieve transactions from the outbound queue, then uses SQT library functions to sort them into commit order (in case of multiple source systems) and determines delivery strategy (batching, grouping, parallelism, etc.) 11. Once the delivery strategy is determined, the DSI Scheduler then passes the transaction to a DSI Executor. 12. The DSI Executor translates the replicated transaction functions into the destination command language (i.e. Transact SQL) and applies the transaction to the replicated database. Again, the only difference here vs. those in the product manuals is the inclusion of the System Table Services (STS), Asynchronous I/O daemon (dAIO), SQT/SQM and queue data flow and the lack of a SQT thread reading from the outbound queue (instead, the DSI-S is illustrated making SQMR/SQT library calls). While the difference is slight, it is illustrated here for future discussion. Keeping these differences in mind, the reader is referred to the Replication Server System Administration Guide for details of internal processing for replication systems involving routing or Warm Standby. Replication Server SMP & Internal Threading In the past, Replication Server was a single process using internal threads for task execution along with kernel threads for asynchronous I/O. Beginning with version 12.5, a SMP version of RS exploiting native OS threads was available via an EBF. Each of the main threads discussed above were implemented as full native threads, which could run on multiple processors. The SMP capabilities could be enable or disable through configuring the Replication Server. By itself, even without enabling SMP, the native threading improved the RS throughput. Version 12.6 improved this by reducing the internal contention from the initial 12.5 implementation consequently DBA's should consider upgrading to version 12.6 prior to attempting SMP. Further discussion about RS SMP capabilities and the impact on performance will be discussed later. However, one new aspect of this from an internals perspective is that shared resources now required locks or mutexes. Typically in most multi-threaded applications, there are resources typically memory structures that are shared among the different threads. For example, in RS, the SQT cache is shared between the SQT thread and an SQT client such as a Distributor thread (this shared cache will be important to understanding the hand-off between DSI-S and DSIEXEC threads later). To coordinate access to such shared resources (so that one thread does not delete it while another is using it or one be reading while another has not finished writing and get corrupted values), threads are required to lock the resource for their exclusive use typically by grabbing the mutex that controls access to the resource. In RS 12.5 and earlier non-SMP environments, since the threads were internal to RS and execution could be controlled by the OpenServer scheduler, conflicting access to shared resources could often be avoided simply due to the fact that only one thread would be executing at a time. In RS 12.6 with or without SMP enabled the native threading implementation allows the thread execution to be controlled by the OS consequently mutexes had to be added to several shared resources. In RS 12.6 and higher, you may sometimes see a state of Locking Resource when issuing an admin who command. Grabbing a mutex really does not take but a few milliseconds unless someone else has it already, at which point the requesting thread is blocked and has to wait. The state of Locking Resource corresponds more to this condition the thread in question is attempting to grab exclusive access to a shared resource and is waiting on another thread to release the mutex. Because mutex allocation is so quick, it is likely that when you see this, RS is undergoing a significant state change for example switching the active in a Warm Standby. Inter-Thread Messaging Additionally, inter-thread communications is not accomplished via a strict synchronous API call. Instead, each thread simply writes a message into one of the target threads OpenServer message queue (standard OpenServer in memory message structures for communicating between OpenServer threads) specific for the message type. Once the target

Final v2.0.1
thread has processed each message, it can use standard callback routines or put a response message back into a message queue for the sending thread. This resembles the following:

OpenClient Callback

Rep Agent User

SQM

OpenServer Message Queues

Figure 4 Replication Server Inter-Thread Communications


Those familiar with multi-threaded programming or OpenServer programming will recognize this as a common technique for communication between threads especially when multiple threads are trying to communicate with the same destination thread. Accordingly, callbacks are used primarily between threads in which one thread spawned the other and the child thread needs to communicate to the parent thread. An example of this in Replication Server is the DIST and SQT threads. The SQT thread for any primary database is started by the DIST thread. Consequently, in addition to using message queues, the SQT and DIST threads can communicate using Callback routines. Note that the message queues are not really tied to a specific thread - but rather to a specific message. As a result, a single thread may be putting/retrieving messages from multiple message queues. Consequently, it is possible to have more message queues than threads, although the current design for Replication Server doesnt require such. By now, those familiar with many of the Replication Server configuration parameters will have realized the relationship between several fairly crucial configuration parameters: num_threads, num_msgqueues and num_msgs (especially why this number could be a large multiple of num_msgqueues). Since this section was strictly intended to give you a background in Replication Server internals, the specifics of this relationship will be discussed later in the section discussion Replication Server tuning. OQID Processing One of the more central concepts behind replication server recovery is the OQID Origin Queue Identifier. The OQID is used for duplicate and loss detection as well as determining where to restart applying transactions during recovery. The OQID is generated by the Replication Agent when scanning the transaction log from the source system. Due to the fact the OQID contains log specific information, each OQID format will be dependent upon the source system. For Sybase ASE, the OQID is a 36 byte binary value composed of the following elements: Byte 1-2 3-8 9-14 15-20 21-28 29-30 31-32 33-34 35-36 Contents Database generation id (from dbcc gettrunc()) Log page timestamp Log page rowid (rid) Log page rid for the oldest transaction Datetime for oldest transaction Used by RepAgent to delete orphaned transactions Unused Appended by TD for uniqueness Appended by MD for uniqueness

Through the use of the database generation id, log page timestamp and log record row id (rid), ASE guarantees that the OQID is always increasing sequentially. As a result, any time the RS detects an OQID lower than the last one, it can somewhat safely assume that it is a duplicate. Similarly at the replicate, when the DSI compares the ODID in the rs_lastcommit table with the one current in the active segment, it can detect if the transaction has already been applied.

Final v2.0.1
Why would there be duplicates?? Simply because the Replication Server isnt updating the RSSD or the rs_lastcommit table with every replicated row. Instead, it is updating every so often after a batch of transactions has been applied. Should the system be halted mid-batch and then restarted, it is possible that the first several have already been applied. At the replicate, a similar situation occurs in that the Replication Server begins by looking at the oldest active segment in the queue which may contain transactions already applied. Note that the oldest open transaction position is also part of the ASE. This is deliberate. Since the Replication Agent could be scanning past the primary truncation point and up to the end of the log, the oldest open transaction position is necessary for recovery. As discussed later, the ASE Rep Agent does not actually ever read the secondary truncation point. Consequently, if the Replication system is shutdown, the Replication Agent may have to restart at the point of the oldest transaction and rescan to ensure that nothing is missed. For heterogeneous systems, the database generation (bytes 1-2) and the RS managed bytes (33-36) are the same, however the other components depend on what may be available to the replication agent to construct the OQID. This may include system transaction ids or other system generated information that uniquely identifies each transaction to the Replication Agent. An important aspect of the OQID is the fact that each replicated row from a source system is associated with only one OQID and vice versa. This is key to not only identifying duplicates for recovery after a failure (i.e. network outage), but also in replication routing. From this aspect, the OQID ensures that only a single copy of a message is delivered in the event that the routing topology changes. Those familiar with creating intermediate replication routes and concept of logical network topology provided by the intermediate routing capability will recognize the benefit of this behavior. The danger is that some people have attempted to use the OQID or origin commit time in the rs_lastcommit table for timing. This is extremely inaccurate. First, the origin commit time comes from the timestamp in the commit record (a specific record in the transaction log) on the primary. This time is derived from the dataservers clock, which is synched with the system clock about once per minute. There can be drift obviously, but not more than a minute as it is re-synched each minute. The dest_commit time in the rs_lastcommit table, on the other hand, comes from the getdate() function call in rs_update_lastcommit. The getdate() function is a direct poll of the system clock on the replicate. The resulting difference between the two could be quite large in one sense or even negative if the replicates clock was slow. In any case, since transactions are grouped when delivered via RS (topic for later), the rs_lastcommit commit time is for the last command in the batch and not necessarily the one issued that you are testing with. Additionally, as we will see later, if the last command was a long running procedure, it may appear to be worse than it is. On the other hand, much like network packeting, the Replication Agent and Replication Server both have deliberate delays built in when only a small number of records are received. This pause is built in so that subsequent transactions can be batched into the buffer for similar processing. Those familiar with TCP programming will recognize this buffering as similar to the delay that is disabled by enabling TCP_NO_DELAY as well as other O/S parameters such as tcp_deferred_ack_interval on Sun Solaris. The best mechanism to determining latency is to simply run a batch of 1,000 normal business transactions (can be simulated with atomic inserts spread across the hot tables) into the primary and monitor the end time at the primary and replicate. For large sets of transactions, obviously a stop watch is not even necessary. If the Replication Server is keeping the system up to the point a stop watch would be necessary, then you dont have a latency problem. If, however, it finishes at the primary in 1 minute and at the replicate in 5 minutes then you have a problem maybe.... Analyzing Replication System Performance Having set the stage, the rest of this document will be divided into sections detailing how these components work in relation to possible performance issues. The major sections will be: Primary Dataserver/Database Replication Agent Processing Replication Server and RSSD General Tuning Inbound Processing Outbound Queue Processing Replicate Dataserver/Database

After these sections have been covered in some detail, this document will then cover several special topics related to DSI processing in more detail. This includes: Procedure Replication Replication Routes Parallel DSI Performance

10

Final v2.0.1

Text/Image Replication Asynchronous Request Functions Multiple DSIs Integration with EAI

11

Final v2.0.1

Primary Dataserver/Database
It is Not Possible to Tune a Bad Design
The above comment is the ninth principal of the Principals of OLTP Processing as stated by Nancy Mullen of Andersen Consulting (now Accenture?) in her paper OLTP Program Design in OLTP Processing Handbook (McGrawHill). A truer statement has never been written. Not only can you not fix it by replication, but in most cases, a bad design will also cause replication performance to suffer. In many cases when replication performance is bad, we tend to focus quickly at the replicate. While it is true that many replication performance problems can be resolved there, the primary database often also plays a significant role. In fact, implementing database replication or other forms of distributing database information (messaging, synchronization, etc.) will quickly point to significant flaws in the primary database design or implementation, including: Poor transaction management, particularly with stored procedures, batch processes. Single threaded batch processes. While they may work, they are not scalable. High-impact SQL statements - such as a single update or delete statement that affects a large number of rows (>10,000). Inappropriate design for a distributed environment (heavy reliance on sequential or pseudo keys) Improper implementation of relational concepts (i.e. lack of primary keys, duplicate rows, etc.)

Note that all of these have problems in a distributed environment whether using Replication Server or MQSeries messaging. However, the proper design of a database system for distributed environments is beyond the scope of this paper. In this section, we will begin with basic configuration issues and then move into some of the more problematic design issues that affect replication performance. Dataserver Configuration Parameters While Sybase has striven (with some success) to make replication transparent to the application, it is not transparent to the database server. In addition to the Replication Agent Thread (even though significantly better than the older LTMs as far as impact on the dataserver), replication can impact system administration in many ways. One of those ways is proper tuning of the database engines system configuration settings. Several settings that would not normally be associated with replication, nonetheless, have a direct impact on the performance of the Replication Agent or in processing transactions within the Replication Server. Procedure Cache Sizing A common misconception is that procedure cache is strictly used for caching procedure query plans. However, in recent years, this has changed. The reason is than in most large production systems, the procedure cache was grossly oversized, consequently under utilized and contributed to the lack of resources for data cache. For example, in a system with 2GB of memory dedicated to the database engine, the default of 20% often meant that ~400MB of memory was being reserved for procedure cache. Often, real procedure cache used by stored procedure plans is less than 10MB. ASE engineers began tapping in to this resource by caching subquery results, sort buffers, etc. in procedure cache. When the Replication Agent thread was internalized within the ASE engine (ASE 11.5), it was no different. It also used procedure cache. Later releases of ASE (from ASE 12.0) have moved this requirement from procedure cache to additional memory grabbed at startup similar to additional network memory. Consequently, if using ASE 12.5, this may not be as great of a problem as ASE 11.9.2 or earlier. The Replication Agent uses memory for several critical functions: Schema Cache - Caching for database object structures, such as table, column names, text/image replication states, used in the construction of LTL. Transaction Cache - Caching LTL statements pending transfer to the Replication Server As a result, system administrators who have tuned the procedure cache to the minimal levels prior to implementing replication may need to increase it slightly to accommodate Replication Agent usage if using an earlier release of ASE. You can see how much memory a Replication Agent is using via the 9204 trace flag (additional information on enabling/disabling Replication Agent trace flags is located in the Replication Agent section).
sp_config_rep_agent <db_name>, trace_log_file, <filepathname> sp_config_rep_agent <db_name>, traceon, 9204 -- monitor for a few minutes sp_config_rep_agent <db_name>, traceoff, 9204

13

Final v2.0.1
Generally speaking, the Replication Agents memory requirements will be less than normal servers metadata cache requirements for system objects (sysobjects, syscolumns, etc.). A rule of thumb if sizing a new system for replication might be to use the metadata cache requirements as a starting point. Metadata Cache The metadata cache itself is important to replication performance. As will be discussed later, as the Replication Agent reads a row from the transaction log, it needs access to the objects metadata structures. If forced to read this from disk, the Replication Agent processing will be slowed while waiting for the disk I/O to complete. Careful monitoring of the metadata cache via sp_sysmon during periods of peak performance will allow system administrators to size the metadata cache configurations appropriately. User Log Cache (ULC) User (or Private) Log Cache was implemented in Sybase SQL Server 11.0 as a means of reducing transaction log semaphore contention and the number of times that the same log page was written to disk. In theory, a properly sized ULC would mean that only when a transaction was committed, would the records be written to the physical transaction log. One aspect of this that could have had a large impact on the performance of replication server was that this would mean that a single transactions log records would be contiguous on disk vs. interspersed with other users transactions. This would significantly reduce the amount of sorting that the SQT thread would have to do within the Replication Server. However, in order to ensure low latency and due to an Operating System I/O flushing problems, a decision was made in the design of SQL Server 11.x, that if the OSTAT_REPLICATED flag was on, the ULC would be flushed much more frequently than normal. In fact, in some cases, the system behaves as if it did not have any ULC. As one would suspect, this can lead to higher transaction log contention as well as negating the potential benefit to the SQT thread. Over the years, Operating Systems have matured considerably, eliminating the primary cause and hence the need for this. In ASE 12.5, this ULC flush was removed, but as of this writing not enough statistics are available to tell how much of a positive impact this has on throughput by reducing the SQT workload. One reason is that it is extremely rare that the SQT workload is the performance bottleneck. Primary Database Transaction Log As you would assume, the primary transaction log plays an integral role in replication performance, particularly the speed at which the Replication Agent can read and forward transactions to the Replication Server. Physical Location The physical location of the transaction log plays a part in both the database performance as well as replication performance. The faster the device, the quicker Replication Agent will be able to scan the transaction log on startup, recovery and during processing when physical i/o is required. Some installations have opted to use Solid State Disks (SSDs) as transaction log devices to reduce user transaction times, etc. While such devices would help the Replication Agent, if resources are limited, a good RAID-based log device will be sufficient to enable the SSD to be used as a stable device or other requirement for general server performance (tempdb). Named Cache Usage Along with log I/O sizing, binding the transaction log to a named cache can have significant performance benefits. The reason stems from the fact that the Replication Agent cannot read a log page until it has been flushed to disk. While this does happen immediately after the page is full due to recovery reasons, if a named cache is available, the probability is much higher that the Replication Agent can read the log from memory vs. disk. If forced to read from disk, the Replication Agent performance may drop to as low as 1GB/hr. A word of caution. While it may be tempting to simply allocate a small 4K pool in an existing cache, the best configuration is a separate dedicated log cache with all but 1MB allocated to 4K buffer pools. For example, a 50MB dedicated log cache would have 49MB of 4K buffers and 1MB of 2K buffers. The reason is that if the named cache is for mixed use (log and data), more than likely other buffer pools larger than 4K have been established. In the Adaptive Server Enterprise Monitor Historical Server Users Guide, a little known fact is stated: Regardless of how many buffer pools are configured in a named data cache, Adaptive Server only uses two of them. It uses the 2K buffer pool and the pool configured with the largest-sized buffers. While the intention may have been the largest size buffers were used, experience monitoring production systems suggests that it is the buffer pool with the largest buffer space instead in some cases, while in others it appears to use different pools almost exclusively for different periods of time. Unfortunately, some DBAs simply assume that any 4KB I/Os must be the transaction log, when it could be query activity counters available through sp_sysmon do not differentiate log I/O from data pages. Rather than trying to

14

Final v2.0.1
second guess this, it is much simpler to simply restrict any named cache to only 2 sizes of buffer pools and use a dedicated log cache for this purpose. In most cases where the RepAgent was lagging, every time that a separate log cache has been enabled, customers have witnessed an immediate 100% improvement in Replication Agent throughput as long as the RepAgent stayed within the log cache region. Application/Database Design While the above configuration settings can help reduce performance degradation, undoubtedly the best way to improve replication performance from the primary database perspective is the application or primary database design itself. Chained Mode Transactions In chained mode, all data retrieval and modification commands (delete, insert, open, fetch, select, and update) implicitly begin a transaction. The biggest impact on RS is from the implicit transactions that result from select statements which in most applications accounts for 75-80% of all activity in a DBMS. Simple transactions that only involve queries vs. DML operations result in empty transactions, which are committed as usual. While some might think that the User Log Cache would filter these empty log transactions from even reaching the transaction log itself. However, since the transactions are committed vs. rolled back, these empty transactions are instead flushed to the transaction log. Besides the obvious negative impact on application performance, they have a negative impact in replication as well as these empty transactions are forwarded to the Replication Server. Earlier versions of Replication Server would filter these empty transactions at the DSI thread due to the way transaction grouping works. Newer versions of Replication Server have reduced the impact by removing empty transactions earlier those from chained transactions as well as system transactions such as reorgs. In ASE 12.5.2, the replication agent has been improved to eliminate the empty transactions from system transactions, however, user actions that result in empty transactions will still result in empty begin/commit pairs sent to the RS. As a result, an application that uses chained mode will degrade Replication Agent throughput as well as increase the processing requirements for Replication Server. Multiple Physical Databases One of the most frequent complaints is that the Replication Agent is not reading the transaction log fast enough, prompting calls for the ability to have more than one Replication Agent per log or a multi-threaded Replication Agent vs. the current threading model. Although some times this can be alleviated by properly tuning the Replication Agent thread, adjusting the above configuration settings, etc., there is a point where the Replication Agent is simply not able to keep up with the logged activity. A classic case of this can be witnessed during large bcp operations (100,000 or more rows) in which the overhead of constructing LTL for each row is significant enough to cause the Replication Agent to begin to lag behind. With the exception of bulk operations, when ever normal OLTP processing causes the Replication Agent to lag behind, the most frequent cause is the failure on the part of the database designers to consider splitting the logical database into two or more physical databases based on logical data groups. Consider for example, the mythical pubs2 application. Purportedly, it is a database meant to track the sales of books to stores from a warehouse. Lets assume that 80% of the transactions are store orders. That means the other 20% of the transactions are administering the lists of authors, books, book prices, etc. If maintained in the same database, this extra 20% of the transactions could be just enough to cause a single Replication Agent to lag behind the transaction logging. And yet, what would be lost by separating the database into two physical databases one containing the authors, books, stores and other fairly static information, while the other functions strictly as the sales order processing database? The answer is not much. While some would say that it would involve cross-database write operations, the real answer is not really. Appropriately designed, new authors, books and even stores would be entered into the system outside the scope of the transaction recording book sales. Cross-database referential integrity would be required (for which a trigger vs. declarative integrity may be more appropriate), but even this does not pose a recovery issue except to academics. The real crux of the matter is, is it more important to have a record of a sale to a store in the dependent database even if the parent store record is lost due to recovery, or is it more important to enforce referential integrity at all points and force recovery of both systems?? Obviously, the former is better. As a result, it makes sense to separate a logical database into several physical databases for the following types of data groupings: Application object metadata such as menu lists, control states, etc. Application driven security implementations (screen navigation permissions, etc.) Static information such as tangible business objects including part lists, suppliers, etc. Business event data such as sales records, shipment tracking events, etc.

15

Final v2.0.1

One-up/sequential key tables used to generate sequential numbers

Not only does this naturally lend itself to the beginnings of shareable data segments reusable by many applications, by doing so, you also will increase the degree of parallelism on the inbound side of Replication Server processing. The last item might catch many people by surprise and immediate generate cautions about cross database transactions. First of all, under any recovery scenario either the correct next value could be determined by scanning the real data or, the gap of missing rows can be determined from the key table. This last is important from a different perspective. Now, consider replication. By placing the one-up key tables in a separate database, they effectively have a dedicated Replication Agent and simple path through the Replication Server. As a result, one-up/sequential key tables will have considerably less latency than the main data tables. Consequently, during a Warm Standby failure, it is less likely that any transactions were stranded, but the number of real transactions stranded may be able to be determined with more accuracy and the associated key sequences preserved. In addition, in some cases splitting a database can be highly recommended for other reasons. Consider the common problem of databases containing large text or image objects. As will be illustrated later, text/image or other types of BLOBs can significantly slow Rep Agent performance due to having to also scan the text chains a slow process in any event. It is probably advisable to put such tables in a separate database with a view in the original for application transparency purposes. The reasons for this are: Enable multiple Replication Agents to work in parallel in effect, dedicating one to reading text data Enable separate physical connection at the replicate to write the data improving overall throughput as nontextual data is not delayed while text or image data is processed by the DSI thread. Improve overall application/database recoverability.

The first two are obvious solutions to replication performance degradation as result of text processing. The latter comment is not so obvious. However, consider the following: Text/Image data is typically static. Once inserted, it is rarely updated and the most common write activity post-insert will be a delete operation performed during archival. To avoid transaction log issues with text/image, most applications will use minimally logged functions such as writetext (or the CT-Library equivalent ct_send_data() function) to insert the text.

As an example, consider the types of data that you may be storing in a text or image column. Some financial institutions store loan applicant credit reports as text datatypes (although not recommended). Other organizations will frequently store customer emails, digitized applications containing signatures, or other infrequently access reference data. So how does a separate database improve recoverability? First, anytime a minimally logged function is executed in a database, the ability to perform transaction log dumps is voided. Consequently, databases containing text/image data often must be backed up using full database dumps. For any large database, this will require significant time to perform depending on the quantity and speed of backup devices. By separating the text/image data, the primary data related to business processing can support transaction log dumps allowing up to the minute recovery as well as be brought online faster after a system shutdown. Avoid Unnecessary BLOBs The handling of BLOB (text/image) data is becoming more of a problem today as application developers faced with storing XML messages in the database are often choosing to store the entire message as a BLOB datatype (image for Sybase if using native XML indexing). In most cases, storing structured data in a BLOB datatype is actually orders of magnitude less efficient for the application. For instance, consider the credit report instance alluded to earlier. If a persons credit report is stored as a single text datatype, the application must then perform the parsing to determine such items as the credit score, the number of open charge accounts, number of delinquent payments, etc. In addition, annotations about a specific charge are difficult to record. For example, if applying for a mortgage, an applicant may be required to explain late payments to a specific credit account. Stored as text datatype, it would be difficult to link the applicants rebuttal (which would be a good use of text) with the specific account. Additionally, it can detract from the businesss ability to perform business analysis functions critical to profitability. For example, a common requirement may be to determine the number of credit accounts and balances with any reported late payments for customers who are late in paying their current bill. This might allow a bank to reduce its risk of exposure either dynamically or avoid it altogether by refusing credit to someone whos profile would suggest a greater chance of defaulting on the loan. The point of this discussion is not to discourage storing XML documents when necessary in fact storing the credit report as an entire entity might be needful particularly if exchanging it with other entities. However, the tendency of

16

Final v2.0.1
some is to think of the RDBMS as a big bit bucket to store all of their data as objects in XML format without recognizing the futility of doing so. Similarly, XML is mainly an application layer communications protocol. While serving an extremely useful purpose in providing the means to communicate with other systems, it can seriously degrade overall application performance if XML messages are stored as a single text datatype. For example, if a cargo airplanes schedule and load manifest were stored in XML format as a text datatype, the businesss routing/scheduling and in transit visibility functions would be extremely hampered. Questions such as whether ground facility capacity had been exceeded, re-routing of shipments due to delays, or even the location of specific shipments would require the XML document to be parsed. While doable, and text indexing/XML indexing may assist in some efforts (i.e. finding shipments), often such operations require the retrieval of a large number of data values and subsequent parsing to find the desired information. Consider the query What scheduled flights or delayed flights are scheduled to arrive in the next 1 hour? Transaction Processing After the physical database design itself, the next largest contributor is how the application processes transactions. An inefficient application not only increases the I/O requirements of the primary database, it also can significantly degrade replication performance. Several of the more common inefficiencies are discussed below. Avoid Repeated Row Re-Writes One of the more common problems brought about by forms-based computing is that the same row of data may be inserted and then repeatedly updated by the same user during the same session. A classic scenario is the scenario of filling out an application for loans or other multi-part application process. A second common scenario is one in which fields in the record are filled out by database triggers, including user auditing information (last_update_user), order totals, etc. While some of this is unavoidable to insure business requirements are met, it may add extra work to the replication process. Consider the following mortgage application scenario: 1. 2. 3. 4. 5. 6. User inserts basic loan applicant name, address information As user transitions to next screen for property info, the info is saved to the database. User adds the property information (stored in same database table). As user transitions to the next screen, the property information is saved to the database User adds dependent information (store in same table in denormalized form) User hits save before asking credit info (not stored in same table)

Just considering the above scenario, the following database write operations would be initiated by the application:
insert loan_application (name, address) update loan_application (property info) update loan_application (dependent info)

Now, consider the actual I/O costs if the database table had a trigger that recorded the last user and datetime that the record was last updated.
insert update update update update update loan_application loan_application loan_application loan_application loan_application loan_application (name, address) (lastuser, lastdate) (property info) (lastuser, lastdate) (dependent info) (lastuser, lastdate)

As a result, instead of a single record, the Replication Agent must process 6 records each of which will incur the same LTL translation, Replication Server normalization/distribution/subscription processing, etc. On top of which, consider what happens at the replicate (if triggers are not turned off for the connection) local trigger firings at the replicate are bolded.
insert update update update update update update update update update update update loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application loan_application (name, address) (lastuser, lastdate) (lastuser, lastdate) (lastuser, lastdate) (property info) (lastuser, lastdate) (lastuser, lastdate) (lastuser, lastdate) (dependent info) (lastuser, lastdate) (lastuser, lastdate) (lastuser, lastdate)

Some may question the reality of such an example. It is real. While remaining unnamed, one of Sybases mortgage banking customers had a table containing 65 columns requiring 8-10 application screens before completely filled out.

17

Final v2.0.1
After each screen, rather than filling out a structure/object in memory, each screen saved the data to the database. During normal database processing, this led to an extremely high amount of contention within the table made worse by the continual page splitting to accommodate the increasing row size. Replication was enabled in a Warm-Standby configuration for availability purposes. Although successful, you can guess the performance implications within Replication Server from such a design. Understanding Batch Processing Most typical batch processes involve one of the following types of scenarios: Bulkcopy (bcp) of data from a flat file into a production table. This is more common than it should be as bcp-ing data is inherently problem-prone. Bulk SQL statement via insert/select or massive update or delete statement. A single or multiple stream of individual atomic SQL statements affecting one row each. This is extremely rare and usually only is present in extremely high OLTP systems where contention avoidance is paramount.

The last one typically is not a problem for replicated systems, however, the first two are and it has nothing to do with Replication Server. The simple fact of the matter is that any batch SQL statement logs each row individually in the transaction log. Consequently, any distributed system is left with the unenviable task of moving the individual statements enmass (and frequently as one large transaction). So, whats the problem with this? The problem is the dismal performance of executing atomic SQL statements vs. bulk SQL statements. Consider what happens for each SQL statement as it hits ASE: SQL statement is parsed by the language processor SQL statement is normalized and optimized SQL is executed Task is put to sleep pending lock acquisition and logical or physical I/O Task is put back on runnable queue when I/O returns Task commits (writes commit record to transaction log) Task is put to sleep pending log write Task sends return status to client

When this much overhead is executed for every row affected in a batch process, the process slows to a crawl. This can be seen in the following graph which compares a straight bcp in, a bcp in using a batch size of 100, an insert/select statement, and atomic inserts grouped in batches of 100 - in an unreplicated system .

Batch Insert Speeds


800 700 600

Seconds

500 400 300 200 100 0 0 25,000 50,000 100,000 150,000 200,000 250,000

bcp in bcp -b100 insert/select 100 grouped inserts

Rows
Figure 5 Non-replicated Batch Insert Speeds on single CPU/NT

18

Final v2.0.1
The above test was run on a small NT system, however, the relative difference holds. Notice that the results are fairly linear and show a marked difference between the grouped atomic inserts and any of the bulk statements (a factor of 700%). So why is this important? One of the biggest causes in latency within a replicated environment is bulk SQL operations during batch processing - in particular high-impact update and delete statements. In these cases, a single update or delete operation could easily affect 100s of thousands of rows. If you think about what was mentioned earlier, the primary ASE can execute the batch SQL along the performance lines as indicated above easily completing 250,000 rows in less than 2 minutes. Note that in the cases of the bcp or the single large insert/select, the parse, compile, optimize steps are either eliminated or only executed once. The problem is that all that is in the transaction log is the 250,000 row images - not the SQL statement that caused the problem. As a result, the replicate system unfortunately has to follow the atomic SQL statement route and suffers mightily as it attempts to execute 250,000 individual inserts. Using the above as an indication, since RS is sending individual inserts, the best it could hope for would be 12 minutes of execution instead of 1.5 - however this is even not attainable as it is unlikely that RS could group 100 inserts into a single batch (as we will see later, it is limited to 50 statements per batch). The problem is that a typical batch process may contain dozens to hundreds of such bulk SQL statements - each one compounding the problem. To see the impact of this in real life, a recent test with a common financial trading package that had a single delete of ~800,000 rows showed the following statistics (over several executions): Component Primary ASE (single delete stmt) Rep Agent RS (Inbound Queue) Outbound Queue Rows/Min 800,000 120,000 180,000 15,000 Latency N/A 7-12 min 5-7 min 53 min

Inbound Queue DSI

Replicate ASE

It is extremely important to realize, it is not the Replication Server that cant achieve the throughput - but rather the inability of the target dataserver to process each statement quickly enough that causes the latency. This leads to the first key concept that is indisputable, but for some reason is unbelievable as so many are quick to blame RS for the latency: Key Concept #1: Replication Server with a single DSI/single transaction will be limited in its ability to achieve any real throughput by the replicate data servers (DBMS) performance. Beyond that point, Parallel DSIs and smaller transactions must be used to avoid latency. It was interesting to note that while the financial package used a single delete statement to remove the rows, it then repopulated the table using inserts of 1,000 rows at a time as atomic transactions. At this point, with parallel DSIs, RS was able to execute the same volume of inserts and achieve the same throughput. Had the delete (above) note been clogging the system, there would have been near-zero latency for the inserts. To further illustrate that this is not just a Replication Server issue, consider the typical messaging implementation: a message table is populated within ASE (similar to the transaction log), the message agent (such as TIBCOs ADB) polls the messages from this table (similar to the RepAgent), the message bus stores the messages to disk (if durable messaging is used), and finally the message system applies the data as SQL statements to the destination system. If the messaging system treats each transaction as a singular message to maintain transactional consistency, it would have the same problem as RS - slow execution by the target server. Only if transactional consistency is ignored and the messages applied in parallel could the problem be overcome. Batch Process/Bulkcopy Concurrency In some cases, the lack of concurrency at the primary translates directly into replication performance problems at the replicate. Consider for example, the ever-common bulkcopy problem. Net gouge for years has stated that during slow bcp, the bcp utility translates the rows of data into individual insert statements. Consequently, people find it surprising that Replication Server has difficulty keeping up. In the first place, the premise is false. While slow bcp is an order of magnitude slower than fast bcp, it is still a bulk operation and consequently does not validate user-defined datatypes, declarative referential integrity, check constraints nor fire triggers. In fact, the only difference between slow bcp and fast bcp is that the individual inserted rows are logged for slow bcp whereas in fast bcp only the space allocations are logged. As a result, of course, it is still several orders of magnitude faster than individual insert statements that Replication Server will use at the replicate. This is clearly illustrated above in the insert batch

19

Final v2.0.1
test (figure 5) as the bcp in this case was a slow bcp - hence the comparable performance of the insert/select (which would log each row as well). Typical Batch Scenario Now, consider the scenario of a nightly batch load of three tables. If bcpd sequentially using slow bcp, it may take 1-2 hours to load the data. Unfortunately, when replication is implemented, the batch process at the replicate requires 8-10 hours to complete, exceeding the time requirements and possibly encroaching on the business day. Checking the replicated database during this time shows extremely little CPU or I/O utilization and the maintenance user process busy only a fraction of the time. All the normal things are tried and even parallel DSIs are implemented all to no avail. Customer decides that Replication Server just cant keep up. The reality of the above scenario is that several problems contributed to the poor performance: The bcp probably did not use batching (-b option) and as a result was loaded in a single transaction. As a result, the Replication Server could only ever use a single DSI, no matter how many were configured, as it had to apply it as a single transaction. Further, it would be held in the inbound queue until the commit record was seen by the SQT thread as a large transaction, this may incur multiple scans of the inbound queue to recreate the transaction records due to filling the SQT cache. Lack of batch size in the bcp (-b option) more than likely drove Replication Server to use large transaction threads while this may have reduced the overall latency in one area due to not having to wait for the DSI to see the commit record, it also meant that Replication Server only considered a small number of threads preserved for large transactions. Replication Agent probably was not tuned (batching and ltl_batch_size) as will be discussed in the next section. Even if bcp batching were enabled, by sequentially loading the tables, concurrent DSI threads would suffer a high probability of contention, especially on heap tables or indexes due to working on a single table. If attempting to use parallel DSIs, this will force the use of the less efficient default serialization method of wait_for_commit.

Some of the above will be addressed in the section specific to Parallel DSI tuning, however, it should be easy to see how the Replication Server lagged behind. It also illustrates a very key concept: Key Concept #2: The key to understanding Replication Server performance is understanding how the entire Replication System is processing your transaction. Batch Scenario with Parallelism Now, consider what would likely happen if the following scenario was followed for the three tables: All three tables were bcpd concurrently using a batch size of 100. Replication Server was tuned to recognize 1,000 statements as a large transaction vs. 100. Replication Agent was tuned appropriately. DOL/RLL locking at the replicate database. DSI serialization was set to wait_for_start (see Parallel DSI tuning section). Optionally, tables partitioned (although not necessary for performance gains if partitioned, DOL/RLL is a must).

Would the SQT cache size fill? Probably not. Would the Parallel DSIs be used/effective? Most assuredly. Would Replication Server keep up? It probably would still lag, but not as much. At the primary, it now may take only 2 hours to load the data (arguably less if not batching) and 3 hours at the replicate. In fact, as noted earlier in the financial trading system example, an insert of ~800,000 rows in 1,000 row transactions executed using 10 parallel DSIs completed at the replicate in the same amount of time as it took to execute at the primary - any latency would be simply due to the RS processing overhead. The same scenario is evident in purge operations. Typically, a single purge script begins by deleting masses of records using SQL joins to determine which rows can be removed. The problem is of course that this is identical from a replication perspective as a bcp operation a large transaction with no concurrency. An alternative approach in which a delete list is generated and then used to cursor through the main tables using concurrent processes may be more recoverable, cause less concurrency problems at the primary and improve replication throughput. Consider the

20

Final v2.0.1
following benchmark results from a 50,000 row insert into one table from a different table (mimicking a typical insert from a staging table to production table):

50,000 Row Bulk Insert Between Two Tables


Method Single SQL statement (insert/select) 10 threads processing 1 row at a time 10 threads processing 100 ranged rows at a time* 10 threads processing 250 ranged rows at a time* Time (sec) 1 57 5 1

By ranged rows (*), the system predefined 10 ranges of rows (i.e. 1-5000, 5001-10000, 10001-15000, etc.). As each thread initialized, it was assigned a specific range. It then performed the same insert/select, but specified a rowcount of 100 or 250 as noted above. Ignoring the replication aspects, the above benchmark easily demonstrates a couple of key batch processing hallmarks: 1. 2. It is possible to achieve the same performance as large bulk statements by running parallel processes using smaller bulk statements on predefined ranges Atomic statement processing is slow

This leads to a second key concept: Key Concept #3: The optimal primary transaction profile for replication is concurrent users updating/inserting/deleting small numbers of rows per transaction spread throughout different tables. That does not mean low volume! It can be extremely high volume. It just means it is better from a replication standpoint for 10 processes to delete 1,000 rows each in batches of 100 than for a single process to delete 100,000 rows in a single transaction. Accordingly, the best way to improve replication performance of large batch operations is to alter the batch operation to use concurrent smaller transactions vs. a single large transaction. An interesting test (some results were described above) was done on a dual processor (850MHz P3 standard (not XEON)) NT workstation with ASE 12.5 and RS 12.5 running on the same host machine. Several batch inserts of 25,000-100,000 rows were conduction from one database on the ASE engine to another using a Warm Standby implementation. By using 10 processes to perform the inserts in 250 row transactions in pre-defined ranges, RS was still able to reliably achieve 750-1,000 rows per second total throughput (and since ASE was configured for 2 engines, this machine was sorely over utilized). This was all accomplished with 10 parallel threads in RS with dsi_serialization_method set to isolation_level_3. Replicating SQL for Batch Processing The fundamental problem in batch processing is that a single SQL statement at the primary is translated into thousands of rows at the replicate each row requiring RS resources for processing and then the typical parse, optimize and sleep pending I/O at the replicate dataserver delays. For updates and deletes, users of ASE 12.5 and RS 12.5 can take advantage of a feature introduced with ASE 12.0 that allows the actual replication of a SQL statement. Consider the following code fragment:
if exists (select 1 from sysobjects where name="replicated_sql" and type="U" and uid=user_id()) drop table replicated_sql go create table replicated_sql ( sql_statement_id sql_string begin_time commit_time ) go

numeric(20,0) varchar(1800) datetime datetime

identity, null, default getdate() not null, default getdate() not null

create unique clustered index rep_sql_idx on replicated_sql (sql_statement_id) go create trigger replicated_sql_ins_trig on replicated_sql for insert as begin

21

Final v2.0.1

declare @sqlstring varchar(1800) select @sqlstring=sql_string from inserted set replication off execute(@sqlstring) set replication on end go exec sp_setreptable replicated_sql, true go if exists (select 1 from sysobjects where name="sp_replicate_sql" and type="P" and uid=user_id()) drop proc sp_replicate_sql go create proc sp_replicate_sql @sql_string varchar(1800) as begin declare @began_tran tinyint, @triggers_state tinyint, @proc_name varchar(60) select @proc_name=object_name(@@procid) -- check for tran state. If already in tran, set a save point so we are well-behaved if @@trancount=0 begin select @began_tran=1 begin transaction rep_sql end else begin select @began_tran=0 save transaction rep_sql end -- check for trigger state. For NT, byte 6 of @@options & 0x02 = 2 is on -- in unix, the bytes may be swapped if (convert(int,substring(@@options,6,1)) & 0x02 = 0) begin select @triggers_state=0 -- since triggers are off, we'd better check if we can turn them on if proc_role('replication_role')=0 begin raiserror 30000 "%1!: You must have replication role to execute this procedure at the replicate", @proc_name if @began_tran=1 rollback tran return(-1) end set triggers on end else begin select @triggers_state=1 end -- okay, now we can do the insert insert into replicated_sql (sql_string) values (@sql_string) if @@error!=0 or @@rowcount=0 begin rollback tran rep_sql raiserror 30001 "%1!: Insert failed. Transaction rolled back", @proc_name if @triggers_state=0 set triggers off return(-1) end else if @began_tran=1 commit tran if @triggers_state=0 set triggers off return (0) end go exec sp_setrepproc 'sp_replicate_sql', 'function' go

Then use the following replication definitions (this example is for a Warm Standby between two copies of pubs2 with a logical connection of WSTBY.pubs2)
Create replication definition replicated_sql_repdef With primary at WSTBY.pubs2 With all tables named replicated_sql ( sql_statement_id identity,

22

Final v2.0.1

sql_string varchar(1800) ) primary key (sql_statement_id) send standby replication definition columns go create function replication definition sp_replicate_sql with primary at WSTBY.pubs2 deliver as sp_replicate_sql ( @sql_string varchar(1800) ) send standby all parameters go

Now, if you really want to amaze your friends, simply execute something like the following:
Exec sp_replicate_sql insert into publishers values (9990,Sybase, Inc.,Dublin,CA)

The trick is in the highlighted portions of the trigger and the stored procedure. Starting in ASE 12.0, Sybase provided a capability to execute dynamically constructed SQL statements using the execute() function. However, if placed directly in a replicated procedure, the Rep Agent stack traces and fails (a nasty recovery issue for a production database). However, if the execute() function is in a trigger, Rep Agent behaves fine. Accordingly, we simply insert the desired SQL statement in a table. Of course, this also provides us a way to audit the execution of batch SQL and compare commit times for latency purposes (even replicated SQL statements could run for a long time). Now then, the only problem is that with Warm Standby, triggers are turned off by default via the dsi_keep_triggers setting (and it probably is off for most other normal replication implementations as well). Rather than enabling triggers for the entire session and cause performance problems during the day, we simply borrow a trick that dsi_keep_triggers simply calls the set triggers off command. Rather than simply indiscriminately turning the triggers off and then on and the beginning and end of the procedure, we employ trick #2 - @@options. @@options is an undocumented global variable that stores session settings such as set arith_abort on, etc. Since it is a binary number, you need to consider the byte order on your host, however, it now becomes a simple matter to replicate a proc that turns on triggers, inserts a SQL string into a table, which in turn triggers the execution of the string, and then the proc returns triggers to the original setting and exits. By the way, why replicate both the table and the proc? Well, the answer is it allows you to replicate truncate table or SQL deletes against the table when it begins getting unwieldy. As stated, this is a neat trick for handling updates and deletes. Inserts, particularly bcps are not able to use this for the simple fact that the source data needs to exist at the replicate already. However, if batch feeds are bcpd into staging databases on both systems (which should be done in WS situations), the bulk insert into the production database using insert into select can be replicated in this fashion as well. Additionally, while it has been stated that this is limited to the 12.5 versions of the products, it will in fact work with any 12.x version, but the SQL statement would be limited to 255 characters due to the varchar(255) limitation prior to ASE 12.5 and RS 12.5. Batch Processing & Ignore_dupe_key Some of the more interesting problems arise when programmers make logical assumptions - and without fully understanding the internal workings of ASE implement an easy work around. Consider the following code snippet that might be used when moving rows from a staging database to the production system:
create proc load_prod_table @batch_size int=250 as begin declare @done_loading tinyint select @done_loading=0 set rowcount @batch_size while @done_loading=0 begin insert into prod_table select from staging_table if @@rowcount=0 select @done_loading=1 delete staging_table end end

This appears to be fairly harmless, and assuming that the proc is NOT replicated, it would appear to be a normal implementation. However, two things are wrong with it: The assumption is that the same rows selected for insert will be the same rows deleted. Remember, if worker threads are involved, this may not be the case, particularly with partitioned tables. As a result, the delete could affect other rows than those inserted. The assumption is that the insert only READ rowcount rows from the source data. This is perhaps the biggest failure that affects performance.

23

Final v2.0.1
Why is the last bullet so important? Remember, that setting rowcount affects the final result set and does not limit any subqueries, etc. Hence select sum(x) from y group by a will return rowcount rows despite the fact it may have to scan millions to generate the sums. Accordingly, it may require ASE to scan hundreds or thousands of rows to generate rowcount unique rows for a table in which ignore_dupe_key is set for the primary key index. So, why is this a problem? Lets assume that we have a batch of 100,00 records in which 50% of them are duplicates (every other row) or already exist in the target table. Assuming rowcount is set to 250, it would mean that the insert would have to scan 500 rows in order to generate 250 unique ones to be inserted. However, the delete would only remove 250 of them. As a result, on the second pass through the loop, the insert would scan 250 rows it had already scanned and then an addition 500 rows to get 250 unique ones that it could insert. And the delete would remove 250. On the third pass, the insert would scan 500 rows already processed plus 500 new rows. And so forth. Essentially, even though 100,000 rows with 50% unique and a batch size of 250 would suggest a fairly smooth 200 iterations through the loop, by the last iteration, the insert would be scanning 49,750 rows already scanned plus the final 500 (with 250 unique). A reproduction of this problem (for the confused or interested) is as the below:
use tempdb go if exists (select 1 from sysobjects where name="test_table" and type="U" and uid=user_id()) drop table test_table go create table test_table ( col_1 int not null, col_2 varchar(40) null ) go create unique nonclustered index test_table_idx on test_table (col_1) with ignore_dup_key go if exists (select 1 from sysobjects where name="test_table_staging" and type="U" and uid=user_id()) drop table test_table_staging go create table test_table_staging ( col_1 int not null, col_2 varchar(40) null ) go insert into test_table_staging values (1,"expected batch=1") insert into test_table_staging values (2,"expected batch=1") insert into test_table_staging values (3,"expected batch=1") insert into test_table_staging values (3,"expected batch=1") insert into test_table_staging values (4,"expected batch=1") insert into test_table_staging values (5,"expected batch=2") insert into test_table_staging values (6,"expected batch=2") insert into test_table_staging values (7,"expected batch=2") insert into test_table_staging values (7,"expected batch=2") insert into test_table_staging values (8,"expected batch=2") insert into test_table_staging values (9,"expected batch=3") insert into test_table_staging values (10,"expected batch=3") insert into test_table_staging values (11,"expected batch=3") insert into test_table_staging values (11,"expected batch=3") insert into test_table_staging values (12,"expected batch=3") go if exists (select 1 from sysobjects where name="lsp_insert_test_table" and type="P" and uid=user_id()) drop proc lsp_insert_test_table go CREATE PROC insert_test_table @batchsize INT = 5 AS BEGIN DECLARE @cnt @myloop @err @del SELECT INT, int, INT, int

-- added to track deletes.

@cnt = -1, @err = 0, @myloop = 1

SET ROWCOUNT @batchsize WHILE BEGIN @cnt != 0

select "Loop ----------- ", @myloop INSERT test_table (col_1, col_2)

24

Final v2.0.1

SELECT FROM SELECT

col_1, col_2+" ==> actual batch="+convert(varchar(3),@myloop) test_table_staging @cnt = @@ROWCOUNT, @err = @@ERROR

set rowcount 0 select "test_table:" select * from test_table -- added to show what is inserted to this point.... select "Rowcount = " , @cnt set rowcount @batchsize DELETE test_table_staging

set rowcount 0 select "test_table_staging:" select * from test_table_staging -- added to show what is left select "Delete Rowcount = ",@del set rowcount @batchsize select @myloop = @myloop + 1 END RETURN 0 END go

Consider the following sample execution since the default is set to 5, executing the procedure without any parameter value should result in a ROWCOUNT limit of 5 rows:
use tempdb go select * from test_table_staging go exec insert_test_table go select * from test_table go The output from this as executed is: ---------- isql CHINOOK ---------col_1 col_2 ----------- ---------------------------------------1 expected batch=1 2 expected batch=1 3 expected batch=1 3 expected batch=1 4 expected batch=1 5 expected batch=2 6 expected batch=2 7 expected batch=2 7 expected batch=2 8 expected batch=2 9 expected batch=3 10 expected batch=3 11 expected batch=3 11 expected batch=3 12 expected batch=3

(15 rows affected)

The above is the output from the first select statement, showing the original 15 rows containing 3 duplicates (3,7, and 11). Note the highlighted rows (5,9, and 10) and their expected batch. Now, consider the procedure execution loop iteration 1 is contained below:

----------------- ----------Loop ----------1 (1 row affected) Duplicate key was ignored. --------test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1

25

Final v2.0.1

3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 (5 rows affected) ----------- ----------Rowcount = 5 (1 row affected) ----------------test_table_staging: (1 row affected) col_1 col_2 ----------- ---------------------------------------5 expected batch=2 6 expected batch=2 7 expected batch=2 7 expected batch=2 8 expected batch=2 9 expected batch=3 10 expected batch=3 11 expected batch=3 11 expected batch=3 12 expected batch=3 (10 rows affected) ------------------ ----------Delete Rowcount = 5 (1 row affected)

Note what occurred. Because of the duplicate row for row_id 3, the subquery select in the insert statement had to read 6 rows consequently row_id 5 was actually inserted as part of the first batch. However, because the delete is an independent statement, it simply deletes the first 5 rows, which contains the duplicate, leaving row_id 5 in the list. Now, consider what happens with loop iteration #2:
----------------- ----------Loop ----------2 (1 row affected) Duplicate key was ignored. --------test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 (10 rows affected) ----------- ----------Rowcount = 5 (1 row affected) ----------------test_table_staging: (1 row affected) col_1 col_2 ----------- ---------------------------------------9 expected batch=3 10 expected batch=3 11 expected batch=3

26

Final v2.0.1

11 expected batch=3 12 expected batch=3 (5 rows affected) ------------------ ----------Delete Rowcount = 5 (1 row affected)

Again, notice what occurred. Because of the row_id 5 is repeated and the duplicate for row_id 7, the insert scans 7 rows to achieve the rowcount of 5. Of course, the delete only removes the next five, leaving rows 9 &10 still in the staging table. Finally, we come to the last loop iteration:
----------------- ----------Loop ----------3 (1 row affected) Duplicate key was ignored. --------test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) ----------- ----------Rowcount = 2 (1 row affected) ----------------test_table_staging: (1 row affected) col_1 col_2 ----------- ---------------------------------------(0 rows affected) ------------------ ----------Delete Rowcount = 5 (1 row affected) ----------------- ----------Loop ----------4 (1 row affected) --------test_table: (1 row affected) col_1 col_2 ----------- ---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2

27

Final v2.0.1

10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) ----------- ----------Rowcount = 0 (1 row affected) (return status = 0) col_1 col_2 ----------- ---------------------------------------1 expected batch=1 ==> actual batch=1 2 expected batch=1 ==> actual batch=1 3 expected batch=1 ==> actual batch=1 4 expected batch=1 ==> actual batch=1 5 expected batch=2 ==> actual batch=1 6 expected batch=2 ==> actual batch=2 7 expected batch=2 ==> actual batch=2 8 expected batch=2 ==> actual batch=2 9 expected batch=3 ==> actual batch=2 10 expected batch=3 ==> actual batch=2 11 expected batch=3 ==> actual batch=3 12 expected batch=3 ==> actual batch=3 (12 rows affected) Normal Termination Output completed (1 sec consumed).

Because of the implementation, each duplicate compounds the problem, causing subsequent batches to begin with duplicates. So whats the problem?? A couple of points are key to understanding what is happening: When a duplicate is encountered, the server uses a Compensation Log Record (CLR) to undo a previous log record in this case, the duplicate insert. SET ROWCOUNT affects the number of rows affected by the statement vs. the rows processed by subquery or other individual parts of the statement. Consequently an insert limited by SET ROWCOUNT to 5 rows may have to read 6 or more rows if a duplicate is present. The implementation does not check to ensure that the rows inserted are the rows being deleted. Consequently, some rows could be dropped without even being inserted.

Now then, since the Rep Agent can be fully caught up, it replicates records for uncommitted transactions as well as committed. In this case, as soon as each log page is flushed, the Rep Agent can read it. Since the log page contains the duplicate rows for those being inserted (remember, bulk SQL first logs the affected rows and THEN applies them), it also reads the CLR records which is needful. By this point you can determine that the following is occurring (assuming the 50,000 row delete using 250 row iterations, again): Each loop iteration causes and additional 250 duplicate insert rows to be replicated along with 250 CLR records over the previous iteration By the last iteration, RS receives ~49,750 duplicate insert records, 49,750 CLR records plus 250 duplicate inserts from the last batch along with the 250 CLR records and then (last but not least) the 250 actually inserted rows.

This is all in one transaction. With all 200 iterations, RS must then remove the duplicate inserts that the CLR records point to. Consequently, this seemingly innocent 100,000 row insert of 50,00 new rows results in an astounding 4,925,250 total CLR records (250+500+750+49,500+49,750) and a duplicate number of inserts for a whopping total of 9,850,500 unnecessary records on top of the 50,000 rows really wanted. Can you guess the impact on: Your transaction log at the primary system (remember, all those CLR and inserts are logged)!!! The Replication Server performance as it also removes all the duplicates!!!

Oh, yes, this actually did happen at a major bank, and may have happened at least one more that we are aware of. The point of this discussion is that even though the SQL to remove the duplicates from the staging table appeared to be a slower design than the quick band-aid of ignore_dupe_key, in reality, given the data quality, it turns out to be tremendous performance boost. Sometimes, band-aids dont stick.

28

Final v2.0.1

Replication Agent Processing


Why is the Replication Agent so slow???
Frequently, comments will be made that the ASE Rep Agent is not able to keep up with logging in the ASE. For most normal user processing, a properly tuned Rep Agent on a properly tuned transaction log/system will have no trouble keeping up. This is especially true if the bulk of the transactions originate from GUI-base user screens since such applications naturally tend to have an order of magnitude more reads than writes. However, for systems with large direct electronic feeds or sustained bulk loading, Replication Agent performance is crucial. At this writing, a complete replication system based on Replication Server 12.0 is capable of maintaining over 2GB/Hr from a single database in ASE 11.9.3 using normal RAID devices (vs. SSDs). In a different type of test, the ASE 12.5.2 RepAgent thread on a single cpu NT machine is capable of sending >3,000 updates/second to Replication Server 12.6. Note that there are many factors that contribute to RepAgent performance cpu load from other users, network capabilities, etc. Readers should expect to achieve the same results if their system is notoriously cpu or network bound (for example). In this section we will be examining how the Replication Agent works and in particular, two bottlenecks quite easily overcome by adjusting configuration parameters. As mentioned earlier, since this paper does not yet address many of the aspects of heterogeneous replication, this section should be read in the context of the ASE Replication Agent thread. However, the discussions on Log Transfer Language and the general Rep Agent communications are common to all replication agents as all are based on the replication agent protocol supported by Sybase. Secondary Truncation Point Management Every one knows that the ASE Replication Agent maintains the ASE secondary truncation point, however, there are a lot of misconceptions about the secondary truncation point and the Replication Agent, including: The Replication Agent looks for the secondary truncation point at startup and begins re-reading the transaction log from that point. The Replication Agent cannot read past the primary truncation point. Zero-ing the LTM resets the secondary truncation point back to the beginning of the transaction log.

As you would guess from the previous sentence, these are not necessarily accurate. In reality, there is a lot more communication and control from the Replication Server in this process than realized. Replication Agent Communication Sequence The sequence of events during communication between the Replication Agent and the Replication Server is more along the lines of: 1. The Replication Agent logs in to the Replication Server and requests to connect the source database (via the connect source command) and provides a requested LTL version. Replication Server responds with the negotiated LTL version and upgrade information. The Rep Agent asks the Replication Server who the maintenance user is for that database. The Replication Server looks the maintenance user up in the rs_maintusers table in the RSSD database and replies to the Rep Agent. The Rep Agent asks the Replication Server where the secondary truncation point should be. The Replication Server looks up the locater in the rs_locaters table in the RSSD database and replies to the Rep Agent. The Rep Agent starts scanning from the location provided by the Replication Server The Replication Agent scans for a configurable number (scan_batch_size) log records. After reaching scan_batch_size log records, the Replication Agent requests a new secondary truncation point for the transaction log. When this request is received, the Replication Server responds with the cached locater which contains the log page containing the oldest open transaction received from the Replication Agent. In addition, the Replication Server writes this cached locater to the rs_locaters table in the RSSD. The Rep Agent moves the secondary truncation point to the log page containing the oldest open transaction received by Replication Server. Repeat step 5.

2.

3.

4. 5. 6.

7. 8.

29

Final v2.0.1
An interaction diagram for this might look like the following:
RepAgent
ct_connect(ra_user,ra_pwd) cs_ret_succeed connect source lti ds.db 300 [mode] lti 300 get maintenance user for ds.db db_name_maint get truncation site.db 0x0000aaaa0000bbbbbbb select from rs_users where...

Rep Server

RSSD

select from rs_sites...

select from rs_maintusers...

select from rs_locaters...

log_scan() LTL SQL Replicate DS.DB

get truncation site.db 0x0000aaaa0000bbbbbbb

insert into rs_locaters values (0x000aaaa0000)

Figure 6 Replication Interaction Diagram for Rep Agent to RSSD


The key elements to get out of this are fairly simple: Keep the RSSD as close as possible to the RS Every scan_batch_size rows, the Rep Agent stops forwarding rows to move secondary truncation point. The secondary truncation point is set to the oldest open transaction received by Replication Server which may be the same as the oldest transaction in ASE (syslogshold) or it may be an earlier transaction as the Rep Agent has not yet read the commit record from the transaction log.

Regarding the first, if you notice, most of the time that the Rep Agent asks the RS for something, the RS has to check with the RSSD or update the RSSD (i.e. the locater). So, dont put the RSSD to far (network wise) from the RS. The best place is on the same box and have the primary network listener for the RSSD ASE be the TCP loopback port (127.0.0.1) Replication Agent Scanning The second can be overcome with a willingness to absorb more log utilization. The default scan_batch_size is 1,000 records. As anyone who has read the transaction log will tell you,1,000 log records happen pretty quickly. The result is that the Rep Agent is frequently moving the secondary truncation point. Benchmarks have show that raising scan_batch_size can increase replication throughput significantly. For example, at an early Replication Server customer, setting it to 20,000 improved overall RS throughput by 30%. Of course, the tradeoff to this is that the secondary truncation point stays at a single location in the log translates to a higher degree of space used in the transaction log. In addition, database recovery time as well as replication agent recovery time will be lengthened as the portion of the transaction log that will be rescanned at database server and replication agent startup will be longer. In contrast to the last paragraph, some have reported better performance with lower scan batch size particularly in Warm Standby situations. While not definite, there is considerable thought within Sybase that this has the same impact of exec_cmds_per_timeslice in that it "throttles" the RepAgent back and allows other threads to have more access time. As the other threads are able to keep up more now, there is less contention for the inbound queue (SQM reads are not delaying SQM writes). While decreasing the RepAgent workload is one way to solve the problem, a better solution would have been to improve the DSI or other throughput to allow it to keep up without throttling back the RepAgent.

30

Final v2.0.1
Rep Agent LTL Generation The protocol used by sources to replication server is called Log Transfer Language (LTL). Any agent that wishes to replicate data via Replication Server must use this protocol, much the same way that RS must use SQL to send transactions to ASE. Fortunately, this is a very simple protocol with very few commands. The basic commands are listed in the table below. LTL Command connect source Subcommand Function request to connect a source database to the replication system in order to start forwarding transactions. request to retrieve maintenance user name to filter transactions applied by the replication system. request to retrieve a log pointer to the last transaction received by the Replication Server.

get maintenance user get truncation distribute begin transaction commit/rollback transaction applied execute sqlddl append dump purge

Used to distribute begin transaction statements Used to distribute commit/rollback statements Used to distribute insert/update/delete SQL statements Used to distribute both replicated procedures as well as request functions Used to distribute DDL to WS systems Used to distribute the dump database/ transaction log SQL commands Used during recovery to notify Replication Server that previously uncommitted transactions have been rolled back.

A sample of what LTL looks like is as follows:


distribute @origin_time='Apr 15 1988 10:23:23.001PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000001, @tran_id=0x000000000000000000000001 begin transaction 'Full LTL Test' -- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.002PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000002, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_insert yielding after @intcol=1,@smallintcol=1,@tinyintcol=1,@rsaddresscol=1,@decimalcol=.12, @numericcol=2.1,@identitycol=1,@floatcol=3.2,@realcol=2.3,@charcol='first insert',@varcharcol='first insert',@text_col=hastext always_rep,@moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='4-15-1988 10:23:23.001PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=hastext rep_if_changed,@bitcol=1 -- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.003PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000003, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first last changed with log textlen=30 @text_col=~.!!?This is the text column value. -- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.004PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000004, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first changed with log textlen=119 @imagecol=~/!"!gx"3DUfw@4@O@@y@f9($&8~'ui)*7^Cv18*bhP+|p{`"]?>,D *@4 -- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.005PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000005, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append @imagecol=~/!!7Ufw@4@O@@y@f -- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.006PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000006,

31

Final v2.0.1

@tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append last @imagecol=~/!!B@O@@y@f9($&8~'ui)*7^Cv18*bh -- added for clarity distribute @origin_time='Apr 15 1988 10:23:23.007PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000007, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_update yielding before @intcol=1,@smallintcol=1,@tinyintcol=1,@rsaddresscol=1,@decimalcol=.12,@numericcol=2.1,@identitycol=1, @floatcol=3.2,@realcol=2.3,@charcol='first insert', @varcharcol='first insert',@text_col=notrep always_rep, @moneycol=$1.56,@smallmoneycol=$0.56,@datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=1 after @intcol=1, @smallintcol=1, @tinyintcol=1, @rsaddresscol=1, @decimalcol=.12, @numericcol=2.1, @identitycol=1, @floatcol=3.2, @realcol=2.3, @charcol='updated first insert', @varcharcol='first insert', @text_col=notrep always_rep, @moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=0

Although it looks complicated, the above is fairly simple all of the above are distribute commands for a part of a transaction comprised of multiple SQL statements. The basic syntax for a distribute command for a DML operation is as follows:
distribute <commit time> <OQID> <tran id> applied <table>.<function> yielding [before <col name>=<value> [, <col name>=<value>, ]] [after <col name>=<value> [, <col name>=<value>, ]]

As you could guess, the distribute command will make up most of the communication between the Rep Agent and the Rep Server. Looking closely at what is being sent, you will notice several things: The appropriate replicated function (rs_update, rs_insert, etc.) is part of the LTL (highlighted above) The column names are part of the LTL

The latter is not always the case as some heterogeneous Replication Agents can cheat and not send the column names (assuming Replication Definition was defined with columns in same order or through a technique called structured tokens. Although currently beyond the scope of this paper, this is achieved by the Replication Agent directly accessing the RSSD to determine replication definition column ordering. This improves Replication Agent performance by reducing the size of the LTL to be transmitted and allowing the Replication Agent to drop columns not included in the replication definition. This information, once retrieved, can be cached for subsequent records. Currently, the ASE Replication Agent does not support this interface. However, in general, the LTL distribute command illustrated above does leave us with another key concept: Key Concept #4: Ignoring subscription migration, the appropriate replication function rs_insert, rs_update, etc., for a DML operation is determined by the replication agent from the transaction log. The DIST/SRE determines which functions are sent according to migration rules, while the DSI determines the SQL language commands for that function. Having determined what the Replication Agent is going to send to the Replication Server, the obvious question is how does it get to that point? The answer is based on two separate processes the normal ASE Transaction Log Service (XLS) and the Rep Agent. The process is similar to the following: 1. 2. 3. 4. 5. 6. 7. 8. 9. (XLS) The XLS receives a log record to be written from the ASE engine (XLS) The XLS checks object catalog to see if logged objects OSTAT_REPLICATED bit is set. (XLS) If not, the XLS simply skips to writing the log record. If it is set, then the XLS checks to see if the DML logged event is nested inside a stored procedure that is also replicated. (XLS) If so, the XLS simply skips to writing the log record. If not, then the XLS sets the log records LSTAT_REPLICATE flag bit (XLS) The XLS writes the record to the transaction log (RA) Some arbitrary time later, the Rep Agent reads the log record (RA) The Rep Agent checks to see if the log records LSTAT_REPLICATE bit is set. (RA) If so, Rep Agent proceeds to LTL generation. If not, the Rep Agent determines if the log record is a special log record such as begin/commit pairs, dump records, etc. (RA) If not, the Rep Agent can simply skip to the next record. If it was, the Rep Agent proceeds with constructing LTL.

32

Final v2.0.1
10. (RA) The Rep Agent checks to see if the operation was an update. If so, it also reads the next record to construct the before/after images. 11. (RA) The Rep Agent checks to see if the logged row was a text chain allocation. If so, it reads the text chain to find the TIPSA. This TIPSA is then used to find the data row for the text modification. The data row for writetext is then constructed in LTL. Then the text chain is read and constructed into LTL chunks of text/image append functions. 12. (RA) LTL Generation begins. Rep Agent checks its own schema cache (part of proc cache) to see if the logged objects metadata is in cache. If not, it reads the objects metadata from system tables (syscolumns). 13. (RA) Rep Agent constructs LTL statement for the logged operation 14. (RA) If batch_ltl parameter is false (default), the Rep Agent passes the LTL row to the Rep Server using the distribute command. If batch_ltl is true, the Rep Agent waits until the LTL buffer is full prior to sending the records to the Rep Server. This process is illustrated below. The two services are shown side-by-side due to the fact that they are independent threads within the ASE engine and execute in parallel on different log regions. This latter is due to the fact that the Rep Agent can only read flushed log pages (flushed to disk), consequently, it will always be working on a different log page than the XLS service.

ASE XLS Service


Receive log record NO Is OSTAT_REPLICATED set? NO

Rep Agent Processing


Read next record from transaction log

Does record have LSTAT_REPLICATE set NO YES

YES

Nested in Store Procedure? NO

Is record BT/CT or schema change YES

Store Procedure OSTAT_REPLICATED set? NO YES Set log records LSTAT_REPLICATE bit

Is logged operation an update YES Read before/after image from log NO

Write record to transaction log

Is operation a writetext YES Find datarow for writetext NO

Is replicated object metadata in RA cache NO YES Read object metadata from syscolumns

Construct LTL rs_datarow_for_writext LTL for text/image chain

Is LTL batching on YES Pause for LTL buffer to fill NO

Send LTL to Replication Server

Figure 7 - ASE XLS and Replication Agent Execution Flow


The following list summarizes this into key elements how this affects replication performance and tuning. Replication Agent has a schema cache to maintain object metadata (schema cache) for constructing LTL as well as tracking transactions (transaction cache). As a result, more procedure cache may be necessary on systems with a lot of activity on large numbers of tables. In addition, careful monitoring of the system metadata cache to ensure that physical reads to system tables are not necessary. LTL batching can significantly improve Rep Agent processing as it can scan more records prior to sending the rows to the Rep Server (effectively a synch point in Rep Agent processing). Replicating text/image columns can slow down Rep Agent processing of the log due to reading the text/image chain. Marking objects for replication that are not distributed (i.e. for which no subscriptions or Warm Standby exists) has a negative impact on Rep Agent performance as it must perform LTL generation needlessly. In

33

Final v2.0.1
addition, these extra rows will consume space in the inbound stable queue and valuable CPU time for the distributor thread. Procedure replication can improve Rep Agent throughput by reducing the number of rows for which LTL generation is required. For example, if a procedure modifies 1,000 rows, replicating the table will require 1,000 LTL statements to be generated (and compared in the distributor thread). By replicating the procedure only a single LTL statement will need to be generated and processed by Replication Server. Key Concept #5 In addition to Rep Agent tuning, the best way to improve Rep Agent performance is to minimize its workload. This can be achieved by not replicating text/image columns where not necessary and ensuring only objects for which subscriptions exist are marked for replication. In addition, replicating procedures for large impact transactions could improve performance significantly. The last sentence may not make sense yet. However, a replicated procedure only requires a single row for the Replication Agent to process no matter how many rows are affected by it. How this is achieved as well as the benefits and drawbacks are discussed in the Procedure Replication section. Note that in the above list, nowhere does it say that enabling replication slows down the primary by resorting to all deferred updates vs. in-place updates. The reason is that this was always a myth. While an update will generate two log records, for the before and after images respectively, the actual modification can be a normal update vs. a deferred one. Unfortunately, the existence of the two log records has led many to mistakenly assume that replication reverts to deferred updates. Replication Agent Communications The Rep Agent connects to the Replication Server in PASSTHRU mode. A common question is What does it mean by passthru mode? The answer lies in how the server responds to packets. In passthru mode, a client can send multiple packets to the server without having to wait for the receiver to process them fully. However, they do have to synchronize periodically for the client to receive error messages and statuses. A way to think of it is that the client can simply start sending packets to the server and as soon as it receives packet acknowledgement from the TDS network listener, it can send the next packet. Asynchronously, the server can begin parsing the message. When the client is done, it sends an End-Of-Message (EOM) packet that tells the server to process the message and respond with status information. By contrast, typical client connections to Adaptive Server Enterprise are not passthru connections, consequently, the ASE server processes the commands immediately on receipt and passes the status information back to the client. This technique provides the Rep Agent/Rep Server communication with a couple of benefits: Rep Agent doesnt have to worry if the LTL command spans multiple packets. The destination server can begin parsing the messages (but not executing) as received, achieving greater parallelism between the two processes

If the Rep Agent configuration batch_ltl is true, Rep Agent will batch LTL to optimize network bandwidth (although the TDS packet size is not configurable prior to ASE 12.5). If not, as each LTL row is created, it is sent to the Rep Server. In either case, the messages are sent via passthru mode to the Rep Server. Every 2K, the Rep Agent synchs with the Rep Server by sending an EOM (at an even command boundary EOM can not be placed in the middle of an LTL command). Replication Agent Tuning Prior to ASE 12.5, the Replication Agent thread embedded inside ASE could not be tuned much. As this was a frequent cause of criticism, ASE engineering added several new configuration parameters to the replication agent. Some of these new parameters as well as other pre-existing parameters are listed below: Parameter (Default) batch ltl Default: True Suggest: True (verify) ASE 11.5* Explanation Specifies whether RepAgent sends LTL commands to Replication Server in batches or one command at a time. When set to "true", the commands are sent in batches. The default is "false" according to the manuals, however, in practice, most current ASEs default this to true.

34

Final v2.0.1

Parameter (Default) connect database Default: [dbname] Suggest: [dbname] connect dataserver Default: [dsname] Suggest: [dsname] data limits filter mode Default: stop or off Suggest: truncate

ASE 11.5

Explanation Specifies the name of the temporary database RepAgent uses when connecting to Replication Server in recovery mode. This is the database name RepAgent uses for the connect source command; it is normally the primary database. Specifies the name of the data server RepAgent uses when connecting to Replication Server in recovery mode. This is the data server name RepAgent uses for the connect source command; it is normally the data server for the primary database. Specifies how RepAgent handles log records containing new, wider columns and parameters, or larger column and parameter counts, before attempting to send them to Replication Server. off - RepAgent allows all log records to pass through. stop - RepAgent shuts down if it encounters log records containing widedata. skip - RepAgent skips log records containing wide data and posts a message to the error log. truncate - RepAgent truncates wide data to the maximum the Replication Server can handle. Warning! Sybase recommends that you do not use the "data_limits_filter_mode, off" setting with Replication Server version 12.1 or earlier as this may cause RepAgent to skip or truncate wide data, or to stop. The default value of data limits filter mode depends on the Replication Server version number. For Replication Server versions 12.1 and earlier, the default value is "stop." For Replication Server versions 12.5 and later, the default value is "off." Specifies the amount of time after Rep Agent has reached the end of the transaction log and no activity has occurred before the Rep Agent will fade out its connection to the Replication Server. This command is still supported as of ASE 12.5.2 although not reported when executing sp_config_rep_agent to get a list of configuration parameters and their values. Specifies whether, when Sybase Failover has been installed, RepAgent automatically starts after server failover. The default is "true." Specifies whether to encrypt all messages sent to Replication Server. This option requires the Replication Server Advanced Security option as well as the Security option for ASE to enable SSL-based data encryption. Specifies whether all messages exchanged with Replication Server should be checked for tampering. This option requires the Replication Server Advanced Security option as well as the Security option for ASE to enable SSL-based data integrity. Specifies whether to check the source of each message received from Replication Server. Specifies whether to check the sequence of messages received from Replication Server.

11.5

12.5

fade_timeout Default: 30

11.5*

ha failover Default: true Suggest: true msg confidentiality Default: false Suggest: false msg integrity Default: false Suggest: false msg origin check Default: false Suggest: false msg out-of-sequence check Default: false Suggest: false

12.0

12.0

12.0

12.0

12.0

35

Final v2.0.1

Parameter (Default) msg replay detection Default: false Suggest: false mutual authentication Default: false Suggest: false priority Default: 5 Suggest: 4 retry_time_out Default: 60

ASE 12.0

Explanation Specifies whether messages received from Replication Server should be checked to make sure they have not been intercepted and replayed. Specifies whether RepAgent should require mutual authentication checks when connecting to Replication Server. This option is not implemented. The thread execution priority for the Replication Agent thread within the ASE engine. Accepted values are 4-6 with the default being 5. Specifies the number of seconds RepAgent sleeps before attempting to reconnect to Replication Server after a retryable error or when Replication Server is down. The default is 60 seconds. The name of the Replication Server to which RepAgent connects and transfers log transactions. This is stored in the sysattributes table. The new or existing user name that RepAgent thread uses to connect to Replication Server. This is stored in the sysattributes table. The new or existing password that RepAgent uses to connect to Replication Server. This is stored in encrypted form in the sysattributes table. If network-based security is enabled and you want to establish unified login, you must specify NULL for repserver_password when enabling RepAgent at the database. Specifies the maximum number of log records to send to Replication Server in each batch. When the maximum number of records is met, RepAgent asks Replication Server for a new secondary truncation point. The default is 1000 records. This should not be adjusted for low volume systems. Specifies the number of seconds that RepAgent sleeps once it has scanned and processed all records in the transaction log and Replication Server has not yet acknowledged previously sent records by sending a new secondary truncation point. RepAgent again queries Replication Server for a secondary truncation point after scan timeout seconds. The default is 15 seconds. scan timeout 'scan_timeout_in_seconds' RepAgent continues to query Replication Server until Replication Server acknowledges previously sent records either by sending a new secondary truncation point or extending the transaction log. If Replication Server has acknowledged all records and no new transaction records have arrived at the log, RepAgent sleeps until the transaction log is extended. Controls the duration of time table or stored procedure schema can reside in the RepAgent schema cache before expiring. Larger values mean a longer duration and require more memory. Range is 1 to 10. This is a factor, so setting it to 2 doubles the size of the schema cache. Specifies the network-based security mechanism RepAgent uses to connect to Replication Server.

12.0

12.5

11.5*

rs servername

11.5*

rs username

11.5*

rs password

11.5*

scan_batch_size Default: 1000 Suggest: 10,000+ for high volume systems only scan_time_out Default:15 Suggest: 5

11.5*

11.5*

schema_cache_growth_factor Default: 1 Suggest: 1-3

12.5

Security mechanism

12.0

36

Final v2.0.1

Parameter (Default) send_buffer_size Default: 2K Suggest: 8-16K

ASE 12.5

Explanation Determines both the size of the internal buffer used to buffer LTL as well as the packet size used to send the data to the Replication Server. Accepted values are: 2K, 4K, 8K, or 16K (case insensitive), with the default of 2K. Larger send buffer sizes will reduce network traffic, as it has to do less sends. Note that this is not tied to the ASE server page size. Specifies whether RepAgent should send records from the maintenance user to the Replication Server for distribution to subscribing sites. The default is "false." Specifies whether the Replication Agent will send queue IDs (OQIDs) to the Replication Server as structured tokens or as binary strings (the default). Since every LTL command contains the oqid, this has the ability to significantly reduce network traffic. Valid values are true/false, default is false. Specifies whether RepAgent sends information about maintenance users, schema, and system transactions to the warm standby database. This option should be used only with the RepAgent for the currently active database in a warm standby application. The default is "false." Similar to "send structured oqids", this specifies whether the Replication Agent will use abbreviated LTL keywords to reduce network traffic. LTL keywords are commands, subcommands, etc. The default value is "false." Specifies whether RepAgent ignores errors in LTL commands. This option is normally used in recovery mode. When set to "true," RepAgent logs and then skips errors returned by the Replication Server for distribute commands. When set to "false," RepAgent shuts down when these errors occur. The default is "false." Instructs RepAgent to skip log records for Adaptive Server features unsupported by the Replication Server. This option is normally used if Replication Server is a lower version than Adaptive Server. The default is "false." This is a bitmask of the RepAgent traceflags that are enabled. The valid traceflags are in the range 9201-9220 (not all values are valid). Specifies the full path to the file used for output of the Replication Agent trace activity. Disables Replication Agent tracing activity. Enables Replication Agent tracing activity. Could severely degrade Rep Agent performance due to file I/O. When a network-based security system is enabled, specifies whether RepAgent seeks to connect to other servers with a security credential or password. The default is "false."

send maint xacts to replicate Default: false Suggest: false (dont change) send structured oqids Default: false Suggest: true

11.5*

12.5

send_warm_standby_xacts Default: false for most, true for Warm Standby

11.5*

short ltl keywords Default: false** Suggest: false** ( true)** skip ltl errors Default: false Suggest: false

12.5

11.5

skip unsupported features Default: false Suggest: false trace flags Default: 0 trace log file Default: null Suggest: [filename as needed] Traceoff Traceon unified login Default: false Suggest: false

11.5

11.5*

11.5*

11.5* 11.5* 12.0

* Some parameters above are noted as having been first implemented in ASE 11.5. This is due to the fact that ASE 11.5 was the first ASE with the Rep Agent Thread internalized. Prior to ASE 11.5, an external Log Transfer Manager (LTM) was used it had similar parameters for those above, but sometimes used different names. ** In ASE 12.5.0.1, the short_ltl_keywords parameter seemed to operate in the reverse setting ltl_short_keywords to true resulted in the opposite of what was expected. See example later. However, this may be fixed in a later EBF if so, whether using this parameter or not, corrective action may be required.

37

Final v2.0.1
In the above tables, several of the configuration parameters that will have the most impact on performance have been high-lighted. A discussion about these is not included here as in each of the above, a suggested configuration setting is mentioned. While your optimal configuration may differ, these are a good starting point. In addition, a couple of the new parameters take a bit more explanation and are detailed in the following paragraphs. Scan_Batch_Size As mentioned in the description, in high volume environments, setting scan_batch_size higher can have a noticeable improvement on Replication Agent throughput. The reason should be clear from the description the RepAgent stops scanning to request a secondary truncation point less often. However, in very low volume environments, this setting should be left at the default or possibly decreased. The reason is that when the RepAgent reaches the end of the log portion it was scanning, it checks to see if the log has been extended. If so, it simply starts scanning again while not starting over, it does so without requesting a secondary truncation point if the scan_batch_size has not been reached. Consequently, if the system is experiencing trickle transactions which always extend the log, but are a low enough volume that it would take hours or days to reach the scan_batch_size, the secondary truncation point may not move during that time period significantly impacting log space. For example, one customer had a number of larger OLTP systems and the usual collection of lesser volume systems. In an attempt to adopt standard configurations (always a hazardous task), they had adopted a scan_batch_size setting to 20,000 as it did benefit the larger systems. However, in one of the lesser systems, the transaction log started filling and could not be truncated. It turned out that the system only had about 140 transactions per hour which would take about 48 days to reach the 20,000 batch size at which point the secondary truncation point would finally be moved. Ouch!! Consequently, while adjusting scan_batch_size (and other settings) to drastically higher values may help in high-volume situations, take care in assuming that these settings can be adopted as standard configurations and applied unilaterally. Rep Agent Priority Beyond a doubt, the most frequently asked for feature to the ASE Replication Agent thread, was the ability to increase the priority. As of ASE 12.5, this is possible. Within ASE, there are 8 priority levels with the lower levels having the highest execution priority (similar to operating system priorities). These levels are: Level 0 1 2 3 4 5 6 7 Priority Kernel Reserved Reserved Highest High Medium Low Idle CPU EC1 Execution Class EC2 Execution Class EC3 Execution Class Maintenance Tasks Housekeeper Default for all users/processes Rep Agent highest in 12.5 Priority Class Kernel Processes

As illustrated above, priorities 3-6 are the only ones associated with user tasks with 4-6 corresponding to the Logical Process Managers EC1-EC3 Execution Classes. Although attempted by many, the LPM EC Execution Classes did not apply to the Replication Agent Threads (nor any other system threads). As a result, until ASE 12.5, there was no way to control a Replication Agents priority. What if more than one database is being replicated? How are the cpus distributed to avoid cpu contention with one engine attempting to service multiple Rep Agents running at highest priority level of 3? At start-up, the RepAgent is affinity bound to a specific ASE engine, if multiple engines are available, each RepAgent being started will be bound to the next available engine. For example: if max online engines = 4, the first RepAgent will be bound to engine 0 and the second RepAgent will be bound to engine 1. Subsequent Replication Agents are then bound in order to the engines. The RepAgent is then placed at specified priority on the runnable queue of the affinitied engine. If ASE is unable to affinity bind the RepAgent process to any available engines, ASE error 9206 is raised. Although a setting of 3 allows a Replication Agent thread to be scheduled more often than user threads, care should be taken to avoid monopolizing a cpu. Best approach is for an OLTP system is to set the priority initially to 4 and see how far the Rep Agent lags (after getting caught up in the first place). Then, only if necessary bump the priority up to 3. If user processes begin to suffer, than additional cpus and engines may have to be added to the primary to avoid

38

Final v2.0.1
Rep Agent lag while maintaining performance. There is a word of caution about this you may not see any improvement in performance by raising the execution priority in current ASE releases as the main bottleneck isn't the ASE cpu time, but rather the ASE internal scheduling for network access and the RS ability to process the inbound data to the queue fast enough. Consequently, changing the priority will only have a positive effect when the ASE engine cpu time is being monopolized by user queries. This can be determined by monitoring monProcessWaits for the RepAgent spid/kpid. If a significant amount of time is spent waiting on the cpu (WaitEventIDs 214 & 215), increasing the priority of the RepAgent may help. If not, increasing the priority will do little as the actual cause is elsewhere. Send_buffer_size As noted above, the send_buffer_size parameter really affects three things: 1. 2. 3. The size of the internal buffer used to hold LTL until sent to the Replication Server The amount of LTL sent each time The packet size used to communicate with the Replication Server.

The last has been an extremely frequent request to be able to control the size of the packets the Replication Agent uses similar to the db_packet_size DSI tuning parameter. It should be noted that the earlier LTMs already had an internal buffer of 16K, however, when the Replication Agent was internalized in ASE 11.5, this buffer was reduced to 2K more than likely to reduce the latency during low to mid volume situations. Consequently, before the packet size could be adjusted, the internal buffer also had to be adjusted. By allowing the user to specify the size of the internal buffer/packet size, optimal network utilization can be achieved. While the 2K setting at first glance may seem the logical choice, for high volume systems, it may not be the optimal setting. The transport layer limits the TCP packet size to the maximum network interface frame size to avoid fragmentation. In terms of effort, significant work is involved in preparing data for transfer. The process of dividing data into multiple packets for transfer, managing the TCP/IP layers and handling network interrupts requires significant CPU involvement. The more data is segmented into packets, the more CPU resources are needed. As a result, the maximum frame size supported by the networking link layer has an impact on CPU utilization. TCP/IP typically penalizes systems that transmit a large number of small packets. Additionally, within the Replication Server, the processing of the Replication Agent user thread and SQM is nearly synchronous for recovery reasons. The Replication Server does not acknowledge that the data from the Replication Agent has been received until it has been written to disk. As a result, even without the scan_batch_size, there is an implicit sync point every 2K of data from servers previous to ASE 12.5. If a new segment needs to allocated, this could involve an update to the RSSD to record the new space allocation. As a result, by increasing the send_buffer_size, the number of sync points is decreased and overall network efficiency improved. To aid in this, ASE 12.0.0.7+ and 12.5.0.3+ added several new sysmon counters. These counters are described in much more detail in the "Replication Agent Troubleshooting: Using sp_sysmon' section below Structured Tokens Heterogeneous Replication Agents have had the capability for a while to send the Replication Server structured tokens and shortened key words. Structured tokens are a mechanism for dramatically reducing the network traffic caused by replication, specifically by reducing the amount of overhead in the LTL protocol and compressing the data values. In the full structured token implementation, this is achieved in a number of ways, including using shortened LTL key words, structured tokens for data values, etc. As of ASE 12.5, some of these capabilities have been introduced in the Replication Agent thread internal to ASE. These two new parameters, send_structured_oqids and short_ltl_keywords, focus strictly on reducing the overhead of the LTL protocol and do not attempt to reduce the actual column values themselves. For example, using short LTL keywords, the distribute command is represented by the token _ds. While a savings of 7 bytes for one command may not appear that great, the average LTL distribute command would be shortened by a total of 20 bytes. For example, lets say we want to add this white paper to the list of titles in pubs2 (ignoring the author referential integrity to keep things simple). We would use the following SQL statements:
Begin tran add_book Insert into publishers values (9990,Sybase, Inc.,Dublin,CA) Insert into titles (title_id, title, type, pub_id, price, advance, total_sales, notes, pubdate, contract) values (PC9900,Replication Server Performance & Tuning,popular_comp,9990, 0.00, -- free to all good Sybase customers 0.00, -- contrary to belief, we didnt get paid extra 100, -- make up a number for number of times downloaded This what happens on sabbaticals taken by geeks and why Sybase still offers them, November 1, 2000, 0) - we wish make us an offer commit tran

39

Final v2.0.1
Tracing the LTL under normal replication (see below), we get the following LTL stream:
REPAGENT(4): [2002/09/08 17:55:12.23] The LTL packet sent is of length 1097. REPAGENT(4): [2002/09/08 17:55:12.23] _ds 1 ~*620020908 17:55:32:543,4 0x000000000000445800000c40000300000c400003000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _bg tran ~")add_book for ~"#sa _ds 4 0x000000000000445800000c40000400000c400003000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _ap owner =~"$dbo ~"+publishers.~!*rs_insert _yd _af ~$'pub_id=~"%%9990,~$)pub_name=~"-Sybase, Inc.,~$%%city=~"'Dublin,~$&state=~"#CA _ds 4 0x000000000000445800000c40000500000c40000 REPAGENT(4): [2002/09/08 17:55:12.23] 3000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _ap owner =~"$dbo ~"'titles.~!*rs_insert _yd _af ~$)title_id=~"'PC9900,~$&title=~"HReplication Server Performance & Tuning,~$%%type=~"popular_comp,~$'pub_id=~"%%9990,~$&price=~(($0.0000,~$(advance=~(($0.0000,~$,total_sales=100 ,~$&notes=~#"3This what happens on sabbaticals taken by geeks - and why Sybase still offers them,~$(pubdate=~*620001101 00:00:00:000,~$)contract=0 _ds 1 ~*620020908 17:55:32:543,4 REPAGENT(4): [2002/09/08 17:55:12.23] 0x000000000000445800000c40000700000c400003000092810127681300000000,6 0x000000000000445800034348494e4f4f4b7075627332 _cm tran

Turning on both short_ltl_keywords and structured oqids, we get the following:


REPAGENT(4): [2002/09/08 17:55:46.24] The LTL packet sent is of length 958. REPAGENT(4): [2002/09/08 17:55:46.24] distribute 1 ~*620020908 17:55:45:543,4 ~,A[000000000000]DX[00000c]@[00]'[00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 begin transaction ~")add_book for ~"#sa distribute 4 ~,A[000000000000]DX[00000c]@[00]([00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 applied owner =~"$dbo ~"+publishers.~!*rs_insert yielding after ~$'pub_id=~"%%9990,~$)pub_name=~"-Sybase, Inc.,~$%%city=~"'Dublin,~$&state=~"#CA distribute 4 ~,A[0000] REPAGENT(4): [2002/09/08 17:55:46.24] [00000000]DX[00000c]@[00])[00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 applied owner =~"$dbo ~"'titles.~!*rs_insert yielding after ~$)title_id=~"'PC9900,~$&title=~"HReplication Server Performance & Tuning,~$%%type=~"popular_comp,~$'pub_id=~"%%9990,~$&price=~(($0.0000,~$(advance=~(($0.0000,~$,total_sales=100 ,~$&notes=~#"3This what happens on sabbaticals taken by geeks - and why Sybase still offers them,~$(pubdate=~*620001101 00:00:00:000,~$)con REPAGENT(4): [2002/09/08 17:55:46.24] tract=0 distribute 1 ~*620020908 17:55:45:543,4 ~,A[000000000000]DX[00000c]@[00]+[00000c]@[00]'[0000928101]'wO[00000000],6 ~,7[000000000000]DX[00]'CHINOOKpubs2 commit transaction

** A couple of comments this is ASE 12.5 LTL (version 300) some examples in this document use older LTL versions, and were traced from the EXEC module consequently, it may look slightly different.

As you can see by the first example, with short_ltl_keywords set to false, the LTL command verbs are replaced with what kind of looks almost like abbreviations. As mentioned in the table, the false setting appears to be backwards for the short_ltl_keywords as setting it to true along with structured_oqids results in the second sequence. Note that the column names, datatype tokens, length tokens and data values remain untouched in both streams. The LAN replication agent used for heterogeneous replication is capable of stripping out the column names as it reads the column order from the replication definition and formats the columns in the stream accordingly. Schema Cache Growth Factor As mentioned earlier, the Rep Agent contains 2 caches - a schema cache and a transaction cache. The transaction cache is used to store open transactions. The other cache (the topic of this section) basically caches components from sysobjects and syscolumns. It used to be (11.x) made up from proc cache, however, as of 12.0, it uses it's own memory outside of the main ASE pool. Each cache item essentially is a row from sysobjects and associated child rows from syscolumns in a hash tree. Accordingly, it follows an LRU/MRU chain much like any cache in ASE - consequently, more frequently hit tables will be in cache while those infrequently will get aged out. When the rep agent reads a DML before/after image from the log it first checks this cache. If not found, then it has to do a look up in sysobjects and syscolumns (hopefully in metadata cache and not physical i/o - a hash table lookup in schema cache is quicker than a logical i/o in metadata cache). The schema cache can "grow" in one of two ways - (A) either a large number of objects are replicated and the transaction distribution is fairly even across all objects (rare - most transactions only impact <10 tables), and/or (B) the structure of tables/columns are being modified. You can watch the growth with RA trace 9208 - if it stays consistent then you are fine. Customers with the most issues similar to (A) are those replicating a lot of procs as you can have a lot of procs modifying a small number of tables. Customers with the most issues similar to (B) are those that tend to change the DDL to tables/procs frequently. The reason is that RA needs to send the correct version of the schema at the point that the DML happened. As a result, if you insert a row, add another column, insert another row, RA needs to send the appropriate info for each - i.e. don't send the new column for the old row, nor ignore it for the new one. As a result, the schema cache may grow (somewhat).

40

Final v2.0.1
The RepAgent config "schema cache growth factor" is a factor - not a percentage - consequently it is extremely sensitive. In other words, setting it to 2 doubles the size of the cache, while 3 triples the size. Depending on the hardware platform, other processing on the box, etc. anything above 3 may not be recommended. Hence, unless you have over 100 objects being replicated per database, setting this above 1 is probably useless. Replication Agent Troubleshooting There are several commands for troubleshooting the Rep Agent. At a basic level, sp_help_rep_agent can help track where in the log and how much of the log the Rep Agent is processing. However, for performance related issues, sp_sysmon RepAgent or the MDA based monProcessWaits table are the best bets. RepAgent Trace Flags However, for tougher problems, several trace flags exist. Trace Flag 9201 9202 9203 9204 9208 Trace Output Traces LTL generated and sent to RS Traces the secondary truncation point position Traces the log scan Traces memory usage Traces schema cache growth factor

Output from the trace flags is to the specified output file. The trace flags and output file are specified using the normal sp_config_rep_agent procedure as in the following:
sp_config_rep_agent <db_name>, trace_log_file, <filepathname> sp_config_rep_agent <db_name>, traceon, 9204 -- monitor for a few minutes sp_config_rep_agent <db_name>, traceoff, 9204

However, tracing the Rep Agent has a considerable performance impact as the Rep Agent must also write to the file. In the case of LTL tracing (9201), this can be considerable. As a result, Rep Agent trace flags should only be used when absolutely necessary. For NT, note that you will need to escape the file path with a double slash as in:
exec sp_config_rep_agent pubs2, 'trace_log_file', 'c:\\ltl_verify.log'

Determining RepAgent Latency Another useful command when troubleshooting the Replication Agent is the sp_help_rep_agent procedure call. Of particular interest are the columns that report the transaction log endpoints and the Rep Agent position start marker, end marker, and current marker. The problem is that these are reported as logical pages on the virtual device(s). This can lead to frequent accusations that the Replication Agent is always #GB behind. Remember, the logical page ids are assigned in device fragment order. Consider the following example database creation script (assume 2K page server):
create database sample_db on data_dev_01=4000 log on log_dev_01=250 go alter database sample_db on data_dev_02=4000 log on log_dev_01=2000 go

This would more than likely result in a sysusages similar to (dbid for sample_db=6):
Dbid 6 6 6 6 Segmap 3 4 3 4 Lstart 0 2048000 2176800 4224800 Size 2048000 128800 2048000 1024000 Vstart ()

Executing sp_help_rep_agent sample_db could yield the following marker positions:

41

Final v2.0.1

Start Marker 2148111

End Marker 4229842

Current Maker 2166042

Those quicker with the calculator than familiar with the structure of the log would erroneously conclude that the Rep Agent is running ~4GB behind (4229842-2166042=2063800; 2063800/512=4031MB) a good trick when the transaction log is only slightly bigger than 2GB. In reality, the Replication Agent is only 31MB behind ( (42298424224800)+(2176800-2166042)=5042+10758=15800; 15800/512=31 ) - assuming that the end maker points to the final page of the log. One of the most understood aspects of sp_help_rep_agent is the scan output - the makers (as listed above) as well as the log records scanned. For the first part, once the XLS wakes up the Rep Agent from a sleeping state, the start and end markers are set to the current log positions. The Rep Agent commences scanning from that point. As it nears the end marker, it requests an update - and may get a new end marker position. The log records scanned works similar but on a more predictable basis. If you remember, one of the Rep Agent configuration settings is scan batch size, which has a default value of 1,000. At the default value, monitoring this value can be extremely confusing. However, if setting scan batch size to a more reasonable value of 10,000 or 25,000 clears it up. What the scan section of sp_help_rep_agent is reporting in the log records scanned is the number of records scanned towards the scan batch size. Once the scan batch size number of records are reached, the counter is reset. This is what causes the confusion - particularly when just using the default as the Rep Agent is capable of scanning 1,000 records a second from the transaction log. Some administrators have attempted to run sp_help_rep_agent every second and were extremely surprised to see little or no change in the log records scanned (or even a drop). The reason is that the Rep Agent was working on subsequent scan batches. Consider the following output from a sample scan: start marker
(133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (133278,22) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20)

end marker
(134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2)

current marker
(133493,3) (133594,15) (133681,14) (133765,11) (133849,5) (133931,9) (134037,6) (134116,15) (134201,15) (134294,7) (134378,3) (134471,19) (134562,8) (134658,5) (134726,20) (134824,0) (134902,5) (135000,9) (135084,5) (135169,0) (135266,5) (135371,23) (135447,23) (135549,13) (135642,7) (135725,0)

log rec scanned


3587 4807 5841 6849 7856 8810 10083 11038 12048 13163 14171 15286 16375 17516 18341 19509 20437 21605 22613 23621 24790 1061 1963 3184 4300 5283

recs/ sec
0 1220 1034 1008 1007 954 1273 955 1010 1115 1008 1115 1089 1141 825 1168 928 1168 1008 1008 1169 1271 902 1221 1116 983

scan cnt
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2

tot recs
3587 4807 5841 6849 7856 8810 10083 11038 12048 13163 14171 15286 16375 17516 18341 19509 20437 21605 22613 23621 24790 26061 26963 28184 29300 30283

42

Final v2.0.1

start marker
(134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (134923,20) (137410,2)

end marker
(137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137410,2) (137416,9)

current marker
(135815,12) (135904,23) (135985,11) (136091,11) (136188,15) (136277,23) (136370,17) (136451,5) (136566,0) (136636,16) (136712,18) (136803,8) (136895,11) (136980,9) (137068,17) (137161,11) (137244,9) (137339,7) (137416,9)

log rec scanned


6371 7433 8389 9663 10832 11894 13009 13965 15346 16195 17098 18187 19275 20284 21346 22463 23447 24588 513

recs/ sec
1088 1062 956 1274 1169 1062 1115 956 1381 849 903 1089 1088 1009 1062 1117 984 1141 925

scan cnt
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3

tot recs
31371 32433 33389 34663 35832 36894 38009 38965 40346 41195 42098 43187 44275 45284 46346 47463 48447 49588 50513

Before asking about where the 3 right-most columns are in your sp_help_rep_agent output, the above output is from a modified version of sp_help_rep_agent. The above output was taken from an NT stress test of a 50,000 row update done by 10 parallel tasks. The Rep Agent was configured for a scan batch size of 25,000. As you can see from the first high-lighted section, as the current marker approached the end marker, the end marker was updated. The second high-lighted area illustrates the scan batch size rollover effect on log recs scanned. Adding more fun to the problem of determining the latency is the fact that the transaction log is a circular log, consequently, it is possible for the markers to have wrapped around. The logic to calculating the latency is: 1. 2. 3. Determine the distance from the end of the current segment in sysusages Add the space for all other segments between the current segment and the segment containing the end marker. Add the distance for the end marker within the end markers segment

Unfortunately, there isn't a built-in function that returns the last log page. If there are not any other open transactions, perhaps the easiest way is to begin a transaction, update some row and then check in master..syslogshold. Otherwise, one way to find the last log page is to use dbcc log as in:
use pubs2 go begin tran mytran rollback tran go dbcc traceon(3604) go -- dbid=5, obj=0, page=0, row=0, recs = last one, all recs, header only dbcc log(5,0, 0, 0, -1, -1, 1) go DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role. LOG SCAN DEFINITION: Database id : 5 Backward scan: starting at end of log maximum of 1 log records.

43

Final v2.0.1

LOG RECORDS: ENDXACT (13582,14) sessionid=13582,13 attcnt=1 rno=14 op=30 padlen=0 sessionid=13582,13 len=28 odc_stat=0x0000 (0x0000) loh_status: 0x0 (0x00000000) endstat=ABORT time=Oct 22 2004 11:26:39:166AM xstat=0x0 []

Total number of log records 1 DBCC execution completed. If DBCC printed error messages, contact a user with System Administrator (SA) role. Normal Termination Output completed (0 sec consumed).

Where the first number in parenthesis is current log page (and row) note that the sessionid points to the log page and row where the transaction begin record is (since this was an empty tran, it is immediately preceding). Another alternative that measures instead the difference in time between the last time the secondary truncation point was updated is by using the following query:
-- executed from the current database select db_name(dbid), stp_lag=datediff(mi,starttime,getdate()) from master..syslogshold where name = '$replication_truncation_point' and dbid=db_id()

This tells how far behind in minutes the Replication Server is (kind of). The problem is that it can be highly inaccurate. Remember, the STP points to the page containing the oldest open transaction that the Replication Server has processed. If a user began a transaction and went to lunch, the STP wont move until the transaction is committed. Unfortunately, this may give the impression that the Replication Agent is lagging, when in reality, the current marker value may be very near the end of the transaction log. The second reason that this can be inaccurate is a matter of interpretation. Simply because the STP and the current oldest open transaction is 30 minutes does not mean that the Rep Agent will take 30 minutes to scan that much of the log consider if the Rep Agent is down or a low volume system. Consequently, the suggestion made earlier to either invoke your own transaction (add a where clause of 'spid = @@spid' to the above) or just grab the latest one and hope that it isn't a user gone to lunch. Using sp_sysmon Most DBAs are familiar with sp_sysmon until the advent of the MDA monitoring tables in 12.5.0.3, this procedure was the staple for most database monitoring efforts (unfortunately so, as Historical Server provided more useful information and yet was rarely implemented). A little known fact is that while the default output for sp_sysmon does not include RepAgent performance statistics, executing the procedure and specifically asking for the repagent report does provide more detailed information than what is available via sp_help_rep_agent. The syntax is:
-- sample the server for a 10 minute period and then output the repagent report exec sp_sysmon 00:01:00, repagent

While the output is described in chapter 5 of the Replication Server Administration Guide, some the main points of interest are repeated below (header lines repeated for clarity):
per sec -----------Log Scan Summary Log Records Scanned Log Records Processed n/a n/a per xact -----------n/a n/a count ---------206739 105369 % of total ---------n/a n/a

The log summary section is a good indicator of how much work the RepAgent is doing and how much information is being sent to the Replication Server. The difference between Log Records Scanned and Log Records Processed is fairly obvious Processed records were converted into LTL and sent to the RS. In the example above, ~50% of the log records scanned were sent to the RS.
per sec -----------Log Scan Activity Updates Inserts Deletes Store Procedures DDL Log Records n/a n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a n/a count ---------101317 19 0 0 0 % of total ---------n/a n/a n/a n/a n/a

44

Final v2.0.1

Writetext Log Records Text/Image Log Records CLRs

n/a n/a n/a

n/a n/a n/a

0 0 0

n/a n/a n/a

The Log Scan Activity contains some useful information if you think something is occurring out of the norm. While the first four are fairly obvious (updates, inserts, deletes and proc execs replicated), the last three should bear some attention. DDL Log Records refers to DDL statements that were replicated generally this should be zero, with only minor lifts in a Warm Standby when DDL changes are made hence we exclude this from concern. Writetext Log Records will show how many writetext operations are being replicated. Text/Image Log Records is similar but a bit different in that it displays how many row images are processed (we need to confirm whether this is rs_datarow_for_writetext or the actual number of text rows). If you see a large number of text rows being replicated, you may want to investigate whether a text/image column was inappropriately marked or left at always_replicate vs. replicate_if_changed. CLRs refer to Compensation Log Records and clearly point to a design problem as earlier discussed with indexes using ignore_dup_row or ignore_dup_key (discussed earlier in the Primary Database section on Batch Processing & ignore_dup_key). More detail about which tables were updated/inserted/deleted can be gotten from the MDA monitoring tables in 12.5.0.3+ -specifically monOpenObjectActivity table which has the following definition:
-- ASE 15.0.1 definition create table monOpenObjectActivity ( DBID int, ObjectID int, IndexID int, DBName varchar(30) ObjectName varchar(30) LogicalReads int PhysicalReads int APFReads int PagesRead int PhysicalWrites int PagesWritten int RowsInserted int RowsDeleted int RowsUpdated int Operations int LockRequests int LockWaits int OptSelectCount int LastOptSelectDate datetime UsedCount int LastUsedDate datetime ) materialized at "$monOpenObjectActivity" go

NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL, NULL,

Since the above is at the index level, though, you will need to isolate IndexID=[0|1] to avoid picking up rows inserted into the index nodes (which of course are not replicated). Some may have notice in the Scan Activity that ~4,000 log records sent to the RS were not DML statements (105,369 processed 101,336 DML = 4,033) most of these are transaction records as seen in the next section.
per sec -----------Transaction Activity Opened Commited Aborted Prepared Maintenance User n/a n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a n/a count ---------2015 2016 0 0 0 % of total ---------n/a n/a n/a n/a n/a

Here are the missing 4,000 records since each transaction is a begin/commit pair, 2015+2016=4031 records sent to the Replication Server. Most of the above statistics should be fairly obvious, except Prepare which refers to twophased commit (2PC) prepare transaction records that are part of the commit coordination phase. Maintenance User refers of course to maintenance user applied transactions that are in turn re-replicated. Normally, this should be zero, but if a logical Warm Standby is also the target of a different replication source, then the primary database in the logical pair is responsible for re-replicating the data to the standby database. The transaction flow is SourceDB RS PrimaryDB RS/WS StandbyDB as illustrated below:

45

Final v2.0.1

Remote Site
Direct Replication Re-Replicated

HQ Warm Standby
Figure 8 Path for External Replicated Transactions in Warm Standby
The next section of the Rep Agent sp_sysmon output is the Log Extension section:
per sec -----------Log Extension Wait Count Amount of time (ms) Longest Wait (ms) Average Time (ms) n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a count ---------2 14750 14750 7375.0 % of total ---------n/a n/a n/a n/a

Here, waiting is not bad as it refers to the time that the Rep Agent was fully caught up and was waiting for more log records to be added to the transaction log. Obviously a count of zero is not desired. In the above example from a 1 minute sysmon taken during heavy update activity, the RepAgent caught up twice and waited ~7 seconds each time for more information to be added to the log. Or it seems so. In reality, the RepAgent was waiting when the sp_sysmon started, then the 100,000 updates occurred in ~2,000 transactions taking 45 seconds to process then the RepAgent was caught up again. So if doing benchmarking, remember that the RepAgent wait count may reflect the before and end state of the benchmark run. The next section of the RepAgent sp_sysmon report is only handy from the unique perspective of DDL replication.
per sec -----------Schema Cache Lookups Forward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) Backward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) per xact -----------count ---------% of total ----------

n/a n/a n/a n/a n/a n/a n/a n/a

n/a n/a n/a n/a n/a n/a n/a n/a

0 0 0 0.0 0 0 0 0.0

n/a n/a n/a n/a n/a n/a n/a n/a

When a table is altered via alter table, the RepAgent may have to scan forward/backward to determine the correct column names, datatypes, etc. to send to the Replication Server. One way to think of this is from the perspective of someone doing an alter table and then shutting down the system. On startup, the RepAgent cant just use the schema from sysobjects/syscolumns because some of the log records may contain rows that had extra or fewer columns. Consequently, it will have to scan backwards (possibly) to find the alter table record to determine the appropriate columns to send. Incidentally, this is done using an auxiliary scan from the main log scan and is why often the Rep Agent will be seen with two scan descriptors active in the transaction log. The next section is one of the more useful:
per sec -----------Truncation Point Movement Moved Gotten from RS n/a n/a per xact -----------n/a n/a count ---------107 107 % of total ---------n/a n/a

As expected, this is reporting the number of times the RepAgent has asked the RS for a new secondary truncation point and then moved the secondary truncation point in the log. If Moved is more than one less than Gotten, the likely

46

Final v2.0.1
cause is that a large or open transaction exists from the Replication Servers perspective (either it is indeed still open in ASE or the RepAgent just hasnt forwarded the commit record yet). The number above is not necessarily high you can gauge the number by dividing the Log Records Processed by the RepAgent scan_batch_size configuration which was the default of 1,000 in this case. With 105,000 records processed, you would expect at least 105 truncation point movements plus one when it reached the end of the log, so 107 is not abnormal. However, that is about 2/sec so increasing scan_batch_size in this case should not have too much a detrimental impact on recovery. Note in this discussion we are talking about Log Records Processed and not Scanned while the record count can happen quickly, the RepAgent isnt foolish enough to ask for a new truncation point every 1,000 records scanned it actually is based off of the number of records sent to the RS. The connection section is really only useful when you are having network problems and should be accompanied by the normal errors in the ASE errorlog:
per sec -----------Connections to Replication Server Success n/a Failed n/a per xact -----------n/a n/a count ---------0 0 % of total ---------n/a n/a

The next section is also when to pay attention to the network activity:
per sec -----------Network Packet Information Packets Sent Full Packets Sent Largest Packet Amount of Bytes Sent Average Packet n/a n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a n/a count ---------16860 14962 2048 30955391 1836.0 % of total ---------n/a n/a n/a n/a n/a

In the above case, it shows that possibly bumping up the send_buffer_size may help. We were using the default 2K packets between the RepAgent and the Replication Server and nearly 90% of them were full. The bottom statistic Average Packet is simply Amount of Bytes Sent Packets Sent and can be misleading. Remember, 107 times, the RepAgent requested a new truncation point requests sent in separate packets from the LTL buffers requests that can skew the average. The more important statistic to watch is Full vs. Sent. For those who have been following this, this 45 second update sent ~30MB to the RS a rate of 40MB/min or >2GB/hr. Of course having the RS on a box with fast CPUs greatly helped in this case as will be discussed later. Note too that we are sending 2K buffers which include column names hence the number of packets in this case is probably much bigger than the number of log pages scanned perhaps 20% more. The final section of the report offers perhaps the biggest clues into why the RepAgent may be lagging:
per sec -----------I/O Wait from RS Count Amount of Time (ms) Longest Wait (ms) Average Wait (ms) n/a n/a n/a n/a per xact -----------n/a n/a n/a n/a count ---------16966 11002 63 0.6 % of total ---------n/a n/a n/a n/a

In this sample, the RepAgent waited on the RS to conduct I/O nearly 17,000 times. Now then, check the above statistic on the number of packets and you will see the problem with RepAgent performance a lot of hurry-up-and-wait. It can scan from the log at fairly tremendous speed, but then has to wait for the RS to parse the LTL, normalized the LTL according to the replication definitions, pack the LTL into a binary format and send to the SQM an average of of a second wait every time a packet is sent (and yes, we did ask 2x a second for a truncation point as well). Lets take a look at an actual snapshot from a customers system. The following statistics are from a 10 minute sp_sysmon only the transaction and RepAgent sections are reported here:
Engine Busy Utilization -----------------------Engine 0 CPU Busy -------3.8 % I/O Busy -------2.1 % Idle -------94.2 %

Transaction Summary ------------------------Committed Xacts Transaction Detail -------------------------

per sec -----------1.2 per sec ------------

per xact -----------n/a per xact ------------

count ---------726 count ----------

% of total ---------n/a % of total ----------

47

Final v2.0.1

Inserts APL Heap Table APL Clustered Table Data Only Lock Table ------------------------Total Rows Inserted Updates Total Rows Updated ------------------------Total Rows Updated

0.7 0.8 3.8 -----------5.3

0.6 0.6 3.2 -----------4.4

419 468 2301 ---------3188

13.1 % 14.7 % 72.2 % ---------99.1 %

0.0 -----------0.0

0.0 -----------0.0

0 ---------0

n/a ---------0.0 %

Data Only Locked Updates Total Rows Updated 0.0 ------------------------- -----------Total DOL Rows Updated 0.0 Deletes APL Deferred APL Direct DOL ------------------------Total Rows Deleted ========================= Total Rows Affected

0.0 -----------0.0

0 ---------0

n/a ---------0.0 %

0.0 0.0 0.0 -----------0.0 ============ 5.4

0.0 0.0 0.0 -----------0.0 ============ 4.4

20 4 4 ---------28 ========== 3216

71.4 % 14.3 % 14.3 % ---------0.9 %

Replication Agent
-----------------

count ---------Log Scan Summary Log Records Scanned Log Records Processed Log Scan Activity Updates Inserts Deletes Store Procedures DDL Log Records Writetext Log Records Text/Image Log Records CLRs Transaction Activity Opened Commited Aborted Prepared Maintenance User Log Extension Wait Count Amount of time (ms) Longest Wait (ms) Average Time (ms) Schema Cache Lookups Forward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) Backward Schema Count Total Wait (ms) Longest Wait (ms) Average Time (ms) 81061 19015

0 15845 0 0 0 0 0 0

1585 1585 0 0 0

0 0 0 0.0

0 0 0 0.0 0 0 0 0.0

48

Final v2.0.1

Truncation Point Movement Moved Gotten from RS Connections to Replication Se Success Failed Network Packet Information Packets Sent Full Packets Sent Largest Packet Amount of Bytes Sent Average Packet I/O Wait from RS Count Amount of Time (ms) Longest Wait (ms) Average Wait (ms)

19 19

0 0

9794 8698 2048 18436223 1882.4

9813 107316 400 10.9

Now the interesting thing about the above of course the RepAgent was lagging in fact it was way behind. Consider the usual suspects: RepAgent not scanning the transaction log fast enough common myth closely followed with a multithreaded RepAgent is needed. As you can see from above, however, the application (bcp in this case) only inserted 3,000 rows in the 10 minutes at a rate of 5 rows/sec. The RepAgent processed 15,000 inserts during the same period about 5x the rate so the RepAgent scan isnt the issue. RepAgent is contending for cpu need to raise the priority another commonly blamed problem (with sp_sysmon, this can now be refuted easily). Looking at the system, we see that ASE is only using ~4% of the cpu idle for the other 96% of the time. The problem is in the waits on sending to the Replication Server from the above, it waited ~2 minutes (107 seconds) of the 10 to send the key is the long wait and high average wait. In fact, a 1 minute sp_sysmon showed the following:
I/O Wait from RS Count Amount of Time (ms) Longest Wait (ms) Average Wait (ms) n/a n/a n/a n/a n/a n/a n/a n/a 4869 54363 323 11.2 n/a n/a n/a n/a

Ugly. The RepAgent is literally waiting 54 seconds of the 1 minute so it is only scanning for 6 seconds. Interestingly enough, the issue was not Replication Server traceflags were enable that turned RS into a datasink with no appreciable impact on performance. The problem is believed to be caused by the ASE scheduler not processing the RepAgents network traffic fast enough. Upgrading to ASE 12.5.2 from 12.5.0.3 did solve the problem and a 15 minute stress test dropped to 3 minutes. The point to be made here is that the RepAgent speed is directly proportionate to the speed of ASE processing the network send requests coupled with the speed of the Replication Server processing. Any contention at the inbound queue (readers delay writers), delay in getting repdefs into cache (too small sts_cache_size), cpu time spent on other threads (DSI, etc.), directly slows down the RepAgent speed. Key Concept #6 The biggest determining factor in RepAgent performance is the speed of ASE sending network data and the speed of the Replication Server hence, from an RS perspective fewer, faster CPUs is much better than slow CPUs on a monster SMP machine. Additionally, enabling SMP, even for a Warm Standby may boost RepAgent performance 10-15% by eliminating CPU contention. Sodont put RS on the old Sun v880 or v440 with the ancient 1.2GHz CPUs when a quad cpu Opteron based Linux box or small SMP with fast CPUs (less than $20,000). Even worse, dont put RS on the Sun6900 that is apparently under utilized as it is the DR host machine. DBAs have often fallen into the trap of buying a bigger SMP machine for the DR site and hosting both RS and the standby ASE server on it. It not only would be cheaper but better performance to have bought a smaller SMP machine for the standby ASE server and a 2-4 way screamer entry level server for RS to run on.

49

Final v2.0.1

Utilizing monProcessWaits One of the key MDA monitoring tables added in 12.5.0.3 is the monProcessWaits table. This can be especially useful for RepAgent performance analysis when determining whether the hold up is the time spent doing the log scan or whether it is due to waiting on the network send aspect. To understand what how to use monProcessWaits, the key is to realize that it requires at least two samples to be effective. The reason for this is that the output values are counters that are incremented from the server boot infinitely until shutdown. If the counter hits 2 billion, a rollover occurs and it re-increments from the rollover. Consequently the time waiting is the difference between samples. The other key is that the monProcessWaits table has two parameters KPID and SPID. Consequently, when focusing on a specific Replication Agent, it will be much faster to first retrieve the RepAgents KPID and SPID from master..sysprocesses and supply it as SARG values such as:
declare @ra_kpid int, @ra_spid int select @ra_kpid=kpid, @ra_spid=spid from master..sysprocesses where program_name=repagent and dbid=db_id(<tgt_db_name>) select * From monProcessWaits where KPID=@ra_kpid and SPID=@ra_spid waitfor delay 00:05:00 select * from monProcessWaits where KPID=@ra_kpid and SPID=@ra_spid

There are very few RepAgent specific wait events as most of the causes of RepAgent waits are due to generic ASE processing vs. RepAgent specific. The RepAgent wait events are: Wait Event 221 222 223 Event Class 9 9 9 Description replication agent sleeping in retry sleep replication agent sleeping during flush replication agent sleeping during rewrite

To illustrate the point about the most frequent causes of RepAgent waits, consider the following event descriptions and counter values. In this case the before and after samples are illustrated side by side with the first column being the first sample and the second column being the end sample. In each case, only WaitEventIDs that showed a difference between the samples is reported.
WaitEventID WaitClassID Description ----------- ----------- -------------------------------------------------29 2 wait for buffer read to complete 31 3 wait for buffer write to complete 171 8 waiting for CTLIB event to complete 214 1 waiting on run queue after yield 222 9 replication agent sleeping during flush Wait Time from Mon Tables on ASE 12.5.0.3 WaitEventID t1.Waits t2.Waits totWaits WaitTime WaitTime totWaitTime ----------- ----------- ----------- ----------- ----------- ----------- ----------31 3 120 117 0 100 100 171 2178 75597 73419 21800 747900 726100 222 2 4 2 17000 54300 37300 Wait Time from Mon Tables on ASE 12.5.2 WaitEventID t1.Waits t2.Waits Tot.Waits t1.WaitTime t2.WaitTime Tot.WaitTime ----------- ----------- ----------- ----------- ----------- ----------- -----------29 2 2 0 0 0 0 31 283 403 120 1900 3200 1300 171 223623 306426 82803 410700 593700 183000 214 3 3 0 0 0 0 222 13988 13990 2 1032636100 1032659600 23500

In both cases illustrated above, the RepAgent spid is spending much more time waiting on CT-Lib events than anything else. However, as is evident, there was a 400% drop in the waits moving from 12.5.0.3 to 12.5.2 finally showing the

50

Final v2.0.1
RepAgent waiting for a buffer read (logical IO from the log cache) and cpu access. Not surprisingly, the application performance stress test also improved from 13-14 minutes to 3 minutes a matching 400%. Some of the wait events and how they could be interpreted are summarized in the following table: ID 29 WaitEvent Description wait for buffer read to complete Possible Intepretation/Reaction Waiting on log scan check to see if log page for scan point is within the log cache possibly use more cache partitions Typically, the only writing RepAgent does is to update the dbinfo structure with the new secondary truncation point so any large values here could be indication of more serious problems This corresponds directly to RepAgent transferring data to the RS. This will be the most common and can be as the result of several things: 214 waiting on run queue after yield Slow network access from ASE Slow network access at RS Slow inbound queue SQM (exec cache full)

31

wait for buffer write to complete

171

waiting for CTLIB event to complete

CPU contention with other processes unless you see a fair number of waits, adjusting the RepAgent priority is likely not going to help throughput issues. In this case, the RepAgent was scanning and got bumped off the cpu at the end of its timeslice (i.e. didnt reach the scan_batch_size before the timeslice) and had to wait to regain access to the cpu. Same as above (cpu contention). In this case, the RepAgent was sleeping (due to sending data to RS - any network or physical disk io results in the spid being put on the sleep queue) and when the network operation was complete, it had to wait on other users before it could reclaim the cpu and continue scanning. This typically is an indication of the rep agent reaching the end of the transaction log and sleeping on the log flush.

215

waiting on run queue after sleep

222

replication agent sleeping during flush

An important point about WaitEventID=222 if you are doing benchmarking, if you sample the counters prior to the stress test starting and then at the end, you will minimally see 2 waits the reason is likely the RepAgent was at the end of the log when the first sample was taken and the last sample was taken after the RepAgent had finished reading out the test transactions. As with any monitoring activity, strictly a before and after snapshot is not that informative. It is much better to take samples at timed intervals such as every minute or so. This will help eliminate false highs/lows at the boundary of the tests such as 222. If you see a significant number of 171s (as illustrated above) and as will be the most common, the next step to determine the cause of the slow LTL transfer process. One method is to use Replication Servers Monitors & Counters feature focusing on the EXEC, SQM, and STS thread counters. While the first two may be obvious, the STS counters is useful to determine if the RS is hitting the RSSD server to read in repdefs which are used by the EXEC thread during normalization. Fault Isolation Questions (FIQs) Most of us are familiar with FAQs which strive to serve as a loosely defined database of previously asked questions and answers. Well morph that a bit to make FIQs questions you should ask when troubleshooting. A common problem is that most programmers and database administrators today have poor fault isolation skills largely due to the lack of organized fault isolation tree use by the DBAs, etc. including vendors. Unfortunately, this often leads to phone calls to Technical Support with its slow and no information about what may be going on to help identify why it may be slow. The following questions are useful in helping to isolate the potential causes of RepAgent performance: How far behind is the Replication Agent (MB)? (current marker vs. end of log)

51

Final v2.0.1

What is the rate at which Replication Agent appears to be processing log pages (MB/min)? What is the rate at which pages are being appended onto the transaction log (MB/min)? (monDeviceIO) How much cpu time is the RepAgent getting? (monSysWaits/monProcessWaits) Is there a named cache specifically for the transaction log with most of the cache defined for a 4K (or other) pool and sp_logiosize set to the pool size? What are the configuration values for the RepAgent? Do any columns in the schema contain text? If so, what is the replication status for the text columns (always, if changed, etc.)? Is there a named cache pool for the text columns (later discussion)? Is it the latency the result of executing large transactions (how many commands show up in admin who, sqt)? Where is the RSSD in relationship to the RS? Is the RepAgent waiting a long time for the secondary truncation point for the RS at the end of each ltl_batch_size? What do the contents of the last several blocks of the inbound queue look like? Were lengthy DBA tasks such as reorg reclaim_space <tablename> issued without turning replication off for the session (via set replication off)?

The last question is a bit strange, but it turns out that some DBA tasks issue a huge number of empty begin/commit transaction pairs. As was described earlier, not knowing if the transaction will contain replicated commands, RepAgent prior to ASE 12.5.2 forwards these BT/CT pairs to the RS where eventually they are filtered out. As a result, it is a good practice to put set replication off at the beginning of most DBA scripts such as dbcc or reorg commands. As of ASE 12.5.2, the RepAgent is smart enough to filter out empty BT/CT pairs caused by system transactions.

52

Final v2.0.1

Replication Server General Tuning


How much resources will Replication Server require?
The above is a favorite question and a valid one of nearly every system administrator tasked with installing a Replication Server. The answer, of course, is all depends it depends on the transaction volume of the primary sites, how many replicate databases involved and how much latency the business is willing to tolerate. The object of this section is to cover basic Replication Server tuning issues. It should be noted that these are general recommendations that apply to many situations, however, your specific business or technology requirements may prevent you from implementing the suggestions completely. Additionally, due to environment specific requirements, you may achieve better performance with different configurations than those mentioned here. The recommendations in this section are based on the assumption of an enterprise production system environment and consequently are significantly higher than the software defaults. Replication Server/RSSD Hosting A common mistake is placing the RSSD database in one of the production systems being replicated to/from. While this in itself has other issues, one of the main problems stemming from this is that this frequently places the RSSD across the network from the Replication Server host. As you saw earlier in the Rep Agent discussion on secondary truncation point management, the volume of interaction between the Replication Server and RSSD can be substantial just in processing the LTL. Add the queue processing, catalog lookups and other RSSD accesses increase this load considerably. This leads a critical key performance concept for RSSD hosting: Key Concept #7: Always place the RSSD database in an ASE database engine on the same physical machine host as the Replication Server or use the Embedded RSSD (eRSSD). In addition, make sure that the first network addresses in the interfaces file for that ASE database engine are localhost (127.0.0.1) entries. The latter part of the concept may take a bit of explaining. If you took in the host file on any platform (/etc/hosts for Unix; %systemroot%\system32\drivers\etc\hosts for WindowsNT), you should see an entry similar to:
127.0.0.1 localhost #loopback on IBM RS6000/AIX

In addition the Network Interface Card (NIC) IP addresses, the localhost IP address refers to the host machine itself. The difference is in how communication is handled when addressing the machine via the NIC IP address or the localhost IP address. If using the NIC IP address, packets destined for the machine name may not only have to hit the NIC card, but may also require NIS lookup access or other network activity (routing) that result in minimally the NIC hardware being involved. On the other hand, when using the localhost entry, the TCP/IP protocol stack knows that no network access is really required. As a result, the protocol stack implements a TCP loopback in which the packets are essentially routed between the two applications only using the TCP stack. An illustration of this is shown below:

53

Final v2.0.1

RS

RSSD

RS

RSSD

Hostname:port

localhost:port

CT-Lib NetLib TCP IP

Sybase Hybrid Stack

CT-Lib NetLib TCP IP

Network

Network

Figure 9 hostname/NIC IP vs. localhost protocol routing


As you could guess, this has substantial performance and network reliability improvements over using the network interface. Typically, this can be implemented by modifying the Sybase interfaces file to include listeners at the localhost address. However, these must be the first addresses listed in the interfaces file in order for this to work. For example:
NYPROD master master query query NYPROD_RS master master query query tcp tcp tcp tcp /dev/tcp /dev/tcp /dev/tcp /dev/tcp localhost nymachine localhost nymachine 5000 5010 5000 5010

tcp tcp tcp tcp

/dev/tcp /dev/tcp /dev/tcp /dev/tcp

localhost nymachine localhost nymachine

5500 5510 5500 5510

Note that many of todays vendors have added the ability for the TCP stack to automatically recognize the machines IP address(es) and provide similar functionality without specifically having to use the localhost address. Even so, there may be a benefit to using the localhost address on machines in which the RSSD is co-hosted with application databases and the RS would have to contend with application users for the network listener with ASE. By using the localhost address, RS queries to the RSSD may by-pass the traffic jam on the network listener used by all the other clients. A word of warning. On some systems, implementing multiple network listeners especially one on localhost could result in severe performance degradation (especially when attempting large packet sizes). One such was AIX 4.3 with ASE 11.9.2 (neither of which is currently supported having been end-of-lifed by both IBM and Sybase years ago). Additionally, the machine should have the following minimal specifications (NOTE: The following specifications are not the bare minimums, but probably are the minimum a decent production system should consider to avoid resource contention or swapping): Resources # of CPUs Recommendation 1 for each RS and ASE installed on box plus 1 for OS and monitoring (RSM). (min 2, 3 preferred). If planning on high volume with multiple connections and using RS 12.6/SMP, suggestion is 2-3 for the RS, 1-2 for ASE/RSSD and 1 for the OS & RSM (4-6 cpus). Using the eRSSD reduces the cpu load significantly such that it would be rare to need more than 4 cpus unless 3 or more active DB connections are in the RS.

54

Final v2.0.1

Resources Memory

Recommendation 128-256MB for ASE (32-64 for eRSSD instead) plus memory for each RS (64-128MB min) and operating system (32-64MB). Min of 256MB with 1GB recommended ASE requirements plus RAID 0+1 device for stable queues separate controllers/disks for ASE and RS. Although the default creation for the RSSD is only 20MB (2KB pages), recommend 256-512MB data and 128-256MB log due to monitoring tables. Although the eRSSD uses significantly less system space (i.e. no 20MB master, 120MB sybsystemprocs, 100+MB tempdb), because of the autoexpansion, it can grow rapidly if logging exceptions. Switched Gigabit Ethernet or better (10GB Ethernet or infiband)

Disk Space

Network

The rationale behind these recommendations will be addressed in the discussions in the following sections. Authors Note: As of this writing, there should be no licensing concern to restricting the use of an ASE for the RSSD. Each Replication Server license includes the ability to implement a limited use ASE solely for the purpose of hosting the RSSD (limited use means that the ASE server could only be used the RSSD no application data, etc. permitted). Consequently, each RS implemented at a site could have its own ASE specifically for the RSSD. However, is assumed you already have the ASE software, consequently it is not shipped as part of the RS product set. For ASE 15.0 and higher, the SySAM 2 license manager will require a restricted use license key - customers may have to coordinate with Sybase Customer service to ensure that the correct number of keys are available. RS Generic Tuning Generally speaking, the faster the disk I/O subsystem and the more memory , the faster Replication Server will be. In the following sections Replication Server resource usage and tuning will be discussed in detail. Replication Server Memory Utilization A common question is how much memory is necessary for Replication Server performance to achieve desired levels. The answer really depends on several factors: 1. 2. 3. 4. Transaction volume from primary systems Number of primary and replicate systems Number of parallel DSI threads Number of replicated objects (repdefs, subscriptions, etc.)

Of course, life isnt that simple. Based on the above considerations, you have to adjust several configuration settings within the Replication Server for optimal settings with certain minimums required based on your configuration. Some of these are documented below: Parameter Replication Server Settings num_threads Default: 50 Suggest: 100+ 10.x The number of internal processing threads, client connection threads, daemons, etc. The old formula for calculating this was
(#PDB * 7) + (#RDB * 3) + 4 + (num_client_connections) + (parallel DSIs) + (subscriptions) +

Ver.

Explanation

The new formula is:


30 + 4*IBQ + 2*RSIQ + DSIQ*(3+max(DSIE))

IBQ=Inbound queues, RSIQ=Route queues, DSIQ=Outbound Queues; max(DSIE)=max parallel DSI threads (max(dsi_num_threads)). Recommendation is minimally 100, particularly for RS 12.5+

55

Final v2.0.1

Parameter Num_msgqueues Default: 178 Suggest: 250+

Ver. 10.x

Explanation Specifies the number of OpenServer message queues that will be available for the internal RS threads to use. The old formula for calculating this was:
2 + (#PDB * 4) + (#RDB * 2) + (#Direct Routes)

However, given that this number must always be larger than num_threads, a simpler formula would be:
num_threads*2

Recommendation is 250 (2.5*num_threads) Num_msgs Default: 45,586 Suggest: 128,000 Num_stable_queues Default: 32 Suggest: 32* num_client_connections Default: 30 Suggest: 20 num_mutexes Default: 128 pre 12.6, 1024 in 12.6 Suggest: see formula 10.x The number of messages that can be enqueued at any given time between RS threads. The default settings suggest a 1:256 ratio, although a 1:512 may be more advisable. Based on the above settings, num_msgs may need to be set to 128000 Minimum number of stable queues. This should be at least twice the number of database connections + num_concurrent_subs.

10.x

10.x

Number of isql, RSM and other client connections (non-Rep Agent or DSI connections). The default of 30 is probably a little high for most systems - 20 may be a more reasonable starting point Used to control access to connection and other internal resources. The old formula for calculating this was:
12+(#PDB * 2) + (#RDB)

10.x

As of RS12.5 and native threaded OpenServer, the formula was changed to:
200 + 15*RA_USER + 2*RSI_USER + 20*DSI + 5*RSI_SENDER + RS_SUB_ROWS + CM_MAX_CONNECTIONS + ORIGIN_SITES

RA_USER=RepAgents connecting; RSI_USER=inbound routes; RSI_SENDER=outbound routes; RS_SUB_ROWS and CM_MAX_CONNECTIONS are from rs_config; ORIGIN_SITES=number of inbound queues sqt_max_cache_size Default: 131,072 Suggest: 4,194,304(4MB) 11.x Maximum SQT (Stable Queue Transaction) interface cache memory (in bytes) for each connection (Primary and Replicate). Serious consideration should be give to setting this to 4-8MB or higher, depending on transaction volume. Settings above 16MB are likely counter productive.

Connection (DSI) Settings dsi_sqt_max_cache Default: 0 Suggest: 2,097,152 11.x Maximum SQT interface cache memory for a specific database connection, in bytes. The default, 0, means the current setting of the sqt_max_cache_size parameter is used as the maximum cache size for the connection. If sqt_max_cache_size is fairly high, you may want to set this at the 2-4MB range to reserve memory. To calculate a start point, consider num_dsi_threads * 64KB (Note that this is a per connection. It is mentioned here to emphasize a connection setting that should be changed as the result of changing sqt_max_cache_size). Amount of memory available to a Rep Agent User/Executor thread for messages waiting in the inbound queue before the SQM writes them out. Must be set in even multiples of 16K (block size). Max is 60 blocks or 983,040

exec_sqm_write_request _limit Default: 16384 Suggest: 983,040

12.1

56

Final v2.0.1

Parameter md_sqm_write_request _limit Default: 16384 Suggest: 983,040

Ver. 11.x

Explanation Amount of memory available to a DIST threads MD module for messages waiting in the outbound queue before the SQM writes them out. Must be set in even multiples of 16K (block size). Max is 60 blocks or 983,040. Note that the name was changed in RS 12.1 from md_memory_pool to the current name to correspond with the exec_sqm_write_request_limit parameter.

Each of these resources consumes some memory. However, once the number of databases and routes are known for each Replication Server, the memory requirements can be quickly determined. For the sake of discussion, lets assume we are trying to scope a Replication Server that will manage the following: 20 databases (10 primary, 10 replicate) along with 5 routes 2 of the 10 replicate databases have Warm Standby configurations as well. 4 of the replicate databases have had the dsi_sqt_max_cache_size set to 3MB The RSSD contains about 5MB of raw data due to the large number of tables involved. md_sqm_write_request_limit and exec_sqm_write_request_limit are maxed at 983,040, sqt_max_cache_size to 1MB. num_threads is set to 250 for good measure (system requires nearly 200)

The memory requirement would be: Configuration value/formula


num_msgqueues * 205 bytes each num_msgs * 57 bytes each num_mutexes * 205 bytes each num_threads * 2800 bytes each # databases * 64K + (16K * # Warm Standby) # databases * 2 * sqt_max_cache_size dsi_sqt_max_cache_size sqt_max_cache_size exec_sqm_write_request_limit * # databases Md_sqm_write_request_limit * # databases size of raw data in RSSD (STS cache) exec_sqm_write_request_limit * #databases 983,040 (max) 960K@ if maxed 960K@ if maxed

Memory
36KB default 2.5MB default 205KB default 140KB default 1 MB min

Example (KB)
100 7,125 205 684 1,312 40,960 8,192 19,200 19,200 5,120 19,200

Minimum Memory Requirement (MB)

~128MB

Of course, the easy way is to just use the values below as starting points (assumes a normal number of databases ~10 or less - if more/less adjust memory by same ratio): Normal
sqt_max_cache_size dsi_sqt_max_cache_size memory_limit 1-2MB 512KB 32MB

Mid Range
1-2MB 512KB 64MB

OLTP
2-4MB 1MB 128MB

High OLTP
8-16MB 2MB 256MB

The definitions of each of these are as follows: Normal thousands to tens of thousands of transactions per day Mid Range tens to hundreds of thousands of transactions per day OLTP hundreds of thousands to millions of transactions per day

57

Final v2.0.1
High OLTP millions to tens of millions of transactions per day Now, then, it is easy to run out of memory if you are not careful. One of the best ways to improve RS speed if there is latency in the inbound queue (other than Warm Standby) is to increase the sqt_max_cache_size. However, bumping it up when admin who, sqt doesnt show any removed doesnt help and also can cause you to run out of memory fairly quickly. For example, a system with 7 connections (3 Warm Standby pairs + RSSD) with a sqt_max_cache_size of 32MB may run great as long as only 2 of the connections are active (i.e. one WS pair) with 500MB of memory. As soon activity starts on another, the following happens:
T. 2004/09/30 00:11:55. (111): Additional allocation of 496 bytes to the currently allocated memory of 524287924 bytes would exceed the memory_limit of 524288000 specified in the configuration. F. 2004/09/30 00:11:55. FATAL ERROR #7035 REP AGENT(SERVER.DB) s/prstok.c(493) Additional allocation would exceed the memory_limit of '524288000' specified in the configuration. T. 2004/09/30 00:11:55. (111): Exiting due to a fatal error

If you get the above error, it is probably a clue that you have sqt_max_cache_size set too high (checked it during a large transaction perhaps) or you need to raise the memory_limit due to the number of connections. General RS Tuning In addition to the memory configuration settings, there are several other server-level Replication Server configuration parameters that should be adjusted. These configuration settings include (note that this list does not include previously mentioned memory configuration settings): Parameter init_sqm_write_delay Default: 1000 Suggest: 50 Ver. 11.5? Explanation Write delay for the Stable Queue Manager if queue is being read. The impact of this is that the RepAgent (inbound) and DIST (outbound) threads are slowed down to provide time for the SQM to read data from disk for the SQT or DSI threads. Typically, if exec_sqm_write_request limit is set appropriately, the SQT is likely rescanning a large transaction consequently increasing this value to favor queue reading will likely result larger overall latency. The maximum write delay for the Stable Queue Manager if the queue is not being read. See above for discussion about why this should be set lower. Specifies the number of stable queue segments Replication Server scans during initialization. This also impacts how frequently RS updates rs_oqid table as segments are allocated/deallocated. During periods of high volume activity, the default settting can result in nearOLTP loads of 2+ updates/sec in the RSSD. Setting it higher allows the RS to spend more time writing to the queue vs. waiting for the RSSD. Similar to the dsync option in ASE, this parameter controls whether RS waits for I/Os to be flushed to disk for stable queues (effectively, RS uses the O_SYNC flag). This should be ignored for raw partitions, however, if using UFS devices, this could impact performance by a factor of 30% or greater (the price of insuring recoverability). While a theoretical data loss can occur with this off (and UFS devices), the built-in recoverability within RS mitigates most of the risk to the point that in testing, no data has ever been lost. The length of time an SQT thread sleeps while waiting for a Stable Queue read to complete before checking to see if it has been given new instructions in its command queue. With each expiration, if the command queue is empty, SQT doubles its sleep time up to the value set for sqt_max_read_delay. In high-volume systems, this may not have much of an impact, but in low volume systems, the rate of space being released from the queue may be impacted negatively.

init_sqm_write_max_delay Default: 10000 Suggest: 100 sqm_recover_segs Default: 1 Suggest: 10

11.5?

12.5+

sqm_write_flush Default: on Suggest: off

12.5

sqt_init_read_delay Default: 2000 Suggest: 1000 for 12.6, 100 for 15.0

12.6

58

Final v2.0.1

Parameter sqt_max_read_delay Default: 10000 Suggest: 1000 for 12.6, 100 for 15.0

Ver. 12.6

Explanation The maximium length of time an SQT thread sleeps while waiting for a Stable Queue read to complete before checking to see if it has been given new instructions in its command queue. In high-volume systems, this may not have much of an impact, but in low volume systems, the rate of space being released from the queue may be impacted negatively.

SMP Tuning As mentioned earlier in the discussion on RepAgent performance, fewer really fast cpus in a small entry level server (2-4 cpus) is ideal. With RS 12.6 or RS 15.0, enabling SMP capabilities with
-- not supported for Max OSX nor Tru-64 (DEC) configure replication server set smp_enable to on

is probably beneficial even in uniprocessor boxes (although probably only slightly with 1 cpu). If more than one cpu is available, you should definitely configure this option. Now, then, RS/SMP as of RS 12.5+ is built on native threading via OpenServer 12.5 support for native threads. Different from ASE (which loves kernel threading for I/O and context switching for user processes, consequently employs a multi-engine SMP environment), OpenServer native threading can only take advantage of multiple processors when the O/S schedules a thread on another CPU. For most machines, this is most efficient when employing POSIX threading as kernel threads typically operate on the same cpu as the parent task. As a result, for example, in engineering tests on Solaris, adding /usr/lib/lwp to $LD_LIBRARY_PATH ahead of /usr/lib resulted in a 30% performance gain. Note that because RS is POSIX thread based vs. engine based, you will need to constrain how many CPUs it can run on by creating processor sets (or similar term for your hardware vendor) and then binding RS to that processor set. There is a separate white paper on RS SMP that describes this in detail. Stable Device Tuning Stable Queue I/O As you are well aware, Replication Server uses the stable device(s) for storing the stable queues. The space on each stable device is allocated in 1MB chunks divided into 64 16K blocks. Individual replication messages are stored in rows in the block, from an I/O perspective, all I/O is done at the block (16K) level, while space allocation is done strictly at the 1MB level. As each block is read and messages processed, the block is marked for deletion as a unit. Only when all of the blocks within a 1MB allocation unit have been marked for deletion and their respective save intervals expired, will the 1MB be deallocated from queue. Often, this is a source of frustration with novice Replication Server Administrators who vainly try to drop a partition and expect it to go away immediately. The reason for this discussion is that administrators need to understand that the RS will essentially be performing 16K I/Os using sequential I/O unless the queue space gets fragmented with several different queues on the same device (see below). Ordinarily, this would lend itself extremely well to UFS (file system) storage as UFS drivers and buffer cache are tuned to sequential i/o especially using read-ahead logic to reduce time waiting for physical I/O. The problem with using UFS is two fold: Replication Server uses asynchronous I/O (dAIO daemon) to ensure I/O concurrency with different SQM threads for the different queues. Still today, some UFS systems such as HP-UX 11 do not allow asynchronous I/O to file systems. The net result is that writing to a UFS effectively single threads the process as the operating system performs a synchronous write and blocks the process. While most vendors (HP included) have enabled the ability to specify raw I/O (unbuffered) to UFS devices, the I/O routines with RS have not been updated to take advantage of this fact. As a result, using UFS devices could cause a loss of replicated data should there be a file system error.

With the exception of the SQM tuning parameters discussed later, there is not much manual tuning you can do to improve I/O performance.

59

Final v2.0.1
Async, Raw, UFS, and Direct I/O In an earlier version of this document, the last bullet caused a bit of misunderstanding that UFS devices would be faster than raw partitions for stable queue devices, but RS was not engineered to take advantage of it. Actually, this is not quite correct, but rather the purpose was to illustrate how O/S vendors are changing their respective UFS device implementations to mirror the concurrent I/O capabilities of raw devices. The misunderstanding is due to a common misconception (that unfortunately was further spread in early ASE 12.0 training materials) that UFS devices are faster than raw devices. As a result, many were tempted to switch to UFS devices using the dsync flag to ensure recoverability under the hope of getting greater performance. The answer simply is false. Raw devices historically have been used to provide unbuffered I/O as well as multi-threaded/concurrent I/O against the same device two distinctly different features. Unbuffered I/O, of course, guaranteed recoverability. Asynchronous I/O provided concurrent I/O and consequently scalability for parallel processing. UFS devices, as mentioned above, typically do not allow concurrent I/O operations against the same device when using buffered IO. While the buffer cache can reduce the I/O wait times for highly serialized access and consequently have in the past provided performance improvements for single threaded processes or areas where a spinlock or other access restriction single threads the access, performance improvements can be achieved. This can easily be illustrated using the transaction log, select/into (single threads I/Os due to system table locks), bcp or other environment in which I/O concurrency is not involved. As stated, however, this is largely due to the buffer cache and not due to the UFS device implementation. When the dsync flag is enabled, the buffer cache is forced to flush each write and consequently the performance advantage immediately disappears. In fact, even with the buffer cache, the more concurrent the I/O activity, the better the performance of raw devices vs. UFS devices. On earlier versions of HP-UX (9.x & 10.x), when 5 concurrent users were attempting database operations, raw partitions were able to outpace UFS devices. The same was true on SGIs IRIX at 75% write activity for a single user. The advantage comes from the fact that raw partitions allow concurrent I/O from multiple threads or processes by using the asynchronous I/O libraries. Consequently, server processes such as ASE or RS can submit large I/O requests in parallel for the same internal user task. Overall, in even low concurrent environments, buffered UFS devices using dsync suffer such performance degradation, that a boldface warning was even placed in the ASE 12.0 manuals. For years, one way to get around this and get similar performance on UFS as with raw partitions was to use a tool such as Veritass Quick I/O product which enables asynchronous I/O for UFS devices. Quick I/O has been certified with ASE, however, it has not been tested with RS (RS engineering typically does not certify hardware such as solid state disks or third party O/S drivers as these features should be transparent to the RS process and managed by the O/S). In fact, an interesting benchmark clearly illustrating the problem with UFS devices was published by Veritas a while ago, titled: Veritas Database Edition 1.4 for Sybase: Performance Brief OLTP Comparison on Solaris 8. For Solaris customers, a really good book on devices you can use to justify the use of raw partitions over file systems to paranoid Unix admins is the book Database Performance and Tuning, by Alan Packer published by Sun (http://www.sun.com/books/catalog/Packer/index.html). While it does not give a lot of detailed device about database tuning from a DBAs perspective, it is a very good book for describing the O/S features that enable top performance as well as understanding the architectures of the major DBMS vendors and their implementation on the Solaris platform. What that last bullet on the previous page referred to was that over the past two years, Sun, HP and others have implemented a version of asynchronous I/O for UFS devices, called Direct I/O. Unfortunately, the degree of concurrency is limited in some operating systems to only 64 concurrent I/O operations far below the capacity of ASE or RS. Additionally, it may require changes to the O/S kernel to enable Direct I/O for UFS I/O activity. As RS 12.x was engineered prior to Direct I/O availability, UFS devices still use synchronous I/O operations. With an EBF released in late 2001 (EBF number is platform dependent), RS 12.1 did implement the dsync flag similar to ASE 12.0 via the sqm_write_flush configuration (configure replication server set sqm_write_flush to on - it may already be on as it is the default). This, of course, does provide the capability for RS to use UFS devices in a recoverable fashion. However, one customer who had been using UFS devices immediately noticed a 30% drop in performance. One thing should be noted after all this discussion. RS typically is not I/O bound (unless a large number of highly active connections are being supported, or when processing large transactions). As a result, unless you have reason to believe that you will be I/O bound, using UFS devices with the sqm_write_flush option off may be a usable implementation if raw partitions are not available particularly in Warm Standby implementations where only a single SQM thread may be performing I/O. The preference, for now (12.6 and 15.0 ESD #1), is still raw partitions. FSync I/O RS Future Enhancement This will change in a feature being considered for a future RS release. Currently, RS flushes each block as it is full, but only updates the RSSD every sqm_recover_seg - which defaults to 1MB - and typically should be set to 10MB. As a result, as part of recovery, RS reads the RSSD to find its location within each queue and then begins checking each block after that point to see if it is still active or if already processed and comparing to the OQID last received by the next stage. Note that the OQID that is provided to the previous stage of the pipeline (RepAgent, inbound SQM, etc.) is

60

Final v2.0.1
based on the even segment point defined by sqm_recover_seg. As a result, each previous stage of the pipeline will often start reprocessing from that point and the next stage will simply treat any repeats as duplicates. Effectively, this means that any data processed after the OQID is written to the RSSD and until the RS crashed or was shutdown is redundant since if it was not there, it would get resent. The future enhancement to RS is to use the fsych i/o call on UFS devices in synchronization with the sqm_recover_seg. This would allow RS to leverage the file system buffering to speed I/O processing by caching most of the writing to the file system buffer cache. Since the O/S destages these as necessary to the devices, when the fsych is invoked, hopefully few, if any, actual writes will have to occur. It is anticipated that this technique will achieve the following benefits: RepAgent throughput will increase as the write to the inbound queue will be faster. DIST throughput will increase for a similar reason as the writes to the outbound queue will be faster. SQT and DSI/SQT cache will effectively be extended by the file system buffer cache, eliminating expensive physical reads when the sqt_max_cache_size is exceeded by large transactions.

As a result, some future release of Replication Server will likely recommend file system devices over raw partitions. Stable Queue Placement One often requested feature new to the 12.1 release, was the concept of partition affinity. Partition affinity refers to the ability to physically control the placement of individual database stable queues on separate stable device partitions. This will help alleviate the I/O contention between: Between two different high volume sources or destinations. Warm Standby inbound queue and other replication sources and targets.

Some people would quickly point out that separation between the inbound and outbound queues for the same connection is not possible with this scheme. True. However, this is not necessarily a problem. Remember, for any source system, the inbound queue is used for data modifications from the source while the outbound queue is used for data modifications from other systems destined to the connection. Consequently, unless two high volume source systems are replicating to each other, this should not pose a problem. One place where it could occur (and consequently bears some monitoring) is if a corporate roll-up system also supports a Warm-Standby. Default Partition Behavior Previous to 12.1, you can get similar behavior through an undocumented behavior. If all of the stable device partitions are added prior to creating the database connections, the Replication Server will round-robin placement of the database connections on the individual partitions. The difference between this and adding the database connections prior to adding the extra partitions is illustrated below.

part1

part2

part3

part1

part2

part3

Connections created prior to stable devices part2 and part3

Connections created after stable devices part2 and part3

Figure 10 Stable Device Partition Assignment & Database Connection Creation


Obviously, the situation on the right is more preferable. However, even though it may start this way, due to much higher transaction volume to/from one connection vs. another or longer save interval, one queue may end up migrating onto another connections partitions.

61

Final v2.0.1
Partition Affinity In Replication Server 12.1, you can specifically assign stable queues to disk partitions through the following command syntax:
Alter connection to dataserver.database set disk_affinity to partition_name [ to off]

or
Alter route to replication_server set disk_affinity to partition_name [to off]

Any disk partition can have multiple connections queues assigned to it, however, currently, each connection can only be affinitied to a single partition. This latter restriction can be a bit of a nuisance where multiple high volume connections need more than 2040MB of queue space (particularly where the save_interval creates such situations). Assigning disk affinity is actually more of a hint than a restriction. If space is available on the partition and the partition exists (i.e. not pending to be dropped or dropped), then the space will be allocated for that stable queue on that partition. If the space is not available, then space will be allocated according to the default behavior Stable Partition Devices Another common mistake that system administrators make, is placing the Replication Server on a workstation with only a single device (or a server but only allowing the Rep Server to address a single large disk). First, this causes a problem in that while a Rep Server can manage large numbers of stable partition devices, each one is limited to 2040MB (less than 2GB). This has nothing to do with 32-bit addressing or the Rep Server could address a full 2GB (2048MB). The real reason is that the limit in the RSSD system table rs_diskpartitions which tracks space allocations.
create table rs_diskpartitions ( name varchar(255), logical_name varchar(30), id int, num_segs int, allocated_segs int, status int, allocation_map binary(255), vstart int ) go

In the above DDL for rs_diskpartitions, the column allocation_map is highlighted. As each 1MB allocation is allocated/deallocated within the device a bit is set/cleared within this column. Those quick with math realize that 255 bytes*8 bits/byte=2040 bits and hence the reason for the partition sizing limits. Consequently, try as one might, without volume management software, Rep Server will never be able to use all of the space in a 40GB drive. The reason is that the 7 partition limit in Unix would restrict it to ~14GB of space. Those who are familiar with vstart would be quick to claim this could be overcome simply by specifying a large vstart and allowing 2-3 stable devices per disk partition. Well, it doesnt quite work that way with Replication Server. For example, consider the following sample of code:
add partition part2 on /dev/rdsk/c0t0d1s1 with size=2040 starting at 2041

The above command will fail. The reason is the vstart is subtracted from the size parameter to designate how far in to the device the partition will start. Consequently, as documented in the Replication Server Reference Manual, the following command only creates a 19MB device instead of a 20MB device 1MB inside the specified partition (and the above command would have attempted a partition of 1MB!!).
add partition part2 on /dev/rdsk/c0t0d1s1 with size=20 starting at 1

Now that we understand the good, bad, and ugly of Replication Server physical storage, you will understand the reason for the next concept: Key Concept #8: Replication Server Stable Partitions should be placed on RAID subsystems with significant amounts of NVRAM. While RAID 0+1 is preferable, RAID 5 can be used if there is sufficient cache. Logical volumes should be created in such a way that I/O contention can be controlled through queue placement.

62

Final v2.0.1
RSSD Generic Tuning You knew this was coming. Or, at least you should have after all the discussions on the number and frequency of calls between the Replication Server and the RSSD. If you are using the embedded RSSD, you can skip this section (go to STS Tuning below) as it really only applies to ASE based RSSDs. Key Concept #9: Normal good database and server tuning should also be performed on the RSSD database and host ASE database server. What does this mean? Consider the following points: Place the RSSD database in a separate server from production systems. This provides the best situation for maintaining flexibility should a reboot of the production database server or RSSD database server is required. However, the main reason is that it reduces or eliminates CPU contention that the RSSD primary user might have with production system long running queries (dont let the parallel table scans hold your replication system hostage). Raise the priority for the RSSD primary user. Place the tempdb in a named cache. Place the RSSD catalog tables in a named cache separate from the exceptions log (although rs_systext presents a problem put it in with the system catalog tables) and also use a different cache for the queue/space management tables. This is as much to decrease spinlock contention as much as it is to ensure that repeated hits on one RS system table dont flush another from cache. Dedicate a CPU to the RSSD database server. If more than one RSSD is contained in the same ASE server, monitor CPU utilization. Set the log I/O size (i.e. bind the log to a log cache with 4K pool) for the RSSD. There are a few triggers in the RSSD, including one on the rs_lastcommit table (fortunately not there in primary or replicate systems) that is used to ensure users dont accidentally delete the wrong rows from the RSSD.

Depending on requirements, if multiple Replication Servers are in the environment, it might make sense to consolidate them on a single host (providing enough CPUs exist) and share a common ASE. In such a case, the common ASE may only need 2 engines vs. the minimum of 1 for individual installations, and the ASE can be tuned specifically for RSSD operations (i.e. turn on TCP no delay). STS Tuning In the illustration of RS internals, the System Table Services (STS) module is illustrated as the interface between the Replication Server and the RSSD database. The STS is responsible for submitting all SQL to the RSSD object definition lookups, segment allocations, oqid tracking, subscription materialization progress, recovery progress, configuration parameters, etc. As you can imagine, this interaction could be considerable. While it is not exactly possible to improve the speed of writes to the RSSD from the STS perspective, obviously, any improvement that caches RSSD data locally will help speed RS processing of replication definitions, subscriptions and function strings. STS Cache Configuration Prior to RS 12.1, only a single tuning parameter was available sts_cache_size, while in version 12.1 another set of parameters was added to enforce a much desired behavior (sts_full_cache_XXX) as described below. Parameter (Default) sts_cache_size Default: 1000 Suggest: 5000 sts_full_cache_{table_name} Default: see notes Suggest: see notes 11.x+ Explanation Controls the number of rows from each table in the RSSD that can be cached in RS memory. Recommended setting is the number of rows in rs_objects plus some padding Controls whether a specific RSSD table is fully cached. See discussion below. If a table is fully cached, the sts_cache_size limit does not apply. Note that the default is on for rs_repobjs and rs_users, but off for all other tables. Suggest enabling for rs_objects, rs_columns, and rs_functions as well as the defaults.

12.1

63

Final v2.0.1
Unfortunately, prior to RS 12.1, only the rs_repobjs (stores autocorrection status of replication definitions at replicate RSs for routes) and rs_users tables could be fully cached. That does not infer that other RSSD table rows were not in cache, but rather that the RS only ensured that the rs_repobjs and rs_users tables were fully cached. RS 12.1 STS Caching As of RS 12.1, most RSSD tables could be specified to be fully cached in the STS memory pool. A complete list of tables includes:
rs_classes rs_columns rs_config rs_databases rs_datatype rs_functions rs_locater rs_objects rs_publications rs_queues rs_repdbs rs_repobjs rs_translations rs_routes rs_sites rs_systext rs_users rs_versions

At a minimum, it is recommended that you cache rs_objects, rs_columns and rs_functions in addition to the rs_users and rs_repobjs that are cached by default. Additionally, if memory permits, you may also want to cache rs_functions, rs_publications (if using publications), rs_translations (if using the HDS feature for heterogeneous support). If your system has sufficient memory, you may even want to cache rs_systext, particularly in non-Warm Standby implementations where function strings are implemented. However, care must be taken as large function string definitions could consume a lot of RS memory. The syntax to cache an RSSD table is:
configure replication server set sts_full_cache_rs_columns to on

It is notable that rs_subscriptions, rs_funcstrings, rs_rules, rs_whereclauses are excluded in the above list. The table rs_funcstrings uses it's own cache outside of the STS cache pool. Additionally, if creating subscriptions, etc., you may want to disable sts_full_cache as the cache refresh mechanism effectively rescans the RSSD after each object creation noticeably slowing object creation. Some tables it doesn't make any sense to fully cache as they are considerably smaller that the sts_cache_size parameter. For example, specifying sts_full_cache_rs_routes is probably not effective as it likely would be fully cached anyhow (most likely with <10 rows). Ditto for rs_databases, rs_dbreps, etc. While rs_locater is small and likely would be fully cached anyhow, it also is updated frequently which involves updates to the RSSD anyhow (similar to rs_diskpartitions). STS RSSD Table Access The STS module is literally just that. It is not a thread nor is it a separate daemon process. Consequently, multiple threads within the Replication Server could be simultaneously using the STS module and creating concurrent connections/queries within the RSSD itself. Unfortunately, the RS is not tied to any specific version of ASE, consequently, no assumptions were made regarding ASE features that could reduce contention between queries or enhance performance. That is not to say that a lot of contention exists within the RSSD, in fact, rather the opposite. Typically, each query will retrieve an atomic row through specifying discrete primary key values. On the other hand, tables frequently updated could have contention when multiple sources or destinations are involved as the tables modified often have rowsize far less than 1000 bytes (as anything with a rowsize minimally half of 1962 bytes would result in 1 row per page anyhow). You may wish to monitor the following tables for blocking (not deadlocks, but blocking) using Historical Server or sp_object_stats. The syntax for the latter is similar to:
sp_object_stats "00:20:00", 10, MY_RS_RSSD

If the following tables show any contention, you may wish to alter the tables to a datarows locking scheme: rs_diskpartitions rs_queues rs_locater rs_segments rs_oqid

You also may want to closely monitor the amount of i/o required to fulfill a STS request. Since most queries will use the primary key, only 2-3 i/os should be required to fulfill the request including index tree traversal, although small tables may be scanned. In addition to adding the monitoring tables, RS 12.1 also modified some indexes within the RSSD for faster access. If using a version prior to 12.1 and you notice excessive i/os you may want to consider adding the following indexes or other indexes as applicable. Table rs_columns Added indexes in 12.1 Non-unique index on (objid) Deleted indexes in 12.1

64

Final v2.0.1

Table rs_databases

Added indexes in 12.1 Unique clustered index on (objid, colnum) Unique clustered index on (ltype, dsname, dbname) Unique index on (ltype, dbid) Non-unique index on (ltype) Clustered index on (funcname) Non-unique index on (dbid, objtype, phys_tablename, phys_objowner) Non-unique index on (parentid) Non-unique index on (classid)

Deleted indexes in 12.1

rs_functions rs_objects rs_systext rs_translation

Clustered index on (objid)

It should be noted that the above indexes where added/deleted due to observed behavior and changes in SQL submitted by the STS. Consequently, simply modifying a 12.0 installation with the above changes may degrade performance. You should always verify index changes through proper tuning techniques before and after the modification. Keep in mind that any RSSD changes you make will be lost during an RS upgrade or re-installation. In addition, it is highly recommended that you contact Sybase Technical Support before making such changes and that you clearly think through all the impact of the changes to ensure that RS correct operation is not compromised. Adding indexes or changing the locking scheme are fairly benign operations (assuming the RS is shutdown during the modification and taking into consideration the extra i/o required to maintain the new indexes), while others particularly any direct row modifications could result in loss of replicated data. On the subject of indexes, it is also advisable to run update statistics after any large RCL changes such as adding or deleting large batches of replication definitions, subscriptions, etc. along with including the RSSD in any normal maintenance activities such as running update statistics on a periodic basis or using optdiag to monitor tables with data only locking schemes (you should never allow rows to be forwarded in the RSSD). After making any indexing changes, you may need to issue sp_recompile against the table to ensure that stored procedures will pick up the new index although few stored procedures are issued by the STS (most are admin procedures issued by users directly in the RSSD such as rs_helpsub). STS Monitor Counters In RS 12.1, the following monitor counters were added to track RSSD requests from the RS via the STS. As of 12.6, these counters have been expanded from the initial 8 to the following 9 with the addition of STSCacheExeceeded: Counter QueriesTotal SelectsTotal SelectDistincts InsertsTotal UpdatesTotal DeletesTotal BeginsTotal CommitsTotal STSCacheExceed Explanation Total physical queries sent to RSSD. Total Select statements sent to RSSD. Total Select Distinct statements sent to RSSD. Total Insert statements sent to RSSD. Total Update statements sent to RSSD. Total Delete statements sent to RSSD. Total Begin Tran statements sent to RSSD. Total Commit Tran statements sent to RSSD. Total number of time STS cached was exceeded.

Obviously the goal is to reduce the amount of select statements issued updates possibly can be reduce by sqm_recover_segs, but the other write activity is necessary for recovery and can't be reduced much (except inserts due to counters). In addition to the usual error in the errorlog, you can watch the STSCacheExceed to determine if you need to bump up the sts_cache_size configuration parameter. Note that some specific types of STS activity can be monitored with counters for other modules. For example, the SQM module includes a counter for tracking the number of updates to the rs_oqid table. Later we will discuss how to set up these counters and how to sample them, but for now, it is somewhat useful to know they exist.

65

Final v2.0.1
A key point about the STS counters, however, is that they will reflect the STS activity to record M&C activity. For example, lets say you activate M&C for all the modules and notice a huge number of inserts via the STS. Rather than think the RSSD is getting hammered by inserting into general RSSD tables, you need to subtract the number of counter values inserted during that time period from the InsertsTotal to derive the non-statistics related insert activity. RSM/SMS Monitoring Installing Replication Server Manager (RSM) is an often neglected part of the installation. Any site that is using Replication Server in a production system without using RSM or equivalent third-party tool (such as BMC patrol) has made a grave error that they will pay for within the first 3 months of operation. Why is this true?? Simply because most sites dont test their applications today and as a consequence the transaction testing which is crucial to any distributed database implementation is missed. This virtually guarantees that a transaction, such as a nightly batch job, will fail at the replicate due to a replicate database/ASE issue. For example, the classic ran out of locks error from the replicate ASE during batch processing. RSM Implementation Having established the need for it, the next question is How is it best implemented? The answer, of course, is it depends. However, consider the following guidelines: 1. 2. Configure one RSM Server on each host where a Replication Server or ASE resides. These RSM Servers will function as the monitoring agents. Configure one RSM Server on primary SMS monitoring workstation per replication domain. This RSM will function as the RSM domain manager. All interaction to the RSM monitoring agents will be done through the SMS RSM domain manager Configure RSM Client (Sybase Central Plug-In) or other monitoring scripts to connect to the SMS RSM domain managers. If Backup Server, OpenSwitch or other OpenServer process is critical to operation, consider having one of the RSM monitoring agents on that host also monitor the process if no other monitoring capability has been implemented. RSM load ratios: 1 RS = 3 ASE = 20 OpenServers. If more than one RS is on a host, consider adding multiple RSM monitoring agents every 3-5 RSs (depending on RS load). Do NOT allow changes to the replication system be implemented through RSM. The main reason for this is that it is a GUI. You will have no record of these changes and it is too easy to make mistakes. Have developers create scripts that can be thoroughly tested and run with high assurance that fat fingers wont crash the system.

3. 4.

5. 6.

The last bullet is important. Similar to not keeping database scripts, if you dont have a record of your replication system, you will after the first time you have to recreate them. Following the above, a sample environment might look like the following:

66

Final v2.0.1

PDS Monitor Server RSM

RSSD

RDS Monitor Server RSM

RSM

Historical Server SMS Trends Database DBA SMS Server

Figure 11 Example Replication System Monitoring


RSM vs. Performance RSM or other SMS software monitoring can impact Replication performance in several ways: Unlike ASEs shared memory access to monitor counters, the Replication Server and RSSD must be polled to determine system status information. If the polling cycle is set too small or too many individual administrators are independently attempting to monitor the system, this polling could degrade RS and RSSD performance. Excessive use of the heartbeat feature can interfere with normal replication.

On one production system with a straight Warm Standby implementation, between the RS accesses and the RSM accesses to the RSSD, replication increased tempdb utilization by 10% (100,000 inserts out of 1,000,000) during a single day of monitoring. Because the way RSM re-uses many of the same users, it was impossible to differentiate between RS and RSM activity. However, it is clearly enough of a load to consider the separate RSSD server vs. using an existing ASE in high volume environments. All of this is leading up to one point: Key Concept #10 Monitoring is critical but make the heart beat, not race! RS Monitor Counters One of the major enhancements to Replication Server 12.1 was performance monitoring counters. Similar to the partition affinity feature, the monitors & counters (M&C) were originally slated for the 12.0 release, but did not quite make it in time. As a result, special EBFs have been created to backfit RS 12.0 and 11.5 with the M&C for testing and debugging purposes only. While in discussion it has often been compared to sp_sysmon, in reality they are closer to Historical Server or the MDA monitoring tables in ASE 12.5.0.3+. The rationale is that sp_sysmon in ASE simply reports the total of any counter during the entire monitoring period. Historical Server and Replication Server have both implemented a sample interval type mechanism in that counter values are flushed to disk on a periodic basis during the sample run. This allows peaks to be identified as well as actual cost of individual activities. The statistics are implemented via a system of counters that can either be viewed through an RS session or can be flushed to the RSSD for later viewing/analysis. Currently, in RS12.6, nearly 300 counters exist, with the possibility of more being added in future releases. Obviously, with 300 counters, it is difficult to document them all in the product

67

Final v2.0.1
documentation. However, you can view descriptive information about the current counters by using the rs_helpcounter stored procedure. Since it is extremely applicable to performance and tuning, this document will discuss the counters in detail as well as provide a list of counters that apply to each of the applicable threads in later sections. This section will provide an overview of the counters as well as those counters specific to RSSD activity, etc. RS Counters Overview The monitoring counters implementation and their use can be divided into five basic areas: 1. 2. 3. 4. 5. Monitor counter system tables in the RSSD RCL commands to enable and sample the counters SQL commands to sample counters flushed to the RSSD RCL commands to reset the counters The dStats daemon which performs the statistics sampling

RSSD M&C System Tables. In addition to the logic and RCL commands added to implement the counters, three additional tables were added to the RSSD to track the counter values and store counter specifics. These tables for RS 12.6 are illustrated below (along with rs_databases due to the relationship with rs_statdetail):
rs_databases dsname dbname dbid dist_status src_status attributes errorclassid funcclassid prsid rowtype sorto_status ltype ptype ldbid enable_seq varchar(30) <pk> varchar(30) <pk> int tinyint tinyint tinyint rs_id rs_id int tinyint tinyint char(1) char(1) int int

dbid = instance_id rs_statdetail run_id instance_id instance_val counter_id counter_val label counter_id = counter_id run_id = run_id rs_statcounters counter_id counter_name module_name display_nam e counter_type counter_status description <pk> int varchar(60) varchar(30) varchar(30) int int varchar(255) rs_statrun run_id run_date run_interval run_user run_status <pk> rs_id datetime int varchar(30) int rs_id int int int int varchar(255) <pk,fk2> <pk,fk3> <pk> <pk,fk1>

Figure 12 RSSD Monitor & Counter Tables


In Replication Server 15.0, the rs_statdetail table changed slightly due to a different method of recording average, max, and last counter values. A comparison of the RS 12.x rs_statdetail table and the RS 15.0 rs_statdetail table is illustrated below:

68

Final v2.0.1

Figure 13 RS 12.x and 15.0 rs_statdetail table comparison


The main difference is that while in RS 12.6 there is a single counter_val column, RS 15.0 records the number of observations (counter_obs), the total for the counter (counter_total), the last value for the counter (counter_last) and the maximum for the counter (counter_max). As a result, where RS 12.6 had counters for the last, max and total for some counters such as DSIEResultTimeLast, DSIEResultTimeMax and DSIEResultTimeAve, RS 15.0 has a single counter DSIEResultTime. If using RS 15.0 and you want the last value for DSIEResultTime, you simply select counter_last, similarly for counter_max. The average is the only change - to get the average DSIEResultTime, you would simply derive the average by selecting counter_total/counter_obs. This difference mainly effects counters tracking rates, time, and memory utilization for the various modules. Information about the individual counters is stored in the rs_statcounters, while counter values from each run are stored in the rs_statdetail table with the run itself stored in the rs_statrun table. The rs_statcounters table is highly structured: Column Name counter_id counter_name module_name display_name counter_type counter_status description instance_id Example Value 4000 RSI: Bytes sent RSI BytesSent 1 140 Total bytes delivered by an RSI sender thread. 2 Explanation The id for the counter counter ids are arranged by module as detailed below. Descriptive external name for the counter Module that the counter applies to Used to identify the counter through RCL The type of counter as detailed below The relative impact of the counter on RS performance as detailed below The counter explanation The particular instance of the module or thread. For example, with a minimum of 2 connections, you will have 2 instances of DSI-S threads (or with parallel DSI, multiple instances of DSI-E).

As mentioned earlier, the counter ids are arranged by internal RS module that the counter is used for. The following table lists, the counter id ranges and modules used in the rs_statcounters table: Counter Id Range 4000-4999 5000-5999 6000-6999 11000-11999 13000-13999 24000-24999 Module RSI DSI SQM STS CM SQT

69

Final v2.0.1

Counter Id Range 30000-30999 57000-57999 58000-58999 60000-60999 61000-61999 62000-62999

Module DIST DSIEXEC RepAgent (EXEC) Sync (SMP sync points) Sync Elements (mutexes) SQMR (SQM Reader)

The counter type and status designate whether the counter is a total sampling, average, etc., as well as the impact of the counter on performance and other status information. These are described in the following table: Value Variable Explanation

Counter Types (Enumerated) 1 2 3 4 @CNT_TOTAL @CNT_LAST @CNT_MAX @CNT_AVERAGE Keeps the total of values sampled Keeps the last value of sampled data Keeps only the largest value sampled Keeps the average of all values sampled

Counter Status (Bitmask) 1 2 4 8 16 32 64 128 256 @CNT_INTRUSIVE @CNT_INTERNAL @CNT_SYSMON @CNT_MUST_SAMPLE @CNT_NO_RESET @CNT_DURATION @CNT_RATE @CNT_KEEP_OLD @CNT_CONFIGURE Counters that may impact Replication Server performance. Counters used by Replication Server and other counters. Counters used by admin statistics, sysmon command. Counters that sample even if sampling is not enabled. Counters that are not reset after initialization. Counters that measure duration. Counters that measure rate. Counters that keep both the current and previous value. Counters that keep the run value of a Replication Server configuration parameter.

From this, you can determine that the sample counter listed above (RSI: Bytes Sent), in addition to being a RSI counter, keeps a running total of bytes sent (counter_type=1), and retains the both the current and previous value, is sampled even when sampling is not enabled, and is also used by the admin statistics, sysmon command (counter_status=140 128 & 8 & 4 = CNT_KEEP_OLD & CNT_MUST_SAMPLE & CNT_SYSMON). When looking at rs_statrun and rs_statdetail, many of the values are encoded. For example, run_id itself is composed of two components the monitored RSs site id (from rs_sites) in hex form and the run sequence also in hex. For example, consider the following example run_id and the decomposition:

Figure 14 Example rs_statrun value and composition

70

Final v2.0.1
The site id is especially needed if trying to analyze across a route and you have combined statistics from more than one RSSD to perform the analysis. If you need to do this, to isolate statistics from one RS from the other, you need to focus on the RS site_id by using a where clause similar to:
strtobin(inttohex(@prs_id))=substring(run_id,1,4)

In which @prs_id is the site id of the RS in question from rs_sites. One slight gotcha with this formula is that the strtobin() function is as of yet undocumented in ASE but also unfortunately is the only way of performing this criteria comparison (attempts to use convert(binary(4),@prsid) failed as it appears to suffer from byte swapping issues). Probably the two most confusing values to decode are the instance_val and instance_id values. The instance_val typically maps to the connections dbid or rsid (for routes). With warm standby systems, the queue related instance_val values will be reported for the logical connection due to the single queue used for the each inbound and outbound queue vs. the individual connection queues. The instance_id column values depend on the thread and more specifically the counter module. Consider the following table that illustrates the various thread and instance_id values: Counter Module REPAGENT SQM SQMR Instance_id ldbid for RS 12.6 dbid for RS 15.0 ldbid for inbound dbid for outbound ldbid Instance_val column value -1 0 1 10 11 21 0 1 -1 (not applicable) outbound queue; inbound queue outbound queue; inbound queue; Warm Standby DSI reader outbound queue (DSI SQT); inbound queue (not applicable)

SQT DIST DSI

ldbid dbid dbid

0 normal DSI; 1 Warm Standby DSI corresponds similarly to 0/1 for outbound/inbound SQM queue identifiers 1 - #dsi_num_threads This number is the specific DSIEXEC thread number. -1 (not applicable)

DSIEXEC

dbid

RSI

rsid

You can view descriptive information about the counters stored in the rs_statcounters table using the sp_helpcounter system procedure. To view a list of modules that have counters and the syntax of the sp_helpcounter procedure, enter:
sp_helpcounter

To view descriptive information about all counters for a specified module, enter:
sp_helpcounter module_name [, {type | short | long} ]

If you enter type, sp_helpcounter prints the display name, module name, counter type, and counter status for each of the modules counters. If you enter short, sp_helpcounter prints the display name, module name, and counter descriptions for each counter. If you enter long, sp_helpcounter prints every column in rs_statcounters for each counter. If you do not enter a second parameter, sp_helpcounter prints the display name, the module name, and the external name of each counter. To list all counters that match a keyword, enter:
rs_helpcounter keyword [, {type | short | long} ]

To list counters with a specified status, the syntax is:


rs_helpcounter { intrusive | sysmon | rate | duration | internal | must_sample | no_reset | keep_old | configure }

Note the difference between the two procedures sp_helpcounter is used to list the counters for a module (or all modules), while rs_helpcounter is used to find a counter by keyword in the name or by a particular status.

71

Final v2.0.1

Enabling M&C Sampling (RS 12.6) The very first thing that must be done prior to enabling M&C is to increase the size of the RSSD hopefully you did this when you installed the RS or you will need to now. As far as the rest of this section, much of the information was pulled straight from the RS 12.1 release bulletin and is simply repeated here for continuity (one of the benefits of working for the company is that plagiarism is allowed). Generically, enabling the monitors and counters for sampling is accomplished through a series of steps outlined below: 1. 2. 3. 4. 5. 6. Enable sampling of non-intrusive counters Enable sampling of intrusive counters Enable flushing of counters to the RSSD (if desired) Enable resetting of counters after flush to the RSSD Set the period between flushes to the RSSD (in seconds). Configuring the flush interval for specific modules or connections

Each of these steps will be discussed in more detail in the following paragraphs. Enabling sampling of non-intrusive counters You enable or disable all sampling at the Replication Server level using the configure replication server command with the stats_sampling option. The default is on. The syntax is:
configure replication server set stats_sampling to { on | off }

If sampling is disabled, the counters do not record data and no metrics can be flushed to the RSSD. Enabling sampling of intrusive counters Most counters sample data with minimal effect on Replication Server performance. Counters that may affect performanceintrusive countersare enabled separately so that you can enable or disable them without affecting the settings for non-intrusive counters. You can enable or disable intrusive counters using the admin stats_intrusive_counter command. The default is off. The syntax is.
admin stats_intrusive_counter, { on | off }

It is highly recommended that you enable intrusive counters. Initially, it was assumed that these counters would impact performance as they primarily tracked execution times of various processing steps. It turned out in reality that these counters had much much less impact than anticipated - and in RS 15.0, the notion of intrusive counters was eliminated. Additionally, these were some of the more useful counters - especially in determining the holdup of the DSI/DSIEXEC processing. Enabling flushing Use the configure replication server command with the stats_flush_rssd option to enable or disable flushing. The default is off. The syntax is:
configure replication server set stats_flush_rssd to { on | off }

You must enable flushing before you can configure individual modules, connections, and routes to flush. This step is optional in a sense that you can view the statistics without flushing them, however, the most beneficial use of the monitors will only be achieved via flushing them to the RSSD for later analysis and baselining configuration settings. Enabling reset after flushing Use the configure replication server command with the stats_reset_afterflush option to specify that counters are to be reset after flushing . The default is on. The syntax is:
configure replication server set stats_reset_afterflush to { on | off }

Certain counters, such as rate counters with CNT_NO_RESET status, are never reset. Setting seconds between flushes You set the number of seconds between flushes at the Replication Server level using the configure replication server command with the stats_daemon_sleep_time option. The default is 600 seconds. The syntax is:
configure replication server set stats_daemon_sleep_time to sleeptime

The minimum value for sleeptime is 10 seconds; the maximum value is 3153600 seconds (365 days). For general monitoring, the default may be fine. However, for narrower performance tuning related issues, this may have to be decreased to 60-120 seconds (1-2 minutes) to ensure accurate latency and volume related statistics.

72

Final v2.0.1
Configuring modules, connections, and routes A hierarchy of configuration options limit the flushing of counters to the RSSD. The command admin stats_config_module lets you configure flushing for a particular module or for all modules. For multithreading modules, you can choose to flush metrics from a matrix of available counters. For example, you can configure flushing for a module, for a particular connection, or for all connections. Configuration parameters that configure counters for flushing are not persistent; they do not retain their values when Replication Server shuts down. Consequently, it would be a good idea to place frequently used configurations used for counter flushing in a script file. Before you can configure a counter for flushing, make sure that you first enable the sampling and flushing of counters. Note Replication Server 12.x does not flush counters that have a value of zero. You can set flushing on for all counters of individual modules or all modules using the command admin
stats_config_module. The default is off. The syntax is: admin stats_config_module, { module_name | all_modules }, {on | off }

where module_name is dist, dsi, rsi, sqm, sqt, sts, repagent, cm, or sts. This command is most useful for single or non-threaded modules, which have only one thread instance. For multithreaded modules, you have greater control over which threads are set on if you use the admin stats_config_connection and admin stats_config_route commands. Note If a modules flushing status is set on, counters for all new threads for that module will be set on also. The number of threads for a multithreaded module depends on the number of connections and routes Replication Server manages. You can configure flushing for individual threads or groups of threads. Connections - Use the admin stats_config_connection command to enable flushing for threads related to connections. The syntax is:
admin stats_config_connection, { data_server, database | all_connections }, { module_name | all_modules }, [ 'inbound' | 'outbound' ], { on | off }

where: data_server is the name of a data server. database is the name of a database.
all_connections specifies all database connections. Hint, this will produce a lot of output.

module_name is dist, dsi, repagent, sqm, or sqt.


all_modules specifies the DIST, DSI, REPAGENT, SQM, and SQT modules. Hint, this too will produce a

lot of output. inbound | outbound identifies SQM or SQT for an inbound or outbound queue. Routes - You can use the admin stats_config_route command to save statistics gathered on routes for the SQM or RSI modules. The syntax is:
admin stats_config_route,{ rep_server | all_routes }, { module_name | all_modules }, {on|off}

where rep_server is the name of the remote Replication Server, all_routes specifies all routes from the current Replication Server, and module_name is sqm or rsi. Note: If you configure flushing for a thread, Replication Server also turns on flushing for the module. This does not turn on flushing for existing threads of that module, but all new threads will have flushing turned on. Example RS 12.x Script The typical performance analysis session might use the following series of commands to fully enable the counters:
admin statistics, reset go configure replication server set 'stats_sampling' to 'on' go admin stats_intrusive_counter, 'on' go configure replication server set 'stats_flush_rssd' to 'on' go configure replication server set 'stats_reset_afterflush' to 'on' go configure replication server set 'stats_daemon_sleep_time' to '60' go

73

Final v2.0.1

admin stats_config_module, 'all_modules', 'on' go

The first line ensures that the first sample is reset vs. holding over the cumulative counts which can distort the very first sample in the run. Along with this, you should reset the counters after each flush this helps prevent counter rollover during sampling particularly for the byte-related counters. Enabling M&C Sampling (RS 15.0) In Replication Server 15.0, there was a conscious effort to simplify the commands needed to implement counter sampling. As a result, a sample script to enable monitoring for RS 15.0 would look like the following:
admin statistics, reset go -- collect stats for all modules -save them to the RSSD -collect for 3 hours at 15 sec interval admin statistics, "all", save, 720, 10800 go admin statistics, status go

One word of warning - the admin statistics, reset command will truncate rs_statrun and rs_statdetail - so be sure to preserve the rows if you wish to keep them. The other difference is that currently there is no explicit start/stop command. Instead the admin statistics command uses the syntax (note this is an abbreviated syntax - see manual for all options/parameters):
admin statistics, <module>, <save>, <num observations>, <sample period>

The first two parameters are fairly self explanatory. However the last two take a bit of getting used to. <num observations> is the number of observations to make. What makes this tricky is the last parameter - the sample period. The first issue is that it is measured in seconds. As noted in the sample script above, 3 hours translates into 10,800 seconds. Using this duration, and given a number of observations, you can derive the sample interval. For example, 10,800 seconds with 720 observations yields a 15 second sample interval. In RS 15.0, it should be noted that smallest sample interval supported is 15 seconds. The biggest problem with this syntax is that typically you know the interval you want (15 seconds or 1 minute) but either dont know how long you wish to collect data for - or you know it terms of hours and minutes. Consequently, you often find out that before you execute the command, you are deriving the parameter values with formulas such as:
Sample_period = time in hrs * 60 * 60 Num_observations = sample_period / sample_interval (in seconds)

Because of the usability issues with this, a subsequent release may support an enhancement to change the syntax to accepting the sample interval directly and accepting the sample period using a notation such as 3h or 120m for entering as a number of hours/minutes. One other difference between RS 12.x and RS 15.0 is that in RS 12.6, counters with a value of zero were not flushed to the RSSD. In RS 15.0, counters with a value of zero will be flushed if the number of observations are greater than zero for that counter. Viewing current counter values RS monitor counter values can be either viewed interactively via RCL submitted to the replication server via isql or other program or by directly querying the RSSD rs_statrun and rs_statdetail tables if the statistics were flushed to the RSSD. Each of these methods will be discussed. Viewing counter values via RCL Replication Server provides a set of admin statistics commands that you can use to display current metrics from enabled counters directly to a client application (instead of saving to the RSSD). Replication Server can display information about these modules: DSI, DSIEXEC, SQT, dCM, DIST, RSI, SQM, REPAGENT, MEM, MD, and MEM_IN_USE. To display information, use the admin statistics command as specified below: To view the current values for one or all the counters for a specified module, enter:
admin statistics, module_name [, display_name]

where module-name is the name of the module and display_name is the display name of the counter. To determine the display name of a counter, use sp_helpcounter. To view current values for all enabled counters, enter:

74

Final v2.0.1

admin statistics, all_modules

To view a summary of values for the DSI, DSIEXEC, REPAGENT, and RSI modules, enter:
admin statistics, sysmon [, sample_period]

where sample_period is the number of seconds for the run. Admin statistics, sysmon [, sample_period] zeros the counters, samples for the specified sample period, and prints the results. If sample_period is 0 (zero) or not present, admin statistics, sysmon [, sample_period] prints the current values of the counters. To display counter flush status, enter:
admin statistics, flush_status

Viewing values flushed to the RSSD You can view information flushed to the rs_statdetail and rs_statrun tables using select and other Transact-SQL commands. If, for example, you want to display flushed information from the dCM module counters, you might enter:
select counter_name, module_name, instance_id, counter_val, run_date from rs_statcounters c, rs_statdetail d, rs_statrun r where c.counter_id = d.counter_id and d.run_id = r.run_id and c.module_name = CM order by counter_name, instance_id, run_date

In this instance, the counters have been configured to save to the RSSD either by the configure replication server set 'stats_flush_rssd' to 'on' command for RS 12.6 or the admin statistics, "all", save, <num observations>, <sample period> command for RS 15.0. While you can view the counter data directly in the RSSD, it may not be the best option. The biggest reason is that the counters in the RSSD only represent that Replication Servers values - if a route is involved, you can not do a full analysis. Another good reason is that extensive querying of the RSSD will put a load on the server that may impact the ability for it to respond as quickly to the normal RSSD processing of RS requests. Additionally, historical trend data could take up considerable space within the RSSD. Consequently, the best option is to use an external repository to collect the RSSD statistics and necessary information to perform analysis Resetting counters Counters are reset when a thread starts. In addition, some counters are reset automatically at the beginning of a sampling period. You can reset counters by: Configuring Replication Server to ensure that counters are reset after sampling data is flushed to the RSSD. Use the configure replication server set stats_reset_afterflush to on command. Issuing the admin statistics, reset command to reset all counters.

You can reset all counters except counters with CNT_NO_RESET status, such as rate counters, which are never reset. Counters that can be reset, reset to zero. dSTATS daemon thread The dSTATS daemon thread supports Replication Servers new counters feature by: Managing the interface for flushing counters to the RSSD. Calculating derived values when the daemon thread wakes up.

dSTATS manages the interface when Replication Server has been configured to flush statistics to the RSSD using the configure replication server command and the stats_flush_rssd parameter. You can configure a sleep time for dSTATS using the configure replication server command and the stats_daemon_sleep_time parameter. When the daemon wakes up, it attempts to calculate derived statistics such as the number of DSI-thread transactions per second or the number of RepAgent bytes delivered per second. Impact on Replication Obviously, intrusive counters impact RS performance within the RS binary codeline itself. However, the impact is not as great as the name would imply probably less than 15%. The difference is that the normal counter execution code is executed regardless, while enabling these counters require executing special routines including system clock function calls. One word of caution: the counters can impact RS performance indirectly. For example, by setting the flush

75

Final v2.0.1
interval to 1 second and collecting a wide-range of counters, you will notice that the RSSD has a sharp increase of 100+ inserts per second as measured by STS InsertsTotal. On a healthy ASE, this may not be that much of a problem, but on most RSSD's, this could slow down queue/oqid updates, etc. Additionally, that many inserts/second could fill the transaction log much quicker which could result in a log suspend (definitely impeding RS performance) but do not turn on 'truncate log on checkpoint' or the next words you will hear from Tech Support will be "I hope you had a backup and you know how to rebuild queues". RS M&C Analysis Repository The second thing you will want to do (after increasing the size of the RSSD above), is to create a repository database to upload the counter data to after collection. As mentioned earlier, the reasons for this is due to: Enable you to perform analysis of replication performance when using routes Avoid consuming excessive space in the RSSD retaining historical data Prevent cpu load of analysis queries from impacting RSSD performance for RS activities Prevent loss of statistics in RS 15.0 when the reset command truncates the tables

Creating the Repository It is recommended that you create the repository in ASE due to a function that is not available currently in ASA or Sybase IQ. This function is the undocumented strtobin() and bintostr() functions that are used primarily systems involving routes. The repository should contain the following tables: rs_statrun - contains a list of the statistics sample periods. rs_statdetail - contains the counter values. rs_statcounters - contains the counters descriptions/names rs_sites - contains the list of all the Replication Servers rs_databases - contains the list of all the connections rs_config - contains all the configuration values In addition to these tables, other tables such as rs_routes, rs_objects, rs_subscriptions, etc. could be included in the extraction if doing the analysis blind (without knowing the topology if routing is involved or if the proper subscriptions have been created). The structure and indexes for these tables can be obtained from the rs_install_systables_ase script in the RS directory. In addition, you may wish to create a copy of your rs_ticket table in the repository as well if you use the rs_ticket feature. About the repository in general, there are a couple of notes: If using a mixed RS 12.x and RS 15.0 environment, use separate databases in the same server due to differences in rs_statcounters (counter values and names) and rs_statdetail (counter columns). The reason for rs_sites and rs_databases is that the counter instance_ids use the RS id instead of the connection name. By having these tables, analysis is much easier as the connection names can be used instead of continually looking up the corresponding dbid or prsid. While rs_config can change due to configuration changes, having a current copy from the RS allows you to quickly look at configuration values during analysis without having to log back in to the RS in question.

Additionally, you can add indexes to facilitate query performance. A sample repository is available in Sybases CodeXchange online along with stored procedures that can help with the analysis. Populating the Repository The easiest way to populate the repository is to use bcp to extract all of the above tables from the RSSD. This can be done even with an embedded RSSD that uses ASA as a bcp out is nothing more than a select of all columns/rows from the desired table. You will need to bcp out all the data from the RSs involved in the replication if routes are involved. A sample bcp script to do this might resemble:
set RSSD=CHINOOK_RS_RSSD set DSQUERY=CHINOOK mkdir .\%RSSD% bcp %RSSD%..rs_statrun out .\%RSSD%\rs_statrun.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%..rs_config out .\%RSSD%\rs_config.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%..rs_statcounters out .\%RSSD%\rs_statcounters.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" A8192 bcp %RSSD%..rs_databases out .\%RSSD%\rs_databases.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192 bcp %RSSD%..rs_sites out .\%RSSD%\rs_sites.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192

76

Final v2.0.1

bcp %RSSD%..rs_statdetail out .\%RSSD%\rs_statdetail.bcp -Usa -P -S %DSQUERY% -b1000 -c -t"|" -A8192

Once you have extracted all the counter data, loading it can be a bit tricky. The counter data tables (rs_statdetail and rs_statrun) should have no issues - even when multiple RSs are involved. However, even if starting from truncated tables, rs_sites and rs_databases may have duplicates when loading data from multiple RSs simply due to the fact that when routes are created, these tables are replicated between the RSs. If you use bi-directional routes, then each server will have a full complement and you only will need to load one copy. If you use uni-directional routes, you may need to bcp in each one, but use the -m and -b switches to effectively ignore the errors. For instance, consider the following bcp command for rs_databases:
bcp rep_analysis..rs_databases in .\%RSSD%\rs_databases.bcp -Usa -P -S %DSQUERY% -b1 -m200 -c -t"|"

The difference is that now bcp will commit every row and ignore up to 200 errors before bcp aborts. This is important - without the -b1 setting, an error in a batch will cause the batch to rollback. By constraining the batch size to 1, only each individual duplicate row is rolled back. Additionally, by setting -m to an arbitrarily high value, bcp will not abort all processing due to the number of duplicates. Similarly, rs_config is a bit strange. RS specific tuning parameters and default values have an objid value of 0x0000000000000000 while connection specific parameters will have the connection id in hex in the first four bytes such as 0x0000007200000000 (0x72h is 114d - so this configuration value corresponds to dbid 114). The important thing to remember is that dbids are unique within a replication domain - across all the RSs within that domain. So the sequence for bcp-ing in rs_config values is to do the following: 1. 2. bcp in the rs_config table from the RS of interest - likely the RRS for a connection, although the PRS may be of interest if analyzing a route bcp in the other rs_config tables using the trick above, but this time use -b1 and -m1000. The reason for the -m1000 is due to the number of default configuration values and server settings.

After populating the tables, remember to run update statistics for all the tables - this can be done using a script such as:
update update update update update update index index index index index index statistics statistics statistics statistics statistics statistics rs_statdetail using 1000 values rs_statrun using 1000 values rs_config using 1000 values rs_statcounters using 1000 values rs_databases using 1000 values rs_sites using 1000 values

Using a higher step count and using update index statistics vs. update statistics is important considering that a few hours of statistics gathered every minute could mean nearly 1 million rows in the rs_statdetail table. Analyzing the Counters It should be noted that both Bradmark and Quest have GUI products that provide an off-the-shelf solution for monitoring RS performance. However, if you dont have these utilities, you will need to get very familiar with the counter schema, the individual counters and their relationships. These are described in more detail in the appropriate section later in this document - for example, the RepAgent User counters are described in the section on the RepAgent User processing. If not using one of the vendor tools to facilitate your analysis, a collection of stored procedures along with the repository schema has been uploaded to CodeXchange. RS_Ticket In Replication Server 12.6, a new feature called rs_ticket was added unfortunately too late for the documentation (it is documented in the RS 15.0 documentation). Perhaps it will become one of the most useful timing mechanisms replacing heartbeats, etc. as it records a timestamp for every thread that touches the rs_ticket record from the execution time in the primary database, the RepAgent processing time, the various threads with Replication Server and finally the destination database. Pre- RS_Ticket Prior to RS_Ticket, the only timing mechanisms were the RSM heartbeat mechanism or the use of a manually created ping/latency table. An example of the latter is illustrated below:
-- latency check table definition create table rep_ping ( source_server varchar(32) source_db varchar(32) test_string varchar(255) source_datetime datetime dest_datetime datetime

default default default default default

"SLEDDOG" db_name() "hello world" getdate() getdate()

not not not not not

null, null, null, null, null

77

Final v2.0.1

) go create unique clustered index rep_ping_idx on rep_ping (source_server,source_db,source_datetime) go

--latency check table repdef create replication definition rep_ping_rd with primary at SLEDDOG_WS.smp_db with all tables named dbo.rep_ping ( source_server varchar(32), source_db varchar(32), test_string varchar(255), source_datetime datetime ) primary key (source_server,source_db,source_datetime) searchable columns (source_server,source_db,source_datetime) send standby replication definition columns replicate minimal columns go

By creating this set and subscribing to it for normal replication, any insert into the primary server would be propagated to the replicate(s). Since the destination datetime column was excluded from the definition, this would cause the replicate field to be populated using the default value which would be the current time at execution. This could be significantly more accurate than using the rs_lastcommit tables values which may reflect a long running transaction. While useful, such implementations were sorely lacking as it really didnt help identify where the latency was occurring. Hence RS engineering decided to add a more explicit timing mechanism that would help identify exactly where the latency is. Rs_ticket setup Rs_ticket is implemented as a stored procedure at the primary as well as a corresponding procedure (and usually a table) at the replicate. The full setup procedure is as follows: 1. Verify that rs_ticket is in the primary database if not, extract from RS 12.6 rsinspri (rs_install_primary) script in $SYBASE/$SYBASE_REP/scripts. It should not be marked for replication as it uses the rs_marker routine that is marked for replication. Customize the rs_ticket_report procedure at the replicate database(s). A sample is below. You will also need to develop a parsing stored procedure (also below). Enable rs_ticket at the replicates by 'alter connection to srv.rdb1 set dsi_rs_ticket_report to on'

2. 3.

A sample rs_ticket_report procedure is as follows:


if exists (select 1 from sysobjects where id = object_id('rs_ticket_history') and type = 'U') drop table rs_ticket_history go /*==============================================================*/ /* Table: rs_ticket_history */ /*==============================================================*/ create table rs_ticket_history ( ticket_num numeric(10,0) identity, ticket_date datetime not null, ticket_payload varchar(1024) null, constraint PK_RS_TICKET_HISTORY primary key (ticket_num) ) lock datarows go /*==============================================================*/ /* Index: ticket_date_idx */ /*==============================================================*/ create index ticket_date_idx on rs_ticket_history (ticket_date ASC) go

if exists (select 1 from sysobjects

78

Final v2.0.1

where id = object_id('rs_ticket_report') and type = 'P') drop procedure rs_ticket_report go

create procedure rs_ticket_report @rs_ticket_param varchar(255) as begin /* ** Name: rs_ticket_report ** Append PDB timestamp to rs_ticket_param. ** DSI calls rs_ticket_report if DSI_RS_TICKET_REPORT in on. ** ** Parameter ** rs_ticket_param: rs_ticket parameter in canonical form. ** ** rs_ticket_param Canonical Form ** rs_ticket_param ::= <section> | <rs_ticket_param>;<section> ** section ::= <tagxxx>=<value> ** tag ::= V | H | PDB | EXEC | B | DIST | DSI | RDB | ... ** Version value ::= integer ** Header value ::= string of varchar(10) ** DB value ::= database name ** Byte value ::= integer ** Time value ::= hh:mm:ss.ddd ** ** Note: ** 1. Don't mark rs_ticket_report for replication. ** 2. DSI calls rs_ticket_report if DSI_RS_TICKET_REPORT in on. ** 3. This is an example stored procedure that demonstrates how to ** add RDB timestamp to rs_ticket_param. ** 4. One should customize this function for parsing and inserting ** timestamp to tables. */ set nocount on declare @n_param @c_time varchar(2000), datetime

-- @n_param = "@rs_ticket_param;RDB(name)=hh:mm:ss.ddd" select @c_time = getdate() select @n_param = @rs_ticket_param + ";RDB(" + db_name() + ")=" + convert(varchar(8), @c_time, 8) + "." + right("00" + convert(varchar(3),datepart(ms,@c_time)) ,3) -- for rollovers, add date and see if greater than getdate() -- print @n_param insert into rs_ticket_history (ticket_date, ticket_payload) values (@c_time,@n_param) end go

if exists (select 1 from sysobjects where id = object_id('parse_rs_tickets') and type = 'P') drop procedure parse_rs_tickets go if exists (select 1 from sysobjects where id = object_id('sp_time_diff') and type = 'P') drop procedure sp_time_diff go

create proc sp_time_diff @time_begin @time_end @time_diff as begin declare @time_char @begin_dt @end_dt ----if

time, time, time output varchar(20), datetime, datetime

first get the hours...we need to check first for a rollover situation to do this, we are going to cheat and add a date since time datatype is a physical clock time vs. a duration (in otherwords 35 hours can not be stored in a time datatype) (datepart(hh,@time_begin)>datepart(hh,@time_end)) begin

79

Final v2.0.1

select @begin_dt=convert(datetime,"Jan 1 1900 " + convert(varchar(20),@time_begin,108)) select @end_dt=convert(datetime,"Jan 2 1900 " + convert(varchar(20),@time_end,108)) end else begin select @begin_dt=convert(datetime,"Jan 1 1900 " + convert(varchar(20),@time_begin,108)) select @end_dt=convert(datetime,"Jan 1 1900 " + convert(varchar(20),@time_end,108)) end select @time_char=right("00"+convert(varchar(2),abs(datediff(hh,@begin_dt,@end_dt))),2)+":" select @time_char=@time_char + right("00"+convert(varchar(2), abs(datediff(mi,@begin_dt,@end_dt))%60),2)+":" select @time_char=@time_char + right("00"+convert(varchar(2), abs(datediff(ss,@begin_dt,@end_dt))%60),2) select @time_diff=convert(time,@time_char) return 0 end go

create proc parse_rs_tickets @last_two_only as begin declare @pos @ticket_num @ticket_date @rs_ticket @head_1 @head_2 @head_3 @head_4 @pdb @pdb_ts @exec_spid @exec_ts @exec_bytes @dist_spid @dist_ts @dsi_spid @dsi_ts @rdb @rdb_ts @last_row @next_last @ra_latency @rs_latency @tot_latency

bit=1

int, numeric(10,0), datetime, varchar(4096), varchar(10), varchar(10), varchar(10), varchar(50), varchar(30), time, int, time, int, int, time, int, time, varchar(30), time, numeric(10,0), numeric(10,0), time, time, time

create table #tickets ( ticket_num head_1 head_2 head_3 head_4 pdb pdb_ts exec_spid exec_ts exec_bytes exec_delay dist_spid dist_ts dsi_spid dsi_ts rs_delay rdb rdb_ts tot_delay )

numeric(10,0) varchar(10) varchar(10) varchar(10) varchar(50) varchar(30) time int time int time int time int time time varchar(30) time time

not null, not null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null, null

select @last_row=isnull(max(ticket_num),0) from rs_ticket_history select @next_last=isnull(max(ticket_num),-1) from rs_ticket_history where ticket_num < @last_row

80

Final v2.0.1

declare rs_tkt_cursor cursor for select ticket_num, ticket_date, ticket_payload from rs_ticket_history where ((@last_two_only = 0) or ((@last_two_only=1) and ((ticket_num=@last_row) or (ticket_num=@next_last))) ) for read only open rs_tkt_cursor fetch rs_tkt_cursor into @ticket_num, @ticket_date, @rs_ticket while (@@sqlstatus=0) begin -- parse the first heading and then strip preceeding characters select @rs_ticket=substring(@rs_ticket,charindex("H1",@rs_ticket)+3,4096) select @pos=charindex(";",@rs_ticket) select @head_1=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) -- parse out Heading 2 if it exists, else use null select @head_2=null, @pos=charindex("H2",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+3,4096) select @pos=charindex(";",@rs_ticket) select @head_2=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) end -- parse out Heading 3 if it exists, else use null select @head_3=null, @pos=charindex("H3",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+3,4096) select @pos=charindex(";",@rs_ticket) select @head_3=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) end -- parse out Heading 4 if it exists, else use null select @head_4=null, @pos=charindex("H4",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+3,4096) select @pos=charindex(";",@rs_ticket) select @head_4=substring(@rs_ticket,1,@pos-1), @rs_ticket=substring(@rs_ticket,@pos+1,4096) end -- parse the PDB select @rs_ticket=substring(@rs_ticket,charindex("PDB",@rs_ticket)+4,4096) select @pdb=convert(varchar(30),substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @pdb_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,12)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096)

-- parse the EXEC select @rs_ticket=substring(@rs_ticket,charindex("EXEC",@rs_ticket)+5,4096) select @exec_spid=convert(int,substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @exec_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,10)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) -- parse the EXEC bytes select @rs_ticket=substring(@rs_ticket,charindex("B(",@rs_ticket)+7,4096) select @exec_bytes=convert(int,substring(@rs_ticket,1,charindex(';',@rs_ticket)-1)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) -- parse out DIST if it exists, else use null select @dist_spid=null, @dist_ts=null, @pos=charindex("DIST",@rs_ticket) if @pos > 0 begin select @rs_ticket=substring(@rs_ticket,@pos+5,4096) select @dist_spid=convert(int,substring(@rs_ticket,1,charindex(')', @rs_ticket)-1)), @dist_ts=convert(time,substring(@rs_ticket, charindex('=',@rs_ticket)+1,10)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) end

81

Final v2.0.1

-- parse the DSI select @rs_ticket=substring(@rs_ticket,charindex("DSI",@rs_ticket)+4,4096) select @dsi_spid=convert(int,substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @dsi_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,10)), @rs_ticket=substring(@rs_ticket,charindex(';',@rs_ticket)+1,4096) -- parse the RDB select @rs_ticket=substring(@rs_ticket,charindex("RDB",@rs_ticket)+4,4096) select @rdb=convert(varchar(30),substring(@rs_ticket,1,charindex(')',@rs_ticket)-1)), @rdb_ts=convert(time,substring(@rs_ticket,charindex('=',@rs_ticket)+1,12))

-- calculate horizontal latency exec sp_time_diff @pdb_ts, @exec_ts, @ra_latency output exec sp_time_diff @exec_ts, @dsi_ts, @rs_latency output exec sp_time_diff @pdb_ts, @rdb_ts, @tot_latency output insert into #tickets (ticket_num,head_1,head_2,head_3,head_4,pdb, pdb_ts,exec_spid,exec_ts,exec_bytes,exec_delay, dist_spid,dist_ts,dsi_spid,dsi_ts,rs_delay, rdb,rdb_ts,tot_delay) values (@ticket_num,@head_1,@head_2,@head_3,@head_4,@pdb, @pdb_ts,@exec_spid,@exec_ts,@exec_bytes,@ra_latency, @dist_spid,@dist_ts,@dsi_spid,@dsi_ts,@rs_latency, @rdb,@rdb_ts,@tot_latency) -- parse the DIST if present fetch rs_tkt_cursor into @ticket_num, @ticket_date, @rs_ticket end close rs_tkt_cursor deallocate cursor rs_tkt_cursor select ticket_num, head_1, head_2, head_3, head_4, pdb_time=convert(varchar(15),pdb_ts,9), exec_time=convert(varchar(15),exec_ts,9), exec_delay=convert(varchar(15),exec_delay,8), exec_bytes, dist_time=convert(varchar(15),dist_ts,9), dsi_time=convert(varchar(15),dsi_ts,9), rs_delay=convert(varchar(15),rs_delay,8), rdb,rdb_time=convert(varchar(15),rdb_ts,9), tot_delay=convert(varchar(15),tot_delay,8) from #tickets order by ticket_num drop table #tickets return 0 end go

Executing rs_ticket Executing the rs_ticket proc is easy it takes four optional parameters that become the headers for the ticket records:
create procedure rs_ticket @head1 varchar(10) @head2 varchar(10) @head3 varchar(10) @head4 varchar(50) as begin = = = = "ticket", null, null, null

The full "ticket" when built and inserted into the replicate database may look like the following:
** rs_ticket parameter Canonical Form ** rs_ticket_param ::= <section> | <rs_ticket_param>;<section> ** section ::= <tagxxx>=<value> ** tag ::= V | H | PDB | EXEC | B | DIST | DSI | RDB | ... ** Version value ::= integer ** Header value ::= string of varchar(10) ** DB value ::= database name ** Byte value ::= integer

82

Final v2.0.1

** Time value ::= hh:mm:ss.ddd V=1;H1=start;PDB(pdb1)=21:25:28.310;EXEC(41)=21:25:28.327;B(41)=324; DIST(24)=21:25:29.211;DSI(39)=21:25:29.486;RDB(rdb1)=21:25:30.846

The description is as follows: Tag V H1 H2 H3 H4 PDB EXEC B DIST DSI RDB Description Rs_ticket version Header #1 Header #2 Header #3 Header #4 Primary Database RepAgent User Thread Bytes Distributor Thread DSI Thread Replicate Database (parenthesis) n/a n/a n/a n/a n/a DB name EXEC RS spid EXEC RS spid DIST RS spid DSI RS spid RDB name Value 1 (current version of format) First header value Second header value Third header value Fourth header value Timestamp of PDB rs_ticket execution Timestamp processed by EXEC Bytes process by EXEC Timestamp processed by DIST Timestamp processed by DSI-S Timestamp of insert at RDB

The Header values are optional values supplied by the user to help distinguish which rows bracket the timing interval. A sample execution might look like:
exec rs_ticket start (run replication benchmarks, DML, whatever) exec rs_ticket stop

rs_ticket tips There are a couple of pointers about rs_ticket that should be discussed: Synchronize the clocks in the ASE & RS hosts!!!! The PDS, RS & RDS hosts should be within 1 sec of each other. This may have to be repeated often - while some systems automatically sync the clocks during boot, due the uptime or due to high clock drift, they can be off by seconds by the end of the day. DIST will not send rs_ticket to DSI unless there is at least one subscription from replicate site Do not use apostrophe/single or double quotation marks within the headers. For example, trying to use a header such as Bobs Test will fail whereas Bobs Test is fine. Considering that the parsing routines look for semi-colons, you should avoid using semi-colons within the headers to avoid parsing problems. The DSI timestamp is the time that the DSI read the rs_ticket which could be a few seconds before execution if there is a large DSI SQT cache. If using parallel DSIs, the RDB timestamp is the time of the parallel DSI execution which may be in advance of other statements that will need to be committed ahead of it. This means that the RDB time may be a few seconds off. If using routes, DSI time includes RSI & RRS DIST. Currently, only the PRS DIST timestamps the ticket. The reason for this is that within the RRS DIST thread, only the MD module is executed. Rs_ticket processing occurs prior to then in the DIST processing sequence.

rs_ticket Trace Flags The rs_ticket can be printed into the Replication Server error log when tracing is enabled. Tracing can be enabled in the three modules that update the rs_ticket: EXEC (Rep Agent User), DIST (Distributor), and DSI (Data Server Interface). The syntax for the trace command is:
trace [ on | off ], [EXEC | DIST | DSI], print_rs_ticket -- examples:

83

Final v2.0.1

trace on, EXEC, print_rs_ticket trace on, DIST, print_rs_ticket trace on, DSI, print_rs_ticket

Note, that what is printed in to the errorlog is the contents of the ticket at that point - for example, the EXEC trace will only include the PDB and EXEC timestamp information. This technique can be extremely useful when running benchmarks or trying to see when a table is quiesced - simply invoke RS ticket and wait for the DSI trace record to appear in the errorlog. Analyzing RS_Ticket When comparing RS tickets, there are three types of calculations that can be performed: horizontal, vertical and diagonal. Each of these are described in the following sections. Horizontal Horizontal calculations refer to the difference in time between two threads in the same rs_ticket row. This is termed pipeline delay as it shows the latency between threads within the pipeline. For example, consider the following rs_ticket output (from two executions):
-- beginning V=1;H1=start;PDB(pdb1)=21:25:28.310;EXEC(41)=21:25:28.327;B(41)=324;DIST(24)=21:25:29.211 ;DSI(39)=21:25:29.486;RDB(rdb1)=21:25:30.846 -- end V=1;H1=stop;PDB(pdb1)=21:25:39.406;EXEC(41)=21:32:03.200;B(41)=20534;DIST(24)=21:33:43.32 3;DSI(39)=21:34:08.466;RDB(rdb1)=21:34:20.103

Note the two highlighted timestamps for each row. If we subtract the two in the beginning row, we notice that the time between when the command was executed and when the RS received it from the RepAgent was nearly immediate in the top example. In the bottom example, however, there is a difference of ~6.5 minutes thus showing that by the end of the sample period, the RepAgent was running approximately 6.5 minutes behind transaction execution. This could be due to either a bulk operation (i.e. a single update that impacted 100,000 rows) that actually resulted in the RepAgent being behind temporarily, a slow inbound queue write speed or just poor configuration. Reviewing RS monitor counter data will help to determine the actual cause. Overall end-to-end latency can be observed by comparing the PDB & RDB (blue highlighted value)values in the end row which shows 9 minute latency overall. With 6.5 minutes of latency within the RepAgent processing, attempting to tune the RS components will not achieve a significant improvement. Vertical Vertical calculations show the time it takes for a single thread to process all of the activity between the two timestamps. This is termed module time as it shows how long a particular module was active. Note, this is a latency figure and does not imply that the module was completely consuming all cpu during that time the delay may be been caused by a pipeline delay. Using the same output as above, consider the various threads.
-- beginning V=1;H1=start;PDB(pdb1)=21:25:28.310;EXEC(41)=21:25:28.327;B(41)=324;DIST(24)=21:25:29.211 ;DSI(39)=21:25:29.486;RDB(rdb1)=21:25:30.846 -- end V=1;H1=stop;PDB(pdb1)=21:25:39.406;EXEC(41)=21:32:03.200;B(41)=20534;DIST(24)=21:33:43.32 3;DSI(39)=21:34:08.466;RDB(rdb1)=21:34:20.103

By comparing the PDB timestamps between the two, we notice that the total test time was approximately 11 seconds of execution time at the primary. Now then, it gets a bit tricky. If we further look at the EXEC vertical calculation, we will see a delay of ~6.5 minutes as we noted earlier from the horizontal calculation. Taking one step further, we can notice that the DIST vertical calculation is ~8 minutes. If we subtract the two, we notice that the DIST thread adds about 1.5 minutes of processing to the overall problem. This may be an indication of one of three possibilities (in order of likelihood): 1. The commands between the two RS tickets included a large transaction which likely could delay the DIST receiving the commands as the SQT has to wait to see the commit record before even starting to pass the commands to the DIST (likelihood: 60%) The outbound queue SQM is overburdened for the associated device speed, thus slowing the delivery rate of the DIST to the outbound queue (likelihood: 35%)

2.

84

Final v2.0.1
3. Due to insufficient STS cache, the DIST had to resort to fetching repdef & subscription metadata from the RSSD (likelihood 5%)

By analyzing RS monitor counters, we can determine which of these are applicable. Diagonal In the last example, we came close to performing a diagonal calculation. A diagonal calculation is termed cross module time and refers to the latency that can be the result of waiting access to the thread (messages cached in thread queues).
-- beginning V=1;H1=start;PDB(pdb1)=21:25:28.310;EXEC(41)=21:25:28.327;B(41)=324;DIST(24)=21:25:29.211 ;DSI(39)=21:25:29.486;RDB(rdb1)=21:25:30.846 -- end V=1;H1=stop;PDB(pdb1)=21:25:39.406;EXEC(41)=21:32:03.200;B(41)=20534;DIST(24)=21:33:43.32 3;DSI(39)=21:34:08.466;RDB(rdb1)=21:34:20.103

For example, in the above, the DIST starts sending data to the DSI ~8.5 minutes prior to the DSI receiving the last of the rows from the DIST. In this case, this is important. As we noted earlier, the RepAgent latency was about 6.5 minutes while the DIST processing added 1.5 minutes for a total of 8 minutes. This means that the DIST saving the data to the outbound queue and the DSI reading the commands from the outbound queue only added about 30 seconds to the overall processing. As you can see, the most useful aspect of diagonal calculations will be in determining the impact of the modules which we dont have timestamps for namely the SQM module(s).

85

Final v2.0.1

Inbound Processing
What comes in
Earlier we took a look at the internal Replication Server threads in a drawing similar to the following:

Figure 15 Replication Server Internals: Inbound and Outbound Processing


In the above copy of the diagram, note that the threads have been divided into inbound and outbound processing along the dashed line from the upper-left to lower-right. An important distinction and one than many do not understand is that the inbound threads used for a replication from a source to a primary belong to a different connection than the outbound group of threads. Consequently, as multiple destinations are added, the same set of inbound threads are used to deliver the data to all of the various sets of outbound threads for each connection. In the sections below, we will be addressing the three main threads in the inbound processing within Replication Server. In previous versions of this document, the RepAgent User thread was not discussed, however, with RS 12.1, some additional tuning parameters were added specifically for it, consequently it is now included. RepAgent User (Executor) The RepAgent User thread has been named various things during the Replication Servers lifetime. It originally started as the Executor thread, followed by the LTM User thread, and lastly, the RepAgent User thread. The reason for this is that there actually are two different types of Executor threads LTM-User for Replication Agents and RSI-User for Replication Server connections. Replication Server will determine which type of thread each Executor is simply by the connect source command that is sent. However, many of the trace flags and configuration commands are specified at the Executor thread generically and affect both threads. Such commands will often refer to this RS thread module as EXEC. For this module, we will simply be discussing the LTM-User or RepAgent User type of Executor thread. RepAgent User Thread Processing The executor threads processing is extremely simple. It simply receives LTL from the Replication Agent, parses and normalizes the LTL and then packs it into binary format and then passes it to the SQM to be written to disk. The full explanation of these steps can be viewed as follows: 1. 2. Parse LTL received from Rep Agent Normalize LTL this involves comparing columns and datatypes in LTL to those in replication definition. An extremely important part and fairly cpu intensive, normalization includes: a. Columns in the LTL stream need to be matched with those in the repdef, and those excluded from the repdef need to be excluded from the queue. b. Column mapping needs to be performed for any renamed columns.

87

Final v2.0.1
Multiple repdefs if more than one repdef exists for the object, the EXEC thread needs to put multiple rows in the inbound queue. d. Primary key columns need to be located as they are stored separately in the row to speed SQL generation at the DSI. e. Minimal column comparisons need to be performed and unchanged, non-key columns eliminated from the stream. f. If autocorrection is enabled for the particular repdef, updates need to be translated into separate delete followed by inserts. g. Duplicate detection (OQID comparison) needs to be done to ensure that duplicate records are not written to the queue. Packs commands in binary form and places on the SQMs queue. If more than one replication definition is available, one command for each will be written to the queue. If the SQMs pending writes are greater than exec_sqm_write_request_limit (RS 12.1+), the Rep Agent User thread is put to sleep. Periodically, update rs_oqids & rs_locater table in the RSSD with the latest OQID to ensure recovery. c.

3.

4.

This is illustrated in the following diagram:

Figure 16 Rep Agent User Thread Processing


A key feature added in RS 12.1 was that writers to the SQM could cache pending writes in the respective writers cache - either the RepAgent User thread or the Distributors MD module. By default, this was set to a single block or 16K with a maximum of 60 blocks or 983040 bytes (2GB now in RS 12.6 ESD 7 and RS 15.0). For Rep Agent User threads, this cache limit is controlled by the exec_sqm_write_request limit. Once this limit has been reached, further attempts to insert write requests on the SQM Write Request queue will be rejected and Rep Agent User Thread put to sleep. The parsing and normalization process can be fairly cpu intensive and is essentially synchronous in processing transactions from the Replication Agent all the way to the SQM. Accordingly, you can control this by adjusting the parameter exec_cmds_per_timeslice (RS 12.1+) which controls how often the Rep Agent User thread will yield the cpu. While lowering it may have some impact, raising it frequently has little impact. The reason for this behavior is that the RepAgent User Thread often has very little work to do as will be illustrated in the section on the monitor counters later. While it is true that Open Server messages are used to prevent it from being completely synchronous, the simple fact is that each transfer from the Replication Agent must be written to disk, the small buffer size (exec_sqm_write_request_limit by default is 16K) in the Rep Agent User thread essentially required a flush to disk. Consequently, at the end of each transfer, the Replication Agent waits an acknowledgement not only that the LTL was received, but also (in effect) that it was written to disk as the RepAgent User thread does not acknowledge to the Replication Agent that the transfer was complete until then. This may seem duplicative given the scan_batch_size and

88

Final v2.0.1
secondary truncation point movement, but in a sense, it is not quite. The secondary truncation point and OQID synchronization take more work as the RSSD update is involved and a specific log page correlation is made. Given that LTL could exceed the log page or due to text/image replication, ensuring that the LTL is written to disk for each transfer means a faster recovery RepAgent User Tuning Unlike the SQM, SQT threads, in RS 12.0 and prior, there were no specific commands to analyze the performance of the executor thread, nor tuning configurations. With RS 12.1, several tuning configuration parameters were added: Parameter exec_cmds_per_timeslice (Default: 5; Min: 1; Max: 2147483648; Recommendation: 20) RS 12.1 Explanation Specifies the number of LTL commands an LTI or RepAgent Executor thread can process before it must yield the CPU to other threads. You can set exec_cmds_per_timeslice for all Replication Server Executor threads using configure replication server or for a particular connection using configure connection. Controls the amount of memory available to an LTI or RepAgent Executor thread for messages waiting in the inbound queue before the SQM writes them out. If the amount of memory allocated by the LTI or RepAgent Executor thread exceeds the configured pool value, the thread sleeps until the SQM writes some of its messages, and frees memory in the pool. You can set exec_sqm_write_request_limit for the Replication Server using configure replication server. The larger the value you assign to exec_sqm_write_request_limit, the more work the Executor thread can perform before it must sleep until memory is released.

exec_sqm_write_request_limit (Default/Min: 16384 (1 SQM block); Max: 983040 (60 SQM blocks); Recommendation: 983040) Note in 12.6 ESD #7 and 15.0 ESD #1, the max has been increased to 2GB. Recommendation for these versions is 2-4MB

12.1

Setting exec_sqm_write_request_limit is easy set it to the maximum that memory will allow, ensuring that the setting is an even number of SQM blocks (i.e. a multiple of 16384) to ensure that memory is effectively utilized. The only downside to increasing the exec_sqm_write_request_limit is that if the RepAgent connection fails and the RepAgent tries to reconnect, it will not be able to until the full cache of write requests have been saved to the inbound queue. Given that the average production system table is likely 1KB per row or more as formatted by the RS, in all likelihood, a full 983,040 bytes of exec_sqm_write_request_limit is likely less than 1,000 replicated commands - which should take less than a second to save to the inbound queue. On the other hand, the exec_cmds_per_timeslice is a bit more difficult. As mentioned earlier, the parsing and normalization process can be CPU intensive. As a result, since it may always have work to do in a high volume situation, it may be robbing CPU time from the DIST or DSI threads. Consequently, if it should appear that data is backing up in the inbound queue and all applicable SQT tuning (below) has been performed, or if the DSI connections show a lot of awaiting command at the replicate (taking into account the dsi_serialization_method as discussed in the section on Parallel DSI), you may want to lower this number. On the other hand, if the Replication Agent is getting behind (a much more normal problem), you may want to raise exec_cmds_per_timeslice. However, there are a few implementation considerations that can also improve performance. Consider the following: Create repdefs in the same column order as the table definition (speeds normalization). Dont use multiple repdefs for high volume tables unless absolutely necessary (doubles I/O) Do not leave autocorrection on any longer than necessary (doubles I/O for insert and update statements)

RepAgent User Thread Counters In RS 12.1, several counters specifically for the RepAgent User thread were added. In RS 12.6, 8 additional counters were added and some of the original counters were renamed for clarity. RepAgent User Thread Monitor Counters The full list of RS 12.6 counters are: Display Name CmdsTotal Explanation Total commands received by a Rep Agent thread.

89

Final v2.0.1

Display Name CmdsApplied CmdsRequest CmdsSystem CmdsMiniAbort

Explanation Total applied commands written into an inbound queue by a Rep Agent thread. Applied Commands are applied as the maintenance user. Total request commands written into an inbound queue by a Rep Agent thread. Request Commands are applied as the executing request user. Total Repserver system commands written into an inbound queue by a Rep Agent thread. Total 'mini-abort' commands (in ASE, SAVEXACT records processed by a Rep Agent thread). Mini-abort instructs Repserver to rollback commands to a specific OQIQ value. Total 'dump database log' (in ASE, SYNCDPDB records and 'load database log' (in ASE, SYNCLDDB records processed by a Rep Agent thread. Total CHECKPOINT records processed by a Rep Agent thread. CHECKPOINT instructs Repserver to purge to a specific OQIQ value. Total create, drop, and alter route requests written into an inbound queue by a Rep Agent thread. Route requests are issued by RS user. Total enable replication markers written into an inbound queue by a Rep Agent thread. Enable marker is sent by executing the rs_marker stored procedure at the active DB. Total updates to RSSD..rs_locater where type = 'e' executed by a Rep Agent thread. Total number of protocol packets rcvd by a Rep Agent thread when in passthru mode. When not in passthru mode, RepServer receives chunks of lang data at a time. For packet size, see counter 'PacketSize'. Lang 'chunk' size is fixed at 255 bytes. Total bytes received by a Rep Agent thread. This size includes the TDS header size when in 'passthru' mode. In-coming connection packet size. RepAgent/ASE 12.0 or earlier versions used a hard coded 2K packet size. Later releases will allow you to change the packet size. Total number of command buffers received by a RepAgent thread. Buffers are broken into packets when in 'passthru' mode, or language 'chunks' when not in 'passthru' mode. See counter 'PacketsReceived' for these numbers. Total number of empty packets received in 'passthru' mode by a Rep Agent thread. These are 'forced' EOM's. See counter 'PacketsReceived' for these numbers. Total number of times a RepAgent Executor thread yielded it's time on the processor while handling LTL commands. The average amount of time the RepAgent spent yielding the processor while handling LTL commands each time the processor was yielded. Total number of times a RepAgent Executor thread had to wait for the SQM Writer to drain the outstanding write requests below the threshold. The average amount of time the RepAgent spent waiting for the SQM Writer thread to drain the number of outstanding write requests to get the number of outstanding bytes to be written under the threshold. Total Repserver SQLDDL commands written into an inbound queue by a Rep Agent thread.

CmdsDumpLoadDB CmdsPurgeOpen CmdsRouteRCL CmdsEnRepMarker

UpdsRslocater PacketsReceived

BytesReceived PacketSize

BuffersReceived

EmptyPackets

RAYields RAYieldTimeAve (intrusive) RAWriteWaits RAWriteWaitsTimeAve (intrusive) CmdsSQLDDL

90

Final v2.0.1

Display Name RSTicket

Explanation Total rs_ticket markers processed by a Rep Agent's executor thread.

For a typical source database, the highlighted counters are the ones to watch. Replication Server 15.0 had a few differences and added a few counters: Display Name CmdsRecv CmdsApplied CmdsRequest CmdsSystem CmdsMiniAbort Explanation Commands received by a Rep Agent thread. Applied commands written into an inbound queue by a Rep Agent thread. Applied Commands are applied as the maintenance user. Request commands written into an inbound queue by a Rep Agent thread. Request Commands are applied as the executing request user. Repserver system commands written into an inbound queue by a Rep Agent thread. 'mini-abort' commands (in ASE, SAVEXACT records) processed by a Rep Agent thread. Mini-abort instructs Repserver to rollback commands to a specific OQIQ value. 'dump database log' (in ASE, SYNCDPDB records) and 'load database log' (in ASE, SYNCLDDB records) processed by a Rep Agent thread. CHECKPOINT records processed by a Rep Agent thread. CHECKPOINT instructs Repserver to purge to a specific OQIQ value. Create, drop, and alter route requests written into an inbound queue by a Rep Agent thread. Route requests are issued by RS user. Enable replication markers written into an inbound queue by a Rep Agent thread. Enable marker is sent by executing the rs_marker stored procedure at the active DB. Updates to RSSD..rs_locater where type = 'e' executed by a Rep Agent thread. Number of protocol packets rcvd by a Rep Agent thread when in passthru mode. When not in passthru mode, RepServer receives chunks of lang data at a time. For packet size, see counter 'PacketSize'. Lang 'chunk' size is fixed at 255 bytes. Bytes received by a Rep Agent thread. This size includes the TDS header size when in 'passthru' mode. In-coming connection packet size. RepAgent/ASE 12.0 or earlier versions used a hard coded 2K packet size. Later releases will allow you to change the packet size. Number of command buffers received by a RepAgent thread. Buffers are broken into packets when in 'passthru' mode, or language 'chunks' when not in 'passthru' mode. See counter 'PacketsReceived' for these numbers. Number of empty packets received in 'passthru' mode by a Rep Agent thread. These are 'forced' EOM's. See counter 'PacketsReceived' for these numbers. The amount of time the RepAgent spent yielding the processor while handling LTL commands each time the processor was yielded. The amount of time the RepAgent spent waiting for the SQM Writer thread to drain the number of outstanding write requests to get the number of outstanding bytes to be written under the threshold. RepServer SQLDDL commands written into an inbound queue by a Rep Agent thread.

CmdsDumpLoadDB CmdsPurgeOpen CmdsRouteRCL CmdsEnRepMarker

UpdsRslocater PacketsReceived

BytesReceived PacketSize

BuffersReceived

EmptyPackets RAYieldTime RAWriteWaitsTime

CmdsSQLDDL

91

Final v2.0.1

Display Name RSTicket RepAgentRecvPcktTime

Explanation rs_ticket markers processed by a Rep Agent's executor thread. The amount of time, in 100ths of a second, spent receiving network packets.

Note that the Total, Avg and other aggregate suffixes (and counters) have been removed as these are available from the counter_total, counter_max, counter_last and counter_avg=counter_total/counter_obs columns in the rs_statdetail table for RS 15.0. There is one new counter added - the last one in the list: RepAgentRecvPcktTime. This can be interesting to use to determine how busy the RepAgent is on network processing time vs. waiting on writes, etc. Note also that counters RAYields and RAWriteWait appear to have been removed - which may be surprising considering the relative importance of them. However, both counters can be obtained as the number of observations for RAYieldTime and RAWriteWaitTime (counter_obs). Obviously, the goal would be to increase the number of commands processed during a given period assuming the commands are equal and transaction rate the same. The RA thread has a number of counters that are of special interest to us and can help us try to improve this rate. Consider the following list (note that most are derived by combining more than one counter): CmdsPerSec = CmdsTotal/seconds CmdsPerPacket = CmdsTotal/PacketsReceived CmdsPerBuffer = CmdsTotal/BuffersReceived (Mirror Rep Agent & Heterogeneous Rep Agents) PacketsPerBuffer = PacketsReceived/BuffersReceived (Mirror Rep Agent & Hetero Rep Agents). UpdsRslocaterPerMin = UpdsRslocater/minutes ScanBatchSize = CmdsTotal/UpdsRslocater RAYieldsPerSec = RAYields/seconds RA_ECTS = CmdsTotal/RAYields RAWriteWaits The first one (CmdsPerSec) should be fairly obvious we are getting a normalized rate that we can use to track the throughput into RS. CmdsPerPacket is an interesting statistic. One would suspect this to be fairly high, but most often with the default 2K packet size and fairly large table sizes (when column names are included), most production sites find themselves only processing 2-3 commands per packet and since this includes begin/commit commands, really identifies the first bottleneck. Increasing the Rep Agent packet size by changing the ASE rep agent send buffer size configuration parameter helps this out tremendously. Note that heterogeneous replication agents and the Mirror Replication Agent (MRA) all use the concept of an LTL buffer that is different in size than the packet size. For example, the MRA has a default ltl_batch_size of 40,000 bytes and a default rs_packet_size of 2048. For ASE Rep Agent Thread, since the packet and buffer size are the same, you would expect the PacketsPerBuffer to be the same (and they are) - for a ratio of 1. For the MRA and heterogeneous replication agents, you may look at these two counters and determine if tuning them is appropriate. Minimally, raising the MRA rs_packet_size to 8192 or 16384 is suggested. Note that as of MRA 12.6, the MRA appears to be a bit chatty - using tens of packets per buffer - which artificially lowers the CmdsPerPacket ratio to considerably less than 1. UpdsRslocaterPerMin and ScanBatchSize work together to identify when the Rep Agent scan batch size configuration should be adjusted. Yes, this does relate to recovery speed of ASE but think about it. Is the difference of 1 minute really a big problem?? If not, then increasing the scan batch size to drive UpdsRslocaterPerMin towards 1 (likely impossible to get there) is the goal. However, on really busy systems, you will find out that even if you set scan batch size to 20,000, you will still see 10 or more updates per minute which means recovery is only affected by a few seconds. However, setting scan_batch_size to really high values can be detrimental on low volume systems. If during peak processing, you dont see any updates to the rs_locater within 2-3 minutes, you likely have scan_batch_size set too high. RAYields is the number of times the RS RA User thread yielded the cpu to another module and is very interesting. First, the number of yields per second gives a good indication of how much or how little cpu time the RA User thread is getting. Secondly, when compared with the number of commands received (via RA_ECTS), we can see how the configuration parameter exec_cmds_per_timeslice (aka ECTS) is helping or hurting us. A good goal to have is to get 8-10 commands per packet but what good is that goal if the default exec_cmds_per_timeslice is still at 5 which means that part way through processing the packet, RA thread yields the cpu?? However, the one that is most interesting is RAWriteWaits it signals how often the RA thread had to wait when writing to the inbound queue. This is a factor of how much cache is available (exec_sqm_write_request_limit) as well as the values for init_sqm_write_delay/init_sqm_write_max_delay.

92

Final v2.0.1
RepAgent User Thread Counter Usage Perhaps the best way to use the counters is to look at them in terms of progression of the data from the source DB to the next thread (SQM). Consider the following sequence for RS 12.6: 1. RepAgent User Thread receives a batch of LTL from the RS. Each LTL batch is a single LTL buffer that is sent using one or more packets to the RS. This causes the network counters BuffersReceived, PacketsReceived, BytesReceived, EmptyPackets to be incremented The RepAgent User Thread then parses the commands out of the buffer and the commands are evaluated for type (i.e. is it a DML command the RepAgent has to pass to SQM or is a locater request). This updates the various Cmd counters such as CmdsTotal, CmdsApplied, CmdsRequest, CmdsSystem, CmdsMiniAbort, CmdsDumpLoadDB, CmdsPurgeOpen, CmdsRouteRCL, CmdsEnRepMarker, CmdsSQLDDL to be incremented accordingly. Depending on the command, what happens next: a. In normal operations, it is likely that the command was a DML, DDL or system statement (miniAbort, Dump/Load, PurgeOpen, Route RCL, Enable Replication marker (rs_marker)). If so, a write request is issued to the SQM (assuming num_messages or exec_sqm_write_request_limit hasnt been reached) and processing continues. b. If the command was a request for a new locator, the RepAgent determines which record was the last written to disk and updates the RSSD locater appropriately. This also increments the UpdsRslocater counter. c. The command could be one of several different commands that the RepAgent User Thread needs to pass to other threads. For example, if a checkpoint record was received, in addition to the incrementing of the CmdsPurgeOpen, the RA User Thread coordinates with the inbound SQM to purge all the open transactions to that point (this happens during ASE database recovery). Similar behaviors for MiniAborts, Dump/Loads, etc. d. If the command was an Enable Replication Marker (rs_marker), then the Rep Agent coordinates setting the replication definition to the marker state (i.e. valid). e. If the command was an rs_ticket (a form of rs_marker), the RepAgent User Thread appends its timestamp info along with byte counts and process id unto the rs_ticket record and sends it through to the SQM. This also updates the RSTicket counter. Periodically, of course, the RepAgent User Thread will need to yield the CPU. This can happen for several reasons, but in each case, if intrusive counters are enabled, the counters RAYields and RAYieldTimeAve are incremented. The types of yields include: a. The number of cmds processed has exceeded the exec_cmds_per_timeslice. b. As mentioned in 3(a), the exec_sqm_write_request_limit has been reached at which point the SQM wont accept anymore write requests, the counters RAWriteWaits and RAWriteWaitsTimeAve are incremented. c. RS scheduler driven yield which is why setting exec_cmds_per_timeslice high may be of no effect as the RS may still slice out the RA User Thread to provide time for the other threads to run.

2.

3.

4.

From this point processing is handed off to the SQM. Lets take a look at some sample data. Note: in each section, the first set of data will be from real customer data and the second set will be from a wide row (30+ columns) insert speed test. For the first consideration, lets look at the efficiency of the network processing between the RepAgent and the RepAgent User Thread for the customer data set:

93

Final v2.0.1

0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54

79,356 93,852 71,669 63,173 63,086 56,570 108,667 101,507 92,022 81,852 78,507

267,882 364,632 253,283 266,288 253,531 164,249 375,512 450,749 326,619 325,148 317,559

3.3 3.8 3.5 4.2 4.0 2.9 3.4 4.4 3.5 3.9 4.0

889 1,207 841 881 839 545 1,243 1,492 1,085 1,076 1,055

268 365 254 266 253 164 375 451 327 326 317

999.5 998.9 997.1 1,001.0 1,002.0 1,001.5 1,001.3 999.4 998.8 997.3 1,001.7

UpdsRslocator/Min (derived) 53 72 50 52 50 32 74 89 65 64 63

Cmds/Sec (derived)

Packets Received

As you can see from the derived columns in red above, sometimes the most useful information from the monitor counters is when you compare two of them. Lets explore some of these: Cmds/Pckt derived from dividing CmdsTotal by PacketsReceived. In this case we are seeing that we are hitting about 3 commands per packet. You have to admit, processing 3 commands per packet does not represent a lot of work nor very efficient. This system would likely benefit from raising the RepAgent configuration ltl_buffer_size, which controls the packet size sent to Replication Server. Cmds/Sec derived from dividing CmdsTotal by the number of seconds between samples (rs_statrun). Note that this is an average in other words, during the ~5 minute intervals, there may have been higher spikes and lulls in activity. However, it does show that the Replication Agent is feeding roughly 1,000 commands per second to the Replication Server. To sustain this without latency, we will need to ensure that each part of Replication Server can also sustain this rate. Scan_batch_size derived by dividing CmdsTotal by UpdsRslocater to get a representative number of commands sent to RS before the Replication Agent asks for the new truncation point. While this is an average, it does provide insight into the probable setting for the Replication Agent scan_batch_size which in this case is likely set to 1,000. To see the effect of this, consider the next metric UpdsRslocater/Min derived by dividing UpdsRslocater by the number of minutes between samples. This metric represents the SQL activity RS inflicts on the RSSD just to keep up with the truncation point. As you can see, it is updating the RSSD practically once per second. Again, this corresponds to the Replication Agent scan_batch_size configuration parameter. Some DBAs are reluctant to raise this for fear of the extra log space that may impact recovery times, etc. But if you think about it, in its current state, I am moving the secondary truncation point every second a bit of overkill. Increasing this to 10,000 would reduce the RSSD overhead considerably while reducing the secondary truncation point to every 10 seconds or so certainly not a huge impact on the transaction log. Now, lets look at a test system in which a small desktop system was stressed by doing a high rate of inserts on wide rows (32 columns). Ideally, we would like to compare to the same system after Replication Agent configuration values have been changed, however, this was not possible to obtain from the customer. So while not a true apples-apples comparison, it will be useful to compare the counter behavior. The Replication Agent configuration differences are: ltl_buffer_size=8192; scan_batch_size=20,000. Using the same metrics from above, we see:

94

Scan_batch_size (derived)

Upds Rslocater

Sample Time

CmdsTotal

Cmds/Pckt (derived)

Final v2.0.1

Packets Received

Scan_batch_size (derived)

11:37:57 11:38:08 11:38:19 11:38:30 11:38:41

149 1,096 637 2,865 78

1,027 7,781 4,512 20,322 553

6.8 7 7 7 7

93 778 410 2,032 50

0 0 0 1 1

0 0 0 20,322 553

UpdsRslocator/ Min (derived) 0 0 0 6 5 WriteWait% (derived) 11.9 9.7 8.0 5.6 8.1 13.7 10.2 7.3 8.6 6.8 4.7

To see how these differences impact the system, lets take a look at the CPU and write wait metrics from the RepAgent User Thread perspective again looking at the customer system first: WriteRequessts (SQM) RAWrite Waits 32,040 35,479 20,243 14,859 20,673 22,528 38,279 32,790 28,127 22,201 14,817

Sample Time

CmdsTotal

0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54

79,356 93,852 71,669 63,173 63,086 56,570 108,667 101,507 92,022 81,852 78,507

267,882 364,632 253,283 266,288 253,531 164,249 375,512 450,749 326,619 325,148 317,559

42,984 58,811 36,820 39,084 39,804 25,347 59,447 72,149 45,778 47,273 39,971

RA ECTS (derived)

RAYields

Packets Received

6 6 6 6 6 6 6 6 7 6 7

268,187 364,705 253,283 266,334 253,684 164,566 376,184 450,809 326,750 325,340 317,674

Note that some of the columns are repeated for clarity - again we have some derived statistics. RA ECTS derived from dividing CmdsTotal by RAYields. This compares to the exec_cmds_per_timeslice configuration parameter, which has a default of 5. Note that in this case, using the default exec_cmds_per_timeslice, we are getting about 6 commands processed before the RA User thread slices. It may be that the exec_cmds_per_timeslice may be affecting the system since we are so close to the default or it may be just the thread scheduling. WriteWait% - derived by dividing the SQM counter WriteRequests by the RAWriteWaits. This is partially due to the fact we have a default exec_sqm_write_request_limit of 16384 (1 block). Some of these waits are undoubtedly influencing the RA User Thread time slices Now, lets look at the insert stress test. For this system, exec_cmds_per_timeslice is set to 20, exec_sqm_write_request_limit is set to 983040 (the max) other than the Rep Agent configurations mentioned earlier, no other tuning was done to the Rep Agent User configurations

Upds Rslocater

Sample Time

CmdsTotal

Cmds/Pckt (derived)

Cmds/Sec (derived)

95

Final v2.0.1

WriteRequess ts (SQM)

Sample Time

11:37:57 11:38:08 11:38:19 11:38:30 11:38:41

149 1,096 637 2,865 78

1,027 7,781 4,512 20,322 553

34 264 156 748 22

30 29 28 27 25

1,027 7,788 4,512 20,336 553

0 0 0 0 0

As you can see, the SQM WriteRequests are much lower, so that may be why there are no RAWriteWaits however, maxing the sqm_write_request_limit may have helped as well. The interesting thing is that the average RA ECTS (derived by dividing the CmdsTotal by RAYields again) shows considerably higher than the configuration value suggesting that the raising the exec_cmds_per_timeslice may be a limit when less than the default, but when cpu time is available, the Rep Agent User can exceed the default cap. This suggests from the customer viewpoint above, raising the exec_cmds_per_timeslice while a suggestion may not help. However, some customers have reported benefits when exec_cmds_per_timeslice is set as high as 100 unknown if these were non-SMP systems, which could influence the behavior. Either the write waits or other cpu demands are causing the RA User thread to timeslice. RepAgent User/EXEC Traces There are a number of trace flags that can be used to diagnose RepAgent and or inbound SQM related performance issues. Module EXEC EXEC EXEC EXEC EXEC Trace Flag EXEC_CONNECTIONS EXEC_TRACE_COMMANDS EXEC_IGNORE_PAK_LTL EXEC_IGNORE_NRM_LTL EXEC_IGNORE_PRS_LTL Diag Description Traces LTM/Rep Agent connections Traces LTL commands received by EXEC RS behaves as data sink Ignores Normalization in the LTL Ignores Parsing of LTL commands

Note that each of the above requires use of the diag binary for Replication Server. As a result, it should only be used in a debugging environment as the extra diagnostic code will have an impact on performance and log output (which can slow down the system). Some of the more useful traces are described below. For best understanding, refer back to the earlier illustration (pg 76) at the modules the EXEC thread performs. EXEC_CONNECTIONS If the RepAgent is having problems connecting to the RS, this trace can be useful to determine if the correct password is being used, etc. The output in the errorlog is the RepAgent user login followed by the password which can be compared to the RSSD values. Care should be taken as the password will be output into the errorlog in clear text you will probably want to change the errorlog location for any diagnostic binary boot just due to the volume of output. If so, you will want to delete it if you use this trace to avoid having passwords exposed. EXEC_IGNORE_PAK_LTL (WARNING: Results in data loss). At first glance, this seems misnamed, however, realizing that the step immediately prior to the RepAgent user thread passing the LTL to the SQM is packing it into packed binary format. Consequently, by enabling this traceflag, the LTL output will not be written to the inbound queue however, the RepAgent user thread will still parse and normalize the LTL stream. This can be useful for eliminating SQM performance issues when debugging RepAgent performance problems (especially when the waits on CT-Lib are high).

96

WriteWait% (derived) 0.00 0.00 0.00 0.00 0.00

CmdsTotal

RA ECTS (derived)

RAYields

RAWrite Waits

Packets Received

Final v2.0.1
EXEC_IGNORE_NRM_LTL (WARNING: Results in data loss). This trace flag disable the normalization step with-in the RepAgent user thread. If you are positive that the replication definitions precisely match the tables ordinal column definition, disabling this can be done without exec_ignore_pak_ltl. However, it is most useful in continuing to step backward to isolate RepAgent performance problems. By first disabling writes to the queue via exec_ignore_pak_ltl and then disabling normalization, you have eliminated the SQM and any normalization overhead (such as checking replication definitions from RSSD) from the RepAgent LTL transmit sequence. EXEC_IGNORE_PRS_LTL (WARNING: Results in data loss). This traceflag disables parsing the LTL commands received by the RepAgent user thread. When used with exec_ignore_pak_ltl and exec_ignore_nrm_ltl, the RepAgent user effectively is throwing the data away without even looking at it. Any RepAgent performance issues that are network oriented that remain at this point are likely caused by network contention within ASE, the host machine(s), or the OCS protocol stack within the RS binary. SQM Processing The Stable Queue Manager (SQM) is the only module that interacts with the stable queue. As a result, it performs all logical I/O to the stable queue and as one would suspect is then one of the focus points for performance discussions. However, SQM code is present in both the SQM and SQT on the inbound side of the connection, and in the SQM and DSI for the outbound (and Warm Standby) side of a connection. It is best to get a better understanding of the SQM module to better see that in itself, the SQM thread may not be contributing to slow downs in inbound queue processing. The SQM is responsible for the following: Queue I/O - All reads, writes, deletes and queue dumps from the stable queue. Reads are typically done by a SQM Reader (SQT or DSI) using SQM module code - while the SQM is responsible for all write activity. Duplicate Detection - Compares OQIDs from LTL to determine if LTL log row is a duplicate of one already received. Features of the SQM thread include support for: Multiple Writers - While not as apparent in inbound processing, if the SQM is handling outbound processing, multiple sources could be replicating to the same destination (i.e. a corporate rollup). Multiple Readers - More a function of inbound processing, a SQM can support multiple threads reading from the inbound queue. This includes user connections, Warm Standby DSI threads along with normal data distribution. For the purpose of this discussion, we will be focusing strictly on the SQM thread which does the writing to the queue. The SQM write processing logic is similar to the following: 1. 2. Waits for a message to be placed on the write queue Flushes the current block to disk if a. Message on queue is a flush request b. Message on queue is a timer pop AND there is a queue reader present c. Message on queue is a timer pop AND the current wait time exceeds init_sqm_write_max_delay d. The current block is full Adds message to current block

3.

The flushing logic (where the physical I/O actually occurs) is performed in the following steps: 1. 2. 3. 4. 5. Attempts platform-specific async write If retry indicated, yields then tries again Once the write request is successfully posted, places write result control block on AIO Result daemon message queue and sleeps Expects to be awakened by AIO Result daemon when that thread processes this ones async write result Awakens any SQM Read client threads waiting for a block to be written

It is important to note the distinction the SQM actually writes the block to disk and then simply tells the dAIO thread to monitor for that I/O completion. The dAIO detects the completion by using standard asynchronous I/O polling

97

Final v2.0.1
techniques and when the I/O has completed, wakes up the SQM, which, can then update the RSSD with the last OQID in the block that was written. This ensures system recoverability as it is this OQID that is returned to the RepAgent when a new truncation point is requested (as described earlier). This is illustrated as follows:

Figure 17 SQM Thread Processing


SQM Performance Analysis One of the best and most frequent commands for SQM analysis is the admin who, SQM command (sample output below extracted from Replication Server Reference Guide).
admin who, sqm Spid State -------14 Awaiting 15 Awaiting 52 Awaiting 68 Awaiting Duplicates ---------0 0 0 0 Info ---101:0 TOKYO_DS.TOKO_RSSD 101:1 TOKYO_DS.TOKYO_RSSD 16777318:0 SYDNEY_RS 103:0 LDS.pubs2 Reads ----0 8867 2037 0 Bytes ----0 9058 2037 0

Message Message Message Message

Writes ------

0.1 0.1.0

B Writes -------0 0 0 0

B Filled ------0 34 3 0

B Reads ------44 54 23

B Cache ------0 2132 268 0

Save_Int:Seg -----------0:0 0:33 0:4 strict:O Next Read --------0.1.0 33.11.0 4.13.0 0.1.0

First Seg.Block --------------0.1 33.10 4.12 0.1 Readers ------1 1 1 1 Truncs -----1 1 1 1

Last Seg.Block -------------0.0 33.10 4.12 0.0

98

Final v2.0.1
Now that we understand how Replication Server allocates space (1MB allocations) and performs I/O (16K blocks 64 blocks per 1MB), the above starts to make a bit more sense. Although a more detailed discussion is in the Reference Guide, a quick summary of the output is listed here for easy reference. Column Spid State Meaning RS internal thread process id equivalent to ASEs spid Current state of SQM Awaiting message, it is caught up and not necessarily part of the problem. However, if state shows Active or Awaiting I/O, the SQM is busy writing data to/from disk. Queue id and database connection for queue Number of LTL records judged as already received can increase at Rep Agent startup, but if continues to increase, it is a sign of someone recovering the primary database without adjusting the generation id. Number of messages (LTL rows) written to the queue. If consistently higher than Reads, you will most likely be seeing a backlog develop. If the inbound queue and not a warm standby, tuning exec_cmds_per_timeslice may help Number of messages read from queue. May surge high at startup due to finding the next row. However, after startup, if this number starts outpacing writes by any significant number, messages are being reread from the queue due to large transactions or SQT cache too small. Number of actual bytes written to queue. The efficiency of the block usage can be calculated by dividing Bytes by B Writes. Obviously if the blocks were always full, the result would be close to 16K. However, in normal processing, this is often not the case as transactions tend to be more sporadic in nature. The most useful uses of this column are to track bytes/min throughput and to explain why the queue usage may be different than estimated (i.e. low block density). Number of 16K blocks written to queue Number of 16K blocks written to queue that were full Number of 16K blocks read from queue Number of 16K blocks read from queue that are cached Save interval in minutes (left of colon) and oldest segment (1MB allocation) for which save interval has not yet expired. First undeleted segment and block in the queue. Last segment and block written to the queue. As a result, the size of the queue can be quickly calculated via Last Seg First Seg (answer in MB) The next segment, block and row to be read. If it points to the next block after Last Seg.Block, then the queue is quiesced (caught up). If continually behind, then reading is not keeping up with writes. If Replication Server is behind, a rough idea of the latency can be determined from the amount of queue to be applied ~ Last Seg Next Read (answer in MB) Number of readers Number of truncation points

Info Duplicates

Writes

Reads

Bytes

B Writes B Filled B Reads B Cache Save Int:Seg First Seg.Block Last Seg.Block Next Read

Readers Trunc

In the above table, performance indicators were highlighted. As such, these are indications further commands will be necessary to determine exactly what the problem is. A frequent command for inbound queue determination is admin who, sqt, while for outbound queues, it most likely will be a look at the replicate database. Note the word rough is underlined in the high-lighted sentence regarding calculating latency by subtracting Last Seg and Next Read. The reason for the highlighting is that this method is not exactly accurate. This metric is from the viewpoint of the SQM thread and not the endpoint (DIST or DSI) that we think it is. Prior to the true endpoint, there is a substantial amount of

99

Final v2.0.1
cache likely in the SQT or DSI (dsi_sqt_max_cache_size) that can be masking the latency. However, if after successive queries the Next Read/Last Seg shows no latency, then it likely is that true that no latency exists (exception is Warm-Standby). As we discuss the SQT thread and DSI SQT module, we will explain in more detail the times and conditions when this could be inaccurate. SQM Tuning To control the behavior of the SQM, there are a couple of configuration parameters available: Parameter init_sqm_write_delay (Default: 1000; Recommendation: 50) RS 11.x Meaning Write delay for the Stable Queue Manager if queue is being read. Init_sqm_write_delay should be less than init_sqm_write_max_delay. Given that IO operations today are in the low ms range, this default value probably should be lowered see next configuration for rationale. The maximum write delay for the Stable Queue Manager if the queue is not being read. Given that IO operations today are in the low ms range, this should be lowered. The likely cause of waiting for the queue to be read would be rescanning for large transactions. If we allow up to a 10 sec delay due to rescanning a large transaction, we will excessively delay Replication Agent processing and have a bigger impact on the system overall. Controls how often the SQM updates rs_oqids. By increasing, the SQM will write less frequently, improving throughput, but lengthening the recovery time due to more segments needing to be analyzed during recovery. Percent of partition segments (stable queue space) to generate a first warning. The range is 1 to 100. Percent of partition segments used to generate a second warning. The range is 1 to 100. Percent of total partition space that a single stable queue uses to generate a warning. The range is 51 to 100. Specifies whether or not writes to memory buffers are flushed to the disk before the write operation completes. Values are "on" and "off." Essentially allows file system devices to be used safely (ala ASEs dsync option).

init_sqm_max_write_delay (Default: 10000; Recommendation: 100)

11.x

sqm_recover_segs (Default: 1; Recommendation: 10) sqm_warning_thr1 (Default: 75;Min: 1; Max: 100) sqm_warning_thr2 (Default: 90;Min: 1; Max: 100) sqm_warning_thr_ind (Default: 70;Min: 51; Max: 100) sqm_write_flush (Default: on; Recommendation: off)

12.1

11.x

11.x

11.x

12.1

The first two take a bit of explaining. The stable queue manager waits for at least init_sqm_write_delay milliseconds for a block to fill before it writes the block to the correct queue on the stable device - or if the queue is being read, it will delay writing by this initial delay. Of course, this is the initial wait time. When the delay time has expired, the SQM writer will check if there are actually readers waiting for this block. If there are no readers waiting for the block, and the block is not full, then SQM will adjust this time and make it longer for the next wait time. The other option is that the queue is still being read - which again causes the SQM to double the time and wait before it again tries to write. To realize what this means, you have to remember that the reader for the block typically will be the SQT, DSI or RSI threads. If the reader is caught up, then it is in fact waiting for the disk block, and the SQM needs to close the block so that the reader can access it immediately. However, if the reader is behind and is still processing previous blocks, then they will not be waiting for this block and consequently, the SQM can wait a bit longer to see if the block can be filled before flushing it to disk. The downside is that if the SQT is completely caught up, then it will be frequently attempting to read from the write block, delaying rows from being appended to it. You may want to change this parameter if you have special latency requirements and the updates to the primary database are done in bursts. To get the smallest possible latency youll have to set init_sqm_write_delay to 100 or 200 milliseconds and batch_ltl to false (sp_config_rep_agent). Decreasing init_sqm_write_delay will cause more I/O to

100

Final v2.0.1
occur as a small init_sqm_write_delay will write blocks that are not filled completely. This will fill up the stable queue faster with less dense blocks. However, for increased throughput, you may wish to increase this parameter in bursty environments with low transaction rates to ensure more full blocks are written and consequently less i/o required to read/write to queue. A better solution than to increase this parameter is to simply ensure that batch_ltl is on at the Rep Agent (if on, Rep Agent sends an ltl_buffer_size block of LTL. Due to normalization, this may be less space in the queue, but under normal circumstances it will be sufficient). Increasing this value in situations in which the transactions do not quite fill up a full block, but are rather bursty may degrade performance as the Rep Agent effectively has a synch point with the SQM basically another block can not be forwarded until the first one is on disk. The key here is that this is how long the SQM will wait before writing to the queue if the DSI, RSI or SQT threads are active to ensure full blocks. This is important it means that the SQM will delay writing partially full blocks when the SQT is busy reading consequently: A large transaction that is removed from the SQT cache and is being re-read (and keeping the SQT busy reading) may reduce throughput as it is likely that once the block is full, it will have to be flushed, forcing the SQT to read it from disk vs. from cache. If the SQT is completely caught up, the rapid polling read cycle against the SQM write block will cause the SQM to delay appending new rows to the block - delaying RepAgent User throughput.

The other important aspect is that the configuration value is the initial wait time. Each time RS hits init_sqm_write_delay, it will double the time up to init_sqm_max_delay. As a result, after RS has been in operation for any length of time, it is likely that the real delay in writing to the queue when the queue is being read is init_sqm_write_max_delay and not init_sqm_write_delay. As a consequence in many systems it is a good idea to reduce init_sqm_write_max_delay. The question some may ask is what happens if other replicated rows arrive from the Replication Agent. Note that this delay does not mean the SQM is sleeping - if the block is not full, the SQM at the end of the wait cycle will check to see if there are more write requests. If so, it will append them to the block. Once the block is full and the wait has expired, the SQM will flush it to disk. On the other hand, init_sqm_write_max_delay is how long a block will be held due to the fact that the DSI, RSI or SQT threads are suspended and not reading from the queue or the reader was not waiting for the block so the SQM delayed past init_sqm_write_delay. A flush to the queue is guaranteed to happen after waiting for init_sqm_write_max_delay. This is the final condition if a block wasnt written yet because of a full condition or the init_sqm_write_delay. This parameter has to do more with when the block will be flushed from memory. If the RS is fully caught up, the SQM readers (when up) may be requesting to read the same disk block as was just written. The SQM cheats and simply reads the block from cache. However, if the SQM reader is not up or is lagging, this parameter controls how long the SQM will keep the block in cache waiting for the reader to resume or catch up. These seem confusing, but consider the following scenario: 1. SQM begins receiving LTL rows and begins to build a 16K block. Assuming the DSI, RSI or SQT are up and the SQT is actively reading the queue, it waits init_sqm_write_delay before writing the current block to disk. Init_sqm_write_delay expires, so block is written to disk. However, the block is still cached in memory of the SQM. If the block was not full and the readers were not waiting for it, the next block will wait longer (to a maximum of init_sqm_write_max_delay). DSI, RSI, or SQT reads the next block. If RS is fully caught up, the block it is requesting is the one just written. To avoid unnecessary disk I/O, the block is simply read from cache vs. the copy flushed to disk.

2.

3.

Now, a little bit different. Lets kill the SQM reader (i.e. suspend the DSI or suspend distribution (the DIST thread starts/stops the SQT thread)). 1. 2. 3. 4. SQM begins receiving LTL rows and begins to build a 16K block. Init_sqm_write_delay expires, however, readers are not up, consequently block is not flushed to disk unless it is full. If the reader comes back up within init_sqm_max_write_delay, it is able to retrieve the block from the SQM cache as discussed above if the next block to read is the current block. If the reader does not come back up within init_sqm_max_write_delay, the block is flushed to disk regardless of full status. The reader will have to do a physical I/O to retrieve the disk block.

Finally, lets consider what likely happens in real life. Lets assume we have a system that is being updated 10 times per second during normal working hours, but is quiescent on weekends and evenings. Assume the default settings and that it the rows are 1KB each so it will take 16 rows to fill a block.

101

Final v2.0.1
1. RS is booted/re-booted on a weekend. Since there is no activity, after a short time, init_sqm_write_delay is doubled from its initial 1 second delay until init_sqm_max_write_delay (10 seconds) As activity starts, the first rows arrive since the block is not full, the SQM delays writing the block (the timer will expire in 10 seconds). At slightly more than 1.5 seconds, enough rows have arrived that the block is full. Even though the timer has not expired, the block will be flushed to disk. A new block is allocated and the timer reset to 0. Process repeats with the SQM block being written at a rate of 1 every ~1.5 seconds.

2. 3. 4. 5.

What happens if the transaction rate slows to 1 per second? At 1KB rows and 16KB blocks, if we waited for a full block wed wait for ~16 seconds before the block flushed. But since we have a timer, the block will be flushed at init_sqm_write_max_delay regardless of whether or not it is full. So, every 10 seconds, we would be flushing a block containing 10 rows of data. Someone looking at the replicate database might notice the 10 second delay and make some wrong assumptions about why the delay and try tuning different areas of RS especially if they have a desire to see RS latency in the 1-2 second range. And that is why it probably is useful to reduce init_sqm_write_max_delay for low throughput systems while the blocks will be flushed nearly empty, the latency will be reduced. For example, if we use the suggested value of 1 second (from the table above), each block would only contain 1 row of data at 1 transaction per second activity rates. Increasing the init_sqm_max_write_delay beyond 10 seconds is probably not useful. If the SQM reader (DSI, RSI or SQT) is down for any length of time, the Rep Agent or DIST will still be supplying data to the SQM. As a result, the block will in all likelihood fill and get flushed to disk. Consequently, it is more probable that the queue will begin to back up if the SQM reader is down, necessitating a physical I/O. The only time increasing this may make sense is if increasing the init_sqm_write_delay to greater than 10,000ms a very rare situation in which queue space may be at a premium and write activity is very low in the source system. Generally speaking, reducing both the init_sqm_write_delay and init_sqm_max_write_delay can help. However, keep this in mind. If the SQM waits too long, the cache of write requests (exec_sqm_write_request_limit) will be filled and the RepAgent User will be forced to wait. This will show up as a RAWriteWait event (in RS 12.6 - in RS 15.0, the counter_obs for RAWriteWaitTime will be incremented). Consequently, reducing this value if there are no RAWriteWaits is likely not a going to help. However, if there are RAWriteWaits and you have already maximized exec_sqm_write_request_limit, you could try decreasing these values as well as looking at the cumulative writes (in MB) for all the queues on the same disk partition or look at the sqm_recover_segs to see if you can speed up the SQM processing. Normal SQM processing is fairly fast however, at some point, the end of the current 1MB segment will be reached. At that point, the SQM will need to allocate a new segment. While this sounds easy, the SQM actually has to do a bit of checking. Whenever a segment is full and new one is allocated, the SQM does the following 1. 2. 3. 4. 5. Update the rs_oqid with the last oqid processed for the segment Check if there is space on the current partition being used Check to see if the current partition has been marked to be dropped Check if a new disk_affinity setting specifies a different location Update the disk partition map and allocate the new segment

If a large number of connections exist or in a high volume system, you may wish to adjust sqm_recover_segs. By increasing this value, the SQM will update the rs_oqid less frequently. Note that the SQM does not currently update the RSSD with every block anyhow, so adjusting it from 1 to 2 may not show any appreciable impact. Also, be aware that increasing this parameter may also increase recovery time after a shutdown, hibernation, or any other event that suspends SQM processing. However, setting this value to 10 can help as SQM flushes to the RSSD are reduced yet for recovery the most that will have to be scanned is 10 blocks (~160KB). Much like changing the Replication Agent scan_batch_size to reduce the updates to the rs_locater, the intent here is to reduce the impact of updating the RSSD not that the RSSD cant handle the load, but since this is done inline with RS processing, updates to the RSSD have the worse effect of degrading RS throughput at that point in time. Additionally, remember that this reduces the updates to rs_oqid only during a segment allocation, the other steps will still have to be performed (but the time to do so will likely nearly be cut in half). From a performance perspective, the most common cause of SQM contributing to performance issues is simply if the SQM cant write to disk fast enough. Other than the lucky instances where you might see the state column in the admin who, sqm command stating Awaiting I/O this may be difficult to detect as the bytes written to the queue may be more than what was written to the transaction log. However, if you see that the transaction logs rate exceeds the SQM rate it may be an indication that the Rep Agent is not able to keep up. From an input standpoint, the SQM write

102

Final v2.0.1
is likely the largest cause of Replication Agent latency however, the biggest probably cause of latency is likely at the DSI, so concentrating on this is likely not going to help reduce overall latency much. From a write speed aspect, remember that a stable device may be used by more than one connection. Consequently if experiencing a high rate on one or more connections, it is likely advisable to use disk_affinity to spread the writes across different devices for different connections. This includes separating inbound and outbound connections as well. SQM Monitor Counters SQM Thread Monitor Counters In RS 12.1 and 12.5 there was only a single group of counters that applied to the SQM thread. In 12.6, this was supplemented by adding counters from the SQM Reader and some of the SQM module counters were shifted to the SQM Reader module counters (listed as deprecated/obsolete in the counter description as you will see below). While the former still use the module name of SQM, the latter use the SQMR module. This SQM module thread counters for RS 12.6 are: Counter Name AffinityHintUsed BlocksFullWrite Explanation Total segments allocated by an SQM thread using user-supplied partition allocation hints. Total number of full blocks written by an SQM thread. Individual blocks can be written due either to block full state or to sysadmin command 'show_queue' (only one message per block). Obsolete. See CNT_SQMR_BLOCKS_READ. Obsolete. See CNT_SQMR_BLOCKS_READ_CACHED. Total number of 16K blocks written to a stable queue by an SQM thread Average byte deliver rate to a stable queue. Current byte deliver rate to a stable queue. Maximum byte deliver rate to a stable queue. Total bytes written to a stable queue by an SQM thread. Average command size written to a stable queue. Obsolete. See CNT_SQMR_COMMANDS_READ. Total commands written into a stable queue by an SQM thread. Total messages that have been rejected and ignored as duplicates by an SQM thread. Total active segments of an SQM queue: the number of rows in rs_segments for the given queue where used_flag = 1. Total segments allocated to a queue during the current statistical period. Total segments deallocated from a queue during the current statistical period. Total srv_sleep() calls by an SQM Writer client due to waiting for SQM thread to start. Total srv_sleep() calls by an SQM Writer client due to waiting for the SQM thread to get a free segment. Total srv_sleep() calls by an SQM Writer client while waiting to write a drop repdef rs_marker into inbound queue. Total srv_sleep() calls by an SQM Writer client while waiting to write an enable rs_marker into the inbound queue. Obsolete. See CNT_SQMR_SLEEP_Q_WRITE.

BlocksRead BlocksReadCached BlocksWritten BPSaverage BPScurrent BPSmax BytesWritten CmdSizeAverage CmdsRead CmdsWritten Duplicates SegsActive SegsAllocated SegsDeallocated SleepsStartQW SleepsWaitSeg SleepsWriteDRmarker SleepsWriteEnMarker SleepsWriteQ

103

Final v2.0.1

Counter Name SleepsWriteRScmd TimeAveNewSeg (intrusive) TimeAveSeg (intrusive) TimeLastNewSeg (intrusive) TimeLastSeg (intrusive)

Explanation Total srv_sleep() calls by an SQM Writer client while waiting to write a special message, such as synthetic rs_marker. Average elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated. Average elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted. The elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated. Elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted. Includes time spent due to save interval, so care should be taken when attempting to time RS speed using this counter. The maximum elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated. The maximum elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted. Includes time spent due to save interval, so care should be taken when attempting to time RS speed using this counter. Total updates to the RSSD..rs_oqid table by an SQM thread. Each new segment allocation may result in an update of oqid value stored in rs_oqid for recovery purposes. Total message writes requested by an SQM client. Total writes failed by an SQM thread due to loss detection, SQM_WRITE_LOSS_I, which is typically associated with a rebuild queues operation. SQM writer thread has forced the current block to disk when no real write request was present. However, there is data to write and we were asked to do a flush, typically by quiesce force RSI or explicit shutdown request. SQM writer thread initiated a write request due to timer expiration. Average size of large messages written to a stable queue. Obsolete. See CNT_SQMR_XNL_INTR. The maximum size of large messages written so far. Obsolete. See CNT_SQMR_XNL_PARTIAL. Obsolete. See CNT_SQMR_XNL_READ. Total large messages skipped so far. This only happens when site version is lower than 12.5. Total large messages written successfully so far. This does not count skipped large message in mixed version situation.

TimeMaxNewSeg (intrusive) TimeMaxSeg (intrusive)

UpdsRsoqid

WriteRequests WritesFailedLoss

WritesForceFlush

WritesTimerPop XNLAverage XNLInterrupted XNLMaxSize XNLPartials XNLReads XNLSkips XNLWrites

Replication Server 15.0 has slightly different SQM counters: Counter Name CmdsWritten Explanation Commands written into a stable queue by an SQM thread.

104

Final v2.0.1

Counter Name BlocksWritten BytesWritten Duplicates SleepsStartQW SleepsWaitSeg SleepsWriteRScmd SleepsWriteDRmarker SleepsWriteEnMarker SegsActive SegsAllocated SegsDeallocated TimeNewSeg TimeSeg

Explanation Number of 16K blocks written to a stable queue by an SQM thread Bytes written to a stable queue by an SQM thread. Messages that have been rejected and ignored as duplicates by an SQM thread. srv_sleep() calls by an SQM Writer client due to waiting for SQM thread to start. srv_sleep() calls by an SQM Writer client due to waiting for the SQM thread to get a free segment. srv_sleep() calls by an SQM Writer client while waiting to write a special message, such as synthetic rs_marker. srv_sleep() calls by an SQM Writer client while waiting to write a drop repdef rs_marker into inbound queue. srv_sleep() calls by an SQM Writer client while waiting to write an enable rs_marker into the inbound queue. Active segments of an SQM queue: the number of rows in rs_segments for the given queue where used_flag = 1. Segments allocated to a queue during the current statistical period. Segments deallocated from a queue during the current statistical period. The elapsed time, in 100ths of a second, to allocate a new segment. Timer starts when a segment is allocated. Timer stops when the next segment is allocated. Elapsed time, in 100ths of a second, to process a segment. Timer starts when a segment is allocated or RepServer starts. Timer stops when the segment is deleted. Segments allocated by an SQM thread using user-supplied partition allocation hints. Updates to the RSSD..rs_oqid table by an SQM thread. Each new segment allocation may result in an update of oqid value stored in rs_oqid for recovery purposes. Writes failed by an SQM thread due to loss detection, SQM_WRITE_LOSS_I, which is typically associated with a rebuild queues operation. SQM writer thread initiated a write request due to timer expiration. SQM writer thread has forced the current block to disk when no real write request was present. However, there is data to write and we were asked to do a flush, typically by quiesce force RSI or explicit shutdown request. Message writes requested by an SQM client. Number of full blocks written by an SQM thread. Individual blocks can be written due either to block full state or to sysadmin command 'show_queue' (only one message per block). Command size written to a stable queue. Large messages written successfully so far. This does not count skipped large message in mixed version situation. Large messages skipped so far. This only happens when site version is lower than 12.5. The size of large messages written so far. The amount of time taken for SQM to write a block.

AffinityHintUsed UpdsRsoqid

WritesFailedLoss WritesTimerPop WritesForceFlush

WriteRequests BlocksFullWrite

CmdSize XNLWrites XNLSkips XNLSize SQMWriteTime

105

Final v2.0.1
Note again, that many of the averages, etc. have been removed. However, one new counter of interest is SQMWriteTime. While a byte rate is possibly useful, this counter may help as it shows how long each 16K I/O takes for a full block. Regardless, the SQM counter values can be viewed in at least two different comparisons. First, the normal is to compare the current samplings values with the previous intervals. This establishes an idea of the rate of a single activity. For example, CmdsWritten when compared with itself could demonstrate a rate (when normalized) of 100 commands/second. If the primary activity was a bcp of 200 rows/second, the obvious implication is that the RepAgent can only read that particular tables rows out at half the speed of bcp, consequently, the replication to other destinations will take at least twice as long as the original bcp. The second way of comparing the counters is to compare multiple counters within the same sample interval. In the above list, there are a number of counters when compared with their counter-parts can provide insight into what the possible causes of performance issues might be. For instance, consider the following: RAWriteWaitPct = RAWriteWaits/WriteRequests CmdsWritten, CmdSizeAverage BlocksFullPct=BlocksFullWrite/BlocksWritten SegsActive, SegsAllocated, SegsDeallocated UpdsRsoqidSec = UpdsRsoqid / Sec RecoverSeg = SegsAllocated/UpdsRsoqid The first counter (RAWriteWaitPct) is a derived value from taking the RAWriteWaits from earlier and dividing it by the number of SQM WriteRequests. This tells us a rough percentage of the time that the RA had to wait in order to write. Even a low value such as 5-10% could be indicative of a problem once you realize that the default init_sqm_write_delay is 1 second which causes the ASE RepAgent to have to wait. The key to all this is realizing that the SQM writes/reads 16K blocks (not configurable). So, by default, the RepAgent User thread will be forced to go to sleep once its outstanding write requests have exceeded what the SQM Writer can pack into one block. Given that the inbound queue often has a 2-4x space explosion, this can literally mean that for every 4-8KB of log data, the RepAgent User is forced to wait which in turn forces the RepAgent to stop scanning. Fortunately for most people, since they have not adjusted exec_sqm_write_request_limit from the default of 16384, increasing it to the maximum of 983,040 (60 16K blocks), provides a lot more cushioning for the RepAgent User to keep processing write requests before it is forced to sleep by the SQM. The next sequence of counters (CmdsWritten, CmdSizeAverage) tells us how many commands actually were written into the queue and should compare with CmdsTotal from the RA although it may not be exactly equal as purge commands during a recovery, etc. are not written to the queue. CmdSizeAverage is the first place that we get a look at how big each command is from the source when packed into SQM format. However, for an outbound queue, this could be different as the same outbound queue may be receiving transactions from more than one source (corporate rollup implementation), consequently you may not be able to directly compare the CmdsWritten to DIST counter values. Where a single connection is involved, however, it can be useful. The next two sets (BlocksFullPct and SegsActive, SegsAllocated, SegsDeallocated) are ones to watch, but you really cant do much about. In most busy systems, BlocksFullPct will likely be 100% as every block is written when full vs. the timer pop. Numbers less than 100% indicates that not a lot of commands are coming into RS on a throughput basis. The others all the SegsActive, etc. counters are more for just tracking the space utilization although ideally, the goal is to see the SegsAllocated and SegsDeallocated matching. However, while this is a way of tracking disk space utilization, it shouldnt be used as an indication of latency (it could be but it also could be just due to something else). The next two (UpdsRsoqidSec and RecoverSeg) are related and likely a big factor in performance of the SQM. As you will notice, once again, we are updating the OQID in the RSSD as we track our progress. However, in this case we are concerned about the speed of recovery for RS. When RS is restarted or a connection resumed, RS uses the OQID from the RSSD to locate the current segment and block. The more frequently this is updated, the shorter RS has to scan from the point the RSSD was last updated to the current working location. Again, just like with the Rep Agent scan batch size, you need to look at this realistically. A sub-second recovery interval is likely overkill and yet most DBAs are surprised to find out that during busy periods, they are updating the OQID in the RSSD 2-3 times per secondand this is just the inbound queue. When you add in the outbound queue and multiply across the number of connections, you can see where the updates to the RSSD are a lot higher than we would like. Adjusting sqm_recover_seg from its default of 1 to 10 or another value and watching both UpdsRsoqidSec and RecoverSeg to fine tune it is likely a good course of action.

106

Final v2.0.1
SQMR Counters After describing where these counters are located, you might think they are in the wrong location. The SQMR actually refers to the SQM code executed by the reader. For the inbound queue, the readers are the SQT and/or the WS DSI threads. For the outbound queue, it will either be a DSI or an RSI thread. These can be distinguished via the counter structures. For instance, a Warm Standby that doesnt have distribution disabled or is replicating to a third site will have both a DSI set of SQMRs (for the Warm Standby DSI which reads from the inbound queue) and a SQT set of SQMRs. From the earlier table, we saw that in rs_statdetail, these would have the instance_val column value of 11 for the SQT SQMR and 21 for the WS-DSI SQMR. As a result, the counters below are actually from the respective reader thread in RS 12.6 and 15.0 and not actually part of the SQM thread. However, in queue processing, we are often comparing the read rate to the write rate, and given the name, we will discuss them here. First lets look at the counters from RS 12.6: Counter CmdsRead BlocksRead BlocksReadCached SleepsWriteQ XNLReads XNLPartials XNLInterrupted Explanation Total commands read from a stable queue by an SQM Reader thread. Total number of 16K blocks read from a stable queue by an SQM Reader thread. Total number of 16K blocks from cache read by an SQM Reader thread. Total srv_sleep() calls by an SQM read client due to waiting for the SQM thread to write. Total large messages read successfully so far. This does not count partial message, or timeout interruptions. Total partial large messages read so far. Number of interruptions so far when reading large messages with partial read. Such interruptions happen due to time out, unexpected wakeup, or nonblock read request which is marked as READ_POSTED. Total srv_sleep() calls by an SQM Reader client due to waiting for SQM thread to start.

SleepsStartQR

Similar to the SQM counters, RS 15.0 has a few modifications for SQM Readers as well. Counter CmdsRead BlocksRead BlocksReadCached SleepsWriteQ XNLReads XNLPartials XNLInterrupted Explanation Commands read from a stable queue by an SQM Reader thread. Number of 16K blocks read from a stable queue by an SQM Reader thread. Number of 16K blocks from cache read by an SQM Reader thread. srv_sleep() calls by an SQM read client due to waiting for the SQM thread to write. Large messages read successfully so far. This does not count partial message, or timeout interruptions. Partial large messages read so far. Number of interruptions so far when reading large messages with partial read. Such interruptions happen due to time out, unexpected wakeup, or nonblock read request which is marked as READ_POSTED. srv_sleep() calls by an SQM Reader client due to waiting for SQM thread to start. The amount of time taken for SQMR to read a block. The number of segments yet to be read. The number of blocks within a partially read segment that are yet to be read.

SleepsStartQR SQMRReadTime SQMRBacklogSeg SQMRBacklogBlock

107

Final v2.0.1
The last three (which are new in RS 15.0) are interesting. The problem with SQMR for 12.6 is that it could not be used to derive a relative latency. While the SQM counters SegsAllocated, SegsDeallocated, and SegsActive would appear to give that information, the issue was that a segment is active until it is deallocated. Since this has a lower priority, a segment could have been read a long time before it is deallocated. These new counters - particularly the Backlog counters - could be used much like the admin who, sqm next.read and last.seg columns to determine a latency. Even better, once the number of segments in the backlog is obtained, the SQMRReadTime could be used as means of determining the length of time it will take to read it at the current rate (although this is likely an idealistic number). One aspect to remember, is that if a transaction is removed from SQT cache due to size, the SQMR may have to re-read significant numbers of blocks to re-create it later. Keeping this in mind, the best counters to consider for the SQMR include: CmdsRead BlocksReadCachedPct = BlocksReadCached/BlocksRead SleepPct = SleepsWriteQ/BlocksRead Ideally, of course, we would like to see CmdsRead equal to the SQM counter CmdsWritten. However because of rescanning, you may frequently see a much higher value especially when rescanning large transactions that were removed from the SQT cache. The next counter (BlocksReadCachedPct) is the most important for the inbound queue reading. Ideally we would like to see this higher than 75%, although anything higher than 30% is fine. The cache referred to for queue reads is an unconfigurable 16k of memory that the writer uses to build the next block to be written. If between the time that the writer requests the block to be written and it starts to re-use the memory to build the next block, a reader requests a message from that block, then it is able to read from cache rather than from disk. While you would like to see high BlocksReadCachedPct numbers, and no RepAgent latency, at the same time if RepAgent latency exist (in ASE), you should be concerned that the writer is not flushing blocks fast enough so that the reader is constantly have to wait for the next write see counter SleepsWriteQ. Alternatively, a possible cause is that the writer is constantly waiting on read activity and when it does, it sleeps sqm_init_write_delay to sqm_init_write_max_delay. So, while reading from cache is good for the reader, it could delay the writer. So if BlocksReadCached is high (i.e. 100%) and there is RepAgent latency, you may want to reduce sqm_init_write_delay (and the max) to reduce the sleep time. For the outbound queue, it is most likely that BlocksReadCachedPct will start high and rapidly drop to zero as the backlog in the DSIEXEC causes the DSI to lag far behind in reading the queue vs. the SQM writing. The final SQMR counter takes a bit of explanation. SleepsWriteQ itself refers to the number of times the reader was put to sleep while waiting for the SQM to write. This wait is likely caused by the SQMR (SQT or DSI) being caught up and therefore is waiting on more data to be written. Consequently, this is best looked at in conjunction with (SQM) BlocksWritten (earlier) but expressed as a ratio of how often it had to sleep for each block read. For the inbound queue, this number (SleepPct) should be in the 300%-700% range as long as the BlocksRead are nearly identical to BlocksWritten (or a decent BlocksReadCachedPct). This indicates that the SQMR is caught up. If the SQT starts to lag and reading then gets behind, this ratio might drop. Again, though, one aspect to watch is if the writing seems to be going fine, but it doesnt look like reading is fast enough (usually indicated by the fact the SQT cache is not full and BlocksReadCachedPct < 30%), a cause may be the configuration values sqt_init_read_delay and sqt_max_read_delay. In RS 12.6, these were defaulted to 2000ms and 10000 ms respectively which meant that if the reader went to read and it was caught up, it would most likely sleep for 2 or more seconds now causing it to be behind. This caused so many problems with upgrades to RS 12.6, that in RS 15.0, the defaults for these values was set at 1ms each which is likely overkill in the other direction and could be causing DIST servicing problems from the SQT. On the other hand, if SleepPct is too high (i.e. constantly >700%) then it is likely that the sqm_init_write_delay is too high. What could be happening is that the SQM writes a block, the SQT reads itforcing the SQM to sleep sqm_init_write_delay seconds before it can write the next one, but the SQT tries to read the next one during that time and is put to sleep sqt_init_read_delay seconds. You can see quickly how that large settings (i.e. the defaults) could cause both the writer and reader to spend a lot of time sleeping vs. doing work resulting in RepAgent latency (as high RAWriteWaits as eventually exec_sqm_write_request_limit fills). SQM Thread Counter Usage Again, helps to look at the counters in terms of the progression of data through the replication server. To see how this works, once again we will take a look at the customer data used earlier in the RepAgent User Thread discussion. 1. The first thing that happens is that the SQM Writer client puts a write request message on the internal message queue (as discussed in the earlier section detailing the OpenServer structures). This increments the WriteRequests counter. The counters BPSaverage, BPScurrent, and BPSmax effectively measure the bytes per second rate of delivery of the write requests to the SQM while CmdSizeAverage records the average size of the commands in the write requests to the SQM. The SQM checks each incoming message to see if it is a duplicate or if a loss was detected.

2.

108

Final v2.0.1
If it is a duplicate, it is discarded, the Duplicates counter is incremented and the SQM starts processing the next write request. b. If loss was detected, typically the processing suspends. This can be overridden through a rebuild queues command. Writes issued by such maintenance activities will cause the WritesFailedLoss counter to be incremented. The SQM is continuously performing space management activities. As new requests come in, it may have to allocate additional segments, incrementing the SegsAllocated counter. a. If the new segment is allocated according to the disk affinity setting, the counter AffinityHintUsed is incremented. b. If intrusive counters are enabled, the time is measured from the last new segment allocated and the counters TimeAveNewSeg, TimeLastNewSeg, and TimeMaxNewSeg are updated accordingly. Use of these counters are interesting in that they show the time it takes for each 1MB segment to be allocated, populated, written to disk in other words, in a steady state high volume system, this demonstrates the disk throughput in MB/milliseconds. In low volume systems, these counters are likely not as effective as the write request rate may not be driving new segments to be allocated fast enough. c. Depending on the configuration values for sqm_recover_segs, the new segment allocation may have to update the OQID in the RSSD. If this happens, the counter UpdsRsoqid is incremented. If this value is fairly high and SQM write speed is blocking the EXEC or DIST rate, you may want to adjust the sqm_recover_segs configuration to reduce this. d. If the SQM has to wait for the segment allocation, the counter SleepsWaitSeg is incremented. While there is no counter that tracks how long it waits, the time is built in to the above counters (TimeAveNewSeg, etc) e. Since a segment is allocated only when needed, the counter SegsActive is incremented, indicating the number of segments that contain undelivered commands. Now that the SQM has space it can use to write to, it receives the command records and begins filling out a 16K block in memory. This causes several counters to be affected, including CmdsWritten and in some situations others as discussed below. a. If the command was a replication definition or subscription marker (rs_marker), or a synthetic rs_marker, the SQM has to process these records, so it sleeps while the enablement or disablement occurs. This increments the SleepsWriteDRmarker, SleepsWriteEnMarker, and SleepsWriteRScmd accordingly. High values here may indicate that the maintenance activity is affecting throughput. b. If the message is considered to be large (i.e. corresponds to XNL Datatypes), the XNL related counters are affected. i. If the RS site version configuration value is less than 12.5, the message is skipped and the XNLSkips counter is incremented. This is useful to detect a bad configuration when the replicate is getting out of sync on tables using XNL Datatypes. ii. If the RS site version is 12.5 or greater, the XNLWrites, XNLMaxSize, and XNLAverage counters are incremented. Eventually, the block will get flushed to disk (reasons and counters below). Regardless of the reasons, this will cause the counters BlocksWritten, BytesWritten to be incremented. a. If the block was written to disk because it was full (essentially the next message would not fit in the space that was left), the counter BlocksFullWrite is incremented. b. If the block was written to disk because the init_sqm_write_delay or the init_sqm_max_write_delay write timer expired, the counter WritesTimerPop is incremented. This is an indication that either the SQM is not getting data from the RepAgent User Thread fast enough (i.e. RA User is starved for cpu time), or the inbound stream of data is not that high of a volume. c. If the block was written to disk due to a RS shutdown, hibernation or other maintenance activity that suspends or shuts down the SQM thread, the counter WritesForceFlush is incremented. When an SQM Reader finishes processing its previous command(s), it will attempt to read the next block from the queue or SQM cache. While the block is being filled, it can not be read by a SQM Reader client (SQT or WS-DSI). If this happens, the SleepsWriteQ counter is incremented. This is an a.

3.

4.

5.

6.

109

Final v2.0.1
indication that the SQM Reader is reading the blocks at the same rate that they are being written i.e. it is not lagging behind. However, remember that you may have multiple readers for an inbound queue. One of them (typically the SQT) may be reading fast enough to read the blocks from cache and may be tripping this counter, while the other may be lagging (see next point below). 7. When the block is read, the counters BlocksRead and BlocksReadCached are incremented accordingly. Obviously, the ratio of BlocksReadCached:BlocksRead is similar to the cache hit ratio in ASE and can indicate when the exec_sqm_write_request_limit/md_sqm_write_request_limit are too small or that a SQM reader is lagging behind. In cases where there are multiple readers, one may be caught up (and incrementing BlocksReadCached) while the other is lagging. In strict Warm Standbys with no other replicate involved, the SleepsWriteQ and BlocksReadCached may be the effect of the SQT processing the messages if distribution has not been disabled for the connection. In such cases, disabling the DIST will provide more accurate values for these counters. Otherwise, admin sqm_readers command or the SegsActive can be an indication of how far the WS-DSI may be lagging behind. 8. Once the block is read successfully, the reader parses out the commands. This causes the counter CmdsRead to be incremented. If the message contains XNL data, additional command records may need to be read as follows: a. For each partial XNL data record read, the XNLPartials counter is incremented. b. If the XNL data record spans more than one 16K block, the next block will try to be fetched and processed. However, since the SQM is a single thread, the write timers may have popped necessitating and write operation. When this happens, the reading of large messages is interrupted and the XNLInterrupted counter is incremented. If you see large values for XNLInterrupted, it may be an indication that the large message reading is blocking the SQM writes which in turn may be slowing down the RepAgent processing. It this occurs frequently, you may need to check the replicate_if_changed state of text/image columns or whether their replication is necessary. The same could be true for large comment columns while these may be necessary for WS systems, in nonWS environments, replicating 16,000 character comment fields to a reporting system may not be necessary. c. Once the last row is read for the large message, the counter XNLReads is incremented. 9. Once all the commands have been read from a block and successfully processed, the SQM reader tells the SQM that they are finished with that block. This continues for all 64 blocks in the segment. When all SQM readers signal that they are finished with all the blocks on a particular segment, the segment is marked inactive and the SegsActive counter is decremented. a. If intrusive counters have been enabled, the timers started when the segment was allocated (3(b) above) are sampled and the TimeAveSeg, TimeLastSeg, and TimeMaxSeg counters are adjusted. 10. Once the segment has been marked inactive and any save interval has expired, the segment is deallocated from the particular queue. This increments the SegsDeallocated timer. Lets take a look at some sample data. Again, we will use the customer data as well as in the insert stress test starting with the customer data below. First, we will look at the writing side by looking at the SQM counters (vs. the reading which are the SQMR counters). Once again, derived statistics are in red.

110

Final v2.0.1

0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54

268,187 364,705 253,283 266,334 253,684 164,566 376,184 450,809 326,750 325,340 317,674

1,655 1,380 1,190 893 1,097 1,723 1,355 1,032 1,200 1,011 825

32,693 36,395 23,664 18,322 22,907 24,759 39,865 34,248 31,783 25,153 19,975

2 3 1 2 2 0 1 1 0 0 1

32,691 36,392 23,663 18,320 22,903 24,759 39,862 34,246 31,783 25,153 19,974

99.99 99.99 99.99 99.98 99.98 100 99.99 99.99 100 100 99.99

511 569 370 287 358 387 623 536 497 393 312

0 0 0 0 0 0 0 0 0 0 0

511 569 370 287 358 387 622 535 497 393 312

1.6 1.8 1.2 0.9 1.1 1.2 2 1.7 1.6 1.3 1

Lets take a look at some of these: CmdsWritten This corresponds to the number of commands actually written to the queue. This metric should be fairly close to the RepAgent counter CmdsTotal although it may not be exact as some RepAgent User thread commands are system commands not written to the queue (such as truncation point fetches). While this may not appear to be as useful given that CmdsTotal is broken down by CmdsApplied, CmdsRequest, CmdsSystem, etc., this value is actually fairly important when looking at read activity and SQMR counters. CmdSizeAverage This metric records the number of bytes necessary to store each command. For inserts, this is the after row image, while for updates, both the after row image and the before row image less identical values when minimal columns is enabled. This metric is useful when trying to determine how wide the rows are being replicated (for space projections) and especially compared to the RepAgent counter PacketsReceived. If the CmdSizeAverage is large i.e. 2,000 bytes this could result in a single command per packet being sent using the default packet size. Earlier, we noted that we were getting about 3 RepAgent commands per packet (which includes begin/commit transaction commands) and this metric demonstrates why. At ~1,000 bytes per command, that is all that will fit in the default packet size. WritesTimerPop & BlocksFull% the second metric is derived by dividing BlocksWritten by BlocksFullWrite. However, both of these are a good indication of how busy the input stream is to Replication Server. Any writes caused by a timer pop indicate that the SQM block wasnt full indicating a lull in activity from the Replication Agent User thread. This system is consistently busy with very marginal timer driven flushes. A non-busy system would likely have a lot more and correspondingly a lower full %. SegsAllocated & SleepsWaitSeg taken together, these two can illustrate when the segment allocation process is hindering replication performance. The actual cause of the delay could be I/O related, however, it is just as likely to be caused by RSSD performance issues. UpdsRsoqid/sec this metric is derived by dividing UpdsRsoqid by the number of seconds between sample intervals. Specifically, again it shows the impact on the RSSD. If we couple this metric with the RepAgent counter UpdsRslocater from above, we are averaging about 2 updates/second. While not a high volume, again, this shows the interruption in RS processing to record recovery information. Sqm_recover_seg this metric is derived by dividing the SegsAllocated by the UpdsRsoqid. Much like the RA ECTS value, this is a good indication of the actual RS configuration parameter sqm_recover_seg. Adjusting this slightly could improve RS throughput. Before we look at the SQM read (SQMR) counters, lets compare this to the insert stress test:

Sqm_recover_seg (derived) 1 1 1 1 1 1 1 1 1 1 1

CmdSizeAverage

WritesTimerPop

BlocksFullWrite

UpdsRsoqid/sec (derived)

SleepsWaitSeg

BlocksWritten

SegsAllocated

CmdsWritten

Sample Time

BlocksFull% (derived)

UpdsRsoqid

111

Final v2.0.1

11:37:57 11:38:08 11:38:19 11:38:30 11:38:41

1,027 7,788 4,512 20,336 553

1,465 1,491 1,491 1,491 1,458

105 817 471 2,120 57

1 0 0 0 1

104 817 471 2,120 56

99.04 100.00 100.00 100.00 98.24

2 12 7 33 1

0 0 0 0 0

0 1 1 3 0

0 0.1 0 0.3 0

Again, we see mostly full blocks with exception of the beginning and end of the test run which illustrates how WritesTimerPop can be used to indicate a lull in Replication Agent user thread activity. Also note that sqm_recover_seg is 10 and the derived value is showing the fluctuation induced by averaging across time periods for example, the 11:38:08 sample likely had an update to rs_oqid at 8 (2 from previous sample period + 8 = 10) and then the next four were combined with six of the seven in sample 11:38:19 and so forth. Now lets take a look at some read statistics by looking at the SQMR counters. First, lets view the customer data metrics (note that segment allocation metrics are SQM and not SQMR counters): BlocksReadCached

Cached Read %

0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54

268,187 364,705 253,283 266,334 253,684 164,566 376,184 450,809 326,750 325,340 317,674

587,860 947,808 318,611 282,958 277,054 194,386 365,435 522,844 400,065 352,656 317,683

73,887 99,781 33,309 19,998 28,017 39,231 43,396 42,165 44,025 32,134 19,975

17,996 19,657 11,165 7,615 5,199 8,344 2,462 8,728 7,210 6,438 10,909

24.35 19.70 33.51 38.07 18.55 21.26 5.67 20.69 16.37 20.03 54.61

40,153 36,035 84,369 79,786 40,364 19,273 19,398 40,419 73,404 73,932 144,828

54.34 36.11 253.29 398.96 144.06 49.12 44.69 95.85 166.73 230.07 725.04

303 38 2 2 25 2 41 57 29 3 2

511 569 370 287 358 387 623 536 497 393 312

621 835 403 287 335 412 583 522 523 422 312

Lets take a look at some of these: CmdsWritten (SQM) vs. CmdsRead (SQMR) it looks like the SQM is reading a lot more than writing. This is partially true. What has happened is that the SQT cache was filled causing large transactions to get removed from cache. Consequently, when the commit was finally seen, the SQT had to re-read the entire transaction from disk and consequently had to re-request the commands from the SQM. Consequently, anytime the SQMR.CmdsRead counter is appreciably higher than SQM.CmdsWritten, you should look to the SQT metrics as the SQM is re-scanning the disks. As you will see in some of the later metrics, this has an impact on system performance.

112

SegsDeallocated

Wrire Wait %

SegsAllocated

SleepsWriteQ

CmdsWritten (SQM)

Sample Time

BlocksRead

CmdsRead (SQMR)

SegsActive

Sqm_recover_seg (derived) 0 12 7 11 0

CmdSizeAverage

WritesTimerPop

BlocksFullWrite

UpdsRsoqid/sec (derived)

SleepsWaitSeg

BlocksWritten

SegsAllocated

CmdsWritten

Sample Time

BlocksFull% (derived)

UpdsRsoqid

Final v2.0.1
Cached Read % - this metric is derived by dividing the BlocksReadCached by BlocksRead. Ideally, we would like this to be in the high percentages with 100% being perfect, but anything in the 90s acceptable. In this case we see rather dismal numbers largely the fault of all the rescanning. Even when it appears to catch up (around samples 3, 4 & 5), the cache hit rate is low. The reason is simple is that when the SQMR had to re-read, the SQM had to flush the blocks it had to disk resulting in physical reads most of the time. Write Wait % - this metric is derived by dividing the BlocksRead by the SleepWriteQ. This is actually an interesting metric. It is desirable that SleepWriteQ is high by definition, it is when the SQM read client sleeps while waiting for the SQM write client to write. While normally 100% is considered complete, in this case a SQM read client may have to wait more than once for the current block to be written. Consequently, the higher above 100% this value, the stronger the indication that the SQM read client is caught up to the SQM writer. This will be evident more when looking at the insert stress test metrics. However, numbers below 300% seem to indicate a latency. SegsActive this metric shows how much space is being consumed in the stable queue. Similar (and if fact the same metrics) to admin who, sqm the amount of active segments indicates latency. However, the latency may not be as large as the actual number of segments active. For instance, between the first two sample periods, the number of active segments drop from 303 to 38. Likely, the large transaction began 300+ segments back and when it had be successfully read out and distributed, the SQM could then drop those segments (a better description of the process is contained in the SQT processing section regarding the Truncate list). Ideally, low numbers would be desirable here. Now, lets take a look a the same counters from the insert stress test. The only caveat is that the insert stress test was a Warm Standby implementation, so these counters are from the SQM read client for the WS-DSI that was reading from the inbound queue. BlocksReadCached

Cached Read %

11:37:57 11:38:08 11:38:19 11:38:30 11:38:41

1,027 7,788 4,512 20,336 553

1,018 7,795 4,514 20,084 575

105 817 471 2,093 59

104 759 466 631 13

99.04 92.90 98.93 30.14 22.03

112 7,798 3,816 6,742 3,509

106.7 954.5 810.2 322.1 5,947.5

3 12 16 45 43

2 12 7 33 1

Comparing this to the above, we notice that the Cache Read % is in the high 90 initially (it drops off later due to the fact the DSI is not keeping up so the Cache Read % is artificially high at the beginning as the DSI SQT cache is filled). However, note that the Write Wait % is very high which is desirable. The SegsActive is climbing as the DSI is falling behind due to the replicate ASE not being able to receive the commands fast enough (most often this is the biggest source of latency). This last point is interesting. Nearly all customers who call into Sybase Tech Support complaining about latency in RS and think RS is the problem due to the backup being in the inbound queue forget that as a Warm Standby, they only have an inbound queue which also functions as the outbound queue. SQT Processing The Stable Queue Transaction (SQT) is responsible for sorting the transactions into commit order and then sending them to the Distributor to determine the distribution rules. The following diagram depicts the flow of data through the SQT starting with the inbound queue SQM and the Distributor to the outbound queue.

SegsDeallocated 0 3 4 4 4

Wrire Wait %

SegsAllocated

SleepsWriteQ

CmdsWritten (SQM)

Sample Time

BlocksRead

CmdsRead (SQMR)

SegsActive

113

Final v2.0.1

Figure 18 Data Flow Through Inbound Queue and SQT to DIST and Outbound Queue
It is good to think of the SQT as just one step in the process between the two queues - and that performance of this pipeline of data flowing between the queues depends on the performance of each thread along the path. For this section, we will focus strictly on the SQT thread on the left side of the above diagram. In early releases of Replication Server, the SQT thread was a common cause of problems because the default SQT cache was only 128KB and DBAs would forget to tune it. Even todays default (1MB) may not be sufficient. In any case, thankfully, this problem is very easy to address by adding cache. Unfortunately, this became almost a silver bullet that became relied on by DBAs to simply keep raising the SQT cache any time there was latency and then complaining when it no longer helped. Today, if the SQT cache is already above 4-8MB, DBAs should resist raising it further without first seeing if the cache is being exceeded. Likely, the problem isnt here and adding more cache will likely just contribute to the problem at the DSI. Key Concept #11: SQT cache is dynamically allocated for small transactions, large amounts of SQT cache will not even be utilized and will result in over-allocating DSI SQT cache if dsi_sqt_max_cache_size is still at the default. As mentioned earlier, the SQT thread is responsible for sorting the transactions into commit order. In order to better understand the performance implications of this (and the output of admin who, sqt), it is best to understand how the SQT works. SQT Sorting Process The SQT sorts transactions by using 4 linked lists, often referred to (confusingly enough) as queues. These lists are: Open The first linked list that transactions are placed on, this queue is a list of transactions for which the commit record has not been processed or seen by the SQT thread yet. Closed Once the commit record has been seen, the transaction is moved from the Open list to the closed list and a standard OpenServer callback is issued to the Distributor thread (or DSI, although this is internal to the DSI as will be discussed later in the outbound section). Read Once the DIST or DSI threads have read the transaction from the SQT, the transaction is moved to the Read queue. Truncate Along with the Open queue, when a transaction is first read in to the system, the transaction structure record is placed on the Truncate queue. Only after all of the transactions on a block have had the commit statements read and been processed by the DIST and placed on the read queue can the SQT request the SQM to delete the block.

114

Final v2.0.1
To get a better idea how this works, consider the following example of three transactions committed in the following order at the primary database:

CT1 D19

I 18

I 17

I 16

D15 U14

I13

I12

I11 BT1

CT2 U27

I 26

I25

I24

I23

I 22

I21 BT2

CT3 D35 D34

I 33

U32 U31 BT3

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

BT3 / CT3 U31

Begin/Commit Transaction Pair (with tran id) Statement ID DML Operation (Update, Insert, Delete) Transaction ID

Figure 19 Example Transaction Execution Timeline


In this example, the transactions were committed in the order 2-3-1. Due to the commit order, however, the transactions might as well have been applied similar to:

CT1 D19 I18 I17 I16 D15 U14 I13 I12 I11 BT1 CT2 U27 I26 I25 I24 I23 I22 I21 BT2

CT3 D35 D34 I33 U32 U31 BT3 T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 20 Example Transactions Ordered by Commit Time


However, the transaction log is not that neat. In fact, it would probably look more like the following:
CT1 D19 I18 I17CT3 D35 D34 CT2 I33 U27 U32 I26 U31 I25 BT3 I24 I16 I23 D15 I22 U14 I21 I13 BT2 I12 I11 BT1 End of Log Beginning of Log

Figure 21 Transaction Log Sequence for Example Transaction Execution


After the Rep Agent has read the log into the RS, the transactions may be stored in the inbound queue in blocks similar to the following (assuming blocks were written due to timing and not due being full):

115

Final v2.0.1

0.6
CT1 D19 I18 I1

0.5
7CT3

0.4
D3 CT2
4

0.3
U3 I2 U3 I2 BT3
2 6 1 5

0.2
I24
6 3 5 2 4 1 3

0.1
2 1

0.0

D35

I33

U27

I1 I2 D1 I2 U1 I2 I1 BT2 I1 I1 BT1

End of Queue

Row 0.3.0 Row 0.3.1 Row 0.3.2 Row 0.3.3 Row 0.3.4

Beginning of Queue

Figure 22 Inbound Queue from Example Transactions with Sample Row Ids
The following diagrams illustrate the transactions being read from the SQM by the SQT, sorted via the Open, Closed, Read and Truncate queues within the SQT. After reading the first block (0.0), these four queues will look like the below:

Open
TX1 BT1 I11 I12

Closed

Read

Truncate
TX1

Figure 23 SQT Queues After Reading Inbound Queue Block 0.0


Note that the transaction is given a transaction structure record (TX1 in above) and statements read thus far along with the begin transaction record have been linked in a linked list to the Open queue. Note that immediately after reading the transaction from the SQM, the transaction id is recorded in the linked list for the Truncate queue. Continuing on and reading the next block from the SQM yields:

Open
TX1 TX2 BT1 BT2 I11 I12 I13 U14 I21 I22

Closed

Read

Truncate
TX1 TX2

Figure 24 SQT Queues After Reading Inbound Queue Block 0.1


Having read the second block from the SQM, we encounter the second transaction. So, we begin a second linked list for its statements as well as continuing to build the first transactions list with statements belonging to it read from the second block. Additionally, we add that transaction to the Truncate queue. Continuing on and reading the next block from the SQM yields:

116

Final v2.0.1

Open
TX1 TX2 BT1 BT2 I 11 I 12 I 13 U1
4 5

Closed

Read

Truncate
TX1 TX2

I 21 I 22 I 23 I 24

D1

I 16

Figure 25 SQT Queues After Reading Inbound Queue Block 0.2


No new transactions were formed, so we are simply adding statements to the existing transaction linked lists. Continuing on yields the following SQT organization:

Open
TX1 TX2 TX3 BT1 BT2 BT3 I11 I12 I1
3

Closed

Read

Truncate
TX1 TX2 TX3

I21 I22 I2
3

U31 U32

U14 I24 D15 I25 I16 I26

Figure 26 SQT Queues After Reading Inbound Queue Block 0.3


At this point, we have all three transactions in progress. Continuing with the next block read from the SQM yields the first commit transaction (for TX2). Since we now have a commit, the transactions linked list of statements is simply moved to the Closed queue and the DIST thread notified of the completed transaction. This yields an SQT organization similar to:

Open
TX1 BT1 I 11 I 12 I1
3 4

Closed
TX3 BT3 U31 U32 TX2 BT2 I 21 I 22 I 23 I 24 I 25 I 26 U27 CT2

Read

Truncate
TX1 TX2 TX3

U1

D15 I 16

Figure 27 SQT Queues After Reading Inbound Queue Block 0.4


Continuing with the next read from the SQM, the DIST is able to read TX2 which causes it to get moved to the Read queue and the commit record for TX3 is read, which moves it to the Closed queue. This yields an SQT organization similar to:

117

Final v2.0.1

Open
TX1 BT1 I 11 I 12 I 13 U1
4 5

Closed
TX3 BT3 U 31 U 32 I 33 D 34 D3
5

Read
TX2 BT2 I 21 I 22 I 23 I 24 I 25 I 26 U27 CT2

Truncate
TX1 TX2 TX3

D1 I1 I1

I 16
7 8

CT3

Figure 28 SQT Queues After Reading Inbound Queue Block 0.5


At this juncture, you might think that we could remove TX2 from the inbound queue. However, if you remember, all I/O is done at the block level. In addition, in order to free the space, the space must be freed contiguously from the front of the queue (block 0.0 in this case). Since the statements that make up TX2 are scatter among the blocks and statements for transactions for which the commit has not been seen yet, the deletion of TX2 must wait. Continuing on with the last block to be read, yields the following:

Open

Closed
TX1 BT1 I11 I12 I13 U14 D1 I1
5 6

Read
TX2 BT2 I 21 I 22 I 23 I 24 I2 I2
5 6

Truncate
TX1 TX2 TX3

TX3 BT3 U31 U32 I 33 D 34 D 35 CT3

I17 I18 D19 CT1

U27 CT2

Figure 29 SQT Queues After Reading Inbound Queue Block 0.6


At this stage, all transactions have been closed, however, we still cannot remove them from the inbound queue. Remember, this is strictly memory sorting (SQT cache), consequently, if we removed them from the inbound queue now and a system failure occurred, we would lose TX1. Consequently, we have to wait until it has been read by the DIST. Once that is done, all three transactions would be in the Read queue and consequently a contiguous block of transactions could be removed since all of the transactions on the blocks have been read. If however, block 0.6 also contained a begin statement for TX4, then the deletes could still be done for blocks 0.0 through 0.5. How? The answer is that the SQM flags each row in the queue with a status flag that denotes whether it has been processed. Consequently on restart after recovery, the SQT doesnt attempt to resort and resend transactions already processed. Instead, it simply starts with the first active segment/row and begins sorting from that point. SQT Performance Analysis Now that we see how the SQT works, this should help explain the output of the admin who, sqt command (example copied from Replication Server Reference Manual).

118

Final v2.0.1

admin who, sqt Spid State -------17 Awaiting 98 Awaiting 10 Awaiting 0 Awaiting Closed -----0 0 0 0 Removed ------0 0 0 0 SQM Reader ---------0 0 0 0 Read ---0 0 0 0 Full ---0 0 0 0

Wakeup Wakeup Wakeup Wakeup Open ---0 0 0 0

Info ---101:1 TOKYO_DS.TOKYO_RSSD 103:1 DIST LDS.pubs2 101 TOKYO_DS.TOKYO_RSSD 106 SYDNEY_DSpubs2sb Trunc ----0 0 0 0 First Trans ----------0 0 0 0 Parsed ------0 0 0 0

SQM Blocked ----------1 1 0 0

Change Oqids -----------0 0 0 0

Detect Orphans -------------0 0 1 1

The observant will say that not all the SQT threads are listed as the ones for the inbound queues (designated with qid:1) are present, but the ones for outbound queues (designated qid:0) are missing. Well, the reality is that there is not a SQT thread for outbound queues. Instead, the DSI (Scheduler) calls SQT routines. Consequently, spids 10 & 0 above represent DSI threads performing SQT library calls. For this section, we are going to concentrate on the SQT thread aspect however, remember that it applies to the DSI SQT module as well. Differences will be discussed in the section on the DSI later. The output for the columns are described in the below table: Column Spid State Info Closed Meaning Process Id for each SQT thread State of the processing for each SQT thread Queue being processed Number of transactions in the Closed queue waiting to be read by DIST or DSI. If a large number of transactions are Closed, then the next thread (DIST or DSI-Exec) is the bottleneck as the SQT is simply waiting for the reader to read the transactions. Number of transactions in the Read queue. This essentially explains the number of transactions process not yet deleted from the queue. A high number in this block may point to a long transaction that is still Open at the very front of the queue (i.e. user went to lunch) as deleting queue space is fairly quick. Number of transactions in the Open queue for which commit has not been seen by SQT yet (although SQM may have written it to disk already) Number of transactions in the Truncate queue essentially an ordered list of transactions to delete once processed in disk contiguous order. Trunc is the sum of the Closed, Read, and Open columns (due to reasons discussed above). Number of transactions removed from cache. Transactions are removed if the cache becomes full or the transaction is a large transaction (discussion later) Denotes if the SQT cache is currently full. Since this is a transient counter, you may wish to monitor the "removed" counter to detect if transactions are getting removed due to cache being full.

Read

Open Trunc

Removed

Full

119

Final v2.0.1

Column SQM Blocked First Trans

Meaning 1 if the SQT is waiting on SQM to read a message. This state should be transitory unless there are no closed transactions. This column contains information about the first transaction in the queue and can be used to determine if it is an unterminated transaction. The column has three pieces of information: ST: Followed by O (open), C (closed), R (read), or D (deleted) Cmds: Followed by the number of commands in the first transaction qid: Followed by the segment, block, and row of the first transaction An example would be ST:O Cmds: 3245 qid: 103.5.23 which basically tells you that at this stage, the first transaction in the queue is still Open (no commit read by SQT) and so far it has 3,245 commands in the transaction (probably a large one) and begins in the queue at segment 103 block 5 row 23. As we will see later, this is a very useful piece of information. The number of transactions that have been parsed. This is the total of transactions including those already deleted from the queue. Along with statistics, this field can give you an idea of the transaction volume over time. The index of the SQM reader handle. If multiple readers of an SQM, this designates which reader it is. Indicates that the origin queue ID has changed. Typically this only happens in Warm Standby after a switch active. Indicates that it is doing orphan detection. This is largely only noticed on RSI queues. For normal database queues, if someone does not close their transaction when the system crashes, on recovery, the Rep Agent will see the recovery checkpoint and instruct the SQM to purge all the open transactions to that point.

Parsed

SQM Reader Change Oqids Detect Orphans

Admin who, sqt is one of the key commands to determining problems on the inbound queue performance. In addition to helping you identify progress of transactions through the Open, Closed, Read and Truncate queues, it is extremely useful for determining when you have encountered a large transaction or, one that is being held open for very long time. The column that assists in this is the First Trans column. Above we gave an example of one view of a large transaction (ST:O Cmds: 3245 qid: 103.5.23). Consider the following tips for this column: ST: O O O C C R Cmds increasing same changes changes same same Qid same same increasing slow same same Possible Cause Large transaction Rep Agent down or uncommitted transaction at primary SQT processing normally SQT reader not keeping up (DIST or DSI) DIST down, outbound queue full Transaction on same block/queue still active

It is important to recognize that this is the first transaction in the queue which especially for the outbound queue could have been delivered already. The inbound queue is even more confusing it may have already been processed, but the space has not been truncated from the queue yet by the SQM. This is especially true if the sqt_init_read_delay and sqt_max_read_delay to are not set to 1000 milliseconds (1 second). Common Performance Problems The most common problems with the SQT are associated with 1) large transactions; and 2) slow SQT readers (i.e. DIST or DSI). The first deals with the classic 10,000 row delete. If the SQT attempted to cache all of the statements for such a delete in its memory, it would quickly be exhausted. Consequently, when a large transaction is encountered,

120

Final v2.0.1
the SQT simply discards all of its statements and merely keeps the transaction structure in the appropriate list. However, this means that in order for the transaction to be passed to the SQT reader, the SQT must go back to the beginning of the transaction and physically rescan the disk. In addition to the slow down of simply doing the physical i/o, it effectively pauses the scanning where the SQT had gotten to until that transaction is fully read back off disk and sent to the DIST, etc. It also impacts Replication Agent performance as this likely will involve a large number of read requests to refetch all of the same blocks adding to the workload of the SQM that is busy trying to write. The second problem is common as well. In cases where the DIST, or DSI threads cannot keep up, the Closed queue continues to grow until all available DSI SQT cache is used. Once this begins to happen, the SQT has a decision to make. If there are transactions in the Closed or Read queue, the SQT simply halts reading the SQM until the transaction is complete and queue can be truncated. If there are no transactions in the Closed or Read queue, the SQT finds the largest transaction in the Open queue, discards the statements (keeping the transaction structure similar to a large transaction) and then keeps processing. Should this continue for very long, a large number of transactions in the SQT cache may have to be rescanned further slowing down the overall process. SQT Performance Tuning To control the behavior of the SQT, there are a couple of configuration parameters available: Parameter sqt_max_cache_size (Default: 1MB; Recommendation: 4MB) RS 11.x Meaning Memory allocated per connection for SQT cache. Note that this is a maximum RS dynamically allocates this as needed by the connection and then deallocates when no longer needed. Consequently, this is frequently oversized and customers often dont understand why continuing to increase it has no effect. Values above 4MB need to be considered very cautiously and only when transactions are being removed and cache has been exceed. The reason is that over sizing this can drive the DSI to be filling cache more than issue SQL due to the default value of dsi_sqt_max_cache_size. See discussion below. This new parameter was added in RS 12.6 ESD #7 as well as RS 15.0 ESD #1. In the past, all connections used sqt_max_cache_size for the inbound queue processing by the SQT regardless of requirement. By adding this parameter, individual inbound queue SQT cache sizes can be tuned similar to DSI SQT cache sizes. If other than zero, the amount of memory used by the DSI thread for SQT cache. If zero, the memory used by DSI is the same as the sqt_max_cache_size setting. The default of 0 is clearly inappropriate if you start adjusting sqt_max_cache_size. See discussion below. The length of time in milliseconds that an SQT thread sleeps while waiting for an SQM read before checking to see if it has been given new instructions in its command queue. With each expiration, if the command queue is empty, SQT doubles its sleep time up to the value set for sqt_max_read_delay. The maximum length of time an SQT thread sleeps while waiting for an SQM read before checking to see if it has been given new instructions in its command queue.

dist_sqt_max_cache_size (Default: 0??; Recommendation: 4MB)

12.6+ 15.0+

dsi_sqt_max_cache_size (Default: 0; Recommendation: 1MB)

11.x

sqt_init_read_delay (Default: 2000; Min: 1000; Max: 86,400,000 (24 hrs); Recommendation: 10) sqt_max_read_delay (Default: 10000; Min: 1000; Max: 86,400,000 (24 hrs); Recommendation: 50)

12.5+

12.5+

There are two main ways of improving SQT performance. The first is rather obvious increase the amount of memory that the SQT has by changing the value for sqt_max_cache_size. By default, the SQT has 1MB for each inbound and outbound queue. So, for a total of 2 source and 5 destination databases we would have 14 (2 source inbound/outbound and 5 destination inbound/outbound) 1MB memory segments for SQT cache. However, 1MB is typically too little. Most medium production systems need 2MB SQT caches with high volume OLTP systems using any where from

121

Final v2.0.1
4MB+ of cache. Obviously, the more connections you have, the more this impacts overall Replication Server memory settings. With a 4MB sqt_max_cache_size setting, the earlier example of 2 source/5 destinations would require 52MB strictly for SQT cache providing that all SQT caches are completely full. Earlier we had the following table: Configuration
sqt_max_cache_size dsi_sqt_max_cache_size memory_limit

Normal
1-2MB 512KB 32MB

Mid Range
1-2MB 512KB 64MB

OLTP
2-4MB 1MB 128MB

High OLTP
8-16MB 2MB 256MB

In which these were defined by: Normal thousands to tens of thousands of transactions per day Mid Range tens to hundreds of thousands of transactions per day OLTP hundreds of thousands to millions of transactions per day High OLTP millions to tens of millions of transactions per day By transactions, we are referring to DML based transactions (unfortunately sp_sysmon reports all). However, notice that for most OLTP systems, only a 2-4MB sqt_max_cache_size is truly all that is necessary. Higher than this is really only necessary in very high volume systems that have periodic/regular large transactions. The rationale is that normal OLTP transactions will cycle through the SQT cache so quickly that the SQT cache will likely not use very much memory. However, to avoid problems caused by rescanning, sizing the SQT cache to contain the periodic large transactions will allow the SQT to avoid the hit. Even 2-4MB SQT cache may be a bit excessive. If you think about it, if each source database is replicating to individual destination systems (1 to 2 and the other to 3), the outbound queue will contain sorted transactions provided that no other DIST thread is replicating into the destination. As a result, the SQT cache may not be fully needed for the DSI for transaction sorting and it can be adjusted down on a connection-by-connection basis via the dsi_sqt_max_cache_size. However, if using Parallel DSI, the DSI may need SQT cache to keep up with the multiple DSIs parsing requirements. The later (Parallel DSI) is best dealt with by adjusting the dsi_sqt_max_cache_size separately from sqt_max_cache_size. The tendency to oversize SQT cache has lent to some concern from within Sybase Replication Server engineering, prompting the following statement:
Prior to RepServer 12.6, typical tuning advice was to increase sqt_max_cache_size so that there are plenty of closed transactions ready to be distributed or sent to the replicate database when RepServer resources handling those efforts became available. Starting with 12.6 that advice no longer applies. Due to SQT behavior modifications associated with the SMP feature, the best advice for correctly sizing SQT (for either the sqt_max_cache_size or the dsi_sqt_max_cache_size configuration) is to set it large enough so that transactions removed from SQT cache never occur or only infrequently, but not much larger than that. Transactions are removed from SQT cache forcing them to be re-read from the queue when needed, whenever SQT cache contains no closed or read transactions (that is, no transactions to be distributed or to be deleted after having been distributed) and cache is full. In these cases, SQT will remove the statements of undistributed transactions from cache in order to make room for more transactions until it is able to cache one that can be distributed or until some distributed transactions can be deleted. You can monitor the removed transaction count by watching counter 24009 - "TransRemoved". Typically, if this counter does not report more than 1 removed transaction in any 5-minute period, transaction removal rate may be considered acceptable. To help determine the proper setting of sqt_max_cache_size and dsi_sqt_max_cache_size, refer to counter 24019 - "SQTCacheLowBnd". This counter captures the minimum SQT cache size at any given moment, below which transactions would have been removed. Monitor this value frequently during a period of typical transaction flow, and configure SQT cache to be no more than about 20% greater than the largest value observed.

Arguably, this was true even prior to RS 12.6 as SQT cache sizing was frequently oversized on many systems. The rationale for the above statement was that in implementing the SMP logic, the logic for the SQT processing was altered to favor filling the cache vs. providing cached transactions to clients such as the DIST and DSI threads. As a result, latency sometimes was introduced simply by the SQT thread waiting to fill huge caches allocated by the DBA vs.

122

Final v2.0.1
passing the transactions on. It became crucial, then, in RS 12.6 and RS 15.0 - to right-size the SQT cache vs. oversizing it. One way to detect either of these two situations is to watch the system during periods of peak activity via the admin who, sqt command. As taught/mentioned in the manuals, if the Full column is set to a 1, then it is a possible indication that SQT cache is undersized particularly from the inbound processing side. However, the best indication from the admin who,sqt command is the Removed column. If the Removed column is growing and the transactions are not large, then it is probable that the cache was filled to capacity several times and multiple transactions normally not considered large were removed to make room. However, the absolute best way (and most accurate) to determine cache sizing is to use the monitor counters. SQT Counters SQT Thread Monitor Counters The following counters are available in RS 12.6 to monitor the SQT thread. Counter CacheExceeded (a useless counter) CacheMemUsed Explanation Total number of times that the sqt_max_cache_size configuration parameter has been exceeded. SQT thread memory use. Each command structure allocated by an SQT thread is freed when its transaction context is removed. For this reason, if no transactions are active in SQT, SQT cache usage is zero. Total transactions removed from the Closed queue. Total transactions added to the Closed queue. Average number of commands in a transaction scanned by an SQT thread. Total commands in the last transaction completely scanned by an SQT thread. Maximum number of commands in a transaction scanned by an SQT thread. Total commands read from SQM. Commands include XREC_BEGIN, XREC_COMMIT, XREC_CHECKPT. Total empty transactions removed from queues. Average memory consumed by one transaction. Total memory consumed by the last completely scanned transaction by an SQT thread. Maximum memory consumed by one transaction. Total transactions removed from the Open queue. Total transactions added to the Open queue. Total transactions removed from the Read queue. Total transactions added to the Read queue. Total transactions whose constituent messages have been removed from memory. Removal of transactions is most commonly caused by a single transaction exceeding the available cache. Total transactions removed from the Truncation queue. Total transactions added to the Truncation queue.

ClosedTransRmTotal ClosedTransTotal CmdsAveTran CmdsLastTran CmdsMaxTran CmdsTotal EmptyTransRmTotal MemUsedAveTran MemUsedLastTran MemUsedMaxTran OpenTransRmTotal OpenTransTotal ReadTransRmTotal ReadTransTotal TransRemoved

TruncTransRmTotal TruncTransTotal

These changed in RS 15.0 to the following set:

123

Final v2.0.1

Counter CmdsRead OpenTransAdd CmdsTran CacheMemUsed

Explanation Commands read from SQM. Commands include XREC_BEGIN, XREC_COMMIT, XREC_CHECKPT. Transactions added to the Open queue. Commands in transactions completely scanned by an SQT thread. SQT thread memory use. Each command structure allocated by an SQT thread is freed when its transaction context is removed. For this reason, if no transactions are active in SQT, SQT cache usage is zero. Memory consumed by completely scanned transactions by an SQT thread. Transactions whose constituent messages have been removed from memory. Removal of transactions is most commonly caused by a single transaction exceeding the available cache. Transactions added to the Truncation queue. Transactions added to the Closed queue. Transactions added to the Read queue. Transactions removed from the Open queue. Transactions removed from the Truncation queue. Transactions removed from the Closed queue. Transactions removed from the Read queue. Empty transactions removed from queues. The smallest size to which SQT cache could be configured before transactions start being removed from cache. An SQT client awakens the SQT thread who is waiting for a queue read to complete. The time taken by an SQT thread (or the thread running the SQT library functions) to read messages from SQM. The time taken by an SQT thread (or the thread running the SQT library functions) to add messages to SQT cache. The time taken by an SQT thread (or the thread running the SQT library functions) to delete messages from SQT cache. Current open transaction count. Current closed transaction count. Current read transaction count.

MemUsedTran TransRemoved

TruncTransAdd ClosedTransAdd ReadTransAdd OpenTransRm TruncTransRm ClosedTransRm ReadTransRm EmptyTransRm SQTCacheLowBnd SQTWakeupRead SQTReadSQMTime SQTAddCacheTime SQTDelCacheTime SQTOpenTrans SQTClosedTrans SQTReadTrans

As mentioned earlier, the average, total and max counters are replaced in RS 15.0 with individual columns in rs_statdetail. However, the three new time tracking counters above (SQTReadSQMTime, SQTAddCacheTime, and SQTDelCacheTime) could be interesting if there is a latency within the SQT. The most important counters SQT counters are: CmdsPerSec = CmdsTotal/seconds OpenTransTotal, ClosedTransTotal, ReadTransTotal CmdsAveTran, CmdsMaxTran CacheMemUsed, MemUsedAveTran CachedTrans = CacheMemUsed/MemUsedAveTran

124

Final v2.0.1
TransRemoved (vs. CacheExceeded) EmptyTransRmTotal Again, the first one (CmdsPerSec)is establishing a rate hopefully it should compare to the rate from the RA thread. The second set (OpenTransTotal, ClosedTransTotal, ReadTransTotal) all refer to the Open, Closed, Read and Truncate transaction lists used by the SQT for sorting. However, the real goal is to see that all three are nearly identical. If ClosedTransTotal starts to lag behind OpenTransTotal, the most likely culprit is a series of large transactions. However, this is not as common as when ReadTransTotal is lagging Closed. In the latter case, either the DIST is not able to keep up (due to bad STS cache settings or slow outbound queue) or a large number of large transactions were committed and in order to pass them to the DIST (which is when it moves from Closed to Read), the whole transaction has to be rescanned from disk. A third alternative is that the SQT cache is too big and since the SQT prioritizes reading over servicing the DIST (and freeing space from the SQM dead last), too much SQT cache could be a problem as well. If this happens, increasing sqt_init_read_delay slightly may help (as the SQT will be forced to find something else to do). The way to find out the cause is to look at the next set of counters. These report the average number of commands per transaction as well as the max. This can be really useful to spot those bcps that someone is not using b on as well as to get a picture of the transaction profile from the origin from a sizing perspective (as will be useful for DSI tuning). If CmdsMaxTran is high, than it is likely a transaction was removed from cache and that may be the cause of ReadTransTotal lagging (more on this later). The third set of counters (CmdsAveTran, CmdsMaxTran) is also very interesting especially when combined with the next one CachedTrans. From this, we can see how much SQT cache was actually used by this SQT and the average memory per transaction. From the inbound queues perspective, we likely will only care about the CacheMemUsed monitoring to see how much memory we actually are using and if we need to increase this (if TransRemoved > 0). If we need to increase it, MemUsedAveTran gives us a good starting point to use as a multiple to increase by (i.e. to add cache for another 100 transactions simply multiply MemUsedAveTran by 100). However, these counters are the most useful for DSI tuning. For example, we can not group transactions if they are not in cache so if we are using 5 parallel DSIs and have dsi_max_xacts_in_group at 20, we will need enough cache for at least 100 transactions and likely double that number (so if CachedTrans is <100, we will need to increase dsi_sqt_max_cache_size). Most often, you will find that sites have left dsi_sqt_max_cache_size at 0 which means it inherits the value for sqt_max_cache_size which is likely oversized and now the DSI/SQT module is spending time stuffing the cache vs. giving the DSIEXECs transactions to work on. Of all of these, the TransRemoved counter is likely the most critical (and why it is high-lighted). If TransRemoved is 0, adding SQT cache is a useless exercise and may actually contribute to the problem. Additionally, if TransRemoved is occasionally > 0 but the CmdsMaxTrans is 1,000,000 you likely dont have enough memory to cache it anyhow. However, if you frequently see TransRemoved >0, you may want to add more SQT cache by increasing sqt_max_cache_size. The key here is that just about any non-zero value occurring often is a problem so thinking that just because it is low (like a steady value < 10) means it is not a problem is just plain wrong. Additionally, sqt_max_cache_size is a server setting that applies to the all connections so before decreasing, you may want to check all your connections and do not decrease if any show TransRemoved > 0 that are not attributable to the once nightly batch job or other obscenely large transaction. Notice that we focused on TransRemoved. CacheExceeded is kind of like the admin who, sqt cache full column it merely is an indication that the cache was full at some point (which the SQT is busy trying to do). However, as transactions are read and the truncated from the SQT cache, this value rapidly change as the new space available is filled quickly by the SQT. If using admin who, sqt, the full column likely blinked between 0 & 1 so fast it is like a light-bulb in your house blinks so fast you think it is constantly on vs. 60Hz. This is so useless a metric, that this counter was removed in RS 15.0. The last counter (EmptyTransRmTotal) is good as a bad-application design counter. If you see a lot of empty transactions, it is either because everything is being done in isolation level 3 or chained mode. The latter can be especially unplanned with java applications as the default behavior is to execute all procedures in chained mode. Even if no rows were modified in the proc (selects only), since a commit was registered, the empty transaction is flushed to the transaction log (think of the performance implications there and the log semaphore) and then replicated. Another common occurrence of this prior to ASE 12.5.2 was system commands such as reorgs which use a plethora of small empty transactions to track progress. So if the RA and/or SQMR is lagging and you have a high number of EmptyTransRmTotal, it is time to either upgrade to ASE 12.5.3+ or hunt your developers down to see if they are running everything in chained mode or isolation level 3 for some reason. SQT Thread Counter Usage After the fairly lengthy discussion of how the SQT works, we dont need a lot of detail here as the Open, Closed, Read and Trunc prefixes make the counters fairly intuitive. Instead, lets skip to looking at the customer data:

125

Final v2.0.1

ClosedTransTotal

OpenTransTotal

ReadTransTotal

SQT CmdsTotal

CacheMemUsed

CacheExceeded

CmdsMaxTran

0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54

268,187 364,705 253,283 266,334 253,684 164,566 376,184 450,809 326,750 325,340 317,674

268,502 336,528 280,586 266,528 253,246 165,535 347,213 432,871 374,994 327,038 318,111

215,196 215,196 9,462 9,462 3,448 3,442 1,723 72,313 3,442 1,723 93

21,031 29,787 65,767 65,192 59,014 38,943 81,818 83,471 84,597 73,442 76,525

21,031 29,790 65,766 65,193 59,014 38,944 81,817 83,469 84,597 73,443 76,525

21,131 29,661 65,941 65,257 59,035 38,933 81,678 83,465 84,806 73,213 76,441

733 4,438 10,892 10,382 13,297 10,222 21,159 27,029 24,705 15,644 5,240

324,608 1,430,016 632,320 857,600 1,379,840 1,498,880 2,091,776 1,944,832 2,103,040 1,967,104 1,750,528

Now then, this customer had sqt_max_cache_size set at 2,097,152 bytes (2MB) and dsi_sqt_max_cache_size at 0. Also, monitoring had been ongoing for >10 hours when this slice of the sampling was taken and the system was busy the entire time. As a result, this represents a steady state of the server. With this in mind, lets take a look at these metrics. SQT CmdsTotal vs. SQM CmdsWritten This represents the lag that the SQT in reading from the inbound queue as commands occur. We said earlier, that often the best starting point is to compare the Cmds in each counter module through the RS. In this case, the SQT is keeping up, reading the commands almost as soon as they arrive (when the SQM writes them). However, it got behind in the 1:00am time frame when the cache filled, but then caught back up quickly. Any latency in the system is not due to the SQT, however, that does not mean that it is tuned properly. CmdsMaxTran This is a very interesting statistic as it indicates the largest transaction processed during that sample period. While it might be tempting to use CmdsAveTran, the problem is that a lot of small transactions could skew when a large transaction hit. The most useful aspect to this metric is used in conjunction with TransRemoved to determine if raising the sqt_max_cache_size would be of benefit. Note especially the extremely large transaction at the beginning; the fairly consistently large transactions throughout and the small transaction at the end. OpenTransTotal, ClosedTransTotal, ReadTransTotal It should be fairly obvious what these refer to the Open, Closed and Read transaction lists. The goal is that these should be fairly identical during the sample period meaning that transactions are added to the SQT cache, the commit is found, and it is passed to the DIST thread quick enough that no discernable lag is evident. The problem is that the SQT gives priority to filling the cache over servicing the DIST, and as a result, it is not untypical for the ReadTransTotal to lag behind ClosedTransTotal until sqt_max_cache_size is reached. At this point, the ReadTransTotal will start mimicking the ClosedTransTotal. The reason why is that the SQT cant put any more transactions into the cache until it removes one as a result, the processing (once the cache is full) is that a new transaction cant be read from the inbound queue until one is read by the DIST. This isnt obvious in the above statistics as the stats were from RS 12.1 vs. 12.6 when the change in SQT processing was influenced by the SMP implementation. CacheMemUsed This is a very interesting counter. Not only does it help in sizing sqt_max_cache_size by showing the high-water mark during each sample interval, it also shows the dynamic allocation and deallocation of memory within each SQT cache. In this case, we have 2MB configured but at the beginning we are only using about 300K. This grows to 1.4MB and then drops back down to 600K before growing successively until the max is reached.

126

TransRemoved 6 2 7 3 7 3 10 5 15 17 0

CmdsWritten (SQM)

Sample Time

Final v2.0.1
TransRemoved this is one of the more important counters. Looking at the above, we note that nearly every sample interval has transactions removed, clearly indicating the SQT cache is undersized. However, if transactions were only removed during the first several sample intervals, this may not be true. If you think about it, a 200,000 row transaction averaging 1K command size (SQM counter CmdSizeAverage), you would need 200MB of SQT cache to contain it. This is impractical as the next large transaction (likely a bcp as it was in this case) may have 500,000 rows. Consequently, you dont tune sqt_max_cache_size to fully cache extraordinarily large transactions that occur periodically. However, in the above case we see that we have a fairly constant transaction sizes in the 3,000-9,000 row range (suggesting a 4-10MB cache). Additionally, the cache is completely full twice around 1:00am when the number of transactions peak at ~80,000 transactions. Consequently, this system would benefit from increase sqt_max_cache_size to 16MB (16,777,216). This value is actually high but is based on providing padding over the largest transaction that is expected that we really want to cache (the 9,000 command transactions assuming 1,500 byte command size). While an 8MB SQT cache may be usable, increasing it to 32MB is likely not to have any benefit over 16MB. However, if we do raise this, we should make sure that dsi_sqt_max_cache_size is explicitly set to 1-2MB. Without doing this, we allocate 16MB of cache for the DSI thread which really doesnt need it. As a result, the DSI Scheduler will spend its time filling the DSI SQT cache vs. yielding its time to the DSI EXEC threads to process the SQL statements. It has been shown that oversizing the SQT cache can lead to performance degradation as a result. Distributor (DIST) Processing Earlier we showed the inbound process flow from the inbound queue to the outbound queue using the following diagram:

Figure 30 Data Flow Through Inbound Queue and SQT to DIST and Outbound Queue
This time, we will be focusing on the Distributor (DIST) thread. Of all the processes in the Replication Server, the DIST thread is probably the most CPU intensive. The reason for this is that the DIST thread is the brains behind the Rep Server determining where the replicated data needs to go. In order to determine the routing of the messages, the DIST thread will call three library routines - the SRE, TD and MD as depicted above. These library routines are discussed below. Subscription Resolution Engine (SRE) The Subscription Resolution Engine (SRE) is responsible for determining whether there any subscribers to each operation. Overall, the SRE performs the following functions: Unpacks each operation in the transaction. Checks for subscriptions to each operation

127

Final v2.0.1

Checks for subscriptions to publications containing articles based on the repdef for each operation Performs subscription migration where necessary.

For the most part, the SRE simply has to do a row-by-row comparison for each row in the transaction. A point to consider is that the begin/commit pairs in the transaction were effectively removed by the SQT thread and the transaction information (transaction name, user, commit time, etc.) are all part of the transaction control block in the SQT cache. This is important as the TD module will make use of this information, but for now, the SRE simply has to check for subscriptions on the individual operations. The reason the SRE looks at the individual operations is that not all tables may be subscribed to by all the sites consequently a transaction that affects multiple tables would still need to have the respective operations forwarded accordingly. Subscription Conditions To maintain performance, the SRE is a very lean/efficient set of library calls that only supports the following types of conditionals: Equality for example col_name = constant. A special type of equality is permitted using rs_address columns is bit-wise comparisons with the logical AND (&) function. Range (unbounded and bounded) for example col_name < constant or col_name > low_value and col_name < high_value Boolean AND conditionals

Note that several (sometimes disturbing to those new to Replication Server) forms of conditionals are not supported: Functions, formulas or operators (other than & with rs_address columns) are not supported Boolean OR, NOT, XOR conditionals. Boolean OR conditionals are easily accomplished via simply creating two subscriptions one for each side of the OR clause. Not equals (!=, <>) comparators. However, this is easily bypassed by treating the situation like a noninclusive range. For example (col_name != New York) becomes (col_name < New York OR col_name > New York) which is handled simply by using two subscriptions. For not null comparisons, a single subscription based on col_name > (note the empty string and use of single quotation marks) is sufficient. Incidentally, this trick is borrowed from the SQL optimization trick of switching column!=null to column>char(0) the ANSI C equivalent for NUL. Columns contained in the primary key can not have rs_address datatypes.

It should also be pointed out that the SRE does not check to see if a site subscribes more than once. For example, a given replication definition could specify that last name, city, and state are subscribable columns. If a destination wants to subscribe to all authors in Dublin, CA or have a last name of Smith care needs to be taken to avoid a duplicate row situation. Simply creating two subscriptions: one specifying last_name=Smith and the other specifying city=Dublin and state=CA will result in an overlapping subscription and cause the destination to receive duplicate rows. It should be noted that the next discussion while focusing on rs_address columns has a secondary purpose in illustrating how subscription rules can impact implementation choices. The biggest restriction is that for any subscription, each searchable column can only participate in a single conditional (a range condition constructed by two where clauses is considered a single conditional). A good example of how this impacts replication can be seen in the treatment of rs_address columns. Many Replication System Administrators complain that the rs_address column isnt as useful as it could be for several reasons: It only supports 32 bits restricting them to 32 sites in the organization. If the only column changed, then it is not replicated problematic for standby databases using repdef & subscriptions vs. Warm Standby feature. The bit-wise AND operation for the subscription behaves as col_name & value > 0 vs. col_name & value = value. This causes a problem described later in this section.

As a result, as their business grows, they have to add more rs_address columns causing considerable logic to be programmed in to the application or database triggers to support replication. While one rs_address column is easily managed, they are reluctant to add more. A valid complaint if you think of the bits one dimensionally with sites. Of course, using the rs_address column as an integer and subscribing with a normal equality (for example, subscribing where site_id = 123 vs. subscribing where site_id & 64 ) extends this near infinitely, however, if the data modification is projected for multiple sites, this could require multiple updates to the same rows and subscription migration issues. An alternative solution (but one that doesnt work as we will see why) might be to think of the bits in the rs_address columns as components similar to class B & class C Internet addresses. High order bytes could be associated with

128

Final v2.0.1
countries or regions while the low order bits with specific sites within those regions. Consider the following examples of bit-wise addressing: Bit Addressing 4 28 8 24 16 16 4 4 24 4 8 16 4 4 4 20 4 4 8 16 44888 Total Sites 112 192 256 384 512 1280 2048 8192 Comments Could be 4 World Regions each with 28 locations World Region Location Country/Region Location World Region Country Location Hemisphere Country/Region Location Hemisphere Country Region Location Hemisphere Country Region Location Hemisphere Country Region District Office

While this does expand the number of conditions that must be checked, it logically fits with distribution rules the application may be trying to implement and therefore mentally easier to implement. Additionally, in the above, we treated each as separate individual locations. If the last bit address represented a region or cell, then the number of sites addressable with each scheme extends another order of magnitude. However, it should be noted that this scheme (if it worked) would only work in cases where data is intended solely to be distributed to a single Region or District (next to last division) or a single location. Otherwise, the same subscription migration issue would occur that plagues a single integer based scheme updates setting the value to first one value and then another in an attempt to send to more than one location migrates the data from one location to the other instead of sending it to both. As mentioned earlier, using multiple rs_address columns or dimensioning the rs_address column will result in more conditionals for the SRE to process. For multiple columns, the reason should be obvious a separate condition would be necessary for each column. However, the same is true for rs_address columns that have been dimensioned a separate condition would be necessary for each dimension at a minimum. This is simply due to the fact that the original intent of the rs_address column was a single dimension of bits. Consequently, when a condition such as (column & 64) returns a non-zero number, the row is replicated. Combining several bits as in (column & 71) could have some unexpected results. Since 71 is 64+4+2+1 (bits 6,2,1, and 0), you might think that this would achieve the goal. However, the way rs_address columns are treated, any column which has bits 6, 2, 1 or 0 on would get replicated to that site effectively a bitwise OR. This includes rs_address values of 3, 129, etc. Since we are allowed to AND conditions together, you might think the way to ensure that exactly the desired value is met is to use multiple conditions as in:
-- my_rsaddr_col is an rs_address column create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where my_rsaddr_col & 64 and my_rsaddr_col & 4 and my_rsaddr_col & 2 and my_rsaddr_col & 1

BUT, we cant do that!!! Unlike other columns (in a sense), rs_address columns may only appear once in the where clause of a subscription. It results in:
Msg 32027, Level 12, State 0: Server 'SYBASE_RS': Duplicate column named 'my_rsaddr_col' was detected.

The reason is that for any single subscription, a single column can only participate in a single rule (rs_rules table has a unique index on subscription and column number). Consequently, although other columns can appear more than once in a where clause, the union of the conditions must produce a single valid range (single pair of low & high values). For example:
-- Good subscription create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col > 32 and int_col < 64 -- Good subscription (effectively !=32)

129

Final v2.0.1

create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col < 32 and int_col > 32 -- Bad range subscription should be an OR (two subscriptions) create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col < 32 and int_col > 63 -- Bad range subscription should be an OR (two subscriptions) create subscription titles_sub for titles_rep with replicate at SYDNEY_DS.pubs2 where int_col = 30 and int_col = 31

Among other things, you can see that this condition restricts Replication Server from supporting Boolean OR conditionals and forces designers to implement multiple rs_address columns. Even if attempting to use the second rs_address column as the Region/District dimension as depicted above in the 2 dimensional break-out, you could incur problems. There is a work-around for the OR problem, of course. Use articles/publications overlaying replication definitions/subscriptions. Introduced in RS 11.5, articles allow Boolean ORs as well as referring to the same column multiple times in the same where clause. However, the references to the same column must use an OR clause as within the RSSD, and AND clause behaves the same as a normal subscription, while an OR clause constructs multiple where clauses conditions in the RSSD. Consider the following:
create publication rollup_pub with primary at HQ.db go -- illegal article definition create article titles_art for rollup_pub with primary at HQ.db with replication definition titles_rep where my_rsaddr_col & 64 and my_rsaddr_col & 8 go -- legal article definition create article titles_art for rollup_pub with primary at HQ.db with replication definition titles_rep where my_rsaddr_col & 64 or where my_rsaddr_col & 8 go

It seems frustrating that there seems to be no way to bypass the 32 site limit with a single rs_address column. While a theoretical 1,024 sites could be addressed if each dimension supported an even 32 locations in each, remember, only a single Region/District or location could be the intended target. Additionally, if you think about it for a second, the most common method for updating rs_address columns to set the desired bits is in a trigger. Consequently, the original row modification plus the modification in which the bits are set are both processed by replication server. As a result, a single replication would require 2 updates to the same row the first being the regular update and the second setting the appropriate bits for distribution. Additional destinations would require additional updates. This leads to n+1 DML operations at the primary for every intended operation not a good choice then, if performance is of consideration. Additionally, if a WS system is involved, it ignores updates in which the only changes were to rs_address columns consequently after a failover you may not have an accurate reflection of the last site updates were distributed to in the processing. SRE Performance Performance of the SRE depends on a number of issues that should be fairly obvious: Number of replication definitions per table. Number of subscriptions per replication definition Number of conditions per subscription

In order to reduce the number of physical RSSD lookups to retrieve replication definitions, subscriptions and where clauses, the SRE makes use of the System Table Services (STS) cache. Configurable through the replication server configuration sts_cachesize, the STS caches rows from each RSSD table/key combination in a separate hash table. The

130

Final v2.0.1
sts_cachesize parameter refers to the number of rows for each RSSD table. For most systems, the default sts_cachesize configuration of 100 is far too low. This would restrict the system to only retaining the most current 100 rows of subscription where clauses, etc. A better starting point might be to set sts_cachesize to the max of the number of columns in repdefs managed by the current Rep Server or the number of subscriptions on the repdefs managed by the current Rep Server, if greater. One way to determine how effective the STS cache is, is to turn on the cache statistics trace flag.
trace on, STS, STS_CACHESTATS - Collects STS Statistics

Which works prior to RS 12.1. With RS 12.1, you can simply use the provided monitor counters. As you can imagine, the largest impact that you can have is by increasing sts_cachesize to reduce the physical lookups. Key Concept #12: The single largest tuning parameter to improve Distributor thread performance is increasing the sts_cachesize parameter in order to reduce physical RSSD lookups. The biggest bottleneck of the SRE will actually be getting the transactions from the SQT fast enough. Consequently, the sqt_max_cache_size setting is crucial to overall inbound processing. For example, at one customer, a sqt_max_cache_size of 4MB was resulting in considerable latency in processing large transactions being distributed to two different reporting system destinations. Setting the sqt_max_cache_size to 16MB resulted in the inbound queue draining at over 100MB/min. This speed is even more notable when considering that the DIST thread had to write each transaction from the inbound queue to two different outbound queues. Transaction Delivery The Transaction Delivery (TD) library is used to determine how the transactions will be delivered to the destinations. The best way to think of this is that while the SRE decides who gets which individual modifications, the TD is responsible for packaging these modifications into a transaction and requesting the writes to the outbound queue. For example, consider the following transaction:
begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, 123 Main St,Anytown,NY,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,31245Q, Chamois Shirt,$25.00,2,0,$50.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,987652W, Leather Jacket,$250.00,1,0,$250.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,54783L, Welcome Mat,$12.00,1,0,$12.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,732189H, Bed Spread Set,$129.00,1,0,$129.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,30345S, Volley Ball Set,$79.00,1,0,$79.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,889213T, 6 Man Tent,$494.00,1,$49.40,$444.60) update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction

Now, picture what happens in a normal replication environment if the source system was replicating to three destinations each concerned with its own set of rules. For example, Replicate Database 1 (RDB1) might be concerned with clothing transactions (shipping warehouse for clothing), while RDB2 with transactions for household goods, and RDB3 focusing on sporting items. This would result in the following replicate database transactions:
-- replicate database 1 (clothing items) begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, 123 Main St,Anytown,NY,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,31245Q, Chamois Shirt,$25.00,2,0,$50.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,987652W, Leather Jacket,$250.00,1,0,$250.00) update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction -- replicate database 2 (household goods) begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, 123 Main St,Anytown,NY,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,54783L, Welcome Mat,$12.00,1,0,$12.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,732189H, Bed Spread Set,$129.00,1,0,$129.00)

131

Final v2.0.1

update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction -- replicate database 3 (sporting goods) begin transaction web_order insert into orders (customer, order_num, ship_addr, ship_city, ship_state, ship_zip) values (1122334, 123456789, 123 Main St,Anytown,NY,21100) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,30345S, Volley Ball Set,$79.00,1,0,$79.00) insert into order_items (order_num,item_num,desc,qty,price,discount,total) values (123456789,889213T, 6 Man Tent,$494.00,1,$49.40,$444.60) update orders set order_subtotal=$964.60, order_shipcost=$20, order_total=$984.60 commit transaction

The SRE physically determines what DML rows go to which of the replicates, however, it is the TD that remembers that each is within the scope of the outer transaction web_order and requests the rows to be written to each of the outbound queues. It accomplishes this through the following steps: Looks up the correct queue for each of the destination databases it is passed a bitmap of the destination databases from the DIST thread (based on SRE). Writes a begin record for each transaction to the destination queue (using the commit OQID) For each operation received, adds two bytes to the commit OQID and replaces the operations OQID with the new OQID based off of the commit record. Packs the command into packed ASCII format and writes the command to each of the destination queues (via the MD module) Writes a commit record to each of the queues once the entire list of operations has been processed.

Earlier, in the makeup for the OQID, we discussed the fact that the TD module added two bytes for uniqueness. A frequent question is Why?. The answer is in the simple fact that transactions could overlap begin/commit times and since the original OQIDs are generated in order, it would result in a de-sorting all the work done by the SQT thread if they were just sent through as normal. Consider the following points: When the Rep Agent forwards commands to the Replication Server it generates unique 32 byte monotonically increasing OQIDs. The job of the SQT thread is to pass transactions to DIST thread in the COMMIT order, therefore the commands the DIST forwards to the TD module may not have increasing OQIDs. The SQM thread relies on the increasing OQIDs to perform its duplicate detection. In order to prevent the outbound SQM rejecting the commands, TD library appends a 2 byte counter to COMMIT record of the transaction for all the commands which are distributed by TD. Only DIST thread calls TD. o Why the commit record??? Because if your transaction began before someone elses who committed before you, your begin tran (and other rows would have lower OQIDs and would really mess things up). o So we use the CT oqid and add 0001-ffff to each row in the tran The counter is reset when a NEW begin record is passed to TD

Consequently, as each transaction is processed, the TD uses the commit records OQID and simply adds a sequential number in the last two bytes. Consider the following scenario in which transaction T1 begins prior to transaction T2, yet commits after:
OQID 0x04010000 0x04020000 0x04030000 0x04040000 0x04050000 0x04060000 0x04070000 0x04080000 0x04090000 0x040A0000 0x040B0000 Operation begin t1 insert t1 begin t2 delete t2 insert t1 update t2 insert t2 insert t1 commit t2 insert t1 commit t1

The TD would receive T2 first and then T1 and would renumber the OQIDs as follows:

132

Final v2.0.1

OQID 0x04090001 0x04090002 0x04090003 0x04090004 0x04090005 0x040B0001 0x040B0002 0x040B0003 0x040B0004 0x040B0005 0x040B0006

Operation begin t2 delete t2 update t2 insert t2 commit t2 begin t1 insert t1 insert t1 insert t1 insert t1 commit t1

As a result, now the destination queues have transactions in commit order with increasing OQIDs to facilitate recovery. This should also explain why some people have a difficult time identifying the same transaction in the outbound queue as one in the inbound queue when attempting to ensure that it is indeed there. You need to first find the commit record for that transaction in the inbound queue a feat that is not made simple in that it is not always identified which transaction the commit record belongs to. As a result, it almost always easier to search by values in each record (i.e. the primary key values). Message Delivery The Message Delivery (MD) module is called by the DIST thread to optimize routing of transactions to data servers or other Replication Servers. The DIST thread passes the transaction row and the destination ID to the MD module. Using this information and routing information in the RSSD, the module determines where to send the transaction: If the current Replication Server manages the destination connection, the message is written to the outbound queue via the SQM for the outbound connection. If the destination is managed by another Replication Server (via an entry in rs_repdbs), the MD module checks to see if it is already sending the exact same message to another database via the same route. If so, the new destination is simply appended to the existing message. If not, the message is written to the outbound queue via the SQM for the RSI connection to the Replicate Replication Server.

MD & Routing This last point is crucial to understanding a major performance benefit to routing data consider the following architecture

Figure 31 Example World Wide Topology


In the above diagram, if a transaction needs to be replicated to all of the European sites, the NY system only needs to send a single message with all of the European destinations in the header to the London system. Further, due to the multi-tiered aspects of the Pacific arena above, NY would only have to send a single message to cover Chicago, Dallas, Mexico City, San Francisco, Tokyo, Taiwan, Hong Kong, Peking, Sydney Australia, New Delhi. In the past, this has often been touted as a means to save expensive trans-oceanic bandwidth. While this may be true, from a technical perspective, the biggest savings is in the workload required of any one node allowing unparalleled scalability. In addition, this performance advantage gained by distributing the outbound workload may make it feasible to implement replication routing even to Replication Servers that may reside on the same host. Take, for example, the following scheme.

133

Final v2.0.1

Pay Roll Accounting

Marketing CRM

POS

Billing

Supply

Purchasing

DW

Shipping

Figure 32 Example Retail Sales Data Distribution


In this scenario, if the Replication System begins to lag, the POS system may be impacted due to the affect the Replication Server could have on the primary transaction log if the Replication Systems stable devices are full. While none of the systems are very remote from the POS system, in this case, it may make sense to implement a MP Rep Server implementation by using multiple Replication Servers to balance the load.

Pay Roll Accounting

Marketing CRM

POS

Billing

Supply

Purchasing

DW

Shipping

Figure 33 Retail Sales Data Distribution using Multiple Replication Servers


Note that in the above example solution, the RS that manages the POS connection does not then manage any other database connections. Consequently, that RS can concentrate strictly on inbound queue processing and subscription resolution. The other three can concentrate strictly on applying the transactions at the replicates. With a 6-way SMP box, all four Replication Servers, along with a single ASE implementation for the RSSD databases could start making more effective use of larger server systems that they may be installed on. Key Concept #13: While replication routes offer network bandwidth efficiency, they offer a tremendous performance benefit to Replication Server by reducing the workload on the primary Replication Server. This can be used to effectively create a MP Replication scenario for load balancing in local topologies. An additional performance advantage in inconsistent network connectivity environments is that network problems that occur during Replication Server applying the transactions at the replicate can degrade performance due to frequent rollbacks/retry due to loss of connection.

134

Final v2.0.1
MD Tuning Other than the sts_cachesize and replication routing, the other performance tuning parameter that directly affects the distributor thread is md_sqm_write_request_limit (formerly known as md_source_memory_pool prior to RS 12.1). This is a memory pool specifically for the MD to cache the writes to the SQM for the outbound queues. With previous versions of RS (i.e. 11.x & 12.0), this parameter was frequently missed as the only way to set it was through using the rs_configure stored procedure in the RSSD database. Fortunately, with RS 12.1+, the md_sqm_write_request_limit can be set through the standard alter connection command. While md_sqm_write_request_limit is a connection scope tuning parameter, it is often misunderstood as it does not change destination connections, but rather the source connection. The reason for this is that we are still discussing the Distributor thread, which is part of the inbound side of replication server internal processing. By adjusting the md_sqm_write_request_limit/md_source_memory_pool, you allow the source connections distributor thread to cache its writes when the outbound SQM is busy and to enable more efficient outbound queue space utilization. This is especially useful when a source system is replicating to multiple destinations without routing, when a replicate database has more than one source database (i.e. corporate rollup), or for the remote replication server when multiple destinations exist for the same source system. The problem is that it is a single pool and the blocks (if you will) are for single connection each. Consequently, even with 60 blocks available for caching, if replicating to 5 different destinations, only 12 blocks of cache will be available for each destinations SQM (assuming each are experiencing same performance traits). Note that similar to the exec_sqm_write_request_limit, in RS 12.6 ESD #7 and RS 15.0 ESD #1, the limit for md_sqm_write_request_limit was raised from 983040 (60 blocks) to 2GB (recommendation is 24MB). Prior to RS 12.1, the only visibility into this memory was via the admin statistics, md command as illustrated below:
admin statistics, md Source -----SYDNEY_DS TOKYO_DS TOKYO_DS Pending_Messages ---------------0 0 0 SQM_Writes ---------34 551 1452 Is_RSI_Source? -------------0 0 0 Memory_Currently_Used --------------------0 0 0 Destinations_Delivered_To ------------------------34 551 1452

Messages_Delivered -----------------34 551 1452 Max_Memory_Hit -------------0 0 0

Each of these values are described below: Column Source Pending_Messages Meaning The Replication Server or data server where the message originated. The number of messages sent to the SQM without acknowledgment. Usually, this occurs because Replication Server is processing the messages before writing them to disk. Memory used by pending messages. Number of messages delivered. Number of messages received and processed. Total number of destinations. Not yet implemented. Indicates whether the current Replication Server can send messages: 0 - This Replication Server cannot send messages 1 - This Replication Server can send messages

Memory_Currently_Used Messages_Delivered SQM_Writes Destinations_Delivered_To Max_Memory_Hit Is_RSI_Source?

Beyond tuning the md_sqm_write_request_limit and sts_cache_size, not much tuning is needed. Frequently, customers have noted that when the inbound queue experiences a backlog, once the SQT cache is resized, the inbound queue

135

Final v2.0.1
drains quite dramatically at a rate exceeding 8GB/hr. This is a testament to the performance and efficiency of the DIST thread. DIST Performance and Tuning Within each of the Distributor module discussions above, we covered tuning issues specific to that module. Overall, to monitor the performance or throughput of the Distributor thread, you can use the admin who, dist command
admin who, dist Spid ----21 22 PrimarySite ----------102 106 Duplicates ---------0 290 NoRepdefCmds -----------0 0 State ---------------Active Active Type ---P P Status -----Normal Normal Info --------------------102 SYDNEY_DS.SYDNEY_RSSD 106 SYDNEY_DS.pubs2 PendingCmds ----------0 0 CmdsProcessed ------------1430 293 CmdMarkers ---------0 1 SqtBlocked ---------1 1 MaintUserCmds ------------0 0

TransProcessed -------------715 1 CmdsIgnored ----------0 0

The meaning for each of the columns is described below. Column PrimarySite Type Status PendingCmds Meaning The ID of the primary database for the SQT thread. The thread is a physical or logical connection. The thread has a status of "normal" or "ignoring." You should only see ignoring during initial startup of the Replication Server. The number of commands that are pending for the thread. If the number of pending commands is high, then the DIST could be a bottleneck as it is not reading commands from the SQT in a timely manner. The likely culprit is either the STS cache is not large enough and repeated accesses to the RSSD is slowing processing or the outbound queue is slow, delaying message writes. Whether or not the thread is waiting for the SQT. This is the opposite of the above (PendingCmds). This essentially certifies that the DIST is not a cause for performance problems. The number of duplicate commands the thread has seen and dropped. This should stop climbing once the Replication Server has fully recovered and the Status (above) changed from ignoring to normal. The number of transactions that have been processed by the thread. The number of commands that have been processed by the thread. The number of commands belonging to the maintenance user. This should be 0 unless the Rep Agent was started with the send_maint_xacts_to_replicate option.

SqtBlocked

Duplicates

TransProcessed CmdsProcessed MaintUserCmds

136

Final v2.0.1

Column NoRepdefCmds

Meaning The number of commands dropped because no corresponding replication definitions were defined or in RS 12.6 and higher, it could include commands replicated using database repdefs (MSA) for which no table level repdef exists. In either case, this is an indication that a table/procedure is marked for replication but lacks a replication definition (as table level repdefs should be created even for MSA implementations). If a procedure, this can be a key insight into why there may be database inconsistencies between a primary and replicate system. The number of commands dropped before the status became "normal." The number of special markers (rs_marker) that have been processed. Normally only noticed during replication system implementation such as adding a subscription or a new database.

CmdsIgnored CmdMarkers

As noted from the above command output, the DIST thread is responsible for matching LTL log rows against existing replication definitions to determine which columns should be ignored, etc. If the replication definition does not exist, it discards the log row at this stage. This is also when request functions are identified. The way this is detected is described in more detail later, however, if you remember from classes you have taken (or reading the manual), request functions have a replication definition specifying the real primary database which would not be the current connection processing the logged procedure execution. In any case, a large number of occurrences of NoRepdefCmds can mean one of several things: Database replication definition was created (for MSA implementation possibly) for a specific source system, but individual table-level replication definitions were not created (a performance issue) A replication definition was mistakenly dropped or never created. In either case, this means that the databases are probably suspect as they are definitely out of synch. Or Tables or procedures were needlessly marked for replication. If this is the case, then a good, cheap performance improvement is to simply unmark the tables or procedure for replication. This will reduce Rep Agent processing, SQM disk i/o, SQT and DIST CPU time.

DIST Thread Monitor Counters The Distributor thread counters added in RS 12.1 are listed below: Counter CmdsDump CmdsIgnored CmdsMaintUser CmdsMarker CmdsNoRepdef CmdsTotal Duplicates RSTicket SREcreate SREdestroy Explanation Total dump database commands read from an inbound queue by a DIST thread. Total commands ignored by a DIST thread. Total commands executed by the maintenance user encountered by a DIST thread. Total rs_markers placed in an inbound queue. rs_markers are enable replication, activate, validate, and dump markers. Total commands encountered by a DIST thread for which no replication definition exists. Total commands read from an inbound queue by a DIST thread. Total commands rejected as duplicates by a DIST thread. Total rs_ticket markers processed by a DIST thread. Total SRE creation requests performed by a DIST thread. This counter is incremented for each new SUB. Total SRE destroy requests performed by a DIST thread. This counter is incremented each time a new SUB is dropped.

137

Final v2.0.1

Counter SREget

Explanation Total SRE requests performed by a DIST thread to fetch an SRE row. This counter is incremented each time a DIST thread fetches an rs_subscriptions row from RSSD. Total SRE rebuild requests performed by a DIST thread. Total deletes commands encountered by a DIST thread and resolved by SRE. Total DIST commands with no subscription resolution that are discarded by a DIST thread. This implies either there is no subscription or the 'where' clause associated with the subscription does not result in row qualification. Total insert commands encountered by a DIST thread and resolved by SRE. Total update commands encountered by a DIST thread and resolved by SRE. Total Begin transaction commands propagated by a DIST thread. Total Commit or Rollback commands processed by a DIST thread. Total transactions read from an inbound queue by a DIST thread. Total updates to RSSD..rs_locater table by a DIST thread. A DIST thread performs an explicit synchronization each time a SUB RCL command is executed.

SRErebuild SREstmtsDelete SREstmtsDiscard

SREstmtsInsert SREstmtsUpdate TDbegin TDclose TransProcessed UpdsRslocater

The counters in RS 15.0 are: Counter CmdsRead TransProcessed Duplicates CmdsIgnored CmdsMaintUser CmdsDump CmdsMarker CmdsNoRepdef UpdsRslocater SREcreate SREdestroy SREget Explanation Commands read from an inbound queue by a DIST thread. Transactions read from an inbound queue by a DIST thread. Commands rejected as duplicates by a DIST thread. Commands ignored by a DIST thread while it awaits an enable marker. Commands executed by the maintenance user encountered by a DIST thread. Dump database commands read from an inbound queue by a DIST thread. rs_markers placed in an inbound queue. rs_markers are enable replication, activate, validate, and dump markers. Commands encountered by a DIST thread for which no replication definition exists. Updates to RSSD..rs_locater table by a DIST thread. A DIST thread performs an explicit synchronization each time a SUB RCL command is executed. SRE creation requests performed by a DIST thread. This counter is incremented for each new SUB. SRE destroy requests performed by a DIST thread. This counter is incremented each time a new SUB is dropped. SRE requests performed by a DIST thread to fetch a SRE object. This counter is incremented each time a DIST thread fetches an SRE object from SRE cache. SRE rebuild requests performed by a DIST thread. Insert commands encountered by a DIST thread and resolved by SRE. Update commands encountered by a DIST thread and resolved by SRE.

SRErebuild SREstmtsInsert SREstmtsUpdate

138

Final v2.0.1

Counter SREstmtsDelete SREstmtsDiscard

Explanation Deletes commands encountered by a DIST thread and resolved by SRE. DIST commands with no subscription resolution that are discarded by a DIST thread. This implies either there is no subscription or the 'where' clause associated with the subscription does not result in row qualification. Begin transaction commands propagated by a DIST thread. Commit or Rollback commands processed by a DIST thread. rs_ticket markers processed by a DIST thread.

TDbegin TDclose RSTicket

dist_stop_unsupported_cmd dist_stop_unsupported_cmd config parameter. DISTReadTime DISTParseTime The amount of time taken by a Distributor to read a command from SQT cache. The amount of time taken by a Distributor to parse commands read from SQT.

As with the other modules, the average, total and max counters have been combined into a single counter with the different columns in rs_statdetail. However, the last two counters are new and can be helpful in determining why a latency might occur between the DIST and the SQT - other than the obvious problem of the SQM outbound slowing things down. The DIST thread will generally have two sources of problems. First, either not enough STS cache was provided or sts_full_cache_ is not enabled for rs_objects and rs_columns. The second source (and most common) is that the outbound queue is not keeping up (or we are writing to too many outbound queues in a fan-out time to add routes and spread the load a smidgen). Either way, the DIST counters also are fairly handy for finding application problems as well. Key counters include: CmdsTotal, CmdsPerSec = CmdsTotal/seconds TransProcessed, TranPerSec = TransProcessed/seconds CmdsNoRepdef UpdsRslocater (again!!!) SREstmtsInsert, SREstmtsUpdate, SREstmtsDelete DISTReadTime, DISTParseTime (RS 15.0 only) Again, the first one helps us identify the rate and compare this back to the SQT and RA modules to see if we are running up to speed. The second set is useful as now we can get a glimpse as to how many transactions vs. just commands are flowing through which can then be compared to the DSI transaction rate later. CmdsNoRepdef is a bit interesting. If using RS 12.6 and a database replication definition (MSA) with no table level repdefs, a high value here is to be expected. However, this in itself should also point out that it is ALWAYS a good idea to use repdefs from a performance perspective even when not necessary (MSA or WS). In all other cases, it points to a table marked for replication for which there is no repdef. This time, there is no real way to control UpdsRslocater but by reducing everything else, this shouldnt afflict much damage besides, this is lower than the updates to the OQID typically less than 1 per second in any case. The next three are useful if trying to learn how many inserts/updates/deletes are flowing through the system. However, these counters are only incremented if using standard table repdefs a database repdef without table repdefs will cause these to be ignored. This also is a good place to again find application driven problems. For instance, if you see that the number of inserts and deletes are nearly identical, it is possible that either autocorrection is turned on or the application developers used a delete followed by insert instead of an update. The last two are new counters added in RS 15.0 to help track how much time the DIST spends on these activities. Typically, this should be minimal, but if DISTReadTime is high, it may point to a problem with the SQT. After the DIST thread, of course we have the SQM for the outbound queue(s) which have the same counters as the inbound queue the only difference is that the DIST does not have a WriteWaits style counter like the RA thread. However, it does have a similar cache configuration called md_sqm_write_request_limit (replaces the deprecated md_memory_source_pool) which should be increased to the current maximum of 983,040 (for pre 12.6 ESD #7 and pre 15.0 ESD #1 servers) as well.

139

Final v2.0.1
DIST Thread Counter Usage Again, lets take a look at some of these counters in action using the customer data weve been discussing: CmdsWritten (SQM) Sample Time

DIST CmdsTotal

SQMR CmdsRead

Cmds/Sec

SREstmts Insert

SREstmts Update

0:29:33 0:34:34 0:39:37 0:44:38 0:49:40 0:54:43 0:59:45 1:04:47 1:09:50 1:14:52 1:19:54

268,187 364,705 253,283 266,334 253,684 164,566 376,184 450,809 326,750 325,340 317,674

587,860 947,808 318,611 282,958 277,054 194,386 365,435 522,844 400,065 352,656 317,683

286,280 459,313 280,677 266,409 250,152 165,375 344,168 430,077 373,714 325,586 317,470

951 1,520 932 882 828 549 1,139 1,424 1,241 1,078 1,054

243,481 393,577 95,050 84,076 83,607 57,013 110,949 203,934 157,915 136,768 125,408

0 3,757 26,698 25,847 24,250 15,432 35,926 29,710 22,554 21,726 19,261

299 2 9 87 4 3 14 4 469 7 44

SREstmts Delete 0 3,753 26,662 25,687 24,238 15,432 33,965 29,707 22,540 20,247 19,261

This one sample period actually was useful as it illustrated two different problems at this customer site. This will become apparent as we look at these counters SQM CmdsWritten vs. DIST CmdsTotal The best way to identify latency in the SQT DIST pipeline is to compare the DIST.CmdsTotal counter to the SQM.CmdsWritten counter. Note that not exactly all commands will be distributed, so a precise match is likely not possible. However, if instead you tried to compare with SQMR CmdsRead, you would have a negative influence based on the re-scanning of removed transactions (as illustrated above) plus if there was any latency, you could not compare it to the previous stage. Note that in this case, despite all the rescanning for large transactions, the DIST thread is keeping pace with the SQM Writer. This does not mean that the SQT cache does not need to be resized it suggests that if any latency is observed, increasing the SQT cache size is not likely to have a significant impact on throughput or reduce the latency as not much exists at this stage. Cmds/Sec Much like other derived rate fields, this value is derived by dividing the CmdsTotal by the number of seconds between sample intervals. This value is useful in observing the impact of tuning on the overall processing by the DIST particularly if adjustments are made to the STS cache (in addition to observing the STS counters as well). CmdsNoRepDef Here is where we begin to see the first problem we have significantly large values for this counter where logically we should expect none. There are two possible causes for this. First, a database replication definition being used for a standby database implementation via the Multiple Standby Architecture (MSA) method is similar to a Warm Standby implementation in that table level replication definitions are not required. While not required, table level replication definitions ought to be used if database consistency (think float datatype problems) and DSI performance is of any consideration. The second possible cause is that the table is marked for replication or the database is marked for standby replication but the table(s) involved at this point dont have corresponding subscriptions. Without subscriptions and lacking a database repdef/subscription the DIST has not choice but to discard these statements. However, it does indicate that overall system performance could be improved by not replicating this data in the first place either by unmarking the tables for replication, using the set replication off command prior to the batch submission, or other technique of ensuring that the Replication Agent doesnt process the rows. In this case, it would significantly reduce the workload of the SQM (inbound) and the SQT. SREstmtsInsert/Update/Delete This is the first location within the monitor counters where you begin to get a picture of what the source transaction profile looked like especially if combined with DIST.TransProcessed. However, in this case, a very curious phenomenon was observed that lead to the second problem

140

CmdsNo RepDef

Final v2.0.1
identification. If you notice, from the second sample interval on, the inserts and deletes are nearly identical while the number of updates are noise level. This could be legitimate for example, when working off of a job queue new jobs could be added as old jobs are removed. However, this is unlikely. This leaves two other possible choices. The most likely choice is that the autocorrection setting has been accidentally left enabled for a replication definition. In that mode, a replicated update would be submitted as a delete followed by an insert. The second choice is that the application itself is doing delete/insert pairs vs. performing an update. While this sounds illogical, earlier versions of some GUI application development tools such as PowerBuilder used to do this by default. The issue is that this not only doubles the workload in Replication Server in having to process twice the number of commands, but it also causes slower performance at the DSI as rows are removed not only from the table but also the indices and then readded. At the primary, this workload is not as apparent thanks to user concurrency. With Replication Server by default using a single DSI, this workload delays replication as a whole. It turned out that this indeed was the application logic and while not a simple fix rewriting the application to use updates instead would immediately have the replication latency. In addition to the DIST counters, the STS counters and SQM (outbound) counters may also need to be looked at to determine what may be driving DIST thread performance. Minimal Column Replication Unfortunately, appending the clause replicate minimal columns to replication definitions is often forgotten. A common misconception is that minimal column replication chiefly benefits the RS throughput by reducing the amount of space consumed in the inbound (and outbound) queues. While it does reduce the space and tighter row densities allow more rows to be processed by the SQM/SQT per I/O and this can improve performance, the biggest benefit of minimal column replication is the performance gain through reducing the workload involved at the replicate DBMS aiding in DSI performance (where typically the problem is). While not reducing the workload of the DIST thread so much, it can dramatically reduce the workload of the DSI thread as it can tremendously reduce the work at the replicate dataserver. This workload reduction specifically is the probable reduction in unnecessary index maintenance at the replicate as well as a reduction in contention caused by index maintenance when parallel DSIs are used and the dsi_serialization_method is set to isolation_level_3. To understand the impact of this, you first have to understand what happens normally. Normal Replication Behavior Under normal (non-minimal column) replication, the DIST thread does not perform any checking of what columns have been changed for an update statement. As a result, if an update of only 2 columns of a 10 column table occurs, Replication Server constructs a default function string containing an update for all 10 columns of the table, setting the column values equal to the new values with a where clause of the primary key old values. For example, consider the following table (from pubs2 sample database shipped with Sybase ASE) and associated indexes.
create table titles (title_id tid title varchar(80) type char(12) pub_id char(4) price money advance money total_sales int notes varchar(200) pubdate datetime contract bit go not null, not null, not null, null, null, null, null, null, not null, not null )

create unique clustered index titleidind on titles (title_id) go create nonclustered index titleind on titles (title) go

For further fun, note that the salesdetail table has a trigger that updates the title.total_sales column:
create trigger totalsales_trig on salesdetail for insert, update, delete as /* Save processing: return if there are no rows affected */ if @@rowcount = 0 begin return

141

Final v2.0.1

end /* add all the new values */ /* use isnull: a null value in the titles table means ** "no sales yet" not "sales unknown" */ update titles set total_sales = isnull(total_sales, 0) + (select sum(qty) from inserted where titles.title_id = inserted.title_id) where title_id in (select title_id from inserted) /* remove all values being deleted or updated */ update titles set total_sales = isnull(total_sales, 0) - (select sum(qty) from deleted where titles.title_id = deleted.title_id) where title_id in (select title_id from deleted) go

By now some of you may be already seeing the problem. As mentioned previously, for an update statement, RS will generate a full update of every column. Consider a mythical replication definition like:
create replication definition CHINOOK_titles_rd with primary at CHINOOK.pubs2 with all tables named 'titles' ( "title_id" varchar(6), "title" varchar(80), "type" char(12), "pub_id" char(4), "price" money, "advance" money, "total_sales" int, "notes" varchar(200), "pubdate" datetime, "contract" bit ) -- Primary key determination based on: Primary Key Definition primary key ("title_id") searchable columns ("title_id")

This means the function string (if you were to mimic it by altering the function string) would resemble:
alter function string CHINOOK_titles_rd.rs_update for rs_sqlserver_function_class output language ' update titles set title_id = ?title_id!new?, title = ?title!new?, type = ?type!new?, pub_id = ?pub_id!new?, price = ?price!new?, advance = ?advance!new?, total_sales = ?total_sales!new?, notes = ?notes!new?, pubdate = ?pubdate!new?, contract = ?contract!new? where title_id = ?title_id!old? '

The result is rather drastic. The first problem, is of course, that the outbound queue will contain significantly more data than actually was updated - assuming the notes column was filled out. But this is minor compared to what really impacts DSI delivery speed. For those of you familiar with database server performance issues, any time a row is updated, any index values that are updated automatically cause the index to be treated as unsafe and therefore also needing updated. In this example, every time a new order is inserted into the salesdetail table, the corresponding update at the replicate not only updates the entire row - it also performs index maintenance. Worse yet, if ANSI constraints were used, the related foreign key tables would have holdlocks placed on the related rows, increasing the probability of contention. Clearly, this is not desirable behavior. Unfortunately, it occurs much more often than you would think. Consider: Aggregate columns Such as the titles example. Auditing columns this includes such columns as last_update_user, last_updated_date, etc. similar to the trigger issue mentioned previously. Status columns shipping/order status information for order entry or any workflow system.

142

Final v2.0.1
Dynamic values product prices (sale prices, etc.). Consider a regional chain store that wants to replicate price changes to 60+ stores for 100s of products. Now add in the overhead of changing every column and index maintenance and the associated impact that could have on store operations. Undoubtedly, there are others you could think of as well. Minimal Column Replication When the replication definition includes the replicate minimal columns phrase, the behavior is much different. With minimal column replication, only the columns with different before and after images as well as primary key values are written to the inbound & consequently outbound queue. Consequently, most of the updates to the titles table would be executing a function string similar to:
alter function string CHINOOK_titles_rd.rs_insert for rs_sqlserver_function_class output language ' update titles set total_sales = ?total_sales!new? where title_id = ?title_id!old? '

Which more than likely will execute much quicker in high volume environments. An interesting aspect to minimal column replication is what happens if the only columns updated were columns not included in the replication definition. Under normal replication rules, if a column is updated, the rs_update function is processed and sent to the RS. The RepAgent User thread simply strips out any columns not being replicated as part of the normalization process and the resulting functions are generated as appropriate. For example, in the above titles table, lets assume that the contract column was excluded from the replication definition as in:
create replication definition CHINOOK_titles_rd with primary at CHINOOK.pubs2 with all tables named 'titles' ( "title_id" varchar(6), "title" varchar(80), "type" char(12), "pub_id" char(4), "price" money, "advance" money, "total_sales" int, "notes" varchar(200), "pubdate" datetime ) -- Primary key determination based on: Primary Key Definition primary key ("title_id") searchable columns ("title_id")

Of course, the full update function string would now be:


alter function string CHINOOK_titles_rd.rs_update for rs_sqlserver_function_class output language ' update titles set title_id = ?title_id!new?, title = ?title!new?, type = ?type!new?, pub_id = ?pub_id!new?, price = ?price!new?, advance = ?advance!new?, total_sales = ?total_sales!new?, notes = ?notes!new?, pubdate = ?pubdate!new? where title_id = ?title_id!old? '

Now, consider the following update statement:


Update titles set contract=1 where title_id=BU1234

If this statement was executed at the primary, the replicate would receive a full update statement of all columns in the replication definition (excluding the contract column, of course), setting them to the same values they already are. As you can guess, under minimal columns, this behaves differently. Obviously, if the only column(s) updated were columns excluded from the replication definition, the RS would otherwise attempt to generate an empty set clause. One option would be for RS to ignore any update for which only columns not being replicated were updated. However, what happens is RS submits an update setting the primary key values to after image values essentially a no-op. This

143

Final v2.0.1
can be confusing and lead to a quick call to TS demanding an explanation. Before you pick up the phone one little consideration what if a custom function string simply was counting the number of updates to a table?? By excluding the update from replication simply if only non-replicated columns were updated, the functions would never get invoked. While this is easier handled today in a cleaner approach via using multiple replication definitions, this implementation no doubt dates back to the earliest implementations of RS, in which guaranteed assurance of replicated transactions held sway over performance (and rightfully so). Keep in mind that this does impose a number of restrictions: Autocorrection can not be used while minimal column replication is enabled. Custom function strings containing columns other than the primary keys may not work properly or generate errors.

Regarding the first restriction, autocorrection should not normally be on. If left on, performance could be seriously degraded as each update translates into a delete/insert pair. Even if the values havent changed, this can have a greater penalty than not using minimal columns as the index maintenance load could be greater due to first removing the index keys (and any corresponding page shrinkage) and then re-adding them (which could cause splits). Consequently, minimal column replication should be enabled by default, and when autocorrection is necessary due to inconsistencies, the replication definition can be altered to remove minimal column replication (temporarily). Note that minimal column replication really only applies for updates. In the case of insert statements, all of the values are new and therefore need replication. While minimal column replication documentation does include comments about both update and delete operations, for most users, only the rs_update function will be impacted. For delete statements, this translates to only the primary key values being placed into the outbound queue (vs. the full before image as without minimal column replication) which means any custom function strings (such as auditing) that is recording the values being deleted in a history table will incur problems. Again, if not using custom function strings on the table, minimal column replication will not have a negative impact on RS functionality. If using custom function strings, using multiple repdefs may alleviate the pain of not being able to use minimal column replication. For example, if you have a Warm Standby and a Reporting system and the reporting system uses custom function strings (to perform aggregates), then you may want to use two repdefs for the table(s) in question one for the Warm Standby supporting minimal column replication; and one for the reporting server. Note that for Warm Standby, minimal column replication is enabled by default as also is true of MSA implementations. Key Concept #14: Unless custom function strings exist for update and delete functions for a specific table, minimal column replication should be considered. By using minimal columns, update operations at the replicate will proceed much quicker by avoiding unnecessary index maintenance and possibly avoiding updates altogether if the only columns updated at the primary are excluded from the replication definition.

144

Final v2.0.1

Outbound Queue Processing


must come out.
The single biggest bottleneck in the Replication System is the outbound queue processing. As hard as this seems to be believed, the main reason for this is that the rate of applying transactions at the replicate will often be considerably slower than they were originally applied at the primary. While some of this is due to the replicated database tuning issues, a considerable part of it is also due to the processing of the outbound queue. A key point to remember, is that when discussing the outbound processing of Replication Server internals, you are discussing threads and queues that belong to the replicate database connection and not the primary. If you remember from the earlier internals diagram, the outbound processing basically includes the SQM for the outbound queue, the DSI thread group and the RSI thread for replication routes. These are illustrated below, with the exception of the RSI thread.

Figure 34 Replication Server Internals: Inbound and Outbound Processing


As you can imagine, the outbound queue SQM processing is extremely similar to the SQM processing for an inbound queue basically manage stable device space allocation and perform all outbound queue write activity via the dAIO daemon. Consequently, we will begin by looking at the Data Server Interface (DSI) thread group in detail. A closer in diagram would look like the following:

145

Final v2.0.1

Figure 35 - Close up of DSI Processing Internals


Many of the concepts illustrated above - DSI SQT processing, transaction grouping, command batching, etc. will be discussed in this section, while the Parallel DSI features will be discussed later. In any case, you can think of the flow through the DSI as having the following stages: 1. 2. 3. 4. 5. 6. Read from Queue (DSI SQM Processing) Sort Transactions (due to multiple sources) (DSI SQT Processing) Group Transactions (DSI Transaction Grouping) Convert to SQL (DSIEXEC Function String Generation) Generate Command Batches for Execution (DSIEXEC Command Batching) Submit SQL to RDB (DSIEXEC Batch Execution)

We will use this list as a starting point to discuss DSI processing. We will look at the most appropriate counters during each section. Because of the number of DSI & DSIEXEC module counters, we will not necessarily look at each one. First, however, it might be a good idea to take a closer walk-through of the DSI/DSIEXEC processing. The DSI thread reads from the outbound queue SQM As the DSI reads each command, it uses SQT logic to sort the commands into their original transactions and also into commit order (when multiple sources are replicating to a single destination) 3. When the DSI/SQT sees a closed transaction, determines if it can group it with already closed transactions it has in cache according to the transaction grouping rules and the various connection configurations. 4. One it cant add it to an existing group, it checks to see which of the DSIEXECs are available and submits the existing transaction group to the DSIEXEC via message queues 5. The DSIEXEC takes the transaction group commands and converts the structures to SQL statements 6. As the DSIEXEC converts the transaction group to SQL statements, it attempts to batch the commands into command batches for execution efficiency (similar to multiple statements in an isql script before the go). 7. When the batch limit is hit (50 commands) or when the batching is terminated due to batching rules/configuration parameters, the DSIEXEC notifies the DSI that it is ready to submit the first batch 8. The DSI checks the dsi_serialization_method and if the serialization method is wait_for_commit, the batch is held until the previous thread is ready to commit. Otherwise, the DSI notifies the DSIEXEC to send the batch the replicate DBMS for execution. 9. When the first batch is sent to the replicate database, the DSIEXEC notifies the DSI so that the DSI can allow parallel DSIs to work if the dsi_serialization_method is not wait_for_commit (i.e. wait_for_start). 10. The DSIEXEC then processes the results from each of the commands within the command batch. When all the results have been processed, it submits the next command batch until the entire transaction has been submitted (but not yet committed). 1. 2.

146

Final v2.0.1
11. When all the SQL commands have been submitted, the DSIEXEC notifies the DSI that it is ready to commit via message queue. 12. The DSI checks the commit order and notifies the DSIEXECs when they can commit. In addition, if the DSI serialization method is wait_for_commit, it notifies other DSIEXECs that they can send their batch. 13. As each DSIEXEC receives commit notification, it sends the commit to the replicate DBMS and notifies the DSI that it has committed and is available for another transaction group. This illustrated in the below diagram (showing only the communications between the DSI and one DSIEXEC others implied).

Figure 36 Logical View of DSI & DSIEXEC Intercommunications


As you can tell, there is quite a bit of back-and-forth communications between the various DSIEXECs and the DSI thread to ensure proper commit sequencing and to also ensure that the command execution sequencing is maintained. A few items of interest relating to the monitor counters from the above diagram Batch Sequencing Time (Steps 4 5 6 7) Is the time between when the first command batch is ready (#4 Batch Ready) and when the DSIEXEC receives the Begin Batch message (#5). This gap is used to control when parallel DSIs can start sending their respective SQL batches according to the dsi_serialization_method. For example, if the dsi_serialization_method was wait_for_commit, if the bottom thread sent a Batch Ready message, the DSI would not respond with a Begin Batch until it got the Commit Ready(#10) from the top thread. If instead the dsi_serialization_method was wait_for_start, the bottom thread would get a Begin Batch response when the top thread sent the Batch Began message (#7) Commit Sequencing Time (Steps 9 10 11 12 13) This is the time between the Commit Ready (#10) and the Commit (#11) response. Any time lag is likely due to the DSI waiting for a previous thread to respond back Committed (#13) which means that it has committed successfully. The reason we say it begins at rs_get_threadseq (#9) is that in parallel DSIs, when not using commit control, the rs_threads table is used for serialization - and it is in this step that it occurs (as will be discussed later). Note that only the first command batch is coordinated with the DSI. Subsequent command batches are simply applied except in the case of large transactions in which every dsi_large_xact_size commands, a rs_get_thread_seq is sent. Note that in the above diagram, when the thread is ready to commit (rs_get_threadseq returns), the seq number from the rs_get_threadseq is passed to the DSI for comparison. If the seq number is less than expected, the implication is that the previous thread rolled back (due to error or contention) and that this thread needs to rollback as well in which case step #11 becomes a Rollback command (currently implemented as disconnect which causes an implicit rollback). DSI SQM Processing Much like the SQT interaction with the inbound queue SQM, the DSI reads from the outbound queue SQM. As far as the SQM itself, it is identical to the inbound queue SQM. While many of the SQM/SQM-R related counters are the same, there is at least one major difference. If you remember from the inbound discussion, the primary goal is to be reading the blocks from cache using BlocksReadCached as the indicator. While this is a desirable goal for the outbound queue as well, the likelihood is that the latency in executing the SQL at the replicate will result in the cache hit quickly dropping to zero once the DSI SQT cache fills. Consider the following:

147

Final v2.0.1

SQMR Cmds Read

Sample Time

Cache Hit %

Blocks Read

BlocksRead Cached

SQM.Cmds Written

Deallocated 0 3 4 3 5 4 10 13 9 11 8 10 23 17 13 0 3 4 3 5 4 8 11 12 6 6 4 7 5 7

SegsActive

Allocagted

19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21 19:52:22 19:57:45 20:02:48 20:07:49 20:12:51

6 6,312 7,711 4,075 6,963 7,499 25,533 48,468 29,238 40,042 19,140 31,727 93,539 67,564 52,751

6 6,293 7,689 4,046 6,987 7,496 18,058 41,405 42,331 21,570 22,807 9,876 12,270 18,803 29,352

2 189 308 185 270 291 530 715 744 405 403 266 418 298 470

2 189 307 185 269 291 401 0 0 240 0 0 0 0 0

100 100 99.67 100 99.62 100 75.66 0 0 59.25 0 0 0 0 0

1 1 1 1 1 1 3 5 2 7 9 15 31 44 50

Cache MemUsed 0 1,792 3,328 0 0 143,104 2,098,432 2,097,920 2,098,432 2,097,920 2,098,432 2,098,432 2,098,432 2,098,432 2,098,432

As you can see from the above, once the DSI SQT cache fills, the BlocksReadCached quickly hits bottom. Now this also points out a bit of a fallacy. Earlier we stated that one way to determine the amount of latency was to subtract the Next.Read value from the Last Seg.Block in the admin who,sqm command. For the outbound queue, this does represent a rough estimate what it is lacking is the amount in the DSI SQT cache. Consequently, the most accurate measurement would be Last Seg.Block Next.Read + CacheMemUse. The number of active segments above is a good estimate as well however these are not reported in any easily obtained admin who statistics. The First Seg.Block includes segments still allocated due to simply not having been deallocated yet as well as segments preserved by the save interval so subtracting First Seg.Block from Last Seg.Block is even more inaccurate than using Next.Read. One aspect to consider is that if there is any latency, then you can be sure that the DSI SQT cache is probably full, which means that the most accurate estimate for latency in the outbound queue is:
Latency = Last.Seg Block Next.Read + (DSI SQT Cache)

If Next.Read is higher than Last.Seg Block, it is very likely that the DSI is caught up or nearly so. But this may explain to some why when the connection appears to be all caught up and you suspend the connection, that suddenly there is 1MB of backlog in the outbound queue despite the source being quiescent. DSI SQT Processing If you notice in the internals diagram above, unlike with the inbound processing, the outbound processing does not have a separate SQT thread. This is largely due to a very simple reason transactions in the outbound queue are more than likely already in commit order. For example, if a source database is replicating to a single destination, the inbound SQT effectively sorts the transactions into commit sequence. Since this ordering is not overridden anywhere within the rest of the inbound processing, then the outbound queue is automatically in sorted order. This does not change if the primary has multiple replicates, since each replicate will have its own independent outbound queue that the single DIST thread is writing commit ordered transactions into. The only time this is not true is when multiple primary databases are replicating into the same replicate database such as corporate rollup topologies. However, even in this latter case, due to MD caching of writes, providing that the transactions are small enough, the SQT will still encounter complete and contiguous transactions from each source system. If the transactions are not contiguous (replicated rows from the various sources inter-dispersed in the stable queue), the SQT will still only have a single transaction per origin in the Open/Closed/Read linked lists as the transactions are still in commit order respective to the source database. As a

148

Final v2.0.1
result, the main DSI thread queue manager (DSI - normally called the DSI scheduler or DSI-S) simply calls the SQT functions when reading from the outbound queue via the SQM. This lack of workload was the primary driver to simply including the SQT module logic into the DSI vs. having a separate SQT thread for the outbound queue. One notable difference to this is for Warm Standby DSIs. In a Warm Standby, the WS-DSI threads read straight off the inbound queue effectively duplicating the sorting process carried out by the SQT thread. If your only connection within the replication server is a Warm Standby, you should consider the alter logical connection logical_DS.logical_DB set distribution off command. This command shuts down the DIST thread for the logical connection. The DIST is more than just a client of the SQT thread it actually controls it. During startup, the RS first starts the SQM threads then the DSI and DIST threads. The DIST in turn starts the appropriate SQT thread. Consequently, by disabling distribution for a logical connection, not only shut down the DIST thread, but you also shut down the SQT thread. This can save CPU time especially in pre-12.6 non-SMP RS implementations by: Eliminating CPU consumed by the DIST thread unnecessarily checking for subscriptions, etc. Eliminating CPU and memory consumed by the SQT thread in sorting the transactions

So, with the exception of the SQT cache in a WS DSI thread, if the SQT module is so little used, what is the SQT cache used for by the DSI thread? Remember, the SQT cache contains the actual commands that comprise the transaction consequently, the SQT cache is where the DSI EXEC threads read the list of commands to generate SQL for and apply to the replicate database. This is illustrated in the above drawing in which the DSI EXEC threads read from the SQT cache Closed queue and after applying the SQL, notify the DSI of the success, causing the transaction to be moved to the Read queue. DSI SQT Performance Monitoring This does not mean that you cannot monitor the SQT processing within the outbound queue processing. If you remember from previous, the admin who, sqt command reports both the inbound and outbound SQT processing statistics.
admin who, sqt Spid ---17 98 10 0 Closed -----0 0 0 0 Removed ------0 0 0 0 SQM Reader ---------0 0 0 0 State ----Awaiting Awaiting Awaiting Awaiting Read ---0 0 0 0 Full ---0 0 0 0 Info ---101:1 TOKYO_DS.TOKYO_RSSD 103:1 DIST LDS.pubs2 101 TOKYO_DS.TOKYO_RSSD 106 SYDNEY_DSpubs2sb Trunc ----0 0 0 0 First Trans ----------0 0 0 0 Parsed ------0 0 0 0

Wakeup Wakeup Wakeup Wakeup Open ---0 0 0 0

SQM Blocked ----------1 1 0 0

Change Oqids -----------0 0 0 0

Detect Orphans -------------0 0 1 1

In the above example output, the DSI SQT processing is reported in the last two lines lacking the queue designator (:1 or :0). The way this can easily be verified is by issuing a normal admin who command and comparing the spids (10 and 0 above) with the type of thread reported for those processes in the process list returned by admin who. From a performance perspective, if you (hopefully) have tuned the Replication Servers sqt_max_cache_size parameter (i.e. to 2-4MB), you may want to adjust the SQT cache for the outbound queue downward or up depending on the status of the removed and full columns in the admin who, sqt output and careful monitoring of the monitor counters. This can (and must) be done on a connection basis via setting the dsi_sqt_max_cache_size to a number differing from the sqt_max_cache_size. In the following sections we will take a look at why you might want to do either.

149

Final v2.0.1
dsi_sqt_max_cache_size < sqt_max_cache_size In most systems, the default dsi_sqt_max_cache_size setting is 0 which means the DSI inherits the same cache size as the SQT cache limit (sqt_max_cache_size). This is extremely unfortunate as DBAs tend to over allocate sqt_max_cache_size setting it well above the 4-8MB that is likely all that is necessary even in high volume systems. As a result, the DSI-S thread will continuously be trying to fill the available DSI SQT cache from the outbound queue often at the expense of yielding the CPU to the DSI EXEC. As a result, in most common systems, the default dsi_sqt_max_cache_size causes performance degradation. The proper sizing for the dsi_sqt_max_cache_size is likely 1-2MB at most and can be more accurately determined for parallel DSI configurations by reviewing the monitor counter information (discussed below). dsi_sqt_max_cache_size >= sqt_max_cache_size A notable exception to this is the Warm Standby implementation. As mentioned earlier, in a WS topology, it is the DSI SQT thread that is actually sorting the transactions into commit order. In this case, you will probably want to set the DSI SQT cache equal to the SQT cache or possibly even higher. A second exception concerns the use of parallel DSIs. When parallel DSIs are used, the DSI thread can effectively process large amounts of row modifications as the load can be distributed among the several available DSIs. This could result in a situation where the DSI transaction rate is higher than the amount of rows read from the outbound queue. In such situations, raising the DSI SQT cache allows the DSI to read ahead into the queue and begin preparing transactions before they are needed. This is especially true in high volume replication environments in which the rate of changes requires more than the default number of parallel DSI threads. In fact, consider the default dsi_max_xacts_in_group setting of 20. If the number of parallel DSIs was set to 5, then you would need dsi_sqt_max_cache_size large enough to hold 100 closed transactions at a minimum and probably some number of open transactions that the DSI executer could be working on. However, even in these cases, unless the system only experienced short transactions allowing the primary sqt_max_cache_size setting to remain low at 1-2MB, the dsi_sqt_max_cache_size setting for parallel DSIs will still likely be less that sqt_max_cache_size. How to size this will be illustrated in the next section. DSI SQT Monitor Counters Although the DSI SQT is not a separate threaded module, the standard SQT monitor counters apply. These are repeated here with DSI appropriate counters highlighted. Counter CacheExceeded CacheMemUsed Explanation Total number of times that the sqt_max_cache_size configuration parameter has been exceeded. SQT thread memory use. Each command structure allocated by an SQT thread is freed when its transaction context is removed. For this reason, if no transactions are active in SQT, SQT cache usage is zero. Total transactions removed from the Closed queue. Total transactions added to the Closed queue. Average number of commands in a transaction scanned by an SQT thread. Total commands in the last transaction completely scanned by an SQT thread. Maximum number of commands in a transaction scanned by an SQT thread. Total commands read from SQM. Commands include XREC_BEGIN, XREC_COMMIT, XREC_CHECKPT. Total empty transactions removed from queues. Average memory consumed by one transaction. Total memory consumed by the last completely scanned transaction by an SQT thread. Maximum memory consumed by one transaction. Total transactions removed from the Open queue.

ClosedTransRmTotal ClosedTransTotal CmdsAveTran CmdsLastTran CmdsMaxTran CmdsTotal EmptyTransRmTotal MemUsedAveTran MemUsedLastTran MemUsedMaxTran OpenTransRmTotal

150

Final v2.0.1

Counter OpenTransTotal ReadTransRmTotal ReadTransTotal TransRemoved

Explanation Total transactions added to the Open queue. Total transactions removed from the Read queue. Total transactions added to the Read queue. Total transactions whose constituent messages have been removed from memory. Removal of transactions is most commonly caused by a single transaction exceeding the available cache. Total transactions removed from the Truncation queue. Total transactions added to the Truncation queue.

TruncTransRmTotal TruncTransTotal

Lets take a look at some of these counters and how the can be used from the outbound queue/DSI perspective Counters CacheExceeded TransRemoved Performance Indicator Normally, we would associate these values with needing to raise the SQT cache setting (i.e. dsi_sqt_max_cache_size). However, what we are likely to see is that the CacheMemUsed grows until dsi_sqt_max_cache_size is reached at which point the CacheExceeded will jump to substantially large values. The only transactions likely to be removed will be large transactions too large to fit into the DSI SQT max cache size. Unless this happens frequently due to larger transactions, DBAs should avoid raising the DSI SQT cache as the latency in processing transactions ahead of them will likely result in their being removed in any case. These counters take on a different perspective. Since the transactions are nearly all presorted, these counters may differ until the cache fills. Once the cache fills, these values will be identical as each group of transactions as committed by the DSI makes room for the same number of transactions in to be read into the DSI SQT cache. These counters are the most appropriate ones to use to size the dsi_sqt_max_cache_size. Ideally, you want the DSI SQT cache to contain double the dsi_max_xacts_in_group transactions for each DSI EXEC thread. Consequently, for 5 DSIEXECs and the default of 20 dsi_max_xacts_in_group, you would like to see 2 * 5DSIs * 20Xacts/Group or 200 transactions. The number of cached transactions can be derived by dividing the CacheMemUsed by MemUsedAveTran. If divided by the dsi_max_xacts_in_group, this will explain how many possible transaction groups are in cache at a max (exluding partitioning rules, different origins, etc.). If we have 200 or more transactions in cache, raising dsi_sqt_max_cache_size is likely of no benefit. This is useful for helping to size dsi_max_xacts_in_group when using parallel DSIs. If the number of commands per transaction is fairly high, large transaction groups only will compound any contention between the parallel DSIs.

OpenTransTotal CloseTransTotal ReadTransTotal

CacheMemUsed MemUsedAveTran

CmdsAveTran

Lets take a look at how these might work by looking at the earlier insert stress test.

151

Final v2.0.1

Sample Time

ClosedTrans Toral

11:37:47 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 11:38:52 11:39:03 11:39:14 11:39:25

0 2,097,408 2,099,712 2,099,200 2,097,920 2,098,432 2,101,504 2,100,224 2,099,968 2,100,224

0 75 289 327 347 319 345 319 345 295

0 10,729 12,223 12,223 12,223 12,223 12,223 12,223 12,223 12,223

0 195 171 171 171 171 171 171 171 171

0 1 47 54 42 56 64 61 61 45

0 0 0 0 0 0 0 0 0 0

0 54 296 331 339 311 336 333 326 307

0 21 62 68 75 67 64 68 68 67

0 58 287 322 334 315 310 319 316 291

0.0 2.7 4.6 4.7 4.4 4.7 4.8 4.6 4.6 4.3

72.2 37.1 36.3 38.8 36.3 35.6 37.1 37.1 39.7

To evaluate this, it helps to know that there were 10 parallel DSIs; dsi_xact_group_size was set to 262,144; dsi_max_xacts_in_group was set to 20; and dsi_sqt_max_cache_size was set to 2,097,152. Again, the derived statistics are in red in the above table. Lets take a look at what these counters are telling us. CacheMemUsed, CacheExceeded & TransRemoved As you can see from the above, as soon as transactions arrive, the DSI SQT cache was quickly filled by the DSI-S filled in about 10 seconds. From that point, as long as there were transactions in the queue to be delivered, the cache remained full and the cache was exceeded frequently. However, notice that there were 0 transactions removed implying that this 2MB DSI SQT cache is likely oversized or is correctly sized. ClosedTransTotal & ReadTransTotal During the first period of activity when the cache was filled (CacheExceeded=1) we see that the DSI SQT cache had 75 Closed transactions and only 54 Read transactions demonstrating that the DSIEXECs were lagging right from the start. However, as the cache became full, new transactions could only be read from the queue into the SQT cache at the same rate that the DSIEXECs could deliver them resulting in the situation we described before in which the Closed Read. When looking at these numbers, you also need to realize that the number of Closed & Read transactions are over the full sample period, so these values to not reflect the number of transactions in cache but the number of transactions that are in cache plus the number of transactions that have been moved to the next stage of the cache (Open Closed Read Truncate). For example, let say we were delivering transactions at a rate of one per sec if the cache quickly filled with 50 transactions, then each second one would be moved from Closed Read making room for one more and at the end of the 10 second sample interval we would show a total of 60 transactions having been Closed the original 50 plus 10 due to processing. CachedTrans The actual number of transactions in the cache can be roughly derived by dividing the CacheMemUsed by the MemUsedAveTran. This is the first indication that the DSI SQT cache is possibly oversized from the system performance perspective as we see about 170 transactions in the cache on a regular basis but the DSIEXECs are only processing ~30 transactions per second (loosely extrapolating from the NgTransTotal over the time period NgTransTotal to be discussed later but it represents the number of original transactions prior to the DSI-S grouping them together). However, the cache may be undersized according to our desired target! With 10 DSIEXECs active and a dsi_max_xacts_in_group of 20, we would need 200 cached transactions to meet the full need. DSIXactInGrp This is the effective dsi_max_xacts_in_group derived by dividing the number of ungrouped transactions as submitted by the source system by the number of transaction groups that the DSI-S created. As you can see, we are not getting anything close to our desired setting of 20 likely some other DSI configuration value is affecting this. MaxCachedGroups This metric is derived by dividing the CachedTrans by the number of transactions being grouped (DSIXactInGrp) which yields the number of transaction groups at the current grouping that are in the DSI SQT cache. If we were getting our maximum dsi_max_xacts_in_group, this would be a good indication that our SQT cache is oversized as we have nearly twice the number of transaction groups in

152

MaxCached Groups 0.0

DSI. TransTotal

DSI.Ng TransTotal

ReadTrans Total

Cache MemUsed

MemUsed AveTran

Cache Exceeded

Trans Removed

DSIXact InGrp

Cached Trans

Final v2.0.1
memory as our effective dsi_max_xacts_in_group. However, since we are only averaging about 4 transactions per group, if we succeed in raising this effective value to even 10 (half of the target dsi_max_xacts_in_group) the number of cached groups drops to 17 (still higher than dsi_num_threads=10 though) and if we reach our target of 20, the number of cached groups would be between 8 & 9. So, DSI SQT cache is slightly undersized for the target performance, but is oversized for the way the system is performing consequently it is some other setting that this restricting processing. Now, lets take a look at the customer example we were looking at earlier: Sample Time ClosedTrans Toral MaxCached Groups 0.0 0.0 1.0 0.0 0.0 57.0 923.0 1,328.0 1,344.0 1,333.0 68.7 25.8 31.5 60.7 42.2 59.6 66.9 66.6 56.8 MaxCached Groups 0.0

DSI. TransTotal

DSI.Ng TransTotal 2 1,574 1,920 1,030 1,746 1,873 3,899 10,348 10,578 5,430 DSI.Ng TransTotal 3 1,702 1,849 1,034 1,405 3,740 5,520 13,040 11,580 6,780

ReadTrans Total

Cache MemUsed

MemUsed AveTran

Cache Exceeded

Trans Removed

19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21

0 1,792 3,328 0 0 143,104 2,098,432 2,097,920 2,098,432 2,097,920

2 1,574 1,922 1,012 1,747 1,906 4,530 10,379 10,605 5,400

1,142 2,109 2,477 2,483 2,493 2,490 2,273 1,579 1,561 1,573

0 0 1 0 0 57 923 1,328 1,344 1,333

0 0 0 0 0 0 2,413 17,820 19,378 3,069

0 0 0 0 0 0 0 0 0 0

2 1,574 1,926 1,030 1,747 1,881 3,922 10,385 10,599 5,442

2 1,574 1,920 1,030 1,746 1,873 3,899 10,348 10,578 5,430

Then the next day, it looks like the following: Sample Time ClosedTrans Toral

DSI. TransTotal

ReadTrans Total

Cache MemUsed

MemUsed AveTran

Cache Exceeded

Trans Removed

19:18:31 19:23:32 19:28:34 19:33:36 19:38:38 19:43:40 19:48:42 19:53:44 19:58:46 20:03:48

0 1,725,696 1,023,232 1,166,592 2,098,432 2,098,432 2,098,944 2,098,432 2,097,408 2,097,664

3 2,023 1,738 1,081 1,598 3,760 5,800 13,120 11,547 6,593

1,123 2,179 2,468 2,478 2,482 2,481 1,760 1,567 1,573 1,844

0 791 414 470 845 845 1,192 1,339 1,333 1,137

0 65 84 2 102 357 480 1,120 996 456

0 0 0 0 0 0 0 0 0 0

3 1,708 1,860 1,060 1,417 3,748 5,574 13,100 11,580 6,772

3 148 115 69 101 187 276 652 579 339

11.5 16.0 14.9 13.9 20.0 20.0 20.0 20.0 20.0

Ouch!!! In the first sample (day 1), we can see we arent doing any transaction grouping whatsoever DSI.NgTransTotal DSI.TransTotal despite the fact that dsi_max_xacts_in_group=20 and dsi_xact_group_size=65,536 (default), which should allow grouping. As a result, any DSI SQT cache above the bare minimum is excessive. But in the second sample (day 2), we can see we are grouping transactions so perhaps the configuration was changed or the transaction profile differs enough to change how transactions are grouped. But rather

DSIXact InGrp 1.0

Cached Trans

DSIXact InGrp 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

Cached Trans

153

Final v2.0.1
than reducing the DSI SQT cache, we probably should start by figuring out why transaction grouping is not happening as well as see if we cant increase the transaction rate to something above 33 transactions per second (~10,000 xact/5 mins). The last may seem like a strange comment (how could we know this is attainable?) but considering the insert stress test target system above was a laptop and it was processing 30 transactions per second (and then barely working) and the customer system is likely a server of considerable more capacity. Now, lets take a look at probably what is a more normal sample that illustrates the point we were making earlier about SQT cache & DSI cache being oversized. This sample comes to us courtesy of a RS 12.1 customer who unfortunately was only collecting a few modules of their RS 12.1 system and RS 12.1 lack some of the more granular details around the SQT Open, Closed, Read and Truncate lists. Source SQT CacheMemUsed DSI SQT Cache 13,632,512 13,632,000 13,632,256 13,632,000 13,632,256 13,632,512 0 0 0 0 0 0 638,720 8,424,960 13,632,256 12,682,240 13,632,768 13,632,768 0 0 3 0 741 2,873 5,359 4,298 4,326 3,516 3,664 Source SQT CacheExceeded Source SQT TransRemoved DSICmdsRead 7,866 8,180 13,999 18,794 18,205 26,564 18,078

Source SQM CmdsWrirren

Dest SQM CmdsWritten 5,510 7,866 5,795 342 0 0 0 0 0 3 0 747 3,187 8,744 9,357 1,442 2,999 6,871

Source Cmds MaxTran

Sample Time

Source SQT CmdsTotal

21:40:46 21:42:47 21:44:48 21:46:49 21:48:50 21:50:50 21:52:51 21:54:52 21:56:53 22:02:21 22:04:22 22:06:22 22:08:23 22:10:24 22:12:25 22:14:26 22:16:26 22:18:27

5,524 7,868 5,797 324 1 2 2 3 0 6 0 844 3,192 8,688 9,411 1,366 3,075 6,845

5,524 7,867 5,795 324 0 0 0 0 0 3 0 842 3,191 8,683 9,407 1,364 2,869 0

19 19 19 19 0 0 0 0 0 3 0 132 104 105 105 106 105 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

57,088 59,648 59,648 0 0 0 0 0 0 0 0 531,200 481,024 172,288 406,784 40,192 442,112 442,112

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7,776 8,225 14,008 18,962 18,615 27,125 8,684 0 0 3 0 747 3,187 8,744 6,873 3,837 2,999 6,322

In the above system, the sqt_max_cache_size was raised from 10MB to 13MB to attempt to get better throughput. The problem was the SQT was never using more than about 500KB of cache! Now, that doesnt mean only 500KB is necessary it means that setting it higher actually wouldnt help. In fact, as you can see all it did was allow the DSI-S to fill up 13MB of cache waiting for the DSIEXEC to catch up. The real problem is the latency at the DSIEXEC in delivering and executing the SQL at the replicate DBMS as can be seen by the lag between the destination SQM.CmdsWritten or SQM.CmdsRead and DSI.CmdsRead. Likely, the same throughput could be achieved by setting sqt_max_cache_size to 4MB and dsi_sqt_max_cache_size to 2MB.

154

Dest SQM CmdsRead

Final v2.0.1
DSI Transaction Grouping Why Group Transactions One function of the main DSI thread is to group multiple independent transactions from the primary into a single transaction group at the replicate. Consider the following illustration of the difference between the primary database transaction and the DSI transaction grouping:

Primary Database Transactions


begin tran order_tran
insert into orders values () insert into order_items values () insert into order_items values () update orders set total=

DSI Transaction Grouping


begin tran
insert into orders values () insert into order_items values () insert into order_items values () update orders set total= insert into ship_history values () update orders set status= insert into orders values () insert into order_items values () insert into order_items values () update orders set total= insert into orders values () insert into order_items values () insert into order_items values () update orders set total=

commit tran order_tran begin tran ship_tran commit tran ship_tran begin tran order_tran

Insert into ship_history values () Update orders set status= insert into orders values () insert into order_items values () insert into order_items values () update orders set total= insert into orders values () insert into order_items values () insert into order_items values () update orders set total=

commit tran order_tran begin tran order_tran

commit tran

commit tran order_tran

Figure 37 Primary vs. Replicate Transaction Nesting Impact of DSI Transaction Grouping
In the example on the right, Replication Servers DSI thread has consolidated the individual transactions into another transaction (begin/commit pair underlined) grouping the transactions together. The obvious question is Why bother doing this? The answer simply is to decrease the amount of logging on the replicate system imposed by replication and to improve the transaction delivery rate. Consider the worst-case scenario of several atomic transactions such as:
insert insert insert insert insert insert insert insert into into into into into into into into checking_acct checking_acct checking_acct checking_acct checking_acct checking_acct checking_acct checking_acct values values values values values values values values (123456789,000001,Sep (123456789,000002,Sep (123456789,000003,Sep (123456789,000004,Sep (123456789,000005,Sep (123456789,000006,Sep (123456789,000007,Sep (123456789,000008,Sep 1 1 1 1 1 1 1 1 2000 2000 2000 2000 2000 2000 2000 2000 14:20:36.321,$125.00,Chk,101) 14:20:36.322,$250.00,Chk,102) 14:20:36.323,$395.00,Chk,103) 14:20:36.324,$12.00,Chk,104) 14:20:36.325,$99.00,Chk,105) 14:20:36.326,$5.32,Chk,106) 14:20:36.327,$119.00,Chk,107) 14:20:36.328,$1132.00,Chk,108)

As you notice, these fictitious transactions all were applied during an extremely small window of time. Now the question is, without transaction grouping, what would Replication Server do? The answer is, each of the above would get turned into separate individual transactions and submitted as follows (RS functions listed vs. SQL):
rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert rs_commit rs_begin rs_insert insert for check 101

insert for check 102

insert for check 103

insert for check 104

insert for check 105

insert for check 106

155

Final v2.0.1

rs_commit rs_begin rs_insert insert for check 107 rs_commit rs_begin rs_insert insert for check 108 rs_commit

Which does not look that bad until you realize two very interesting facts: 1) the contents of the rs_commit function; and 2) how rs_commit is sent as compared to other functions. In regards to the former, rs_commit calls a stored procedure rs_update_lastcommit, which updates the corresponding row in the replication system table rs_lastcommit. As far as the second point, while this will be discussed in more detail in the next section, Replication Server does not batch the outer commit statements with the transaction batch if batching is enabled. Consequently, the replicate database would actually be executing something similar to:
begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success begin tran insert into checking_acct -- wait for success update rs_lastcommit commit transaction -- wait for success (,101)

(,102)

(,103)

(,104)

(,105)

(,106)

(,107)

(,108)

Why is this a problem? First, the amount of I/O has clearly doubled. Consequently, if the replicate system was already experiencing I/O problems, this would add to the problem. Secondly, the delivered transaction rate would not match that at the primary system. Consider each of the following primary database transaction scenarios: Concurrent User Concurrent users applied each transaction at the primary. At the replicate, only a single user is applying the transactions. So while the primary system can take full advantage of multiple CPUs, group commits for the transaction log and every other feature of ASE to improve concurrency, the replicate simply has no concurrency. Single User/Batch In this scenario, a single user applies all the transactions at the primary in a large SQL batch. At the replicate, the batching is essentially undone as each of the atomic commits results in 2 network operations per transaction. This could be significant as anyone familiar with the performance penalties of not batching SQL can attest.

156

Final v2.0.1
Single User/Atomic A single user performs each of the original inserts using a single atomic transaction per network call. While the replicate might appear to be similar, consider the following. As ASE performs each I/O the user process is put to sleep. As a result, the replicate system with twice the i/os will spend twice as much time sleeping, consequently halving its ability to process transactions. Simply, transaction batching is critical to replication performance although it can be an issue with parallel or multiple DSIs as discussed later. Key Concept #15: Transaction grouping reduces I/O caused by updating replication system tables and the corresponding logging overhead at the replicate system. This also improves throughput as the replication process within the replicate database server spends less time waiting for I/O completion. While we can see the benefits of this, some may have been quick to notice that the individual transactions seem to have gotten lost. Actually, they are still there and tracked. One reason for this is that if any individual statement in the above group of transactions fail, the entire group is rolled back and the individual transactions submitted until the point of failure (again). So why didnt RS engineering simply submit it as nested transactions? Several reasons: The nested commits would have prevented parallel DSIs from working at all as it would have guaranteed contention on rs_lastcommit Not all DBMSs support nested transactions (i.e. ODBC interfaces to flat files) Rolling back a nested transaction is not possible (read the ASE docs carefully you can rollback to a savepoint, but not a nested transaction described later in procedure replication).

DSI Transaction Grouping Rules Unfortunately, not every transaction can be grouped together. A transaction group will end any time one of the following conditions is met: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. There are no more transactions in the DSI queue. The predefined maximum number of transactions allowed in a group has been reached. The current or the next transaction will make the total size of the transactions (in bytes) exceed the configured group size. The next transaction is from a different origin. The current or the next transaction is on disk. The current or the next transaction is an orphan transaction. The current or the next transaction is a rollback. The current or the next transaction is a subscription (de)materialization transaction marker. The current or the next transaction is a subscription (de)materialization transaction queue end marker. The current or the next transaction is a dump/load transaction. The current or the next transaction is a routing transaction. The current or the next transaction has no begin command (i.e., it is a special RS-to-RS transaction). The next transaction has a different user/password. The first transaction has IGNORE_DUP_M mask on. A transaction partitioning rule determines that the next transaction cannot be grouped with the existing group. A timeout expires

While this appears to be quite a long list, the rules for grouping transactions can simply be paraphrased into the rule that in order for transactions to be batched together, all of the following six conditions must be met. 1. 2. 3. 4. 5. Transactions cached in the DSI/SQT closed queue. Transactions from the same origin. Transactions will be applied at the replicate with the same username and password. The transaction group size is limited by the lesser of dsi_xact_group_size and dsi_max_xacts_in_group. Aborted, database/log dump, orphan, routing, and subscription transactions cannot be grouped.

157

Final v2.0.1
6. A transaction partitioning rule determines that the next transaction cannot be grouped with the existing group.

The fourth condition will be discussed in the next section on tuning transaction grouping. The fifth condition is due to system level reprocessing or ensuring integrity of the replicate system during materialization of subscriptions or routes and is rare consequently not discussed. The last condition will be discussed in the section on parallel DSIs later in this document. This leaves only the first three conditions that apply to most transactions. While the first condition makes sense simply from a performance aspect, the second condition requires some thought, while the third is fairly easy. Earlier, one of the conditions which causes transactions not to be grouped was stated as The next transaction has a different user/password, which was summarized above that transactions grouped together must use the same user/password combination. Some find this confusing, assuming that it refers to the user who committed the transaction at the primary system. It does not. It refers instead to the user that will apply the transaction at the replicate. At this juncture, many might say Wait a minute, I thought the maintenance user applies all the transactions? This is mostly true. During normal operations, the maintenance user will be the login used to apply transactions at the replicate thereby allowing full transaction grouping capabilities. However, some transactions are not applied by the maintenance user. For example, in Warm Standby systems, DDL transactions that are replicated are executed at the standby system by the same user who executed the DDL at the primary. This assures that the object ownership is identical. Additionally, Asynchronous Request Functions (discussed later) are also applied by the same user as executed at the originating system. In this latter case, it has less to do with the specific user and more to do with ensuring that the transaction is recorded using a different user login than the maintenance user thereby allowing the changes to be re-replicated back to the originating or other systems without requiring the RepAgent to be configured for send_maint_xacts_to_replicate. In short, it should be extremely rare and possibly not at all that a transaction group is closed early due to a different user/password. Now that we understand this, the next question might be Why cant we group transactions from different source databases? The reason that the transactions have to be from the same origin is due to the management of the rs_lastcommit table and how the DSI controls assigning the OQID for the grouped transaction. When the DSI groups transactions together, it uses the last grouped transactions begin record to determine the OQID for the OQID for the grouped transaction. The reason is that on recovery, not using the last transactions OQID could result in duplicate row errors or an inconsistent database. Consider a default grouping of 20 transactions into a single group that are applied to the replicate database server and then immediately the replicate database shuts down. On recovery, as most people are aware, the Replication Server will issue a call to rs_get_lastcommit to determine the last transaction that was applied. Remember, the transactions are grouped in memory not in the stable queue. Consequently, if the OQID of the first transaction was used, then the first 19 transactions would all be duplicates and not detected as such by the Replication Server as that was the whole reason for the comparison of the OQID in the first place!! As a result, the first 19 transactions would either cause duplicate key errors (if you are lucky) or database inconsistencies if using function strings. For that reason, when transactions are grouped together, the OQID of the last transactions begin record is used for the entire group. Now then, following that logically along, since the rs_commit function updates only a single row in the rs_lastcommit table for the source database of the transaction, then all of the transactions grouped together must be from the same source. Note that currently, the DSI does not simply collect all of the closed transactions from the same source. If the third transaction in a series is from a different source database, then the group will end at two even if the next four transactions are from the same source database as the first two. As you can imagine, a fragmented queue with considerable inter-dispersed transactions from different databases, the DSI will be applying transactions in very small groups. As mentioned earlier, the smaller the group size, the less efficient the replication mechanism due to rs_lastcommit and processing overhead, which leads us to the following concept: Key Concept #16: Outbound queues that are heavily fragmented with inter-dispersed transactions from different source databases will not be able to effectively use transaction grouping This may or may not be an issue. As you will see later, if using parallel DSIs and a low dsi_max_xacts_in_group to control concurrency, this mix of transactions may not be an issue - especially if dsi_serialization_method is set to single_transaction_per_origin. For non-parallel DSI implementations, it does suggest that increasing dsi_max_xacts_in_group and similar parameters in such cases may prove fruitless.

158

Final v2.0.1

Tuning DSI Transaction Grouping Prior to Replication Server 12.0, however, there really wasnt a good way to control the number of transactions in a batch. The reason was that the only tuning parameter available attempted to control the transaction batching by controlling the transaction batch size in bytes a difficult task with tables containing variable width columns and considering the varying row sizes of different tables. With version 12.0 came the ability to explicitly specify the number of original transactions that could be grouped into a larger transaction. These connection level configuration parameters are listed below. Parameter (Default) dsi_xact_group_size Default: 65,536; Recommended: 2,147,843,647 (max) dsi_max_xacts_in_group Default: 20; Max: 100; Recommended: see text Explanation The maximum number of bytes, including stable queue overhead, to place into one grouped transaction. A grouped transaction is multiple transactions that the DSI applies as a single transaction. A value of "-1" means no grouping. Specifies the maximum number of transactions in a group, allowing a larger transaction group size, which may improve data latency at the replicate database. The default value is a good starting point lower generally should be considered if primarily updates are replicated and using parallel DSIs and contention is an issue. The number of bytes available for managing the SQT open, closed, read and truncate queues. This impacts DSI SQT processes by also being a limiter on the transaction batches that are cached in memory waiting for the DSIEXECs. For example, if the DSI SQT cache is too small, the DSIEXECs may not be able to group transactions to the number specified in dsi_xact_group_size. Specifies the partitioning rules (one or more) the DSI uses to partition transactions among available parallel DSI threads. Valid values are: origin, origin_sessid (if source is ASE 12.5.2+), time, user, name and none. This setting will be described in detail in the section on parallel DSIs.

dsi_sqt_max_cache_size Default : 0 ; Recommended : see text

dsi_partitioning_rule Default: none; Valid Values: origin, origin_sessid, time, user, name, and none

At first, the dsi_xact_group_size may appear to be fairly large. Remember, however, this includes stable queue overhead which can be significant as the queue may require 4 times the storage space as the transaction log space. Additionally, it can be a bit difficult controlling the number of transactions with this parameter due to the varying row widths of different database tables, etc. As a result, Sybase added the dsi_max_xacts_in_group parameter and suggests that you set dsi_xact_group_size to the maximum and control transaction grouping using dsi_max_xacts_in_group. If you dont adjust dsi_xact_group_size, the lesser of the two limits will cause the transaction grouping to terminate. On the other hand, dsi_max_xacts_in_group can be raised from the default of 20 if using a single DSI and perhaps should be if system is performing a lot of small transactions. However, in parallel or multiple DSI situations, this parameter may need to be lowered to reduce inter-thread contention. While this will be discussed later in the section on parallel DSIs, contention is likely to occur in update heavy environments, or inserts with isolation level three due to next key (range) or infinity locks. A good starting point for dsi_sqt_max_cache_size is to figure on 500-750KB per DSIEXEC thread in use with a minimum of 1MB. This may seem like an awfully small amount, but remember from the earlier example that 2MB was enough to cache ~30 transaction groups for one customer. As mentioned though, from this starting point, you will need to monitor the approximate transactions and transaction groups in cache and increase dsi_sqt_max_cache_size only when it can no longer hold 2 * dsi_max_xacts_in_group * num_dsi_threads transactions. DSI Grouping Monitor Counters To help determine the efficiency of DSI transaction grouping, the following monitor counters are available. Counter CmdGroups Explanation Total transaction groups sent to the target by a DSI thread. A transaction group can contain at most dsi_max_xacts_in_group transactions. This counter is incremented each time a 'begin' for a grouped transaction is executed.

159

Final v2.0.1

Counter CmdGroupsCommit CommitsInCmdGroup GroupsClosedBytes GroupsClosedLarge GroupsClosedMixedMode

Explanation Total command groups committed successfully by a DSI thread. Total transactions in groups sent by a DSI thread that committed successfully. Total transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_xact_group_size. Total transaction groups closed by a DSI thread due to the next transaction satisfying the criteria of being large. Total transaction groups closed by a DSI thread because the current group contains asynchronous stored procedures and the next tran does not or the current group does *not* contain asynchronous stored procedures and the next transaction does. Total asynchronous stored procedure transaction groups closed by a DSI thread due to the next tran user ID or password being different from the ones for the current group. Total trxn groups closed by a DSI due to no open group from the origin of the next transaction (i.e. We have a new origin (source db) in the next trxn), or the RS scheduler forced a flush of the current group from the origin leaving no open group from that origin. Note that the highlighted condition could cause transaction groups to be flushed prior to reaching dsi_max_xacts_in_group and likely will be the most common cause for transactions closed identified by this metric. Total transaction groups closed by a DSI thread due to the next transaction following the execution of the 'resume' command - whether 'skip', 'display' or execute option chosen. Total transaction groups closed by a DSI thread due to the next transaction being qualified as special orphan, rollback, marker, duplicate, ddl, etc.

GroupsClosedMixedUser

GroupsClosedNoneOrig

GroupsClosedResume

GroupsClosedSpecial

GroupsClosedTranPartRule Total transaction groups closed by a DSI thread because of a Transaction Partitioning rule. GroupsClosedTrans GroupsClosedWSBSpec Total transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group. Total transaction groups closed by a DSI thread for a Warm Standby due to the next transaction being special - empty, or a enable replication marker or subscription materialization marker or ignored due to duplication detection, etc. Total non-grouped transactions read by a DSI Scheduler thread from an outbound queue. Total transaction groups forced to wait for another group to complete (processed serially based on Transaction Partitioning rule). Total transactions contained in transaction groups sent by a DSI thread. The number of trxns in a group is added to this counter each time a 'begin' for a grouped transaction is executed. Total transactions applied successfully to a target database by a DSI thread. This includes transactions that were committed or rolled back successfully. Total transaction groups generated by a DSI Scheduler while reading the outbound queue. This counter is incremented each time a new transaction group is started. If grouping is disabled, this is total transactions in queue. This counter is incremented each time the main DSI Scheduler body yields following the dispatch of closed transaction groups to DSI Executor threads.

NgTransTotal PartitioningWaits TransInCmdGroups

TransSucceeded TransTotal

YieldsScheduler

160

Final v2.0.1
In RS 15, the counters change slightly, mainly with the addition of more timing counters: Counter DSIReadTranGroups DSIReadTransUngrouped DSITranGroupsSucceeded Explanation Transaction groups read by the DSI. If grouping is disabled, grouped and ungrouped transaction counts are the same. Ungrouped transactions read by the DSI. If grouping is disabled, grouped and ungrouped transaction counts are the same. Transaction groups applied successfully to a target database by a DSI thread. This includes transactions that were successfully committed or rolled back according to their final disposition. Grouped transactions failed by a DSI thread. Depending on error mapping, some transactions may be written into the exceptions log. Grouped transactions retried to a target server by a DSI thread. When a command fails due to data server errors, the DSI thread performs postprocessing for the failed command. This counter records the number of retry attempts. Transaction groups sent to the target by a DSI thread. A transaction group can contain at most dsi_max_xacts_in_group transactions. This counter is incremented each time a 'begin' for a grouped transaction is executed. Transactions contained in transaction groups sent by a DSI thread. Transactions committed successfully by a DSI thread.

DSITransFailed DSITransRetried DSIAttemptsTranRetry

DSITranGroupsSent

DSITransUngroupedSent DSITranGroupsCommit

DSITransUngroupedCommit Transactions in groups sent by a DSI thread that committed successfully. DSICmdsSucceed DSICmdsRead GroupsClosedBytes GroupsClosedNoneOrig Commands successfully applied to the target database by a DSI. Commands read from an outbound queue by a DSI. Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_xact_group_size. Trxn groups closed by a DSI due to no open group from the origin of the next trxn. I.e. We have a new origin in the next trxn, or the Sched forced a flush of the current group from the origin leaving no open group from that origin. Asynchronous stored procedure transaction groups closed by a DSI thread due to the next tran user ID or password being different from the ones for the current group. Transaction groups closed by a DSI thread because the current group contains asynchronous stored procedures and the next tran does not or the current group does *not* contain asynchronous stored procedures and the next transaction does.

GroupsClosedMixedUser

GroupsClosedMixedMode

GroupsClosedTranPartRule Transaction groups closed by a DSI thread because of a Transaction Partitioning rule. GroupsClosedTrans CmdGroupsRollback RollbacksInCmdGroup GroupsClosedLarge Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group. Command groups rolled back successfully by a DSI thread. Transactions in groups sent by a DSI thread that rolled back successfully. Transaction groups closed by a DSI thread due to the next transaction satisfying the criteria of being large.

161

Final v2.0.1

Counter GroupsClosedWSBSpec

Explanation Transaction groups closed by a DSI thread for a Warm Standby due to the next transaction being special - empty, or a enable replication marker or subscription materialization marker or ignored due to duplication detection, etc. Transaction groups closed by a DSI thread due to the next transaction following the execution of the 'resume' command - whether 'skip', 'display' or execute option chosen. Transaction groups closed by a DSI thread due to the next transaction being qualified as special - orphan, rollback, marker, duplicate, ddl, etc. Time spent by the DSI/S finding a group to dispatch. Time spent by the DSI/S dispatching a regular transaction group to a DSI/E. Time spent by the DSI/S dispatching a large transaction group to a DSI/E. This includes time spent finding a large group to dispatch. Number of DSI/E threads put to sleep by the DSI/S prior to loading SQT cache. These DSI/E threads have just completed their transaction. Time spent by the DSI/S putting free DSI/E threads to sleep. Time spent by the DSI/S loading SQT cache.

GroupsClosedResume

GroupsClosedSpecial DSIFindRGrpTime DSIDisptchRegTime DSIDisptchLrgTime DSIPutToSleep DSIPutToSleepTime DSILoadCacheTime

Lets take a look at some of these counters and how the can be used from the outbound queue/DSI perspective as well as clarifying some of these that appear to be confusing. Other than the SQT aspects, the most common counters in the DSI include (15.0 formulas/names in parenthesis): CmdsRead, TransSucceeded (DSICmdsRead, DSITranGroupsSucceeded) XactsInGrp = NgTransTotal / TransTotal (DSIReadTransUngrouped/DSIReadTranGroups) GroupsClosedBytes, GroupsClosedLarge GroupsClosedNoneOrig, GroupsClosedTrans GroupsClosedMixedUser, GroupsClosedMixedMode While there are others, these are the most common. The first set is mostly (again) monitoring type counters CmdsRead should match SQM CmdsWritten (for the outbound queue) but likely wont as the most frequent source of latency is the DSIEXEC due the replicate database. XactsInGrp, on the other had is clearly tied to configuration settings specifically dsi_max_xacts_in_group. By comparing the number of ungrouped transactions (NgTransTotal) to the number of grouped transactions (TransTotal) we can observe much transaction grouping is going on. One of the keys to parallel transaction use is to increase this parameter as much as possible (until contention starts) at lower settings, it is not likely that too many threads will actually be used. Even without parallel DSI, considering the overhead during the commit phase (updating rs_lastcommit, etc.), the more the merrier. The next sets of counters will explain why a group of transactions were closed. The first set point to likely configuration issues. If you see very many GroupsClosedBytes, it is likely because you have not adjusted dsi_xact_group_size from its default of 64K to something more realistic such as 256K. As a result, no matter what you have dsi_max_xacts_in_group set to, a low value here will prevent the grouping. Similarly, the default value for dsi_large_xact_size of 100 is simply too small and in fact, arguably large transactions are not effective in any case so you should set this to the upper limit of 2 billion and forget about it. GroupsClosedNoneOrig and GroupsClosedTrans will be the most common causes, so they can be ignored if tuned properly. The first while it may refer to the fact that the next transaction is from a different origin (corporate rollup), the most often it is referring to the fact the scheduler forced a flush. The second is incremented whenever a group is closed due to reaching dsi_max_xacts_in_group. A lot of these may indicate that dsi_max_xacts_in_group is too low (the default of 20 is typically plenty, but someone may have decreased it). However, if the next set appears, it may provide a reason why even though you have a well defined dsi_max_xacts_in_group, it isnt being used. The first (GroupsClosedMixedUser) happens whenever the DSI has to connect as another user vs. the maintenance user typically DDL commands. The second (GroupsClosedMixedMode) refers to asynchronous request functions. There are other GroupClosed counters, but the point is to avoid GroupsClosedBytes and if GroupsClosedNoneOrig or GroupsClosedTran are not where expected, you may have to look to the others for the explanation.

162

Final v2.0.1
Lets take a look at how these might work by looking at the earlier insert stress test. CacheMem Used DSI CmdGroups Yields Scheduler 0 103 389 418 433 414 436 416 421 396 DSIXact InGrp

11:37:47 11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 11:38:52 11:39:03 11:39:14 11:39:25

0 2,097,408 2,099,712 2,099,200 2,097,920 2,098,432 2,101,504 2,100,224 2,099,968 2,100,224

0 195 171 171 171 171 171 171 171 171

0.0 2.7 4.6 4.7 4.4 4.7 4.8 4.6 4.6 4.3

0 17 63 68 75 67 64 68 68 67

0 51 289 322 334 315 310 319 316 291

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 4.8 1.5 6.7 3.0 1.6 2.9 2.9 6.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 104.8 93.5 98.5 93.3 98.5 100.0 97.1 97.1 95.5

The only derived columns above are the same in the previous example from the SQT in fact the first four columns are repeated partially to put in context some of the others. As you may remember, the dsi_max_xacts_in_group was 20 and we are hoping to determine (if we can) why the actual value is more in the 4-5 range than close to 20. While there are additional DSI metrics for GroupsClosed______ not listed above, some of the more common reasons are listed in the above table. Note especially, that the GroupsClosed______ metrics are presented as a percentage (of 100%) and not the actual values (rationale is that it is easier to recognize the primary reasons this way) hence the blue color highlighting the metrics above. CachedTrans & DSIXactInGrp Repeated from the DSI SQT cache metrics, these derived values are calculations of the number of transactions in the DSI SQT cache (based on average memory used per transaction) and the average number of transactions grouped together by the DSI thread respectively. DSI.CmdGroups & TransInCmdGroups These metrics report the actual number of transaction groups sent by the DSI to the DSI EXEC and operate very similarly to the metrics DSI.TransTotal and NgTransTotal. Slight differences may occur, however, as variable substitution may cause the original grouping to exceed the byte limit on the transaction group. One way to think of the differences between TransTotal/NgTransTotal and CmdGroups/TransInCmdGroups is that TransTotal/NgTransTotal represents the planned transaction grouping where as CmdGroups/TransInCmdGroups represent the actual. To that extent, DSIXactInGrp (a derived statistic based on dividing NgTransTotal by TransTotal) represents a planned transaction grouping ratio vs. actual while the actual may be slightly deviated, it is well within a margin of error. GroupsClosedBytes This counter is incremented any time the transaction group is closed because the number of bytes in the transaction group exceed dsi_xact_group_size. In the case above, the dsi_xact_group_size was 262,144 (256KB) which although much smaller than the suggested maximum setting, did not contribute to the reason the transaction grouping was less than desired. GroupsClosedTrans Similar to above, this counter is incremented anytime a group is closed due to the number of transactions exceeding dsi_max_xacts_in_group. Interestingly, we see that 1.5-7% of the groups reached the maximum of 20 so despite the computed average of 4 transactions per group, there are some (few) that do reach the maximum and likely many in between. GroupsClosedLarge This counter is incremented any time a group of transactions is closed due to the fact that the next transaction is considered large either because it exceeds dsi_large_xact_size or because it involves text/image data (which automatically qualifies it as a large transaction). GroupsClosedOrig This counter is incremented any time a group is closed because the next transaction to be delivered to the destination comes from a different source database (think corp rollup). In addition and a more common cause in WS systems - this counter is incremented when the DSI-S cant find an open transaction group from the same origin a situation usually caused when the scheduler forces the DSI to close pending transaction groups and send them to the DSIEXECs. That is the case here as the system in

GroupsClosed Resume 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

CachedTrans

Sample Time

TransInCmd Groups

Groups ClosedLarge

Groups ClosedTrans

Groups ClosedBytes

Groups ClosedOrig

163

Final v2.0.1
question was a WS implementation in isolation so no other connection existed to cause this counter to be incremented. We just need to determine what is driving the scheduler GroupsClosedResume This counter is incremented any time a group is closed due to the next transaction following a resume command. The reason for this is that often times a transaction group needs to be rolled back and applied as individual transactions up to the point of error and then the DSI is suspended. As a result, when the DSI is resumed, the DSI rebuilds transaction groups from that point. YieldsScheduler This metric is illustrated here to show how often the DSI is yielding after a group has been submitted to a DSI EXEC. However, we see that the number of yields is 4-6x the number of transaction groups which suggests that the DSI was repeated checking to see if the DSI EXEC was finished with the current group and ready for the next. From the above, it looks like the scheduler is closing transaction groups prior to reaching dsi_max_xacts_in_group but otherwise no real indication of what the cause may be. Perhaps other DSI or DSI EXEC counters will help us learn why the scheduler is doing this but we will look at them late. For now, lets take a look at the customer examples from the 2 different days. The first days counter values for DSI grouping are illustrated below: GroupsClosed Resume 100.0 100.0 100.1 98.3 100.1 100.1 115.6 99.9 100.1 99.3 CachedTrans Sample Time TransInCmd Groups Groups ClosedLarge Groups ClosedTrans

DSI CmdGroups

Groups ClosedBytes

CacheMem Used

Groups ClosedOrig

19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21

0 1,792 3,328 0 0 143,104 2,098,432 2,097,920 2,098,432 2,097,920

0 0 1 0 0 57 923 1,328 1,344 1,333

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

2 1,574 1,920 1,030 1,746 1,873 3,899 10,348 10,578 5,430

2 1,574 1,920 1,030 1,746 1,873 3,899 10,348 10,578 5,430

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

1,951 2,528 1,397 2,281 2,452 6,055 21,396 21,794 8,281

Almost instantly we see that most of the transactions were closed because the next transaction followed a resume command rather odd and suggestive of a significant number of errors. Some that are observant might have noted that some of these percentages are above 100% - remember, as mentioned earlier transaction groups are automatically tried individually until the individual transaction with the problem re-occurs. It also simply could be due to calculating the percentage based on DSI.TransTotal vs. DSI.CmdGroups. Note as well that the ratio of YieldsScheduler to transactions ranges from slightly more than 1 to 2. Now, lets look at the next day: GroupsClosed Resume CachedTrans Sample Time TransInCmd Groups Groups ClosedLarge Groups ClosedTrans

DSI CmdGroups

Groups ClosedBytes

CacheMem Used

Groups ClosedOrig

19:18:31 19:23:32 19:28:34

0 1,725,696 1,023,232

0 791 414

1.0 11.5 16.0

3 148 115

3 1,702 1,849

0.0 0.0 0.0

0.0 58.8 69.6

0.0 0.0 0.0

100.0 52.7 25.2

0.0 0.0 0.0

164

Yields Scheduler 9 702 638

DSIXact InGrp

Yields Scheduler 6

DSIXact InGrp

Final v2.0.1

GroupsClosed Resume 0.0 0.0 0.0 0.0 0.0 0.0 0.0

CachedTrans

Sample Time

TransInCmd Groups

Groups ClosedLarge

Groups ClosedTrans

DSI CmdGroups

Groups ClosedBytes

CacheMem Used

Groups ClosedOrig

19:33:36 19:38:38 19:43:40 19:48:42 19:53:44 19:58:46 20:03:48

1,166,592 2,098,432 2,098,432 2,098,944 2,098,432 2,097,408 2,097,664

470 845 845 1,192 1,339 1,333 1,137

14.9 13.9 20.0 20.0 20.0 20.0 20.0

69 101 187 276 652 579 339

1,034 1,405 3,740 5,520 13,040 11,580 6,780

0.0 0.0 0.0 0.0 0.0 0.0 0.0

69.6 72.3 100.0 104.3 100.2 99.7 97.3

0.0 0.0 0.0 0.0 0.0 0.0 0.0

33.3 36.6 0.0 0.0 0.0 0.0 0.0

1,271 1,418 2,952 2,702 1,790

Note in this case, the transactions at the beginning are largely closed due to GroupsClosedOrig likely due to the same scheduler driven reasons as the insert test. However, very quickly the reasons shift to GroupsClosedTrans as the DSIXactInGrp climbs and eventually reaches the dsi_max_xacts_in_group of 20. DSIEXEC Function String Generation DSI Executer Processing While the DSI is responsible for SQT functions and transaction grouping, it is the responsibility of the DSI Executer (DSI-E) threads to actually perform the SQL string generation, command batching and exception handling. The key to the DSI-E is that the DSI-S simply passes the list of transaction ids in the group to it. The DSI-E then reads the actual transaction commands from the DSI SQT cache region. If you remember from the earlier discussion on LTL, the replicated functions (rs_insert, rs_update, rs_delete, etc.) actually are identified by the Replication Agent. This helps the rest of the Replication Server as it does not have to perform SQL language parsing (which is not in the transaction log anyhow something many people have a hard time understanding the transaction log NEVER logs the SQL). However, we need to send ASCII language commands to the replicate system (or RPCs). As a result, the DSI-E thread execution looks like the following flow diagram.

Yields Scheduler 372 588

DSIXact InGrp

165

Final v2.0.1

Transaction group from DSI

Translate replicated functions into SQL via fstring definitions

Break transaction into dsi_cmd_batch_size Batches of SQL

Send SQL batch to Replicate database

No Yes
Stop Errors?

Rollback transaction

No

Done?

Yes
Suspend connection Commit Transaction

Figure 38 DSI Executer SQL Generation and Execution Logic


Note that in the above diagram, only stop errors cause the DSI to suspend. If you remember, some error actions such as ignore (commonly set to handle database change, print and other information messages), retry, etc. allow the DSI to continue uninterrupted. DSI Executer Performance Beyond DSI command batching (next section), the tuning parameters available for the DSI Executer are listed in the following table (other parameters are available, however, do not specifically address performance throughput). Note that parameters specific to parallel DSI performance are not listed here. Parameter (Default) Replication Server scope fstr_cachesize (obsolete/deprecated) Obsolete and deprecated. In RS 12.0, it was decided that this was not necessary (possibly viewed as duplicative as function string RSSD rows would be in STS cache as well) and the parameter was made obsolete (although still in the documentation). Mentioned here as often questions are asked whether changing this would help short answer No. Long answer is this was deprecated by sts_full_cache_xxxxx. (essentially). The total number of rows cached for each cached RSSD system table. Increasing this number to the number of active replication definitions prevents Replication Server from executing expensive table lookups. From a DSI Executer performance perspective, the STS cache could be used to hold RSSD tables such as rs_systext that hold the function string definitions. Of all the parameters below, this one is probably the most critical as insufficient STS cache would result in network and potentially disk i/o in accessing the RSSD. Explanation

sts_cachesize Default: 100; Suggested: 1000

166

Final v2.0.1

Parameter (Default) sts_full_cache_xxxxx Connection scope batch Default: on; Recommended: on

Explanation For DSI performance the list of tables that should be fully cached include rs_objects, rs_columns, and rs_functions

Specifies how Replication Server sends commands to data servers. When batch is "on," Replication Server may send multiple commands to the data server as a single command batch. When batch is "off," Replication Server sends commands to the data server one at a time. This is on for ASE and should be on for any system that supports command batching due to performance improvements of batching. Some heterogeneous replicate systems such as Oracle do not support command batching, and consequently this parameter needs to be set to off. Note that for Oracle, we are referring to the actual DBMS engine as of 9i and 10g, batch SQL is handled outside the DBMS engine by the PL/SQL engine. Indicates whether a begin transaction can be sent in the same batch as other commands (such as insert, delete, and so on). For single DSI systems, this value should be on (the default). If using parallel DSIs and wait_for_commit, the value should be on as well. For most other parallel DSI serialization methods (i.e. wait_for_start) this value should be off. The rationale for off is that the DSIEXEC will post the Batch Began message quicker to the DSI allowing the other parallel threads to begin quicker than waiting for the begin and the first command batch (and possibly only command batch) to execute before the message is sent. The maximum size of a network packet. During database communication, the network packet value must be within the range accepted by the database. You may change this value if you have an Adaptive Server that has been reconfigured for max network packet size minimally at the desired size or greater. A recommended packet size of 16,384 on high speed networks or tuned to network MTU on lower speed networks is appropriate. Values less than 2,048 are suspect and should only be used if the target system does not support larger packet sizes. On ASE 15 systems, the connection will automatically be bumped to 2048 as the minimum packet size. The maximum number of bytes that Replication Server places into a command batch. You need to be careful with this setting as too high of a setting may exceed the stack space in the replicate database engine. However, it should be at least the same as the db_packet_size if not doubled. Specifies whether triggers should fire for replicated transactions in the database. Set to "off" to cause Replication Server to set triggers off in the Adaptive Server database, so that triggers do not fire when transactions are executed on the connection. By default, this is set to "on" for all databases except standby databases. Arguably should be off for all databases, although caution should be exercised when replicating procedures. On is the default as it is the typical safe approach that Replication Server defaults assume, however, there should be compelling reasons not to have this turned off including security as the replication maintenance user could be viewed as a trusted agent fully supportable in Bell-Lapadula and other NCSC endorsed security policies. Additionally, having it on is no guarantee of database consistency as will be illustrated later in the discussion on triggers. Simply put if you leave this on you WILL have RS latency & performance problems.

batch_begin Default: on; Recommended: see text

db_packet_size Default: 512; Recommended: 8192 or 16384

dsi_cmd_batch_size Default: 8192; Recommended: 32768 dsi_keep_triggers Default: on for most off for WS; Recommended: off

167

Final v2.0.1

Parameter (Default) dsi_replication Default: off for most on for WS

Explanation Specifies whether or not transactions applied by the DSI are marked in the transaction log as being replicated. When dsi_replication is set to "off," the DSI executes set replication off in the Adaptive Server database, preventing Adaptive Server from adding replication information to log records for transactions that the DSI executes. Since these transactions are executed by the maintenance user and, therefore, not usually replicated further (except if there is a standby database), setting this parameter to "off" avoids writing unnecessary information into the transaction log. dsi_replication must be set to "on" for the active database in a warm standby application for a replicate database, and for applications that use the replicated consolidated replicate application model. The reason this is mentioned as a possible performance enhancement is its applicability in multiple DSI situations discussed later.

Some of these, such as the STS and other server level configurations, have been discussed before and have been included here simply for completeness. Additionally, several have to do with command batching which is discussed in the next section. Those that are highlighted are specifically applicable to DSI Executer performance. DSI EXEC DML Monitor Counters Several monitor counters in the DSIEXEC module help analyze throughput, transaction characteristics and general function string generation issues. Counter Explanation

Command (DML or DDL Related) CmdsApplied CmdsSQLDDLRead DeletesRead ExecsGetTextPtr ExecsWritetext InsertsRead UpdatesRead Function String Generation DSIEFSMapTimeAve DSIEFSMapTimeLast DSIEFSMapTimeMax Average time taken, in 100ths of a second, to perform function string mapping on a command. Time, in 100ths of a second, to perform function string mapping on the last command. The maximum time taken, in 100ths of a second, to perform function string mapping on a command. Total commands applied by a DSIEXEC thread. Total SQLDDL commands processed by a DSI DSIEXEC thread. Total rs_delete commands processed by a DSIEXEC thread. Total invocations of function rs_get_textptr by a DSIEXEC thread. This function is executed each time the thread processes a writetext command. Total rs_writetext commands processed by a DSIEXEC thread. Total rs_insert commands processed by a DSIEXEC thread. Total rs_update commands processed by a DSIEXEC thread.

The RS 15.0 equivalent counters are: Counter Read From SQT Cache DSIEReadTime DSIEWaitSQT DSIEGetTranTime The amount of time taken by a DSI/E to read a command from SQT cache. The number of times DSI/E must wait for the command it needs next to be loaded into SQT cache. The amount of time taken by a DSI/E to obtain control of the next logical transaction. Explanation

168

Final v2.0.1

Counter DSIERelTranTime DSIEParseTime

Explanation The amount of time taken by a DSI/E to release control of the current logical transaction. The amount of time taken by a DSI/E to parse commands read from SQT.

Command (DML or DDL Related) TransSched UnGroupedTransSched DSIECmdsRead DSIECmdsSucceed BeginsRead CommitsRead SysTransRead CmdsSQLDDLRead InsertsRead UpdatesRead DeletesRead ExecsWritetext ExecsGetTextPtr Function String Generation DSIEFSMapTime Time, in 100ths of a second, to perform function string mapping on commands. Transactions groups scheduled to a DSIEXEC thread. Transactions in transaction groups scheduled to a DSIEXEC thread. Commands read from an outbound queue by a DSIEXEC thread. Commands successfully applied to the target database by a DSI/E. 'begin' transaction records processed by a DSIEXEC thread. 'commit' transaction records processed by a DSIEXEC thread. Internal system transactions processed by a DSI DSIEXEC thread. SQLDDL commands processed by a DSI DSIEXEC thread. rs_insert commands processed by a DSIEXEC thread. rs_update commands processed by a DSIEXEC thread. rs_delete commands processed by a DSIEXEC thread. rs_writetext commands processed by a DSIEXEC thread. Invocations of function rs_get_textptr by a DSIEXEC thread. This function is executed each time the thread processes a writetext command.

As you can see, the largest change is that the DSIEXEC has more counters tracking the time spent retrieving the commands/command groups from the SQT cache in the DSI thread. An important aspect to these counters is to remember that they are per DSI EXEC thread so with parallel DSI enabled, more than one value will be recorded. As mentioned earlier in the general discussion about the RS M&C feature, the rs_statdetail.instance_id column corresponds to the thread number for each value allowing us to also track how efficiently each thread is utilized. For now, we will focus on just the function generation and DML aspects later we will take a look at the parallel DSI aspect of the problem. However, it does mean that if looking across all the DSIEXECs, we will need to aggregate the counter values per sample period. Some of the more useful general counters include: CmdsApplied (DSICmdsSucceeded), CmdsPerSec=CmdsApplied/seconds InsertsRead, UpdatesRead, DeletesRead ExecsWritetext, ExecsGetTextPtr These are fairly obvious as they help us establish rate information for throughput as well as which commands were being executed. The last set refer more to text/image processing and can be used to develop profiles (i.e. a relative indication of the size of the text/image is WritesPerBlob=ExecsWritetext/ExecsGetTextPtr). While these are interesting to monitor (and the number of updates may give a clue to how effective minimal column replication might be), the real effort at this stage is command batching. Lets take a look at how these counters can be used. First, lets consider the insert stress test:

169

Final v2.0.1

CmdsApplied

UpdatesRead

Sample Time

CmdsPerSec

DeletesRead

InsertsRead

MsgChecks

Execs Writetext

11:37:57 11:38:08 11:38:19 11:38:30 11:38:41 11:38:52 11:39:03 11:39:14 11:39:25 11:39:36

305 2,030 2,234 2,267 2,150 2,235 2,253 2,212 2,107 2,414

27 203 203 226 195 203 204 201 191 219

200 1,450 1,595 1,620 1,536 1,595 1,609 1,580 1,504 1,725

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

94 541 567 640 571 556 580 587 584 654

As you can see the cumulative throughput was ~200 commands/sec across all the DSIs and it was all inserts (no surprise). The disparity between CmdsApplied and InsertsRead is simple the begin tran/commit tran commands are counted as well. And interesting statistic is the message checks per command which is averaging close to 25%. Note that the test machine can easily hit 900 inserts/sec using RPC calls and 200 inserts/sec using language commands consequently the 200 inserts/sec rate may be the max we can get out of the replicate ASE using a Warm Standby configuration. Later when we look at the timing information, we will see statistics that help support that it is the replicate ASE that is the bottleneck. Because the insert stress test is rather simplistic, lets next take a quick look at the first day of the customers data that we have been looking at before we discuss the counters: CmdsApplied UpdatesRead Sample Time CmdsPerSec

DeletesRead

InsertsRead

MsgChecks

Execs Writetext

19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16 19:37:18 19:42:19 19:47:21

6 6,292 7,679 4,119 6,983 7,491 15,595 41,343 42,255 21,711

0 20 25 13 23 24 51 137 140 72

0 615 0 0 0 0 526 10,347 10,469 5,431

2 1,914 3,839 2,059 3,491 3,745 6,841 1 270 9

0 615 0 0 0 0 430 10,299 10,360 5,411

0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0

8 4,802 5,909 3,210 5,357 5,738 11,711 31,044 31,734 16,299

Now, lets take a look at some of these metrics for the most part the description will concentrate on the customer numbers and only refer back to the insert test when necessary.

170

MsgChks PerCmd 1.3 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7

ExecsGet TextPtr

MsgChks PerCmd 0.3 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2

ExecsGet TextPtr

Final v2.0.1
CmdsApplied CmdsApplied reports the number of SQL statements issued to the replicated database. As you can see in the above, the system is nearly idle at the beginning and then builds to executing tens of thousands of SQL commands per sample period. CmdsPerSec This metric is derived by dividing the CmdsApplied by the number of seconds in the sample interval. This can be used to gauge the real performance of the DSI threads vs. CmdsApplied as it gives an execution rate. Note that it peaks at ~140/sec which really is not all that good (compared to the insert test steadily achieving 200 inserts/sec on a laptop and even that is not ideal) but then we are dealing with a single DSI thread as well. Inserts/Updates/DeletesRead Much like the DIST counters, these counters track the number of inserts, updates and deletes read out of the outbound queue and sent to the replicate database. Again, we see the curious pattern of inserts/deletes mimicking each other. However, the number of updates also suggests that minimal column replication should be considered as well. One thing that is interesting is that the sum of the DML commands is only of the CmdsApplied value. The reason is that the counters for the begin transaction & commit transaction are not shown above. For example, if the delete/insert were a pair in a single transaction, then at time 19:37, we would have ~10,000 deletes + 10,000 inserts + 10,000 begin tran + 10,000 commit trans which does work out to 40,000 commands. ExecsGetTextPtr/ExecsWritetext These counters are related to text/image processing. The first metric refers to the number of text/image columns that are involved. The reason this can be deduced is that each text/image column per replicated row will require an execution of rs_get_textptr (see section on text replication later). The second counter is incremented for each writetext operation. While there are counters available, these two also give you a fairly good indication of the amount of text/image data flowing. For example, if they were equal, then you would know that the amount of text/image data is fairly small (<16KB) and can be issued with a single writetext call. If you see a ratio of 100 or more writetext commands per rs_get_textptr, you can be fairly confident that the text/image data is fairly substantial which may be contributing to the slow delivery rate at the replicate database server. MsgChecks This metric tracks how often the DSI EXEC threads check for pending commands via the OpenServer message structures referenced in the internals section at the beginning of this document specifically the batch sequencing and commit sequencing messages (but also the actual transaction commands are posted here as well). MsgChksPerCmd This metric is derived by dividing the number of MsgChecks by the CmdsApplied to get a ratio of how autonomous the DSIEXE is. For example, with large groups allowed, once it can start, the DSIEXEC can do a lot of processing without having to continuously check for transaction group/batch sequencing with the parent DSI thread. In this case, we see that we are checking with the parent DSI thread nearly every command but then remember, the calculated dsi_max_xacts_in_group was 1 which means with only 1 transaction per group, we are going to have more coordination between the DSI and the DSIEXEC with each new transaction. A key element here is that we are doing 2 message checks for every 2 commands which makes sense if these are atomic transactions as each transaction would be 3-4 commands (begin, insert/delete, commit) and we would need to check group sequencing and batch sequencing (two message checks). Now, lets look at the second days metrics: CmdsApplied UpdatesRead Sample Time CmdsPerSec

DeletesRead

InsertsRead

MsgChecks

Execs Writetext

19:02:07 19:07:08 19:12:10 19:17:12 19:22:13 19:27:14 19:32:16

9 6,758 7,371 4,171 5,603 14,959 22,128

0 22 24 13 18 49 73

0 627 0 6 0 0 4,012

3 2,133 3,685 2,079 2,802 7,479 3,071

0 615 0 0 0 0 3,978

0 0 0 0 0 0 0

0 0 0 0 0 0 0

12 2,058 2,095 1,198 1,638 4,114 6,085

MsgChks PerCmd 1.3 0.3 0.2 0.2 0.2 0.2 0.2

ExecsGet TextPtr

171

Final v2.0.1

CmdsApplied

UpdatesRead

Sample Time

CmdsPerSec

DeletesRead

InsertsRead

MsgChecks

Execs Writetext

19:37:18 19:42:19 19:47:21

52,085 46,301 26,980

172 153 89

13,047 11,584 5,382

73 37 2,797

12,885 11,520 5,268

0 0 0

0 0 0

14,344 12,738 7,445

Notice that although the CmdsPerSec are in the same order of magnitude, the MsgChksPerCmd is considerably less. If you remember, it was during this time frame that the transaction grouping was much more effective hitting the goal of 20 constantly for the last. So, if transaction grouping is much more efficient (and notice that we are now only coordinating with the DSI approximately every fifth command but with 2 checks (batch & transaction sequence) we really are checking every 10 commands), but we have not really improved the delivery rate, then something else is the bottleneck now and may have been primary limiting factor for the day before. This points out an interesting perspective. So many P&T sessions follow the same flawed logic: 1. 2. Change one setting and run the test (this is already flawed as sometimes settings work together cooperatively). If it didnt improve anything, reset it to the original setting and try something else

The question is, how often after finding something that helps do DBAs go back and retry something that didnt help previously?? Answer: Not very often. In this case, we know that in the first day, transaction grouping was not occurring. Whatever the customer did to change the picture for day 2 helped the transaction grouping, but did not help the overall throughput. The tendency might be to reset whatever was changed and look at something else. However, a better way of looking at P&T issues is to think of the system as a pipeline with at least one or more bottlenecks. Removing the second or third one will not necessarily improve the throughput as the first one is still constricting the flow. Putting it back and then removing the first one doesnt help either as the re-introduced second bottleneck restricts the flow making the removal of the first bottleneck appear to be without benefit as well. As a result, when using the M&C, it is best sometimes to look at each and if possible, remove each bottleneck as noticed and leave it removed. In this case, they should leave the changes they made that affected transaction grouping intact as larger groupings of transactions are much more efficient in non-Parallel DSI systems. DSIEXEC Command Batching In addition to transaction grouping, another DSI feature that is critical for replication throughput is DSI command batching. While some database systems, such as older Oracle (pre-9i), do not allow this feature or have limitations those that do gain a tremendous advantage in reducing network I/O time by batching all available SQL and sending it to the database server in a single structure. This is analogous to executing the following from isql:
-- isql script insert into orders values () insert into order_items values insert into order_items values insert into order_items values insert into order_items values insert into order_items values insert into order_items values insert into order_items values go

() () () () () () ()

vs. without command batching, the same isql script would look like:
-- isql script insert into orders values () go insert into order_items values go insert into order_items values go insert into order_items values go insert into order_items values go insert into order_items values

() () () () ()

172

MsgChks PerCmd 0.2 0.2 0.2

ExecsGet TextPtr

Final v2.0.1

go insert into order_items values () go insert into order_items values () go

Anyone with basic performance and tuning knowledge will be able to tell that the first example will execute an order of magnitude faster from the client application perspective. How does this apply to Replication Server? Believe it or not, it does NOT mean that multiple transaction groups can be lumped into a large SQL structure and sent in a single batch. It does mean, however, that all of the members of a single transaction group may be sent as a single command batch with the exception of the final commit (due to recovery reasons. The commit is withheld until all transaction statements have executed without error. If no errors, then the commit is sent separately. If errors occurred, either a rollback is issued or the DSI connection is suspended (most common) which implicitly rolls back the transaction). The way this works is as follows: 1. 2. 3. The DSI groups a series of transactions until one of the group termination messages is hit (for example, the maximum of 65,536 bytes). The DSI passes the entire transaction group to the next DSIEXEC that is ready. The DSIEXEC executes the grouped transaction by sending dsi_cmd_batch_size (8192 bytes by default) sized command batches until completed.

In the example above, if the transaction group was terminated due to 65,536 byte limitation and we still had the default of 8,192 bytes per command batch, the entire transaction would be sent to the replicate database in ~8 batches of 8,192 bytes (depending on command boundaries as ASE requires that commands must be complete and not split across batches). Consequently, the effect of command batching which is on by default for ASE replicates is that performance of each of the transaction groups is maximized by reducing the network overhead of sending a large number of statements within a single transaction. In this way, lock time on the replicate is minimized, reducing contention between parallel DSIs as well as contention with normal replicate database users. Command batching is critical for large transaction performance. As you could imagine, a large transaction especially one that gets removed from SQT due to cache limitations will force the previous transaction group to end. As the large transaction begins, each dsi_cmd_batch_size bytes will be sent to the replicate database. The common wisdom has been to set this to the same size as db_packet_size or some small multiple (i.e. 2x) of db_packet_size. However, as you will see when we look at the counters, this is actually not optimal optimal is to set it as large enough that a single transaction group is sent as on command batch. It might be tempting then to assume that it would be best to set dsi_cmd_batch_size to the same as dsi_xact_group_size or 65,536 by default. One problem that many people who have coded large stored procedures might remember about this each user connection has a limited stack size in ASE for their connection. Issuing too large a batch of SQL results in stack overflow while later releases of ASE can easily handle the large batches, early releases from quite a few years ago could not. Consequently, RS sends a maximum of 50 commands even if the dsi_cmd_batch_size will support more. The optimal setting would be to set it to the largest command buffer that your DBMS can handle and let the network layer break it up into smaller chunks. The dsi_cmd_batch_size should rarely (hesitating to say never only to avoid setting a precedence) be set to less than 8,192 no matter what db_packet_size is set to- and never less than 2,048 as a large data row size of >1,000 bytes might easily violate this by the time the column names, etc. are added to the command. Remember, even with the default of 512 bytes (how many of us typically set the -A to higher??), isql is faster executing batches of SQL than individual statements. So lowering dsi_cmd_batch_size to db_packet_size is typically will degrade throughput. Key Concept #17: Along with transaction grouping, DSI command batching is critical to throughput to replicate systems that support it. The optimal size for DSI command batching would allow the entire transaction group to be sent as a single command batch. However, just like transaction grouping the command batching limits are upper bounds/goals. Command batches could be flushed from the DSI EXEC for any number of reasons some of which are tracked by the monitor counters. Command Batch Monitor Counters Several DSIEXEC module counters exist to help optimize command batching:

173

Final v2.0.1

Counter Preparation DSIEBatch DSIEBatchSizeAve DSIEBatchSizeLast DSIEBatchSizeMax DSIEBatchTimeAve DSIEBatchTimeLast DSIEBatchTimeMax DSIEICmdCountAve DSIEICmdCountLast DSIEICmdCountMax DSIEOCmdCountAve DSIEOCmdCountLast DSIEOCmdCountMax MemUsedAvgGroup MemUsedLastGroup MemUsedMaxGroup TransAvgGroup

Explanation

The number of command batches started. Average size, in bytes, of a command batch submitted by a DSI. Size, in bytes, of the last command batch submitted by a DSI. The maximum size, in bytes, of a command batch submitted by a DSI. Average time taken, in 100ths of a second, to process a command batch submitted by a DSI. Time, in 100ths of a second, to process the last command batch submitted by a DSI. The maximum time taken, in 100ths of a second, to process a command batch submitted by a DSI. Average number of input commands in a batch submitted by a DSI. Number of input commands in the last command batch submitted by a DSI. The maximum number of input commands in a batch submitted by a DSI. Average number of output commands in a batch submitted by a DSI. Number of output commands in the last command batch submitted by a DSI. The maximum number of output commands in a batch submitted by a DSI. Average memory consumed by a DSI/S thread for a single transaction group. Memory consumed by a DSI/S thread for the most recent transaction group. Maximum memory consumed by a DSI/S thread for a single transaction group. The average number of transactions dispatched as a single atomic transaction. If the value of this counter is close to the value of TransMaxGroup, you may want to consider bumping dsi_xact_group_size and/or dsi_max_xacts_in_group. If a DSIEXEC thread is capable of utilizing any degree of transaction grouping logic, this counter reports the number of transactions executed in the last grouped transaction. The maximum number of transactions dispatched as a single atomic transaction.

TransLastGroup

TransMaxGroup Execution DSIEBFBatchOff DSIEBFBegin DSIEBFCommitNext DSIEBFForced

Number of batch flushes executed because command batching has been turned off. Number of batch flushes executed because the next command is a 'transaction begin' command and by configuration such commands must go in a seperate batch. Number of batch flushes executed because the next command in the transaction will be a commit. Number of batch flushes executed because the situation forced a flush. For example, an 'install java' command needs to be executed, or the next command is the first chuck of BLOB DDL. Number of batch flushes executed because the next command is a get text descriptor command. Number of batch flushes executed because the next command would exceed the batch byte limit.

DSIEBFGetTextDesc DSIEBFMaxBytes

174

Final v2.0.1

Counter DSIEBFMaxCmds

Explanation Number of batch flushes executed because we have a new command and the maximum number of commands per batch has been reached. This limit currently is 50 commands as measured from the input command buffer. Number of batch flushes executed because the next command is to have its results processed in a context different from the current batch. Number of batch flushes executed because we expect to have row results to process. Number of batch flushes executed because the next command is an RPC. Number of batch flushes executed because the next command is part of a system transaction.

DSIEBFResultsProc DSIEBFRowRslts DSIEBFRPCNext DSIEBFSysTran Sequencing DSIESCBTimeAve DSIESCBTimeMax

Average time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'. The maximum time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'.

In RS 15.0, these counters are similar but lack the total, average, max as per: Counter Preparation DSIEBatchTime DSIEBatchSize DSIEOCmdCount DSIEICmdCount Execution DSIEBFResultsProc DSIEBFCommitNext DSIEBFMaxCmds DSIEBFRowRslts DSIEBFRPCNext DSIEBFGetTextDesc DSIEBFBatchOff DSIEBFMaxBytes DSIEBFBegin DSIEBFSysTran Number of batch flushes executed because the next command is to have its results processed in a context different from the current batch. Number of batch flushes executed because the next command in the transaction will be a commit. Number of batch flushes executed because we have a new command and the maximum number of commands per batch has been reached. Number of batch flushes executed because we expect to have row results to process. Number of batch flushes executed because the next command is an RPC. Number of batch flushes executed because the next command is a get text descriptor command. Number of batch flushes executed because command batching has been turned off. Number of batch flushes executed because the next command would exceed the batch byte limit. Number of batch flushes executed because the next command is a 'transaction begin' command and by configuration such commands must go in a seperate batch. Number of batch flushes executed because the next command is part of a system transaction. Time, in 100ths of a second, to process command batches submitted by a DSI. Size, in bytes, of command batches submitted by a DSI. Number of output commands in command batches submitted by a DSI. Number of input commands in command batches submitted by a DSI. Explanation

175

Final v2.0.1

Counter DSIEBFForced

Explanation Number of batch flushes executed because the situation forced a flush. For example, an 'install java' command needs to be executed, or the next command is the first chuck of BLOB DDL.

Sequencing DSIESCBTime Time, in 100ths of a second, to check the sequencing on command batches which required some kind of synchronization such as 'wait_for_commit'.

Note that the equivalent of DSIEBatch in RS 15.0 is to get the counter_obs column value for the DSIEBatchSize counter. Command batching can kind of be compared to how many SQL statements before each go you put in a file to be executed by isql. If youve ever done this test, you will quickly find that with smaller numbers (i.e. 2 or 3 inserts, then a go), it is much slower than with 100 or so. Consequently, you will want to want the following counters (RS 12.6 listed - equivalent RS 15.0 counters can be easily determined): DSIEBatch DSIEBatchSizeMax, DSIEBatchSizeAve DSIEOCmdCountMax, DSIEOCmdCountAve DSIEBFCommitNext, DSIEBFBegin DSIEBFMaxCmds, DSIEBFMaxBytes DSIEBFRPCNext, DSIEBFGetTextDesc, DSIEBFSysTran The first one is fairly simple the number of command batches used. The next set report the size in bytes of the command batches. The default dsi_cmd_batch_size of 8192 typically is too small and most often results in 4-6 SQL commands per batch. Increasing this to 256K is likely advisable as well. The set after that, DSIEOCmdCountMax/Ave, report the number of commands actually sent per batch vs. the bytes. Along with the above, these are some of the more important counters. The O vs. the I for the similar batch of counters (e.g. DSIEICmdCountAve) refers to Output vs. Input. In other words, the DSI submits a transaction grouping of commands but they are commands in which the SQL generation has not yet happened. After SQL generation and variable substitution, the number of bytes per batch or other factor may reduce the actual number of commands sent in the batch to the replicate DBMS. The Output commands have the most interest to us. The best way to think of this is after every DSIEOCmdCountAve commands (on average) a go is sent ala isql. Obviously, the smaller the batches, the slower the throughput. The real goal, then, is to try to submit the entire transaction group in one command batch Much like with DSI transaction grouping, with command batching, there can be many reasons why a command batch is terminated. All the counters beginning with DSIEBF (DSIEXEC Batch Flush). Some of the more common ones will be described in the following bullets. DSIEBFCommitNext - This counter signals that the end of the transaction group has been reached. As mentioned above, if the goal is to submit the entire transaction group as a single batch, you want this counter to be the primary reason for command batch flushes to the replicated database. DSIEBFBegin - This counter is typically incremented when batch_begin is off. If this is deliberate, it can be ignored. DSIEBFMaxBytes - this clearly suggests that dsi_cmd_batch_size is too small as described in the above paragraph. As a result, the batch is sent because it exceeded dsi_cmd_batch_size. DSIEBFMaxCmds - this counter tells when the batch size hits the internal limit of 50 commands before function string mapping. One reason for limiting the number of commands per batch is that some servers would have stack overflow if the number of command batch bytes exceed 64KB (including earlier copies of Sybase SQL Server). DSIEBFRPCNext - This counter signals how often a batch was flushed because the next command had an output style of RPC instead of language. RPCs can not be batch, consequently, the language commands before it being accumulated in a batch have to be flushed, then the RPC sent. DSIEBFGetTextDesc - This counter tells how often a batch was flushed because the next command would be a writetext command. Since the writetext requires a text pointer, we first have to get the textpointer value from the replicate server.

176

Final v2.0.1
DSIEBFSysTran - This counter tells us how often a batch was flushed due to the next command being a DDL command. In order to replicate DDL statements, they are submitted outside the scope of a transaction so in this case, not only is the batch flushed, but the transaction grouping stopped as well. Lets take a look at our insert stress test. There are three sample periods below. The first two are from when the dsi_max_batch_size was set at twice the packet size of 8192, and the latter when this was increased to 65,536. The difference between the first two has to do with the average transactions per group the DSI was submitting. Sample Time DSIEBF CommitNext

DSIEOCmd CountMax

DSIEOCmd CountAve

DSIEBatch

DSIEBatch TimeMax

DSIEBatch TimeAve

DSIEBatch SizeAve

DSIEBatch SizeMax

DSIEBF MaxCmds 0 0

dsi_max_batch_size at 16384, 5 inserts/transaction, 1 tran per group avg 16:13:20 16:13:30 136 119 100 100 1 1 3,550 3,770 12,004 12,004 16 16 5 5 271 244 0 0

dsi_max_batch_size at 16384, 5 inserts/transaction, ~4 tran per group avg 11:17:39 11:17:50 66 59 100 100 4 5 7,299 7,864 15,999 15,999 21 21 9 10 131 120 0 0 50 43

dsi_max_batch_size at 65536, 5 inserts/transaction, ~4 tran per group avg 11:38:08 11:38:19 63 68 100 100 4 4 9,141 8,952 39,170 39,170 50 50 12 12 126 137 9 8 0 0

It helps, of course to have the intrusive counters for timing purposes turned on. From the above, we can see that RS is taking about 10ms (counter is in 1/100ths of a second or 0.01 vs. milliseconds) per ungrouped transaction to process the batch. We will take a look next at the timing aspect, but for now, lets look at the commands. Notice that the average number of commands per batch increased from 5 to ~10 to 12. It is interesting to note that the number of CmdsPerSec jumped from ~150 to ~200 (earlier execution statistics for first set not shown here) simply by increasing the number of commands per command batch. However, between #2 and #3, increasing the dsi_max_batch_size shifted the ~25% batch flushes due to hitting the configuration limit to a <10% due to hitting the maximum number of commands. On interesting statistic to keep in mind is that each command batch will have 2 batch flushes at a minimum. This is because the actual commit is sent in a separate batch. Consequently, the ideal situation would be to have DSIEBFCommitNext 2 x DSIEBatch which is what we have. However, this puts the DSIEBFMaxBytes in more perspective as it suggests that nearly every batch in the middle sample exceeded dsi_max_batch_size requiring three command batches instead of two. Since each batch that begins will have at least one separate batch flush for the commit record, you can subtract DSIEBatch from DSICommitNext to reach a true DSICommitNext. Since we are looking at the execution stage, it would help to take a look at the times for the various stages RS 1) preparing the batch; 2) sending the batch to the replicate database and then 3) processing the results. Lets use the same time samples from the insert stress test above and see if it can explain why we have latency (or at least which component is to blame): SendTimeMax SendTimeAvg DSIEResult TimePerCmd 2.6 3.4 Sample Time

DSIEOCmd CountMax

DSIEOCmd CountAve

DSIEResult TimeMax

dsi_max_batch_size at 16384, 5 inserts/transaction, 1 tran per group avg 16:13:20 16:13:30 136 119 100 100 1 1 16 16 5 5 0 0 0 0 100 100 13 17

DSIEResult TimeAve

DSIEBatch

DSIEBatch TimeMax

DSIEBatch TimeAve

DSIEBF MaxBytes

177

Final v2.0.1

SendTimeMax

SendTimeAvg

dsi_max_batch_size at 16384, 5 inserts/transaction, ~4 tran per group avg 11:17:39 11:17:50 66 59 100 100 4 5 21 21 9 10 0 0 0 0 100 100 14 22 1.5 2.2

dsi_max_batch_size at 65536, 5 inserts/transaction, ~4 tran per group avg 11:38:08 11:38:19 63 68 100 100 4 4 50 50 12 12 0 0 0 0 100 100 29 26 2.4 2.1

And we notice a very key clue it is taking ~20ms (counter is in 1/100ths of a second or 0.01 vs. milliseconds) per command to process the results from each one. In fact, it is taking RS about 4 times longer to process the results than it does to process the batch internally and when no grouping, this is ~15x longer. Given this lag, no matter how much tuning we do to RS, it will be extremely difficult to achieve much faster we need to speed up the replicate SQL execution first. At 20ms per command, we could only hit ~50 commands/sec of course using parallel DSIs help some, but in this case, they barely let us 4x the throughput of this system. Ideally, we need to figure out a way to speed up the individual command processing increasing the number of DSIs may not help if the system is already CPU bound. Now, lets take a look at the customer system. Unfortunately, the customer system did not have the timing counters enabled, so we will only be able to look at the batching efficiency over the two days: Sample Time DSIEBF CommitNext

DSIEOCmd CountMax

DSIEOCmd CountAve

DSIEBatch

DSIEBatch TimeMax

DSIEBatch TimeAve

DSIEBatch SizeAve

DSIEBatch SizeMax

DSIEBF MaxCmds 0 0 0 0

~1 tran per group avg 19:07:08 19:12:10 19:17:12 19:22:13 1,574 1,920 1,030 1,746 0 0 0 0 1 1 1 1 884 1,271 1,274 1,279 2,342 2,325 2,323 2,340 3 3 3 3 2 2 2 2 3,148 3,840 2,060 3,491 0 0 0 0

~16-20 tran per group avg 19:23:32 19:28:34 19:33:36 19:38:38 148 115 69 101 0 0 0 0 11 16 14 13 4,852 5,849 5,768 5,708 8,185 8,189 8,189 8,189 61 16 17 16 11 10 9 9 294 230 138 201 0 0 0 0 287 538 304 402

You can see the changes in some of the counters as described below: DSIEBatch Normally, a 10-20x drop in the number of batches sent would suggest a drop in throughput either because of fewer transactions being replicated or due to slower throughput. In this case, however, if we had the space to show DSIEXEC.TransAveGroup jump the same amount. DSIEBatchTimeAve Again, this shows the same jump, but it works out to slightly less than 1/100th of a second (10ms) per transaction group so we are not too concerned here although we wish we would have had some of the other time based counters such as DSIEResultTimeAve for comparison.

178

DSIEBF MaxBytes

DSIEResult TimePerCmd

Sample Time

DSIEOCmd CountMax

DSIEOCmd CountAve

DSIEResult TimeMax

DSIEResult TimeAve

DSIEBatch

DSIEBatch TimeMax

DSIEBatch TimeAve

Final v2.0.1
DSIEBatchSizeAve/Max Obviously, this system is still using the default dsi_max_batch_size = 8192 which, while it may not look to be a problem as the average is only 70% of the max with the average being that close to the max, it tells us that the max is being hit pretty frequently (as DSIEBFMaxBytes does show). Remember also that only complete commands can be sent therefore if the average command size is 1,500 bytes, the most we will be able to send is 5 commands or 7,500 bytes. As a result, it may be difficult for the max to be hit that often. DSIEOCmdCountAve/Max In the first days metrics, not only is transaction grouping an issue, but command batching is all but ineffective as well. Day two is a lot better, running ~10 commands per batch and peaks of ~60 commands per batch. DSIEBFCommitNext/DSIEBFMaxBytes Unlike the insert stress test system, this one hits >>50 without tripping DSIEBFMaxCmds. However, it does show that the most common reason for batch flushes is due to hitting the dsi_max_batch_size limit. If we sum the four values above, we get a total of DSIEBFCommitNext=863 and DSIEBFMaxBytes=1531 or nearly a 2:1 ratio for DSIBFMaxBytes. Curiously, DSIEBatch which reports the number of batches began - is only at total of 333. While this may seem odd, remember that DSIEBatch is measured at the beginning and likely some of the command batches exceeded dsi_max_batch_size several times within the same batch resulting in multiple batch flushes per command batch in addition to the separate commit flush. If we subtract 333 from DSIEBFCommitNext, we end up with 530 instead of 863 which is a 3:1 ratio for DSIBFMaxBytes and a truer picture of the problem. So, part of the issue with this system is that the dsi_max_batch_size is undertuned. While this may be a big bottleneck, it is not the largest and tuning it will help some but not likely as much as some may be looking for. Much like the multiple bottlenecks in a pipe, removing other bottlenecks may have greater impact for example, 50% of the latency can be eliminated for this system simply by eliminating the delete/insert pairs and replacing with an update statement. Increasing dsi_max_batch_size is still a good idea. Some of you may have noticed that during the 19:23:32 period (first sample in the second group in the table), that the value for DSIOCmdCountMax was 61 definitely higher than the limit we stated as 50. The command limit is based on replicated commands from the input, whereas during SQL generation, additional commands may be necessary. For example, if we replicate a table containing identity columns, the actual replicated command is the rs_insert a single command. However, the output command language would require:
set identity_insert tablename on insert into tablename set identity_insert tablename off

Consequently a single command becomes three. Consequently, while you may see DSIEOCmdCountAve/Max/Last higher than 50, the input counters DSIEICmdCountAve/Max/Last should never exceed 50. In the case above, when the DSIEOCmdCountMax was equal to 61, during the same period, DSIEICmdCountMax was equal to 41. DSIEXEC Execution Replication Server is simply another client to ASE or any other DBMS it has no special prioritization nor special command processing. Consequently, RS execution of SQL statements is effectively very similar to the basic ct_results() looping in sample CT-Lib programs. The basic template might look similar to:
ct_command() called to create command batch ct_send() send commands to the server while ct_results returns CS_SUCCEED (optional) ct_res_info to get current command number switch on result_type /* ** Values of result_type that indicate fetchable results: */ case CS_COMPUTE_RESULT... case CS_CURSOR_RESULT... case CS_PARAM_RESULT... case CS_ROW_RESULT... case CS_STATUS_RESULT... /* ** Values of result_type that indicate non-fetchable results: */ case CS_COMPUTEFMT_RESULT... case CS_MSG_RESULT... case CS_ROWFMT_RESULT... case CS_DESCRIBE_RESULT... /* ** Other values of result_type: */ case CS_CMD_DONE...

179

Final v2.0.1

(optional) ct_res_info to get the number of rows affected by the current command case CS_CMD_FAIL... case CS_CMD_SUCCEED... end switch end while switch on ct_results final return code case CS_END_RESULTS... case CS_CANCELED... case CS_FAIL... end switch

The only real difference would be if an RPC call was made or text/image processing. To some, the many variations of result type processing may seem to be a bit overkill as RS really doesnt need or care about the results let alone compute-by clause results. However, remember that with stored procedure replication, just about any SQL statement could be contained within the replicated procedure, consequently RS needs to know how to handle the results type. Those familiar with CT-Lib programming also know that within this ct_results() loop often is a ct_fetch() loop which RS has to implement as well. Ideally, there will only be a single result for each DML command, but again, in the case of stored procedure replication, there might be any number of rows to be fetched and/or messages from print statements. So why are we discussing all of this? For two main reasons. First, to help you understand how RS works. Secondly and most appropriate to this section is the counters that are mostly associated with execution statistics. DSIEXEC Execution Monitor Counters The following monitor counters deal specifically with sending the commands to the replicate DBMS, processing the results (and error handling) during processing. Normally, only a few of these are applicable as most replication environments are fairly basic (consequently values for other counters may be an indication of unexpected behavior that may be contributing to the issue at hand). Some of the counters are repeated from earlier sections, but since they are applicable here particularly in light of some of the derived values they are repeated here for ease of reference. Counter Explanation

Batch sequencing (repeated from earlier) DSIESCBTimeAve DSIESCBTimeMax Average time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'. The maximum time taken, in 100ths of a second, to check the sequencing on a command batch which required some kind of synchronization such as 'wait_for_commit'.

ct_send() phase SendTimeAvg SendTimeMax SendRPCTimeAvg SendRPCTimeMax SendDTTimeAvg SendDTTimeMax ct_results() processing DSIEResultTimeAve DSIEResultTimeMax Exception Processing Average time taken, in 100ths of a second, to process the results of a command batch submitted by a DSI. The maximum time taken, in 100ths of a second, to process the results of a command batch submitted by a DSI. Average time, in 100ths of a second, spent in sending command buffers to the RDS. Maximum time, in 100ths of a second, spent in sending command buffers to the RDS. Average time, in 100ths of a second, spent in sending RPCs to the RDS. Maximum time, in 100ths of a second, spent in sending RPCs to the RDS. Average time, in 100ths of a second, spent in sending chunks of text or image data to the RDS. Maximum time, in 100ths of a second, spent in sending chunks of text or image data to the RDS.

180

Final v2.0.1

Counter ErrsDeadlock

Explanation Total times that a DSI thread failed to apply a transaction due to deadlocks in the target database (ASE Error 1205). Note that this does not track the times when deadlocks occur with parallel DSIs, but only when RS deadlocks with another nonRS process. Total times that a DSI thread failed to apply a transaction due to no available log space in the target database (ASE Error 1105). Total times that a DSI thread failed to apply a transaction due to target the database in log suspend mode (ASE Error 7415). Total times that a DSI thread failed to apply a transaction due to no connections to the target database (ASE Error 1601). Total times that a DSI thread failed to apply a transaction due to no locks available in the target database (ASE Error 1204).

ErrsLogFull ErrsLogSuspend ErrsNoConn ErrsOutofLock Commit Sequencing DSIESCCTimeAve DSIESCCTimeMax MsgChecks

Average time taken, in 100ths of a second, to check the sequencing on a commit. The maximum time taken, in 100ths of a second, to check the sequencing on a commit. Total checks for Open Server messages by a DSIEXEC thread. Message checks are for group and batch sequencing operations as discussed earlier in association with the dsi_serialization_method Number of MsgChecks_Fail returned when a DSIEXEC thread calls dsie__CheckForMsg(). If a timer is specified, MsgChecks_Fail returns if timer expired before an event is returned. Average time taken, in 100ths of a second, to process a transaction by a DSI/E thread. This includes function string mapping, sending and processing results. A transaction may span command batches. The maximum time taken, in 100ths of a second, to process a transaction by a DSI/E thread. This includes function string mapping, sending and processing results. A transaction may span command batches.

MsgChecksFailed

DSIETranTimeAve

DSIETranTimeMax

In RS 15.0, the counters are similar: Counter Explanation

Preparation & Batch Sequencing DSIESCBTime DSIEPrepareTime Ct_send() phase SendTime SendRPCTime SendDTTime DSIEExecCmdTime DSIEExecWrtxtCmdTime Time, in 100ths of a second, spent in sending command buffers to the RDS. Time, in 100ths of a second, spent in sending RPCs to the RDS. Time, in 100ths of a second, spent in sending chunks of text or image data to the RDS. The amount of time taken by a DSI/E to execute commands. This process includes creating command batches, flushing them, handling errors, etc. The amount of time taken by a DSI/E to execute commands related to text/image data. This process includes initializing and retreiving text pointers, flushing commands, handling errors, etc. Time, in 100ths of a second, to check the sequencing on command batches which required some kind of synchronization such as 'wait_for_commit'. The amount of time taken by a DSI/E to prepare commands for execution.

181

Final v2.0.1

Counter ct_results() processing DSIEResSucceed DSIEResFail DSIEResDone DSIEResStatus DSIEResParm DSIEResRow DSIEResMsg DSIEResultTime Exception Processing ErrsDeadlock

Explanation

The number of times a data server reported successful executions of a command batch. The number of times a data server reported failed executions of a command batch. The number of times a data server reported the results processing of a command batch execution as complete. The number of times a data server reported a status in the results of a command batch execution. The number of times a data server reported a parameter, cursor or compute value in the results of a command batch execution. The number of times a data server reported a row as being returned in the results of a command batch execution. The number of times a data server reported a message or format information as being returned in the results of a command batch execution. Time, in 100ths of a second, to process the results of command batches submitted by a DSI.

Total times that a DSI thread failed to apply a transaction due to deadlocks in the target database (ASE Error 1205). Note that this does not track the times when deadlocks occur with parallel DSIs, but only when RS deadlocks with another nonRS process. Total times that a DSI thread failed to apply a transaction due to no available log space in the target database (ASE Error 1105). Total times that a DSI thread failed to apply a transaction due to target the database in log suspend mode (ASE Error 7415). Total times that a DSI thread failed to apply a transaction due to no connections to the target database (ASE Error 1601). Total times that a DSI thread failed to apply a transaction due to no locks available in the target database (ASE Error 1204).

ErrsLogFull ErrsLogSuspend ErrsNoConn ErrsOutofLock Commit Sequencing DSIESCCTime DSIETranTime

Time, in 100ths of a second, to check the sequencing on commits. Time, in 100ths of a second, to process transactions by a DSI/E thread. This includes function string mapping, sending and processing results. A transaction may span command batches. The amount of time taken by a DSI/E to finish cleaning up from committing the latest tran. These clean up activities include awaking the next DSI/E (if using parallel DSI) and notifying the DSI/S.

DSIEFinishTranTime

However, the most useful DSIEXEC counters are the time counters. In RS 12.6, the only counters were averages which meant that the most useful way of looking at them was from a total perspective, requiring re-calculating the original total that was used in the average: FSMapTime=(DSIEFSMapTimeAve * CmdsApplied)/100.0 BatchTime =(DSIEBatchTimeAve * DSIEBatch)/100.0 SendTime=(SendTimeAvg * DSIEBatch)/100.0 ResultTime=(DSIEResultTimeAve * DSIEBatch)/100.00 CommitSeqTime=(DSIESCCTimeAve * TransApplied)/100.0

182

Final v2.0.1
BatchSeqTime=(DSIESCBTimeAve * TransApplied)/100.00 TotalTranTime=(DSIETranTimeAve * TransApplied)/100.00 RS 15.0 simplifies this thanks to the counter_total column in the rs_statdetail table. The key to all of these is to remember that we are executing command batches with the transaction group currently being dispatched by the DSIEXEC and that multiple groups may be executed by the DSIEXEC within the sample interval. Consequently, to get the time spent for each sample interval, we have to multiply the individual timing counters by the number of commands, batches or transactions processed by that DSIEXEC during that interval to get the total time spent on that aspect (note that this changes substantially in RS 15 as it tracks totals already). All the times reported by these counters are in 100ths of a second, consequently we need to normalize to seconds to make them more readable. From these we can most often find quite clearly where RS is spending the time. Lets take the above times in order of the execution and describe the likely causes: FSMapTime - As noted earlier, this is the amount of time translating the replicated row functions into SQL commands. If there is a lot of time spent in this area, it could point to fairly big customized function strings - which you may not be able to do much about. However, you may wish to ensure that STS cache is sized appropriately. BatchTime - As noted earlier as well, this is the amount of time creating the batches. Although it seems odd, generally when this value is high, it almost always goes hand in hand with dsi_cmd_batch_size being too small. One possibility is that the overhead of batch creation - beyond the mechanics of append the SQL clauses is high enough that when the number of batches is high due to a low batch size setting, it adds up considerably. BatchSeqTime - This, as described earlier, is the time spent trying to coordinate sending of the first batch in parallel DSIs. A lengthy time could indicate that the dsi_serialization_method is wait_for_commit and a previous transaction is running a long time or that the DSI thread is simply too busy to respond to the Batch Sequencing message. SendTime - This represents the amount of time spent sending the command batch to the replicate data server. A high time here may indicate inefficient batching or slow response to client applications from the replicate server. ResultTime - This calculated value can be used to determine the amount of time spent processing results from the replicate server. In actuality, this includes the execution time as RS does very little result processing. Frequently, these metrics will among the highest and points to a need to speed up the replicate DBMS as the key to improving RS throughput. CommitSeqTime - This is the amount of time spent waiting to commit. Again, a high value may indicate a near-serial dsi_serialization_method such was wait_for_commit - or it also could point to contention within the replicate server - possibly within the rs_threads group. TotalTranTime - Most of the time for 12.6 systems will be reported as TotalTranTime which when you subtract the other components (FSMapTime, SendTime), leaves execution time by the replicate database as the result. And if this is the largest chunk of time, tuning RS isnt going to help you have to either tune the replicate database, use parallel DSIs (and the key here is to achieve the greatest degree of parallelism without introducing prohibitive contention) or use minimal columns/repdefs to reduce the SQL execution time. Above, we have also highlighted the two message check counters (MsgChecks, MsgChecksFailed). To understand how these counters can be useful, think back to the earlier diagram of the DSI to DSIEXEC intercommunications concerning batch and commit sequencing. As discussed at the beginning of this paper, inter-thread communications are conducted using OpenServer message structures internally allowing asynchronous processing between the threads. Consequently, when a DSIEXEC puts a message such as Batch Ready on the DSI message queue, it then checks its own message queue for the response. If the response is there, only the MsgChecks counter is incremented. If the expected message is not there, the MsgChecksFailed is incremented along with the MsgChecks. While the number of failures could be an obvious indication of a lengthy batch/commit sequencing issue, we dont really need to look at the value too closely as RS monitor counters will explicitly tell us how long the batch sequencing and commit sequencing times were. However, the number of message checks is kind of handy from a different perspective. A very high number in comparison to the number of transaction groups or command batches processed gives us an indication of whether transaction grouping is effective (along with other explicit counters for this). Unfortunately, these counters were removed in RS 15.0 DSI Post-Execution Processing After the DSIEXEC finishes executing the SQL, it checks to see if it can commit. For parallel DSIs this is done by first sending an rs_get_threadseq or using DSI Commit Control. If it can commit, it notifies the DSI which in turn

183

Final v2.0.1
coordinates the commits among the DSIEXEC threads. If the thread is next to commit, the DSI sends a message to the DSIEXEC telling it to commit. Once the DSIEXEC has committed, it notifies the DSI that it successfully committed and the DSI in turn notifies the SQM to truncate the queue of the delivered transaction groups. Additionally, the DSI handles regrouping transactions after a failure.

End-to-End Summary
The two most common questions that are asked are Where do you begin? followed closely by How do you find where the latency is? The answer actually is the second question. When you think about it, with 3 near-synchronous pipelines for normal replication (2 for WS), any latency will manifest itself in one of three locations: 1. 2. 3. Primary Transaction Log Inbound Queue Outbound Queue

So, the first place to begin is to identify which of those three are lagging. The fastest way to isolate the problem is to do the following: Sp_help_rep_agent: Check the RepAgent state. If sleeping, then the RepAgent is caught up. If not sleeping, get sp_sysmon output to aid in further diagnostics. Admin who, sqm: Compare Next.Read with Last Seg.Block although this is not totally accurate, if the dsi_sqt_max_cache size is <4MB, it is likely that if Next.Read is greater than Last Seg.Block, any latency in minor. Admin sqm_readers, queue#, 1: For WS applications, admin who,sqm is particularly ineffective. Similar to admin who,sqm though, this will show the Next.Read and Last Seg.Block relative positions. The outcome of this will identify which of the three disk locations mentioned above contains the latency. Problem determination begins from that point forward according to the main near-synchronous pipelines: TranLog RepAgent RS RepAgent User SQM (W) Inbound Queue Inbound Queue SQT DIST Outbound SQM Outbound Queue Outbound Queue DSI DSIEXEC RDB Inbound Queue WS DSI WS DSIEXEC WS RDB (Warm Standby only) Outbound Queue RSI RRS DIST Outbound SQM Outbound Queue (Route only) For example, if the latency is in the inbound queue, for normal replication, you start by analyzing the SQT, DIST and outbound queue SQM threads, while for WS implementations, you focus on the WS DSI, DSIEXEC and RDB. Once you know where you are beginning, the next step is to verify the latency by using the M&C and comparing the commands. Focusing on the RS, this will typically mean beginning with the SQM commands written. Alternatively, you can skip the admin who,sqm at the beginning and start simply by looking at the various command metrics across the full path through the RS. For example: SQT CmdsTotal DSI CmdsRead SQT TransRemoved

21:40:46 21:42:47 21:44:48 21:46:49 21:48:50 21:50:50 21:52:51 21:54:52

5,524 7,868 5,797 324 1 2 2 2

5,524 7,868 5,797 324 1 2 2 3

5,524 7,868 5,797 324 1 2 2 3

5,524 7,867 5,795 324 0 0 0 0

0 0 0 0 0 0 0 0

5,510 7,866 5,795 342 0 0 0 0

5,510 7,866 5,795 342 0 0 0 0

7,776 8,225 14,008 18,962 18,615 27,125 8,684 0

7,866 8,180 13,999 18,794 18,205 26,564 18,078 0

184

DSIEXEC CmdsApplied n/a n/a n/a n/a n/a n/a n/a n/a

Src SQM CmdsWritten

Dest SQM CmdsWritten

Sample Time

Dest SQMR CmdsRead

RepAgent CmdsTotal

DIST CmdsTotal

Src SQMR CmdsRead

Final v2.0.1

SQT CmdsTotal

DSI CmdsRead

SQT TransRemoved

21:56:53 22:02:21 22:04:22 22:06:22 22:08:23 22:10:24 22:12:25 22:14:26 22:16:26

0 6 0 844 3,192 8,688 9,411 1,366 2,869

0 6 0 844 3,192 8,688 9,411 1,366 3,075

0 6 0 844 3,192 8,688 9,411 1,366 3,075

0 3 0 842 3,191 8,683 9,407 1,364 2,869

0 0 0 0 0 0 0 0 0

0 3 0 747 3,187 8,744 9,357 1,442 2,999

0 3 0 747 3,187 8,744 9,357 1,442 2,999

0 3 0 747 3,187 8,744 6,873 3,837 2,999

0 3 0 741 2,873 5,359 4,298 4,326 3,516

The DSIEXEC Cmds were not available as the customer who gathered the above did not collect all the statistics. However, enough is there to quickly determine the following: There definitely is latency in the DSI/DSIEXEC pipeline There may be latency at the source RepAgent, but we can not tell from the RS statistics.

Regardless of the example above, remember that latency in one thread may be the result of build up in threads further in the pipeline classically SQT type problems. After identifying where the problem is, the second step is to look for the obvious/common bottlenecks for each thread: Thread/Module RepAgent User Common Issues RSSD interaction (rs_locater, etc.) STS Cache RepAgent Low packet size/scan batch size SQM Write Waits RSSD interaction Slow Disks Read Activity Large Transactions Write Activity Physical Reads vs. Cached Cache Size (too large or too small) Large Transactions DIST/Outbound Queue slow No RepDefs Large Transactions RSSD Interaction STS Cache SQM Write Waits Cache Size (too large or too small) Large Transactions Transaction Grouping configuration

SQM (Write)

SQM (Read)

SQT

DIST

DSI

DSIEXEC CmdsApplied n/a n/a n/a n/a n/a n/a n/a n/a n/a

Src SQM CmdsWritten

Dest SQM CmdsWritten

Sample Time

Dest SQMR CmdsRead

RepAgent CmdsTotal

DIST CmdsTotal

Src SQMR CmdsRead

185

Final v2.0.1

Thread/Module DSIEXEC

Common Issues Replicate DBMS response time Command Batching configuration Lack of Parallel DSIs Text/Image replication RRS DIST/SQM slow Network issues

RSI

These can readily be spotted by looking at the monitor counters detail in the previous sections. One aspect to consider is that each of the pipelines mentioned above begin and end with disk space. Even the replicate DBMS is disk space in effect as the DML statement execution depends on changing disk rows and logging those changes. The most frequent source of bottlenecks will be the components that talk to these disks the RS SQM threads and the Replicate DBMS. In any case, RS M&C includes timers for these actions that allow you to isolate that it is these endpoints that are the problem. From the previous sections we have tried to illustrate problems and provide general configuration guidance. A summary of the this guidance is repeated here: Thread/Module RepAgent Thread Common Issues DSIEXEC Use large packet sizes Use larger scan batch sizes Watch RS response time Sts_full_cache rs_objects, rs_columns Max sqm_write_request_limit Tune RepAgent Increase sqm_recover_seg Max sqm_write_request_limit Right-size SQT & DSI SQT Cache Right size cache Break up large transactions (app change) Use table RepDefs Sts_full_cache rs_objects, rs_columns, rs_functions. Set sts_cache_size to 1,000 or higher Max md_sqm_write_request_limit Right-size DSI SQT Cache cache should be able to hold 1.5-2 times (max) the number of grouped transactions that you execute on average Target dsi_max_xacts_in_group Max dsi_xact_group_size, dsi_large_xact_size to eliminate their effects Target dsi_cmd_batch_size to full tran group (40KB+ as starting point) or 50 commands Watch RDB DBMS response times Use Parallel DSIs

RepAgent User

SQM (Write) SQM (Read) SQT DIST

DSI

At this point, we are done looking in detail at the RS aspects to the problem and can focus on the replicate database & replicate DBMS. This is appropriate as probably 90% of all latency problems stem from the SQL execution speed at the replicate database.

186

Final v2.0.1

Replicate Dataserver/Database
You gotta tune this too !!
Often when people are quick to blame the Replication Server for performance issues, it turns out the real cause of the problem is the replicate database. As with any client application, the lack of a tuned replicate database system really impedes transaction delivery rates. Two things contribute to the Replication Servers quick blame for this: 1. 2. As a strictly write intensive process, poor design is quickly evident administrators will monitor replication delivery rates quicker than DBMS performance.

In fact, it is an extremely rare database shop these days that regularly monitors their system performance beyond the basic CPU loading and disk I/O metrics. Key Concept #18: Not only is a well tuned replicate dataserver crucial to Replication Server performance, but a well instrumented primary and replicate dataserver is critical to determining the root cause of performance problems when the do occur. The purpose of this section is not to discuss how to tune a replicate dataserver as that can be extremely situational dependent. However, several points to consider and common problems associated with replication will be discussed. Maintenance User Performance Monitoring For ASE based systems, it is critical to have the Monitoring Diagnostic API (MDA) Tables set up for performance monitoring of the primary and replicate dataservers (possibly the RSSD as well if located on an ASE with production users). Because the MDA tables can be accessed directly via SQL and provide process level metrics, you can get a clear picture of replication maintenance user specific activity. However, there are a couple of nuances when using MDA based monitoring of the replicate database: The maintenance user may disconnect/reconnect during the following circumstances: o Errors mapped to stop replication o Parallel transaction failure due to deadlock or commit control intervention o DSI fadeout due to inactivity As with any MDA based monitoring, a series of samples using a short sample interval will be necessary to determine Most MDA tables are not stateful - but only show the cumulative values for the current sample period When querying the MDA tables, using the known parameters can reduce the query time significantly.

While it might be tempting to simply look for the maintenance user by SPID, the first point should illustrate that the SPID is likely not too reliable as any disconnect/reconnect can change the SPID. Even it reconnects with the same SPID, the KPID will differ meaning that counter values for the previous SPID will be lost for all but the stateful tables. Additionally, when monitoring the maintenance user, we primarily are interested in determining the following conditions: How quickly statements are executed - since all we are executing are DML operations based on primary keys or atomic inserts (ignoring procedure replication), most statements should execute extremely quickly What the maintenance user process within ASE is waiting on Possibly, how even the distribution of the workload for parallel DSI configurations - although this can be skewed by large transactions and other conditions.

While the goal of this table is not to teach how to monitor ASE using MDA tables - existing white papers already cover this topic. As a result, the tables and queries contained in this section will focus primarily on the tables most applicable to monitoring maintenance user performance. Consider the following:

187

Final v2.0.1

Figure 39 - Useful ASE 15.0.1 MDA Tables for Monitoring Maintenance User Performance
The first trick is to identify which of the SPID/KPID combinations we are interested in. Logically, it might be tempting to retrieve the SPID/KPID pairs either from master..sysprocesses or from the monProcessLookup table (at the top in the diagram). However, from the above diagram, you can see ServerUserID is also in the monProcessActivity table which will need to be queried anyhow. As a result, prior to each sample the query:
declare @SampleTime datetime select @SampleTime=getdate() select SampleTime=@SampleTime, * into #monProcessActivity from master..monProcessActivity where ServerUserID=suser_id(<maint_user_name>)

The other tables then can be queried using a join with this table to narrow the results to only the SPID/KPID pairs used by the maintenance user in question. In the next paragraphs, we will use this diagram to special points of interest for monitoring the performance of the maintenance user. Maintenance User Wait Events The best starting point for detecting maintenance user performance issues is to begin by looking at the Wait Events from monProcessWaits (bottom center of diagram). This table is key to determining how long the maintenance user task spent waiting for disk I/O, network I/O, CPU access, etc. Assuming we had used the above query to determine which SPID/KPIDs we are interested in, the query to retrieve the wait events would be:

188

Final v2.0.1

select SampleTime=@SampleTime, w.* into #monProcessWaits from master..monProcessWaits w, #monProcessActivity a where a.SPID = w.SPID and a.KPID = w.KPID select w.*, e.EventDescription into #WaitEvents from #monProcessWaits w, master..monWaitEventInfo e where w.WaitEventID = e.WaitEventID

Once we have the wait events, we need to find the ones of key interest. Looking at the schema for the monProcessWaits table, we see that there are two columns for the metrics - Waits and WaitTime. A logical assumption might be to focus on the WaitTime, however, there is a slight consideration that may make this not as important. ASE measures time based on a timeslice or ticks, which by default is 100 milliseconds. In measuring wait events, the server simply subtracts the timeslice a process was put to sleep in from the timeslice value when it was woken up. If it is the same timeslice, a wait event is recorded with a WaitTime of 0. Consequently a handy query for weighting the Waits and the WaitTime equitably might be a query similar to:
select SampleTime, WaitEventID, Waits, WaitTime, MaxWaitTime=(case when Waits * 100> WaitTime then Waits * 100 else WaitTime end), EventDescription from #WaitEvents where Waits * 100 > 0 order by 5 -- order by MaxWaitTime

The following table lists some common wait events that you might see for a maintenance user WaitEventID CPU Related 214 215 Disk Read Related 29 waiting for regular buffer read to complete waiting on run queue after yield waiting on run queue after sleep Event Description

Memory/Cache Related 33 34 36 37 Disk Write Related 51 52 waiting for last i/o on MASS to complete waiting for i/o on MASS initiated by another task waiting for buffer read to complete waiting for buffer write to complete waiting for MASS to finish writing before changing wait for MASS to finish changing before changing

Transaction Log/Write Related 54 55 Network Receive 250 Network Send 171 251 waiting for CTLIB event to complete waiting for network send to complete waiting for incoming network data waiting for write of the last log page to complete wait for i/o to finish after writing last log page

189

Final v2.0.1

WaitEventID

Event Description

Contention/Blocking Related 150 41 Internals/Spinlocks 272 waiting for lock on ULC waiting for a lock wait to acquire latch

Some of the more common issues are discussed below. CPU Contention If there is a high degree of CPU contention (wait events 214 & 215), you will need to consider the priority of the maintenance user as well as the numbers of parallel DSI threads being used. In the case of the former, if the replicate database is also being used be production users for reporting purposes or in a peer-to-peer fashion, the maintenance users are competing for CPU time with the production users. If the replication latency is greater than desired, you have a couple of options available: Increase the maintenance user priority to EC1 Use engine grouping to restrict reporting users to a subset of engines as well as focusing the maintenance user at the remaining engines Increase the number of engines

If CPU contention is high and parallel DSI threads are being used, consider reducing the number of threads to see if any improvement in throughput occurs. A good starting rule of thumb is 5-10 threads per engine as a maximum. Disk Read Delays While delays due to disk reads certainly could be due to slow disk drives or disk contention, a much more likely cause for the maintenance user is excessive I/O due to a bad query plan. This can happen particularly for updates and deletes when the table is missing indexes on the primary key columns and during inserts when the clustered index is not unique and is non-selective (based on low-cardinality columns). This can be confirmed by looking at the statement and object statistics as will be described in Query Related Causes section later. Memory/Cache Contention Normally, individual logical I/Os as represented by wait events 33 & 34 will not be a problem. If they are, one possible cause - particularly when the machine is used by production users - is too few cache partitions. The most common memory contention issue for maintenance users, however, will be focused on the Memory Address Space Segment (MASS) spinlocks. A MASS is a way of controlling concurrent access to group of contiguous pages in memory - typically 8 pages. For example, if a query results in an APF pre-fetch of an entire extent, all 8 pages are read from disk and placed into cache. While those pages are being placed into cache, other users are prevented from trying to use those same pages by the MASS bit. Once in memory, user DML statements may cause several pages to be updated (marked dirty). When the housekeeper, checkpoint process or other write operation forces the pages to be flushed, for IO efficiency, ASE will do multi-page write of the pages within the MASS - again, to safely record the page as having been flushed, concurrent user access during the write operation is blocked. In the case of replication server maintenance users, the most common form of MASS contention is in a high insert environment, the parallel DSI threads will all be attempting to append rows to heap tables or tables whose clustered index is ordered by a monotonic sequential key (including datetime values). As a result, if one parallel DSI just filled one page, the next insert from a different parallel DSI may have to allocate a new page for the object and may try to append it to the same MASS area. Using cache partitions may alleviate this problem. Disk Write Delays As mentioned in the previous paragraph, ASE does all I/O write requests using as large of an I/O as possible. For example, if 2 or 3 contiguous pages in a cache MASS area are dirty, ASE will attempt a 2 or 3 page I/O sized write (46K for 2K page sized servers). Note that writes of data pages normally only happen when either the housekeeper flushes a page, when the wash marker is reached, or a checkpoint process flushes the pages based on the recovery interval. As a result, if you see a lot of write based delays, you may first want to look at the monDeviceIO/monIOQueue tables (not in the above diagram) along with OS utilities such as sar to see if slow disk response times, or ASE configuration values are causing the IO times to be longer than normal.

190

Final v2.0.1
However, if the majority of the write delays are due to waiting for the MASS to complete from a different user, this suggests that in a high insert environment you need more cache partitions or the clustered index is forcing parallel DSIs to insert into the same page - and the housekeeper/checkpoint is forcing a disk flush before the page is completely full. Transaction Log Delays In the MDA tables, transaction log based delays are collectively grouped with disk write activity - but due to the differences in causes, we separated them into different sections for this discussion. In the above list, there were two transaction log delay wait events - 54 & 55. The first one (54) actually is referring to waiting to get access to the transaction log to flush the maintenance users ULC to the primary log cache. Commonly we might associate this with log semaphore contention. This can be verified by looking at the monOpenDatabases table, which has columns that track the AppendLogRequests and the AppendLogWaits. If the maintenance users appear to be waiting on the log semaphore and the replicate system is not being used by production users, it could point to a need to increase the ULC size at the replicate or speed up the physical log I/O of the process that currently has the log semaphore. The second condition (55) suggests that either the log device is slow in responding or that the number of writes per log page is causing the last log page to be busy. As of ASE 15.0, one possible solution for this is to enable delayed commit - either for the entire database - or just for the maintenance users. If modifying just for the maintenance users, you will need to modify one of the class scope function strings executed at the beginning of the DSI connection sequence - such as rs_usedb. The danger in this is that non-ASE 15.0 servers may not understand this command, so you will likely need to create a user defined function class that inherits from rs_sqlserver_function_class to minimize the impact and the work involved to implement this capability. Network Receive Delays This is likely the largest single cause of latency and as a result, any real attempt at improving the throughput of a maintenance user will likely need to begin with this. As a whole, the problem can be caused by: RS slow in sending commands to the ASE due to spending time on other processes ASE slow in parsing, compiling, optimizing language commands as typical DML statements are sent by RS

The first one can be double checked by looking at the DSIEXEC time related counters. If no real appreciable time is being spent in batching, function string conversion and nearly all the time is spent in the send/execute and results processing windows, then it is most likely is the second cause. The second cause is a bit nasty. While Replication Server could be viewed as sending very simplistic SQL statements (atomic inserts, updates and deletes based on primary keys), the issue is that every statement sent to the replicate DBMS needs to parsed, compiled, optimized and then executed. In reality, execution (less any contention or other causes) is by far the least of these times. This has been proven in test scenarios involving high insert environments in which using fully prepared SQL statements were 3-10 times faster than the equivalent language commands. The reason was that fully prepared SQL statements create a dynamic procedure that is executed repeatedly by simply sending the parameter values with each call vs. a language command. It was further proven that the most expensive part of the delay was due to compilation or optimization as it was determined that language procedure calls did not exhibit the same delays as language DML statements. Beginning with ASE 12.5.2, Sybase introduced statement caching. When enabled, as each SQL command is received, it is hashed with an MD5 hash for that login and environment settings (such as isolation level). If the hash matches an already executed query, that querys optimization plan is used instead. However, the ASE 12.5.2 statement cache did not benefit Replication Server environments due to the following reasons: The literal values were included in the hash key - consequently updates or deletes - especially those caused by a single statement at the source - could not use the statement cache as the literal values for the primary keys differed. Statement caching was not used for atomic insert/values statements.

In ASE 15.0.1, the first restriction was removed by adding a configuration setting to control literal parameterization as well as a session setting. RS environments are strongly encouraged to enable this if the environment sustains a lot of update or delete activity. In the future, ASE 15.0.2 is looking at providing (note this is a future release - normal caveats about future functionality apply) the same capability for atomic insert/values statements which should benefit RS environments greatly. In addition, on a parallel effort, Replication Server engineering is looking at an enhancement to RS 15.0 (again, caveats regarding future release functionality apply) that would enable RS to send dynamic SQL vs. language statements. Early tests with this have reported substantial improvements.

191

Final v2.0.1
Until either ASE 15.0.2 or RS 15.0 are enhanced to resolve the ASE optimization issue, significant improvements in RS throughput can be achieved by using stored procedures and changing the function strings to call the stored procedures instead of the default language commands. Network Send Delays Network send delays can be caused by several factors within a replicate database The maintenance user task was running on one engine, but needs to perform network I/O on a different network engine that it is connect to. ASE CPU contention is preventing a task to be scheduled quick enough to tell if the network send was acknowledged. The replicated procedure or trigger contains a number of print statements - particularly if the setting set flushmessage on is enabled. RS is slow at processing the results.

The first is a most likely cause on larger systems. Unfortunately, while engine to CPU affinity can be performed via dbcc tune(), task to engine affinity is not explicitly supported within Sybase ASE. If the replicate DBMS has a large number of SMP engines, the only real alternative is to use engine groups to try to constrain the maintenance users to a subset of cpus - thereby reducing the task migration. However, this should be done with extreme caution and only after verifying that task migration is occurring. One way that it can be verified is by reducing the sample interval significantly and then monitoring the monProcess.EngineNumber column for the same SPID/KPID pairs. If task migration is occurring a lot, an engine group may be desired. On smaller systems or non parallel DSI environments, the most likely cause will be the second cause. Again, this may point to the need to either increase the process priority for the maintenance user or use engine grouping to deconflict with other production users. The third cause can be alleviated by changing the proc/trigger code by bracketing print statements as well as the set flushmessage setting with a check for either the replication_role or the maintenance user by name - or by ensuring that triggers are disabled at the replicate if the print statements are within triggers. However, it is unlikely that this will be a significant cause. Contention/Blocking Related Delays With parallel DSIs or other production users on the replicate system, you will need to monitor this closely. Of the two listed, the logical lock event (150) corresponds directly to a lock contention issue either at a page or row level. The specific table involved can be diagnosed via monOpenObjectActivity. While monLocks may seem the most apparent, because the lock hash table changes so rapidly, it would be difficult to spot transient blocks. Latch contention is likely caused by inserts into the same index pages by parallel threads and typically are not a major concern as latch duration is extremely short. Internal/Spinlock Delays Another common wait event for maintenance users is the waiting for a lock on their own ULC cache. This can be caused by two primary issues: A low/default configuration for the server configuration user log cache spinlock ratio ULC flushing to the transaction log

The first one is a setting that is often not changed by DBAs. By default, this means that a single spinlock is used for every 20 ASE processes. For most replicate/standby databases attempting to use parallel DSI threads, the result is that likely only a single spinlock is used for all the parallel threads. Since this is a dynamic parameter, you may wish to reduce this to a low single digit (1-3) to see if it alleviates any delays. A second cause is that when a users ULC is flushed to the transaction log, the ULC is locked from the user to prevent overwriting of the log pages in the ULC. If the above doesnt help, then this is the likely cause. Unless the ULC is full for the maintenance user, there likely is not a lot that can be done about alleviating this problem. Warm Standby, MSA and the Need for RepDefs When Sybase implemented Warm Standby Replication - and later Multi-Standby Architecture (MSA) - the need for individual replication definitions for each table was made optional. The goal was to extremely simplify replication installation and setup for simple systems. However, replication definitions are strongly recommended in high volume systems and in most cases due to the following reasons:

192

Final v2.0.1

As mentioned earlier, minimal column replication is allowed with replication definitions - although this is enabled for the standby database in a WS or MSA setup by default without a repdef, a common implementation today includes reporting/historical database feeds from the standby system. When minimal column replication is enabled, replicate database performance can be improved for updates as the number of unsafe indexes is reduced and a direct in-place update may be doable instead of a more expensive implementation. Primary keys are identified. Without a primary key, the RS has to assume all non-text/image/rawobject columns are part of the primary key. The result not only is that the where clause that is generated a lot longer, but during execution, each part of the where clause has to be compared vs. strictly the primary key values. By having a repdef and defining the primary key, the time it takes to generate the SQL statement within RS is shorter and the execution at the replicate is also shorter. In some cases, not having a repdef can lead to database inconsistencies - especially when the table contains a float, real, or double datatype, ansinull is enforced or other similar conditions (such as data modifications due to a trigger if dsi_keep_triggers is on). Even with repdefs, if different character sets/sort orders are used, database inconsistencies could result.

While the first two have either been explained before or are self-evident, the last bullet may catch some by surprise. Lets take a look at each of these, with the exception of the discussion on triggers which is covered in a later section. Before we do this, however, it is extremely important to note that unless the replication definition contains the send standby clause, it will not be used by Warm Standby or MSA for primary key or other determination. Approximate Numerics & RepDefs Without a replication definition, all non-BLOB columns are included in the primary key/where clause generation for updates and deletes. Most data movement systems encode data values as ASCII text values for transport between the systems. When applied to the destination system, the destination database language handler translates the string literal ASCII number to the binary numerical representation typically by calling the C library routine atof(). If a different host platform is involved, different operating system versions or different cpu hardware within the same family, the translation on the destination machine may be slightly different that at the origin. For example, inserting a value of 12.0 on the primary may result in a translated value of 11.999999999999998 at the destination. Even worse, an insert of 12.0 at the primary may get stored as 12.000000000001 at the primary, replicated as 12.00000001 and stored at the replicate as 12.000000002. If basic scientific principals such as rounding to a specified number of significant digits were implemented in the application, this slight difference in the stored value may not be an issue for the application. However, Replication Server does not support significant digit rounding. The problem becomes especially acute when the float column is a member of the primary key, or if the primary key is not specified and all columns are used to define the where clause for update or delete DML operations. Because of the approximate nature of the float datatype, the new value may not match the stored value resulting in not finding the row. Again, for example, assuming that the original system stored a 12.0 perfectly, however, when the row was sent to the destination, it ended up as 11.999999999998. Consider the impact of the following type of query for a subsequent update:
Update data_table Set column = new_value Where obj_id=12345 and float_column = 12.0

Note that the result is not an error. What happens is that the update simply affects 0 rows. Similarly a delete hits zero rows. This can result in either database inconsistencies or errors that stop replication. Consider what happens if an application deletes a row and then later the same row is reinserted. While this does not appear to be common, it can happen in work tables as well as older GUIs that translated primary key updates into delete and insert statements. The result is that at the primary, possibly everything is fine. However, at the replicate, it is likely a duplicate key error will result on the insert. The reason is that the delete will likely miss the desired row due to the float datatype. The subsequent insert will then fail as any unique index or constraint will flag the duplicate and raise the error (unless ignore_dupe_key is set). When database inconsistencies are reported to Sybase with a Warm Standby system, the presence of approximate numeric datatypes/lack of repdefs leads the causes by a wide margin when materialization errors are excluded. As a result, float or any approximate numeric should not be used as a primary key or a searchable column - and if a table contains a float datatype, a replication definition must be used.

193

Final v2.0.1

ANSINULL enforcement If ANSINULL is enabled, database comparisons using a syntax such as column=null are always treated to be false. By definition then, if a warm standby is created and ansinull is enforced, then without a primary key, it is likely that nearly every update and delete will fail to work correctly as any column containing a null value will result in 0 rows affected. Those that are alert may point out that this requires the connection to issue the set ansinull on statement whereas the default is set ansinull off (or fipsflagger). However, in 12.5.4, both of these settings can now be exported from a login trigger - consequently care must be taken to ensure that the login trigger doesnt set these automatically for the maintenance user. Different Character Sets/Sort Orders If replicating between different character sets and sort orders, a primary key may help reduce database inconsistencies caused by character conversion/sort comparison. The most common example of this is when the original system uses binary sort order and the standby uses case-insensitive sort order. Whether or not the table has a replication definition, if any part of the actual key includes character data, database inconsistencies can happen. Consider the case in which last name may be part of the primary key and two records are inserted with the only distinction in the key values being that in one case the name is McDonald and the other Mcdonald - while other non-key attributes may differ. Now, if the table has a repdef, the generated update or delete could resemble:
Delete data_table Where first_name = Fred and last_name= McDonald

With a repdef and primary key, the replicated delete may affect more than one row at the replicate. Without a replication definition, the other attributes may differ and prevent the problem. Consequently, if the primary uses a case sensitive sort order and the replicate uses a case insensitive sort order, replication definitions may not be recommended, but even then, database consistency is not guaranteed. In other cases, when using different character sets, not specifying a primary key - especially if a localized system only uses numeric keys vs. character data - could result in database inconsistencies. As a result, it is safe to say that any warm standby or MSA implementation between different character sets or sort-orders is risky and could result in data inconsistencies. Query Related Causes While the language command optimization issue (see Network Receive Delays above) is likely the biggest cause of throughput issues for high-insert intensive environments, a close second - especially for update/delete intensive transactions are standard query related problems. As an example, as of this writing, a common financial trading application includes a delete statement without a where clause. While it is likely that this was done prior to truncate table being a grantable option (ASE 12.5.2) forcing non table owners to a table truncation in this fashion, the biggest problem was that the table did not have any defined primary key constraint nor any unique indices (although an identity column existed and had a nonunique index defined solely on that column). Equally problematic was that this table easily contains ~1 million rows or more. In a typical lazy standby implementation that does not have a repdef defined, the result is instantaneously disastrous as the RS latency stretches for hours. The problem is that while the delete is a single statement at the primary, as you can guess by now, each row becomes a single delete at the replicate - and lacking any index information based on the where clause - it promptly becomes a table scan for each delete. One million table scans to be precise. While this may be an extreme example, when triggers are enabled, procedure replication is being used - or if repdefs are not being used, you will need to carefully monitor the query performance at the replicate. The main tables that will help with this are illustrated here:

194

Final v2.0.1

Figure 40 - MDA Tables Useful for Query Analysis


Note that the table monSysPlanText was excluded from the above - this is due to the fact that while the query plan could confirm what is happening - due to the need to configure an appreciable pipe size and the impact the configuration value has on execution speed, we have avoided it. However, for particularly perplexing issues, it still maybe required. To begin with, you will want to make sure that the monProcessActivity.TableAccesses, IndexAccesses and LogicalReads/PagesWritten have the correct relative ratios for the maintenance users. For example, if the number of TableAccesses are high, it could be an indication of a table scan - which should also be evident as the number of LogicalReads may be orders of magnitude higher than expected. The obvious question is What are the expected orders of magnitude? The answer is that it depends on the operation, minimal column replication setting and volatility of the indexed columns. Consider the following table: Operation Insert I/O pattern 1 index traversal to locate insert point (reads), write for the data row; index traversals to locate index key insert points and writes for each index key PK index traversal to locate row, write for the data row, index traversals for each unsafe index plus index key overwrites Typical Cost 50-75

Update

10-50

195

Final v2.0.1

Operation Delete

I/O pattern PK index traversal to locate row, write to delete row, index traversals for all indexes plus index key deletion

Typical Cost 50-75

As a result, if the delta between two samples shows that the maintenance user did 100,000 logical I/Os but only did 60 page writes, this points to a likely indexing issue. To find the issue, the next step is to try to isolate which object it is occurring for. There are several possibilities for this. The first is monProcessObject, but it is unlikely to help as it only records the object statistics for the currently executing statement in the batch. Consequently, unless the server just happened to be still executing the bad statement, it is unlikely that this will provide any useful information. monProcessStatement has the same issue. The second likely answer is to use monOpenObjectActivity. If no other production users are on the system, the task is a simple comparison of the LogicalReads/PagesWritten ratio - and in addition, you can look for a table in which the IndexID=0 and a non-null LastUsedDate (indicative of a table scan). Failing that, you can use monSysStatement and again compare the LogicalReads/PagesModified (and in ASE 15.0.1 the new RowsAffected column) for the maintenance user SPID/KPID pairs. While this can prove beyond a shadow that an ineffective index was being used (or if proc replication or triggers enabled - bad logic within them), the actual table involved can not be identified without monSysSQLText. Regardless, if triggers are still enabled or procedure replication is occurring, you will need to watch monSysStatement closely for the maintenance user and attempt to keep the total IO cost of any triggers/procedures to the absolute minimum - which may mean that triggers may have to be rewritten to avoid joins with the insert/deleted tables and be optimized for single row DML statements. Triggers & Stored Procedures In this discussion, we are not focusing on stored procedure replication - but rather what can happen when triggers are enabled and in particular when the trigger calls stored procedures at the replicate database. Triggers & Database Inconsistencies Other than float/approximate datatype issues, the second (and a distant second) most common cause of inconsistencies as a result of not having replication definitions is when triggers are enabled. For a standard warm-standby, triggers are disabled by default via dsi_keep_triggers. However, if replicating stored procedures, DBAs may have changed this setting as they have been instructed to do so to ensure the integrity of actions with replicated procedures. Or, some DBAs have simply enabled triggers out of fear that without them database inconsistencies could result. Additionally, for MSA implementations, the default setting is that triggers are enabled. Some of the most common fields modified by triggers include auditing data (such as last update time), aggregate values, derived values, etc. Typically, these columns are not part of the primary key. As a result, if no replication definition is found, the update or deletes may fail as the actual values for these columns may differ. There is a common fallacy that triggers should be enabled for all replication except Warm Standby and that this is the only way to guarantee database consistency. Actually this is only true for the following situations: 1. Not all the tables in the database are being replicated, and one of the replicated tables has a trigger that maintains another table (i.e. a history table) that is not replicated, but a similar table maintenance is desired at the replicate A stored procedure that is replicated has DML statements that affect tables with triggers that update other tables (replicated or not) in the same database.

2.

The latter reason is likely the most common however, leaving dsi_keep_triggers to on just for this cause is grossly inefficient as a more optimal solution would be to have the proc check @@options and manually issue set triggers on/off as necessary. To balance the above, there are cases where leaving the triggers enabled would result in database inconsistencies as well. Consider the following: 1. 2. All tables in the database are replicated. The trigger calls a stored procedure that does a rollback transaction or returns a negative return code between -1 and -99

The first case is fairly obvious. Any trigger that causes an insert (i.e. maintains a history table) or does an update to an aggregate value will cause problems at the replicate either throwing duplicate key errors or the triggered DML statements from the primary will clobber the triggered changes at the replicate and the values may be different.

196

Final v2.0.1
The second case is really interesting and requires a bit of knowledge of ASE internals. Returning a negative number from a stored procedure return code is something that is fairly common among SQL developers. Now, we all know that just because something is documented as something developers shouldnt do doesnt mean that we all obey it. Case in point is that the ASE Reference Manual clearly states that:
One aspect for the customer to consider is that return values 0 through -99 are reserved by Sybase. For example: 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 Procedure executed without error Missing object Datatype error Process was chosen as deadlock victim Permission error Syntax error Miscellaneous user error Resource error, such as out of space Non-fatal internal problem System limit was reached Fatal internal inconsistency Fatal internal inconsistency Table or index is corrupt Database is corrupt Hardware error

Now then, consider the following schema:


use pubs2 go create table trigger_test ( rownum int identity not null, some_chars varchar(40) not null, primary key (rownum) ) lock datarows go create table hist_table_1 ( rownum int not null, ins_date datetime not null, primary key (rownum, ins_date) ) lock datarows go create table hist_table_2 ( rownum int not null, ins_date datetime not null, primary key (rownum, ins_date) ) lock datarows go create procedure bad_example @rownum int as begin declare @curdate datetime select @curdate=getdate() insert into hist_table_2 values (@rownum, @curdate) return -4 end go create trigger trigger_test_trg on trigger_test for insert as begin declare @currow int select @currow=rownum from inserted insert into hist_table_1 values (@currow, getdate()) exec bad_example @currow end go

Note the highlighted line the proc returns -4 no error raised..just a negative return code. We would expect that by inserting a row into trigger_test that the trigger would fire, inserting a row in hist_table_1, then calling the proc which would insert a row in hist_table_2.lets try it:

197

Final v2.0.1

---------- isql ---------1> use pubs2 1> truncate table trigger_test 1> begin tran 1> insert into trigger_test (some_chars) values ("Testing 1 2 3...") 2> select @@error (1 row affected) 1> commit tran 1> select * from trigger_test 2> select * from hist_table_1 3> select * from hist_table_2 rownum some_chars ----------- ---------------------------------------(0 rows affected) rownum ins_date ----------- -------------------------(0 rows affected) rownum ins_date ----------- -------------------------(0 rows affected) Output completed (0 sec consumed) - Normal Termination

What happened???? It looks like the insert happened we did get back the standard (1 row affected) message after all and no error was raised.but curiously, neither did we get the results of @@error.hmmmmmmand all the tables are empty. Lets change the trigger slightly to:
create trigger trigger_test_trg on trigger_test for insert as begin declare @currow int select @currow=rownum from inserted insert into hist_table_1 values (@currow, getdate()) exec bad_example @currow select @@error select * from hist_table_1 end go

And add an extra insert to the execution:


---------- isql ---------1> use pubs2 1> 2> begin tran 1> insert into hist_table_1 values (0, getdate()) 2> insert into trigger_test (some_chars) values ("Testing 1 2 3.....") 3> select @@error (1 row affected) ----------0 (1 row affected) rownum ins_date ----------- -------------------------0 Jan 4 2006 1:21AM 401 Jan 4 2006 1:21AM (2 rows affected) 1> commit tran 1> select * from trigger_test 2> select * from hist_table_1 3> select * from hist_table_2 rownum some_chars ----------- ---------------------------------------(0 rows affected)

198

Final v2.0.1

rownum ins_date ----------- -------------------------(0 rows affected) rownum ins_date ----------- -------------------------(0 rows affected) Output completed (0 sec consumed) - Normal Termination

Whoa! Still no error inside the trigger immediately after the proc call with -4 returned, and the rows were being inserted.butno data. The reason is that if a nested procedure inside a trigger (or another procedure) returns a negative return code, ASE assumes that the system actually did raise the corresponding error (i.e. -4 is a permission problem) and that it is supposed to rollback the transaction. All of course, without errors.which means if this happened at the replicate database, the replicate would get out of synch with the primary and no errors would get thrown. Ouch!!! Trigger/Procedure Execution Time Besides data inconsistency problems when triggers exist, the biggest problem with triggers is that the typical coding style for triggers is not optimized for single row executions. It is not uncommon to see throughout a trigger multiple joins to the inserted/deleted tables or joins where if a single row was all that was affected could be eliminated using variables. This results in a lot of unnecessary extra I/O that lengthens the trigger execution time needlessly. Trigger and procedure execution time are extremely, extremely critical. One metric of interest may be to know that trigger based referential integrity is 20 times slower than declarative integrity (via constraints). Remember, in order to maintain commit order, the Replication Server basically applies the transactions in sequence even in parallel DSI scenarios, the threads block and wait for the commit order. As a result, while procedure execution is great for Replication Server performance from thread processing perspective, the net effect is that as soon as a long procedure begins execution, the following transactions in the queue effectively are delayed. Note, that this is not unique to stored procedures long running transactions will have the same effect (i.e. replicating 50,000 row modifications in a single transaction vs. a procedure that modifies them have the same effect at the replicate system however, the procedure is much less work for the Replication Server processing). As a result, particular attention should be paid to stored procedure and trigger execution times (if you for some odd reason opt not to turn triggers off for that connection). Any stored procedure or trigger that employs cursors, logged I/O in tempdb, joins with inserted/deleted tables, etc. should be candidates for rewriting for performance. Ideally, triggers should be disabled for replication at the replicate via the DSI configuration dsi_keep_triggers. Key Concept #19: Besides possibly causing database consistency issues, trigger execution overhead is so high and probable coding style so inefficient, that triggers may be the primary cause of replication throughput problems and as a consequence triggers should be disabled via dsi_keep_triggers until proven necessary and then enabled individually if possible. To see how to individually enable triggers, refer back to the trick on replicating SQL statements via a procedure call and using @@options to detect the trigger status. Concurrency Issues In replicate only databases, concurrency is mainly an issue between the parallel DSI threads or when long running procedures execute and lock entire tables. However, in shared primary configurations workflow systems or other systems in which the data in the replicate is updated frequently, concurrency could become a major issue. In this case, user transactions and Rep Server maintenance user transactions could block/deadlock each other. This may require decreasing the dsi_max_xacts_in_group parameter to reduce the lock holding times at the replicate as well as ensuring that long running procedures replicated to that replicate database are designed for concurrent environments.

199

Final v2.0.1
Key Concept #20: In addition to concurrency issues between maintenance user transactions when using Parallel DSIs, if the replicate database is also updated by normal users, considerable contention between maintenance user and application users may exist. Reducing transaction group sizes as well as designing long running procedures to not cause contention are crucial tasks to ensuring the content does not degrade business performance at the replicate or Replication Server throughput. Similar to any concurrency issue, depending on what resources are the source of contention, it may be necessary to use different locking schemes, etc. at the replicate than at the primary (or same if Warm Standby). Consider the following activities: Strategy Additional Indexes Comment Additional indexes, particularly if replicating to a denormalized schema or data warehouse could increase contention. While not necessarily avoidable, it may require a careful pruning of OLTP specific indexes. Eliminate index contention and data row contention by implementing DOL locking at the replicate system. Provide parallel DSIs multiple last pages to avoid contention without implementing DOL locking. Have RS DSI disable triggers especially data validation triggers

DOL Locking Table Partitioning Triggers Off

Obviously, the above list is not complete, but may provide ideas to resolve contention issues when the contention is not due to the holding of locks longer due to transaction grouping.

200

Final v2.0.1

Procedure Replication
Is it true that I cant replicate both procedures and affected tables??
Procedure vs. Table Replication The above question is a common misconception that you cannot replicate both procedures and tables modified by replicated procedures. This is partially based on the following paragraph:
If you use function replication definitions, do not attempt to replicate affected data using table replication definitions and subscriptions. If the stored procedures are identical, they will make identical changes to each database. If the affected tables are also replicated, duplicate updates would result. - page 9-3 in Replication Administration/11.5

However, consider the following paragraphs:


In replicating stored procedures via applied functions, it may be advisable to create table replication definitions and subscriptions for the same tables that the replicated stored procedures will affect. By doing this you can ensure that any normal transactions that affect the tables will be replicated as well as the stored procedure executions. However, DML inside stored procedures marked as replicated is not replicated. Thus, in this case, you must subscribe to the stored procedure even if you also subscribe to the table. - page 3-145 in Replication Reference/11.5

Confused?? A lot of people are. What it really refers to is if you replicate a procedure, the DML changes within the procedure will not be replicated, no matter what. The way this is achieved is that normally, as a DML statement is logged, if the objects OSTAT_REPLICATE flag is set, then the ASE logger sets the transaction log records LSTAT_REPLICATED flag. For a stored procedure, this means that the stored procedure receives the LSTAT_REPLICATED flag, and the ASE logger does not mark any DML records for replication until after that procedure execution has completed. This is illustrated with the following sample fragment of a transaction log:
XREC_BEGINXACT XREC_EXECBEGIN proc1 XREC_INSERT Table1 XREC_INSERT Table2 XREC_DELETE Table3 XREC_EXECEND XREC_ENDXACT (implicit transaction) (proc execution begins) (insert DML inside proc) (insert DML inside proc) (delete DML inside proc) (end proc execution) (end implicit tran)

Only the highlighted records will have the LSTAT_REPLICATED flag set, and consequently forwarded by the Replication Agent to the Replication Server. Attempting to force both to be replicated (i.e. executing a replicated procedure in one database with replicated DML modifications in another) could lead to database inconsistencies. The only way to force this replication is to a) replicate a procedure call in one database and b) that procedure modify data in a table that is also replicated in another database. This would allow both to be replicated as two independent log threads would be involved. The one that would be evaluating the DML for replication would not be aware that the DML was even inside a procedure that was also replicated. Which brings us to the point the second reference was making. The second reference stated that it may be advisable to create table replication definitions and subscriptions for the same tables. The reason for this is exactly the fact that DML within a procedure is NOT replicated and needs reverse logic to understand the impact. Consider the scenario of New York, London Tokyo, San Francisco and Chicago all sharing trade data. A procedure at New York is executed at the close of the day to update the value of mutual funds based on the closing market position of the funds stock contents. All the other sites subscribe to the mutual fund portfolio table. Now, consider what would happen if only San Francisco and Chicago subscribed to the procedure execution. Neither London nor Tokyo would ever receive the update mutual fund values!!! Why?? Since the DML within the replicated procedure is not marked for replication, the Replication Agent would only forward the procedure execution log records and NOT the logged mutual fund table modifications. Since neither subscribed to the procedure, they would not receive anything. This is illustrated below:

201

Final v2.0.1

Exec proc1

Chicago

exec proc1
OBQ Chicago
(Nothing)
BT X proc1 I Table1 I Table2 D Table3 D Table4 CT

London

OBQ London
(Nothing)

Tokyo

OBQ Tokyo New York IBQ New York


BT exec proc1 CT Exec proc1

San Francisco

OBQ San Francisco

Figure 41 Replicated Procedure & Subscriptions


Which brings us to the following concept: Key Concept #21: If replicating a procedure as well as the tables modified by the procedure, any replicate that subscribes to one should also subscribe to the other to avoid data inconsistency. A notable exception to that is that if replicating to a data warehouse, the data warehouse may not want to subscribe to a purge or archive procedure executed on the OLTP system. However, there is a gotcha when replicating procedures and tables. If replicating procedures and the dsi_keep_triggers setting is off database inconsistencies might develop. The reason is evident in the below scenario: 1. 2. 3. At the primary, a replicated procedure is executed. In the procedure, an insert occurs on Table A. Table As trigger modifies Table B Procedure is replicated as normal via Rep Agent to Replication Server. When applied, the procedure is executed. Because triggers are off, only the insert to Table A occurs.

Preventing this can be done in one of two ways. First the obvious set dsi_keep_triggers to on. However, this could significantly affect throughput. The other and possibly better approach is to consider how the triggers got disabled in the first place via a function string executing the command set triggers off. This then can be included in the procedure logic via a sequence similar to:
create procedure proc_a @param1 datatype [, @paramn datatype] as begin if proc_role(replication_role)=1 set triggers on dml statements if proc_role(replication_role)=1 set triggers off return 0 end

By ensuring user has replication role, other users executing the same procedure would not get permission violations. This brings up another key concept about procedure replication: Key Concept #22: If replicating procedures, special care must be taken to ensure that DML triggered operations within the procedure are also handled or otherwise you risk an inconsistent database at the replicate. Procedure Replication & Performance Now that we have cleared that matter up and we understand that we can replicate procedures and tables they affect simultaneously, the question is how does this affect performance. The answer as in all performance questions is: It

202

Final v2.0.1
depends. Replicating procedures can both improve replication performance as well as degrade replication performance. The former is often referenced in replication design documents, and consequently, will be discussed first. Reduced Rep Agent & RS Workload Consider a normal retail bank. At a certain part of the month, the bank updates all of the savings accounts with interest calculated on the average daily balance during that month. This literally can be tens of thousands to hundreds of thousands of records. If replicating the savings account table to regional offices, failover sites, or elsewhere, this would mean the following: 1. 2. 3. 4. 5. The Replication Agent would have to process and send to the Replication Server every individual account record. The account records would have to be saved to the stable device. Each and every account record would be compared to subscriptions for possible distribution. The account records would have to be saved again to the stable device once for each destination. Each account record would have to update as individual updates at each of the replicates

The impact would be enormous. First, beyond a doubt, the Replication Agent would lag significantly. Secondly, the space requirements and the disk I/O processing time would be nearly insurmountable. Third, the CPU resources required for tens to hundreds of thousands of comparisons are enormous. And lastly, the time it would take to process that many individual updates would probably exceed the required window. How would replicating stored procedures help?? Thats easy to see. Rather than updating the records via a static SQL statement at the primary, a stored procedure containing the update would be executed instead. If this procedure were replicated, then the Replication Agent would only have to read/transfer a single log record to the Replication Server, which in turn would only have to save/process that single record. The difference could be hours of processing saved and the difference between a successful replication implementation or one that fails due to the fact the replicate can never catch up due to latency caused by excessive replication processing requirements. Key Concept #23: Any business transaction that impacts a large number of rows is a good candidate for procedure replication, along with very frequent transactions that affect a small set of rows.

Increased Latency & Contention at Replicate So, if stored procedures are can reduce the disk I/O and Replication Server processing, how can replicating a stored procedure negatively affect replication? The answer is two reasons: 1) the latency between begin at the primary and commit at the replicate; and 2) extreme difficulty in achieving concurrency in delivering replicated transactions to the replicate once the replicated procedure begins to be applied. Lets discuss #1. Remember, Replication Server only replicates committed transactions. Now, using our earlier scenario of our savings account interest procedure, lets assume that the procedure takes 4 hours to execute. We would see the following behavior: 1. 2. 3. 4. 5. 6. 7. Procedure begins execution at 8:00pm and implicitly begins a transaction. Replication Agent forwards procedure execution to RS nearly immediately. RS SQT thread caches execution record until the procedure completes execution and the completion record is received via the implicit commit. At midnight the procedure completes execution. Within seconds, the Replication Agent has forwarded the commit record to RS and RS has moved the replicated procedure to the Data Server Interface (DSI). The DSI begins executing the procedure at the replicate shortly after midnight Assuming all things being equal, the procedure will complete at the replicate at 4:00am

Consequently, we have a total of 8 hours from when the process begins until it completes at the replicate, and 4 hours from when it completes at the primary until it completes at the replicate. This timeframe might be acceptable to some businesses. However, what if the procedure took 8 hours to execute? Basically, the replicate would not be caught up for several hours after the business day began which may not be acceptable for some systems such as stock trading systems with more real time requirements. An example of this happening can be illustrated with the following scenario. Lets assume that we have a bank that has a sustained 24x7 transaction rate of 20,000tph and that the interest calculation procedure takes 8 hours to run. For sake of the example, lets assume that we have Replication Server

203

Final v2.0.1
tuned to the point that it is delivering 500tpm or 30,000tph. This is illustrated in the following diagram (each of the lines represents one hours worth of transactions (20K=20,000tph)):
20K 40K 60K 80K Interest Calculation Procedure 100K 120K 140K 160K 180K 200K 220K 240K 260K 280K 300K 320K 340K

340,000 xactn in 17 hours (plus interest calculation) =20,000tph

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 42 Procedure & Transaction Execution At The Primary


Normally we would be happy as it would appear that we have a 50% surge capacity built into our system and we can go home and sleep through the night. Except that we would probably get woken up at about 4am by the operations staff due to the following problem:
30K 60K 90K 120K 150K 180K 210K 240K Interest Calculation Procedure 270K

270,000 xactn in 17 hours (plus interest calculation) = 70,000 xactns behind

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 43 Procedure & Transaction Execution At The Replicate


Even at 30,000tph, we are significantly behind. More than 7 hours in fact. Why? Remember, transactions must be delivered in commit order. Consequently, a full 240,000 transactions must be delivered by the RS before it can send the proc for execution. This delays the procedure from starting for 4 hours after it completes at the primary. Now that we are executing the procedure, it must complete before any other transactions can be sent/committed (discussed in next paragraph). Whatever the cause, we are now 70,000 transactions behind which sounds not that bad a mere two hours or so at 30,000tph rate (2h:20min to be exact). But. During those 140 minutes, another 27,000 transactions arrive! Another way to look at it is that the RS has a net gain of 10,000tph. Consequently, 70,000 transactions behind represents 7 hours before we are caught up. That explains the latency issue what of the concurrency? Why cant the normal transactions continue to execute at the replicate simultaneous with the procedure execution the same way it did at the primary? This requires a bit of thinking, but consider this: while the procedure is executing at the primary, concurrent transactions by customers (i.e.

204

Final v2.0.1
ATM withdrawals) may also be executing in parallel, as illustrated in the first timeline above. Since they would commit far ahead of the interest calculation procedure, they would show up at the replicate within a reasonable amount of time. Assuming this pattern continues even after the procedure completes (i.e. checks clearing from business retailers), as illustrated in the second timeline, the following would happen: 1. 2. 3. Procedure completes at primary. It is followed by a steady stream of other transactions possibly even a batch job requiring 3 hours to run. Since RS guarantees commit order at the replicate, RS processes the transactions in commit order and internally forwards them to the DSI thread for execution at the replicate. If only using a single DSI, the follow-up transactions would not even begin until the interest procedure had committed some 8 hours later. If multiple DSIs and no contention, the DSI would have to ensure that the follow-up transactions did not commit first and would do so by not sending the commit record for the follow-up transactions until the procedure had finished. Due to contention, the replicated batch process may not even begin execution via a parallel DSI until the replicated interest procedure committed.

4.

In addition to the fact that transactions committed shortly after the interest procedure suddenly have a 8 hour latency attached, the question that should come up is Can the Replication Server catch up?. The answer is doubtfully prior to the start of the business day. So, Key Concept #24: Replicated procedures with long execution times may increase latency by delaying transactions from being applied at the replicate. The CPU and disk I/O savings with RS need to be balanced against this before deciding to replicate any particular procedure. As a result, it may be advisable to actually replicate the row modifications. This could be done by not replicating the procedure but have the procedure cursor through each account. This would be the same as atomic updates, each a separate transaction (after all, there is no reason why Annie Aunts interest calculation needs to be part of the same transaction as Wally the Walrus but whether or not that is how it is done at the primary, at the replicate they would be all part of the same transaction due to the fact the entire procedure would be replicated and applied within the scope of a single transaction.). While it may take RS several hours to catch up, entirely on the replicate it just might be less than the latency incurred due to replicating the procedure. Is there a way around this problem without replicating the individual row updates? Possibly. In this particular example, assuming the average daily balance is stored on a daily basis (or other form so that changes committed out of order do not affect the final result), a multiple DSI approach could be used to the replicate system, in which the replicated procedure could use its own dedicated connection to the replicates. Consequently, the Replication Server would be able to keep up with the ongoing stream of transactions, while concurrently executing the procedure. However, this would only work in such places where having a transaction that committed at the primary after the interest calculation but commits before it at the replicate does not cause a disparity in the balance. More will be discussed about this approach in a later section after the discussion about Parallel DSIs. The following guidance is provided to determine whether or not to replicate the procedure or allow the affected tables to replicate as normal. You probably should consider replicating stored procedures when: OLTP Procs - Frequently executed stored procedures with more than 5 DML operations with fast execution times. Purge Procs - Purge procedures when one of the targets for replication is a reporting system which is used for historical trend analysis. Large Update Procs - Procedures containing mass updates in a single statement, which when the individual rows affected when replicated will exceed any reasonable setting for number of locks. You should consider not replicating the procedure and allowing the affected rows to replicate when: Cursor Procs - Procedures that process a large set using cursor processing and applying the changes as atomic transactions. Queue Procedures - Procedures that are processing sequential lists such as job queues (replicating these could result in inconsistent databases). Long Running Procs - Procedures that either perform a lot of I/O (selects or updates) that causes it to have a long runtime (more than a few seconds).

205

Final v2.0.1
System Functions in Proc - Procedures that contain calls to getdate(), suser_name(), user_name() or other system functions, which when executed at the replicate by the maintenance user will result in different data values than at the primary. Triggers Executed by Proc - Procedures that contain DML operations that in turn invoke normal database triggers particularly if the connections dsi_keep_triggers is set to off disabling trigger execution (this can be corrected by using set triggers on/off within the procedure, however, if a vendor package, you may not have the ability to change the source. Improper Transaction Management in Proc - Procedure does not implement proper transaction management (discussed earlier) unless it can be corrected to behave properly. As with all guidance, it is offered as a starting point, you should test your transactions to determine which is best for your environment. Procedures & RPCs vs. Language (DML) From the very earliest times, we have heard that stored procedures are faster than language batches. This is not always true for reads - but it is certainly true for write operations - and for the same reason in both cases: query optimization. As we all know, stored procedures are optimized at the initial execution and then subsequent executions re-use this preoptimized plan. While this can be a problem for reports and other complex queries that have a lot of flexibility in query search arguments, it can significantly help DML operations. If you think about it, each DML statement sent by RS to the replicate database goes through the same sequence: 1. 2. 3. 4. Command parsing Query compilation/object resolution Query optimization Query execution

It turns out that step 2 and especially step 3 take significantly more time than one would think. While the difference varies by platform and cpu speed, a stored procedure containing a simple insert executes anywhere from 2-3x faster for C code and up to 10x faster for JDBC applications than the individual insert/values statement. The obvious question is how can this be exploited for a DML-centric process such as Replication Server? The answer is understanding what all constitutes a stored procedure in ASE: Traditional stored procedure database objects - executed either as language calls or RPCs Fully prepared SQL Statements/Dynamic SQL - these create a dynamic procedure on the server which is invoked via the RPC interface Queries using the ASE Statement Cache - a query contained in the statement cache is compiled as a dynamic procedure.

The first is quite easily understood - we are referring to the usual database objects created using the create procedure T-SQL command. As mentioned, a stored procedure can either be executed as a language call or an RPC call. Most scripts that call stored procedures via isql are using a language call, while inter-server invocations such as SYB_BACKUPsp_who gets executed using the RPC interface (particularly in that case as it is running against the Sybase Backup Server which doesnt support a language interface). Fully prepared statements or dynamic SQL are used in very high speed systems with a large number of repeating transactions. For JDBC, this involves setting the connection property DYNAMIC_PREPARE=true, while for CT-Lib applications, the ct_dynamic() statement is used along with ct_param() and a slightly different form of ct_send(). In either case, what happens is that the ASE server creates a dynamic procedure that is executed repeatedly via the RPC interface. A pseudo-code representation of application logic for this might resemble:
stmtID=PrepareSQLStatement(insert into table values (?,?)) while n <= num_rows stmtID.setParam(1,<value[n,1]>) stmtID.setParam(2,<value[n,2]>) stmtID.execute() end loop stmtID.DropSQLStatement()

Note the language portion of the command is parameterized as it is sent to the server - and that it is only sent to the server one time. Subsequent calls simply set the parameter values and execute the procedure. Obviously, since Replication Server is a compiled application, we cant change the CT-Lib source code to invoke fully prepared statements, so this approach is not open to us (although a future enhancement to Replication Server 15.0 is considering an implementation using dynamic SQL). Note, however, that unlike language statements, RPC calls can not be batched

206

Final v2.0.1
- however, early tests suggest that the performance advantage of a dynamic SQL implementation within RS even without command batching is much faster than language statements with command batching. Additionally, there likely will be restrictions on using custom function strings for obvious reasons. The second method using the ASE statement cache likely will not help either. The ASE statement cache introduced in ASE 12.5.2 is a combination of the two. Since ASE 12.5.2, as each SQL language command is parsed & compiled, a MD5 hash value is created. If the statement cache is enabled, this hash is compared to existing query hashes already in the statement cache. If no match is found, the query is optimized and converted into a dynamic procedure much like the preceding paragraph. If found, the query literals are used much like parameters to a prepared statement and the preoptimized version of the query is executed. However, the reason it is stated that this may not help is that in early 12.5.x and 15.0 ASEs, the statement MD5 hash value included the command literals. For example, the following queries would result in two different hashkeys being created:
-- query #1 select * from authors where au_lname=Ringer -- query #2 select * from authors where au_lname=Greene

The problem with this approach was that statements executed identically with only a change in the literal values would still incur the expense of optimization. Additionally the restrictions on statement cache fairly much limited it to update statements, delete statement and select statements. Insert table () values () statements were not cacheable nor were SQL batches such as if/else constructs. As a result of these restrictions, the statement cache in ASE 12.5.x does not benefit Replication Server unless there is an extremely high incidence of the same row being updated. In ASE 15.0.1, a new configuration option statement cache literal parameterization was introduced along with the associate session level setting set literal_autoparam [off | on]. When enabled, the constant literals in query texts will be replaced with variables prior to hashing. While this may result in a performance gain for some applications, there are a few considerations to keep in mind: Just like stored procedure parameters, queries using a range of values may get a bad query plan. For example, a query with a where clause specifying where date_col between <date 1> and <date 2>. Techniques to finding/resolving bad queries may not be able to strictly join on the hashkey as the hashkey may be representing multiple different query arguments.

However, the restriction on insert/values is still in effect, consequently, the improved statement caching will only help applications with significant update or delete operations. If your system experiences a large number of update or delete operations, this should be considered, if not for the server, certainly for the maintenance user by altering the rs_usedb function to include the session setting syntax. ASE 15.0.2 is supposed to lift the restriction on caching insert statements, consequently, when it is released, RS users may see a considerable gain in throughput. Consequently for insert intensive applications, the only means to exploit this today (through RS 15.0 ESD #1) is to use custom function strings that call stored procedures and create stored procedures for each operation (insert, update, delete). You may wish to test using a function string output style of RPC as well as language to determine whether language based procedures with command batching give you performance gains over RPC style or vice versa. However, keep in mind that both ASE 15.0.2 and a future release of RS that will implement dynamic SQL/fully prepared statements may eliminate the need for this. As a result, it may be more practical to upgrade to those releases when available vs. converting to function strings and procedure calls. Procedure Transaction Control One of the least understood aspects of stored procedure coding is proper transaction control. It is commonly thought that the following procedure template has proper transaction control and represents good coding style. However, in reality, it demonstrates improper transaction control.
create procedure proc_name <parameter list> as begin <variable declarations> begin transaction tran_name insert into table1 if @@error>0 rollback transaction

207

Final v2.0.1

insert into table2 if @@error>0 rollback transaction insert into table3 if @@error>0 rollback transaction commit transaction end

The problem arises when the procedure is called from within another transaction, as in:
Begin transaction tran_1 Exec proc_name <parameters> Commit transaction

The reason the problem occurs is the mistaken belief that if nested commit transactions only commit the current nested transaction, then a nested rollback only rolls back to the proper transaction nesting level. Consider the following code:
Begin tran tran_1 <statements> begin tran tran_2 <statements> begin tran tran_3 <statements> if @@error>0 rollback tran_3 commit tran tran_3 if @@error>0 rollback tran tran_2 commit tran tran_2 if @@error rollback tran tran_1 commit tran tran_1

While nested commits do only commit the innermost transaction, application developers need to keep the following rules in mind, particularly regarding rollback transaction statements: Rollback transaction without a transaction_name or savepoint_name rolls back a user-defined transaction to the beginning of the outermost transaction. Rollback transaction transaction_name rolls back a user-defined transaction to the beginning of the named transaction. Though you can nest transactions, you can roll back only the outermost transaction. Rollback transaction savepoint_name rolls a user-defined transaction back to the matching save transaction savepoint_name.

The above bullets are word for word from the Adaptive Server Enterprise Reference Manual. The underlined sentence sums it up quite simply unless you use transaction savepoints (explicit use of save transaction commands) you can only rollback the outermost transaction. As a result, any rollback transaction encountered automatically rolls back to the outermost transaction unless a savepoint name is specified (it also points to the fact that only outer transactions and savepoints can have transaction names). Consequently, a procedure that attempts to implement transaction management can have undesired behavior during a rollback if itself was called from within a transaction. This is crucial as Replication Server always delivers stored procedures within an outer transaction as part of the normal transactional deliver. The second common problem with procedures is the fact that if transaction management is not implemented at all, simply raising an error and returning an non-zero return code does not represent a failed execution. Consider the following common code template:
create procedure my_proc <parameter list> as begin insert into table_1 if @@error > 0 begin raiserror 30000 <error message>, <variables> return 1 end return 0 end

It often surprises people that if the procedure is marked for replication and an error occurs, it still gets replicated and fails at the replicate resulting in the DSI thread suspending. The reason is simple. Even though an error was raised, the implicit transaction (started by any atomic statement) was not rolled back. Consequently, this leads to the following points: Stored procedures that are replicated should always be called from within a transaction, should check to see if in a transaction and rollback the transaction as appropriate during exception processing.

208

Final v2.0.1

Alternatively, stored procedures that are replicated should be implemented as sub-procedures that are called by a parent procedure after local changes have completed successfully AND then the sub-procedure should be called from within a transaction managed by the parent procedure. Stored procedures that implement transaction management should ensure a well-behaved model is implemented using appropriate save transaction commands (see below).

The first point is illustrated with the following template:


create procedure my_proc <parameter list> as begin if @@trancount<1 begin raiserror 30000 This procedure can only be called from within an explicit transaction return -1 end insert into table_1 if @@error > 0 begin raiserror 30000 <error message>, <variables> rollback transaction return 1 end return 0 end

Notice the highlighted sections that are modifications to the previous code. The second point is probably the best implementation for replicated procedures as it allows minimally logged functions for row determination (exact details how are beyond the scope of this discussion) and ensures the local changes are fully committed before the call to the replicated procedure is even attempted. A sample code fragment would be similar to:
create procedure my_proc <parameter list> as begin insert into table_1 if @@error > 0 begin raiserror 30000 <error message>, <variables> return 1 end begin tran my_tran @retcode=exec replicated_proc <parameters> if @retcode!=0 begin raiserror 30000 Call to procedure replicated_proc failed rollback transaction return 1 end else commit tran return 0 end

Note that this would rollback an outer transaction as well if called from within a transaction. Finally, implementing proper transaction control for a stored procedure actually resembles something similar to the following:
create procedure my_proc <parameter list> as begin declare @began_tran int if @@trancount=0 begin select @began_tran=1 begin tran my_tran_or_savepoint end else begin select @began_tran=0 save tran my_tran_or_savepoint end <statements> if @@error>0 begin rollback tran my_tran_or_savepoint raiserror 30000 something bad happened message return -1 end if @began_tran=1 commit tran return 0

209

Final v2.0.1

end

Again, note the highlighted sections. Since only the outermost transactions actually commit the changes, using nested transaction is a fruitless exercise. A more useful mechanism as demonstrated is to implement savepoints at strategic locations that can be rolled back as appropriate. Each procedure, when called, simply needs to determine if it has been called from within a transaction or not. If not, it begins a transaction. If it was called within a transaction, it simply implements savepoints to rollback the changes it initiated. However, it would still be the responsibility of the parent procedure to rollback the transaction (by checking the return or error code as appropriate). Procedures & Grouped Transactions To understand why this can lead to inconsistencies at the replicate and more to the point, seemingly spurious duplicate key errors, you need to consider the impact of transaction batching and error handling. Consider the following SQL batch as if sent from isql:
insert insert insert insert insert go statement_1 statement_2 statement_3 statement_4 statement_5

If statement 3 fails with an error, statements 4 & 5 still execute as members of the batch. Now, put this in context of replication transaction grouping which if issued via isql would resemble the following:
begin transaction rs_update_threads 2, <value> insert statement_1 insert statement_2 exec replicated_proc_1 insert statement_3 exec replicated_proc_2 insert statement_4 insert statement_5 rs_get_thread_seq 1 --end of batch -- if succeeded rs_update_lastcommit commit tran -- if it didnt succeed, disconnect to force a rollback -- rollback tran

Now, lets suppose that the second call to replicated_proc (exec replicated_proc_2) fails and a normal transaction management model was implemented as discussed earlier vs. a proper implementation. The effect would be that the entire transaction batch would get rolled back to where the transaction began, however the subsequent inserts (#4 & #5) would succeed (remember, a rollback does not suspend execution, it merely undoes changes). Fortunately, in one sense, the error raised would cause RS to attempt to rollback and retry the entire transaction group individually. However, since inserts #4 & #5 were executed outside the scope of a transaction, they would not get rolled back by the RS. On retry (after the error was fixed for the replicated proc), upon reaching inserts #4 & #5, both would raise duplicate key errors. Checking the database would reveal the rows already existing, and simply resuming the DSI connection and skipping the transaction would have keep the database consistent, but leave a very confused DBA wondering what happened. Procedures with Select/Into The latter example probably raised a quick but..but.. from developers who are quick to state that replicating procedures with select/into.. is not possible due to DDL in transaction errors at the replicate system. Very true if procedure replication is only at the basic level which typically is not the optimal strategy for procedure replication. While this may seem to be more appropriately discussed in the primary database section earlier, the transaction wrapping effect of Replication Server has often caused application developers to change the procedure logic at the primary. Case in point, procedures with select/into execute fine at the primary, however, fail at the replicate due to DDL in tran errors. Many developers then are quick to re-write both to eliminate the select/into not only affecting the performance at the replicate, but also endangering performance at the primary. So, in a way, it does make sense to discuss it here. The best way to decide what to do with procedures containing select/into is by assessing the number of physical changes actually made to the real tables the procedure modifies and the role of the worktable created in tempdb. Several scenarios are discussed in the following sections. A summary table is included first for ease of reference between the scenarios.

210

Final v2.0.1

Solution replicate tables vs. procedure Work table & subprocedure

Applicability complex (long run time) row identification small number of real rows modified complex (long run time) row identification small number of rows in work table large number or rows in real tables

procedure rewrite without select/into

row identification easy work tables contain large row counts large number of rows modified in real table

Replicate Affected Tables vs. Procedures In this case, it is a classic case of replicating the wrong object. In some cases, the stored procedure may use a large number of temporary tables to identify which rows to modify or add to the real database in a list paring concept. In this case, the final number of rows affected in replicated tables is actually fairly small. Consider the following example:
Update all of the tax rates for minority owned business within the tax-free empowerment zone to include the new tax structures.

Since these empowerment zones typically encompass only an area several blocks in size, the number of final rows affected will probably be only a couple dozen. However, the logic to identify the rows may be fairly complicated (i.e. a certain linear distance from a epicenter) and may require culling down the list of prospects using successive temp tables until only the desired rows are left. For example, the first worktable may be a table simply to get a list of businesses and their range to the epicenter possibly using the zip code to reduce the initial list evaluated. The second list would be constrained to only those within the desired range that are minority owned. The pseudo code would look something like:
select business_id, minority_owner_ship, (range formula) into #temptable_1 from businesses where zip_code in (12345,12346) select business_id, minority_owner_ship, distance into #temptable_2 from #temptable_1 where distance < 1 and minority_owner_ship > 0.5 update businesses set tax_rate = tax_rate - .10 from #temptable_2 t2, businesses b where b.business_id=t2.business_id

Now, lets take a look at what if this was in a procedure. The first temporary table creation might take several seconds simply due to the amount of data being processed and the second may also take several seconds due to the table scan that would be required for the filtering of data from the first temp table. The net effect would be a procedure that requires (just for sake of discussion) possibly 20 seconds for execution 19 of which are the two temp table creations. The decision to replicate the rows or the procedure then becomes on of determining whether the average number of rows modified by the procedure take longer to replicate than the time to execute the procedure at the replicate. For instance, lets say that when executed, the average execution of the procedure is 20 seconds modifying 72 rows. If it takes 10 seconds to move the 72 rows through Replication Server and another 13 seconds to apply the rows via the DSI, it still may be better to replicate the rows vs. changing the procedure to use logged I/O and permanent worktables as that might slow down the procedure execution to 35 seconds. Worktable & Subprocedure Replication However, in many cases, it is simply too much to replicate the actual rows modified. Take the above example again, only this time, lets assume that the target area contains thousands of businesses. Replicating that many rows would take too long. However, think of the logic in the original procedure at the primary:
Step 1 Identify the boundaries of the area Step 2 Develop list of businesses within the boundaries Step 3 Update the businesses tax rates

211

Final v2.0.1
Now think about it. Step 1 really needs a bit more logic. In this example, identifying the boundaries as the outer cross streets does not help you identify whether an address is within the boundary unless employing some form of grid system ala Spatial Query Server (SQS). The real logic would probably be more likely:
Step Step Step Step Step 1 2 3 4 5 Identify the outer boundaries of the area Identify the streets within the boundaries Identify the address range within each street Develop list of businesses with address between range on each street Update the businesses tax rates

Up through step 3, the number of rows are fairly small. Consequently the logic for a stored procedure could be similar to:
(Outer procedure outer boundaries as parameters) Insert list of streets and address range into temp table (Inner procedure) Update business tax rate where address between range and on street.

As a result, you simply need to replicate the worktable containing the street number ranges and the inner procedure. The procedure at the primary then might look like:
create procedure set_tax_rate @streetnum_n int, @street_n varchar(50), @streetnum_s int, @street_s varchar(50), @streetnum_e int, @street_e varchar(50), @streetnum_w int, @street_w varchar(50), @target_demographic varbinary(255), @new_tax_rate decimal(3,3) as begin -- logic to identify N-S streets in boundary using select/into -- logic to identify E-W streets in boundary using select/into begin tran insert into street_work_table select @@spid, streetnum_n, streetnum_s, streetname from #NS_streets union all select @@spid, streetnum_e, streetnum_w, streetname from #EW_streets exec set_tax_rate_sub @@spid, @target_demographic, @new_tax_rate commit tran return 0 end create procedure set_tax_rate_sub @proc_id int, @target_demographic varbinary(255), @new_tax_rate decimal(3,3) as begin update businesses set tax_rate= @new_tax_rate from businesses b, street_work_table swt where swt.streetname=b.streetname and b.streetnum between swt.low_streetnum and swt.high_streetnum and swt.process_id = @proc_id and b.demographics & @target_demographics > 0 delete street_work_table where process_id=@proc_id return 0 end

By replicating the worktable (street_work_table) and the inner procedure (set_tax_rate_sub) instead of the outer procedure, the difficult logic to identify the streets between the others is not performed at the replicate, allowing the use of select/into at the primary database for this logic, while reducing the number of rows actually replicated to the replicate system. Note the following considerations: Inner procedure performs cleanup on the worktable. This reduces the number of rows replicated as only the inserts into the worktable get replicated from the primary. @@spid is parameter to the inner procedure and column in the worktable. The reason for this is that in multi-user situations, you may need to identify which rows in the worktable are for which users transactions. Since the spid at the replicate will be the spid of the maintenance user and not the same as at the primary, it must be passed to the subprocedure so that the maintenance user knows which rows to use. The inner procedure call and inserts into the worktable are enclosed in a transaction at the primary. This is due to the simple fact that if the procedure hits an error and aborts, the procedure execution was successful according to the primary ASE. As a result it would still be replicated and attempted at the replicate. By

212

Final v2.0.1
enclosing the inserts and proc call in a transaction, the whole unit could be rolled back at the primary, resulting in a mini-abort in the RS that would purge the rows from the inbound queue. The last point is fairly important. Any procedure that is replicated should be enclosed in a transaction at the primary. This will allow user-defined exits (raiserror, return 1) to be handled correctly provided that the error handling does a rollback of the transaction. Despite the fact an error is raised and a negative return status returned from the procedure, it still is a successful procedure execution according to ASE, consequently replicated to all subscribing databases where the same raiserror would occur resulting in a suspended DSI. A crucial performance suggestion for the above is to have the clustered index on the worktable have the spid and one or more of the other main columns as indexed columns. For example, in the above example, the clustered index might include spid, and streetname. Then if the real data table (businesses) has an index on streetname, the update via join can use the index even if no other SARG (true in the above case) is possible. While this technique may appear to have limited applicability, in actuality, it probably resolves most of the cases in which a select/into is used at the primary database and not all the rows are modified in the target table (establishing the fact some criteria must exist replicate the criteria vs. the rows). Situations it is notably applicable for include: Area Bounded Criteria DML involving area boundaries identified via zip codes, area code + phone exchange, countries, regions, etc. A classic example is the mark all blood collections from regions with E-Coli outbreak as potentially hazardous example often used in replication design examples as good procedure replication candidates. The list of blood donations would be huge, but the list of collection centers located in those regions is probably very small. Specified List Criteria In certain situations, rather than using a range, a specified list is necessary to prevent unnecessarily updating data inclusive in the range at the replicate (a consolidated system) but not in the primary. For example, a list of personnel names being replicated from a field office to the headquarters. This could include dates, account numbers, top 10 lists, manufacturers, stores, etc. As well as any other situation in which a fairly small list of criteria exists compared to the rows actually modified. Procedure Rewrite without Select/Into This, unfortunately, is the most frequent fallback for developers suddenly faced with the select/into at replicate problem and agreeably, sometimes it is necessary. However, this usually requires permanent working tables in which the procedure makes logged inserts/updates/deletes. This should only be used when the identifying criteria is the entire set of rows or a range criteria that is huge in itself. An example is if a procedure is given a range of N-Z as parameters. While it is possible to create a list of 13 characters and attempt the above, the end result is the same thousands of rows will be changed. A classic case would be calculating the finance charges for a credit card system. In such a situation even if the load was distributed across every day of the month by using different closing dates tens of thousands to millions of rows would be updated each execution of the procedure. Since most credit cards operate on an average daily balance to calculate the finance charges, the first step would be to get the previous months balance (hopefully stored in the account table), subtract any payments (as these always apply to old balances first). This is a bit more difficult than simply taking the average and dividing by the number of days. Consider the following table: Day
Begin 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Charge
1,000.00

Balance
1,000.00 1,000.00 1,000.00 1,000.00 1,050.00 1,050.00 1,050.00 1,050.00 1,050.00 1,050.00 1,125.00 1,125.00 1,125.00 1,275.00 1,275.00 1,275.00

50.00

75.00

150.00

213

Final v2.0.1

Day
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Avg Bal

Charge
125.00

Balance
1,400.00 1,400.00 1,400.00 1,400.00 1,400.00 1,400.00 1,400.00 1,400.00 1,900.00 1,900.00 1,900.00 1,900.00 1,900.00 1,900.00 1,900.00 1,366.67

500.00

As you can see, there is no way to simply take the sum of the new charges ($900) and get the final answer. As a result, the system needs to first calculate the daily balance for each account and then insert the average daily balance multiplied by some exorbitant interest rate (i.e. 21% for department store cards) for the finance charge. For sake of argument, lets assume this is done via a series of select/intos (possible with about 3-4 an exercise left for the reader). Obviously, no matter what time the procedure runs, it will run for several hours on a very large row count. Replicating the procedure is a must as replicating all the row changes at the end of every day (assuming every day is a closing date for 1/30th of the accounts), could be impractical. Consequently, instead of using select/intos to generate the average daily balances, a series of real worktables would have to be used. Separate Execution Connection This last example (finance charges on average daily balance) clearly illustrates a problem though in replicating stored procedures. At the primary system assuming no contention at the primary the finance charges procedure could happily run at the same time as user transactions (assuming the finance charge procedure used a cursor to avoid locking the entire table). However, as described before, in order to guarantee that the transactions are delivered in commit order, the Replication Server applies the transactions serially. Consequently, once the procedure started running at the replicate, it would several hours before any other transactions could begin. Additionally, at the replicate, the entire update would be within a transaction if it didnt fail due to exhausting the locks, the net result would be a slow lockdown of the table. This, of course, is extremely unsatisfactory. One way around this is to employ a separate connection strictly for executing this and other business maintenance. In doing so, normal replicated transactions could continue to be applied while the maintenance procedure executed on its own. The method to achieve this is based on multiple (not parallel multiple) DSIs which is covered later in this section. Needless to say, there are many, many considerations to implementing this which are covered later, consequently, this should only be used when other methods have failed and procedure replication is really necessary. One of those considerations is the impact on subsequent transactions that used/modified data modified by the maintenance procedure. Due to timing issues with a separate execution connection, it is fully possible that the update makes it to the replicate first only to be clobbered by later execution within the maintenance record. One of the other advantages to this approach, is that statement and transaction batching could both be turned off. This would allow the procedure at the replication to contain the select/into provide that system administrators were willing for a manual recovery (similar to system transactions). With both statement and transaction batching off, the following procedure would work.
create procedure proc_w_select @parm1 int as begin declare @numtrans int select @numtrans=@@trancount while @@trancount > 0 commit tran -- select into logic begin tran -- updates to table

214

Final v2.0.1

commit tran while @@trancount < @numtrans begin tran return 0 end

This is similar to the mechanism used for system transactions such as DDL or truncate table. In the case of system transactions, Replication Server submits the following:
rs_begin rs_commit -- DDL operation rs_begin rs_commit

The way this works is that the rs_commit statements update the OQID in the target database. During recovery, only three conditions could exist: rs_lastcommit OQID < first rs_commit OQID In this case, recovery is fairly simple as the empty transaction prior to the DDL has not yet been applied. Consequently, the RS can simply begin with the transaction prior to the DDL. rs_lastcommit OQID >= second rs_commit OQID Similar to the above, recovery is simple as this implies that the DDL was successful since the empty transaction that followed it was successful. As a result, Rep Server can begin with the transaction following the one for which the OQID was recorded. rs_lastcommit OQID = first rs_commit OQID Here all bets are off. Reason is that one of two possible situations exists. Either 1) the empty transaction succeeded but the DDL was not applied (replicate ASE crashed in middle); or 2) both were applied. Since the DDL operation is not within an rs_commit, the OQID is not updated when it finishes. Consequently the administrator has to check the replicate database and make a conscious decision whether or not to apply the system transaction. Hence the added execute transaction option to resume connection command. By specifying execute transaction, the administrator is telling RS to re-apply the system transaction as it never really was applied. If instead it had run but the second rs_commit had not, then simply leaving it off the resume connection is sufficient. Accordingly, by committing and re-beginning the transactions at the procedure boundaries, you are not sure if the proc finished if the OQID is equal to the OQID prior to the proc execution. If it was successful, resume connection DS.DB skip transaction provides similar functionality to leaving of execute transaction for system transactions. However, it is critical that the procedure be fully recoverable possibly even to a point where it could recover from a previous incomplete run. If the actual data modifications were made outside a transaction, then when a failure occurs during the execution, reapplying the procedure after recovery would result in duplicate data. So, for example, the finance charge procedure would only develop the list of average monthly balances from accounts that did not already have a finance charge for that month.

215

Final v2.0.1

Replication Routes
To Route or Not to Route, That is the Question
One of the key differences between Sybases Replication Server and competing products is the routing capabilities. In fact, it is the only replication product on the market that supports intermediate routes. Routing was developed for Sybase Replication Server from the onset to support long-haul network environments while providing performance advantages in that environment over non-routed solutions. The goal of this section is to provide the reader with a fundamental understanding of this feature, how it works, considerations and performance aspects. Routing Architectures Replication routing architectures is not a topic for the uninitiated as it has significant similarities with messaging/EAI technologies. Thats a topic for later. Understanding routing architectures requires an understanding of the basic route types and then the different topologies and the types of problems they were designed to solve. Route Types Anyone who has been around Sybase RS for more than a few months knows that there are two different types of routes that Rep Server provides: Direct and Indirect. Direct Routes A direct route implies that the Primary Replication Server (PRS) and Replicate Replication Server (RRS) have a direct logical path between them (logically adjacent). In fact, it is common to have two connections since routes are unidirectional in Sybase Replication Server. This has more to do with how routes work from an internals perspective, however, and should not be viewed as limitation. Sybase very easily could have used a single command to construct a bi-directional route, however, it would have posed a problem with indirect routes and the flexibility of having different intermediate sites between two endpoints. The below diagram illustrates two one-directional routes between the primary and replicate servers:
LOG PRS RSSD LOG RRS RSSD

RSM PRS RSSD DS PDB LOG RA PDS PRS RRS RSSD DS RRS

RDS

LOG

RDB

Figure 44 - Two One-Direction Direct Routes between Primary & Replicate


Indirect Routes An indirect route infers that the Primary Replication Server (PRS) and Replicate Replication Server (RRS) are separated by one or more Intermediate Replication Servers. An intermediate route was illustrated at the beginning of this paper with the following diagram:

217

Final v2.0.1

LOG

PRS RSSD

LOG

RRS RSSD

RSM PRS RSSD DS PDB LOG RA PDS PRS RRS RSSD DS RRS

RDS

LOG

RDB

IRS

IRS RSSD DS

LOG

IRS RSSD

Figure 45 - An Example of an Intermediate Route


Each of the Replication Servers above first has a direct route to its neighbor and then an indirect route to the replicate. At first glance, some may question the reason for even using intermediate routes, but many of the topologies (as we will see) fairly much require them. Route Topologies Once routing gets implemented, it doesnt take long before the term topology starts being discussed. Topology is nothing more than a description of the connections between the different sources & targets. With each topology, certain things are understood (i.e. a hierarchical topology implies a rollup factor) and certain aspects are also immediately known (i.e. bidirectional replication, etc.). There are only a limited number of base topologies, however, large implementations may find that they combine different topologies within their data distribution architecture. Each of the base topologies are discussed in the following sections. Point-to-Point A point-to-point topology is characterized by every RS having a direct connection to every other RS. Classic implementations include Remote Standby and Shared Primary (Peer-to-Peer). Remote Standby In a typical Warm Standby system, a single Replication Server is used. This restriction is mainly due to the fact that routing is implemented as a connection and hence an outbound connection. Since WS only uses inbound queue, it has been restricted to a single RS. In some environments where the standby system is extremely remote (i.e. 100s of miles) away, the connectivity between the RepAgent and the RS become a bit of a problem. The reason is that with the longer WANs, not only is the bandwidth lower, but also the line quality and other factors become an issue. Consequently sometimes it may be advisable to set up a replicated copy in which all the tables are published and subscribed to using standard replication definitions and subscriptions and use two replication servers - one local and one remote.

218

Final v2.0.1

San Francisco (Standby)

New York (Primary)

Figure 46 - Example of a Remote Standby


This has some distinct performance advantages: Empty begin/commit pairs and other types of non-replicated data gets filtered out immediately at the primary The transaction log continues to drain as normal and is not impacted by WAN outages Other destination systems are not impeded by having transactions first go to remote site as a normal WS would indicate. Instead, they can subscribe at the local node. Tends to be more resilient from network issues

It also has some very acute disadvantages: Doesnt support a logical connection internal to RS Doesnt support automated failover Has increased latency in respect to RS processing, especially with large transactions

The first point may appear to be fairly minor, but in reality, it can be a real bear to deal with. While it is true that if the system is isolated, this is not a problem, it is equally true that if the system participates in replication to/from other sites, it gets real sticky. The reason is that some of the nuances of a logical connection are not well known. Consider the following scenarios:

Chicago (Different app)

???

???

San Francisco (Standby)

New York (Primary)

Figure 47 - The Standby as a Target Puzzler


Now, comes one of those times in this paper where you have to engage your thinking cap Hows the switch affected from Chicagos viewpoint (question marks above)? Remember, the two would be different connections in the same domain - duplicate subscriptions are not the answer. Having transactions applied to SF directly could cause database out-of-sync issues. The issue is that NY users can modify the

219

Final v2.0.1
source data, later updated by Chicago replicated transactions. But, due to latency and timing, the Chicago replicated updates get to SF first, then the replicated NY changes. Result is that Chicagos transactions would appear to have been lost. Using NY RS as an intermediate route for SF RS from Chicago (CH NY SF as a RS route) would not be the answer either. Again, consider the problem posed at the end of the last bullet. The Chicago transaction still has a distinct probability of getting to SF first if the transactions are executed close together. So, if we just replicate to NY from Chicago, what happens when NY fails? Some of the Chicago transactions will be stranded in the transaction log while others will be in the queue - the outbound queue in NY RS, which will not drain since NY ASE is dead. Potentially others are still stranded in the Chicago RS outbound queue for the route. Simply trying to switch Chicago to SF could result in missing transactions since the currently active segment in the queue is past those transactions and routing does not forward transactions intact (later discussion in internals).

By now you are beginning to see the real purpose behind the logical connection for a WS. While this is a different topic altogether (Warm Standby Replication), two of the important aspects of a WS connection is that the transactions sent to the logical pair are routed correctly in the event of a failover and that transactions are applied to the primary which in turn re-replicates them to the standby (send warm standby xacts effectively encapsulates send maint xacts to replicate, however, an rs_marker is used to signal when to begin sending all transactions to avoid transactions applied by the other node). Additionally, rs_lastcommit is replicated, consequently once replicate systems reconnect to the logical pair, they see the last transaction that made it to the pair (hence the strict save interval as well). However, we are digressing deep into a topic that deserves its own discussion. A simpler solution to the problem above is to have Chicago not use a route to NY & SF, but to use a multiple-DSI approach and a different maintenance user (and connection name due to the domain). Regardless, the point of this entire discussion is that while it may be tempting to set up replicated standbys for more local systems, be absolutely 200% positive that it is the best approach. If performance is the issue, it probably is solvable via other means than this implementation as it is doubtful that this implementation really will improve performance over a properly tuned WS implementation. The driver for this sort of implementation should be network resilience. Shared Primary (Peer-to-Peer) The other classic implementation for point-to-point topologies is a shared primary or peer-to-peer implementation. In a Peer-to-Peer implementation a distinct model of data ownership is defined - either on different sets of tables, columnwise within tables or row-wise within tables. This type of implementation is often illustrated as:
NY CH SF

Chicago

San Francisco
NY CH SF

New York
NY CH SF

Figure 48 - Typical Shared Primary/Peer-to-Peer Implementation


This technique is often referred to as data ownership from a replication standpoint, but infers another concept called application partitioning. In a shared primary implementation, application partitioning is done implicitly at each site by restricting the users from modifying other sites data. Now it is important to note that request functions have been used by some customers to modify another sites data by sending the change request to that site - or by having the change request implement an ownership change.

220

Final v2.0.1
MP Implementations Another successful implementation of the shared primary implementation that really drives home this point is when the system is divided for load balancing purposes. In a typical environment, the reads (selects) grossly outnumber the writes (DML) and consequently is the driving force when a machine is at capacity. In such a case, a larger machine often is the answer. But what if no larger machines are available? Additionally, a single large machine is a single point of failure and leaves customers exposed. Some customers started using RS from the earliest days to maintain a loose model of a massively parallel system by using peer-to-peer replication. A typical implementation looked like:
Transaction routers

A-G H-Q R-Z

A-G H-Q R-Z

A-G H-Q R-Z

Figure 49 - MPP via Load Balancing with RS


This implementation is more or less a cross between a MPP share-disk approach (Oracle, Microsoft) and a MPP shared-nothing approach (IBM, Sybase). As weird as the above may look, it has some advantages over both models. Interestingly enough, Oracle 9i Real Application Clusters (RAC) enforces application partitioning (forget the marketing hype - read the manuals) and implements a block ownership and block transfer. The problem of course is that the block transfers are on demand, which slows a cross node query (hence their own benchmarks do not allow users to read a block they didnt write). Microsoft quite explicitly uses a transaction router to enforce application partitioning. IBM and Sybase (old Navigation Server/Sybase MPP) split the data among different nodes and used result set merging. For ASE 15.0, Sybase is planning on implementing MPP via a federated database using unioned views. The above implementation has a couple of advantages over RAC/MS (shared-disk) as well as result set merging (shared-nothing). 1. 2. 3. First, RAC/MS (and each node of a share-nothing) has a single copy of the database - and consequently a single point of failure. Queries involving remote data execute substantially quicker as the data is local. Shared-nothing approaches essentially union data. In some cases an aggregate function across the datasets then becomes an application implementation (i.e. count(*) or sum(amount) across nodes involves summing the individual results vs. unioning the results). Cross node writes can be handled as request functions or via function strings (i.e. aggregates) to prevent blocking on contentious columns (think balance for a bank account - now consider cross account transfers). Shared-disk architectures in particular have problems with this as Distributed Lock Managers are necessary to coordinate and cache coherency resolution is necessary. Shared-nothing architectures have severe problems as well as this often reverts to a 2PC.

4.

The downside, of course, is that each node is looking at a point in time (historical) copy of the data from other nodes, which may not be current. A little known fact, of course, is that the same is true of Oracle RAC - the blocks are copies from when the transaction began. Probably a little closer in time than with the above, but still a problem. An additional downside is that each node must be able to support the full write load while handling a fraction of the query load. If it can not support the full write load under any query load, then a shared-primary implementation and pure application partitioning will be necessary in which only data truly needed at the other nodes is replicated. Incidentally, a fine example of a transaction router is OpenSwitch, although it would be easy to implement in an application server as well.

221

Final v2.0.1
Hub & Spoke Hub & Spoke implementations are common implementations where the point-to-point implementations are no longer practical due to scalability and management. Consider the common point-to-point scenario described in the last section. It is fine as long as the number of sites is in the 3-4 range and possibly could be extended to 5. However, remember that the number of connections from each site is one less than the totals sites. In fact, it would be twice that number due to the unidirectional nature of routes - so for M sites, the total number of connection is M*(M-1)*2. For 3 sites, a total of 12 would be needed.5 would require 40. As you can tell, as the numbers grow beyond 5, the number of connections gets to be entertaining. Consequently a hub & spoke implementation could be used with a common arbitrator/re-director in the middle.

Figure 50 - Hub & Spoke Implementation


Note that the site in the center lacks a database. The reason for this is that its sole purpose is to facilitate the connections. An astute observer may be quick to point out that logically you still need to create the individual routes as if it were point-to-point with the only difference in the above being that the hub is specified as the intermediate node. A true statement - however, it does not take into consideration the processing and possibly the disk space that is saved at each site. Every replicated row goes to the same outbound queue where it is passed to another replication server (hub) which determines the distination(s). Circular Rings A circular ring is a topology in which each Replication Server has direct routes only to those adjacent to it. This is largely due to the fact that most communications flow sequentially about the ring, typically in a single direction. A classic example was illustrated earlier in follow-the-sun technical support systems. Such systems typically use globally dispersed corporate centers to avoid having 24-hour shifts locally. For example, Sybase has support centers in Concord, Massachusetts; Chicago, Illinois; Dublin, California; Hong Kong, China; Sydney, Australia; and Maidenhead, England. Additional support staff are distributed to other locations as well (Brazil, Netherlands, etc.), but these represent the main support centers for English speaking customers. Globally, this can be represented by:

222

Final v2.0.1

Figure 51 - Sybases Follow-The-Sun TS Implementation


Just by looking at it, you can discern the ring between the centers. While Sybases implementation is different, you could picture as a support case is opened, it is sent to the next site as a precaution. If a handoff is necessary, a ownership change for that case is effected. As soon as the support person at the next site makes any modification, it will cause it to replicate and consequently the next site will have the info. Geographic Distribution This logically leads to the next and one of the more common topologies - Geographic Distribution. The primary reason for this topology used to be the limited bandwidth between the continents. As that has largely been resolved in recent years, the biggest benefit from this then becomes Replication Server performance as efficiencies are realized by implementing a system as such. Consider the following topology:

Figure 52 - Possible Geographic Distribution Topology for a Global Corporation


This is where IBM, Oracle and Microsoft lose it. Because of their lack of indirect routes, they must perform direct routes from/to every site. In the above illustration, there are ~35 sites, yet the most that any one site has a direct route to is 5. A change to a lookup table is easily distributed to all of the sites. A system that does not have indirect routing would have to create 35x34 or 1190 connections in order to support replication to/from every site. The amount of processing saved is enormous. Hierarchical Trees The above topology is considered a basic one even though it combines elements of others. In it, sites that need to communicate to other local sites have direct routes to those sites. Looking at in a slightly different view and you get the illustration of cascading nodes. As a result, it is very similar to probably one of the most common routing implementations (along with remote standby) - hierarchical. A hierarchical topology is very similar to an index tree for databases in that there is a root node and several levels until the bottom is reached. It is different in an aspect that the intermediate levels also represent functional nodes. One of the clearest examples of a hierarchical implementation can be witnessed in a large retail department store chain. We will use a mythical chain of Syb-Mart. Each Syb-Mart store sells the usual clothing, furniture, tools, automotive goods, etc. Some of these items bear the Syb-Mart label while others are national brands. Each store reports its

223

Final v2.0.1
receipts to a regional office, which in turn feeds to an area office, which in turn feeds to a national headquarters, and finally to corporate headquarters. This hierarchy can be illustrated as follows:

Corporate

National

Area

Regional

Field

Figure 53 - Syb-Mart Mythical Hierarchical Topology


Both sales and HR information (such as timesheet data, hirings, firings, etc.) would move up the tiers (perhaps using function strings to only apply aggregates at each higher level), while pricing information (sale prices, price increases, etc.) could be replicated down the tiers. On of the difficult concepts to grasp is that each of the tiers need not be simply a roll-up of all the information below. It is often viewed that each of the tiers are consolidations of each of the tiers below, perhaps with the addition of some aggregate values. It is true that many of the business objects - products, product SKUs, prices, promotions, and perhaps on-site inventories may be present in all the tiers, along with individual employee records (such as name, employee id, address, store, etc.). However, the field sites may have record of each individual transactions (business events), while the higher level tiers would only retain daily/monthly/yearly aggregates. Some HR information, such as individual employee timesheets might also only record aggregates at each level, but at the top level, each record may be present in detail for payroll purposes. This last example is one that is sometimes missed - detail records going to the top, while intermediate locations only receive aggregates. In fact, it is arguable, that all detail records should rollup to the top, if for no other reason than to feed the corporate data warehouse. The biggest problem with hierarchical tiers is a re-organization in which field sites migrate from on regional center to another. The problem is not the routing, which is trivial to modify, but rather the subscription de-materialization/rematerialization and supporting data elements. For example, in the above illustration, each of the field sites would be similar and somewhat independent of the regional site. The stores current database status regarding past sales, current inventory, etc. would not change. In this case, simply dropping the subscriptions to the previous regional center and adding them to the new regional center (without materialization of course) may be all that is necessary from the stores perspective. There may be minor additional rows needed at the regional center to handle the new field site (or some removed), but all-in-all fairly simple. However, HR information is a little different. In the case of HR data, employees would no longer be (possibly) accountable to the original region and it more than likely would be a security risk to have employee data still resident in a system to which no one has need to know that information anymore. The new regional center would need to know the employee data, of course. This is kind of an interesting paradox in that at some point in the tiers, the personnel would still be under the same area or national aspect. At whatever levels in between, either bulk or atomic de-materialization and re-materialization would be required. Hierarchical implementations still remain one of the most common, but database administrators need to plan for the capability to re-organize quickly. As soon as a re-organization is announced, they need to review what the original and final physical topologies would resemble and then determine the actions necessary to carry it out.

224

Final v2.0.1
Logical Network For large systems, it may be best to borrow an analogy from the hardware domain and implement a logical network. A logical network essentially is a back-bone of Replication Servers whose sole purpose is to provide efficient routing and ease connection management - similar to the hub-and-spoke earlier. However, it typically is a mix of geographic distribution as well as, and more often resembles the geographic distribution in topology - usually because a corporate bandwidth strategy is allocated from corporate to main regional centers (more than likely larger metropolitan areas with the infrastructure in place). Lets consider our Syb-Mart hierarchical example above. Assuming a very wide distribution of stores (one in every friendly neighborhood) consider the following hypothetical map of high-bandwidth networks (maintained by that great monopoly phone system).

Major Metropolitan City Syb-Mart Regional HQ High Bandwidth Network

Figure 54 - Hypothetical High-Bandwidth Connections


It would make sense to put a Replication Server at each of the metropolitan areas above to implement the back-bone. For example, stores in Charleston SC technically report to the Eastern Regional HQ in Boston, MA. In a pure hierarchical model, a direct connection would be created between them. Certainly, the network routers from the phone company would take care of physically routing the traffic most effectively, consequently, it may be possible to do so. However, in the past years, train crashes in tunnels in Baltimore, brownouts in San Francisco, backhoes in Reston, VA, etc. have disrupted communications - some for days. By using a back-bone with multiple paths, company systems personnel could easily re-route replication along alternate routes. Additionally, each of the major metropolitan centers could function as collectors for all of the stores in their region, reducing network traffic for price changes, while ensuring that data flows along the quickest route possible. Routing Internals Now that we understand logically how routing can be put to use, lets discuss the internals of how it works. RS Implementation Support for routing within the Replication Server is fairly unique. From a source systems perspective, the route is the same as any other destination. However, in moving the data through the system, routes exploit some neat features. Consider the following diagram.

225

Final v2.0.1

Figure 55 - Replication Server Routing Internal Threading


The path for routing is as follows: 1. 2. 3. 4. 5. 6. The Rep Agent sends the LTL stream to the Rep Agent User thread as normal The Rep Agent user thread performs normalization and then passes the information to the SQM for storage as usual The SQM writes the data to the inbound queue. The SQT thread performs transaction sorting as usual The SQT thread passes the sorted transactions to the DIST thread The DIST thread passes each transaction to the subscribing sites SQM. If the subscriber is a local database, then it sends the data to that databases SQM thread. However, if the subscriber is a remote database, it finds the next RS on the route and sends the data to the SQM for that RS. The outbound SQM for the route writes the data to the outbound queue as normal The Replication Server Interface (RSI) thread reads the data from the outbound queue via the SQM The RSI forwards the rows to the RS via the RSI User thread in the remote RS. The RSI User thread sends the data to the DIST thread which only needs to call the MD module to read the bitmask of destinations and determine the appropriate outbound queues to use. The DIST send the rows to the SQM of the destination database The SQM writes the data to the outbound queue The DSI-S reads the data from the outbound queue (via SQM) and then sorts the transactions into commit order. The DSI-S performs transaction grouping and submits each group to the DSI-Execs as usual The DSI-Execs generate the appropriate SQL and apply to the replicate database.

7. 8. 9. 10. 11. 12. 13. 14. 15.

Consider the following points about the above: There will be a SQM and RSI thread for each direct route created from any RS. Consequently, if a RS has 3 direct routes to 3 other RSs, there will be 3 RSI outbound threads and associated SQMs and outbound queues. A route does not have an inbound queue. The inbound processing (if you would call it that) is to simply determine which queues to place the data in - either an outbound queue for a local database. The RSI User thread (a type of EXEC thread similar to the RepAgent User thread) merely serves as a connection point. The MD is the only module of a Distributor thread necessary. All of the subscription resolution (SRE) and transactional organization (TD) have already been completed at the primary RS. If you remember, we stated that a bitmask was used to reflect the destinations. For local databases, this bitmask translates to an outbound queue. For remote databases, a single copy of the message with the bitmask is placed into the RS outbound queue. Hence only a single copy of the message is necessary for each direct route.

226

Final v2.0.1

Unlike the DSI interface, the RSI interface is non-transactional in nature. For example, it does not make SQT calls and does not base delivery on completed transactions. Instead, it operates much on the same principals of a Replication Agent it simply passes the row modifications as individual messages to the replicate Replication Servers and tracks recovery on a message id basis (and consequently, it is the only mechanism in Replication Server in which orphan transactions can happen due to a data loss in the outbound queue mainly).

A common misconception is that the admin quiesce_force_rsi is used to quiesce all RS connections - DSI and RSI. However, in really only applies to RSI connections as DSI threads are in a perpetual state of attempting to quiesce. The reason this command is used is that similar to the RepAgent RepAgent User thread communications, the RSI thread batches messages to send to remote RSs. In return, the message acknowledgements are sent only on a periodic or as requested basis. The admin quiesce_force_rsi checks to see if the RS is quiescent, the same as admin quiesce_check. In addition, where-as admin quiesce_check merely checks to see if RSI acknowledgements have been received, admin quiesce_force_rsi forces all of the RSI threads to send any outstanding messages and then prompt for a acknowledgements. RSI Configuration Parameters The following configuration parameters are available for tuning replication routing. Parameter (Default) disk_affinity Default: off rsi_batch_size Default: 262,144 Recommended: 4MB if on RS 12.6 ESD #7 or RS 15.0 ESD #1. rsi_fadeout_time Default: -1 Description Specifies an allocation hint for assigning the next partition. Enter the logical name of the partition to which the next segment should be allocated when the current partition is full. The number of bytes sent to another Replication Server before a truncation point is requested. The range is 1024 to 262,144. This works similar to the Replication Agents scan_batch_size configuration setting. This normally should not be adjusted downwards unless in a fairly unstable network environment and want the RSI outbound queue to be kept trimmed. In RS 12.6 ESD #7 and RS 15.0 ESD #1, this was increased to a max of 128MB The number of seconds of idle time before Replication Server closes a connection with a destination Replication Server. The default (-1) specifies that Replication Server will not close the connection. In low volume routing configurations this may be set higher (i.e. 600 = 10 minutes) to reduce connection processing in the replicate Replication Server. Packet size, in bytes, for communications with other Replication Servers. The range is 1024 to 8192. In high-speed networks, you may want to boost this to 8192. The RSI uses an 8K send buffer to hold pending messages to be sent. When the number of bytes in the buffer will exceed the packet size, the send buffer is flushed to the replicate RS. The number of seconds between RSI synchronization inquiry messages. The Replication Server uses these messages to synchronize the RSI outbound queue with destination Replication Servers. Values must be greater than 0. This is analogous to the scan_batch_size parameter of a Replication Agent, but is measured in seconds instead of rows. Specifies route behavior if a large message is encountered. This parameter is applicable only to direct routes where the site version at the replicate site is 12.1 or earlier. Values are skip and shutdown. The number of minutes that the Replication Server saves messages after they have been successfully passed to the destination Replication Server. See the Replication Server Administration Guide Volume 2 for details.

rsi_packet_size Default: 2048 Recommended: 8192

rsi_sync_interval Default: 60

rsi_xact_with_large_msg Default: shutdown save_interval Default: 0 minutes

As you can see, there are very few adjustments needed to the defaults for routing. RSI Monitor Counters Replication Server 12.6 extended the basic counters from 12.1 to the following counters to monitor RSI activity.

227

Final v2.0.1

Counter BytesSent PacketsSent MsgsSent MsgsGetTrunc FadeOuts BlockReads SendPTTimeLast

Explanation Total bytes delivered by an RSI sender thread. Total packets sent by an RSI sender thread. Total RSI messages sent by an RSI thread. these messages contain the distribute command. Total RSI get truncation messages sent by a RSI thread. This count is affected by the rsi_batch_size and rsi_sync_interval configuration parameters. Number of times that a RSI thread has been faded out due to inactivity. This count is influenced by the configuration parameter rsi_fadeout_time. Total number of blocking (SQM_WAIT_C) reads performed by a RSI thread against SQM thread that manages a RSI queue. Time, in 100ths of a second, spent in sending the packet of data to the RRS.

SendPTTimeMax Maximum time, in 100ths of a second, spent in sending packets of data to the RRS. SendPTTimeAvg Average time, in 100ths of a second, spent in sending packets of data to the RRS.

Replication Server 15.0 changed these slightly to: Counter BytesSent PacketsSent MsgsSent MsgsGetTrunc FadeOuts BlockReads SendPTTime RSIReadSQMTime Explanation Bytes delivered by an RSI sender thread. Packets sent by an RSI sender thread. RSI messages sent by an RSI thread. These messages contain the distribute command. RSI get truncation messages sent by a RSI thread. This count is affected by the rsi_batch_size and rsi_sync_interval configuration parameters. Number of times that a RSI thread has been faded out due to inactivity. This count is influenced by the configuration parameter rsi_fadeout_time. Number of blocking (SQM_WAIT_C) reads performed by a RSI thread against SQM thread that manages a RSI queue. Time, in 100ths of a second, spent in sending packets of data to the RRS. The time taken by an RSI thread to read messages from SQM.

Essentially, other than adding the new counter RSIReadSQMTime, the only other change is inline with the others in than the SendPTTimeLast/Max/Avg is collapsed into a single counter SendPTTime. Again, by looking at some of these in comparison with each other, an idea of different performance metrics could be established. For example, if comparing PacketsRead and BytesSent, an idea of the usefulness of changing the rsi_packet_size parameter can be determined. Additionally, by comparing with other threads, the ability of the RSI to keep up can be determined (i.e. SQM:CmdsWritten and RSI:MsgsSent). If using RS 15.0 and the route seems slow, the last two can be of use to determine if it is the network (or downstream RRS) or the outbound queue reading speed that is the largest source of time. One thing to note is that the RSI does not have an SQT library function - messages are simply sent in the order they appear in the outbound queue. The problem with this is that the RSI lacks the SQT cache that can help buffer activity when the downstream system is lagging slightly - which may translate into more blocks being read physically than desired. As a consequence, since the RSI includes an SQMR logic, the SQMR counters for BlocksRead and BlocksReadCached may be helpful in determining why a route may be lagging.

228

Final v2.0.1
Routing Performance Advantages In certain circumstances, a routed connection will perform better than a non-routed connection. Some of these are described below. It is important to note that routes may not out-perform in all circumstances - in fact a common fallacy is that a route will outperform a normal Warm Standby setup even if the sites are located fairly close. SQL Delivery In some cases, nearly all of the cpu is consumed with processing the inbound stream. As a result, little cpu is available for the DSI connection to generate and apply the SQL. However, since the RS threads are executed at the same priority, the DSI connection ends up getting the same amount of cpu time as the other threads. In this case, often the symptom is a fully caught up outbound queue, but a lagging inbound queue (due to DIST thread having to wait for access to the outbound queue SQM) or a lagging RepAgent. Prior to RS 12.5/SMP, in these cases, it made sense to split the replication processing in half by using a route. Consequently, one cpu could concentrate on the inbound connection, while another cpu (perhaps on the same box) would concentrate on SQL delivery. This is frequently the excuse why some set up their standby systems as remote standbys even when close together. As noted earlier, this has some tremendous puzzlers to solve the minute the standby pair is a target of replication from another system. Additionally, the amount of cpu gained over a normal WS must exceed the cost of additional cpu used for the DIST thread (typically suspended in WS only configurations) as well as the extra I/O cost to write to the outbound queue. This is very difficult to substantiate as some of the highest throughputs measured with Replication Server at customer sites has all been with traditional Warm-Standby configurations. Consequently, it might be said that the most appropriate place for a SQL Delivery based performance improvement using routing is when the system is a normal replicate database and not a standby. Distributed Processing One of the more common implementations in routing environments is using multiple RSs to distribute the processing load when a single RS needs to communicate with a large number of connections. While a single Replication Server can handle dozens of connections, the amount of resources necessary on a single machine would be tremendous. Additionally, prior to RS 12.5/SMP, a single RS could easily be swamped trying to maintain a large number of high volume connections. Consequently, even from the earliest days of version 10.x, customers were implementing multiple replication servers using routing as a way of getting multi-processor performance. In such implementations, generally a single RS was implemented at each source with multiple Replication Servers serving the destinations as necessary. In some cases, this was even implemented between only two nodes - a primary and a replicate. While obvious for remote nodes, it would not appear to be as necessary when both nodes are local. However, in some extremely high volume situations, the inbound processing could fully utilize a cpu. Under these circumstances, when not using the SMP version of RS, it may make sense to offload the DSI processing to another cpu via replication routing. This is particularly true in the case of corporate rollup scenarios in which the DSIs SQT library may be exercised more fully since transactions from different sources may be intermingled. With RS 12.5/SMP, this advantage is totally eliminated for local nodes. For remote nodes, a route still may be optimal to ensure network resilience. Network Resilience One of the biggest advantages to replication routes is its ability to provide network resilience. This capability is directly attributable to the concept of indirect routes. In recent years, there have been a number of incidents that have illustrated how easy it is to disrupt wide-area networks. Not too many years ago, a train crash and resulting fire in a tunnel in Baltimore, Maryland USA disrupted network communications for MCI for several days. Similarly, the World Trade Center disaster on 9/11 left many business in Manhattan electronically stranded - and those that routed services through it equally disadvantaged. By using an indirect route, should a physical network outage occur, replication system administrators can simply re-direct the route over an alternate direct route. Routing Performance Tuning There really is not much to tune for a route. Out of the box, the configuration settings are fairly optimal for most environments, although some recommendations as above are appropriate. An intermediate node in the route really experiences minimal loading outside of the outbound queue for the outgoing route. However, you still shouldnt have an intermediate node attempting to service dozens of direct routes when a more conservative approach would be much more efficient. Consequently, route performance becomes more of a network tuning exercise. If the route is over a very low bandwidth network or is sharing the bandwidth with extremely high bandwidth applications such as video teleconferencing, you can expect very low performance from the route. For most cases, however, a sudden drop in

229

Final v2.0.1
routing throughput will be due to an unexpected network issue such as an outage, DNS errors, or other network related problems. There is one aspect to consider, however, if multiple databases are involved - there is only one RSI for each route. This can lead to IO saturation in some instances. Consider the differences between the following two scenarios:

Figure 56 - A Common Multi-DB Routing Implementation

Figure 57 - A More Optimal Multi-DB Routing Implementation


Why is this more optimal? In the first example, all 12 databases use the same route. This means that 12 DIST threads in one RS are all trying to write to the same outbound queue and a single RSI is trying to send the messages for 12 connections. This may be fine for low volume systems, but for high volume systems, the outbound queue for the RSI connection is likely going to be a source of contention and may become an IO bottleneck as well. In the bottom example, there are 4 routes - and the load is split between the 4 routes using 4 outbound queues (one for each route) and 4 RSIs send the messages. Additionally each of the routes could have disk affinity enabled, reducing the chances for an IO bottleneck on a single device. It might be tempting to thing then that New York should have 4 RSs as well. While this may be true simply from a loading perspective, it may not help routing performance considering the direction London New York. Remember, the route will have a unique DIST thread at the RRS that will be writing directly into the outbound queue for the destination connection. Consequently, as soon as we created 4 routes to London, there are 4 DIST threads - one for each route - in the NY_RS to handle the traffic in reverse.

230

Final v2.0.1
As mentioned, though, the New York RS may be overloaded with the 12 connections. In fact, considering workload distribution and using multiple RSs, the following depict the bad, better, better-yet, best architectures for a large multidatabase source system:

Figure 58 - Bad - Not a Good Plan

Figure 59 - Not Much Better - But Unfortunately, All Too Common

Figure 60 - Ahhh.Feels Much Better

231

Final v2.0.1

Figure 61 - The Best Yet!!!


The rationale for the above stems from multiple factors: Currently with RS 15.0, RS can best deal with about 2 high volume connections and a total of 10 connections before latency is impacted due to task switching. While more connections may be doable in low volume situations, this is optimal As mentioned above, the division of routes allows load balancing of IO processing for the route messages.

232

Final v2.0.1

Parallel DSI Performance


I turned on Parallel DSIs and didnt get much improvement what happened?
The answer is that if using the default settings, not a whole lot of parallelism is experienced. In order to understand parallel DSIs, a solid foundation in Replication Server internal processing is necessary. This goes beyond just understanding the functions of the internal threads it also means understanding how the various tuning parameters as well as types of transactions affect replication behavior, particularly the DSI. In the following sections, we will discuss the need for parallel DSI, internal threads, tuning parameters, serialization methods, special transaction processing and considerations for replicate database tuning Need for Parallel DSI There are five main bottlenecks in the Replication Server: 1. 2. 3. 4. 5. Replication Agent transaction scan/delivery rate Inbound SQT transaction sorting Distributor thread subscription resolution DSI transaction delivery rate Stable Queue/Device I/O rate

In early 10.x versions of Replication Server, it was noticed that the largest bottleneck in high volume systems was #4 DSI transaction delivery rate. The reason was very simple. At the primary database, performance was achieved by concurrent processes running on multiple engines using a task efficient threading model. On the other hand, at the replicate database, Replication Server was limited to a single process. Consequently, if the aggregate processing at the primary exceeded the processing capability of a single process, the latency would increase dramatically. Much of this time was actually not spent on processing as most replication systems were typically handling simple insert/update/delete statements, but rather the sleep time waiting for the I/O to complete. Consider the following diagram.

200 tpm max 100 tpm each = 500 tpm total High sleep time 1 cpu busy RS queue growing steadily Outbound queue steady

OLTP 1 OLTP 2 OLTP 3 OLTP 4

OLTP 5

High Volume OLTP Balanced work/load in run/sleep queue

Figure 62 Aggregate Primary Transaction Rate vs. Single DSI Delivery Rate
It should be noted that in the above figure, the numbers are fictitious. However, it does illustrate the point how a single threaded delivery process can quickly become saturated. Early responses to this issue talked around it by attributing this to Replication Servers ability to flatten out peak processing to a more manageable steady-state transaction rate. While this may be appealing to some, organizations with 24x7 processing requirements or those with OLTP during the day and batch loading at night quickly realized that this flattening required a lull time of little or no activity during which replication would catch up. Due to normal information flow, the organizations did not have this time to provide. The obvious solution was to somehow introduce concurrency into the replication delivery. The challenge was to do so without breaking the guarantee of transactional consistency. The result was that in version 11.0, Parallel DSIs were introduced to improve the replication system delivery rates.

233

Final v2.0.1
Key Concept #25 Replication/DSI throughput is directly proportionate to the degree of concurrency within the parallel DSI threads. Parallel DSI Internals Earlier in one of the first sections of this paper, we discussed the internal processing of the Replication Server. From this aspect, very little is different for Parallel DSIs, however, considerable skill and knowledge is necessary to understand how these little differences are best used to bring about peak throughput from Replication Server. While this section discusses the internals and configuration/tuning parameters, later sections will focus on the serialization methods as they are key to throughput, as well as tuning Parallel DSIs. Parallel DSI Threads The earlier diagram discussing basic Replication Server internal processing included in the illustration Parallel DSIs (step 11 in the below)
RSSD STS Memory Pool

Outbound
12
DSI-Exec

11
DSI
SQT

DSI-Exec DSI-Exec

Replicate DB

10
Stable Device SQM

Primary DB

SRE

TD

MD

7 6

9 8
Outbound (0) Inbound (1)
dAIO

Distributor SQT

5 1
RepAgent Rep Agent User

Outbound (0) Inbound (1)

SQM

4 3

Inbound
Figure 63 Replication Server Internals with Parallel DSIs
While the DSI thread is still responsible for transaction grouping, etc., it is the responsibility of the DSI Executor threads to perform the function string translation, apply the transactions and perform error recovery. Up to 255 Parallel DSI threads can be configured per connection. However, after a certain number of threads, adding more will not increase throughput. rs_threads processing As mentioned earlier (and repeatedly), the Replication Server guarantees transactions are applied in the same order at the replicate as at the primary. At first glance, this would seem an impossible task where Parallel DSIs are employed a long running procedure on DSI 1 ..and DSI 2 might get ahead. To prevent this, Replication Server 12.5 and earlier implemented a synchronization point at the end of every transaction by way of the rs_threads table.
create table rs_threads ( id seq pad1 pad2 pad3 pad4 ) go create unique clustered go

int, int, char(255), char(255), char(255), char(255),

-- thread id -- one up used for detecting rollbacks -- padding for rowsize.

index rs_threads_idx on rs_threads(id)

234

Final v2.0.1

-- alternative implementation used on servers with >2KB page size -- contained in rs_install_rll.sql script create table rs_threads ( id int, seq int, pad1 char(1), pad2 char(1), pad3 char(1), pad4 char(1), ) lock datarows go create unique clustered index rs_threads_idx on rs_threads(id) go

While still in later versions of RS (i.e. 12.6 and 15.0) an alternative implementation called "DSI Commit Control" is also available and is discussed in the next section. The rs_threads table is manipulated using the following functions used only when Parallel DSI is implemented. Function rs_initialize_threads rs_update_threads Explanation Used during initial connection to setup rs_threads table. Issued shortly after rs_usedb in the sequence. Used by a thread to block its row in the rs_threads table to ensure commit order and also to set the sequence number for rollback detection. Used by a thread to determine when to commit by selecting the previous threads row in rs_threads. Similar to above, but only used when isolation_level_3 is the serialization method.

rs_get_thread_seq rs_get_thread_seq_noholdlock

To understand how this works, consider an example in which 5 Parallel DSI threads are used. During the initial connection processing during recovery, Replication Server will first issue the rs_initialize_threads function immediately after the rs_usedb. This procedure simply performs a delete of all rows (logged delete vs. truncate table due to heterogeneous support), and then inserts blank rows for each DSI initializing seq value to 0. During processing, when Parallel DSIs are in use, the first statement a DSI issues immediately following the begin transaction for the group is similar to the following:
create procedure rs_update_threads @rs_id int, @rs_seq int as update rs_threads set seq = @rs_seq where id = @rs_id go

Each DSI simply calls the procedure with its thread id (i.e. 1-5 in our example) and the seq value plus one from the last transaction group (the initial call uses a value of 1). Since this update is within the transaction group, it has the effect of blocking the threads row during the transaction groups duration. Following this, normal transaction statements within the transaction group are sent as normal. After all the transaction statements have been executed, the DSI then attempts to select the previous threads row from the rs_threads table using the rs_get_thread_seq function. If the previous thread has not yet committed, then the thread is blocked (due to lock contention) by the update lock on the row by the previous thread. If the previous thread has committed, then the lock is not held, consequently, the current thread possibly also can commit. Ignoring the effects of serialization method on transaction timing, this could be illustrated as in the below diagram. Note that in each case, each subsequent thread is blocked and waiting on the previous threads update on rs_threads.

235

Final v2.0.1

CT 1

TX 05

TX 04

TX 03

TX 02

TX 01

UT 1

BT 1

Blocked
CT 2 GT 1 TX 10 TX 09 TX 08 TX 07 TX 06 UT 2 BT 2

Blocked
CT 3 GT 2 TX 15 TX 14 TX 13 TX 12 TX 11 UT 3 BT 3

Blocked
CT 4 GT 3 TX 20 TX 19 TX 18 TX 17 TX 16 UT 4 BT 4

. . .
T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00
BT n CT n TX ##

rs_begin for transaction for thread n rs_commit for transaction for thread n Replicated DML transaction ##

UT n GT n

rs_update_threads n rs_get_thread_seq n

Figure 64 Parallel DSI Thread Sequencing Via rs_threads


To anyone who has monitored their system and checked object contention, they probably thought all of the blocking on rs_threads was a problem. As illustrated above, it is actually deliberate. The theory of the above is that transactions can acquire locks and execute in parallel but due to the rs_threads locking mechanism, the transactions are still committed in order (1-20 in the above). After each thread commits, it then requests the next transaction group from the DSI-S. Note this happens in commit order, consequently in an ideal situation, the transaction groups will proceed in sequence through the threads. The first question that comes to mind for many is: What happens if one of the threads hits an error and rollsback its transaction? Wouldnt the next thread simply commit? The answer is no. This is where the seq column comes in and the realization why rs_get_thread_seq has seq in the name. As each rs_get_thread_seq function call is made, it returns the seq column for the previous thread. This value is simply compared to the previous value. If it is equal to the previous value, then an error must have occurred and subsequent transactions need to rollback as well. However, if the seq value is higher than the previous seq value for that thread, then the current thread can commit its transaction.

236

Final v2.0.1

rs_begin

rs_update_threads n

(replicated transactions)

rs_get_thread_seq n-1

(Blocked)

Rollback transaction

No
seq > previous

Yes
suspend connection commit transaction

Figure 65 rs_get_thread_seq and seq value comparison logic


It should be emphatically stated that: 1. 2. Blocking on rs_threads is NOT an issue it is deliberate and precisely used to control the commit order. Threads will block until their turn to commit. Deadlocks raised involving rs_threads does not infer that rs_threads is an issue. Instead, it is an indicator that the statement it surfaced the deadlock with has contention with out of sequence execution.

To put it simply, rs_threads is NEVER the issue!!! To find out the real cause of concern, you can monitor the true contention through monDeadlocks and monOpenObjectActivity as well as watching monProcessWaits, monLocks especially if the replicate database is also used by end-users for reporting or if maintenance activities are being performed. Techniques for finding the true causes of deadlocks/contention are discussed below in the section Resolving Parallel DSI Contention DSI Commit Control So, then, if rs_threads is not the issue, then why was DSI Commit Control implemented. The rationale stems from several reasons: 1. If there is intra-thread contention, it is handled by causing a deadlock. ASE chooses the deadlock victim according it's own algorithm which favors longer running tasks which in this case probably is the task that should have waited consequently, often the wrong task is rolled back as the deadlock victim. This adds additional work to the re-submittal of the SQL batches involved. Since RS knows the sequence of commit, if contention does occur under DSI Commit Control, only the offending thread and subsequent threads need to be rolled back. The blocked thread and any other up to the blocking thread can continue. The logic for rs_threads is heavily dependent on the ASE locking scheme, consequently does not lend itself to heterogeneous situations. For very short transactions with small or no transaction grouping, the rs_threads activity adds significantly to the IO processing of replication.

2.

3. 4.

As a result, DSI Commit Control was implemented in RS 12.6 as a more internal means of controlling contention detection and resolution between Parallel DSI's. The implementation is as follows: 1. 2. 3. Each thread submits its batch of SQL as usual After the batch has completed execution, it checks to see if the previous thread has committed. If so, the current thread can simply go ahead and commit. If the previous thread has not committed, the current thread issues rs_dsi_check_thread_lock function to see if thread's SPID is blocking another DSI thread.

237

Final v2.0.1
4. 5. 6. If rs_dsi_check_thread_lock returns a non-zero number, the thread rollsback it's transaction. If rs_dsi_check_thread_lock returns 0, it waits dsi_commit_check_locks_intrvl seconds and then checks again to see if the previous thread has committed and re-issues rs_dsi_check_thread_lock if not. Step 5 is repeated dsi_commit_check_locks_max times, after which the batch is rolled back regardless.

This can best be illustrated by the following flow-chart:


Execute SQL

Commit

Yes

Did previous thread commit? No

Rollback/Abort

>0

Rs_dsi_check_ thread_lock =0 dsi_commit_check _locks_intrvl

Yes

>dsi_commit_check No _locks_max

Figure 66 - Commit Control Logic Flow


Note that of course if the thread is blocked, it does not get out of the first stage (executing SQL) until the contention is resolved. Additionally, note that if the threads commit quickly, there also is no delay at all. The first question that might be asked is How would a thread know the previous thread had committed? Referring back to the earlier diagram, as each thread commits, it sends an acknowledgement to the DSI-S before doing posttransaction clean-up and sending a thread ready message.

Figure 67 Logical View of DSI & DSIEXEC Intercommunications


From the above diagram, you can see how that it would be fairly simple for the DSI-S to withhold the Commit message from a subsequent thread until it gets a Committed message from the previous thread. The only issue then is to determine when a later thread is blocking an earlier thread resulting in an application deadlock - the earlier thread is blocked - and the later thread is waiting for it to finish - hence rs_dsi_check_thread_lock.

238

Final v2.0.1
On the plus side of rs_threads, it distinctly focuses in on the exact threads with contention and execution continues as soon as the contention is lifted. The default function string provided for RS 12.6 is much less specific and in fact may lead to excessive false rollbacks just due to contention between the RS and other processes. This definition is:
alter function string rs_dsi_check_thread_lock for sqlserver_function_class output language ' select count(*) "seq" from master..sysprocesses where blocked = @@spid '

As noted, this would return a non-zero value whenever the DSI thread was blocking any other user - for example someone running a report or trying to do table maintenance. Consequently, a slight alteration would achieve the desired affect of only blocking when blocking another maintenance user transaction:
alter function string rs_dsi_check_thread_lock for sqlserver_function_class output language ' select count(*) "seq" from master..sysprocesses where blocked = @@spid and suid=suser_id() -- added to detect only maintenance user blocks '

As this statement may get executed extremely frequently, the recommended approach is to actually use a stored procedure and a modified function string definition that calls it such as:
-- procedure modification -- add to rs_install_primary.sql (rsinspri.sql on NT) create procedure rs_dsi_check_thread_lock as begin select count(*) "seq" from master..sysprocesses where blocked = @@spid and suid=suser_id() return 0 end go -- install in RS -- function string modification alter function string rs_dsi_check_thread_lock for rs_default_function_class output language ' exec rs_dsi_check_thread_lock ' go

The rationale is that this avoids optimizing the above SQL statement every 100 milliseconds or whatever dsi_commit_check_locks_intrvl is set to. One important note. In addition to the modification needed for rs_dsi_check_thread_lock, the default configuration values are likely too high to provide effective throughput as well. The biggest problem is that the default value for dsi_commit_check_locks_intrvl is set to 1000ms or 1 second. This likely is too long to wait by a full order of magnitude as any contention will result in the thread waiting 1 second as well as causing subsequent threads from committing as well. To understand the magnitude of the problem, consider what would happen if 5 threads were being used and the first thread had a long running transaction. As a result, threads 2-5 would each execute the rs_dsi_check_thread_lock function and wait for 1 second. As soon as thread 1 commits, it still could be up to 1 second later before thread 2 commits due to waiting dsi_commit_check_locks. Note that thread 3 is waiting on thread 2, consequently, depending on the timing of the rs_dsi_check_thread_lock calls, thread 3 could be delayed up to 1 second after thread 2 and so forth. Net result is that the maximum delay will be:
max_delay=(num_dsi_threads-1) * dsi_commit_check_locks_intrvl

So with 5 threads, the max delay at the default settings would be 4 seconds - in a high volume system, several thousand SQL statements could have been executed during this period. As a result, a better starting value for dsi_commit_check_locks_intrvl is likely 100ms or even less. The problem is that this method depends on the speed of materializing the master..sysprocesses virtual table. On replicate systems used for reporting, this could result in

239

Final v2.0.1
considerable rows that then have to be table scanned for the values (virtual tables such as sysprocesses do not support indexing). There is another problem: false blocking. If an earlier thread acquires a lock and blocks a later thread, this should be expected and not an issue. However, the statement above would detect that a blocked user existed. Consider the following scenario: 1. 2. Thread #1 starts processing and is executing a larger than average transaction or one that executes longer than normal due to a replicated procedure or a invoked trigger. Thread #2 completes its transaction, in the process, it acquires locks that block thread #3.

Thread #2 checks the commit status of thread #1 and sees that it isnt ready to commit, so it then issues a rs_dsi_check_thread_lock - which returns a non-zero number since thread #3 is blocked. The result is predictable. One might think that this is easily rectified by returning the spid being blocked. However, it is likely that this could be a deadlock chain - such as #2 blocking #3 who is in turn blocking #1. Without knowing all the spids for previous threads and traversing the full chain, there is no way for a thread to know that if the block is a real problem or not. Net result, a rollback when none is necessary. Thread Sequencing As mentioned, the parallel transactions are submitted to each of the threads in order. Now that we understand how they commit in order, it might help to understand how the start in order. The key to thread sequencing is to understand that based on the dsi_serialization_method, parallel threads can start based on if the previous thread has reached one of three conditions: Ready to Commit - In this scenario, subsequent threads can start only when the previous thread has submitted all its transaction batches successfully, received a successful rs_get_thread_seq function and is ready to send the rs_commit function. NOTE: A common misconception is that this implies the previous thread has committed - in reality, it is merely ready to commit. Started - In this scenario, subsequent threads can start only after the previous thread has already started. When Ready - In this scenario, threads can start at any point as soon as they are ready. This doesnt change the commit order, it merely allows a thread to start when it is ready vs. waiting for another thread. This coordination is done by the DSI-Scheduler. If you look back at the earlier detailed diagram of the DSI Execution flow, each DSIEXEC sends messages back to the DSI-S informing of the current status of its processing.

Figure 68 Logical View of DSI & DSIEXEC Intercommunications


Based on the above diagram, you could see how commit control would work from an internals perspective - each subsequent thread to be committed would simply not get told to commit (step 11) until the previous thread had successfully committed (step 13). In perspective of the thread sequencing the thread at the bottom (with no lines to it) could begin executing at the following points: Ready to Commit - In this scenario, thread #2 would have to wait until the Commit Ready (step 10) message was received by the DSI-S. When the DSI-S got the Commit Ready message from thread #1, it would send Begin Batch message to thread #2 - assuming it had received a Batch Ready message from thread #2.

240

Final v2.0.1
Started - In this scenario, thread #2 would only wait until the Batch Began (step 7) message was received by the DSI-S. When the DSI-S got the Batch Began message from thread #1, it would send Begin Batch message to thread #2 - again, assuming that it had received a Batch Ready message from thread #2. When Ready - In this scenario, threads can start at any point as soon as they are ready. Consequently, when thread #2 would send its Batch Ready message, the DSI-S would immediately reply with Begin Batch. Note that the batch we are discussing is only the first batch. Subsequent command batches are sent until the thread reaches the end and is ready to commit. The purpose for command batch sequencing is to try to control contention by proper execution. The basic premise is this. If the first transaction group is allowed to start in its proper order, it will acquire the locks it needs first. Subsequent threads will simply block vs. deadlocking. However, the problem with this theory is that it depends largely on the following factors: Transaction Group Size - Essentially, how large the transaction group is from a number of statements. If the transaction groups are submitted nearly in parallel, the first batch of SQL statements in each thread logically should follow the last from the previous thread. However, they are being executed first, resulting in an overlap in which the vulnerability of a deadlock is raised. The larger the transaction groups, this vulnerability is increased. Long Running SQL - If a thread executes a long running statement - such as a stored procedure or if an invoked trigger runs long - the likelihood is that subsequent threads will get ahead of the first thread and most likely be ready to commit (waiting on rs_threads or commit control) by the time the first thread completes the long running statement. As a result, any other statements left to be executed by the first thread increases the vulnerability of a rollback due to a deadlock issue. ASE Execution Scheduling - As each statement is executed, it is likely that logical and/or physical IOs will need to be performed. As a result, the SPID for the DSI thread is put to sleep pending the IO and execution moves to the next task on the ASE run queue. When the IO has completed, the thread is woken up and put on the runnable queue for processing. However, it is likely that multiple DSI threads will be waiting for IO concurrently. Note that ASE doesnt know the ideal execution order based on the DSI pattern, so ASE can wake up any one of them in any order, resulting in out of order execution. DSI Transaction Grouping - After each complete execution, the parallel DSI thread needs to get the next batch of transactions from the DSI Scheduler. If insufficient cache or time was spent grouping the transactions, a transaction group may not be available. Problems in any one of these areas could lead to a bursty behavior in which blocking or commit sequencing results in apparent thread inactivity. The goal then is understanding how the configuration parameters - especially the serialization method - along with replicate DBMS tuning can minimize periods of inactivity enabling maximum parallelism for the transaction profile. Configuration Parameters There are several configuration parameters that control Parallel DSIs. Parameter (Default) batch_begin Default: on; Recommended: (see text) Explanation Indicates whether a begin transaction can be sent in the same batch as other commands (such as insert, delete, and so on). While it is unarguable that it should be on for non-parallel DSI and for parallel DSIs using a wait_for_commit serialization method, there is a disagreement currently whether having this enabled for parallel DSI serialization methods such as wait_for_start delays the begin sequencing. The number of milliseconds (ms) the DSI executor thread waits between executions of the rs_dsi_check_thread_lock function string. Used with parallel DSI. Default: 1000ms (1 second); Minimum: 0; Maximum: 86,400,000 ms (24 hours)

dsi_commit_check_locks_intrvl Default: 1000ms; Recommended: 50-100ms

241

Final v2.0.1

Parameter (Default) dsi_commit_check_locks_logs Default: 200; Recommended: <100 (see text)

Explanation The number of times the DSI executor thread executes the rs_dsi_check_thread_lock function string before logging a warning message. Used with parallel DSI. This should be set to a setting which puts a log warning out after 3-5 seconds to provide an earlier indication of an issue. To arrive at this value, simply divide 3000 (3 secs in milliseconds) by dsi_commit_check_locks_intrvl. Likely this will be a number <100. Default: 200; Minimum: 1; Maximum: 1,000,000 The maximum number of times a DSI executor thread checks whether it is blocking other transactions in the replicate database before rolling back its transaction and retrying it. Used with parallel DSI. Note that at the default settings of 1000ms for dsi_commit_check_locks_intrvl, the default setting of 400 becomes 400 seconds or 6.667 minutes - which is far, far too long. The max should terminate in 5-10 seconds or less - the shorter especially for pure DML (insert, update, delete). Again, if we use 10 seconds as our max, to derive the value we would simply divide 10000ms by the dsi_commit_check_locks_intrvl - if 100ms, the answer would be 100. Default: 400; Minimum: 1; Maximum: 1,000,000 Specifies whether commit control processing is handled internally by Replication Server using internal tables (on) or externally using the rs_threads system table (off). Recommendation is based on your preference as both mechanisms have positives and negatives as discussed above. Default: on Specifies the isolation level for transactions. The ANSI standard and Adaptive Server supported values are: 0 ensures that data written by one transaction represents the actual data. 1 prevents dirty reads and ensures that data written by one transaction represents the actual data. 2 prevents nonrepeatable reads and dirty reads, and ensures that data written by one transaction represents the actual data. 3 prevents phantom rows, nonrepeatable reads, and dirty reads, and ensures that data written by one transaction represents the actual data. Data servers supporting other isolation levels are supported as well through the use of the rs_set_isolation_level function string. Replication Server supports all values for replicate data servers. The default value is the current transaction isolation level for the target data server. Specifies whether triggers should fire for replicated transactions in the database. Set off to cause Replication Server to set triggers off in the Adaptive Server database, so that triggers do not fire when transactions are executed on the connection. While the book suggests to set on for all databases except standby databases; the reality is that unless you are doing procedure replication - or are not replicating tables that are strictly populated by triggers, this can be safely set to off.

dsi_commit_check_locks_max Default: 400; Recommended: (see text)

dsi_commit_control Default: on; Recommended: (see text)

dsi_isolation_level Default: DBMS dependent; Recommended: 1

dsi_keep_triggers Default: on (except standby databases); Recommended: off

242

Final v2.0.1

Parameter (Default) dsi_large_xact_size Default: 100; Recommended: 10,000 or 2,147,843,647 (max)

Explanation The number of commands allowed in a transaction before the transaction is considered to be large for using a single parallel DSI thread. The minimum value is 4. The default is probably far too low for other than strictly OLTP systems. While the initial recommendation would be to raise this to 2 billion and thereby eliminate this configuration from kicking in as it has little real effect, if the application does have some poorly designed large transactions, setting this to a much higher number than ordinary might help reduce DSI latency when the DSI is waiting on a commit before it even starts. Specifies the maximum number of transactions in a group. Larger numbers may improve data latency at the replicate database. Range of values: 1 100. The reason this is mentioned here at all is because of the impact on parallel DSIs. In non-parallel DSI environments, setting this higher may help throughput. In parallel DSI environments - especially those involving a lot of updates or deletes, this may have to set considerably lower (i.e. 5-10). A common mistake is setting this to 100 and using a single DSI instead of attempting parallel DSIs and a lower value. While 100 may work in some instances, all too often grouping rules make it difficult to achieve, hence, parallel DSIs are a better approach than increasing this value significantly. The number of parallel DSI threads to be reserved for use with large transactions. The maximum value is one less than the value of dsi_num_threads. More than 2 are probably not effective. If dsi_large_xact_size is set to 2 billion, this should be set to 0. If attempting some large transactions, likely 1 is the best setting. See sub-section on Large Transaction Processing later in this section for details. The number of parallel DSI threads to be used. The maximum value is 255. See section on parallel DSI for appropriate setting - but it is likely that the default is too low for high performance situations.

dsi_max_xacts_in_group Default: 20; Recommended: (see text)

dsi_num_large_xact_threads Default: 2 if parallel_dsi is set to true; Recommended: 0 or 1 (see text) dsi_num_threads Default: 1 if no parallel DSIs; 5 if parallel_dsi is set to true; Recommended (see text) dsi_partitioning_rule Default: none; Recommended (see text)

Specifies the partitioning rules (one or more) the DSI uses to partition transactions among available parallel DSI threads. Values are origin, ignore_origin, origin_sessid, time, user, name, and none. See the Replication Server Administration Guide Volume 2 for detailed information. The recommended setting is to leave this set to none unless using parallel DSIs and experiencing more than 1 rollback every 10 seconds. Then try the combination origin_sessid, time. The method used to maintain serial consistency between parallel DSI threads, when applying transactions to a replicate data server. Values are: isolation_level_3 - specifies that transaction isolation level 3 locking is to be used in the replicate data server. single_transaction_per_origin - prevents conflicts by allowing only one active transaction from a primary data server. wait_for_commit - maintains transaction serialization by instructing the DSI to wait until one transaction is ready to commit before initiating the next transaction. None/wait_for_start - assumes that your application is designed to avoid conflicting updates, or that lock protection is built into your database system. No_wait - threads begin as soon as they are ready vs. waiting for previous threads to at least start as other settings. See sub-section on Serialization Methods

dsi_serialization_method Default: wait_for_commit; Recommended: wait_for_start

243

Final v2.0.1

Parameter (Default) dsi_sqt_max_cache_size Default: (0); Recommended: 4-8MB

Explanation Maximum SQT (Stable Queue Transaction interface) cache memory for the database connection, in bytes. The default, "0," means that the current setting of sqt_max_cache_size is used as the maximum cache size for the connection. This parameter controls the use of parallel DSI threads for applying transactions to a replicate data server. The more DSI threads you plan on using, the more dsi_sqt_max_cache_size you may need. Provides a shorthand method for configuring parallel DSI threads. A setting of "on" configures these values: dsi_num_threads = 5 dsi_num_large_xact_threads = 2 dsi_serialization_method = "wait_for_commit" dsi_sqt_max_cache_size = 1 million bytes. A setting of "off" configures these parallel DSI values to their defaults. You can set this parameter to "on" and then set individual parallel DSI configuration parameters to fine-tune your configuration.

parallel_dsi Default: off; Recommended: (see text)

As illustrated by the single parameter parallel_dsi, many of these work together. Note that parallel_dsi sets several configuration values to what would appear to be fairly low numbers. However, due to the serialization method, these settings are typically the most optimal. More DSI threads will not necessarily improve performance. Serialization Methods Key Concept #26 Serialization Method has nothing to do with transaction commit order. No matter which serialization method, transactions at the replicate are always applied in commit order. However, it does control the timing of transaction delivery with Parallel DSIs in order to reduce contention caused by conflicts between the DSIs. One of the most difficult concepts to understand is the difference between the serialization methods. The best way to describe this is that the serialization method you choose depends on the amount of contention that you expect between the parallel threads. Some of this you can directly control via the dsi_max_xacts_in_group tuning parameter. The more transactions grouped together, the higher the probability of contention as the degree of parallelism increases or the higher the probability of contention with other users on the system. This will become more apparent as each of the serialization methods will be described in more detail in the following sections. wait_for_commit The default setting for dsi_serialization_method is wait_for_commit. This serialization method uses the Ready to Commit transaction sequencing as it assumes that there will be considerable contention between the parallel transactions. As a result, the next threads transaction group is not sent until the previous threads statements have all completed successfully and it is ready to commit. This results in the thread timing in which execution is more staggered than parallel as illustrated below.

244

Final v2.0.1

As discussed earlier, this timing sequence would have limited scalability beyond 3-5 parallel DSI threads. However, it assures that contention between the threads does not result in one rolling back which would cause all those that follow to rollback as well. none/wait_for_start None does not infer that no serialization is used. What it really means is that no (none) contention is expected between the Parallel DSI threads. As a result, the thread transactions are submitted nearly in parallel based on a transaction sequencing on the begin statement - (timing then more a factor of the number of transactions in the closed queue in the SQT cache) with each thread waiting until the previous thread has begun (hence the new name wait_for_start vs. the legacy term none). This looks like the illustration similar to the below.

However, in certain situations, none could result in an inconsistent database see the section on isolation_level_3 for details. As the first method being discussed which has a high degree of parallelism, lets take a look at contention and see how it can be reduced to reduce the number of rollbacks. isolation_level_3 Lets first make sure of one thing a common misconception is dispelled. When purely replicating DML (i.e. inserts, updates, and deletes), most people would think that isolation_level_3 will be slower than none as a serialization method and invoke considerable more blocking with non-replication processes at the replicate. This is an absolute falsehood for the following reasons: 1. 2. Since RS delivers all changes inside of a transaction and DML statement locks are held during the duration of the transaction, there is no difference in lock hold times, etc. Since DRI constraints hold their locks until the end of the transaction, the same is true for DRI constraint locks.

As a result, there is no difference between isolation_level_3 and none from a performance perspective, however, isolation_level_3 is the safer, and consequently Sybase recommended setting up through RS 12.5. There is one exception to this statement - when DOL locking is involved (and unfortunately datarows locking is likely needed to support parallel DSIs, so, this is likely). In this case, isolation level 3 locking includes some addition locks - namely range and infinity (or next key) locks. Normally, again, in pure DML replication this likely will have minimal if any

. . . .
T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 69 Thread timing with dsi_serialization_method = wait_for_commit

. . .
T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 70 Thread timing with dsi_serialization_method = none

245

Final v2.0.1
impact. However, if triggers are still enabled, or procedure replication is involved, the lock time for these locks can be extended - not only causing contention with reporting users - but also between the different parallel DSI threads. It should also be noted that isolation_level_3 as a dsi_serialization_method is a bit of an anachronism in RS 15.0. While it is still available to support legacy configurations, the impact is the same as setting the serialization method to wait_for_start and setting dsi_isolation_level to 3. Serialization method isolation_level_3 is identical to none with the addition that Replication Server first issues set transaction isolation level 3 via rs_set_isolation_level_3 function. However, as one would expect, this could increase contention between threads dramatically due to select locks being held throughout the transaction but ONLY if the replicated operations invoke or contain select statements (i.e. replicated procedures). While Replication Server is normally associated with write activity, a considerable amount of read activity could occur in the following: Declarative integrity (DRI always holds locks) Select statements inside replicated procedures Trigger code if not turned off for connection Custom function strings

Aggregate calculations, etcConsequently, care should be taken when replicating stored procedures, etc. in assuring that isolation_level_3 is necessary to ensure repeatable reads from the aspect of the replicated transactions and that the extra lock hold time for selects in the procedure will not increase contention. Consider for example the normal phantom read problem where a process scanning the table reads a row the row is moved as a result of an update, and then the row is re-read. In a normal system, this is simply avoided by having the scanning process invoke isolation level 3 via the set command. However, if you think about it, no one ever mentions having the offending writer invoke isolation level 3. The reason for that is that it would be unnecessary as once the read scans the row to be updated, it holds the lock until the read completes thereby blocking the writer and preventing the problem. In this case, most of Replication Servers transactions will be as the writer, so, it probably is in the same role as the offending writer in the phantom read no isolation level three required. Of course, the most obvious example of when isolation level 3 is normally thought of is when performing aggregation for data elements that are not aggregated at the primary and consequently the replicate may have to perform a repeatable read as part of the replication process. This could be a scenario similar to replicating to a DSS system or a denormalized design in which only aggregate rollups are maintained. Even in these cases however, isolation level 3 may not be necessary as alternatives exist. Consider the classic case of the aggregate. Lets assume that a bank does not keep the account balance stored in the primary system (possibly because the primary is a local branch and may not have total account visibility??). When replicating to the central corporate system, the balance is needed to ensure timely ATM and debit card transactions. Of course, this could be implemented as a repeatable read triggered by the replicated insert, update, delete or whichever DML operation. However, it is totally unnecessary. Because Replication Server has access to the complete before and after images of the row, a function string similar to the following could be constructed:
alter function string <repdef_name>.rs_update for rs_default_function_class output language update bank_account set balance = balance (?tran_amount!old?-?tran_amount!new?) where <pkeycol> = ?tran_id!new?

This maintains the aggregate without isolation level three required and much more importantly without the expensive scan of the table to derive the delta. By exploiting function strings or by encapsulating the set isolation command within procedure or trigger logic, you may find that you can either avoid using isolation level three or restrict it only to those transactions from the primary that truly need it. In summary, in addition to the contention increase simply from holding the locks on select statements, a possibly bigger performance issue when isolation level three is required is the extra i/o costs of performing the scans that the repeatable reads focus on all within the scope of the DSI transaction group. Although isolation_level_3 is currently the safest parallel DSI serialization setting, if it is needed to ensure repeatable reads for aggregate or other select-based queries invoked in replicated procedures or triggers the primary goal should be to see if a function string approach could eliminate the repeatable read condition. Once eliminated, isolation_level_3 can be set safely without any undue impact on performance. single_transaction_per_origin Similar to isolation_level_3, single_transaction_per_origin is outdated in RS 15.0. The same effect could be implemented by setting the dsi_serialization_method to wait_for_start and setting dsi_partitioning_rule to origin.

246

Final v2.0.1
The single_transaction_per_origin serialization method is mainly used for corporate rollup scenarios. Although clearly applicable for corporate rollups, another implementation for which single_transaction_per_origin works well is the shared primary or any other model in which the target replicated database is receiving data from multiple sources.

Chicago

San Francisco New York London

Tokyo

Figure 71 Corporate Rollup or Shared Primary scenario


In the above example, all the available routes between the sites normally present in a shared primary are not illustrated simply due to image clarity. In this serialization method, since the transactions are from different origin databases, there should not be any contention between transactions. For example, stock trades in Chicago, San Francisco, Toronto, Tokyo and London are completely independent of each other consequently their DML operations would not interfere with each other except in cases of updates of aggregate balances. However, within each site for example, transactions from Chicago some significant amount of contention may exist. By only allowing a single transaction per origin, each DSI could simply be processing a different sites transaction consequently, the transaction timing is similar to none or isolation_level_3 in that the Parallel DSI threads are not staggered waiting for a the previous commit. From an internal threads perspective, it would resemble:
DSI-Exec DSI (S)
SQT

DSI-Exec DSI-Exec

Corporate HQ
Stable Device

Distributor Distributor Distributor

SQM dAIO SQT SQT SQT SQM SQM SQM Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1)

New Yo rk Chicago Seattle

Rep Agent Rep Agent Rep Agent

Figure 72 Internal threads processing of single_transaction_per_origin


Note that the above diagram suggests that each DSI handles a separate origin regardless. This is not quite true and is just an illustration. The real impact of single_transaction_per_origin is that if any origin already has a transaction group in progress on one DSIEXEC thread, if another transaction is hit from the same origin, that transaction is applied

247

Final v2.0.1
as if the serialization was wait_for_commit instead. However, if the next transaction was from a different origin, it could be applied in parallel. From a performance perspective, single_transaction_per_origin may not have as high of a throughput as other methods such as none. Consider the following: Origin Transaction Balance single_transaction_per_origin works best in situations where all the sites are applying transactions evenly. In global situations where normal workday at one location is offset from the other sites, this is not true. Instead, all of the transactions for a period of time come from the same origin and consequently are single threaded. Single Origin Error Consider what happens if one of the replicated transactions from one of the sites fails for any reason. All DSI threads are suspended and the queue fills until the one sites transaction is fixed and connection resumed. This could cause the outbound and inbound queues to rapidly fill possibly ending up with a primary transaction log suspend Origin Transaction Rate Again, each individual site effectively has a single DSI of all the parallel DSIs to use. If the source system has a very high transaction volume, the outbound queue will get behind quickly. Either one of these situations is fairly common and could cause apparent performance throughput to appear much lower than normal. While the error handling is easily spotted from the Replication Server error log, the source transaction rate or the balance of transactions is extremely difficult to determine on the fly. no_wait The dsi_serialization_method of no_wait is similar to wait_for_start except that the threads do not wait for the other threads to start - instead the simply start as soon as they are ready. Remember, with wait_for_start or none - each thread waits to begin its batch until the previous thread begins. The result is a slightly staggered starting sequence illustrated a few pages ago, similar to the following:

. . .
T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 73 Thread timing with dsi_serialization_method = none


No wait no only eliminates this slight stagger, it also means that since a thread can start when ready, it could even start before the previous thread if the previous thread is not ready for any reason (i.e. still converting to SQL). The result could be something like:

Figure 74 Thread timing with dsi_serialization_method = no_wait

248

Final v2.0.1
Note that the commit order is still maintained. When would you use no_wait vs. wait_for_start?? In an insert intensive environment, no_wait may help. However, in an update intensive environment, because the probability of a conflicting update executing ahead of a previous one is even higher than it was under wait_for_start, no_wait could increase the number of parallel failures/rollbacks. Dsi_serialization_method summary So, now that it is better understood which one to use??? Consider the following table: Dsi_serialization_method Wait_for_commit None/wait_for_start When to use High contention at primary Low to mid volume High volume Insert intensive application Commit consistent transactions Mid-High Volume Ensure database consistency Low cardinality rollup with high volume from each Short cycle update/DML High cardinality rollup with low volume from each When not to use High volume Short cycle update/DML (unless dsi_isolation_level is set to 3) Not commit consistent transactions High number of selects in procs or function strings Satisfiable via before/after image Commit consistent Low cardinality rollup with high volume from each

Isolation_level_three

Single_transaction_per_origin

As you can tell, it simply depends on the transaction profile from the source system. Transaction Execution Sequence However, there are transaction profiles that must use dsi_serialization_method=isolation_level_3 vs. dsi_serialization_method=none. Alternatively, they can use none/wait_for_commit, but they must have dsi_isolation_level set to 3 which has the same effect. Disappearing Update Problem Consider the following scenario of atomic transactions at the primary
-- assume a table similar to: -- create table tableX ( -col_1 int not null, -col_2 int not null, -col_3 varchar(255) null, -constraint tableX_PK primary key (col_1) -- ) Insert into tableX (col_1, col_2, col_3) values (1, 2, this is a test string) Update tableX set col_2=5, col_3=dummy row where col1_1=1

One would always expect the resulting row to tuple to be {1,5,dummy row} however, it is possible that the result at the replicate could be {1,2,this is a test string}, as if the update never occurred. The reason for this is the timing of the transactions. If parallel DSIs are used and the dsi_serialization_method of none is selected, the SQL statements are executed OUT OF ORDER but committed in SERIALIZED ORDER. This is a big difference. If the first insert is the last SQL statement in a group and the update the first statement in the next group, the update will physically occur BEFORE the insert. Consider the following picture:

249

Final v2.0.1

Insert into tableX ()


CT 1 TX 05 TX 04

Update tableX
TX 03 TX 02 TX 01 UT 1 BT 1

Blocked
CT 2 GT 1 TX 10 TX 09 TX 08 TX 07 TX 06 UT 2 BT 2

Blocked
CT 3 GT 2 TX 15 TX 14 TX 13 TX 12 TX 11 UT 3 BT 3

Blocked
CT 4 GT 3 TX 20 TX 19 TX 18 TX 17 TX 16 UT 4 BT 4

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00
BT n CT n TX ##

rs_begin for transaction for thread n rs_commit for transaction for thread n Replicated DML transaction ##

UT n GT n

rs_update_threads n rs_get_thread_seq n

Figure 75 Statement Execution Sequence vs. dsi_serialization_method=none


Many of you may already see the problem. The update effectively sees 0 rows affected, consequently the insert physically occurs later and the values are never updated. But wait.shouldnt the update block the insert??? Locking in ASE No. As of SQL Server 10.0, Sybase stopped holding the locks unless isolation level 3 is enabled, consequently the above could happen. Many people state that situations like this are not described in the books but they are (as well as nearly all the material in this paper). Consider the following description from the Replication Server Administration Guide located in the section describing Parallel DSI Serialization Methods (located in Performance and Tuning chapter) and in particular is the description for none. It reads:
This method assumes that your application is designed to avoid conflicting updates, or that lock protection is built into your database system. For example, SQL Server version 4.9.2 holds update locks for the duration of the transaction when an update references a nonexistent row. Thus, conflicting updates between transactions are detected by parallel DSI threads as deadlocks. However, SQL Server version 10.0 and later does not hold these locks unless transaction isolation level 3 has been set. For replication to non-Sybase databases, transaction serialization cannot be guaranteed if you choose either the "none" (no coordination) or the "isolation_level_3" method, which may not be supported by non-Sybase databases.

The high-lighted section probably makes a lot more sense now to those who read it in the past and wondered. So, in the above illustration, if the dsi_serialization_method was set to isolation_level_3, the update would hold the locks and consequently the insert would block resulting in a deadlock as earlier discussed in the last section. The result would be the typical rollback and serial application and all will be fine. The DOL/RLL Twist An aspect that caught people by surprise was when this started happening even when using wait_for_commit and DOL tables. In implementing DOL, Sybase ASE engineering introduced several optimizations under isolation levels 1 & 2. Uncommitted Insert By-Pass Uncommitted inserts on DOL tables would be bypassed by selects and other DML operations such as update or delete. Unconflicting Update Return Select queries could return columns from uncommitted rows being updated if ASE could determine that the columns being selected were not being updated. For example, an update of a particular authors phone number in a DOL table would not block a query returning the same authors address.

250

Final v2.0.1
At this time, there is no proof that the unconflicting update is a cause of concern. However, in the case of the uncommitted insert by-pass, this broadened the vulnerability to the above problem substantially. Instead of the update or delete having to be executed prior to insert, any subsequent update or delete would by-pass the uncommitted insert and as a result would return 0 rows affected. Additionally, although the window of opportunity was much narrower, because the vulnerability was exposed until the commit was actually executed, the vulnerability was extended to the other serialization methods with the exception of dsi_serialization_method = isolation_level_3. It should be noted that in ASE 11.9-12.5, this optimization can be monitored with trace flag 694 and disabled with 693. As a result, customers are suggested to do one of the following if they find themselves in this situation: Use dsi_isolation_level=3 or dsi_serialization_method=isolation_level_3 Boot the server with -T693 to disable the locking optimizations. This may be preferred if isolation level 3 leads to increased contention with parallel DSIs.

Isolation Level 3 The reason isolation level 3 does not experience the problem is intrinsic to the ANSI specification to prevent phantom rows under isolation level 3. In order to prevent phantom rows, a 0 row affected DML operation must hold a lock where the row should have been to prevent another user from inserting a row prior to commit (and hence a re-read of the table would yield different results violating isolation level 3). This is prevented in ASE via the following methods: APL (All Page) Locking The default locking scheme protects isolation level 3 on tables by retaining an update lock with an Index Page context. An update lock is a special type of read lock that indicates that the reader may modify the data soon. An update lock allows other shared locks on the page, but does not allow other update or exclusive locks. Since it is not always possible to determine the page location (i.e. the end of the table) for the lock, the lock is placed on the index page. This prevents inserts by blocking the insert from inserting the appropriate index keys. DOL (Data Only) Locking To ensure isolation levels 2 & 3, Sybase introduced two new types of locking contexts with ASE 11.9.2 Range and Infinity locks. A Range lock is placed on the next row of a table beyond the rows that qualify for the query. This prevents a user from adding an additional row that would have qualified either at the beginning or end of a range. The Infinity lock is simply a special Range lock that occurs at the very beginning or end of a table. Consequently, by retaining these locks, the premature execution of the update will cause a deadlock with rs_threads. As described in the Replication Server Performance & Tuning White Paper, this is deliberate, and results in Replication Server rolling back the transactions and re-issuing them in serial vs. parallel. As a result, if isolation level 3 is set, the above situation becomes:

Insert into tableX ()


CT 1 TX 05 TX 04

Update tableX
TX 03 TX 02 TX 01 UT 1 BT 1

Deadlock
CT 2 GT 1 TX 10 TX 09 TX 08

Blocke d
TX 07 TX 06 UT 2 BT 2

Blocked
CT 3 GT 2 TX 15 TX 14 TX 13 TX 12 TX 11 UT 3 BT 3

Blocked
CT 4 GT 3 TX 20 TX 19 TX 18 TX 17 TX 16 UT 4 BT 4

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00
BT n CT n TX ##

rs_begin for transaction for thread n rs_commit for transaction for thread n Replicated DML transaction ##

UT n GT n

rs_update_threads n rs_get_thread_seq n

Figure 76 - Deadlock instead of disappearing update with isolation level 3

251

Final v2.0.1

Spurious Duplicate Keys Although the issue above has gained the most attention as disappearing updates and Sybase has been able to identify other situations (such as insert/delete) that could occur, one of the situations that does not apply is a delete followed by an insert in which the insert would be executed first due to the same parallel execution that causes the disappearing update problem. Note that this situation differs significantly from the disappearing update problem and is only related in the aspect of the execution order and that it might be conceived to be a related problem but in fact is not related. This situation could occur when an application might delete and re-insert rows vs. performing an update somewhat analogous to Replication Servers autocorrection feature. Note that a deferred update is NOT logged nor replicated as a separate delete/insert pair and consequently it must be an explicit delete/insert pair submitted by the application. In this case, referring to the previous drawings, if the insert ended up where the update was illustrated above, it would execute (or attempt to) prior to the delete. As the row is already present, this would result in a duplicate key error being raised by the unique index on the primary key columns. Normally, when any error occurs during parallel execution, the Replication Server logs the error along with a warning that a transaction failed in parallel and will be re-executed in serial. This situation differs from the disappearing update problem in several critical areas: While the disappearing update problem does not raise an error, the duplicate key insert does in fact raise an error which causes a rollback of the SQL that is causing the problem. Subsequent execution in serial by Replication Server would correctly apply the SQL and the database would not be inconsistent.

Additionally, other than proper transaction management within the application, none of the current proposals for addressing the disappearing update problem would address this issue. Customers witnessing a frequent number of duplicate key errors that appear spurious as subsequent execution succeeds should attempt to resolve the problem by ensuring proper transaction management is in place or by other application controls outside the scope of this issue. One frequent fix is to determine if the system is a Warm Standby and if an approximate numeric (float) column exists - if so, the likely cause of the spurious keys is our old friend the missing repdef/primary key identification. Estimating Vulnerability Before eliminating parallel DSIs, you should first assess the vulnerability of your systems. Basically, the above could happen when an update, delete or procedure (containing conditional logic or aggregation) closely follows a previous transaction such that it is within num_xacts_in_group * num_threads transactions but definitely separate transactions. Examples of applications that might be vulnerable include: Applications with poor transaction management in which atomic SQL statements all part of the same logical unit of work are often executed outside the scope of explicit transactions. Applications in which explicit transactions were avoided due to contention Common wizard based applications if each screen saves its information to the database individually prior to transitioning to the next screen (assuming the following screen may update the same row of information). Middle tier components that perform immediate saving of data as each method is called vs. waiting until the object is fully populated. A typical job queue application in which a job is retrieved from the queue very quickly after begin created (as is normal, retrieving a job usually entails updating the job status). Work table data is replicated and then a procedure that uses the work table data is replicated.

Updates or deletes triggered by an insert would not be a case as any triggered DML is included in the implicit transaction with the triggering DML. Specifically, you can determine if DOL has exposed your application by booting the replicate dataserver with trace flag 694. In any case, assessing the window of vulnerability finds it extremely small. The conditions that could cause an issue fall into the category of: Non-Existent Row Basically the scenario addressed in the book and illustrated above characterized by an insert followed closely by DML. The lack of a row doesnt return an error and doesnt hold the lock when the second statement is executed first. Consequently this scenario is always characterized by an insert followed by update or delete. Repeatable Read The typical isolation level three problem as discussed for isolation_level_3 dsi_serialization_method. Basically any DML operation followed closely by a read (either in a replicated proc or a trigger on a different table) This leads up to:

252

Final v2.0.1
Key Concept 27: Parallel DSI Serialization does NOT guarantee transactions are executed in the same order which could lead to database inconsistencies particularly with dsi_serialization_method=wait_for_start or none and dsi_isolation_level other than 3. If you think about it, we are deliberately executing the transactions kind of out of order to achieve greater parallelism. If we didnt, then we would be executing them in serial fashion (ala wait_for_commit), which does not achieve any real parallelism. Later (Multiple DSI section), we will discuss the concept of commit consistent. At this point, suffice it to say, is that if the transactions are not commit consistent, use isolation_level_3. Large Transaction Processing One of the most commonly known and frequently hit problems with Replication Server is processing large transactions. In earlier sections, the impact of large transactions on SQT cache and DIST/SRE processing were discussed. This section takes a close look at how large transactions affect the DSI thread. It should be noted that it is at the DSI that a transaction is defined as large. While a transaction may be large enough to be flushed from the SQT cache it still can be too small to qualify as a large transaction. Parallel DSI Tuning Tuning parallel DSIs for large transactions is a mix of understanding the behavior of large transactions, particularly in relationship to the dsi_large_xact_size and the SQT open queue processing. DSI Tuning Parameters There really only are two tuning parameters for large transactions. Both of these are only applicable to Parallel DSI implementations. The tuning parameters are: Parameter dsi_large_xact_size Default: 100; Recommended: 10,000 or 2,147,843,647 (max) Definition The number of commands allowed in a transaction before the transaction is considered to be large for using a single parallel DSI thread. The minimum value is 4. The default is probably far too low for other than strictly OLTP systems. While the initial recommendation would be to raise this to 2 billion and thereby eliminate this configuration from kicking in as it has little real effect, if the application does have some poorly designed large transactions, setting this to a much higher number than ordinary might help reduce DSI latency when the DSI is waiting on a commit before it even starts. The number of parallel DSI threads to be reserved for use with large transactions. The maximum value is one less than the value of dsi_num_threads. More than 2 are probably not effective. If dsi_large_xact_size is set to 2 billion, this should be set to 0. If attempting some large transactions, likely 1 is the best setting. See the text in this section for details.

dsi_num_large_xact_threads Default: 2 if parallel_dsi is set to true; Recommended: 0 or 1 (see text)

The key tuning parameter of both of these is dsi_large_xact_size. When a transaction exceeds this limit, the DSI processes it as a large transaction. In doing so, the DSI does the following: 1. 2. 3. Allow the transaction to be sent to the replicate without waiting for the commit record to be read. Use a dedicated large transaction DSI Each dsi_large_xact_size rows, the DSI will attempt to provide early detection

An important note is that this is only applicable to Parallel DSI. If Parallel DSI is not used, large transactions are processed normally with no special handling. Parallel DSI Processing In addition to beginning to process large transactions before the commit record is seen by the DSI/SQT, if using Parallel DSIs, the Replication Server also processes the large transaction slightly differently during execution. The main differences are:

253

Final v2.0.1

DSI/SQT open queue processing (DSI doesnt wait for commit to be seen) Early conflict detection Utilizes reserved DSI threads set aside for large transactions

SQT Open Queue Processing The reference manual states that large transactions begin to be applied by the DSI thread before the DSI sees the commit record. While some people misinterpret this to mean that the transaction has yet to be committed at the primary, except in the case of Warm Standby, the transaction has not only been committed, but fully forwarded to the Replication Server. Remember, in order for the inbound queue SQT to pass the transaction to the outbound queue, the transaction had to be committed. However, the DSI could start delivering the commands before the DIST has processed all of the commands from the inbound queue, while a Warm Standby system could be delivering SQL commands prior to the command being committed at the primary. This is accomplished by the DSI processing large transactions from the Open queue vs. the more normal Closed queue in the SQT cache. Overall, this can significantly reduce latency as the DSI does not have to wait for the full command to be in the queue prior to sending it to the replicate. However, this does have a possible negative effect in Warm Standby systems that a large transaction may be rolled back at the primary and need to be rolled back at the replicate. How can this happen??? Simple. Consider the case of a fairly normal bcp of 100,000 rows into a replicated table (slow bcp so row changes are logged). As the row changes are logged, they are forwarded to the Replication Server by the Rep Agent long before the commit is even submitted to the primary system. If at the default, after 100 rows have been processed to the Replication Server, the transaction would be labeled as a large transaction. As a result, the DSI would start applying the transactions row changes immediately without waiting for a commit (in fact, the commit may not even have been submitted to the primary yet). Now, should the bcp fail due to row formatting problems it will need rolled back - not only in the primary, but also at the replicate as the transaction has already been started. With such a negative, why is this done?? The answer is simple transaction rollbacks in production systems are extremely rare (or should be!!) therefore this issue is much more of an exception and not the norm. In fact, for normal (non-Warm Standby) replication, the commit had to have been issued at the primary and processed in the inbound queue or it would not have even got to the outbound queue. In addition, the benefit of this approach far outweighs the very small amount of risk. Consider the latency impact of waiting until the commit is read in the outbound queue as illustrated below by the following timeline:
Large Xactn at PDS Rep Agent Processing Inbound SQT Sort DIST/SRE Outbound SQT Sort DSI -> RDS (normal) DSI -> RDS (large xactn)

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 77 Latency in processing large transactions


Without starting to apply the transaction until the commit is read, several problems can occur. First, as illustrated above, the overall latency of the transaction is extended. In the bottom DSI execution of the transaction (labeled DSI > RDS (large xactn)), it finishes well before it would if it waited until the transaction was moved to the SQT Closed queue. This is definitely an important benefit for batch processing to ensure that the batch processing finishes at the replicate prior to the next business day beginning. Consider the above example. If each time unit equaled an hour (although 2 hours for DIST/SRE processing is rather ludicrous) at the transaction began at the primary at 7:00pm, it

254

Final v2.0.1
would finish at the replicate at 7:00am the next morning using large transaction thread processing. Without it, the transaction would not finish at the replicate until 10:00am 2 hours into business processing. The latency savings for this is really evident in Warm Standby. Remember, for Warm Standby, the Standby DSI is reading from the inbound queues SQT cache. Normal (small) transactions, of course, are not sent to the Standby database until they have committed. However, since a large transaction reads from the SQT Open queue, it is fully possible that the Standby system will start applying the transaction within seconds of it starting at the primary and would commit within nearly the same time. Compare the following timeline with the one above.
Large Xactn at PDS Rep Agent Processing Inbound SQT Sort DSI -> RDS (normal) DSI -> RDS (large xactn)

dsi_large_xact_size rows scan time

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00

Figure 78 Latency in processing large transactions for Warm Standby


However, the above will only happen if large transactions run in isolation. The problem is that if a large transaction begins to be applied and another smaller transaction commits prior to the large transaction, the large transaction is rolled back and the smaller concurrent transaction committed in order. After the smaller transaction commits, the large transaction does not restart from the beginning automatically - but rather waits until the commit is actually received before it is reapplied. This probably is due to the expense of large rollbacks and the aspect that if it the rollback occurs once, it is likely to occur again. This behavior is easily evident by performing the following in a Warm Standby configuration: 1. 2. Configure the DSI connections for parallel DSI using the default parallel_dsi=on setting. Begin a large transaction at the primary (i.e. a 500 row insert into table within an explicit transaction). At the end of the transaction place a waitfor delay 00:03:00 immediately prior to the commit. Use a dirty read at the replicate to confirm large transaction is started. Perform an atomic insert into another table at the primary (allow to implicitly commit) Use a dirty read at the replicate to confirm large transaction rolled back and does not restart until delay expires and transaction commits.

3. 4. 5.

As a result, attempts to tune for and allocate large transaction threads will be negated if smaller/other transactions are allowed run concurrently and commit prior to the large transaction(s). This behavior, coupled with the early conflict detection and other logic implemented in large transaction threads to avoid excessive rollbacks is a very good reason to avoid the temptation - especially in Warm Standby - to reduce dsi_large_xact_size with hopes of improving throughput and reducing latency. Key Concept #28: Large transaction DSI handling is intended to reduce the double latency penalty that waiting for a commit record in the outbound queue introduces in normal replication and latency as well as switch active timing issues associated with Warm Standby. However, it is nearly only useful when large transactions run in isolation (such as serial batch jobs). Having said that, large transactions run concurrently (provided started in order of commit) such as concurrent purge routines may be able to execute without the rollback/wait for commit behavior. However, concurrent large transactions may not experience the desired behavior as will be discussed in the next section.

255

Final v2.0.1
Early Conflict Detection Another factor of large transactions that the dsi_large_xact_size parameter controls is the timing of early conflict detection. This is stated in the Replication Server Administration manual as After a certain number of rows (specified by the dsi_large_xact_size parameter), the user thread attempts to select the row for the next thread to commit in order to surface conflicting updates. What this really means is the following. During processing of large transactions, every dsi_large_xact_size rows, the DSI thread attempts to select the sequence number of the thread before it. So, for example, for a large transaction of 1,000 statements (i.e. a bcp of 1,000 rows), the Replication Server would insert an rs_get_threadseq every 100 rows (assuming dsi_large_xact_size is still the default of 100). By doing this, if there is a situation in which the large transaction is blocking the smaller one, a deadlock is caused, thus surfacing the conflict. This is illustrated in the diagram below, in which thread #2 is being blocked by a conflicting insert by thread #3.

CT 1

Upd

Ins

UT1

BT 1

Blocked
CT 2 ST1 Upd Ins UT2 BT 2

Deadlock
CT 3 ST2 Upd Ins Ins ST2 Upd Upd Ins UT3 BT 3

Blocked
CT 4 ST3 Upd Ins Ins ST3 Upd Upd Ins UT4 BT 4

BT # UT# ST# CT #

Begin transaction for transaction # Update on rs_threads for thread id # (blocks own row) Select on rs_threads for thread id # (check for previous thread commit) Commit transaction for transaction #

Figure 79 Early Conflict Detection with large transactions


The reason for this is the extreme expense of rollbacks and the size of large transactions. To put this in perspective, try a large transaction in any database within an explicit transaction and roll it back vs. allowing it to commit. Although performance varies from version to version of ASE as well as the transaction itself, a normal transaction may take a full order of magnitude longer to rollback than it takes to fully execute (i.e. a transaction with an execute time of 6 minutes may require an hour to rollback). By surfacing the offending conflict earlier rather than later, the rollback time of the large transaction is reduced. This is crucial as no other transaction activity is re-initiated until all the rollbacks have completed. Consequently, without the periodic check for contention by selecting rs_threads every dsi_large_xact_size rows, a large transaction could have a significantly large penalty (i.e. 900 rows for the bcp example). This is illustrated in the below diagram a slight modification of the above with the intermediate rs_thread selects grayed out.

256

Final v2.0.1

CT 1

Upd

Ins

UT1

BT 1

Rollback/Block Penalty Range


CT 3 ST2 Upd Ins Ins

Blocked
CT 2 ST1 Upd Ins UT2 BT 2

Deadlock
ST2 Upd Upd Ins UT3 BT 3

Blocked
CT 4 ST3 Upd Ins Ins ST3 Upd Upd Ins UT4 BT 4

Figure 80 Possible Rollback Penalty without Early Conflict Detection


Now then, getting back to the point earlier discussed in the previous section the temptation to reduce dsi_large_xact_size until most transactions qualify with the goal of reducing latency. To understand why this is a bad idea, consider the following points: Large transactions are never grouped. Consequently, this eliminates the benefits of transaction grouping and increase log I/O and rs_lastcommit contention. In order to ensure most transactions qualify, dsi_large_xact_size has to be set fairly low (i.e. 10). The problem with this is that every 10 rows, the large DSI threads would block waiting for the other threads to commit. If the average transaction was 20 statements and 5 large transaction threads were used, the first would have all 20 statements executing while the other 4 would execute up to the 10th and block. The higher the ratio of dsi_large_xact_size to average transaction size, the more the performance degradation. By contrast a serialization method of none would let all 5 threads execute up to the 20th statement before blocking. The serialization between large transaction threads is essentially none up to the point of the first dsi_large_xact_size rows since we are not waiting for the commits at all (let alone waiting until they are ready to be sent). If the transactions have considerable contention between them to the extent wait_for_commit would have been a better serialization method, the large transactions could experience considerable rollbacks and retries. After the first dsi_large_xact_size rows, the rs_threads blocking changes the remainder of the large transaction to more of a wait_for_commit serialization.

The last bullet takes a bit of thinking before it can be understood. Let say we have a novice Replication System Administrator (named Barney) who has diligently read the manuals, took the class but didnt test his system with a full transaction load (nothing abnormal here in fact, it is rarity and a shame these days to note that few if any of large IT organizations stress test their applications or even have such a capability). However, being a daring individual, Barney decides to capitalize on the large transaction advantage of reading from the SQT Open queue and sets dsi_num_threads to 5, dsi_num_large_xact_threads to 4 and finally sets dsi_large_xact_size to 5 (his average number of SQL statements set from the application a web order entry system). Now then, lets assume due to triggered updates for shipping costs, inventory tracking, customer profile updates, etc., the 5 SQL statements expands to a total of 12 statements per transaction (not at all hard). What Barney assumes he is getting looks similar to the following:

257

Final v2.0.1

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Begin/Commit Transaction Replicated Statement rs_threads select rs_threads block on seq

Figure 81 Wishful Concurrent Large Transaction DSI Threads


The expectation: everything is done at T05. What Barney actually gets is more like:

T17 T16 T15 T14 T13 T12 T11 T10 T09 T08 T07 T06 T05 T04 T03 T02 T01 T00 Begin/Commit Transaction Replicated Statement rs_threads select rs_threads block on seq
Thread 3 blocked by thread 2 Thread 4 blocked by thread 3

Thread 5 blocked by thread 4

Figure 82 Real Life Concurrent Large Transaction DSI Threads


This illustrates how the first dsi_large_xact_size rows are similar to a serialization method of none while those statements after transition to more of a wait_for_commit. By the way, consider the impact if the last statement in thread 4 conflicts with one of the first rows in thread 5. A rollback at T12. Now, the unbeliever would be quick to say that the dsi_large_xact_size could be increased to exactly the rows in the transaction (i.e. 12) at which point we would really have the execution timings in the earlier figure. Possibly be real hard as the number of statements in a transaction is not a constant. However, remember we have now lost transaction grouping, introduced a high probability of contention/rollbacks, increased load on rs_lastcommit and replicate transaction log all for very little gain in latency for smaller transactions. While not denying that in some very rare instances of Warm Standby with a perfect static transaction size with no contention between threads that there is a probability that this type of implementation might help a small amount the reality is that it is highly improbable especially given the concurrent transaction induced rollback earlier discussed. Thread Allocation A little known and undocumented fact is that dsi_num_large_xact_threads are reserved out of dsi_num_threads exclusively for large transactions. That means only 3 threads are available for processing normal transactions if you set the default connection parameter of parallel_dsi to on without adjusting any of the other parameters (parallel_dsi on sets dsi_num_threads to 5 and dsi_num_large_xact_threads to 2 leaving only 3 threads for normal transactions of <100 rows (at default)). This can surprise some administrators who in checking their replicate dataserver

258

Final v2.0.1
discover that only a few of the configured threads are active. Combining this with the previous topic yields another key to understanding Parallel DSIs: Key Concept #29: For most systems, it is extremely doubtful that more than 2 large transaction threads will improve performance. In addition, since large transaction threads are reserved, increasing the number of large transaction threads may require increasing the total number of threads to avoid impacting (small) normal transaction delivery rates. Maximizing Performance with Parallel DSIs By now, you have enough information to understand why the default settings for the parallel_dsi connection parameter are what they are in respect to threading and why this may not be the most optimal. Consider the following review of points from above: In keeping with Replication Servers driving philosophy of maximizing resilience, the default serialization method is wait_for_commit as this minimizes the risk of inter-thread contention causing significant rollbacks. When using the wait_for_commit serialization method, only 3 Parallel DSIs will be effective. Using more than this number will not bring any additional benefit. For most large transactions due to the early conflict detection algorithm no more than 2 large transaction threads will be effective. After this point, no more benefit will be realized as the next large transaction could reuse the first thread.

However, this may not be even close to optimal as the assumption is that there will be significant contention between the Parallel DSIs and the large transactions are significantly higher than dsi_large_xact_size setting. If this is not true for your application (typically the case), then the default parallel_dsi settings are inadequate. To determine the optimal settings, you need to understand the profile of the transactions you are replicating, eliminate any replication or system induced contention at the replicate and develop Parallel DSI profiles of settings corresponding to the transaction profile during each part of the business day. Parallel DSI Contention Wait_for_start serialization method provides some of the greatest scalability the more DSIs involved, the higher the throughput. However, it also means a higher probability of contention causing rollback of a significant number of transactions (remember, if one rolls back, the rest do as well). Remember the threads are already blocked on each others rows in rs_threads deliberately to ensure commit order is maintained. Any contention between threads, then, is more than likely going to cause a deadlock. Consider the following illustration.

CT 1 CT 2

Upd Table B

Ins Table A

Upd rs_threads 1

BT 1 BT 2

Deadlock
Sel rs_threads 1 Upd Table C Ins Table B Upd rs_threads 2

Blocked CT 3
Sel rs_threads 2 Upd Table C

Blocked
Ins Table A Upd rs_threads 3

BT 3

Deadlock CT 4
Sel rs_threads 3 Upd Table A Ins Table C Upd rs_threads 4

BT 4

BT # CT #

Begin Tran for thread # - marks beginning of transaction group Commit Tran for thread # - marks end of transaction group Normal thread sequencing block Inter-thread contention on transactions within group

Figure 83 Deadlocking between Parallel DSIs with serialization method of none

259

Final v2.0.1
In the example above, two deadlocks exist threads 1 & 2 are deadlocked since thread 2 is waiting on thread 1 to commit as normal (rs_threads) yet thread 2 started processing its update on table B prior to thread 1 (assuming the same row hence the contention). As a result, #2 is waiting on #1 and #1 is waiting on #2 a classic deadlock.. Threads 3 & 4 are similarly deadlocked. Interestingly enough, one of the more frequent tables blamed for deadlocks in replicated environments is the rs_threads table. As you can see this is rather deliberate. Consequently deadlocks involving rs_threads should not be viewed as contention issues with rs_threads, but rather an indication of contention between the transactions the DSIs were applying. An easy way to find out the offenders is to turn on print deadlock info configuration in the dataserver using sp_configure and simply ignore the pages for the object id/table rs_threads. The biggest problem with this is that once one thread rollsback (typical response for a deadlock), all the subsequent threads will rollback as well. In order to prevent the contention from continuing and causing the same problems all over again, the Replication Server will retry the remaining transactions serially (one batch at a time) before resuming parallel operations. Obviously, a rollback followed by a serial transaction delivery will cause performance degradation if it happens frequently enough. However, a small number of occurrences are probably not a problem. During a benchmark at a customer site, using the default wait_for_commit resulted in the inbound queue rapidly getting one hour behind the primary bcp transaction. Switching to none drained the queue in 30 minutes as well as keeping up with new records. During these 30 minutes, the Replication Server encountered 3 rollbacks per minute ordinarily excessive, but in this case, the serialization method of none was outperforming the default choice. However, at another customer site, a parallel transaction failed every 3-4 seconds and no performance gain was noted in using none over wait_for_commit. As usual, this illustrates the point that no one-size-fits-all approach to performance tuning works and that each situation brings its own unique problem set. While the book states that This method assumes that your application is designed to avoid conflicting updates, or that lock protection is built into your database system. it is not as difficult to achieve as you think. Basically, if you do not have a lot of contention at the primary, then contention at the replicate may be a direct cause of system tuning settings at the replicated DBMS and not due to the transactions. If the contention is system induced, you need to first determine the type of contention involved and whether it involves. Consider the following matrix of contention and possible resolutions. Contention Last page contention Index contention Row contention Possible Resolution(s) Change clustered index, partition table or use datarow locking. Use datapage locking or reduce dsi_max_xacts_in_group Reduce dsi_max_xacts_in_group until contention reduced

Note that nowhere in the above did we suggest changing the serialization method to wait_for_commit. If the problem is system induced as compared to the primary yes, wait_for_commit will resolve it however, the impact on throughput can be severe. In almost any system, a serialization method of none should be the goal. Backing off from that goal too quickly when other options exist could have a large impact on the ability of Replication Server to achieve the desired throughput. Keep in mind that even 2 threads running completely in parallel with a serialization of none may be better than 5 or 6 using wait_for_commit. Understanding Replicated Transaction Profile In order to determine if the contention at the replicate (if there is any) is due to replication or schema induced contention, you need to develop a sense of the transaction profile being executed at the primary during each part of the business day. Consider the following fictitious profile: Transaction Execute Trade Place Order Adjust Stock Price 401K Deposit Money Market Deposit Money Market Check Close Market Type OLTP OLTP OLTP Batch OLTP Batch Batch Time Range 0830 - 1630 0700 - 1900 0830 - 1630 0500 0700 0900 - 1700 1800 - 2200 1700 - 1930 Volume (tpd) 500,000 750,000 625,000 125,000* 1,000 750 1 Leading Contention Cause get next trade id get next order id place order read mutual fund balance central fund update central fund withdrawal isolation level 3 aggregation

260

Final v2.0.1

Transaction Purge Historical

Type Batch

Time Range 2300 - 2359

Volume (tpd) 1

Leading Contention Cause Index maintenance

* Normalized for surge occurring on regular periodic basis

Note that the first two OLTP transactions have a monotonic key contention issue. When replicating this transaction, the id value will be known, therefore, this will not cause contention at the replicate. Accordingly, we would be most interested in what the second leading cause of contention is, however, we may not be able to determine that as the first one may be masking it. Also, in the above list of sample transactions, some of the OLTP transactions not only affect individual rows of data representing one type of business object (such as customer account) but they also affect either an aggregate (central fund balance) or other business object data. The contention could be on the latter. For example, each individual 401K pay deposit affects the individual investors account. In addition, it also adjusts their particular funds pool of receipts with which the fund manager uses for reinvestment. It is the activity against the fund data that could be the source of contention and not the individual account data. Resolving Parallel DSI Contention

Figure 84 Parallel DSI Contention Monitoring via MDA Tables


In the above set of tables, if the database is only being used by the maintenance user, monOpenObjectActivity can provide a fairly clear indication of which tables are causing contention among the parallel DSIs by monitoring the monOpenObjectActivity.LockWaits column. If the transaction profile is not well understood, the RowsInserted,

261

Final v2.0.1
RowsDeleted, and RowsUpdated also can provide a sense of what is going on at a more table/index level perspective than the RS DSI monitor counters. A few general tips are DSI Partitioning Having understood where the contention occurs at the primary, you then have to look at where contention is at the replicate. It is unfortunate, but almost in every case in which a customer has called Sybase Support with Replication Server performance issues, few have bothered to investigate if and where contention is the cause. This is especially true in Warm Standby scenarios in which the replicate system is the only updated by the Replication Server (and attempting a serialization method of none). Additionally, in the few cases where the administrators have been brave enough to attempt the none serialization method, as soon as the first error that occurs stating a parallel transaction failed and had to be retried in serial, the immediate response is to switch back to wait for commit vs. eliminating the contention or even determining if that level of contention is acceptable. In one example of the latter, during a bulk load test in a large database, the queue got 1GB behind after 1 hour using wait_for_commit. After switching to none, the queue was fully caught up in 30 minutes. However, during that period, approximately 3 parallel transactions failed per minute and were retried in serial. The trade-off was considered more than acceptable 90 errors and empty queue vs. no errors and 1GB backlog. Just think though if you were able to eliminate the contention that caused even 50% of the failures the number of additional transactions per minute would be at least equivalent to the number of DSIs. For example, in this case, 10 DSIs were in use. This means an extra 15 transactions (3 * 0.50 * 10) could have been applied per minute or 450 transactions during that time. And this is an extremely low estimate as we have not include the time it took to reapply the transactions in serial during which the system could still be applying the transactions in parallel. Which brings us back to the point how can we eliminate the contention at the replicate? The answer is (of course) it all depends on what the source of contention is is it contention introduced as a result of replication or contention between replication and other users. Replication Induced Contention As discussed earlier, replication itself can induce contention frequently resulting in the decision to use suboptimal Parallel DSI serialization methods. For normal transactions, a serialization method of none will achieve the highest throughput. The goal is to eliminate any replication induced contention that is preventing use of none and then to assess whether the level of parallel transaction retries is acceptable. As discussed earlier, the main cause of contention directly attributable to replication is the transaction grouping. Transaction grouping is a good feature, however, at its default of 20 transactions per group, it can frequently lead to contention at the replicate that didnt exist in the primary. The easiest way to resolve this is to simply reduce the dsi_max_xacts_in_group parameter until most of the contention is resolved. A possible strategy is to simply halve the dsi_max_xacts_in_group repeatedly until the replication induced contention is nearly eliminated. While it is theoretically possible to eliminate all replication-induced contention caused by transaction grouping in this manner, there is a definite tradeoff in eliminating transaction grouping and the associated increase in log and I/O activity and a limited acceptance of some contention. This means you will need to be willing to accept some degree of parallel transactions failing and being retried. If you remember, in an earlier session we mentioned that in one system, Replication got 1GB behind using wait_for_commit. By switching to none, Replication Server not only was able to keep up, it was able to fully drain the 1GB backlog in less than 30 minutes. During that time, however, an average of 3 parallel transactions per minute failed and were retried. This was completely acceptable considering the relative gain in performance. Concurrency Induced Contention In a sense, the transaction grouping is a form of concurrency that is causing contention. In addition to transaction grouping, the mere fact that Parallel DSIs are involved means that the individual Parallel DSIs could experience contention between them as well as with other users on the system. Possible areas of contention include: Replication to aggregate rollups in which many source transactions are all attempting to update the same aggregate row (i.e. total sales) in the destination database. DML applied serially at source that is being applied in parallel at replicate in which contention exists. For example, a (slow) bcp at primary does not have any contention. However, if the bcp specified a batch size (using b), then the Replication Server may send the individual batches using Parallel DSIs. The result is last page contention or index contention at the replicate. Replicated transactions that had contention at the primary. Transactions that have contention at the replicate due to the timing of delivery where at the primary no contention existed due to different timings. The timing difference could be the result of Replication Server component availability (i.e. Replication Agent was down) or due to long running transactions at the replicate delaying the first transaction until the conflicting transaction was also ready to go (i.e. a long running procedure at replicate would delay further transactions).

262

Final v2.0.1
How and if this contention could be eliminated depends on the type of contention. For example, where contention exists at index or page level for data tables, but not on the same rows, changing the replicate system to use datapage or datarow locking may bring relief. Finding Contention using MDA Monitoring Tables In ASE 12.5.0.3 and later, Sybase provides system monitoring tables via the Monitoring and Diagnostics API (MDA). As a result, sometimes these are often referred MDA tables, but technically they are known as the "monitoring tables". These tables are actually proxy tables that interface to the MDA via standard Sybase RPC calls. In order to determine where intra-parallel DSI contention is originating, you mainly need to look at five of these tables: Monitoring Table monLocks monProcess monProcessSQLText monSysStatement monSysSQLText Information Recorded Records the current process lock information Records information about currently executing processes Records the SQL for currently executing processes Records previously executed statement statistics Records previously executed SQL statements

The relationship between these tables is depicted below:


SPID = BlockingSPID

monProcess SPID KPID WaitEventID FamilyID BatchID ContextID LineNumber SecondsConnected BlockingSPID DBID EngineNumber Priority Login Application Command NumChildren SecondsWaiting BlockingXLOID DBName EngineGroupName ExecutionClass MasterTransactionID smallint int smallint smallint int int int int smallint int smallint int varchar(30) varchar(30) varchar(30) int int int varchar(30) varchar(30) varchar(30) varchar(255) <pk> <pk> <fk1>

monSysStatement <fk2> SPID KPID DBID ProcedureID PlanID BatchID ContextID LineNumber StartTime EndTime CpuTime WaitTime MemUsageKB PhysicalReads LogicalReads PagesModified PacketsSent PacketsReceived NetworkPacketSize PlansAltered smallint int int int int int int int datetime datetime int int int int int int int int int int <pk,fk> <pk,fk>

monLocks SPID KPID DBID ParentSPID LockID Context ObjectID LockState LockType LockLevel WaitTime PageNumber RowNumber smallint <pk,fk> int <pk,fk> int smallint int int int varchar(20) varchar(20) varchar(30) int int int

SPID = SPID KPID = KPID

SPID = SPID KPID = KPID

<pk> <pk>

SPID = SPID KPID = KPID

monProcessSQLText SPID KPID BatchID LineNumber SequenceInLine SQLText smallint int int int int varchar(255) <pk,fk> <pk,fk> <pk> <pk> <pk>

SPID = KPID = BatchID = LineNumber =

SPID KPID BatchID LineNumber

monSysSQLText SPID KPID BatchID LineNumber SequenceInBatch SQLText smallint int int int int varchar(255) <pk,fk> <pk,fk> <pk,fk> <pk,fk> <pk>

Figure 85 MDA-based Monitoring Tables Useful for Identifying Contention


The difficult aspect of using monitoring tables is remembering which of the tables contain currently executing information and which contain previously executed statements. This is important since once a SQL statement is done executing, the information about that statement will be only available in the monSys* tables vs. monProcess* -

263

Final v2.0.1
however, the statement may still be holding locks. Consequently, if there is contention, the blocked statement will still be executing and in the monProcess* tables, while the statement(s) that caused the contention may be either in monProcess* (if still executing such as long running procedure or if blocked itself) or in monSys* if the statement has finished executing but the transaction has not yet committed. Another aspect is that some of the monitoring tables concerns the fact that some of the tables are meant to be queried to build historical trend systems by multiple users simultaneously. Classic examples of this are monSysStatement and monSysSQLText. The first time you query these tables, it returns all the rows that the pipe contains. Subsequent queries will only return rows that previously have not been returned to your connection. Consequently, if two different users are querying the monSysStatement table at different times, the proper rows will be returned. Note that as mentioned earlier, you should start with the monOpenObjectActivity table (specifically the LockWaits column). If RS is the only user in the system, that may be all that is necessary. But if not, the next step is to enable statement monitoring. With statement monitoring, rather than looking at monLocks, you would actually track monProcessSQLText and monProcess. The technique is fairly simple - rapidly poll the tables (frequently enough to get an idea of the contention). Then by using the monProcess.BlockingSPID column you can identify both the blocked and blocking users along with their SQL statements at the time via monProcessSQLText. You can also review the past statements from monSysSQLText as well as look for statements with WaitTime in monSysStatements as indicators of where contention might exist. Parallel DSI Configuration vs. Actual In a sense, the num_dsi_threads configuration parameter is a limiter or maximum number of DSI threads that the Replication Server can use for any connection. Executing admin who, of course, will list all of the parallel DSI thread processes within the Replication Server. However, a check of sp_who <maint_user> may as few as two connections or may have show some number of connections but monitoring may show that only a few of them are actually active. Basically, after each batch is sent to a thread, the RS checks to see if thread seq #1 is available again. If so, it simply sends the next batch of SQL back to thread seq #1 instead of the next thread in sequence. This phenomena can be controlled loosely by adjusting the dsi_max_xacts_in_group as well as dsi_xact_group_size. If the transactions are fairly fast (i.e. atomic inserts) and both are set fairly small (i.e. 3 and 2048 respectively), by the time the DSI Scheduler dispatches the second batch to the second DSI, the first will be available again and will be reused. By setting them to higher values such as20 and 65536, it may take more time for the larger transaction to commit and the RS may use the full complement of DSI threads configured for. Developing Parallel DSI Profiles Similar to managing named data caches in Adaptive Server Enterprise, you may have to establish DSI profiles to manage replication performance during different periods of activity. Consider the following table of example settings: dsi_serialization _method dsi_num_large_ xact_threads dsi_max_xacts_ in_group 5 30 -1 5 dsi_large_xact_ size 1000 100 1000 100

Profile normal daily activity post-daily processing bcp data load bcp of large text data

None wait_for_commit None wait_for_commit

10 5 5 3

num_threads 1 2 0 2

Developing a similar profile for your replication environment will enable the Replication Server to avoid potentially inhibitive deadlocks and retries during long transactions such as large bcp and high incidence SQL statements typical of post-daily processing routines. For small and large bcp loads, however, remember to use the B option to breakup potentially queue filling bulk loads of data. Key Concept #30: Maximum performance using Parallel DSIs can only be achieved after replication and concurrency caused contention is eliminated and DSI profiles (based on the transaction profile) are developed to minimize contention between Parallel DSIs.

264

Final v2.0.1
Tuning Parallel DSIs with Monitor Counters Tuning parallel DSIs with monitor counters really boils down to maximizing parallelism while decreasing contention. First, lets start by taking a look at what counters are available that might be of use during tuning parallel DSIs Parallel DSI Monitor Counters While some of the same counters are used for parallel DSI tuning as with regular DSI tuning, the object is to see if the aggregate numbers for these counters is higher than with a single DSI. In addition, there are a few counters that relate specifically to Parallel DSI tuning. Overall the counters to watch are listed here (note that for this section we will only be reporting RS 15.0 counters): Monitor Counter DSITranGroupsSent Description Transaction groups sent to the target by a DSI thread. A transaction group can contain at most dsi_max_xacts_in_group transactions. This counter is incremented each time a 'begin' for a grouped transaction is executed. Transactions contained in transaction groups sent by a DSI thread. Transaction groups applied successfully to a target database by a DSI thread. This includes transactions that were successfully committed or rolled back according to their final disposition. Grouped transactions failed by a DSI thread. Depending on error mapping, some transactions may be written into the exceptions log. Transactions in groups sent by a DSI thread that rolled back successfully. This counter is incremented each time a Parallel Transaction must wait because there are no available parallel DSI threads. This counter is incremented each time a Large Parallel Transaction must wait because there are no available parallel DSI threads. Invocations of rs_dsi_check_thread_lock by a DSI thread. This function checks for locks held by a transaction that may cause a deadlock. Number of rs_dsi_check_thread_lock invocations returning true. The function determined the calling thread holds locks required by other threads. A rollback and retry occurred. Number of times transactions exceeded the maximum allowed executions of rs_dsi_check_thread_lock specified by parameter dsi_commit_check_locks_max. A rollback occurred. Transaction groups closed by a DSI thread due to the next tran causing it to exceed dsi_max_xacts_in_group. Time spent by the DSI/S finding a group to dispatch. Time spent by the DSI/S determining if a transaction is special, and executing it if it is. Time spent by the DSI/S dispatching a regular transaction group to a DSI/E. Time spent by the DSI/S dispatching a large transaction group to a DSI/E. This includes time spent finding a large group to dispatch. Number of DSI/E threads put to sleep by the DSI/S prior to loading SQT cache. These DSI/E threads have just completed their transaction. Time spent by the DSI/S putting free DSI/E threads to sleep. Time spent by the DSI/S loading SQT cache. ''Thread Ready'' messages received by a DSI/S thread from its assocaited DSI/E threads.

DSITransUngroupedSent DSITranGroupsSucceeded

DSITransFailed RollbacksInCmdGroup AllThreadsInUse AllLargeThreadsInUse ExecsCheckThrdLock TrueCheckThrdLock

CommitChecksExceeded

GroupsClosedTrans DSIFindRGrpTime DSIPrcSpclTime DSIDisptchRegTime DSIDisptchLrgTime DSIPutToSleep DSIPutToSleepTime DSILoadCacheTime DSIThrdRdyMsg

265

Final v2.0.1

Monitor Counter DSIThrdCmmtMsgTime DSIThrdSRlbkMsgTime DSIThrdRlbkMsgTime

Description Time spent by the DSI/S handling a ''Thread Commit'' message from its associated DSI/E threads. Time spent by the DSI/S handling a ''Thread Single Rollback'' message from its associated DSI/E threads. Time spent by the DSI/S handling a ''Thread Rollback'' message from its associated DSI/E threads.

Some of these have been discussed before - but in the following sections we will be taking a closer look to see how they work in parallel DSI environments. Maximizing Parallelism Obviously, the first step to maximizing parallelism is to use a dsi_serialization_method of wait_for_start and disabling large transactions. Then it becomes a progression of finding the right balance of number of threads, the group size and the partitioning rules to effectively use the parallel threads. The secret is to start with a reasonable number of threads based on the transaction profile and either increase the number of threads and/or adjust the transaction group size to keep all of them busy. A couple of key points: You can have too many threads - in which case some are not being used If the group size is too small, the number of threads you can effectively use will be reduced.

With respect to the first comment, if you remember, the DSI-S will re-use a free thread before using a new/idle thread if a thread becomes free. One key counter to look at for this is the DSI counter AllThreadsInUse. If the counter DSI.AllThreadsInUse=0, then it is unlikely that adding threads will help. However, if this has any value, then looking closer at the load balancing of transaction groups and commands sent on each of the individual threads will give a good idea if adding threads will help. Lets take a look at a high-end trading stress test: PartnWaits 0 0 0 0 0 0 DML Stmts Sec 9 10 10 AllThreads Busy Trans Fail RollBacks Checks Exceeded 0 0 0 0 0 0 Deletes 0 0 0

Trans In Groups

14:56:11 14:56:48 14:57:29 14:58:10 14:58:57 14:59:38

2 643 1796 1817 3212 2056

2 1241 3447 3508 6391 4066

2 637 1803 1814 3207 2039

0 0 0 0 0 0

0 0 0 0 0 0

0 216 562 596 1907 998

0 0 0 0 0 0

0 0 0 0 0 0

For now, we will only look at the AllThreadsBusy column (highlighted). Before the test began, this was 0 - and this actually may be the case under normal circumstances. The key is to look at the value during peak processing. As we can see, processing started around 14:56 and peaked about 14:58. If we look at the load distribution during this period we would see something similar to the following (dsi_num_threads=7; dsi_max_xacts_in_group=2 - due to contention discussed later): DML Stmts 333 370 382 DSIThread Cmds Per Sec

NgTrans

Xact Per Grp

14:56:48 14:56:48 14:56:48

1 2 3

85 93 104

165 183 189

1.9 1.9 1.8

666 739 763

18 20 21

1 0 0

332 370 382

266

Updates

Cmds Applied

Trans Groups

Sample Time

Inserts

RS Threads 0 7 0 0 0 23

Trans Succeed

Cmd Groups

Sample Time

Checks True

Check Locks

Final v2.0.1

DML Stmts 340 361 338 359 970 966 972 1018 975 1004 994 1002 1025 962 995 994 1002 1021 1221 1220 1238 1209 1245 1219 1221 790 805 833 849 799 856 811

14:56:48 14:56:48 14:56:48 14:56:48 14:57:29 14:57:29 14:57:29 14:57:29 14:57:29 14:57:29 14:57:29 14:58:10 14:58:10 14:58:10 14:58:10 14:58:10 14:58:10 14:58:10 14:58:57 14:58:57 14:58:57 14:58:57 14:58:57 14:58:57 14:58:57 14:59:38 14:59:38 14:59:38 14:59:38 14:59:38 14:59:38 14:59:38

4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7

86 91 88 90 253 251 258 262 255 259 258 258 265 250 258 256 262 268 458 463 458 457 462 458 458 287 288 298 302 289 312 286

168 179 167 178 485 483 486 509 486 502 496 502 512 483 498 500 502 511 911 919 915 911 921 910 910 574 576 584 590 578 604 574

1.9 1.9 1.8 1.9 1.9 1.9 1.8 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 1.9 2 2 1.9 1.9 2 1.9 2

679 721 675 718 1940 1932 1944 2036 1950 2008 1989 2005 2052 1926 1990 1994 2004 2042 3040 3056 3069 3031 3084 3039 3038 1938 1954 1996 2026 1955 2060 1955

18 20 18 19 47 47 47 49 47 48 48 48 50 46 48 48 48 49 66 66 66 65 67 66 66 47 47 48 49 47 50 47

0 0 0 0 0 0 0 0 0 0 0 1 2 2 1 1 1 0 599 614 590 613 593 601 595 354 343 329 327 356 347 331

340 361 338 359 970 966 972 1018 975 1004 994 1001 1023 960 994 993 1001 1021 622 606 648 596 652 618 626 436 462 504 522 443 509 480

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Looking at the workload distribution during the peak period, we see that the number of transaction groups/transactions is extremely balanced. This gives us and indication that adding addition threads during this time frame would increase throughput. It is extremely interesting to note that the transaction profile starts as almost exclusively updates and then

DML Stmts Sec 9 10 9 9 23 23 23 24 23 24 24 24 25 23 24 24 24 24 26 26 26 26 27 26 26 19 19 20 20 19 20 19

DSIThread

Cmds Per Sec

NgTrans

Xact Per Grp

Updates

Cmds Applied

Trans Groups

Sample Time

Deletes

Inserts

267

Final v2.0.1
becomes an even balance of inserts/updates. Lets bump the num_threads to 10 and also increase the dsi_max_xacts_in_group to 3 since the load is so evenly balanced and the AllThreadsBusy is so high. This increase is a bit cautious as we are dealing with updates which have been experiencing contention. AllThreadsBu sy Sample Time Cmd Groups Check Locks Checks True RS Threads Fail 0 0 0 9 26 0 5 3 4 5 4 5 3 3 4 5 10 11 8 11 11 10 13 9

19:30:43 19:31:18 19:32:01 19:32:45 19:33:32 19:34:17

6 306 904 2252 2154 16

7 794 2362 6334 6306 16

6 306 900 2253 2128 16

0 0 0 0 0 0

0 0 0 0 0 0

0 43 79 1248 1918 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

We can see that the bulk of the processing was accomplished in ~1.5 minutes (from 19:32 to 19:33:32) and only ~2 minutes overall. This is a bit better than the first run which took about 3 minutes overall and processing was distributed over all three minutes. Looking at one reason why, we see immediately that at peak we were processing 6,300 original transactions each ~40 second interval whereas in the first run we were only accomplishing mostly 3,000 transactions with a peak of 6,300. Looking at the load distribution gives us a better idea why. DML Stmts Sec DSIThread DMLStmts XactPerGr p Cmds Per Sec

NgTrans

Updates

Cmds Applied

Trans Groups

Sample Time

19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:31:18 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01 19:32:01

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8

39 22 32 32 27 44 26 24 27 33 89 98 76 90 101 86 108 76

97 56 82 88 74 107 65 62 73 90 228 253 186 241 255 233 290 206

2.4 2.5 2.5 2.7 2.7 2.4 2.5 2.5 2.7 2.7 2.5 2.5 2.4 2.6 2.5 2.7 2.6 2.7

388 223 328 352 296 421 260 248 292 360 912 1011 744 962 1010 932 1160 824

11 6 9 10 8 12 7 7 8 10 21 23 17 22 23 21 26 19

0 1 0 0 0 6 0 0 0 0 0 0 0 2 1 0 0 0

194 110 164 176 148 200 130 124 146 180 456 506 372 478 504 466 580 412

Deletes

Inserts

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

194 111 164 176 148 206 130 124 146 180 456 506 372 480 505 466 580 412

268

PartnWaits 0 0 0 0 0 0

RollBacks

DSIYields

TransFail

Checks Exceeded

Trans In Groups

Trans Succeed

Final v2.0.1

19:32:01 19:32:01 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:32:45 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32 19:33:32

9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

97 83 220 224 219 248 229 222 228 216 226 230 209 206 234 206 207 235 209 209 231 206

253 217 619 630 622 640 647 643 644 620 644 658 627 621 648 618 621 651 627 627 645 621

2.6 2.6 2.8 2.8 2.8 2.5 2.8 2.8 2.8 2.8 2.8 2.8 3 3 2.7 3 3 2.7 3 3 2.7 3

1009 859 2472 2518 2484 2558 2577 2567 2564 2470 2575 2622 2486 2465 2570 2448 2471 2587 2493 2484 2561 2441

23 19 56 57 56 58 58 58 58 56 58 59 54 53 55 53 53 56 54 54 55 53

1 0 2 1 3 0 1 1 1 0 0 1 18 16 27 23 19 18 23 18 21 31

503 429 1232 1257 1237 1277 1286 1278 1280 1234 1287 1309 1216 1208 1244 1188 1207 1267 1212 1213 1250 1175

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

504 429 1234 1258 1240 1277 1287 1279 1281 1234 1287 1310 1234 1224 1271 1211 1226 1285 1235 1231 1271 1206

Again, the load is fairly balanced during peak processing. The question is whether or not this was indeed a better configuration. The easiest way to tell the difference (besides aggregating across sample periods) is that during the peak processing, this run is steadily in the high 20s of DML Statements per Second while the first run was primarily in the mid-20s. However, the transaction mix is a bit different as well as the number of inserts vs. updates in the latter part are significantly different. The problem was that during the last part of the processing (when a few inserts were occurring), the number of parallel DSI failures was nearly 1 every 2 seconds. For this application, it turns out the optimal mix was 9 threads and a group size of 3. However, the same application also executes transactions (mainly inserts) against another database. Since the transactions were nearly exclusively inserts, the DSI profile was 20 threads and a group size of 20. Controlling Contention In the above section we made a mention to the fact that parallel DSI contention was nearly 1 every 2 seconds. The most common way to spot contention is to review the errorlog and look for the familiar message stating that a parallel transaction had failed and is being retried serially. However, if looking back over time just using the monitor counters, you may not have access anymore to historical errorlogs. Additionally, even when it is happening, keeping track of all the error messages to determine the relative frequency can be an inexact science. This is spotted by looking at one of two possible sets of counters - depending on whether Commit Control is used or rs_threads. If Commit Control is used, the answer is fairly obvious - simply look for TrueCheckThrdLock and

DML Stmts Sec 11 9 28 28 28 29 29 29 29 28 29 29 26 26 27 26 26 27 26 26 27 26

DSIThread

DMLStmts

XactPerGr p

Cmds Per Sec

NgTrans

Updates

Cmds Applied

Trans Groups

Sample Time

Deletes

Inserts

269

Final v2.0.1
CommitChecksExceeded - which are recorded as ChecksTrue and ChecksExceeded in the spreadsheet below. However, in this case we were not using Commit Control. In this case, remember a bit from our notion of how parallel DSIs communicate (and with some experimentation) we determine that in RS 15.0, the DSI counter DSIThrdRlbkMsgTime (specifically the counter_obs column) will tell us how often the DSI had to rollback transactions due to parallel DSI contention. Repeating the above last runs spreadsheet: Sample Time Cmd Groups Check Locks Checks True All Threads Busy RS Threads Fail 0 0 0 9 26 0 PartnWaits 0 0 0 0 0 0

RollBacks

DSIYields

TransFail

19:30:43 19:31:18 19:32:01 19:32:45 19:33:32 19:34:17

6 306 904 2252 2154 16

7 794 2362 6334 6306 16

6 306 900 2253 2128 16

0 0 0 0 0 0

0 0 0 0 0 0

0 43 79 1248 1918 0

0 0 0 0 0 0

0 0 0 0 0 0

0 0 0 0 0 0

As we can see, once the inserts start, contention immediately spikes. Possible causes include: The inserts may be firing a trigger which can be causing the contention The inserts may be causing conflicting locks (either due to range/infinity locks or similar if isolation level is 3) The updates may have shifted to another table and may be the cause of the contention - possibly even be updates to the same rows (such as updates to aggregate values).

Only further analysis using the MDA tables could tell us what tables are involved in the contention. Note that the key activity here is to try to reduce the contention - the suggested order to use is: 1. 2. 3. 4. first within the DBMS (i.e. change to datarows locking, optimize trigger code, etc.) if this is not possible, to decrease the grouping then try DSI partitioning finally reduce the number of parallel DSIs (as a last resort)

DSI Partitioning In RS 12.6, one of the new features that was added to help control contention was the concept of DSI partitioning. Currently, the way DSI partitioning works is that the DBA can specify the criteria for partitioning among such elements as time, origin, origin session id (aka spid), user, transaction name, etc. During the grouping process, the DSI scheduler compares each transactions partition key to the next. If they are the same, they are processed serially - if possible, grouped within the same transaction group. If they are different, the DSI scheduler assumes that there is no application conflict between the two and allows them to be submitted in parallel. If the transaction group needs to be closed due to group size and the next transaction has the same partition key value, then that thread is executed as if the dsi_serialization_method was wait_for_commit (and subsequent threads are also held until it starts). Note that the goal of this feature was specifically aimed at the case in which RS introduces contention either by executing transactions on different connections that were originally submitted on the same - or simply didnt have any contention at the primary due to the time. As a result, the recommended starting point is none for dsi_partitioning_rule. However, if contention exists and it cant be eliminated and reducing the group size doesnt help, a good starting point is to set the dsi_partitioning_rule to origin_sessid or the compound rule of origin_sessid, time. Once implemented, you will need to carefully monitor the DSI counters PartitioningWaits and in particular the counters for the respective partitioning rule you are using. For example, if using origin_sessid, the counters OSessIDRuleMatchGroup and OSessIDRuleMatchDist will identify how often a transaction was forced to wait (submitted in serial - OSessIDRuleMatchGroup) vs. how often it proceeded in parallel (OSessIDRuleMatchDist). If the parallelism is too low, it might actually be better to reduce the number of parallel threads and try without DSI partitioning. Remember, however, the goal is to reduce the contention. So if by implementing dsi_partitioning_rule = origin_sessid, you see a drop of AllThreadsBusy from 1000 to 500 and PartitionWaits climbs to 250, but the failed

270

Checks Exceeded 0 0 0 0 0 0

Trans In Groups

Trans Succeed

Final v2.0.1
transactions drops from 1-2 per second to 1 every 10 seconds, this is likely a good thing. The final outcome (as always) is best judged by comparing the aggregate throughput rates for the same transaction mix.

271

Final v2.0.1

Text/Image Replication
Okay, just exactly how is Replication Server able to replicate non-logged text/image updates???
The fact that Replication Server is able to do this surprises most people. However, if you think about it the same way that ASE had to provide the capability to insert 2GB of text into a database with a 100MB log Replication Server had to provide support for it AND also be able to insert this same 2GB of text into the replicate without logging it for the same reason. The unfortunate problem is that text/image replication can severely hamper Replication Server performance degrading throughput by 400% or more in some cases. Unfortunately, other than not replicating text, not a lot can be done to speed this process up. Text/Image Datatype Support To understand why not, you need to understand how ASE manages text. This is simply because the current biggest limiter on replicating text is the primary and replicate ASEs themselves. While we are discussing mainly text/image data, remember, this applies to off row java objects as well as these are simply implemented as image storage. Throughout this section, any reference to text datatypes should be treated as any one of the three Large Object (LOB) types. Text/Image Storage From our earliest DBA days, we are taught that text/image data is stored in a series of page chains separate from the main table. This allows an arbitrary length of text to be stored without regard to the data page limitation of 2K (or ~1960 bytes). Each row that has a text value stores a 16-byte value called the text pointer or textptr that points to the where the page chain physically resides on disk. While this is good knowledge, a bit more knowledge is necessary for understanding text replication. Unlike normal data pages with >1900 bytes of storage, each text page can only store 1800 bytes of text. Consequently a 500K chunk of text will require at least 285 pages in a linked page chain for storage. The reason for this is that each text page contains a 64-byte Text Image Page Statistics Area (TIPSA) and a 152-byte Sybase Text Node (st-node) structures located at the bottom of the page.

Page header (32 bytes) Text/image data (1800 bytes)

Head of st-node (152 bytes) TIPSA (64 bytes)


Figure 86 ASE Text Page Storage Format
Typically, a large text block (such as 500K) will be stored in several runs of sequential pages with the run length depending on concurrent I/O activity to the same segment and available contiguous free space. For example, the 285 pages needed to store 500K of text may be arranged in 30 runs of roughly 10 pages each. Prior to ASE 12.0, updating the end of the text chain or reading the chain starting at a particular byte offset (as is required in a sense), meant beginning at the first page and scanning each page of text until the appropriate byte count was reached. As of ASE 12.0, the st-node structure functions similar to the Unix File Systems I-node structure in that in contains a list of the first page in each run and the cumulative byte length of the run. For simplicity sake, consider the following table for a 64K text chunk spread across 4 runs of sequential pages on disk:

273

Final v2.0.1

Page Run (page #s) 8 (300-307) 16 (410-425) 8 (430-437) 5 (500-504)

st-node page 300 410 430 500

byte offset 14400 43200 57600 65536

This allows ASE to rapidly determine which page needs to be read for the required byte offset without having to scan through the chain. Depending on how fragmented the text chain is (i.e. how many runs are used) and the size of the text chain itself, the st-node may require more than 152 bytes. Rather than use the 152 bytes on each page and force ASE to read a significant portion of the text chain simply to read the st-node, the first 152 bytes are stored on the first page while the remainder is stored in its own page chain (hence the slight increase in storage requirements for ASE 12.0 for text data vs. 11.9 and prior systems). It goes without saying, then, that Adaptive Server Enterprise 12.0+ should be considerably faster at replicating text/image data then preceding versions. Thanks to the st_node index, the Replication Agent read of the text chain will be faster and the DSI delivery of text will be faster as neither one will be forced to repeatedly re-read the first pages in the text chain simply to get to the current byte offset where currently reading/writing text. The first page in the chain pointed to by the 16-byte textptr is called the First Text Page or FTP. It is somewhat unique in that when a text chain is updated, it is never deleted (unless the data row is deleted). This is surprising but true and still true when setting the text value explicitly to null still leaves this page allocated simply empty. The textptr is a combination of the page number for the FTP plus a timestamp. The FTP is important to replication because it is on this page that the TIPSA contains a pointer back to the data row it belongs to. So, while the data row contains a textptr to point to the FTP, the FTP contains the Row ID (RID) back to the row. Should the row move (i.e. get a new RID), the FTP TIPSA must be updated. The performance implications of this at the primary server is fairly obvious (consequently, movements of data rows containing text columns should be minimized). The FTP value and TIPSA pointers can be derived using the following SQL:
-- Get the FTP..pretty simple, since it is the first page in the chain and the text pointer in the row -- points to the first page, all we have to do is to retrive the text pointer select [pkey columns], FTP=convert(int,textptr(text_column)) From table Where [conditions] -- Getting the TIPSA and the row from the TIPSA is just a bit harder as straight-forward functions for -- our use are not included in the SQL dialect. Dbcc traceon(3604) Go Dbcc page(dbid, FTP, 2) Go -- look at last 64 bytes, specifically the 6 bytes beginning at offset 1998. The first 4 bytes are -- the page id (depending on platform, the byte order may be reversed) followed by the last 2 bytes -- which are the rowid on the page. For APL tables, you then can do a dbcc page on that page at use -- the row offset table to determine the offset within the page and read the pkey values.

As you can see, determining the FTP is fairly easy, while the TIPSA resembles more of an nonclustered lookup operation which the dataserver internally can handle extremely well. Standard DML Operations Text and image data can be directly manipulated using standard SQL DML Insert/Update/Delete commands. As we also were taught, however, this mode of manipulation logs the text values as they are inserted or updated and is extremely slow. The curious might wonder how a 500K text chunk is logged in a transaction log with a fixed log row size. The answer is that the log will contain the log record for the insert and subsequent log records with up to 450 bytes of text data the final number of log records dependent on the size of the text and the sessions textsize setting (i.e. set textsize 65536). SQL Support for Text/Image In order to speed up text/image updates and retrievals as well as provide the capability to insert text data larger than permissible by the transaction log, Sybase added two other verbs to the Transact SQL dialect readtext and writetext. Both use the textptr and a byte offset as input parameters to determine where to begin read or writing the text chunk. In addition, the writetext command supports a NOLOG parameter which signals that the text chunk is not to be logged in

274

Final v2.0.1
the transaction log. Large amounts of text simply can be inserted or updated through repetitive calls to writetext specifying the byte offset to be where previous writetext would have terminated. Of special consideration from a replication viewpoint is that the primary key for the row to which the text belongs is never mentioned in the writetext function. The textptr is used to specifically identify which text column value is to be changed instead of the more normal where clause structure with primary key values. Hold this thought until the section on Replication Agent processing below. Programming API Support Anyone familiar with Sybase is also familiar (if only in name) with the Open Client programming interface - which is divided into the simple/legacy DB-Lib (Database Library) API interface and the more advanced CT-Lib (Client Library) interface. Using either, standard SQL queries including DML operations can be submitted to the ASE database engine. Of course, this is one way to actually modify the text or image data but as we have all heard, DML is extremely slow at updating text/image and forces us to log the text as well (which may not be supportable). Consequently, both support API calls to read/write text data to ASE very similar to the readtext/writetext functions described above. For example, in CT-Lib, ct_send() is used to issue SQL statements to the dataserver while ct_get_data() and ct_send_data() are used to read/write text respectively. Similar to writetext, ct_send_data supports a parameter specifying whether the text data is to be logged. Note that while we have discussed these functions as if they followed readtext/writetext implementation, in reality, the API functions basically set the stage for the SQL commands instead of the other way around. In any case, similar to write text, the sequence for inserting a text chunk using the CTLIB interface would look similar to:
ct_send() - send the ct_send() - retrieve ct_send_data() send ct_send_data() send ct_send_data() send ct_send_data() send insert statement with dud data for text (init pointer) the row to get the textptr just initd the first text chunk the next text chunk the next text chunk the last text chunk

The number of calls dependent on how large of a temporary buffer the programmer wishes to use to read the text (probably from a file) into memory and pass to the database engine. A somewhat important note is that the smaller the buffer, the more likely the text chain will be fragmented and require multiple series of runs. Of all the methods currently described, the ct_send_data() API interface is the fastest method to insert or update text in a Sybase ASE database. RS Implementation & Internals Now that we now how text is stored and can be manipulated, we can begin applying this knowledge to understand what the issue is with replicating text. sp_setreptable Processing If not the single most common question, the question Why does sp_setreptable take soooo long when executed against tables containing text or image columns? certainly ranks in the top ten questions asked to TSE. The answer is truthfully to fix an oversight that ASE engineering kinda forgot. If you remember from our previous discussion, the FTP contains the RID for the data row in its TIPSA. The idea is that simply by knowing what text chain you were altering, you would also know what row it belongs to. This is somewhat important. If a user chose to use writetext or ct_send_data(), a lock should be put on the parent row to avoid data concurrency issues. However, ASE engineering chose instead to control locking via locking the FTP itself. In that way (lazily) they were protected in that updates to the data row also would require a lock on the FTP (and would block if someone was performing a writetext) and concurrent writetexts would block as well. Unfortunately for Replication Server Engineering, this meant that ASE never maintained the TIPSA data row RID if the RID was never initialized which frequently was the case especially in databases upgraded from previous releases prior to ASE 12.0. In order to support replication, the TIPSA must be initialized with the RID for each data row. Consequently, sp_setreptable contains an embedded function that scans the table and for each data row that contains a valid textptr, it updates the columns FTP TIPSA with the RID. Since a single data row may contain more than one text or image column, this may require more than one write operation. To prevent phantom reads and other similar issues, this is done within the scope of a single transaction, effectively locking the entire table until this process completes. The code block is easily located in sp_setreptable by the line:
if (setrepstatus(@objid, @setrep_flags) != 1)

Unfortunately, as you can imagine, this is NOT a quick process. On a system with 500,000 rows of data containing text data (i.e. 500,000 valid text pointers), it took 5 hours to execute sp_setreptable (effectively 100,000 textptrs/hour usual caveat of your time may vary is applicable). An often used metric is that the time required is the same as that to build a new index (assuming a fairly wide index key so the number of i/os are similar).

275

Final v2.0.1
Key Concept #31: The reason sp_setreptable takes a long time on tables containing text/image columns, is that it must initialize the First Text Pages TIPSA structure to contain the parent rows RID. There is a semi-supported method around this problem provided that pre-existing text values in a database will never be manipulated via writetext or ct_send_data(). That method is to use the legacy sp_setreplicate procedure which does not support text columns and then call sp_setrepcol as normal to set the appropriate mode (i.e. replicate_if_changed). This executes immediately and supports replication of text data manipulated through standard DML operations (insert/update/delete) as well as new text values created with the writetext and ct_send_data methods and slow bcp operations. Replication Agent Processing Now, the nagging question Why on earth is initializing the FTP TIPSA with the RID so critical?? Some may already have guessed. If a user specifies a non-logged writetext operation and only modifies the text data (i.e. no other columns in row changed), then it would be impossible for the Replication Server to determine which row the text belonged to at the replicate. Remember, replicated databases have their own independent allocation routines, consequently, even in Warm Standby, there is no way to guarantee that because a particular text chain starts at page 23456 at the primary that the identical page will be used at the replicate. This is especially true in non-Warm Standby architectures such as shared primary or corporate rollup scenarios in which the page more than likely will be allocated to different purposes (perhaps an OAM page in one, while a text chain in the other). As a result, the Replication Server MUST be able to determine the primary keys for any text column modified. As you could guess, this lot falls to the task of the Replication Agent. While we have used the term NOLOG previously, as those with experience know, in reality, there is no such thing as an unlogged operation in Sybase. Instead, operations are considered minimally logged which means that while the data itself is not logged, the space allocations for the data are logged (required for recovery). In addition to logging the space allocations for text data, the text functions internal within ASE check to see what the replication status is for the text column any time it is updated. If the text column is to be replicated, ASE inserts a log row in the transaction log containing the normal logging information (transaction id, object id, etc.) as well as the textptr. The Replication Agent reads the log record, extracts the textptr and parses the page number for the text chain. Then it simply reads the FTP TIPSA for the RID (itself a combination of a page number and row id) along with table schema information (column names and datatypes as normal) and reads the parent row from the data page. If the text chain was modified with a writetext, the Replication Agent tells the Replication Server what the primary keys were by first sending a rs_datarow_for_writetext function with all of the columns and their values. Key Concept #32: The Replication Agent uses the FTP TIPSA RID to locate the parent row and then constructs a replicated function rs_datarow_for_writetext to send with the text data to identify the row at the replicate. In either case text modified via DML or writetext similar to transaction logging of text data, in order to send data to the Replication Server, the Replication Agent must break up the text into multiple chunks and send via multiple rs_writetext append calls. An example of this from a normal logged insert of data is illustrated in the below LTL block (notice the highlighted sections).
distribute @origin_time='Apr 15 1988 10:23:23.001PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000001, @tran_id=0x000000000000000000000001 begin transaction 'Full LTL Test'distribute @origin_time='Apr 15 1988 10:23:23.002PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000002, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_insert yielding after @intcol=1,@smallintcol=1,@tinyintcol=1,@rsaddresscol=1,@decimalcol=.12, @numericcol=2.1,@identitycol=1,@floatcol=3.2,@realcol=2.3,@charcol='first insert', @varcharcol='first insert',@text_col=hastext always_rep, @moneycol=$1.56,@smallmoneycol=$0.56,@datetimecol='4-15-1988 10:23:23.001PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff,@varbinarycol=0x01112233445566778899,@imagecol=hastext rep_if_changed, @bitcol=1 distribute @origin_time='Apr 15 1988 10:23:23.003PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000003, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first last changed with log textlen=30 @text_col=~.!!?This is the text column value. distribute @origin_time='Apr 15 1988 10:23:23.004PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000004,

276

Final v2.0.1

@tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append first changed with log textlen=119 @imagecol=~/!"!gx"3DUfw@4@O@@y@f9($&8~'ui)*7^Cv18*bhP+|p{`"]?>,D *@4 distribute @origin_time='Apr 15 1988 10:23:23.005PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000005, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append @imagecol=~/!!7Ufw@4"@O@@y@f distribute @origin_time='Apr 15 1988 10:23:23.006PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000006, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_writetext append last @imagecol=~/!!B@O@@y@f9($&8~'ui)*7^Cv18*bh distribute @origin_time='Apr 15 1988 10:23:23.007PM', @origin_qid=0x0000000000000000000000000000000000000000000000000000000000000007, @tran_id=0x000000000000000000000001 applied 'ltltest'.rs_update yielding before @intcol=1, @smallintcol=1, @tinyintcol=1, @rsaddresscol=1, @decimalcol=.12, @numericcol=2.1, @identitycol=1, @floatcol=3.2, @realcol=2.3, @charcol='first insert', @varcharcol='first insert', @text_col=notrep always_rep, @moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=1 after @intcol=1, @smallintcol=1, @tinyintcol=1, @rsaddresscol=1, @decimalcol=.12, @numericcol=2.1, @identitycol=1, @floatcol=3.2, @realcol=2.3, @charcol='updated first insert', @varcharcol='first insert', @text_col=notrep always_rep, @moneycol=$1.56, @smallmoneycol=$0.56, @datetimecol='Apr 15 1988 10:23:23.002PM', @smalldatetimecol='Apr 15 1988 10:23:23.002PM', @binarycol=0xaabbccddeeff, @varbinarycol=0x01112233445566778899, @imagecol=notrep rep_if_changed, @bitcol=0

A couple of points are illustrated above: The base function (insert/update) contains the replication status and also whether or not the column contains data. In the last example, notrep refers to the fact that the text chain is empty. The text replication is passed through a series of rs_writetext append first, append, append, ., append last functions with each specifying the number of bytes.

As you could guess, even when not logging the text, the Replication Agent can simply read the text chain (after all, it already has started to in order to find the RID on the FTP TIPSA). Key Concept #33: Similar to the logging of text data, text data is passed to the Replication Server by chunking the data and making multiple calls until all the text data has been sent to the Replication Server.

Changes in ASE 15.0.1 Because of customer complaints about the impracticality of marking large pre-existing text columns for replication, ASE implemented a different method in ASE 15.0.1 that did not involve updating the TIPSA. Instead, ASE 15.0.1 provides the option of creating an index on the text pointer value in the base table. As a result, when the Replication Agent is scanning the log and sees a textchain allocation, it can perform an internal query of the table via the text pointer index to find the datarow belonging to the text chain. This can be enabled using the following syntax:
-- Warm Standby and MSA syntax with DDL replication sp_reptostandby <db_name> [,'ALL' | 'NONE' | 'L1'] [, 'use_index'] -- Standard table replication marking sp_setreptable <table_name> [, true | false] [, owner_on | owner_off | null] [, use_index] -- Standard text/image column replication marking sp_setrepcol <tab_name> [, column_name] [, do_not_replicate | replicate_if_change | always_replicate] [, use_index]

As you can see, the only difference between these and pre-ASE 15.0.1 systems is the final parameter of use_index (or null if using the pre-ASE 15.0.1 implementation). This implementation has advantages and disadvantages Advantages o The speed of this index creation obviously depends on the size of the table as well as the settings for number of sort buffers and parallel processing. o On extremely large tables, this still is likely to complete in hours vs. days o Read only queries can still execute as create index only uses a shared table lock

277

Final v2.0.1

Disadvantages o On really large tables, more I/Os will need to be performed traversing the index to find the data row where as in the TIPSA method, the page pointer is located on the first index page. o Additional storage space is required to store the text pointer index o Normal DML operations (such as insert, update, deletes) may incur extra processing to maintain the index (except updates when the text column is not modified and the text pointer index would be considered a safe index).

As a result, if expecting a large number of text operations and you can take the upfront cost of the TIPSA method, you may wish to use this instead of the text pointer index. In addition to these considerations, the text/image marking precedence is column table database. As a result, if the database is marked use_index, but a specific table is marked using the TIPSA method, the table has precedence and will use the TIPSA method. RS & DSI Thread Processing As far as Replication Server, text data is handled no differently than any other, except of course, that the DIST thread needs to associate the multitude of rows with the subscription on the DML function (rs_insert) or as designated by the rs_datarow_for_writetext. You may have wondered previously why the rs_datarow_for_writetext didnt simply contain only the primary key columns vs. the entire row. There actually are two reasons: 1) the DBA may have been lazy and not actually identified the primary key (used a unique index instead); and 2) subscriptions on non-primary key searchable columns would be useless. The latter is probably the most important of the two without all of the columns, if a site subscribed to data from the primary based on a searchable column (i.e. state in pubs2..authors), the site would probably never receive any text data. However, by providing all data, the DIST thread can check for searchable columns within the data row to determine the destination for the text values. The bulk of the special handling for text data within the Replication Server is within the DSI thread. First, the DSI thread treats text as a large transaction. In itself, this is not necessarily odd as often text write operations result in a considerable number of rows in the replication queues. However, the biggest difference is how the DSI handles the text from a replicated function standpoint. Replicated Text Functions As we discussed earlier, when a text row is inserted using regular DML statements at the primary, the primary log will contain the insert and multiple inserttext log records. The replication agent, as we saw from above, translates this into the appropriate rs_insert and rs_writetext commands. At the replicate, we are lacking something fairly crucial the textptr. Consequently, the DSI first sends the rs_insert as normal and then follows it with a call to rs_init_textptr typically an update statement for the text column setting it to a temporary string constant. It then follows this with a call to rs_get_textptr to retrieve the textptr for the text chain allocation just created. Once it receives the textptr, the DSI uses the CT-LIB ct_send_data() function to actually perform the text insert. From a timeline perspective, this looks like the below

distribute rs_insert rs_insert distribute rs_writetext append first rs_writetext distribute rs_writetext append rs_writetext distribute rs_writetext append last rs_writetext rs_insert rs_insert rs_init_textptr rs_init_textptr rs_get_textptr rs_get_textptr (textpointer) textpointer) rs_writetext rs_writetext rs_writetext rs_writetext

Figure 87 Sequence of calls for replicating text modified by normal DML.

278

Final v2.0.1
For text inserted at the primary using writetext or ct_send_data, the sequence is little different. As we discussed before, because the textreq function within the ASE engine is able to determine if the text is to be replicated even when a non-logged text operation is performed, ASE will put a log record in the transaction log. The Replication Agent in reading this record, retrieves the RID from the TIPSA and then creates an rs_datarow_for_writetext function. After that, the normal rs_writetext functions are sent to the Replication Server. The DSI simply does the same thing. It first sends the rs_datarow_for_writetext to the replicate. It then is followed by the rs_init_textptr and rs_get_textptr functions as above. The role of rs_datarow_for_writetext is actually two fold. Earlier, we discussed the fact that it is used to determine the subscription destinations for the text data. For rows inserted with writetext operations, it is also used to provide the column values to the rs_init_textptr and rs_get_textptr function strings so the appropriate row for the text can be identified at the replicate and have the textptr initialized. The sequence of calls for replicating text modified by writetext or ct_send_data is illustrated below:

distribute rs_datarow_for_writetext rs_ datarow_for_writetext distribute rs_writetext append first rs_writetext distribute rs_writetext append rs_writetext distribute rs_writetext append last rs_writetext rs_datarow_for_writetext rs_ datarow_for_writetext rs_init_textptr rs_init_textptr rs_get_textptr rs_get_textptr (textpointer) textpointer) rs_writetext rs_writetext rs_writetext rs_writetext

Figure 88 Sequence of calls for replicating text modified by writetext or ct_send_data().


This brings the list of function strings to 4 for handling replicated text. Thankfully, if using the default function classes (rs_sqlserver_function_class or rs_default_function_class), these are generated for you. However, what if you are using your own function class?? If using your own function class, you will not only need to create these four function strings, but you will also need to understand the following: Text function strings have column scope. In other words, you will have to create a series of function strings for each text/image column in the table. If you have 2 text columns, you will need two definitions for rs_get_textptr, etc. The textstatus modifier available for text/image columns in normal rs_insert, rs_update, rs_delete as well as rs_datarow_for_writetext, rs_init_textptr is crucial to avoid allocating text chains when no text data was present at the primary.

In regards to the first bullet, the text function strings for each text column is identified by the column name after the function name. In the following paragraphs, we will be discussing these functions in a little bit more detail. Text Function Strings Consider the pubs2 database. In that database, the blurbs table contains biographies for several of the authors in a column named copy. If we were to create function strings for this table, they might resemble the below:
create function string blurbs.rs_datarow_for_writetext;copy for sqlserver2_function_class output language

Note the name of the column in the function string name definition. As noted earlier, the rs_datarow_for_writetext is sent when a writetext operation was executed at the primary. In the default function string classes, this function is empty for the replicate the rs_get_textptr function is all that will be necessary. However, in the case of a custom function class, you may want to have this function perform something for example insert auditing or trace information into an auditing database.

279

Final v2.0.1
Typically the next function sent is the rs_init_textptr, which might look like the below:
create function string blurbs.rs_textptr_init;copy for sqlserver2_function_class output language 'update blurbs set copy = Temporary text to be replaced where au_id = ?au_id!new?'

This, at first appears to be a little strange. However, remember, we need a valid text pointer before we start using writetext operations. But since we havent sent any text yet.kind of a catch-22 situation. Consequently, we simply use a normal update command to insert some temporary text into the column knowing that the real text will begin at an offset of 0 and therefore will write over top of it. Note that in the examples in the book, it sets the column to a null value. This can be problematic. Although setting a text column to null is supposed to allocate a text chain, in earlier versions of SQL Server, it was no guarantee that setting the text column to null would do so (in fact, it seemed that ~19 bytes of text was the guidelines for System 10.x). In addition, there is a little known (thankfully) option to sp_changeattribute - dealloc_first_txtpg - which asynchronously deallocates text pages with null values. As a result, text replication may fail as the text pointer may get deallocated before the RS retrieves it - or may get deallocated between the time RS allocates it and the first text characters are sent to the ASE. Anytime you get an invalid textpointer error or zero rows error for the textpointer, it is a good idea to check the RS commands being sent (using trace on,DSI,DSI_BUF_DUMP) and validating the text row should exist and that the table attribute for dealloc_first_txtpg is not set. Consequently, to ensure that the text chain is indeed allocated when needed, rather than initializing the textpointer using and update textcol=null, you may want to use an update where textcol=<some arbitrary string>. After initializing the textptr, the next function Replication Server sends is the rs_get_textptr function.
create function string blurbs.rs_get_textptr;copy for sqlserver2_function_class output language 'select copy from blurbs where au_id = ?au_id!new?'

Those who have worked with SQL text functions may be surprised at the lack of a textptr() function call in the output mask as in select textptr(copy) from . This is deliberate. Those familiar with CT-Lib programming know that when a normal select statement without the textptr function is used, it is the pointer itself that is bound using ct_bind() and ct_fetch() calls. The textptr() function solely exists so that those using the SQL writetext and readtext commands can pass it a valid textptr. The CT-Lib API essentially has it built-in as it is only with the subsequent ct_get_data() or ct_send_data() calls that the actual text is manipulated. Since Replication Server uses CT-Lib API calls to manipulate text, the textptr() function is then unnecessary. Of special note, it is often the lack of a valid textptr or more than one that frequently will cause a Replication Server DSI thread to suspend. If this should happen, check the queue for the proper text functions as well as check the RSSD for fully defined function string class. The error could be transient, but it also could point to database inconsistencies where the parent row is actually missing. Finally, the text itself is sent using multiple calls to rs_writetext. The rs_writetext function can perform the text insert in three different ways. The first is the more normal writetext equivalent as in:
create function string blurbs.rs_writetext;copy for rs_sqlserver2_function_class output writetext use primary log

In this example, RS will use ct_send_data() API calls to send the text to the replicate using the same log specification that was used at the primary. While this is the simplest form of the rs_writetext functions, it is probably the most often used as it allows straightforward text/image replication between two systems that provide ct_send_data() for text manipulation (and therefore one of the biggest problems in replicating through gateways). An alternative is the RPC mechanism, which can be used to replicate text through an Open Server:
create function string blurbs.rs_writetext;copy for gw_function_class output rpc 'execute update_blurbs_copy @copy_chunk = ?copy!new?, @au_id = ?au_id!new?, @last_chunk = ?rs_last_text_chunk!sys?, @writetext_log = ?rs_writetext_log!sys?'

This also could be used to replicate text from a source database to a target in which the text has been split into multiple varchar chunks. Note that in this case, two system variables are used to flag whether this is the last text chunk and

280

Final v2.0.1
whether it was logged at the primary. The former could be used if the target is buffering the data to ensure uniform record lengths (i.e. 72 characters) and to handle white space properly. When the last chunk is received, the Open Server could simply close the file or if a dataserver, it could update the master record with the number of varchar chunks. Note that the Replication Server handles splitting the chunks of text into 255 byte or less chunks avoiding datatype issues. The final method for rs_writetext is in fact to prevent replication via no output.
create function string blurbs.rs_writetext;copy for rs_sqlserver2_function_class output none

Which disables text replication no matter what the setting of sp_setrepcol. Text Function Modifiers The second aspect of text replication that takes some thought, is the role of the text variable modifiers. While other columns support the usual old and new modifiers for function strings as in ?au_lname!new?, text does not support the notion of a before and after image. The main reason for this, is that while the text rows may be logged, unlike normal updates to tables, the before image is not logged when text is updated. Additionally, if the primary application opts not to log the text being updated, the after image isnt available from the log either. While it is true that the text does get replicated, so that in a sense an after image does exist, remember, that text is replicated in chunks, consequently a single cohesive after image is not available. Even if it were, the functionality would be extremely limited as the support for text datatypes is extremely reduced. However text columns do support two modifiers: new and text_status. Before you jump and say wait a minute, didnt you just say, the answer is sort of. In the previous paragraph, we were referring to the old and new as it applies to the before and after images captured from the transaction log. The new text modifier instead refers to the current chunk of text contents without referring to whether it is the old or new values. For example, if left at always_replicate, if a primary transaction updates a column in the table other than the text column and minimal column replication is not on, then the text column will be replicated. In this scenario, the new chunks are really the old values which are still the same. The whole purpose of new in this sense was to provide an interface into the text chunks as they are provided through the successive rs_writetext commands. An example of this can be found near the end of the previous section when discussing the RPC mechanism for replicating text to Open Servers (which could then write it to a file). In that example (repeated below), the new variable modifier was used to designate the text chunk string vs. the columns text status.
create function string blurbs.rs_writetext;copy for gw_function_class output rpc 'execute update_blurbs_copy @copy_chunk = ?copy!new?, @au_id = ?au_id!new?, @last_chunk = ?rs_last_text_chunk!sys?, @writetext_log = ?rs_writetext_log!sys?'

For non-RPC/stored procedure mechanisms, text columns also support the text_status variable modifier, which specifies whether the text column actually contains text or not. The values for text_status are: Hex
0x0000 0x0002 0x0004 0x0008 0x0010

Dec
0 2 4 8 16

Meaning
Text field contains NULL value, and the text pointer has not been initialized. Text pointer is initialized. Real text data will follow. No text data will follow because the text data is not replicated. The text data is not replicated but it contains NULL values.

During normal text replication, these modifiers are not necessary. However, if using custom function strings, these status values allow you to customize behavior at the replicate for example, avoiding initialing a text chain when no text exists at the primary. Consider the following:
create function string blurbs_rd.rs_update for my_custom_function_class with overwrite output language if ?copy!text_status? < 2 -- do nothing since no text was modified

281

Final v2.0.1

else if ?copy!text_status? = 2 or ?copy!text_status? = 4 insert into text_change_tracking (xactn_id, key_val) values (?rs_origin_xactn_id!sys?,?au_id!new?) else if ?copy!text_status? = 8 -- text is not replicated else if ?copy!text_status? = 16 insert into text_change_tracking (xactn_id, key_val, text_col) values (?rs_origin_xactn_id!sys?, ?au_id!new?, (text was deleted or set to null at the primary));

The above function string or one similar could be used as part of an auditing system that would only allocate a text chain when necessary and also signal when the primary text chain may have been eliminated via being set to null. Performance Implications As mentioned earlier, the throughput for text replication is much, much lower than for non-text data. In fact, during a customer benchmark in which greater than 2.5GB/hr was sustainable for non-text data, only 600MB/hr was sustainable for text data (or 4x worse). The reason for this degradation is somewhat apparent from the above discussions. Replication Agent Processing It goes without saying that if the text or image data isnt logged, then the Replication Agent has to read it from disk and more than likely physical reads. While the primary transaction may have only updated several bytes by specifying a single offset in the writetext function, the Replication Agent needs to read the entire text chain. As it reads the text chain, if the original function was a writetext or ct_send_data, it first has to read the rows RID from the FTP TIPSA, read the row from the base table and construct the rs_datarow_for_writetext function as well. Then as it begins to scan the text chain, it begins to forward the text chunks to the Replication Server. While reading the text chain, all other Rep Agent activity in the transaction log is effectively paused. In highly concurrent or high volume environments, this could result in the Replication Agent getting significantly behind. As mentioned earlier, it might be better to simply place tables containing text or image data in a separate database and replicate both. Replication Server Processing Within the Replication Server itself, replicating text can have performance implications. First, it will more than likely fill the SQT cache and also be the most likely victim of a cache flush meaning it will have to be read from disk. Consequently, not only will the stable queue I/O be higher due to the large number of rs_writetext records required, but also during the transaction sorting, it is almost guaranteed that it will have to be re-read from disk. The main impact within the Replication Server however, is at the DSI thread. Consider the following points: Text transactions cant be batched The DSI has to get the textptr before the rest of the text can be processed. This requires more network interaction than most other types of commands. Each rs_writetext function is sent via successive calls to ct_send_data(). While this is the fastest way to handle text, it is not fast. Consider the fact that in ASE versions prior to ASE 12.0, the database engine would have to scan the text pages to find the byte offset. Consequently, processing a single rs_writetext is slower than an rs_insert or other similar normal DML function.

Net Impact Replicating text will always be considerably slower than regular data. If not that much text is crucial to the application, then replicating text may not have that profound of an impact on the rest of the system. However, if a lot of text is expected, then performance could be severely degraded. At this juncture, application developers have really only three choices: 1. 2. 3. Replicate the text and endure the performance degradation. Use custom function strings to construct a list of changed rows and then asynchronously to replication, have an extraction engine move the text/image data Dont replicate text/image at all

Which one is best is determined by the business requirements. For most workflow automation systems, the text is irrelevant and therefore simply can be excluded from replication. However, for high availability architectures involving a Warm Standby, text replication is required.

282

Final v2.0.1

Asynchronous Request Functions


Just exactly why were Asynchronous Request Functions invented for anyway???
It is an even toss up as to which replication topic is least understood text replication, Parallel DSIs, or asynchronous request functions. Even for those who understand what they do, they dont understand the impact that they could have on replication performance. In this section, we will be taking a close look at Asynchronous Request Functions and the performance implications of using them. Purpose During normal replication, it is impossible for a replicated data item to be re-replicated back to the sender or sent on to other sites (without the old LTM A mode or the current send_maint_xacts_to_replicate configuration for Replication Agent). However, in some cases this might be necessary. There are many real-life scenarios in which a business unit needs to submit a request to another system and have the results replicated back. While it is always possible to have the first system simply execute a stored procedure that is empty of code as a crude form of messaging, the problem with this is that the results are not replicated back to the sender. The reason is simple the procedure would be executed at the target by the maintenance user whose transactions are filtered out. It is also possible to configure the replication agent to not filter out the maintenance user, but that could lead to the endless loop replication problem. Since we are discussing it, the obvious solution is asynchronous request functions. Sometimes, however, it might not be the obvious answer as it can get overlooked. In the next couple of sections, we discuss several scenarios of real-life situations in which asynchronous request functions make sense. Key Concept #34: Asynchronous Request Functions were intended for a replicate system to be able to asynchronously request the primary perform some changes and then re-replicate those changes back to the replicate

Web Internal Requests Lets assume we are working for a large commercial institution such as a bank or a telephone utility company. As part of our customer service (and to stay competitive), we have created a web site for our customers to view online billing/account statements or whatever. However, to protect our main business systems from the ever-present hackers and to ensure adequate performance for internal processes, we have separated the web-supported database from the database used by internal applications (a very, very good idea that is rarely implemented). In addition, to make this site work for us and to reduce the number of customer service calls handled by operators, we would like the customer to be able to change their basic account information (name, mailing address) as well as perform some basic operations (online bill pay, transfer funds). Sounds pretty normal right??? The problem with this is, how do you handle the name changes, etc.??? In some systems, you cant you have to provide a direct interface to the main business systems. However, with Replication Server, you simply implement each of the customers actions as request functions, in which the request for a name change, bill payment, whatever is forwarded to the main business system, processed and then the results replicated back. You could easily picture this as being something similar to:

Web Database

Business Systems
Account Transactions

App Server

Account Requests

Figure 89 Typical Web/Internal Systems Architecture


In fact, the way most commercial bank web sites work, this architecture is extremely viable and reduce the risk to mission critical systems by isolating the main business systems from the load and security risks of web users. Corporate Change Request In many large systems, some form of corporate controlled data exists which can only be updated at the corporate site. A variation of this is a sort of change nomination process in which the change nomination is made to the headquarters and due to automated rules, the change is made. One example in which this applies is a budget programming system. As lower levels submit their budget requests, the corporate budget is reduced and the budgeted items replicated back to

283

Final v2.0.1
subscribing sites. At the headquarters system, rules such as whether or not the amount exceeds certain dollar thresholds based on the type of procurement etc. could be in place. This scenario is a bit different than most as the local database would not be strictly executing a request function. More than likely, a local change would be enacted i.e. a record saved in the database with a proposed status. Once the replicated record is received back from headquarters, it simply overwrites the existing record. In addition, due to the hierarchical nature of most companies, a request from a field office for a substantial funding item may have to forwarded through intermediates in affect, the request function is replicated on to other more senior organizations due to approval authority rules.

Corporate
Forwarded Budget Requests Total Expenditures Approved Requests Budgeted Amounts

Regional
Budget Requests & Expenditures

Field
Figure 90 Typical Corporate Change Nomination/Request Architecture
Update Anywhere Whoa!!! This isnt supposed to be able to be done with Sybase Replication Server. For years we have been taught the sanctity of data ownership and woe to the fool who dared to violate those sacred rules as they would be forever cursed with inconsistent databases. Not. Consider the fact that you and your spouse are both at workonly you happen to be traveling out of the area. Now, picture a bad phone bill (or something similar) in which you both call to change the address, account names or something but provide slightly different information (i.e. work phone number). The problem is that by being in two different locations and using the same toll-free number, you were probably routed to different call centers with (gasp) different data centers. The fledgling Sybase DBA answer is this cant be done. However, keep in mind, that the goal is to have all of the databases consist which of the two sets of data is the most accurate portrayal of the customer information is somewhat irrelevant. Having that in mind, look at the following architecture.

284

Final v2.0.1

San Francisco

Chicago

New York

Arbitrator

Los Angeles

Dallas

Washington DC

Request #1 Request #2

Response A Response B

Figure 91 Update Anywhere Request Function Architecture


No matter what order request 1 or 2 occur in, the databases will all have the same answer. The reason? We are exploiting the commit sequence assurance of Replication Server. In this case, it is the commit sequence of the request functions at the arbitrator. If request #2 commits first, then it will get response A and request #1 will get response B. Since commit order is guaranteed via Replication Server, then every site will have the response (A) from request 2 applied ahead of the response (B) from request 1. Implementation & Internals Now that we have established some of the reasons why a business might want to do Asynchronous Request Functions, the next thing to consider is how they are implemented. Frequently, another reason administrators dont implement request functions is the lack of understanding who to set it up. In this section, we will explore this and how the information gets to the replication server. Replicate Database & Rep Agent Perhaps before discussing what happens internally, a good idea might be to review the steps necessary to create an asynchronous request function. Implementing Asynchronous Request Functions In general, the steps are: 1. 2. 3. 4. 5. If not already established, make sure source database is established as a primary database for replication (i.e. has a Rep Agent, etc.) Create the procedure to function as the asynchronous request function. This could be an empty procedure or could have logic to perform local changes (i.e. set a status column to pending). Mark the procedure for replication in the normal fashion (sp_setrepproc) Create a replication definition for the procedure, specifying the primary database as the target (or recipient) desired and not the source database actually containing the procedure. Make sure the login names and passwords are in synch between the servers for users who have permission to execute the procedure locally (including those who can perform DML operations if proc is embedded in a trigger).

285

Final v2.0.1
6. Ensure that the common logins have permission to execute the procedure at the recipient database.

A bit of explanation might be in order for the last three. Regarding step #4, the typical process of replicating a procedure from a primary to a replicate involves creating a replication definition and subscription as normal similar to:

HQ.funding

NY.funding

my_proc_name At PRS Procedure exists here create function replication definition my_proc_name with primary at HQ.funding deliver as hq_my_proc_name (param list) searchable parameters (param list)

hq_my_proc_name Procedure exists here create subscription my_proc_name_sub for my_proc_name with replicate at NY.funding

At RRS

Figure 92 Applied (Normal) Procedure Replication Definition Process


This illustrates a normal replicated procedure from HQ to NY. For request functions, the picture changes slightly to:

HQ.funding

NY.funding

ny_req_my_proc_name At PRS Procedure exists here create function replication definition ny_my_proc_name with primary at HQ.funding deliver as ny_req_my_proc_name (param list) searchable parameters (param list)

At RRS (no subscription)

ny_my_proc_name Procedure exists here

Figure 93 Asynchronous Request Function Replication Definition Process


In this illustration, NY is sending the request function (dashed line) to HQ and the return is replicated via the solid line. Note that in the above example, the with primary at clause specifies the recipient (HQ in this case) and not the source (NY) and that the replication definition was created at the primary PRS for the recipient. One way to think of it is that an asynchronous request function replication definition functions as both a replication definition and subscription. A couple of points that many might not consider in implementing request functions: A single replicated database can submit request functions to any number of other replicated databases. Think of a shared primary configuration of 3 or more systems. Any one of the systems could send a request function to any of the others. While a single site can send request functions to any number of sites, a single request function can only be sent to a single recipient site. This restriction is due to the fact a single procedure needs to have a unique replication definition and that definition can only specify a single with primary at clause. In order to send a request function to another system, a route must exist between the two replicated systems.

Replication Agent Processing Essentially, there is nothing unique about Replication Agent processing for request functions. As with any stored procedure execution, when a request function procedure is executed, an implicit transaction is begun. While described in general terms in the LTL table located in the Replication Agent section much earlier, the full LTL syntax for begin transaction is:
distribute begin transaction tran name for username/ - encrypted_password

Consequently, the username and encrypted password are packaged into the LTL for the Replication Server. The reason for this is as you probably guessed the fact that the Replication Server executes the request function at the destination as the user who executed it at the primary (more on this in the next section). As a result, Replication Agent processing for request functions is identical to the processing for an applied function. Replication Server Processing Since the source database processing is identical to applied functions, it is within the Replication Server that all of the magic for request functions happens. This happens in two specific areas the inbound processing and the DSI processing.

286

Final v2.0.1
Inbound Processing As discussed earlier, within the inbound processing of the replication server, not much happens as far as row evaluation until the DIST thread. Normally, this involves matching replicated rows with replication definitions, normalizing the columns and checking for subscriptions. In addition, for stored procedure replication definitions, this process also involves determining if the procedure is an applied or request function. Remember: the name for a replication definition for a procedure is the same as the procedure name, and that due to the unique naming constraint for replication definitions, there will only be one replication definition with the same name as the procedure. Consequently, determining if the procedure is a request function or not is easily achieved simply by checking to see if the primary database for the replication definition is the same as the current source connection (i.e. connection for which the SQM belongs to). If not, then the procedure is a request procedure. Following the SQM, the DIST/SRE fails to find a subscription and simply needs to read the primary at clause to determine the primary database that is intended to receive the request function. The DIST/SRE then writes the request function to the outbound queue, marking it as a request function. DSI Processing Within the outbound queue processing of a request function, the only difference is in the DSI processing. When a request function is processed by a DSI, the following occurs: The DSI-S stops batching commands and submits all commands up to the request function. The DSI-E disconnects from the replicate dataserver and reconnects as the username and password from the request function transaction record. The DSI-E executes the request function. If more than one request function has been executed in a row by the same user, all are executed individually. The DSI-E disconnects from the replicate and reconnects as either the maintenance user or different user. The latter is applicable when back-to-back request functions are executed by different users at the primary.

Once the request function(s) have been delivered, the DSI resumes normal processing of transactions as the maintenance user until the next request function is encountered. Recipient Database Processing The second difference in request function processing takes place at the replicate database. If you remember from our earlier discussion, the Replication Agent filters log records based on the maintenance user name returned from the LTL get maintenance user command. Since the DSI applies the request function by logging in as the same user at the primary, then any modification performed by the request function execution is eligible for replication back out of the recipient database. If the procedure listed in the deliver as clause of the request function replication definition is itself marked for replication, then the procedure invoked by the request function will be replicated as an applied function. If not, then any individual DML statements on tables marked for replication and/or sub-procedures marked for replication will be replicated as normal. A couple of points for consideration: The destination of the modifications be replicated out of the recipient is not limited to the site that originally made the request function call. Since at this point normal replication processing is in effect, normal subscription resolution specifies which sites receive the modifications due to the request function. The deliver as procedure itself (or a sub-procedure) could be a request function in which case the request is forwarded up the chain while the original request function serves as notification to the immediate supervisory site that the subordinate is making a request. Key Concept #35: An Asynchronous Request Function will be executed at the recipient by the same user/password combination as the procedure was executed by at the originating site. Because it is not executed by the maintenance user, changes made by the request function are then eligible for replication. Performance Implications By now, you have begun to realize some of the power and possibilities of request functions. However, they do have downside it degrades replication performance. Consider the following: Replication command batching/transaction grouping is effectively terminated when a request function is encountered (largely due to the reconnection issue).

287

Final v2.0.1

Replication Server must first disconnect/reconnect as the request function user, establish the database context, execute the procedure, and then disconnect/reconnect as the maintenance user. Ignoring the procedure execution times, the two disconnect/reconnects could consume a considerable portion of time when a large number of request functions are involved. In the typical implementation, the request functions at the originator are often empty, while at the recipient there is a sequence of code. Consequently, at the originator, transactions that follow the request function appear to execute immediately. However, at the recipient, they will be delayed until the request function completes execution.

Normally the latter is not much of an issue, but some customers have attempted to use request functions as a means of implementing replication on demand in which a replicate periodically executes a request function that at the primary flips a replicate_now bit (or something similar). If the number of rows affected are very large, then this procedures execution could be significantly longer than expected. In summary, request functions will impede replication performance by interrupting the efficient delivery of transactions. Obviously, the degree to which performance is degraded will depend on the number and frequency of the request functions. This should not deter Replication System Administrators from using request functions, however, as they provide a very neat solution to common business problems.

288

Final v2.0.1

Multiple DSIs
Multiple DSI or Parallel DSI which is which or are they the same???
The answer to this question takes a bit of history. Prior to version 11.0, Parallel DSIs were not available in Replication Server. However, many customers were already hitting the limit of Replication Server capabilities due to the single DSI thread. Accordingly, several different methods of implementing multiple DSIs to the same connection were developed and implemented so widely that it was even taught in Sybases Advanced Application Design Using Replication Server (MGT-700) course by late 1995 and early 1996. This does not mean the two methods are similar as there is one very key difference between the two. Parallel DSIs guarantee that the transactions at the replicate will be applied in the same order. Multiple DSIs do not in fact, exploit this to achieve higher throughput. WARNING: Because the safeguards ensuring commit order are deliberately bypassed, Multiple DSIs are not fully supported by Sybase Technical Support. If you experience product bugs such as stack traces, dropped LTL, etc., then Sybase Technical Support will be able to assist. However, if you experience data loss or inconsistency then Sybase Technical Support will not be able to assist in troubleshooting. Concepts & Terminology Okay, if youve read this far, then the above warning didnt deter you. Before discussing Multiple DSIs, however, a bit of terminology and concepts need to be discussed to ensure we each understand what is trying to be stated. Throughout the rest of this section, the following definitions are used in association with the following terms: Parallel DSI Internal implementation present in the Replication Server product that uses more than one DSI thread to apply replicated transactions. Transaction commit order is still guaranteed, despite number of threads or serialization method chosen. Multiple DSI A custom implementation in which multiple physical connections are created to the same database, in effect implementing more than one DSI thread. Transaction commit order is not guaranteed and must be controlled by design. Serialized Transactions Transactions that must be applied in the same order to guarantee the same database result and business integrity. For example, a deposit followed by a withdrawal. Apply these in the opposite order may not yield the same database result as the withdrawal will probably be rejected due to a lack of sufficient funds. Commit Consistent Transactions applied in any order will always yield the same results. For example transactions at different Point-Of-Sale (POS) checkout counters or transactions originating from different field locations viewed from the corporate rollup perspective. Key Concept #36: If using the Multiple DSI approach, you must ensure that your transactions are commit consistent or employ your own synchronization mechanism to enforce proper serialization when necessary. Performance Benefits Needless to say, Multiple DSIs can achieve several orders of magnitude higher throughput than Parallel DSIs. One customer processing credit card transactions reported achieving 10,000,000 transactions per hour. If you think this is unrealistic, in late 1995, a U.S. Government monitored test demonstrated a single Replication Server (version 10.5) replicating 4,000,000 transactions per 24 hour period to three destinations each transaction a stored procedure with typical embedded selects and averaging 10 write operations (40,000,000 write operations total) against SQL Server 10.0 with only 5 DSIs. Thats a total of 12,000,000 replicated procedures for a total of 120,000,000 write operations processed by a single RS in a single day against a database engine with known performance problem!!! So 10,000,000 a hour with RS 11.x is could be believable. Such exuberance however needs to be tempered with the cold reality that in order to achieve this performance, a number of design changes had to be made to facilitate the parallelism and extensive application testing to ensure commit consistency had to be done. It cannot be understated Multiple DSIs can be a lot of work you have to do the thinking the Replication Server Engineering has done for you with Parallel DSIs.

289

Final v2.0.1
In order to best understand the performance benefits of Multiple DSIs over Parallel DSIs, you need to look at each of the bottlenecks that exist in Parallel DSIs and see how Multiple DSIs overcome them. While the details will be discussed in greater detail later, the performance benefits from Multiple DSIs stem from the following: No Commit Order Enforcement by itself, this is the source of the biggest performance boost as transactions in the outbound queue are not delayed due to long running transactions (i.e. remember the 4 hour procedure execution example) or just simply waiting for their turn to commit. Not Limited to a Single Replication Server Multiple DSIs lends itself extremely well to involving multiple Replication Servers in the process achieving an MP configuration currently not available within the product itself. Independent of Failures If a transaction fails with Parallel DSI, activity halts even if the transactions that follow it have no dependence on the transaction that failed (i.e. corporate rollups). As a consequence, Multiple DSIs prevent large backlogs in the outbound queue reducing recovery time from transaction failures. Cross-Domain Replication Parallel DSIs are limited to replicating to destinations within the same Replication domain as the primary. Multiple DSIs have no such restriction and in fact, extend easily to support large-scale cross-domain replication architectures (different topic outside scope of this paper). Implementation While the Sybase Education course MGT-700 taught at least three methods for implementing Multiple DSIs, including altering the system function strings, the method discussed in this section will focus on that of using multiple maintenance users. The reason for this is the ease and speed of setup and the least impact on existing function definitions (i.e. you dont end up creating a new function class). Implementing Multiple DSIs is a sequence of steps: 1. 2. 3. Implementing multiple physical connections Ensuring recoverability and preventing loss Defining and implementing parallelism controls

Implementing multiple physical connections The multiple DSI approach uses independent DSI connections for delivery. Due to the unique index on the rs_databases table in the RSSD, the only way to accomplish this is to fake out the Replication Server and make it think it is actually connecting to multiple databases instead of one. Fortunately, this is easy to do. Since Replication Server doesnt check the name of the server it connects to, all we need to do is alias the real dataserver in the Replication Servers interfaces file. For example, lets assume we have a interfaces file similar to the following (Solaris):
CORP_FINANCES master tli query tli /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000

Based on our initial design specifications, we decide we need a total of 6 Multiple DSI connections. Given that the first one counts as one, we simple need to alias it five additional times.
CORP_FINANCES master tli query tli CORP_FINANCES_A master tli query tli CORP_FINANCES_B master tli query tli CORP_FINANCES_C master tli query tli CORP_FINANCES_D master tli query tli CORP_FINANCES_E master tli query tli /dev/tcp /dev/tcp \x000224b782f650950000000000000000 \x000224b782f650950000000000000000

/dev/tcp /dev/tcp

\x000224b782f650950000000000000000 \x000224b782f650950000000000000000

/dev/tcp /dev/tcp

\x000224b782f650950000000000000000 \x000224b782f650950000000000000000

/dev/tcp /dev/tcp

\x000224b782f650950000000000000000 \x000224b782f650950000000000000000

/dev/tcp /dev/tcp

\x000224b782f650950000000000000000 \x000224b782f650950000000000000000

/dev/tcp /dev/tcp

\x000224b782f650950000000000000000 \x000224b782f650950000000000000000

290

Final v2.0.1
Once this is complete, the Multiple DSIs can simply be created by creating normal replication connections to CORP_FINANCES.finance_db, CORP_FINANCES_A.finance_db, CORP_FINANCES_B.finance_db, etc. However, before we do this, there is some addition work we will need to do to ensure recoverability (discussed in next section). To get a clearer picture of what this accomplishes, however, as we mentioned Replication Server now thinks it is replicating to n different replicate databases instead of one. Because of this, it creates separate outbound queues and DSI threads to process each connection. The difference between this and Parallel DSIs is illustrated in the following diagrams.
DSI-Exec DSI-Exec DSI-Exec DSI
SQT

Replicate DB

Stable Device Primary DB


SRE TD MD

SQM

Distributor SQT Rep Agent User

Outbound (0) Inbound (1) Outbound (0) Inbound (1)

SQM

RepAgent

Figure 94 Normal Parallel DSI with Single Outbound Queue & DSI threads
DS2_a.my_db DS2_b.my_db DS2_c.my_db

DS2 my_db

DSI DSI DSI

DSI-Exec DSI-Exec DSI-Exec

Stable Device
SQM SQM SQM Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1) Outbound (0) Inbound (1)

SRE TD MD Distributor

SQT Rep Agent User SQM

Figure 95 Multiple DSI with Independent Outbound Queues & DSI threads
In the above drawings, only a single replication server was demonstrated. However, in Multiple DSIs each of the connections could be from a different replication server. Consider the following the first being the more normal multiple replication server implementation using routing to a single replication server, while the second demonstrates Multiple DSIs - one from each Replication Server.

291

Final v2.0.1

Chicago

San Francisco New York London

Tokyo

Figure 96 Multiple Replication Server Implementation without Multiple DSIs


While the RRS could use Parallel DSIs, as we have already discussed, long transactions or other issues could degrade performance. In addition, only a single RSI thread is available between the two Replication Servers involved in the routing. While this is normally sufficient, if a large number of large transactions or text replication is involved, it may also be a bottleneck. Additionally, this has an inherent fault in that if any one of the transactions from any of the source sites fail, all of the sites stop replicating until the transaction is fixed and the DSI is resumed. In contrast, consider a possible Multiple DSI implementation:

Chicago

San Francisco New York London

Tokyo

Figure 97 Multiple Replication Server Implementation Using Multiple DSIs


In this case, each RS could still use Parallel DSIs to overcome performance issues within each and in addition, since they are independent, a failure of one does not cause the others to backlog. A slight twist of the latter ends up with a picture that demonstrates the ability of Multiple DSIs to provide a multiprocessor (MP) implementation.

292

Final v2.0.1

Trading System

Investments

Figure 98 MP Replication Achieved via Multiple DSIs


Note that the above architecture really only helps the outbound processing performance. All subscription resolution, replication definition normalization, etc. is still performed by the single replication server servicing the inbound queue. However, systems with high queue writes, extensive function string utilization or other requirements demonstrating a bottleneck in the outbound processing, the MP approach may be viable. Ensuring Recoverability and Preventing Loss While the multiple independent connections do provide a lot more flexibility and performance, they do present a problem recoverability. The problem is simply this: with a single rs_lastcommit table and commit order guaranteed, Parallel DSIs are assured at restarting from that point and not incurring any lost or duplicate transactions. However, if using Multiple DSIs, the same is not true. Simply because the last record in the rs_lastcommit table refers to transaction id 101 does not mean the transaction 100 was applied successfully or that 102 has not been already applied. Consider the following picture:
rs_lastcommit rs_lastcommit tran oqid 41 tran oqid 41 ... ...

DS2_a.my_db DS2_a.my_db tran oqid 31 tran oqid 31 tran oqid 35 tran oqid 35 tran oqid 39 tran oqid 39 tran oqid 43 tran oqid 43 ... ...

DS2_b.my_db DS2_b.my_db tran oqid 32 tran oqid 32 tran oqid 36 tran oqid 36 tran oqid 40 tran oqid 40 tran oqid 44 tran oqid 44 ... ...

DS2_c.my_db DS2_c.my_db tran oqid 33 tran oqid 33 tran oqid 37 tran oqid 37 tran oqid 41 tran oqid 41 tran oqid 45 tran oqid 45 ... ...

DS2_d.my_db DS2_d.my_db tran oqid 34 tran oqid 34 tran oqid 38 tran oqid 38 tran oqid 42 tran oqid 42 tran oqid 46 tran oqid 46 ... ...

Plausible Scenarios: 1 - c committed after a, b, & d (long xactn) xactn) 2 - a, b, d suspended first 3 - a, b, d rolled back due to deadlocks
Figure 99 Multiple DSIs with Single rs_lastcommit Table
Consider the three scenarios proposed above. In each of the three, you would have no certainty that tran OQID 42 should be next. As a result, it is critical that each Multiple DSI has its own independent set of rs_lastcommit, rs_thread tables as well as associated procedures (rs_update_lastcommit). Unfortunately, a DSI connection does not identify itself, consequently there are only two choices available: 1. Use a separate function class for each DSI. Within the class, call altered definitions of rs_update_lastcommit to provide distinguishable identity. For example, add a parameter that is hardcoded to the DSI connection (i.e. A), or call a variant of the procedure such as rs_update_lastcommit_A. Exploit the ASE permission chain and use separate maintenance users for each DSI. Then create separate rs_lastcommit, etc. owned by each specific maintenance user.

2.

293

Final v2.0.1
3. Multiple maintenance users with changes to the rs_lastcommit table to accommodate connection information and corresponding logic added to rs_update_lastcommit to set column value based on username.

While the first one is obvious and obviously a lot of work as maintaining function strings for individual objects could then become a burden, the second takes a bit of explanation. The third one is definitely an option and is perhaps the easiest to implement. The problem is that with high volume replication, the single rs_lastcommit table could easily become a source of contention. In addition to rs_lastcommit, a column would have to be added to rs_threads as it has no distinguishable value either along with changes to the procedures which manipulate these tables (rs_update_lastcommit, rs_get_thread_seq, etc.). However, it does have the advantage of being able to handle identity columns and other maintenance user actions requiring dbo permissions. While separate maintenance user logins are in fact used, each are aliased as dbo within the database. The modifications to the rs_lastcommit and rs_threads tables (and their corresponding procedures such as rs_update_lastcommit, rs_get_lastcommit, etc.) would be to add a login name column. Since this is system information available through suser_name() function, the procedure modifications would simply be adding the suser_name() function to the where clause. For example, the original rs_lastcommit table, rs_get_lastcommit and rs_update_lastcommit are as follows:
/* Drop the table, if it exists. */ if exists (select name from sysobjects where name = 'rs_lastcommit' and type = 'U') begin drop table rs_lastcommit end go /* ** Create the table. ** We pad each row to be greater than a half page but less than one page ** to avoid lock contention. */ create table rs_lastcommit ( origin int, origin_qid binary(36), secondary_qid binary(36), origin_time datetime, dest_commit_time datetime, pad1 binary(255), pad2 binary(255), pad3 binary(255), pad4 binary(255), pad5 binary(4), pad6 binary(4), pad7 binary(4), pad8 binary(4) ) go create unique clustered index rs_lastcommit_idx on rs_lastcommit(origin) go

/* Drop the procedure to update the table. */ if exists (select name from sysobjects where name = 'rs_update_lastcommit' and type = 'P') begin drop procedure rs_update_lastcommit end go /* Create the procedure to update the table. */ create procedure rs_update_lastcommit @origin int, @origin_qid binary(36), @secondary_qid binary(36), @origin_time datetime as update rs_lastcommit set origin_qid = @origin_qid, secondary_qid = @secondary_qid, origin_time = @origin_time, dest_commit_time = getdate() where origin = @origin if (@@rowcount = 0) begin insert rs_lastcommit (origin, origin_qid, secondary_qid, origin_time, dest_commit_time, pad1, pad2, pad3, pad4, pad5, pad6, pad7, pad8)

294

Final v2.0.1

values (@origin, @origin_qid, @secondary_qid, @origin_time, getdate(), 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00) end go /* Drop the procedure to get the last commit. */ if exists (select name from sysobjects where name = 'rs_get_lastcommit' and type = 'P') begin drop procedure rs_get_lastcommit end go /* Create the procedure to get the last commit for all origins. */ create procedure rs_get_lastcommit as select origin, origin_qid, secondary_qid from rs_lastcommit go

Note that the last procedure, rs_get_lastcommit, normally retrieves all of the rows in the rs_lastcommit table. The reason for this is that the oqid is unique to the source system but if there are multiple sources as can occur in a corporate rollup scenario there may be duplicate OQIDs. Consequently, the oqid and database origin id (from RSSD..rs_databases) is stored together. During recovery, as each transaction is played back, the oqid and origin are used to determine if the row is a duplicate. If using the multiple login/altered rs_lastcommit approach, then you simply need to add a where clause to each of the above procedures and the primary key/index constraints. For rs_lastcommit, this becomes (modifications highlighted):
/* Drop the table, if it exists. */ if exists (select name from sysobjects where name = 'rs_lastcommit' and type = 'U') begin drop table rs_lastcommit end go /* ** Create the table. ** We pad each row to be greater than a half page but less than one page ** to avoid lock contention. */ -- modify the table to add the maintenance user column. create table rs_lastcommit ( maint_user varchar(30), origin int, origin_qid binary(36), secondary_qid binary(36), origin_time datetime, dest_commit_time datetime, pad1 binary(255), pad2 binary(255), pad3 binary(255), pad4 binary(255), pad5 binary(4), pad6 binary(4), pad7 binary(4), pad8 binary(4) ) go -- modify the unique index to include the maintenance user create unique clustered index rs_lastcommit_idx on rs_lastcommit(maint_user, origin) go

/* Drop the procedure to update the table. */ if exists (select name from sysobjects where name = 'rs_update_lastcommit' and type = 'P') begin drop procedure rs_update_lastcommit end go /* Create the procedure to update the table. */ create procedure rs_update_lastcommit

295

Final v2.0.1

@origin @origin_qid @secondary_qid @origin_time as

int, binary(36), binary(36), datetime

-- add maint_user qualification to the where clause. update rs_lastcommit set origin_qid = @origin_qid, secondary_qid = @secondary_qid, origin_time = @origin_time, dest_commit_time = getdate() where origin = @origin and maint_user=suser_name() if (@@rowcount = 0) begin -- add the maintenance user login to insert statement insert rs_lastcommit (maint_user, origin, origin_qid, secondary_qid, origin_time, dest_commit_time, pad1, pad2, pad3, pad4, pad5, pad6, pad7, pad8) values (suser_name(), @origin, @origin_qid, @secondary_qid, @origin_time, getdate(), 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00) end go /* Drop the procedure to get the last commit. */ if exists (select name from sysobjects where name = 'rs_get_lastcommit' and type = 'P') begin drop procedure rs_get_lastcommit end go /* Create the procedure to get the last commit for all origins. */ create procedure rs_get_lastcommit as -- add the maint_user to the (previously nonexistent) where clause select origin, origin_qid, secondary_qid from rs_lastcommit where maint_user = suser_name() go

Similar changes will need to be done to the rs_threads table and associated procedure calls as well. It is important to avoid changing the procedure parameters. Fortunately, all retrieval and write operations against the rs_lastcommit table are performed through stored procedure call (similar to an API of sorts). By not changing the procedure parameters and due to the fact that all operations occur through the procedures, we do not need to make any changes to the function strings (reducing maintenance considerably). Why this is necessary at all is discussed under section describing the Multiple DSI/Multiple User implementation. Note that at the same time, we could alter the table definition to accommodate max_rows_per_page or datarow locking and eliminate the row padding (thereby reducing the amount of data logged in the transaction log for rs_lastcommit updates). However, other than the reduction in transaction log activity, this will gain little in the way of performance . It is a useful technique to remember, though, as ASE 12.5 will support larger page sizes (i.e. 16KB vs. 2KB), which invalidates the normal rs_lastcommit padding. So if implementing RS 12.1 or less on ASE 12.5 you may need to modify these tables anyhow. While useful for handling identity and simple to implement, the third alternative above may provide slightly greater performance by eliminating any contention on the rs_lastcommit table. By using separate maintenance users, you can exploit the way ASE does object resolution and permission checking. It is a little known fact (but still documented), that when you execute a SQL statement in which the objects ownership is not qualified, ASE will first look for an object of that name owned by the user (as defined in sysusers). If one is not found, then it searches for one owned by the database owner dbo. So if fred is a user in the database and there is two tables: 1) fred.authors; and 2) dbo.authors and fred issues select * from pubs2..authors, authors will be resolved to fred.authors. On the other hand, if Mary issues select * from pubs2..authors, since no mary.authors exists, authors will be resolved to dbo.authors. Consequently, by using separate maintenance users and individually owned rs_lastcommit, etc. tables, we have the following:

296

Final v2.0.1

MaintUser1.rs_lastcommit

MaintUser1

MaintUser2.rs_lastcommit

MaintUser2

MaintUser3.rs_lastcommit

MaintUser3 MaintUser4.rs_lastcommit

MaintUser4 MaintUser5.rs_lastcommit

MaintUser5

Figure 100 Multiple Maintenance Users with Individual rs_lastcommits


This then addresses the problems in the scenario we discussed earlier and changes the situation to the following:
DS2_a.rs_lastcommit DS2_a.rs_lastcommit tran oqid 39 tran oqid 39 ... ... DS2_c.rs_lastcommit DS2_c.rs_lastcommit tran oqid 41 tran oqid 41 ... ... DS2_b.rs_lastcommit DS2_b.rs_lastcommit tran oqid 44 tran oqid 44 ... ... DS2_d.rs_lastcommit DS2_d.rs_lastcommit tran oqid 34 tran oqid 34 ... ...

DS2_a.my_db DS2_a.my_db tran oqid 31 tran oqid 31 tran oqid 35 tran oqid 35 tran oqid 39 tran oqid 39 tran oqid 43 tran oqid 43 ... ...

DS2_b.my_db DS2_b.my_db tran oqid 32 tran oqid 32 tran oqid 36 tran oqid 36 tran oqid 40 tran oqid 40 tran oqid 44 tran oqid 44 ... ...

DS2_c.my_db DS2_c.my_db tran oqid 33 tran oqid 33 tran oqid 37 tran oqid 37 tran oqid 41 tran oqid 41 tran oqid 45 tran oqid 45 ... ...

DS2_d.my_db DS2_d.my_db tran oqid 34 tran oqid 34 tran oqid 38 tran oqid 38 tran oqid 42 tran oqid 42 tran oqid 46 tran oqid 46 ... ...

Plausible Scenarios: 1 - c committed after a, b, & d (long xactn) xactn) 2 - a, b, d suspended first 3 - a, b, d rolled back due to deadlocks

Figure 101 Multiple DSIs with Multiple rs_lastcommit tables


Now, no matter what the problem, each of the DSIs recovers to the point where it left off. Key Concept #37: The Multiple DSI approach uses independent DSI connections set up via aliasing the target dataserver.database. However, this leads to a potential recoverability issue with RS system tables that must be handled to prevent data loss or duplicate transactions.

Detailed Instructions for Creating Connections Now that we now what we need to do to implement the multiple DSIs and how to ensure recoverability, the next stage is to determine exactly how to achieve it. Basically, it comes down to a modified rs_init approach or performing the

297

Final v2.0.1
steps manually (as may be required for heterogeneous or OpenServer replication support). Each of the below requires the developer to first create the aliases in the interfaces file. Manual Multiple DSI Creation Despite what it sounds, the manual method is fairly easy, but does require a bit more knowledge about Replication Server. The steps are: 1. 2. Add the maintenance user logins (sp_addlogin). Create as many as you expect to have Multiple DSIs plus a few extra. Grant maintenance user logins replication_role. Do not give them sa_role. If you do, when in any database, the maintenance user will map to dbo user vs. the maintenance user desired consequently incurring the problem with rs_lastcommit. Add the maintenance users to the replicated database. If identity values are used, one may have to be aliased to dbo. If following the first implementation (modifying rs_lastcommit), all may be aliased to dbo. Grant all permissions on tables/procedures to replication_role. While you could grant permissions to individual maintenance users, by granting permissions to the role, you reduce the work necessary to add additional DSI connections later. Make a copy of $SYBASE/$SYBASE_RS/scripts/rs_install_primary. Alter the copy to include the first maintenance user as owner of all the objects. Use isql to load the script into the replicate database. Repeat for each maintenance user. Create connections from Replication Server to the replicate database. If the database will also be a primary database and data is being replicated back out, pick one of the maintenance users to be the maintenance user and specify the log transfer option
create connection to data_server.database set error class [to] rs_sqlserver_error_class set function string class [to] rs_sqlserver_function_classset username [to] maint_user_name [set password [to] maint_user_password ] [set database_param [to] 'value'] [set security_param [to] 'value' ] [with {log transfer on, dsi_suspended}] [as active for logical_ds.logical_db | as standby for logical_ds.logical_db [use dump marker]]

3.

4.

5.

6.

7.

8.

If replicate is also a primary, add the maintenance user to Replication Server (create user) grant the specified maintenance user connect source permission in the Replication Server. For all other maintenance users, alter the connection and set replication off (if desired). Configure the Replication Agent as desired.

Modified rs_init Method The modified rs_init method is the easiest and ensures that all steps are completed (none are accidentally forgotten). It is very similar to the above in results, but less manual steps. 1. Make a copy of $SYBASE/$SYBASE_RS/scripts/rs_install_primary (save it as rs_install_primary_orig). Alter the rs_install_primary to include the first maintenance user as owner of all the objects. Run rs_init for replicate database. Specify the first maintenance user. Repeat steps 1-2 until all maintenance users created. If using the modified rs_lastcommit approach, you can simply repeat step 2 until done. If identity values are used, one may have to be aliased to dbo (drop the user and add an alias). (Same as above). Grant all permissions on tables/procedures to replication_role. While you could grant permissions to individual maintenance users, by granting permissions to the role, you reduce the work necessary to add additional DSI connections later. Use sp_config_rep_agent to specify the desired maintenance user name and password for the Replication Agent. Not that all maintenance users have probably been created as Replication Server users. This is not a problem, but can be cleaned up if desired. Rename the rs_install_primary script to a name such as rs_install_primary_mdsi. Rename the original back to rs_install_primary. This will prevent problems for future replication installations not involving multiple DSIs.

2.

3. 4.

5.

6.

298

Final v2.0.1
Single rs_lastcommit with Multiple Maintenance Users If for maintenance reasons or other, you opt not to have multiple rs_lastcommit tables and instead wish to use a single table, you will have to do the following (note this is a variance to either of the above, so replace the above instructions as appropriate): 1. Make a copy of rs_install_primary. Depending on manual or rs_init method, edit the appropriate file and make the following changes: a. Add column for maintenance user suid() or suser_name() to all tables and procedure logic. This includes adding column to tables such as rs_threads without anything. Procedure logic should select suid() or suser_name() for use as column values. b. Adjust all unique indexes to include suid() or suser_name() column. Load script according to applicable manual or rs_init instructions above.

2.

Single rs_lastcommit with Single Maintenance User This method employs the use of function string modifications and really is only necessary if the developers really want job security due to maintaining function strings. The steps are basically: 1. Make a copy of rs_install_primary and save it as rs_install_primary_orig. Modify the original as follows: a. Add column for DSI to each table as well as parameter to each procedure. This includes tables such as rs_threads, rs_lastcommit and their associated procedures. b. Adjust all unique indexes to include DSI column. Load script using rs_init as normal. This will create the first connection. Create a function string class for the first DSI (inherit from default). Modify the system functions for rs_get_thread_seq, rs_update_lastcommit, etc. to specify the DSI. Repeat for each DSI. Alter the first connection to use the first DSIs function string class. Create multiple connections from Replication Server to replicate database for remaining DSIs using the create connection command. Specify the appropriate function string class for each. Rename the rs_install_primary script to a name such as rs_install_primary_mdsi. Rename the original back to rs_install_primary. This will prevent problems for future replication installations not involving multiple DSIs. Monitor replication definition changes during lifecycle. Manually adjust function strings if inheritance does not provide appropriate support.

2. 3. 4. 5. 6.

7.

Defining and Implementing Parallelism Controls The biggest challenge to Multiple DSIs is to design and implement the parallelism controls in such a way that database consistency is not compromised. The main mechanism for implementing parallelism is through the use of subscriptions, and in particular the subscription where clause. Each aliased database connection (Multiple DSI) subscribes to a different data either at the object level or through the where clause. As a result, two transactions executed at the primary might be subscribed to by different connections and therefore have a different order of execution at the replicate than they had at the primary. The following rules MUST be followed to ensure database consistency: 1. 2. 3. Parallel transactions must be commit consistent. Serial transactions must use the same DSI connection. If not 1 & 2, you must implement your own synchronization point to enforce serialization.

Parallel Subscription Mechanism. In many cases, this is not as difficult to achieve as you would think. The key, however, is to make sure that the where clause operations for any one connection are mutually exclusive from every other connection. This can be done via a variety of mechanisms, but is usually determined by two aspects: 1) the number of source systems involved; and 2) the business transaction model. Single Primary Source In some cases, a single primary source database provides the bulk of the transactions to the replicate. As a result, it is the transactions from this source database that must be processed in parallel using the Multiple DSIs. In this situation, each of the Multiple DSIs subscribes to different transactions or different data through one of the following mechanisms:

299

Final v2.0.1
Data Grouping In this scenario, different DSIs subscribe to a different subset of tables. This is most useful when a single database is used to process several different types of transactions. The transactions affect a certain small number of tables unique to that data. An example of this might be a consolidated database in which multiple stations in a business flow all access the same database. For example, a hospitals outpatient system may have a separate appointment scheduling/check-in desk, triage treatment, lab tests and results, pharmacy, etc. If each group of tables that support these functions are subscribed to by different DSIs, they will be applied in parallel at the replicate. Data Partitioning In this scenario, different DSIs subscribe to different sets of data from the same tables, typically via a range or discrete list. An example of the former may be that a DSI may subscribe to A-E or account numbers 10000-20000. An example of a discrete list might be similar to a bank in which one DSI subscribes to checking accounts, the other credit card transactions, etc. User/Process Partitioning In this scenario, different DSIs subscribe to data modified by different users. This is most useful in situations where individual user transactions need to be serialized, but are independent of each others. Probably one of the more frequently implemented, this includes situations such as retail POS terminals, banking applications, etc. Transaction Partitioning In this scenario, different DSIs subscribe to different transactions. Typically implemented in situations involving a lot of procedure-based replication, this allows long batch processes (i.e. interest calculations) to execute independent of other batch processes without either blocking the other through the rs_threads issue. The first two and last are fairly easy to implement and typically do not require modification to existing tables. However, the user/process partition might. If the database design incorporates an audit function to record the last user to modify a record and user logins are enforced, then such a column could readily be used as well. However, in todays architectures, frequently users are coming through a middleware tier (such as a web or app server) and are using a common login. As a result, a column may have to be added to the main transaction tables to hold the process id (spid) or similar value. In many cases, the spid itself could be hard to develop a range on as load imbalance and range division may be difficult to achieve. For example, a normal call center may start with only a few users at 7:00am, build to 700 concurrent users by 09:00am and then degrade slowly to a trickle from 4:00pm to 06:00pm. If you tried to divide the range of users evenly by spid, you would end up with some DSIs not doing any work for a considerable period (4 hours) of the workday. On the other hand, the column could store the mod() of the spid (i.e. @@spid%10) remembering that the result of mod(n) could be zero through n-1 (i.e. mod(2) yields 0 & 1 as remainders). Note that as of ASE 11.9, global variables are no longer allowed as input parameter defaults to stored procedures. Multiple Primary Sources Multiple primary source system situations are extremely common to distributed businesses needing a corporate rollup model. Each of the regional offices would have its own dedicated DSI thread to apply transactions to the corporate database. As mentioned earlier, this has one very distinct advantage over normal replication in that an erroneous transaction from one does not stop replication from all the others by suspending the DSI connection. When multiple primary source systems are present, establishing parallel transactions are fairly easy due to the following: No code/table modifications - Since each source database has its own dedicated DSI, from a replication standpoint, it resembles a 1:1 straightforward replication. Guaranteed commit consistency - Transactions from one source system are guaranteed commit consistent from all others. This is true even in cases of two-phased commit distributed transactions affecting several of the sources. Since in each case an independent Rep Agent, inbound queue processing and OQIDs are used for the individual components of a 2PC transaction, it would be impossible for even a single Replication Server to reconstruct the transaction into a single transaction for application at the replicate. Parallel DSI support While this doesnt appear to add benefit if the multiple DSIs are from a single source, in the case of multiple sources, it can help with large transactions (due to large transaction threads) and medium volume situations through tuning the serialization method (none vs. wait_for_commit), etc. Handling Serialized Transactions In single source systems, it is frequent that a small number of transactions still need to be serialized no matter what the parallelism strategy you choose. For example, if a bank opts for using the account number, probably 80-90% of the transactions are fine. However, in the remaining 10-20% are transactions such as account transfers that need to be serialized. For example, if a typical customer transfers funds from a savings to a checking account, if the transaction is split due to the account numbers, the replicate system may be inconsistent for a period of time. While this may not affect some business rules, if an accurate picture of fund balances is necessary, this could cause a problem similar to the

300

Final v2.0.1
typical isolation level 3/phantom read problems in normal databases. Consequently, after defining the parallelism strategy, a careful review of business transactions needs to be conducted to determine which ones need to be serialized. Once determined, the handling of serialized transactions is pretty simple simply call a replicated procedure with the parameters. While this may necessitate an application change to call the procedure vs. sending a SQL statement, the benefits in performance at the primary are well worth it. In addition, because it is a replicated procedure, the individual row modifications are not replicated consequently, the Multiple DSIs that subscribe to those accounts do not receive the change. Instead, another DSI reserved for serialized transactions (it may be more than one DSI depending on design) subscribes to the procedure replication and delivers the proc to the replicate. The above is a true serialized transaction example. For the most part, serializing the transactions simply means ensuring that all the ones related are forced to use the same DSI. At that stage, the normal Replication Server commit order guarantee ensures that the transactions are serialized within respect one another. The most common example is to have transactions executed by the same user serialized or impacting the same account serialized. For example, a hospital bill containing billable items for Anesthesia and X-ray. As long as the bill invoice number is part of the subscription and the itemization, then by subscribing by invoice, the transaction is guaranteed to arrive at the replicate as a complete bill and within a single transaction. However, there may not be a single or easily distinguishable set of attributes that can be easily subscribed to for ensuring transaction serialization within the same transaction. If such is the case, then the rs_id column becomes very useful. During processing, the primary database can simply assign an arbitrary transaction number (up to 2 billion before rollover) and store it in a column added similar to the user/spid mod() column described earlier. By using bitmask subscription, the load could be evenly balanced across the available Multiple DSIs. Serialization Synchronization Point There may be times when it is impossible to use a single procedure call to replicate a transaction that requires serialization and the normal parallel DSI serialization is counter to the transactions requirements. This normally occurs when a logical unit of work is split into multiple physical transactions possibly even executed by several different users. A classic case without even parallel DSI - is when the transaction involves a worktable in one database and then a transaction in another database (pending/approved workflow). Another example, a store procedure at the primary call may generate a work table in one database using a select/into and then call a sub-procedure to further process and insert the rows. Of course, since both transactions originate from two different databases, read by two different Rep Agents, and delivered by two different DSI connections, the normal transactional integrity of the transaction is inescapably lost. Similarly, even when user/process id is used for the parallelism strategy, Multiple DSI connections will wreak havoc on transactional integrity and serialization simply because there is no way to guarantee that the transaction from once connection will always arrive after the other. The answer is Yes. The question Is there a way to ensure transactions are serialized?. However, the technique is a bit reminiscent of rs_threads. If you remember, rs_threads imposes a modified dead mans latch to control commit order. A similar mechanism could be constructed to the same thing through the use of stored procedures or function string coding. The core logic would be: Latch Create Basically some way to ensure that the latch was clear to begin with. Unlike rs_threads where the sequence is predictable, in this case, it is not, consequently a new latch should be created for each serialized transaction Latch Wait In this case, the second and successive transactions if occurring ahead of the first transaction need to sense that the first transaction has not taken place and wait. Latch Set As each successive transaction begins execution, the transaction needs to set and lock the latch. Latch Block Once the previous transactions have begun, the following transactions need to block on the latch so that as soon as the previous transactions commit, they can begin immediately. Latch Release When completed, each successive transaction needs to clear its lock on the latch. The last transaction should destroy the latch by deleting the row. This is fairly simple for two connections, but what if 3 or more are involved? Even more complicated, what if several had a specific sequence for commit? For example, lets consider the classic order entry system in which the following tables need to be updated in order: order_main, order_items, item_inventory, order_queue. Normally, of course, the best approach would be to simply invoke the parallelism based on the spid of the person entering the order. However, for some obscure reason, this site cant do that and want to divide the parallelism along table lines. So, we would expect 4 DSIs to be involved one for each of the tables. The answer is we would need a latch table and procedures similar to the following at the replicate:
-- latch table create table order_latch_table ( order_number latch_sequence

int int

not null, not null,

301

Final v2.0.1

constraint order_latch_PK primary key (order_number) ) lock datarows go

-- procedure to set/initialize order latch create procedure create_order_latch @order_number int, @thread_num rs_id as begin insert into order_latch_table values (@order_number, 0) return (0) end go

-- procedure to wait block and set latch create procedure set_order_latch @order_number int, @thread_seq int, @thread_num rs_id as begin declare @cntrow int select @cntrow=0 -- make sure we are in a transaction so block holds if @@trancount = 0 begin rollback transaction raiserror 30000 Procedure must be called from within a transaction return(1) end -- wait until time to set latch while @cntrow=0 begin waitfor delay 00:00:02 select @cntrow=count(*) from order_latch_table where order_number = @order_number and latch_sequence = @thread_seq 1 at isolation read uncommitted end -- block on latch so follow-on execution begins immediately -- once previous commits update order_latch_table set latch_sequence = @thread_seq where order_number = @order_number -- the only way we got to here is if the latch update worked -- otherwise, wed still be blocked on previous update -- In any case, that means we can exit this procedure and allow -- the application to perform the serialized update return (0) end go

-- procedure to clear order latch create procedure destroy_order_latch @order_number int, @thread_num rs_id as begin delete order_latch_table where order_number = @order_number return (0) end go

It is important to note that the procedure body above is for the replicate database. At the primary, the procedure will more than likely have no code in the procedure body as there is no need to perform serialization at the primary (transaction is already doing that). In addition, it is possible to combine the create and set procedures into a single procedure that would first create the latch if it did not already exist. The way this works is very simple - but does require the knowledge of which threads will be applying the transactions. For example, consider the following pseudo-code example:
Begin transaction Insert into tableA Update tableB Insert into tableC Insert into tableC Update table B

302

Final v2.0.1

Commit transaction

Now, assuming tables A-C will use DSI connections 1-3 and need to be applied in particular order (i.e. A inserts a new financial transaction, while B updates the balance and C is the history table), the transaction at the primary could be changed to:
Begin transaction Exec SRV_create_order_latch @order_num, 1 Insert into tableA Exec SRV_set_order_latch @order_num, 1, 2 Update into tableB Exec SRV_set_order_latch @order_num, 2, 3 Insert into tableC Insert into tableC Exec SRV_set_order_latch @order_num, 3, 2 Update into tableB Exec SRV_destroy_order_latch @order_num, 1 Commit transaction

Note that the SRV prefix on the procedures in the above is to allow the procedure replication definition to be unique vs. other connections. The deliver as name would not be prefaced with the server extension. Also, note that the first set latch is sent using the second DSI. If you think about it, this makes sense as the first statement doesnt have to wait for any order - it should proceed immediately. In addition, the procedure execution calls above could be placed in triggers, reducing the modifications to application logic - although this would require the trigger to set the latch for the next statement, changing the above to:
Begin transaction Insert into tableA Exec SRV_create_order_latch @order_num, 1 Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 2 Update into tableB Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 3 Insert into tableC Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 3 Insert into tableC Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 2 Update into tableB Select @seq_num=sequence_num from order_latch_table where order_number = @order_num Exec SRV_set_order_latch @order_num, @seq_num, 3 Commit transaction

In which the indented calls are initiated by the triggers on the previous operation. Note that the above also uses variables for passing the sequence. This is simply due to the fact that the trigger is generic and cant tell what number of operations preceded it. As a result, the local version of the latch procedures would have to have some logic added to track the sequence number for the current order number and each set latch would have to simply add one to the number.
-- latch table create table order_latch_table ( order_number int not null, latch_sequence int not null, constraint order_latch_PK primary key (order_number) ) lock datarows go

-- procedure to set/initialize order latch create procedure SRV_create_order_latch @order_number int, @thread_num rs_id as begin insert into order_latch_table values (@order_number, 1) return (0) end go

-- procedure to wait block and set latch create procedure SRV_set_order_latch @order_number int,

303

Final v2.0.1

@thread_seq @thread_num as begin

int, rs_id

update order_latch_table set latch_sequence = latch_sequence+1 where order_number = @order_number end go

-- procedure to clear order latch create procedure SRV_destroy_order_latch @order_number int, @thread_num rs_id as begin delete order_latch_table where order_number = @order_number return (0) end go

However, you should also note that the destroy procedure never gets called - it would be impossible from a trigger to know when the transaction is ended. A modification to the replicate versions of rs_lastcommit procedure could perform the clean up at the end of each batch of transactions. Design/Implementation Issues In addition to requiring manual implementation for synchronization points, implementing multiple DSIs has other design challenges. Multiple DSIs & Contention Because Multiple DSIs mimic the Parallel DSI serialization method none, they could experience considerable contention between the different connections. However, unlike Parallel DSIs - the retry from deadlocking is not the kindler-gentler approach of applying the offending transactions in serial and printing a warning. Instead, they one that was rolled back (in this case the order (i.e. thread 2 vs. thread 1) is not known, so the wrong victim may be rolled back and the transaction attempted again and again until the DSI suspends due to exceeding the retries. For example, in a 1995 case study using 5 Multiple DSI connections for a combined 200 tps rate, 30% of the transactions deadlocked at the replicate. Of course, in those days, the number of transactions per group was not controllable and attempts to use the byte size were rather cumbersome. In the final implementation, transaction grouping was simply disabled and the additional I/O cost of rs_lastcommit endured. As a result, it is even more critical to tune the connections similar to the Parallel DSI/ dsi_serialization_method=none techniques discussed earlier. Namely: Set dsi_max_xacts_in_group to a low number (3 or 5) Use datapage or datarow locking on the replicate tables Change clustered indexes or partition the table to avoid last page contention

Identity Columns & Multiple DSI As partially discussed before, this could cause a problem. If the parallelism strategy chosen is one based on the table/table subset strategy, then simply aliasing one of the DSI connections to dbo and ensuring that all transactions for that table use that DSI connection is a simple strategy. Parallel DSIs may also have to be implemented for that DSI connection as well. However, if not - for example the more classic user/process strategy, the real solution is to simply define the identity at the replicate as a numeric vs. identity. This should not pose a problem as the identity - with the exception of Warm Standby - does not have any valid context in any distributed system. Think about it. If not a Warm-Standby, define the context of identity!! It doesnt have any - and in fact, if identities are used at multiple sites - field sites for example, at a corporate rollup, it would have to be combined with the site identifier (source server name from rs_source_ds) to ensure problems with duplicate rows do not happen. Multiple DSIs & Shared Primary Again, as we mentioned before, you need to consider the problem associated with Multiple DSIs if the replicate is also a primary database. Since the DSI connections use aliased user names, the normal Replication Agent processing for filtering transactions based on maintenance user name will fail - consequently re-replicating data distributed from Multiple DSIs. Normally. However, as mentioned, it is extremely simple to disable this by configuring the connection parameter dsi_replication to off. However, the re-replication of data modifications may be desirable. For instance, in large implementations, the replicate may be an intermediate in the hierarchical tree. Or, it could be viewed as a slight twist on the asynchronous

304

Final v2.0.1
request functions earlier described. Only in this case, normal table modifications could function as asynchronous requests. For example, order entry database could insert a row into a message queue table for shipping. At the shipping database, the replicated insert triggers inserts into the pick queue and the status is replicated back to the order entry system. And so on. Business Cases Despite their early implementation as a mechanism to implement parallelism prior to Parallel DSIs, Multiple DSIs still have applicability in most of todays business environments. By now, you may be getting the very correct idea that Multiple DSIs can contribute much more to your replication architecture than just speed. In this section we will take a look at ways that Multiple DSIs can be exploited to get around normal performance bottlenecks as well as entertaining business solutions. Long Transaction Delay In several of the previous discussions, we illustrated how a long running transaction whether it be a replicated procedure or several thousand individual statements within a single transaction can cause severe delays in applying transactions that immediately followed them at the primary. For example, if a replicated procedure requires 4 hours to run, then during the 4 hours that procedure is executing, the outbound queue will be filling with transactions. As was mentioned in one case, this could lead to an unrecoverable state if the transaction volume is high enough that the remaining time in the day is not enough for the Replication Server to catch up. Multiple DSIs can deftly avoid this problem. While in Parallel DSIs, the rs_threads table is used to ensure commit order, no such mechanism exists for Multiple DSIs. Consequently, while one DSI connection is busy executing the long transaction, other transactions can continue to be applied through the other DSI connections. This is particularly useful in handling overnight batch jobs. Normal daily activity could use a single DSI connection (it still could use parallel DSIs on that connection though!), while the nightly purge or store close out procedure would use a separate DSI connection. Consider the following illustration:

Batch Interest Payments

OLTP System

Closing Trade Position Customer Trades Mutual Fund Trades

DataWarehouse

Figure 102 - Multiple DSI Solution for Batch Processing


The approach is especially useful for those sites which normally Replication Server is able to maintain the transaction volume even during peak processing - but gets behind rapidly due to close of business processing and overnight batch jobs. Commit Order Delay Very similarly, large volumes of transactions that are independent of each other end up delaying one-another simply due to commit order. Consider the average Wal-Mart on a Friday night, with 20+ lanes of checkout counters. It the transactions are being replicated, transactions from the express lane would have to wait for the others to execute at the replicate and commit in order, even though the transactions are completely independent and commit consistent. Again, because commit consistency is a prerequisite, Multiple DSIs allow this problem to be overcome by allowing such techniques as dedicating a single DSI connection for each checkout counter. Similarly, in many businesses, there are several different business processes involved in the same database. Again, these could use separate DSI connections to avoid being delayed due to a high volume of activity for another business process. Consider the following:

Flight Departures

Airport

Airfreight Shipments Passenger Ticketing Aircraft Servicing Costs

Airline Headquarters

Figure 103 - Multiple DSI Solution for Separate Business Processes


Flight departures is an extremely time sensitive piece of information, yet very low volume compared to passenger check-in and ticketing activities. During peak travel times, a flight departure could have to wait for several hundred passenger related data records to commit at the replicate prior to being received. During peak processing, a delay of 30

305

Final v2.0.1
minutes would not be tolerable as this is the required reporting interval for flight following (tracking) that may be required from a business sense (i.e. delay the next connecting flight due to this one leaving 45 minutes late) - or simply timely notification back at headquarters that a delayed flight has finally taken off. Contention Control Another reason for Multiple DSIs is to allow better control of the parallelism and consequently reduce the contention by managing transactions explicitly. For example, in normal Parallel DSI, a typical online daemon process (such as a workflow engine) will log in using a specific user id. At the primary, there would be no contention within its transactions simply due to only a single thread of execution. However, with parallel DSI enable, considerable contention may occur at the replicate as transactions are indiscriminately split among the different threads. As a result, in the case of aggregates, etc. at the replicate, considerable contention may result. With multiple queuing engines involved, the contention could be considerable. By using Multiple DSIs, all of the transactions for one user (i.e. a queuing engine) could be directed down the same connection - minimizing the contention between the threads. Another example of this is also present in high volume OLTP situations such as investment banking in which a few small accounts (investment funds) incur a large number of transactions during trading and compete with small transactions from a large user base investing in those funds. However, it also can happen in retail banking from a different perspective. Granted, any single account probably does not get much activity. And when it does, it is dispersed between different transactions over (generally) several hours. However, given the magnitude of the accounts, if even a small percentage of them experience timing related contention, it could translate to a large contention issue during replication. 1% of 1,000,000 is 1,000 - which is still a large number of transactions to retry when an alternative exists. In the example below, however, every transaction that affected a particular account would use the same connection and as a result would be serialized vs. concurrent and much less likely to experience contention.

Acct_num mod 0

Branch Bank

Acct_num mod 1 Acct_num mod 2 (etc) Cross_Acct Transfer

Headquarters

Figure 104 - Multiple DSI Approach to Managing Contention


One of the advantages to this approach is that where warranted, Parallel DSIs can still be used. While this is nothing different than other Multiple DSI situations, in this case, it takes on a different aspect as different connections can use different serialization methods. For example, one connection in which considerable contention might exist would use wait_for_commit serialization, while others use none. Corporate Rollups On of the most logical places for Multiple DSI implementation is corporate rollup. No clearer picture of commit consistency can be found. The problem is that Parallel DSIs are not well equipped to handle corporate rollups. Consider the following If one DSI suspends, they all do. Which means they all begin to back up - not just the one with the problem. As a result the aggregate of transactions in the backup may well exceed possible delivery rates. Single Replication Server for delivery. While transactions may be routed from several different sources, it places the full load for function string generation and SQL execution on a single process. Large Transactions issues. Basically, as stated before, a system becomes essentially single threaded with a large transaction due to commit order requirements. Given several sites executing large transactions and the end result is that corporate rollups have extreme difficulty completing large transactions in time for normal daily processing. Limited Parallelism. At a maximum, Parallel DSI only supports 20 threads. While this has proven conclusively to be sufficient for extremely high volume at even half of that, with extremely large implementations (such as nation-wide/global retailers), it still can be two few. Mixed transaction modes. In follow-the-sun type operations limit the benefits of single_transaction_per_source as the number of sources active concurrently performing POS activity may be fairly low while others are performing batch operations. Consequently, establishing Parallel DSI profiles is next to impossible as the different transaction mixes are constant.

Multiple DSIs can overcome this by involving multiple Replication Servers, limiting connection issues to only that site and allowing large transaction concurrency (within the limits of contention at replicate, of course). In fact, extremely large-scale implementations can be developed. Consider the following:

306

Final v2.0.1

Regional Rollup

Field Offices Corporate Rollup

Figure 105 - Large Corporate Rollup Implementation with Multiple DSIs


In the above example, each source maintains its own independent connection to the corporate rollup as well as the intermediate (regional) rollup. This also allows a field office to easily disconnect from one reporting chain and connect to the other simply by changing the route to the corporate rollup as well as the regional rollup and changing the aliased destination to the new reporting chain (note: while this may not require dropping subscriptions, it still may require some form of initialization or materialization at the new intermediate site). While not occurring on a regular basis (hopefully), this reduces the IT workload significantly when re-organizations occur. Asynchronous Requests Addition to parallel performance, another performance benefit for Multiple DSIs could be as a substitute for asynchronous request functions. As stated earlier, request functions have the following characteristics: Designed to allow changes to be re-replicated back to the originator or other destinations. Can incur significant performance degradation in any quantity due to reconnection and transaction grouping rules. Require synchronization of accounts and passwords.

Multiple DSIs natively allow the first point but by-pass the last two quite easily. The replicated request functions could simply be implemented as normal procedure replication with the subscription being an independent connection to the same database. In this way, transaction grouping for the primary connection is not impeded, and the individual maintenance user eliminates the administrative headache of keeping the accounts synchronized. Cross Domain Replication Although a topic better addressed by itself, perhaps one of the more useful applications in Multiple DSIs is as a mechanism to support cross-domain replication. Normally, once a replication system is installed and the replication domain established, merging it with other domains is a difficult task of re-implementing replication for one to the domains. However, this may be extremely impractical as it disables replication for one of the domains during this process - and is a considerable headache for system developers as well as those on the business end of corporate mergers who need to consider such costs as part of the overall merger costs. The key to this is that a database could participate in multiple domains simply be being aliased in the other domain the same way as Multiple DSI approach - because in a sense it is simply a twist on Multiple DSIs - each domain would have a separate connection. Consider the following:

307

Final v2.0.1

DS1

DS2

DS1.db1

DS2.db2

DS3a.db1 DS1a.db1 DS3.db1 DS4.db2

DS3

DS4

Figure 106 - Multiple DSI Approach to Cross-Domain Replication


Once the concept of Multiple DSIs is understood, cross-domain replication becomes extremely easy. However, it is not without additional issues that need to understood and handled appropriately. As this topic is much better addressed on its own, not a lot of detail will be provided, however, consider the following: Transaction Transformation - Typically the two domains will be involved in different business processes. For example, Sales and HR. If integrating the two, the integration may involve considerable function string or stored procedure coding to accommodate the fact that a $5,000 sale in one translates to a $500 commission to a particular employee in the other. Number of Access Points - If the domains intersect at multiple points, replication of aggregates could cause data inconsistencies as the same change may be replicated twice. This is especially true in hierarchical implementations. Messaging Support - Replicating between domains may require adding additional tables simply to form the intersection between the two. For example, if Sales and Shipping were in two different domains, replicating the order directly - particularly with the amount of data transformation that may need to take place - may be impractical. Instead queue or message tables may have to be implemented in which the new order received message is enqueued in a more desirable format for replication to the other domain. While some of this may be new to those whove never had to deal with it, particularly, any form of workflow automation involves some new data distribution concepts foreign to and in direct conflict with academic teachings. Since cross-domain replication is a very plausible means of beginning to implement workflow, some of these need to be understood. However, it is crucial to establish that cross-domain replication should not be used as a substitute for a real message/event broker system where the need for one clearly is established. Whether in a messaging system or accomplished else wise (replication), workflow has the following characteristics: Transaction Division - While an order may be viewed as a single logical unit of work by the Sales organization, due to backorders or product origination, the Shipping department may have several different transactions on record for the same order. Data Metamorphism - To the Sales system, it was a blue shirt for $39.95 to Mr. Ima Customer. To Shipping, it is a package 2x8x16 weighing 21 ounces to 111 Main Street, Anytown, USA. Transaction Consolidation - To Sales, it is an order for Mrs. Smith containing 10 items. To credit authorization, it is a single debit for $120.00 charged to a specific credit card account. And so forth. Those familiar with Replication Servers function string capabilities know that a lot of different requirements can be meant with them. However, as the above points illustrate, cross domain replication may involve an order of magnitude more difficult data transformation rules - spanning multiple records - not supportable by function strings alone. While message tables could be constructed to handle simpler cases, it increases I/O in both systems and may require modifications to existing application procedure logic, etc. Hence advent and forte of Sybase Real Time Data Services and Unwired Orchestrator

308

Final v2.0.1

Integration with EAI


One if by Land, Two if by Sea..
Often, system developers confuse replication and messaging - assuming they are mutually exclusive or that messaging is some higher form of replication that has replaced it. Both are equally wrong. For good reason remove the guaranteed commit order processing and provide transaction level transformations/subscriptions and Sybases RS becomes a messaging system. In fact, Sybases RS is a natural extension to messaging architectures to the extent that any corporation with an EAI strategy that already owns RS should take a long and serious look at how to integrate RS into their messaging infrastructure (i.e. build an adapter for it). Several years ago, Sybase produced the Sybase Enterprise Event Broker, which did just that - used Replication Server as a means to integrate older applications with messaging systems. Today, SEEB has been replaced with RepConnector (a component in Real Time Data Services), consequently is the 2nd generation product for replication/messaging integration. The assumption for this section is that the reader is familiar with basic EAI implementations and architectures. Replication vs. Messaging Messaging is billed as application-to-application integration while replication is often viewed as database-todatabase integration. The confusion then usually arises as different people will proselytize one solution over another completely ignorant of the fact that each are entirely different solutions and are target to different needs. However, in order to straighten this out, lets take a closer look at the characteristics of each solution. Characteristic Focus Replication Server Enterprise/Corporate data sharing at the data element level Transaction composed of individual row modifications. Guaranteed Serialization to ensure database consistency Row/column value EAI Messaging Enterprise/Internet B2B integration at the message/logical unit of work Complete message essentially intact logical transaction Optional usually not serialized. Desire is to ensure workflow Message type, addressees, content, etc. Time expiration, message transmission Complete transparency (requires integration server)

Unit of Delivery

Serialization

Subscription Granularity

Event triggers

DML operation/Proc execution

Schema Transparency

Row level with limited denormalization similar data structures High Volume/ Low-NRT latency

Speed/Throughput

Medium Throughput/Hours-Minutes latency. Medium to Complex with coordinated specifications/disjoint administration & support Requires rewrite to form messages. Primary transaction is asynchronous and may be extensively delayed. EDI, XML, proprietary

Implementation Complexity

Low to Medium with singular corporate administration & support

Application Transparency

Transparent with isolated issues. Primary transaction unaltered (direct to database) LTL, SQL, RPC

Interfaces

While the above would seem to suggest that EAI represents a better data distribution mechanism, the real answer is it depends on your requirements. If you want a simpler implementation with NRT latency and high volume replication to an internal system, Replication Server is probably the better solution. However, if flexibility is key or, if the target

309

Final v2.0.1
system is not strictly under internal control (i.e. a packaged application or a partner system), EAI is the only choice. In general, EAI extends basic messaging with business level functionality. The following table illustrates how EAI extends basic messaging to include business level drivers. Replication Server Guaranteed Delivery EAI Messaging Guaranteed Delivery N/A N/A Transmission Encryption/System Authentication via SSL ANSI SQL SQL Transactions CML (insert/update/deletes) Procedure Executions Individual DB connections Definable actions (stop, retry, log) Time limit Non-repudiation (return receipt) Delivery Failure Relative priority Time constraints Time expiration Subsequent Message Sender/user Authenticity Privacy Protocol Translation Custom Protocol Definition Message Structures Failure Events (Non-Events) Threshold Events State Change Events User Requested Events Conditions on Events Hierarchical Channels Broadcast Corrupted/Incomplete Rule (Expiration, Time limit, etc) Actions (Retry, Log, Event)

Message Prioritization

Perishable Messages

Message Security

Interface Standards (EDI, XML)

Message Format Distribution Flexible Event Detection

Row/Column value subscriptions

Message Filters Addressee Groups

Exception Processing

Now then, lets consider the classic architectures and when which of these solutions might be a better fit. Scenario Standby System Internal system to packaged application such as PeopleSoft ? RS MSG Rationale Transaction serialization Schema transparency, interface specification possibly use both if internal system use RS to signal EAI solution Schema transparency, interface specification

Two packaged applications

310

Final v2.0.1

Scenario Corporate Roll-ups/Fan-Out

RS

MSG

Rationale Little if any translation required (ease of implementation); transaction serialization from individual nodes Little if any translation required (ease of implementation); transaction serialization from individual nodes Schema transparency, control restrictions, protocol differences

Shared Primary/Load Balancing

Internal to External (customer/partner) Enterprise Workflow ?

Possibly use RepConnect to integrate RS with EAI rationale is business viewpoint differences drive large schema differences plus use of packaged applications (i.e. PeopleSoft Financials).

The real difference between the two and the need for EAI is apparent in a workflow environment. While RS supports some basic workflow concepts (request functions, data distribution, etc.) it is hampered by the need to similar data structures or extensive stored procedure interfaces to map the data at each target location. To see how complex workflow situations can get, lets take the simple online or catalog retail example. Different Databases/Visualization Within different business units in the workflow, the data is visualized quite differently. Consider the basic premise of a customer ordering a new PC. Order Processing Database - Its a HP Vectra PC costing $$$ for Mr. Jones along with a fancy new printer. HR Database - $$$ in sales at 10% commission for Jane Employee Shipping Database - Its 3 boxes weighing 70lbs to Mulberry St. Obviously, you could conceive more Financials, etc. However, the point is a single transaction which may be represented as a single record in the Order Processing database (and a single SKU) has different elements of interest to different systems. HR really only cares about the dollar figure and the transaction date for payroll purposes, while shipping cares nothing about the customer nor financial aspects of the transaction in fact the single record becomes three in its systems. Those familiar with replication know it would be a simple task to use function strings and procedure calls to perform this integration from a Replication Server perspective. However, that would require in a sense modifying the application (although this is highly arguable as adding a few stored procedures that are strictly used as an RS API is no different than message processing). Different Companies Additionally, the workflow often requires interaction with external parties such as credit card clearing houses, suppliers (hint: buy.com and amazon.com neither one REALLY have that Pentagon size inventory). Interactions with external parties has its own set of special issues. Still want guaranteed transaction delivery (but the transaction may be changed) Mutually untrusted system access Complicated by different protocols, structures, (EDI 820 messages, fpML messages) etc.

In addition to the external party complexities that Replication Server really cant address, the other aspect to external party interaction is that it often requires a challenge/response message that is required before workflow can continue. For example, the store needs to debit the credit card and receive and acknowledgement prior to the original message continuing along the path to HR and Shipping. Different Transactions Additionally, a single business transaction in a workflow environment may be represented by different transactions at different stages of the workflow. As noted above, some stages of the workflow may become synchronous (i.e. credit card debit) before the workflow can continue. The below list of transaction operations are not couched in the terms of

311

Final v2.0.1
any one EAI product but are useful when considering the metamorphis a single business transaction can undergo in a workflow system Transaction spawning - Shipping Request Stock Order for example, if the purchase depletes the stock of an item below a threshold that spawns and automatic re-ordering of the product from the supplier. Transaction decomposition/division - One order multiple shipments (due to backorder or multiple/independent suppliers). In this sense the order is not complete until each item is complete. Transaction multiplication - One order Accounting, Marketing, Shipping. In a sense this is multiplication in that for each business transaction, N other messages/transactions will result in various workflow systems. Transaction state - One order Booked vs. Recognized Revenue. In this case, one transaction from the order entry system spawns a transaction to the financial system as well as order fulfillment. In the financial system, the revenue is treated as booked but not credited yet. In the order fulfillment department, once the order has been shipped in a sense they issue a response message to the order entry system stating the order is complete. Additionally, the shipping departments response also updates the state of the financial system causing the credit card to actually be debited as well as changing the state of the revenue to recognized. The important aspect to keep in mind is that through each of these systems, a transaction identifier is needed to associate the appropriate responses for retail, this is the order number/item number combination. Additionally, workflow messaging may require challenge/response messaging (as discussed earlier) as well as message merging (merge airline reservation request, rental car request, hotel reservation request into single trip ticket for travelers) over an extended period of time consequently, the life span of a message within a messaging system can be appreciable unlike database replication in which the message has extremely short duration (less recovery configuration settings). Integrating Replication & Messaging Having seen that the two are distinctly different solutions, the next question that arises is whether they are complementary. In other words, does it make sense to use both solutions simultaneously in an integrated system. The answer is a resounding YES. The single largest benefit of integrating replication and messaging systems when both are needed (i.e. a Warm Standby within a workflow environment) is that legacy applications may be include in the EAI strategy without the cost of re-writing existing 3 tier applications and the response time impact to front-end systems of adding messaging on to the transaction time. Additionally, existing systems can now have extended functionality added without a major re-write. For example, today, we expect an email from any online retailer worthy of the name when our order is shipped. This becomes a simple task for RS, RepConnect and EAServer as a single column update of the status in the database via subscription on the shipment status field could invoke a component in EA Server to extract the rest of the order details, construct an email message and pass it to the email system for delivery. Similarly, RS could use an RPC to add a job to an OpenServer or EA Server based queuing mechanism vs. having the systems constantly polling from a database queue. Performance Benefits of Integration The chief performance benefits of integrating the two solutions comes from the elimination of using a cpu/process intensive polling mechanism that is commonly used to integrate existing database systems into a new messaging architecture. Any polling mechanism that attempts to detect database changes outside of scanning the transaction log involves one of several techniques: timestamp tracking; or shadow tables. Timestamp tracking involves adding a datetime field to every row in the database. This field is then modified with each DML operation. At a simplistic level, the polling mechanism simply selects the rows that have been modified since the last poll period. This technique has a multitude of problems: 1. An isolation level 3 read is required which could significantly impact contention on the data as the shared/read locks are held pending the read completion. Isolation level 3 is required to avoid row movement (deferred update/primary key change, etc.) from causing a row to be read twice. Deleted rows are missed entirely (they are there anymore so no way to detect a modification via the date). Multiple updates to the same row between polling cycles are lost. This could mean the loss of important business data, such as the daily high for a stock price.

2. 3.

The second implementation is a favorite of many integration techniques including heterogeneous Replication Agents where log scanning is not supported. This implementation has a number of considerations (not necessarily problems, but could have system impact):

312

Final v2.0.1
1. Lack of transactional integrity each table is treated independently of the parent transaction. Consequently a transaction tracking table is necessary to tie individual row modifications together in the concept of a transaction. Additionally, each operation (i.e. inserts into different tables) would have to be tracked ordinally to ensure RI was maintained as well as serialization within the transaction. Lack of before/after images if all that is recorded is the after image, then again, there would be issues with deletes additionally critical information for updates would be lost. As a result, the shadow table would have to track before/after values for each column. Extensive I/O for distribution. A single insert becomes: a. Insert into real table(s) b. Insert into shadow table(s) c. Insert into transaction tracking table d. Distribution mechanism reads transaction tracking table e. Distribution mechanism reads shadow table(s) f. Distribution mechanism deletes rows from shadow table(s) g. Distribution mechanism deletes rows from transaction tracking table

2.

3.

This last consideration may not be that much of a concern on a lightly or medium loaded system. However, if the system is nearing capacity, this activity could bring it to its knees. Additionally, as the distribution mechanism reads or removes records from the shadow tables, it could result in contention with source transactions that are attempting to insert rows. As a consequence ignoring the cost/development benefits of an integrated solution integrating Replication Server with a messaging system could achieve greater overall performance & throughput than simply forcing a messaging solution. The key areas of improved performance would be: Reduced latency for event detection Replication Agents work in Near-Real Time whereas a polling agent would have a polling cycle possibly taking several minutes to detect a change. Reduced I/O load on primary system by scanning directly from the transaction log, the I/O load - and associated CPU load of timestamp scanning or maintaining shadow tables are eliminated for ASE systems. Shadow tables may still be necessary for heterogeneous systems. Reduced contention.

The conclusion is fairly straight-forward. Any site that has existing applications that does not wish to undertake a massive recoding effort, particularly if the system is already involved in replication (i.e. Warm Standby), integrating replication with messaging may improve performance & throughput over using both individually and suffering the impacts that a database adapter could inflict. Messaging Conclusion This section may have appeared out of context with the rest of this paper. However, it was included to illustrate the classic point that sometimes better performance and throughput is a system-wide consideration and a shift in architecture may achieve more for overall system performance than merely tweaking RS configuration parameters. Key Concept #38: A corollary to You cant tune a bad design is A limited architecture may be limiting your business.

313

Final v2.0.1

Sybase Incorporated Worldwide Headquarters One Sybase Drive Dublin, CA 94568, USA Tel: 1-800-8-Sybase. www.sybase.com

Copyright 2000 Sybase, Inc. All rights reserved. Unpublished rights reserved under U.S. copyright laws. Sybase and the Sybase logo are trademarks of Sybase, Inc. All other trademarks are property of their respective owners. indicates registration in the United States. Specifications are subject to change without notice. Printed in the U.S.A.

Вам также может понравиться