Rep Server Monitoring Best Practices

Best Practice Recommendation
Subject: Monitoring Replication System

Author(s): Deepak Upadhyay, Sr.DBA, Sybase IT
Reviewer(s): David Burgess, Staff DBA, Sybase IT

Udaya Challapalli, Sr. DBA Manager, Sybase IT
Hema Seshadri, Sr. DBA Manager, Sybase IT
Contributor(s): David Burgess, Staff DBA, Sybase IT
Abstract: The purpose of this document is to provide best practices to monitor a typical
Sybase replication environment.
Sybase, Inc. 2009 Page 1 of 51

Table of Contents
1
Introduction ...................................................................................................................................... 3
1.1 Definitions.............................................................................................................................. 3
1.1.1 Relational Database....................................................................................................... 3
1.1.2 Primary database........................................................................................................... 3
1.1.3 Replicate database ........................................................................................................ 3
1.1.4 Standby database .......................................................................................................... 3
1.1.5 Replication Server.......................................................................................................... 3
1.1.6 Primary replication server .............................................................................................. 3
1.1.7 Replicate replication server ........................................................................................... 3
Best Practice Solution ..................................................................................................................... 4
1.2 General Monitoring................................................................................................................ 4
1.2.1 Status monitoring ........................................................................................................... 5
1.2.2 Errorlog ........................................................................................................................ 17
1.2.3 Disk space ................................................................................................................... 21
1.2.4 Replication topology..................................................................................................... 22
1.3 Performance monitoring...................................................................................................... 28
1.3.1 Latency ........................................................................................................................ 28
1.3.2 Throughput –................................................................................................................ 34
1.3.3 Statistics monitoring (i.e. Monitor counters) – ............................................................. 35
1.4 Alerting/Notification – .......................................................................................................... 43
1.4.1 RSM Event monitoring ................................................................................................. 43
1.4.2 Scripts .......................................................................................................................... 45
1.5 General Troubleshooting..................................................................................................... 49
1.5.1 Skipping transaction – ................................................................................................. 49
1.5.2 Dumping stable queue –.............................................................................................. 49
1.5.3 Disabling secondary truncation point –........................................................................ 49
1.5.4 Enabling secondary truncation point –......................................................................... 50
References .................................................................................................................................... 51

Introduction
The purpose of this document is to list out best practices used to monitor Sybase Replication
System.
1.1 Definitions
1.1.1 Relational Database
Type of database which groups the data into related tables, and the tables has two major
elements (i.e. ROW & COLUMN).
1.1.2 Primary database

A database where transactions are originally performed (by end-user/process) and those
transactions are grouped/captured for replication server.
1.1.3 Replicate database

A database which receives replicated transactions from replication server and applied those
transactions to its own copy of primary database. A replicate database may or may not be the
exact copy of its primary database.
1.1.4 Standby database

A database which receives replicated transactions from replication server and applied those
transactions to its own copy of primary database. A standby database is generally the exact copy
of its primary database.
1.1.5 Replication Server

Sybase Open Client/Server product which performs continuous, asynchronous transfer of
transaction log from a primary to replicate database(s).
1.1.6 Primary replication server

An instance of a replication server performs continuous, asynchronous transfer of transaction log
from a primary to other replication server(s). It is also capable to replicate transactions to replicate
databases.
1.1.7 Replicate replication server

An instance of replication server generally receives transactions from primary replication server
and replicates to replicate database(s).

Best Practice Solution
In this section monitoring is categorized as follows
• General Monitoring
• Performance monitoring
• Alerting/Notification
• General trouble-shooting
We will discuss each of the above categories in detail in the sections below.
1.2 General Monitoring
Monitoring replication is very critical. Effective monitoring is the key to maintaining a replication
system, since time is an important factor (i.e. time to “FIX” any issues). If connections are
suspended for a long time, this can cause:
• The stable device to fill
• The Replication Agent to suspend at the primary site
• Transaction log to FILL up at primary site
• MAY CAUSE ALL TRANSACTION TO SUSPEND/ABORT (i.e. STOP ALL ACTIVITY)
In this category of monitoring basically the objective is to make sure all components of replication
system are up and running AND to avoid any surprise failure of the system.
Generally monitoring for category is divided as follows
• Status monitoring
• Log (i.e. error log) monitoring
• Space monitoring (i.e. disk space)
• Overall topology (i.e. functioning as it is supposed to be)
o Table schema monitoring - To make sure the schema matches with replication
definitions and with replicate database(s), especially after any application
changes
o Marking replication - All required tables are marked for replication correctly
o Data is in sync between primary and replicate sites
Following diagram lists out various components to monitor

1.2.1 Status monitoring
This section details various replication systems’ components and also provides detail on why it is
important to monitor them. Following major components are considered in this section
• Servers
• Connection/Routes
• Replication queues
• Replication Agents/LTMs
• Replication threads/modules
1.2.1.1 Servers
Generally there are two types of servers
• Database servers

• Replication server
1.2.1.1.1 Database servers –

In replication servers basically three types of database servers used
• Primary
• Replicate
• RSSD
Primary – The most important database of any application, but from replication
system point of view this is of its critical component to monitor. Whether the server is
up and running can be monitored using many different methods
• Scripts
• RMS/RSM event monitoring
• Other third-party tools
Replicate – When performed as standby (i.e. DR) this is as critical as primary

database server. Shutdown of replicate site(s) have a big impact (to the primary site
i.e. final application) based on the configuration of the replication system. Several key
factors (below) values can define impact of down replicates sites
• Time to fix and bring up replicate site(s)

• Space allocated to the stable device(s) for the replication server(s)
• Complexity of the replication system (i.e. how easy is to rebuild the replication
system or how easy is to remove/add the replicate site from/to replication system.
Again, whether the server is up and running can be monitored using many different
methods
• Scripts
RSSD – Equally important as other replicate database servers, since failure of RSSD
database server may contribute to failure of replication system
Again, whether the server is up and running can be monitored using many different
methods
• Scripts

1.2.1.1.2 Replication servers –
Mostly there is only one replication server which performs all replications in the
replication system. More complex environment will have multiple types of required
replication servers as mentioned below. The status of the server can be monitored
using scripts or RMS/RSM event monitoring.
ID – Must be up and running during at-least on following conditions
• Add new replication server to the replication system
• Adding databases to the replication system
• Adding routes to the replication system
Primary – Responsible to collecting published data. Can impact the business, If down
for longer duration (without being monitored!!).
Intermediate - Flow of data between two replication servers. Monitoring the
intermediate replication server is equally important as other replication servers.
Replicate – Apply replicate data. Monitoring the intermediate replication server is
equally important as other replication servers.
1.2.1.2 Connections/Routes –
Critical components of the replication server, there status generally make sure that data
is replication smoothly or of any issues
Logical connection – In warm-standby environment, it is important to verify the current
active and standby connections
isql –U<user> -S<RepSrv> -P<pwd>

1> admin logical_status
2> go
Logical Conn Active Active Standby Connection Standby Controller Operations State of Spid
Connection Conn State name conn RS in Operations
name state progress in
progress
[278] [281] Active/ [1526] Active/ [16777358] None None
DBS.DBSfast_LC PDSDBS1.DBSf PDSDBS5.DBSfast PRLDBS1A
ast
DBS.DBSudef_LC PDSDBS1.DBSu PDSDBS5.DBSudef PRLDBS1A
def
DBS.DBSuomm_L PDSDBS1.DBSu PDSDBS5.DBSuomm PRLDBS1A
C omm
DBS.DBSvend_L PDSDBS1.DBSv PDSDBS5.DBSvend PRLDBS1A
C end
DBS.u_DBScta_L PDSDBS1.u_DB PDSDBS5.u_DBScta PRLDBS1A
C Scta
Physical database connection – Make sure that the database connection is up and
running especially for replicate database connection.

1> admin health
2> go
Mode Quiesce Status

---- ------- ------
NORMAL FALSE HEALTHY
In most of below “admin” command’s output the column “State” is important to

observe. Following table describes possible values for the column
State Description
Active Actively processing a command.
Awaiting Command The thread is waiting for a client to send a

command.
Awaiting I/O The thread is waiting for an I/O operation to

finish.
Awaiting Message The thread is waiting for a message from an

Open Server message queue.
Awaiting Wakeup The thread has posted a sleep and is waiting to

be awakened.
Connecting The thread is connecting.
Down The thread has not started or has terminated.
Getting Lock The thread is waiting on a mutual exclusion

lock.
Inactive The status of an RSI User thread at the

destination of a route when the source
replication Server is not connected to the
destination replication Server.
Initializing The thread is being initialized.
Suspended The thread has been suspended by the user.
1> admin who_is_down

2> go
Spid Name State Info

DSI EXEC Suspended 414(1) PRSPS3.cons
DSI Suspended 414 PRSPS3.cons
1> admin who

2> go

58 DIST Awaiting Wakeup 376 DBS.DBSvend_LC
67 SQT Awaiting Wakeup 376:1 DIST DBS.DBSvend_LC
29 SQM Awaiting Message 376:1 DBS.DBSvend_LC
82 DSI EXEC Awaiting Command 540(1) PDSDBS1.DBSvend
42 DSI Awaiting Message 540 PDSDBS1.DBSvend
8427 REP AGENT Awaiting Command PDSDBS1.DBSvend
4646 DSI EXEC Awaiting Command 614(1) PRSDBS1.DBSvend
4645 DSI Awaiting Message 614 PRSDBS1.DBSvend
32 SQM Awaiting Message 614:0 PRSDBS1.DBSvend
38 RSI Awaiting Wakeup PRLRMDBS1
37 SQM Awaiting Message 16777372:0 PRLRMDBS1
86 RSI USER Awaiting Command PRLRMDBS1
54 dSUB Sleeping
15 dCM Awaiting Message
18 dAIO Awaiting Message
62 dREC Sleeping dREC
63 dSTATS Sleeping
1152 USER Active sa
14 dALARM Awaiting Wakeup
Direct/In-direct routes – Based on outbound queue size of the source replication

server, it is critical to monitor the status of the route.

1> admin who,rsi
2> go
In addition to “State” column, other two highlighted columns should tell (i.e. if they are
different value) if there is still data to be process by replication server for the RSI.

Spid State Info Packets Bytes Sent Blocking Locater Sent Locater Deleted
Sent Reads
38 Awaiting PRLRMDBS1 2655009 426460847 383262 0x000000000000 0x000000000000000
Wakeup 0000000000000 00000000000000000
0000000000000 00000000000000000
0000000000000 000000000014c5900
0000000014c590 320002
0320002

2> go

DSI EXEC Down 418(1)
DTRSPS3.cons
1.2.1.3 Queues –
Inbound – During peak hours, it is important to monitor whether the data is moving thru
the queue.
Outbound – Except ONLY warm-standby environment, it is equally important to
monitor the size and movement of the data in the queue
Materialization – Only important during materialization
Below replication command “admin who,sqm” shows about seventeen column output,
and all column are recommended to monitor carefully. Below four columns can quickly
provide brief state of the all replication queues.
When connection is active but if it observed that data is not replicating, and then look if
column “Duplicates” is rising, unique transaction may be incorrectly resolved as
duplicates.
Additionally, other two columns “First Segment.Block” and “Last Seg.Block”

can quickly tell how much approx data in queue is there to be process. For example, in
below the queue (378:1 DBS.DBSglep_LC) has about 46 MB (i.e. 600520-600474) of
data to be processed.
Notice: Segment = 1MB consists of 64 BLOCKS (i.e. BLOCK SIZE = 16K)

1> admin who,sqm

2> go
Info Duplicates First Last Seg.Block

Segment.Block
16777372:0 PRLRMDBS1 167 294680.39 294680.39
615:0 PRSDBS1.DBSvmst 150 2735.24 2735.24
606:0 PRSDBS1.DBSglep 3 159995.25 159995.25
379:0 DBS.DBSvmst_LC 0 0.1 0
378:1 DBS.DBSglep_LC 12 600474.19 600520.47
378:0 DBS.DBSglep_LC 0 0.1 0
201:0 9 8.9 8.9
PDSREP1.PRLDBS1B_RSSD
201:1 5 212582.54 212582.54
1.2.1.4 Replication Agent –

Sybase Replication Agent – Internal thread in Sybase ASE which scans the Xact log
of the database server, make sure the agent is collecting marked data, forwarding
Xacts to the replication server and importantly moving the secondary truncation point.
In order to verify the state of the replication agent, following stored procedure can be
executed.
isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>

1> sp_help_rep_agent <DBName>,process
2> go
Dbname Spid sleep status retry count last error

DBSallc 19 end of log 0 0
The “sleep status” column (in above output!!) shows current activity of the
replication agent
Status Comment
not running RepAgent is not running.
not active RepAgent is not in recovery mode.
Initial RepAgent is initializing in recovery mode.
end of log RepAgent is in recovery mode and has

reached the end of the transaction log.
Unknown none of the above.
Further, the shown “Spid” of the replication agent can be verified using stored
procedure “sp_who”.
isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>

1> sp_who “19”
2> go
fid spid status loginame Origname hostname blk_spid dbname cmd block_xloid
0 19 background NULL NULL 0 DBSallc REP 0
AGENT
1> select * from master..syslogshold

2> where dbid = db_id(“DBSallc”)
3> and name = “$replication_truncation_point”
4> go
dbid Reserved spid page xactid masterxactid starttime name xloid

5 0 0 21762 0x000000000000 0x000000000000 Jan 26 $replication_truncation_point 0
2009
9:00AM
LTMs – For non-Sybase primary data only, which collects the primary Xact log (i.e.
delta) and transfers to replication server it is equally important as Sybase replication
agent to monitor its errorlog.
1.2.1.5 Replication threads

SQM – Make sure different between “First Segment” and “Last Seg” is less than
equal to zero (from “admin who,sqm”). As mentioned earlier, below output shows the
difference of 46 MB data in the queue.

1> admin who,sqm
2> go

Segment.Block
16777372:0 PRLRMDBS1 167 294680.39 294680.39

Segment.Block
615:0 PRSDBS1.DBSvmst 150 2735.24 2735.24
606:0 PRSDBS1.DBSglep 3 159995.25 159995.25
379:0 DBS.DBSvmst_LC 0 0.1 0
378:1 DBS.DBSglep_LC 12 600474.19 600520.47
378:0 DBS.DBSglep_LC 0 0.1 0
201:0 9 8.9 8.9
201:1 5 212582.54 212582.54
Notice : Segment = 1MB consists of 64 BLOCKS (i.e. BLOCK SIZE = 16K)
SQT – Using “admin who,sqt” look for any large Xact SQT is processing, and not
affecting replication system. If column “Full” is often observed “1” then SQT cache
size is not enough or small. Also column “Removed” shows number of Xact’s messages
are move out from SQT cache (due to their sizes), if there are many or even single for
long time, observe other columns for example “Open” or “First Tran” (with ST = O
and large number of Cmds)
The column “First Tran” contains information in three parts
ST: Followed by O/C/R/D (Open/Closed/Read/Deleted)

Cmds: Followed by number of SQL commands in first transaction
<qid>: Followed by exact position of the first transaction (i.e. segment:block:row)
Make sure “Cmds” are changing and <qid> is increasing.

1> admin who,sqt
2> go
Info Clos Rea Ope Tru Remo Fu SQLBl First Trans Pars SQL Chan Detec
ed d n nc ved ll ocked ed Read ge t
er Oqid Orph
s ans
209:1 0 0 0 0 0 0 1 0 0 0 0
DIST
ITDSREP
1.ITRLDR
1_RSSD
323:1 0 0 0 0 0 0 1 0 0 0 0
DIST
GIT.nisdb
_LC2
210:1 0 0 0 0 0 0 1 0 0 0 0
DIST
GIT.nisdb
_LC
324 112 0 0 112 0 0 0 st:C,cmds:3,q 0 0 0 1

Info Clos Rea Ope Tru Remo Fu SQLBl First Trans Pars SQL Chan Detec
ed d n nc ved ll ocked ed Read ge t
er Oqid Orph
s ans
DTSGIT1. id:15:23:0
nisdb
212 72 0 0 72 0 0 0 st:C,cmds:3,q 0 0 0 1
DOSNIS1 id:21:58:0
A.nisdb
214 112 0 0 112 0 0 0 st:C,cmds:3,q 0 0 0 1
DTSGIT1. id:435:4:0
nisdbdev
209 0 0 0 0 0 0 0 0 0 0 1
ITDSREP
1.ITRLDR
1_RSSD
DIST – The “Status” columns (from “admin who,dist”) will provide current status of
the thread, either “Normal” or “ignoring”. Other useful columns to look for are
“PendingCmds” and “Duplicates”.

1> admin who,dist
2> go
Info PrimarySit Type Status Pendin SqtBl Duplicates Transprocessed Cmds MaintUs NoRepDef
e gCmds ocked Proce erCmds Cmds
ssed
200 200 P Normal 0 1 0 479 1625 0 0
PDS
REP1
.PRL
DBS1
A_RS
SD
543 544 L Normal 0 1 0 344572 52453 0 4550199
DBS. 41
u_DB
Scta_
LC
690 690 P Normal 0 1 0 813 2439 0 0
PDS
DBS1
.login
db
376 540 L Normal 0 1 0 62246 20358 0 70644
DBS. 6
DBSv
end_
LC
375 539 L Normal 0 1 0 6116 18166 0 0
DBS.
DBSu
omm
_LC
283 284 L Normal 0 1 0 6168 18274 4 0
DBS.

Info PrimarySit Type Status Pendin SqtBl Duplicates Transprocessed Cmds MaintUs NoRepDef
e gCmds ocked Proce erCmds Cmds
ssed
DBSu
def_L
C
278 281 L Normal 0 1 0 12698 53442 4 503479
DBS. 2
DBSf
ast_L
C
DSI – Make sure the DSI thread is UP and running and it is NOT suppose to be down
for long time in order avoid processing backlog
There are many columns to look for when “admin who,dsi” displays results, but
following few columns can quickly provide quick status

1> admin who,dsi
2> go
Status Maintenance User Xacts_skipped TriggerStatus ReplStatus

Info
Awaiting 618 PRSDBS1.u_DBScta u_DBScta_maint 0 off on
Message
Awaiting 200 PRLDBS1A_RSSD_maint 0 on on
Message PDSREP1.PRLDBS1A_RSSD
Awaiting 1530 PDSDBS5.u_DBScta u_DBScta_maint 0 off off
Message
Awaiting 1529 PDSDBS5.DBSuomm DBSuomm_maint 0 off off
Message
Awaiting 1528 PDSDBS5.DBSudef DBSudef_maint 0 off off
Message
Awaiting 544 PDSDBS1.u_DBScta u_DBScta_maint 0 on on
Message
Awaiting 690 PDSDBS1.logindb logindb_maint 0 on on
Message
Awaiting 540 PDSDBS1.DBSvend DBSvend_maint 0 on on
Message
Awaiting 539 PDSDBS1.DBSuomm DBSuomm_maint 0 on On
Message
Awaiting 284 PDSDBS1.DBSudef DBSudef_maint 0 on On
Message
Awaiting 281 PDSDBS1.DBSfast DBSfast_maint 0 on On
Message
Awaiting 691 PRSDBS1.logindb logindb_maint 20 on On
Message
Awaiting 614 PRSDBS1.DBSvend DBSvend_maint 0 off On
Message
Awaiting 1526 PDSDBS5.DBSfast DBSfast_maint 0 off Off
Message
Awaiting 1527 PDSDBS5.DBSvend DBSvend_maint 0 off Off
Message

Other quick commands to check the status of DSIs are as follows

2> go
.
.
1> admin who_is_up
2> go
RSI – Make sure route’s status is UP.

1> admin who,rsi
2> go
Spid State Info Packets Bytes Sent Blocking Locater Sent Locater Deleted
Sent Reads
38 Awaiting PRLRMDBS1 2655009 426460847 383262 0x000000000000 0x000000000000000
Wakeup 0000000000000 00000000000000000
0000000000000 00000000000000000
0000000000000 000000000014c5900
0000000014c590 320002
0320002
DAEMONS – Make sure to monitor status of daemons dAlarm, dAIO,dSUB and dCM
regularly.

1> admin who
2> go

58 DIST Awaiting Wakeup 376 DBS.DBSvend_LC
67 SQT Awaiting Wakeup 376:1 DIST DBS.DBSvend_LC
8427 REP AGENT Awaiting Command PDSDBS1.DBSvend
4646 DSI EXEC Awaiting Command 614(1) PRSDBS1.DBSvend

4645 DSI Awaiting Message 614 PRSDBS1.DBSvend
32 SQM Awaiting Message 614:0 PRSDBS1.DBSvend
38 RSI Awaiting Wakeup PRLRMDBS1
37 SQM Awaiting Message 16777372:0 PRLRMDBS1
86 RSI USER Awaiting Command PRLRMDBS1
54 dSUB Sleeping
15 dCM Awaiting Message
18 dAIO Awaiting Message
62 dREC Sleeping dREC
63 dSTATS Sleeping
1152 USER Active sa
14 dALARM Awaiting Wakeup
1.2.2 Errorlog
1.2.2.1 Database errorlogs
By default located at path -- > $SYBASE/$SYBASE_ASE/install
Primary database errorlog – Especially to look for errors related to Sybase
Replication agent and/or any corruption in the primary database.
00:00000:00586:2008/12/13 14:10:13.30 server Started Rep Agent on database,

'DBSglep' (dbid = 21).
02:00000:00586:2008/12/13 14:10:13.45 server Error: 692, Severity: 20, State: 1
02:00000:00586:2008/12/13 14:10:13.45 server Uninitialized logical page
'1498656' was read while accessing object '8' in database '2
1'. Please contact Sybase Technical Support.
02:00000:00586:2008/12/13 14:10:13.45 server Rep Agent Thread for database
'DBSglep' (dbid = 21) terminated abnormally with error. (
major 0, minor 92)
Replicate database errorlog – Verify the replicate site is up and running with adequate
resources (i.e. NOT running out number of connections, locks,log space etc).
00:00000:00006:2007/10/13 16:21:55.88 server Error: 1105, Severity: 17, State:

4
4
00:00000:00006:2007/10/13 16:21:55.89 server Can't allocate space for object
'syslogs' in database 'DBSbank' because 'logsegment' se
gment is full/has no free extents. If you ran out of space in syslogs, dump the
transaction log. Otherwise, use ALTER DATABASE to inc
rease the size of the segment.
RSSD database errorlog – Very critical database for replication server, look for current
available space in all segments and is up and running with adequate resources (i.e.
NOT running out number of connections, locks etc).

00:00000:00068:2008/07/31 16:27:18.07 server RepAgent(7): Received the
following error message from the Replication Server: Msg 1106
0. CT/CS Lib function 'ct_results' failed. Retcode = 0..
0
00:00000:00068:2008/07/31 16:27:18.07 server RepAgent(7): This Rep Agent Thread
is aborting due to an unrecoverable communications o
r Replication Server error.
00:00000:00068:2008/07/31 16:27:18.07 server Rep Agent Thread for database
'PRLDBS1D_RSSD' (dbid = 7) terminated abnormally with err
or. (major 92, minor 61)
1.2.2.2 Replication errorlog –

Use “admin log_name” (as shown below) to find location of the errorlog, where replication
server records informational and error messages.

admin log_name
go
Output will look like as below
Log File Name

-------------
/cis1/PRLDBS1A/log/PRLDBS1A.log
Below error code are important to observe

I - Informational messages
W - Warning
E - Error
H - Replication thread died
F - Due to serious error replication server died
N - Internal error
Below is the classic error from the replication server errorlog
E. 2007/01/27 22:45:12. ERROR #1028 DSI EXEC(1010(1) PDSDBS5.u_DBScta) -

dsiqmint.c(3071)
Message from server: Message: 2601, State 2, Severity 14 -- 'Attempt to
insert duplicate key row in object 'esa_invoice_job'
with unique index 'pk_esa_invoice_job'

'.
I. 2007/01/27 22:45:12. Message from server: Message: 3621, State 0, Severity 10 -
- 'Command has been aborted.
'.
H. 2007/01/27 22:45:12. THREAD FATAL ERROR #5049 DSI EXEC(1010(1)
PDSDBS5.u_DBScta) - dsiqmint.c(3078)
The DSI thread for database 'PDSDBS5.u_DBScta' is being shutdown. DSI
received data server error #2601 which is mapped to STO
P_REPLICATION. See logged data server errors for more information. The data server
error was caused by output command #1 mapped from
input command #2 of the failed transaction.
I. 2007/01/27 22:45:12. The DSI thread for database 'PDSDBS5.u_DBScta' is
shutdown.
If stable device partition is failed (or not available) look for failure messages in errorlog as
shown below
I. 2008/08/22 00:38:36. Embedded database id is '101'.

E. 2008/08/22 00:38:36. ERROR #6078 GLOBAL RS(GLOBAL RS) - sun_svr4.c(139)
Could not open file '/dev/rdsk/c2t0d0s5'. System error 'No such file or
directory(2)'
I. 2008/08/22 00:38:36. Unable to open partition '/dev/rdsk/c2t0d0s5'.
E. 2008/08/22 00:38:37. ERROR #6021 GLOBAL RS(GLOBAL RS) - m/sqmext.c(2086)
Stable queue '101:1' cannot be started. It is on a failed partition 'sq1'.
E. 2008/08/22 00:38:37. ERROR #6034 GLOBAL RS(GLOBAL RS) - m/sqmext.c(1259)
Cannot start the stable queue named '101:1'
W. 2008/08/22 00:38:37. WARNING #6131 GLOBAL RS(GLOBAL RS) - qm/sqmsp.c(2689)
Replication Server has no partitions.
I. 2008/08/22 00:38:38. Replication Server 'ITRLID1' is started.
I. 2008/08/22 00:38:38. DIST for 'ITDSREP1.ITRLID1_RSSD' is Starting
E. 2008/08/22 00:38:38. ERROR #30020 DIST(101 ITDSREP1.ITRLID1_RSSD) -
xec/dist.c(1647)
Unable to start distributor thread for queue '101'.
I. 2008/08/22 00:38:38. The distributor for 'ITDSREP1.ITRLID1_RSSD' is shutting
down
I. 2008/08/22 00:38:38. The DSI thread for database 'ITDSREP1.ITRLID1_RSSD' is
started.
I. 2008/08/22 00:38:38. SQM starting: 101:0 ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 00:39:28. Replication Agent for ITDSREP1.ITRLID1_RSSD connected in
passthru mode.
E. 2008/08/22 00:39:28. ERROR #14023 REP AGENT(ITDSREP1.ITRLID1_RSSD) -
/execint.c(3463)
SQM had an error writing to the inbound-queue.
I. 2008/08/22 00:39:55. Shutting down.

Later after executing “rebuild queues” to recover the replication server look for loss
detection in the errorlog
I. 2008/08/22 01:02:00. Partition 'SQ2' is added.

I. 2008/08/22 01:48:06. Partition 'sq1' is in the process of being dropped.
I. 2008/08/22 01:50:26. Rebuild Queues: Starting
I. 2008/08/22 01:50:27. Resetting Replication Agent starting log position for
ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 01:50:27. Shutting down the DSI thread for 'ITDSREP1.ITRLID1_RSSD'.
shutdown.
I. 2008/08/22 01:50:27. DSI: enabled loss detection for 'ITDSREP1.ITRLID1_RSSD'.
I. 2008/08/22 01:50:27. Rebuild queues: deleting queue 101:1
I. 2008/08/22 01:50:27. Rebuild queues: done rebuilding queue 101:1. Restarting.
I. 2008/08/22 01:50:27. Rebuild queues: deleting queue 101:0
I. 2008/08/22 01:50:27. SQM stopping: 101:0 ITDSREP1.ITRLID1_RSSD
I. 2008/08/22 01:50:27. Rebuild queues: done rebuilding queue 101:0. Restarting.
I. 2008/08/22 01:50:28. Starting DIST for 101:1.
I. 2008/08/22 01:50:28. DIST for 'ITDSREP1.ITRLID1_RSSD' is Starting
I. 2008/08/22 01:50:28. Starting the DSI thread for 'ITDSREP1.ITRLID1_RSSD'.
started.
I. 2008/08/22 01:50:28. Rebuild Queues: Complete
The loss can also be detected by querying RSSD tables as shown below
isql –U<user> -S<RSSDSrv> -P<pwd> -D<RSSD>

1> select dsname,dbname from rs_databases where dbid in ( select
distinct case when origin_lsite_id = 0 then origin_site_id
else origin_lsite_id end from rs_oqid where valid > 0)
2> go
1.2.2.3 LTM errorlog –

In hybrid replication system, important to monitor the LTM process’s log for it’s status and
any other related errors (i.e. error connecting to replication server etc)

1.2.2.4 Dbltm (i.e. Rep agent for ERSSD) –
Make sure to monitor the process if routes are used and respective replication servers
are using ERSSD
1.2.3 Disk space

Database segments
• Primary Database’s log segment
• Replicate Database’s all segments
• RSSD’s all segments
isql –U<user> -S<DBSrv> -P<pwd> -D<DBName>

sp_help_segment <SegmentName>
go
segment name status

----------- ---- -----------
2 logsegment 0
device size
------ ----
raw07 320.0MB
free_pages
-----------
163197
table_name index_name indid
---------- ---------- -----------
syslogs syslogs 0
total_size total_pages free_pages used_pages reserved_pages
---------- ----------- ---------- ---------- --------------
320.0MB 163840 163197 643 0
Threshold monitoring can be setup for all required segments of the database
Replication stable device – Use “admin disk_space” to monitor all stable devices.

admin disk_space
go
Partition Logical Part.Id Total Used State

Segs Segs
/dev/vx/rdsk/sybase2/raw2g14 SQM14 112 2000 0 ON-LINE//

Partition Logical Part.Id Total Used State
Segs Segs
File system managed by operating system (i.e. disk space for errorlogs) – The
filesystem space used all servers (i.e. database servers and replication servers)
For example, Sybase replication server installation can be Sun Solaris 10 can be
monitored using simple “df” unix command
hypnos-mis-/cis1/PRLDBS1A/log> df -k $SYBASE
Filesystem kbytes used avail capacity Mounted on
/dev/vx/dsk/sybase1/cis1_fs 10485760 5013734 5139665 50% /cis1
1.2.4 Replication topology

Tools like Sybase central, Sybase power designer allows to generate graphical
replication topology used for an organization. For large and complex replication system,
regularly monitoring topology of the replication system for any changes mode to the
replication system. The changes are not limited to…
• Enabling data replication between two sites
• Disabling data replication between two sites
• Changing direction of data replication between two sites
1.2.4.1 Monitoring changes to table Schema

Schema for all required databases/tables needs to be verify at least between
• Primary site and replicate site(s)
• Primary site and respective replication definition
This step is very critical especially, during application upgrade when most likely the
database schema changes.
Many tools/methods can be used to find schema of a table or replication definition of the
respective table which are not limited to
• Sybase Central
• Sybase Power Designer (for replication requires “Information Liquidity Model”)
• Shell/Perl scripts
• Simple SQL commands (i.e. “sp_help” for table schema and “rs_helprep” for
replication definition)
• Other third party tools

1.2.4.2 Marking for replication
This is to make sure all required databases or database objects (i.e. mostly user tables)
are marked for replication. Following sample stored procedure can provide brief
information
create procedure sps_check_for_repmrk as

declare @dbnm varchar(100)
select @dbnm = db_name()
if (getdbrepstat() >= 0)
if exists (select 1 from sysobjects where type = 'U' and sysstat & -32768 = -
32768)
select name from sysobjects where type = 'U' and sysstat & -32768 = -32768
else
select "No objects in database "+@dbnm+" is marked for replication"
else
select "Entire database "+@dbnm+" is marked for replication"
1.2.4.3 Data consistency

Regularly (weekly, monthly or at least before some important days for example quarter end
closing/ year end closing) making sure data is consistent between primary and replicate site.
Sybase provided tool “rs_subcmp” (on WINDOW it called “subcmp”) can be used to find data
inconsistency between primary and replicate sites.
For example, in order to find data inconsistency between primary site (Server Name = PDSDBS1,
Database Name = DBSCOMMON, table name = attach) and its replicate site create a
configuration file (using “vi” or other editor)
# attach.cfg – This is the file name

# PDSDBS1.DBSCOMMON.dbo.attach with
# PRSDBS1.DBSCOMMON.dbo.attach.
#
PDS = PDSDBS1
RDS = PRSDBS1
PDB = DBSCOMMON
RDB = DBSCOMMON
PTABLE = attach
RTABLE = attach
PSELECT = select
wijt_location,caller_id,create_date,file_name,description,file_contents,chgstamp
from attach
order by caller_id, create_date, file_name
RSELECT = select
wijt_location,caller_id,create_date,file_name,description,file_contents,chgstamp
from attach
order by caller_id, create_date, file_name

PUSER = svr_maint
RUSER = svr_maint
PPWD = forget1t
RPWD = forget1t
KEY = caller_id
KEY = create_date
KEY = file_name
RECONCILE = N
VISUAL = Y
NUM_TRIES = 3
WAIT = 10
Then simply use above configuration file as below to find the data inconsistency
$SYBASE/$SYBASE_REP/bin/rs_subcmp -f attach.cfg
A separate user (i.e. svr_maint) can be created and bind it to its user defined temporary database
(which can also be bind to a user defined cache) can be used to avoid resource competition with
other remaining user.
Other option is to use command line switches for “rs_subcmp”. In order to Sync the whole
database, it is recommended to create a batch process having set of “rs_subcmp”s for every
user table in database. Following script can be used to generate the required script (i.e. the script
which will actually verify/sync using rs_subcmp commands), may require few modifications to
customize local environment
isql -Usa –SPDSDBS1 -DDBSCOMMON <<EOF

create table #table_list
(
id int ,
uu int
)
GO
create table #table_def
(
colid tinyint
,name char(30)
)
GO
insert #table_list
select id,uid
from sysobjects
where type='U'
and name not like 'rs_%'
order by name

GO
declare cursor_tabs cursor for
select id,uu from #table_list
GO
declare @dbname varchar(30)
,@tabid int
,@tabname varchar(100)
,@msg varchar(255)
,@pmsg varchar(255)
,@colid tinyint
,@indid int
,@counter int
,@colname varchar(30)
,@uu int
,@uuc varchar(100)
select @tabid = 0
open cursor_tabs
fetch cursor_tabs into @tabid,@uu
while (@@sqlstatus = 0)
begin
select @uuc = user_name(@uu)
setuser @uuc
select @tabname = object_name(@tabid)
insert #table_def
select A.colid, A.name
from syscolumns A
where A.id = @tabid
order by A.colid
select @msg = 'rs_subcmp –SPDSDBS1 -DDBSCOMMON –sPRSDBS1 -dDBSCOMMON -c"select

* from '+@tabname+' order by '
print @msg
select @indid = min(indid)

from sysindexes
where id = @tabid
and indid > 0
and (status & 2) = 2
select @pmsg=' '

if (@indid <> NULL)
begin
select @counter = 1

while @counter <= 16
begin
select @colname = index_col(@tabname, @indid, @counter)
if (@colname is NULL)
break
if (@counter > 1)
select @pmsg = @pmsg + ", "
select @pmsg = @pmsg + rtrim(@colname)
select @counter = @counter + 1
end
end
else
begin
select @colid = 0
while (select min(colid) from #table_def where colid > @colid) != NULL
begin
select @colid = min(colid) from #table_def where colid > @colid
select @pmsg = @pmsg + convert(varchar(30),name)
from #table_def
where colid = @colid
if (select count(*) from #table_def) > 1

and exists (select * from #table_def where colid>@colid)
begin
select @pmsg = @pmsg + ","
end
end
end
select @msg = @msg + @pmsg + '" -u'+@uuc+' -U'+@uuc+' -t'+@tabname+'-V -

k'+@pmsg
print @msg
truncate table #table_def
fetch cursor_tabs into @tabid,@uu

select @uuc = user_name(@uu)
setuser
end
GO
close cursor_tabs
GO
deallocate cursor cursor_tabs
GO
EOF

1.3 Performance monitoring
Following diagram lists out various performance units to measure/monitor for effective monitoring
1.3.1 Latency
Difference (generally the UNIT is seconds) between “work” done in primary database and
replicate database. Many methods can be used to determine the difference based on
how exactly the latency is defined. The latency can calculated for a single transaction, a
batch job or for an entire database system.
1.3.1.1 Rs_lastcommit –
Replication server maintains this table (i.e. “rs_lastcommit”) in every replicate
database which store the most recent committed transactions from specific
source/primary site. This method does NOT really provide the best method since the
timings specified in the table are generally NOT correct. Since it reports ONLY the last
committed transaction, so it is difficult link the respective primary transaction. Also, in

large complex environment it is difficult to identify the latency for various batch
processes or single transactions. For example, below is the output from one of replicate
database
isql –U<user> -S<Replicate_DBSrv> -P<pwd> -D<DBName>

select origin_time,dest_commit_time from rs_lastcommit
go
origin_time dest_commit_time
Jan 28 2009 2:02PM Jan 28 2009 2:52PM
1.3.1.2 Heartbeat –
This is Sybase Central feature to monitor latency in replication system, which creates
replication enabled table (called “rsm_heartbeat”) and modifies the table at frequent
interval. It provides latency in nice graphical form. Restrictions includes are
• Must use Sybase Central (i.e. must be connected to both primary and replicate)
• Latency measure is good for single row updates
To configure Heartbeat using Sybase central select the database connection (i.e.
primary database connection for which heartbeat needs to configure) Æ right click and
select “Heartbeats”. Complete detail steps are available in “Help” section from RS-
Plugin (see below print screen)

1.3.1.3 Manually managed ping/time table(s) –
User defined table(s) can be created having columns defaulting to time on respective
database servers. Inserts into these tables can be done before/after/during (i.e. based
on application/batch-job) and latency can be derived by comparing the values from the
tables. For example table can be defined as follows in primary database
isql –U<user> -S<PDBSrv> -P<pwd> -D<DBName>

Create table PRLREP1_timer
( daats_id int,
p_dt datetime default getdate()
)
Go
Create unique clustered index PRLREP1_timer_idx01 on
PRLREP1_timer (daats_id)
Go
Grant all on PRLREP1_timer to public
go
On replicate database create similar table as mentioned below
isql –U<user> -S<RDBSrv> -P<pwd> -D<DBName>

Create table PRLREP1_timer
( daats_id int,
p_dt datetime,
r_dt datetime default getdate()
)
Go
Create unique clustered index PRLREP1_timer_idx01 on
PRLREP1_timer (daats_id)
Go
Grant all on PRLREP1_timer to public
go
Once replication setup is completed for above created table, first two columns will get
replicated from primary values.
Now in order to calculate latency insert values into the table before and after a large
batch of transaction

Insert into PRLREP1_timer select max(daats_id) +1 from PRLREP1_timer
Go
/* EXECUTE BATCH PROCESS */
Insert into attach ….
Update attach set …
……
Insert into attach ….
Update attach set …
Go
Insert into PRLREP1_timer select max(daats_id) +1 from PRLREP1_timer
go
After the batch load is completed and replicated, following SQL can be used (on
replicate database) to calculate latency

Select datediff(ss,min(p_date),max(r_date)) from PRLREP1_timer
go
Clocks for primary and replicate site must be synchronized to measure latency
effectively.

1.3.1.4 Rs_ticket –
A rs_ticket can be think of a message which travels from primary database to replicate
database hopping at following replication threads
• EXEC
• DIST
• DSI
At each hopping stop (i.e. the traveling message handled by particular replication
thread) time is appended to the message. Once the message arrives at destination (i.e.
replicate database) stored procedure “rs_ticket_report” can modified to append
the message with time and store the entire message to a user defined table. The user
defined table can be used for further analysis (i.e. measuring latency).
The rs_ticket_report function string must be enabled by modifying the replicate

connection

1> alter connection to servername.databasename set
'dsi_rs_ticket_report' to 'on'
2> go
On replicate site the stored procedure “rs_ticket_report” can be modified as follows
isql –U<user> -S<RDBSrv> -P<pwd> -D<DBName>

create procedure rs_ticket_report
(@rs_ticket_param varchar(255))
as
begin
set nocount on
declare @new_cmd varchar(255),
@c_time datetime,
@c_secs numeric(6,3)
select @c_time = getdate()
select @c_secs = datepart( millisecond, @c_time)
select @c_secs = datepart( second, @c_time) + @c_secs/1000
select @new_cmd =
@rs_ticket_param + ";RDB(" + db_name()+ ")="
+ convert( varchar(2), datepart( hour, @c_time))
+ ":" + convert( varchar(2), datepart minute,
@c_time))
+ ":" + convert( varchar(6), @c_secs)
insert daats_tkt values (@new_cmd)

end
On primary site we execute like follows to gather performance related data
isql –U<user> -S<PrimaryASE> -P<pwd> -D<PrimaryDB>

1> exec rs_ticket 'BEGIN BATCH PROCESS'
2> EXECUTE BATCH PROCESS
3> exec rs_ticket 'COMPLETE BATCH PROCESS'
4> go
On replicate site ..
isql –U<user> -S<RepASE> -P<pwd> -D<RepDB>

select * from daats_tkt
go
Output should look like…

####################################################################
V=1;H1=BEGIN BATCH PROCESS;PDB(pdsdbs1)=09:51:49.180;
EXEC(29)=09:51:49.0;B(29)=43690;DIST(20)=09:51:52.0;DSI(27)=09:51:55.0;RDB(prs
dbs1)=09:51:55.413
V=1;H1=COMPLETE BATCH PROCESS;PDB(pdsdbs1)=09:51:49.193;
EXEC(29)=09:51:49.0;B(29)=44894;DIST(20)=09:51:52.0;DSI(27)=09:51:55.0;RDB(prs
dbs1)=09:51:55.413
####################################################################
Understanding the output from the “rs_ticket” process
• V – Version number
• H – Header information; String input by "rs_ticket" in primary site
• PDB - Primary database name and the time (from the host clock) rs_ticket was
executed
• EXEC – spid of user executed rs_ticket in primary and the time
• B – Total bytes received from Replication Agent and spid number. In this case total
bytes received = 43690
• DIST - spid number (shown in "admin who") and the time rs_ticket passes through
the DIST
• DSI - spid number (shown in "admin who") and the time rs_ticket passes through
the DSI
• RDB - Replicate database name and the time rs_ticket_report called to add the
message to the results table (i.e. daats_tkt)

As shown in above output, latency can be calculated by difference of time between the
“rs_ticket” message arrives at replicate database and the “rs_ticket” message
sent from primary database i.e. approximately two seconds ( 09:51:55:413 –
09:51:49:180 = 2 Seconds).
1.3.2 Throughput –
Throughput can be calculated by measuring latency (using above defined methods) for
given “work” (for example 1000 transactions or total bytes transferred).
Many commands can be used to to find out how much amount of data (i.e. in bytes) is
processed through replication server, for example using “admin who,sqm” column
“Bytes” shows the total number of bytes written. First run “admin who,sqm” and look for
the bytes column on the interested connection (below output shows only three column,
other column are not shown)
State Info Bytes

Awaiting Message 16777372:0 PRLRMDBS1 268793702
Awaiting Message 615:0 PRSDBS1.DBSvmst 24000982
Awaiting Message 606:0 PRSDBS1.DBSglep -926256804
Awaiting Message 379:1 DBS.DBSvmst_LC 81859762
Awaiting Message 378:1 DBS.DBSglep_LC -1344189992
Awaiting Message 378:0 DBS.DBSglep_LC 0
Awaiting Message 248:1 DBS.DBSwact_LC -1605776577
Awaiting Message 248:0 DBS.DBSwact_LC 0
Awaiting Message 236:1 DBS.DBSarpc_LC 68229190
Awaiting Message 231:1 DBS.DBSallc_LC 120230729
Awaiting Message 201:0 257502
Now process the batch load as mentioned in previous sections (i.e. to calculate latency)
and at the completion of batch replication execute “admin who,sqm” again
State Info Bytes

Awaiting Message 16777372:0 PRLRMDBS1 268793702
Awaiting Message 615:0 PRSDBS1.DBSvmst 24002982
Awaiting Message 606:0 PRSDBS1.DBSglep -926256804
Awaiting Message 378:1 DBS.DBSglep_LC -1344189992
Awaiting Message 378:0 DBS.DBSglep_LC 0
Awaiting Message 248:1 DBS.DBSwact_LC -1605745641

State Info Bytes
Awaiting Message 248:0 DBS.DBSwact_LC 0
As shown during the batch load total 2000 (24002982 – 24000982) bytes were processed
by the replication server. To calculate throughput divide total bytes transfer (i.e 2000
bytes) by total latency (2 second from previous section) i.e. 1000 bytes/sec will be the
final throughput.
Another method to calculate total bytes transferred (in order to calculate throughput) is
using “admin statistics,SQM,BytesWritten” command. Make sure to reset
counter before starting the large batch in primary database by executing “admin
statistics,reset”.
1.3.3 Statistics monitoring (i.e. Monitor counters) –

Monitors and counters can be used to monitor replication in more detail fashion not
limited to
• Finding “ignored” transactions
• Managing segments
• Finding transaction sizes
• Finding command sizes
• Calculating throughput and latency
• Calculating read/write rate for stable device
• Help configuring replication to its optimal (i.e. sizing SQT size, parallel DSI etc)
Monitoring counters for replication server in version 12.6 was done using following two
commands

1> admin who
2> go
.
.
1> admin statistics
2> go

Counters provides details related to following replication modules
• CM
• DIST
• DSI
• DSIEXEC
• REPAGENT
• RSI
• SQM
• SQT
• STS
The counters can be distinctly identified for each instance (i.e. occurrence) of the module.
Counters for single-instance module can be identified by their respective module name
(for example STS, CM). For multi-instance they can identified using following two ways
• Module name and instance ID (i.e. LDBID,DBID) for example, RSI/DSI-S/DIST
• Module name, instance ID and instance value for example, SQT/DSI-Exec
Replication Monitors and counters can be assigned into one of following groups based on
their outputs (i.e. characteristic of generated statistics)
• Observers – Results in number of occurrences

• Monitors – Result their current value
• Counters – Results collection statistics
Additionally, each replication Monitors and counters can have one or more following
status by their end results are calculated
• CNT_SYSMON – These counters can be used by “admin statistics, sysmon”
command
• CNT_MUST_SAMPLE – Their results are always in sampled form
• CNT_NO_RESET – Cannot be reset (i.e. initialized)
• CNT_DURATION – Counters which measure durations
• CNT_KEEP_OLD – Counters which keeps their current and previous values
• CNF_CONFIGURE – Counters which keeps current value of replication configuration
parameter
1.3.3.1 Replication counters version 12.6 –

Use stored procedure “rs_helpcounter” (in related RSSD) to find detail information on
each counter.
In this version, counters are categorized in following types

• Total
• Last
• Max
• Avg
Following methods can be used to monitor the counters

Using “admin statistics” – Quick and easy way to monitor the replication counters. For
example, simply executing “admin statistics,sysmon” will list out all non-intrusive counters
in this category. Detail explanation about those counters can be obtained by executing
“rs_helpcounter sysmon”.
Configuring replication server to collect counters – This is the recommended method to
monitor the replication counters in order to derive performance and other useful statistics
for the replication system. Below are high level steps for the setup
Start sampling for all types (intrusive/non-intrusive) counters – Following commands
can be used
configure replication server set “stat_sampling” to “ON”
admin stats_intrusive_counter,”ON”
Start collecting the counters into RSSD (..make sure to monitor RSSD database) –
This step will collect the counters into following RSSD tables at regular interval
rs_statcounters – Details about ALL counters.
rs_statdetail – Collected counters
rs_statrun – Stored statistical information for each collection (i.e. flush to RSSD)
Following commands can be used to setup the collection
1. configure replication server set “stat_flush_rssd” to “ON”

2. configure replication server set “stat_reset_after_flush” to
“ON”
3. Configure replication server set “stat_daemon_sleep_time” to
“ON”
4. admin statistics, reset (This command is optional to reset counters manually
at any point of time)
Filter collection – By starting “sampling” replication server collection many counters.

Filters can be added to collection only specific (to the required replication module)
counters into RSSD tables. Following commands can be used to setup the filters
1. admin stats_config_module
2. admin stats_config_connection
3. admin stats_config_route

4. admin statistics,flush_status
1.3.3.2 Replication counters version 15.0 –

Unlike in version 12.6, in this version counters are not categorized into various type (i.e.
Total, Last, Max, Avg), but all counters will collect
• Number of observation
• Total observations
• Last observed value
• Max observed value
Also, starting version 15.0, there are no intrusive counters. Starting this version
collecting/monitoring replication monitors/counters can be achieved in very simple steps.
Basically, using command “admin stats” it is required to define
• What statistics to collect
• Final destination of collected counters (i.e. screen/RSSD)
• How long to collect (i.e. sampling period and number of observations)
After executing the command “admin stats” user is prompted to replication server’s
command prompt, from where user can exit the session or continue with other work.
Later “admin stat,status” can be used view the progress of previously executed
“admin stats” (to collect counters). In order to stop collecting “admin stat,cancel” can
be used any time to stop collection
Once the counters are collected RSSD tables (mentioned above) can be queried for
further analysis of the replication system. The newly introduced RSSD stored procedure
“rs_dump_stats” dumps all collected counters into CSV formatted file which can further
be loaded into Excel sheet for further analysis (Note:- Excel sheet may have limitation of
storing only 65K rows.)
1.3.3.3 Sp_sysmon –
Sybase ASE ‘s stored procedure “sp_sysmon” also provides a section specific to
Replication Agent which provides detail statistical information for each replication agent
configured.
Below is the sample output of “sp_sysmon”
In “Log Scan Activity” section provides

• replicated DDL activities
• CLRs – Log records which were partially or fully rolled back
Replication Agent

-----------------
Replication Agent: DBSCOMMON

Replication Server: PRLDBS1G
per sec per xact count % of total

------------ ------------ ---------- ----------
Log Scan Summary
Log Records Scanned n/a n/a 9 n/a
Log Records Processed n/a n/a 1 n/a
Log Scan Activity

Updates n/a n/a 0 n/a
Inserts n/a n/a 0 n/a
Deletes n/a n/a 0 n/a
Store Procedures n/a n/a 0 n/a
DDL Log Records n/a n/a 0 n/a
Writetext Log Records n/a n/a 0 n/a
Text/Image Log Records n/a n/a 0 n/a
CLRs n/a n/a 0 n/a
In “Transaction Activity” can make sure (approximately) total number of

transactions committed and total number of transactions aborted was equal to total
number of transactions opened.
Transaction Activity
Opened n/a n/a 1 n/a
Commited n/a n/a 1 n/a
Aborted n/a n/a 0 n/a
Prepared n/a n/a 0 n/a
Maintenance User n/a n/a 0 n/a
Log Extension Wait

Count n/a n/a 2 n/a
Amount of time (ms) n/a n/a 14133 n/a
Longest Wait (ms) n/a n/a 14133 n/a
Average Time (ms) n/a n/a 7066.5 n/a
Schema Cache Lookups

Forward Schema
Count n/a n/a 0 n/a
Total Wait (ms) n/a n/a 0 n/a
Backward Schema
Count n/a n/a 0 n/a

Total Wait (ms) n/a n/a 0 n/a
Truncation Point Movement

Moved n/a n/a 0 n/a
Gotten from RS n/a n/a 1 n/a
Connections to Replication Server

Success n/a n/a 0 n/a
Failed n/a n/a 0 n/a
Network Packet Information

Packets Sent n/a n/a 1 n/a
Full Packets Sent n/a n/a 0 n/a
Largest Packet n/a n/a 175 n/a
Amount of Bytes Sent n/a n/a 175 n/a
Average Packet n/a n/a 175.0 n/a
I/O Wait from RS

Count n/a n/a 2 n/a
Amount of Time (ms) n/a n/a 0 n/a
Average Wait (ms) n/a n/a 0.0 n/a
--------------------------------------------------------------------------------
1.3.3.4 Measuring Replication Agent –

Various information in relates with how replication agent is keeping up can be measured
by..
• Beginning for transaction log can be located in master..sysdatabases table (column =
logptr).
isql –U<user> -S<PrimaryDBSrv> -P<pwd>

1> select logptr from master..sysdatabases where name =
“DBSCOMMON”
2> go
Below is output look like

logptr
-----------
19844

• Truncation points can be queried from master..syslogshold table (column = page)

1> select page from master..syslogshold
2> go
page
-----------
19845
• Current position of replication agent can be found by executing

“sp_help_rep_agent <db>” (column = current marker)

1> sp_help_rep_agent “DBSCOMMON”
2> go
Replication Agent Recovery status

dbname connect dataserver connect
database status rs servername
rs username
------------------------------ ------------------------------ ---------------
--------------- ------------------------------ ------------------------------
------------------------------
DBSCOMMON PDSDBS1 DBSCOMMON
not active PRLDBS1G PRLDBS1G_ra
Replication Agent Process status
dbname spid sleep status

retry count last error
------------------------------ ----------- ------------------------------ ---
-------- -----------
DBSCOMMON 18 end of log 0
0
Replication Agent Scan status
dbname start marker end marker

current marker log recs scanned oldest transaction
------------------------------ ------------------------------ ---------------
--------------- ------------------------------ ---------------- --------------
----------------
DBSCOMMON (19845,18) (19845,22)
(19845,22) 0 (-1,0)

• Last page of the log can calculated using “dbcc
pglinkage(<dbid>,<cur_pg>,0,0,0,1)”, where <cur_pg> can be any page
(i.e. beginning of log or primary/sec truncation point or current replication position)
Object ID for pages in this chain = 8.

End of chain reached.
2 pages scanned. Object ID = 8. Last page in scan = 19846.
DBCC execution completed. If DBCC printed error messages, contact a user with
System Administrator (SA) role.
1.3.3.5 Measuring Queue (Inbound/Outbound) –

Use “admin who,sqm” and difference between “Last Seg.block” and “Next Read” to
find how replication server is processing the queues. If “Next Read” is greater than equal
to (>= ) ‘Last Seg.block” that means there is nothing to process for replication server in
the queue.

1.4 Alerting/Notification –
User defined scripts (to take appropriate action/send email etc) can be initiated by configuring
Sybase Replication server manager (using Sybase Central) for certain replication
event/conditions (i.e. DSI DOWN/SERVER DOWN etc). Similar setup can also be accomplished
using Sybase RMS (Replication Monitoring System).
Following diagram shows generally used alerting/notification methods
1.4.1 RSM Event monitoring

Sybase Central can be used instruct RSM server (may be residing on remote host) to initiate user
defined scripts (located on the same host where RSM is running) for specific server
events/conditions. The event is a change that occurs in a replication system managed by a
specific RSM Server
In order to configure RSM event monitoring,

• Install and configure RSM server (using “rsmgen” utility located in
$SYBASE/$SYBASE_RSM/install)
• Using Sybase Central (having Replication Plugin) connect to the newly configured RSM
server.
• Add primary, replicate, replication and RSSD server to the RSM server using Sybase Central
(Make sure to add RSSD server before adding replication server)
• Then right-click the RSM server and select “Server Events”. The next dialog box show six
different events (i.e. six different “Tabs”)
o Server events – Event specific to change in state of the monitoring server (can
be ASE or REP server). Select “RSM Domain” for specific “Server”. Following
events are available to monitor server events
Active - indicates a server is functioning normally. This option is useful if
you want to send an e-mail or pager message when a server begins
functioning normally after experiencing a problem.
Quiesced - Indicates a server is quiesced. If you use RSM to quiesce a
Replication Server, the Replication Server state becomes Suspect rather
than Quiesced because the LTMs are suspended.
Suspect - indicates a server is still running but is experiencing a
problem.
Hung - indicates RSM cannot connect to the server because of a
connection timeout.
Shutdown - indicates the shutdown command was used to shut down a
Replication Server or an LTM.
Dead - indicates a server was shut down using a method other than the
Shutdown command; for example, you used the isql command to shut
down a server.
Unknown - indicates RSM cannot communicate with another server
because of a connectivity problem.
Invalid - indicates RSM encountered an error in critical information files,
such as a missing or corrupt stored procedure in the RSSD of a
monitored Replication Server.
Once the event is selected from above list, then select “Servers” of which this
event needs to be monitor.
o Route events: Event specific in changes of the status of a route. Select “RSM
Domain” and “Replication server” for specific route.
o Connection events: Event specific in changes the status of a connection. Select
“RSM Domain” and “Replication server” for specific connection.
o Partition events: Event specific in partition thresholds (monitors partitions and
raises an event when a partition's size equals or exceeds a specified threshold)
and partition state (monitors partitions and raises an event when a partition's
state changes to ONLINE, OFFLINE, or DROPPED) changes. Select “RSM
Domain” and “Replication server” for partition.
o Queue events: Event specific in queue thresholds (monitors queues for specified
Replication Servers and raises an event when a queue's size equals or exceeds
the specified threshold) and queue latency (the amount of time that the first block

has remained at the beginning of each stable queue). Select “RSM Domain” and
“Replication server” for partition.
o Database events: Event specific in replication latency. Select “RSM Domain”
and “database” (select multiple databases using Cntrl-Key).
In the end make sure to provide “Script Location” for each selected event. The
script can accomplish at least
Send email notification
Send page
Write in error log file
Insert error into selected database
Add partition
Re-start server
1.4.2 Scripts
Shell or Perl script can used separately to monitor replication system. For example,
below sample script can be use to monitor primary database server, replicate database
server and replication server.
wrap_rep.csh
#!/bin/csh
#################################################################
# This script is a wrapper
# This script can be placed in startup script
#################################################################
ps -ef | grep -v grep | grep $0

if "$status" == "0" exit
while 1
check_srv PDSDBS1
check_srv PRSDBS1
check_rep_comp PRLDBS1A
sleep 300
end
check_srv.csh
#!/bin/csh

#################################################################
# This script checks connection to ASE/Rep Srv is OK
#################################################################
set EMAIL_LIST="xxxx@mms.mycingular.com"
set usr=
set pass=
## Check wheather connection to the Srv Can be establish
date > /tmp/$$

isql -U$usr -w132 -S$1 <<EOF >> /tmp/$$
$pass
go
EOF
## IF not send Email and Loop through until connection to the Srv
## gets establish
if "$status" != "0" cat /tmp/$$ | mailx -s"PL Check $1"

$EMAIL_LIST
isql -U$usr -w132 -S$1 <<EOF

$pass
go
EOF
while $status != 0
sleep 300
isql -U$usr -w132 -S$1 <<EOF
$pass
go
EOF
end
check_rep_comp.csh
#!/bin/csh

#################################################################
######
# This script checks all components of Rep Srv are UP and running
#################################################################
######
set EMAIL_LIST="xxxx@mms.mycingular.com"
set usr=
set pass=
## Check wheather all components of Rep Srv are UP and running

## Loop through until they are UP and running
check_srv $1
isql -U$usr -w132 -S$1 <<EOF | grep -i suspect
$pass
admin health
go
EOF
if "$status" == "0" tail -100 $SYBASE/REP-

12_6/install/PRLDBS1A.log| mailx -s"PL Check RepSrv" $EMAIL_LIST
## IF any of the component is DOWN send Email with Last 100 Lines
from ErrorLog
## Loop through until they are DOWN
check_srv $1
$pass
admin health
go
EOF
while $status == 0
sleep 300
check_srv $1
$pass

admin health
go
EOF
end

1.5 General Troubleshooting
1.5.1 Skipping transaction –

This happens when DSI goes down due to bad transactions (i.e. the transaction which errors out
in replicate database). In order to continue (i.e. ignoring the current bad transaction) the
connection can be resumed using following command

1> resume connection to <replicate_dataserver>.<replicate_db> skip tran
2> go
Replication server will move the first bad transaction into exception log (located in RSSD) and
continue (i.e. resume the connection which was down/suspended) with next transaction in queue.
To view the skipped transaction log into respective RSSD (replication command “admin
rssd_name” can be used to find the RSSD) and use “rs_helpexception” stored procedure. Once
the transaction is reviewed then it can be deleted from exception using “rs_delexception” stored
procedure.

1> rs_helpexception
2> go
.
.
-- Look for Xact which was logged most recently (i.e. Xact_id)
.
.
1> rs_helpexception <Xact_id>,v
2> go
1.5.2 Dumping stable queue –

Dumping (to file/sent to screen) the entire queue or part of queue (by providing particular
segment:block:count) can be done using “sysadmin dump_queue” command
1.5.3 Disabling secondary truncation point –

To avoid transaction log getting filled in primary due to any replication disaster, following
commands at the primary database

isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<PrimaryDB>
3> sp_stop_rep_agent <DBName>
4> go
.
.
1> dbcc settrunc(ltm,ignore)
2> go
1.5.4 Enabling secondary truncation point –

After fixing replication system/server disaster or refreshing primary from backup, following
commands can be used to re-enable secondary truncation point

1> rs_zeroltm <PrimaryDBSrv>,<DBName>
5> go
isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<PrimaryDB>

1> dbcc settrunc(ltm,valid)
6> go
.
.
1> sp_start_rep_agent <DBName>
2> go

References
• www.sybase.com
• Replication Reference Manual
• Replication Administration Guide
• Replication Troubleshooting Guide
• Replication Heterogeneous Replication Guide

Rep Server Monitoring Best Practices

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Rep Server Monitoring Best Practices

Загружено:

Авторское право:

Доступные форматы

Best Practice Recommendation

Subject: Monitoring Replication System

Reviewer(s): David Burgess, Staff DBA, Sybase IT

Contributor(s): David Burgess, Staff DBA, Sybase IT

Sybase, Inc. 2009 Page 1 of 51

Sybase, Inc. 2009 Page 2 of 51

1.1.2 Primary database

1.1.3 Replicate database

1.1.4 Standby database

1.1.5 Replication Server

1.1.6 Primary replication server

1.1.7 Replicate replication server

Sybase, Inc. 2009 Page 3 of 51

1.2 General Monitoring

Generally monitoring for category is divided as follows

Following diagram lists out various components to monitor

Sybase, Inc. 2009 Page 4 of 51

Sybase, Inc. 2009 Page 5 of 51

1.2.1.1.1 Database servers –

Replicate – When performed as standby (i.e. DR) this is as critical as primary

• Time to fix and bring up replicate site(s)

Sybase, Inc. 2009 Page 6 of 51

isql –U<user> -S<RepSrv> -P<pwd>

Sybase, Inc. 2009 Page 7 of 51

Mode Quiesce Status

In most of below “admin” command’s output the column “State” is important to

Active Actively processing a command.

Awaiting Command The thread is waiting for a client to send a

Awaiting I/O The thread is waiting for an I/O operation to

Awaiting Message The thread is waiting for a message from an

Awaiting Wakeup The thread has posted a sleep and is waiting to

Connecting The thread is connecting.

Down The thread has not started or has terminated.

Getting Lock The thread is waiting on a mutual exclusion

Inactive The status of an RSI User thread at the

Initializing The thread is being initialized.

Suspended The thread has been suspended by the user.

1> admin who_is_down

Sybase, Inc. 2009 Page 8 of 51

Spid Name State Info

1> admin who

Spid Name State Info

Direct/In-direct routes – Based on outbound queue size of the source replication

isql –U<user> -S<RepSrv> -P<pwd>

Sybase, Inc. 2009 Page 9 of 51

1> admin who_is_down

Spid Name State Info

Additionally, other two columns “First Segment.Block” and “Last Seg.Block”

Notice: Segment = 1MB consists of 64 BLOCKS (i.e. BLOCK SIZE = 16K)

isql –U<user> -S<RepSrv> -P<pwd>

Sybase, Inc. 2009 Page 10 of 51

Info Duplicates First Last Seg.Block

1.2.1.4 Replication Agent –

isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>

Dbname Spid sleep status retry count last error

not running RepAgent is not running.

not active RepAgent is not in recovery mode.

Initial RepAgent is initializing in recovery mode.

end of log RepAgent is in recovery mode and has

Sybase, Inc. 2009 Page 11 of 51

Unknown none of the above.

isql –U<user> -S<PrimaryDBSrv> -P<pwd> -D<DBName>

1> select * from master..syslogshold