Вы находитесь на странице: 1из 5

HOW TO TROUBLESHOOT DISTRIBUTED TRANSACTIONS ============================================ You need to resolve a distributed transaction error because of errors received.

But before you try to implement any other steps, please diagnose whether it is or it is not a REAL distributed transaction. STEPS TO TROUBLESHOOT DISTRIBUTED TRANSACTIONS ---------------------------------------------1. Check the alert.log for errors. 2. Make sure that the network is up and all db_links are valid and operational. V$DBLINK AND GV$DBLINK show the db_links currently used by distributed transactions: SVRMGR> desc v$dblink Column Name Null? Type ------------------------------ -------- ---DB_LINK VARCHAR2(128) OWNER_ID NUMBER LOGGED_ON VARCHAR2(3) HETEROGENEOUS VARCHAR2(3) PROTOCOL VARCHAR2(6) OPEN_CURSORS NUMBER IN_TRANSACTION VARCHAR2(3) UPDATE_SENT VARCHAR2(3) COMMIT_POINT_STRENGTH NUMBER SVRMGR> desc gv$dblink Column Name Null? Type ------------------------------ -------- ---INST_ID NUMBER DB_LINK VARCHAR2(128) OWNER_ID NUMBER LOGGED_ON VARCHAR2(3) HETEROGENEOUS VARCHAR2(3) PROTOCOL VARCHAR2(6) OPEN_CURSORS NUMBER IN_TRANSACTION VARCHAR2(3) UPDATE_SENT VARCHAR2(3) COMMIT_POINT_STRENGTH NUMBER 3. Run the following query against DBA_2PC_PENDING: SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID, STATE, MIXED, HOST, COMMIT# FROM DBA_2PC_PENDING WHERE LOCAL_TRAN_ID = '??.'; LOCAL_TRAN_ID is the Local transactionID (number that is given when error is reported). If LOCAL_TRAN_ID = GLOBAL_TRAN_ID, it means that this site is the Global Coordinator, i.e. where the transaction originated from. You can also get LOCAL_TRAN_ID from the alert.log. 4. Run the following query against DBA_2PC_NEIGHBORS view on all Nodes: SELECT LOCAL_TRAN_ID, IN_OUT, DATABASE, INTERFACE FROM DBA_2PC_NEIGHBORS; 5. Get the values for COMMIT_POINT_STRENGTH init.ora parameter on all Nodes. If you are not aware of what other Nodes are participating in the distributed transaction, query the DBA_2PC_NEIGHBORS view. The Node with COMMIT_POINT_STRENGTH that has highest value in init.ora has to commit first. 6. Check the STATE column from DBA_2PC_PENDING. If STATE column states COMMIT, it means the local database has successfully

committed. This also means you most likely do not have to force anything on that Node. Or if other Nodes have not committed and this particular Node happens to be a Global Coordinator - you need to note COMMIT# i.e. SCN number from DBA_2PC_PENDING view, as this is THE SCN# you will need to use if you need to perform COMMIT FORCE. 8. Take the GLOBAL_TRAN_ID and COMMIT# (SCN number from DBA_2PC_PENDING) and compare it to the value on the other nodes. On Nodes where there are no entries for this COMMIT# and transactionID in DBA_2PC_PENDING, RECO process has resolved the problem and we don't have to do anything. Note: RECO is always automatically started whenever Distributed Transactions are enabled. If you issue a UNIX command : ps - ef | grep -i reco and see RECO running, distributed transactions are correctly enabled. If you combine this output with values of the init.ora parameter DISTRIBUTED_TRANSACTONS (which should be higher than 0 for distributed environment), you may safely say that the customer is running in a distributed environment even if this particular customer has not set it up. 9. If the STATE (from DBA_2PC_PENDING view) is PREPARED, then this Node has not completed the transaction. Take GLOBAL_TRAN_ID (from DBA_2PC_PENDING) and COMMIT# (SCN number) and compare it to the value on the other nodes. You need to also check DBA_2PC_NEIGHBORS and see whether there are any other Nodes that reference the same transactionID in their DBA_2PC_PENDING. If you see it referenced, it means there are children and they need to be resolved (either committed or rollback) as well. Please take 'children' i.e. Nodes that other Nodes (not Global Coordinator) into account. Note: for clarification on terminology please see [NOTE:13229.1] 10. If there are no children, it is safe to manually interfere and either commit force or rollback force. For global integrity, use the same COMMIT# (SCN) when you force the transaction. To force the transaction please see SYNTAX NOTES below. 11. After the transaction is forced, make sure that both DBA_2PC_PENDING and DBA_2PC_NEIGHBORS data dictionary views are empty. If for some reason it is necessary to clean those views, use packaged procedure DBMS_TRANSACTION.purge_lost_db_entry. See [NOTE:1012842.102] ORA-2019 ORA-2058 ORA-2068 ORA-2050: FAILED DISTRIBUTED TRANSACTIONS Note: Be very careful with this procedure as it is always more reliable and more consistent to let RECO resolve distributed transaction conflicts. Please attempt to use DBMS_TRANSACTION.purge_lost_db_entry only as a last resort after every other measure was implemented including restarting the database to resolve the locking conflict. 12. If even then there are entries in the views and errors still appear, please check for [BUG:684157]. This is fixed in 8.0.6 and higher. SYNTAX NOTES: ============== COMMIT FORCE command --------------------

Example of syntax where highest committed SCN is 88123887 (from dba_2pc_pending) and local transaction ID is 1.13.5197 (from either dba_2pc_pending or alert.log): SVRMGR> COMMIT FORCE 'your local transactionID on this node', 'highest SCN from already committed site'; SVRMGR> COMMIT FORCE '1.13.5197', '88123887'; ROLLBACK FORCE command ---------------------Example of syntax where local transactionID is 1.13.5197 (from either dba_2pc_pending or alert.log): SVRMGR> ROLLBACK FORCE 'your local transactionID on this node'; SVRMGR> ROLLBACK FORCE '1.13.5197'; PURGING VIEWS: -------------Example of syntax for lost local transactionID 1.13.5197 (from either dba_2pc_pending or alert.log): SVRMGR> Execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY ('1.13.5197'); NOTE: Run as SYS where 1.13.5197 is the local_tran_id Example of syntax for mixed local transaction ID 1.13.5197 (from either dba_2pc_pending or alert.log): SVRMGR> Execute DBMS_TRANSACTION.PURGE_MIXED ('1.13.5197'); NOTE: Run as SYS Run PURGE_MIXED Procedure only if significant reconfiguration has occurred so that automatic recover (RECO) cannot resolve the transaction. Example of appropriate cases when it can be used: - Total loss of the remote database, or reconfiguration of software resulting in loss of two-phase commit capability. SCN RECOVERY STEPS -----------------Complete Recovery: 1. At the down site recover completely if possible (treat as regular recovery). 2. SCN will appear in the alert.log after recover is done on the crashed Node. Incomplete Recovery: 1. If time-based or cancel-based recovery was used on the crashed node, other sites must be placed back to the same point in time for global consistency. Get last SCN from the alert.log of the crashed node. 2. At each node perform a shutdown normal or immediate. 3. Take a cold backup. 4. Restore the control file if necessary. 5. Restore the last backup of all datafiles along with archived redo logs. 6. Choose which tool to use to perform SCN recovery - either Server Manager line mode or RMAN.

7. Connect internal, startup mount, check status from v$datafile to make sure all datafiles are online. Issue 'Alter database datafile '?/?/?' online;', for each datafile with status offline, to bring it online. 8. Issue the following command using the latest SCN from alert.log on the Node that had to be recovered: RECOVER DATABASE UNTIL CHANGE '1.13.5197'; NOTE: If for some reason (e.g.when issuing commit force command) automatic recovery (RECO process) needs to be disabled, use the following command: ALTER SYSTEM DISABLE DISTRIBUTED RECOVERY; (will put RECO to sleep). Make sure to wake RECO up after that: ALTER SYSTEM ENABLE DISTRIBUTED RECOVERY; HOW TO TROUBLESHOOT DISTRIBUTED TRANSACTIONS ============================================ You need to resolve a distributed transaction error because of errors received. But before you try to implement any other steps, please diagnose whether it is or it is not a REAL distributed transaction. STEPS TO TROUBLESHOOT DISTRIBUTED TRANSACTIONS ---------------------------------------------- 1. Check the alert.log for errors. 2. Make sure that the network is up and all db_links are valid and operational. V$DBLINK AND GV$DBLINK show the db_links currently used by distributed transactions: SVRMGR> desc v$dblink Column Name Null? Type ------------------------------ -------- ---- DB_LINK VARCHAR2(128) OWNER_ID NUMBER LOGGED_ON VARCHAR2(3) HETEROGENEOUS VARCHAR2(3) PROTOCOL VARCHAR2(6) OPEN_CURSORS NUMBER IN_TRANSACTION VARCHAR2(3) UPDATE_SENT VARCHAR2(3) COMMIT_POINT_STRENGTH NUMBER SVRMGR> desc gv$dblink Column Name Null? Type ------------------------------ -------- ---- INST_ID NUMBER DB_LINK VARCHAR2(128) OWNER_ID NUMBER LOGGED_ON VARCHAR2(3) HETEROGENEOUS VARCHAR2(3) PROTOCOL VARCHAR2(6) OPEN_CURSORS NUMBER IN_TRANSACTION VARCHAR2(3) UPDATE_SENT VARCHAR2(3) COMMIT_POINT_STRENGTH NUMBER 3. Run the following query against DBA_2PC_PENDING: SELECT LOCAL_TRAN_ID, GLOBAL_TRAN_ID, STATE, MIXED, HOST, COMMIT# FROM DBA_2PC_PENDING WHERE LOCAL_TRAN_ID = '??.'; LOCAL_TRAN_ID is the Local transactionID (number that is given when error is reported). If LOCAL_TRAN_ID = GLOBAL_TRAN_ID, it means that this site is the Global Coordinator, i.e. where the transaction originated from. You can also get LOCAL_TRAN_ID from the alert.log. 4. Run the following query against DBA_2PC_NEIGHBORS view on all Nodes: SELECT LOCAL_TRAN_ID, IN_OUT, DATABASE, INTERFACE FROM DBA_2PC_NEIGHBORS; 5. Get the values for COMMIT_POINT_STRENGTH init.ora parameter on all Nodes. If you are not aware of what other Nodes are participating in the distributed transaction, query the DBA_2PC_NEIGHBORS view. The Node with COMMIT_POINT_STRENGTH that has highest value in init.ora has to commit first. 6. Check the STATE column from DBA_2PC_PENDING. If STATE column states COMMIT, it means the local database has successfully committed. This also means you most likely do not have to force anything on that Node. Or if other Nodes have not committed and this particular Node happens to be a Global Coordinator - you need to note COMMIT# i.e. SCN number from DBA_2PC_PENDING view, as this is THE SCN# you will need to use if you need to perform COMMIT FORCE. 8. Take the GLOBAL_TRAN_ID and COMMIT# (SCN number from DBA_2PC_PENDING) and compare it to the value on the other nodes. On Nodes where there are no entries for this COMMIT# and transactionID in DBA_2PC_PENDING, RECO process has resolved the problem and we don't have to do anything. Note: RECO is always automatically started whenever Distributed Transactions are enabled. If you issue a UNIX command : ps - ef | grep -i reco and see RECO running, distributed transactions are correctly enabled. If you combine this output with values of the init.ora parameter DISTRIBUTED_TRANSACTONS (which should be higher than 0 for distributed environment), you may safely say that the customer is running in a distributed environment even if this particular customer has not set it up. 9. If the STATE (from DBA_2PC_PENDING view) is PREPARED, then this Node has not completed the transaction. Take GLOBAL_TRAN_ID (from DBA_2PC_PENDING) and COMMIT# (SCN number) and compare it to the value on the other nodes. You need to also check DBA_2PC_NEIGHBORS and see whether there are any other Nodes that reference the same transactionID in their DBA_2PC_PENDING. If you see it referenced, it means there are children and they need to be resolved (either committed or rollback) as well. Please take 'children' i.e. Nodes that other Nodes (not Global Coordinator) into account. Note: for clarification on terminology please see [NOTE:13229.1] 10. If there are no children, it is safe to manually interfere and either commit force or rollback force. For global integrity, use the same COMMIT# (SCN) when you force the transaction. To force the transaction please see SYNTAX NOTES below. 11. After the transaction is forced, make sure that both DBA_2PC_PENDING and DBA_2PC_NEIGHBORS data dictionary views are empty. If for some reason it is necessary to clean those views, use packaged procedure DBMS_TRANSACTION.purge_lost_db_entry. See [NOTE:1012842.102] ORA-2019 ORA-2058 ORA-2068 ORA2050: FAILED DISTRIBUTED TRANSACTIONS Note: Be very careful with this procedure as it is always more reliable and more consistent to let RECO resolve distributed transaction conflicts. Please attempt to use DBMS_TRANSACTION.purge_lost_db_entry only as a last resort after every other measure was implemented including restarting the database to resolve the locking conflict. 12. If even then there are entries in the views and errors still appear, please check for [BUG:684157]. This is fixed in 8.0.6 and higher. SYNTAX NOTES: ============== COMMIT FORCE command -------------------- Example of syntax where highest committed SCN is 88123887 (from dba_2pc_pending) and local transaction ID is 1.13.5197 (from either dba_2pc_pending or

alert.log): SVRMGR> COMMIT FORCE 'your local transactionID on this node', 'highest SCN from already committed site'; SVRMGR> COMMIT FORCE '1.13.5197', '88123887'; ROLLBACK FORCE command ---------------------- Example of syntax where local transactionID is 1.13.5197 (from either dba_2pc_pending or alert.log): SVRMGR> ROLLBACK FORCE 'your local transactionID on this node'; SVRMGR> ROLLBACK FORCE '1.13.5197'; PURGING VIEWS: -------------- Example of syntax for lost local transactionID 1.13.5197 (from either dba_2pc_pending or alert.log): SVRMGR> Execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY ('1.13.5197'); NOTE: Run as SYS where 1.13.5197 is the local_tran_id Example of syntax for mixed local transaction ID 1.13.5197 (from either dba_2pc_pending or alert.log): SVRMGR> Execute DBMS_TRANSACTION.PURGE_MIXED ('1.13.5197'); NOTE: Run as SYS Run PURGE_MIXED Procedure only if significant reconfiguration has occurred so that automatic recover (RECO) cannot resolve the transaction. Example of appropriate cases when it can be used: - Total loss of the remote database, or reconfiguration of software resulting in loss of two-phase commit capability. SCN RECOVERY STEPS ------------------ Complete Recovery: 1. At the down site recover completely if possible (treat as regular recovery). 2. SCN will appear in the alert.log after recover is done on the crashed Node. Incomplete Recovery: 1. If time-based or cancel-based recovery was used on the crashed node, other sites must be placed back to the same point in time for global consistency. Get last SCN from the alert.log of the crashed node. 2. At each node perform a shutdown normal or immediate. 3. Take a cold backup. 4. Restore the control file if necessary. 5. Restore the last backup of all datafiles along with archived redo logs. 6. Choose which tool to use to perform SCN recovery - either Server Manager line mode or RMAN. 7. Connect internal, startup mount, check status from v$datafile to make sure all datafiles are online. Issue 'Alter database datafile '?/?/?' online;', for each datafile with status offline, to bring it online. 8. Issue the following command using the latest SCN from alert.log on the Node that had to be recovered: RECOVER DATABASE UNTIL CHANGE '1.13.5197'; NOTE: If for some reason (e.g.when issuing commit force command) automatic recovery (RECO process) needs to be disabled, use the following command: ALTER SYSTEM DISABLE DISTRIBUTED RECOVERY; (will put RECO to sleep). Make sure to wake RECO up after that: ALTER SYSTEM ENABLE DISTRIBUTED RECOVERY; The following command were used to remove the uncommitted transaction. (login using SYS AS SYSDBA) SQL> rollback force '5.33.28445'; Rollback complete. SQL> select state, local_tran_ID, Global_tran_Id from dba_2pc_pending; STATE LOCAL_TRAN_ID ---------------- ---------------------GLOBAL_TRAN_ID -------------------------------------------------------------------------------forced rollback 5.33.28445 1463898948.0000011E043B991F0000000100000D36FEA867CB58F47A9CA3C9C89F11CF23D764C66E72 SQL> Execute DBMS_TRANSACTION.PURGE_LOST_DB_ENTRY ('5.33.28445'); PL/SQL procedure successfully completed. SQL> commit; Commit complete. SQL> select state, local_tran_ID, Global_tran_Id from dba_2pc_pending; no rows selected SQL>

Вам также может понравиться