Troubleshoot: losing control

I was immersed in an Oracle RAC 11gR2 design and implementation project, at one of my regular customer’s facilities, when I was interrupted by the project manager. He started telling me about a performance problem they were experiencing with the main database, which was still on Oracle RAC 10gR2, and which as part of the project would be upgraded to 11gR2.

They had already identified that the main event was buffer busy waits and asked me to take a look at it to see what could be done to solve it. The task seemed simple, it was just a matter of identifying the problematic object, but as usual it was just the tip of the iceberg. Do you want to know what happened, join me in this interesting story, taken from real life.

First round

As indicated by AWR reports, the initial scenario was as follows:

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
buffer busy waits                   116,522     113,002    970   43.1 Concurrenc
log file switch completion           61,227      59,589    973   22.7 Configurat
log file sync                        26,675      18,098    678    6.9     Commit
db file sequential read           2,167,894      17,680      8    6.7   User I/O
enq: CF - contention                 28,209      13,492    478    5.1      Other

. . .

Segments by Buffer Busy Waits
           Tablespace                      Subobject  Obj.          Busy    % of
Owner         Name    Object Name            Name     Type         Waits Capture
---------- ---------- -------------------- ---------- ----- ------------ -------
SYSTEM     TBS_OTROS  DB_AUDIT_LOGON                  TABLE       98,134   85.30
USU_MONI   TOOLS      TEST_MONITOR_DB                 TABLE        7,843    6.82
USU_SIC    TBS_OTROS  REGSERVPROVSIC                  TABLE        4,048    3.52
USU_ACP    TBS_OTROS  USUARIO_DB                      TABLE        3,149    2.74
USU_SIC    TBS_OTROS  REGSERVSIC                      TABLE          863     .75

Clearly the problematic segment is the db_audit_logon table. It happened that at the moment of establishing a connection with the database, a row was inserted in that table, but part of the logic included a search of rows within the same table, a search that was done by a criterion for which there was no index.

This caused a full table scan to be performed on the table, which was taking longer and longer over time, due to the normal growth in the number of rows, to the point that it had become unbearable for the end user.

As an immediate measure I recommended the creation of a new index, which after being implemented allowed to eliminate the problem, or at least that’s what I thought, since I resumed the tasks of the main project in which I was involved at the time of receiving this consultation.

Myth buster

It happened that after resolving the buffer busy waits event, the DBA team started to attack the following events in the list: log file switch completion and log file sync. As they happened to be moving the disk groups to a new storage server, and encouraged by the express recommendation of the storage server provider, they opted to move the online redo logs to a disk group exclusively for them, on SSD disks. As an immediate result, the log file sync event disappeared and the log file switch completion event was reduced to one third.

With such good results obtained, or so they thought, they asked me if it was not an excellent idea to include in my design of the new Oracle RAC 11gR2 a disk group exclusively for online redo logs and of course with SSD disks. I saw their faces of joy turn into faces of astonishment when I answered in the negative: SSD disks are not efficient for writing, on the contrary: they tend to become slower and slower when subjected to a high writing rate, which is exactly how online redo logs behave.

The fact that they observed an improvement in times was not related to the SSD disks, it was related to the fact that now the online redo logs resided on dedicated disks, with the capacity to consume all the available I/O, and no longer as before, when they shared it with other 4TB of data subject to intensive reading and writing. In other words: they could achieve the same effect, and even a better one, if they had used traditional disks, which incidentally has the additional advantage of representing a lower cost, since they are much cheaper than their SSD counterpart.

Faith moves... waiting events?

Despite the explanation given, and since they had already had improvements by moving the online redo logs to SSD disks, they went ahead with the next step in their plans: moving the controlfile as well, hoping to get rid of the enq: CF – contention event. Unfortunately for them the effect was practically null, so they opted to open a service request with Oracle Support.

The first mistake was to open it in Spanish, the comings and goings with the translations made the progress very slow, and as usual in the first level support, finally the first thing they were asked to do was to update the Oracle Server version, since evidently 10gR2 does not have premiere support at the moment.

The support engineer in charge of the case did not intend to escalate the request for attention until they did so, so my client had no choice but to ask me to put the initial project on hold again and take care of this problem as a matter of urgency and priority, since the operability was seriously compromised, and so I accepted the challenge.

Second round

Let’s take into account that the new scenario was as follows:

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                         61,576          42.5
enq: CF - contention                 45,676      19,029    417   13.1 Other
log file switch completion           17,882      17,077    955   11.8 Configurat
control file sequential read     18,658,312      16,081      1   11.1 System I/O
db file sequential read           5,490,434       9,251      2    6.4 User I/O

This waiting event was new to me, and what was available in My Oracle Support did not point to any solution, so I started by opening a new service request, but this time in English, and making it clear in advance that the situation was extremely critical and that there was no time for upgrades. This time the first thing they asked was to check with the storage team to verify that there was no problem at those levels, considering that the database had been moved to a new storage array.

Of course that was not the problem, since they were even using SSD disks, so they moved on to the next thing: check that there were not too many log switches being generated. When we showed them that there were only 2 to 6 per hour, it was also ruled out, and they pointed to another possible cause: they found some error messages in the alert.log and suspected a problem with the connectivity to the physical standby server, and asked to investigate it with the network team.

Fri Jun 8 06:00:37 2014
Errors in file /u01/admin/db/bdump/db_lns1_2195896.trc:
ORA-03135: connection lost contact
Fri Jun 8 06:00:37 2014
LGWR: I/O error 3135 archiving log 321 to 'dbcntg'
Fri Jun 8 06:03:56 2014
ARC1: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (3113)
ARC1: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
PING[ARC1]: Error 3113 when pinging standby dbcntg.
Fri Jun 8 06:40:30 2014
Errors in file /u01/admin/db/bdump/db_arc1_745792.trc:
ORA-16146: standby destination control file enqueue unavailable
Fri Jun 8 06:40:30 2014
Master background archival failure: 16146

It was explained to them that every day at 6am the standby database service was suspended to apply the archived redo logs accumulated during the day, so the primary database lost connection, and therefore it was not a connectivity problem. Again to wait.

Gotcha!

We had already been a few days with the problem unresolved, and rumors began to spread that users were ready to storm the data center (), when Oracle Support tells us to read the note 753893.1 Degraded performance on standby due to waits on controlfile enqueue (do not bother to look for it, it is an internal document, not available to the general public).

This document states that with a large number of archived redo logs registered in the controlfile, the event enq: CF – Contention may occur, and that the idea was not to have more than a thousand logs in use. With this new information we made the investigation of the case:

SQL> SELECT type, records_total, records_used 
       FROM v$controlfile_record_section 
      WHERE type='ARCHIVED LOG';

TYPE                         RECORDS_TOTAL RECORDS_USED
---------------------------- ------------- ------------
ARCHIVED LOG                        116912        84612

Bingo! Almost 85 thousand archived redo logs recorded. How did they get there considering that the control_file_record_keep_time parameter had the value 1 (1 day)? Good question, I confess I don’t have the answer, it was not the goal at that time to investigate it, the users were on the verge of collapse, so we simply focused on implementing the solution:

SQL> exec dbms_backup_restore.resetcfilesection(11);

RMAN> catalog start with '+FRA/db/archivelog';

SQL> SELECT type, records_total, records_used 
       FROM v$controlfile_record_section 
      WHERE type='ARCHIVED LOG';

TYPE                         RECORDS_TOTAL RECORDS_USED
---------------------------- ------------- ------------
ARCHIVED LOG                        116912          336

SQL> exec dbms_backup_restore.resetcfilesection(11);

RMAN> catalog start with '+FRA/db/archivelog';

SQL> SELECT type, records_total, records_used 
       FROM v$controlfile_record_section 
      WHERE type='ARCHIVED LOG';

TYPE                         RECORDS_TOTAL RECORDS_USED
---------------------------- ------------- ------------
ARCHIVED LOG                        116912          336

The effect was immediate: the events disappeared completely, both those related to the controlfile and those related to the online redo logs, and the situation returned to normal.

Top 5 Timed Events                                         Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait   Call
Event                                 Waits    Time (s)   (ms)   Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time                                         32,205          51.0
db file sequential read           5,966,390      11,183      2   17.7 User I/O
SQL*Net break/reset to client         2,597       4,250   1637    6.7 Applicatio 
gc cr grant 2-way                 2,705,303       2,222      1    3.5 Cluster
gc current block 2-way              957,231       1,812      2    2.9 Cluster

Knockout

Of course the case was not closed, now there was time to make sure that this situation would not happen again, and therefore it was necessary to modify the MAXLOGHISTORY parameter, which controls how big the archived redo logs history in the controlfile can be. To do this we started by obtaining a script that would allow us to rebuild it with an appropriate value:

alter database backup controlfile to trace;

CREATE CONTROLFILE REUSE DATABASE "DB" NORESETLOGS FORCE LOGGING ARCHIVELOG
    MAXLOGFILES 51
    MAXLOGMEMBERS 3
    MAXDATAFILES 200
    MAXINSTANCES 8
    MAXLOGHISTORY 29200
LOGFILE
  GROUP 11 '+DG1/db/onlinelog/group_11.439.755028947' 
           SIZE 800M BLOCKSIZE 512,
  GROUP 12 '+DG1/db/onlinelog/group_12.1143.755028957'
           SIZE 800M BLOCKSIZE 512,
  GROUP 13 '+DG1/db/onlinelog/group_13.1170.755028963' 
           SIZE 800M BLOCKSIZE 512,
. . .

We changed the value 29,200 to 5,000, and on the scheduled date we stopped the database, ran the controlfile rebuild script and restarted the services. As there was also a physical standby, we had to create and transfer a new standby controlfile, so we could finally say with confidence: case closed!

Did you find this article interesting, did you have any doubts, do you want to suggest a topic to cover, leave me your comments or contact me me right now!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts

Learn how to resolve the CRS-2304 GPnP profile signature verification failed error when starting an 11.2 database on a 19c cluster.
Learn how to fix corrupted permissions on an Oracle Home, either Oracle Grid or Oracle Database Server.
What to do when we find that the database returns the time advanced or delayed for no apparent reason.

Need Help?

Fill in these details and I will be in touch as soon as possible.