I was immersed in an Oracle RAC 11gR2 design and implementation project, at one of my regular customer’s facilities, when I was interrupted by the project manager. He started telling me about a performance problem they were experiencing with the main database, which was still on Oracle RAC 10gR2, and which as part of the project would be upgraded to 11gR2.
They had already identified that the main event was buffer busy waits and asked me to take a look at it to see what could be done to solve it. The task seemed simple, it was just a matter of identifying the problematic object, but as usual it was just the tip of the iceberg. Do you want to know what happened, join me in this interesting story, taken from real life.
First round
As indicated by AWR reports, the initial scenario was as follows:
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
buffer busy waits 116,522 113,002 970 43.1 Concurrenc
log file switch completion 61,227 59,589 973 22.7 Configurat
log file sync 26,675 18,098 678 6.9 Commit
db file sequential read 2,167,894 17,680 8 6.7 User I/O
enq: CF - contention 28,209 13,492 478 5.1 Other
. . .
Segments by Buffer Busy Waits
Tablespace Subobject Obj. Busy % of
Owner Name Object Name Name Type Waits Capture
---------- ---------- -------------------- ---------- ----- ------------ -------
SYSTEM TBS_OTROS DB_AUDIT_LOGON TABLE 98,134 85.30
USU_MONI TOOLS TEST_MONITOR_DB TABLE 7,843 6.82
USU_SIC TBS_OTROS REGSERVPROVSIC TABLE 4,048 3.52
USU_ACP TBS_OTROS USUARIO_DB TABLE 3,149 2.74
USU_SIC TBS_OTROS REGSERVSIC TABLE 863 .75
Clearly the problematic segment is the db_audit_logon table. It happened that at the moment of establishing a connection with the database, a row was inserted in that table, but part of the logic included a search of rows within the same table, a search that was done by a criterion for which there was no index.
This caused a full table scan to be performed on the table, which was taking longer and longer over time, due to the normal growth in the number of rows, to the point that it had become unbearable for the end user.
As an immediate measure I recommended the creation of a new index, which after being implemented allowed to eliminate the problem, or at least that’s what I thought, since I resumed the tasks of the main project in which I was involved at the time of receiving this consultation.
Myth buster
It happened that after resolving the buffer busy waits event, the DBA team started to attack the following events in the list: log file switch completion and log file sync. As they happened to be moving the disk groups to a new storage server, and encouraged by the express recommendation of the storage server provider, they opted to move the online redo logs to a disk group exclusively for them, on SSD disks. As an immediate result, the log file sync event disappeared and the log file switch completion event was reduced to one third.
With such good results obtained, or so they thought, they asked me if it was not an excellent idea to include in my design of the new Oracle RAC 11gR2 a disk group exclusively for online redo logs and of course with SSD disks. I saw their faces of joy turn into faces of astonishment when I answered in the negative: SSD disks are not efficient for writing, on the contrary: they tend to become slower and slower when subjected to a high writing rate, which is exactly how online redo logs behave.
The fact that they observed an improvement in times was not related to the SSD disks, it was related to the fact that now the online redo logs resided on dedicated disks, with the capacity to consume all the available I/O, and no longer as before, when they shared it with other 4TB of data subject to intensive reading and writing. In other words: they could achieve the same effect, and even a better one, if they had used traditional disks, which incidentally has the additional advantage of representing a lower cost, since they are much cheaper than their SSD counterpart.
Faith moves... waiting events?
Despite the explanation given, and since they had already had improvements by moving the online redo logs to SSD disks, they went ahead with the next step in their plans: moving the controlfile as well, hoping to get rid of the enq: CF – contention event. Unfortunately for them the effect was practically null, so they opted to open a service request with Oracle Support.
The first mistake was to open it in Spanish, the comings and goings with the translations made the progress very slow, and as usual in the first level support, finally the first thing they were asked to do was to update the Oracle Server version, since evidently 10gR2 does not have premiere support at the moment.
The support engineer in charge of the case did not intend to escalate the request for attention until they did so, so my client had no choice but to ask me to put the initial project on hold again and take care of this problem as a matter of urgency and priority, since the operability was seriously compromised, and so I accepted the challenge.
Second round
Let’s take into account that the new scenario was as follows:
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time 61,576 42.5
enq: CF - contention 45,676 19,029 417 13.1 Other
log file switch completion 17,882 17,077 955 11.8 Configurat
control file sequential read 18,658,312 16,081 1 11.1 System I/O
db file sequential read 5,490,434 9,251 2 6.4 User I/O
This waiting event was new to me, and what was available in My Oracle Support did not point to any solution, so I started by opening a new service request, but this time in English, and making it clear in advance that the situation was extremely critical and that there was no time for upgrades. This time the first thing they asked was to check with the storage team to verify that there was no problem at those levels, considering that the database had been moved to a new storage array.
Of course that was not the problem, since they were even using SSD disks, so they moved on to the next thing: check that there were not too many log switches being generated. When we showed them that there were only 2 to 6 per hour, it was also ruled out, and they pointed to another possible cause: they found some error messages in the alert.log and suspected a problem with the connectivity to the physical standby server, and asked to investigate it with the network team.
Fri Jun 8 06:00:37 2014
Errors in file /u01/admin/db/bdump/db_lns1_2195896.trc:
ORA-03135: connection lost contact
Fri Jun 8 06:00:37 2014
LGWR: I/O error 3135 archiving log 321 to 'dbcntg'
Fri Jun 8 06:03:56 2014
ARC1: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (3113)
ARC1: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
PING[ARC1]: Error 3113 when pinging standby dbcntg.
Fri Jun 8 06:40:30 2014
Errors in file /u01/admin/db/bdump/db_arc1_745792.trc:
ORA-16146: standby destination control file enqueue unavailable
Fri Jun 8 06:40:30 2014
Master background archival failure: 16146
It was explained to them that every day at 6am the standby database service was suspended to apply the archived redo logs accumulated during the day, so the primary database lost connection, and therefore it was not a connectivity problem. Again to wait.
Gotcha!
We had already been a few days with the problem unresolved, and rumors began to spread that users were ready to storm the data center (), when Oracle Support tells us to read the note 753893.1 Degraded performance on standby due to waits on controlfile enqueue (do not bother to look for it, it is an internal document, not available to the general public).
This document states that with a large number of archived redo logs registered in the controlfile, the event enq: CF – Contention may occur, and that the idea was not to have more than a thousand logs in use. With this new information we made the investigation of the case:
SQL> SELECT type, records_total, records_used
FROM v$controlfile_record_section
WHERE type='ARCHIVED LOG';
TYPE RECORDS_TOTAL RECORDS_USED
---------------------------- ------------- ------------
ARCHIVED LOG 116912 84612
Bingo! Almost 85 thousand archived redo logs recorded. How did they get there considering that the control_file_record_keep_time parameter had the value 1 (1 day)? Good question, I confess I don’t have the answer, it was not the goal at that time to investigate it, the users were on the verge of collapse, so we simply focused on implementing the solution:
SQL> exec dbms_backup_restore.resetcfilesection(11);
RMAN> catalog start with '+FRA/db/archivelog';
SQL> SELECT type, records_total, records_used
FROM v$controlfile_record_section
WHERE type='ARCHIVED LOG';
TYPE RECORDS_TOTAL RECORDS_USED
---------------------------- ------------- ------------
ARCHIVED LOG 116912 336
SQL> exec dbms_backup_restore.resetcfilesection(11);
RMAN> catalog start with '+FRA/db/archivelog';
SQL> SELECT type, records_total, records_used
FROM v$controlfile_record_section
WHERE type='ARCHIVED LOG';
TYPE RECORDS_TOTAL RECORDS_USED
---------------------------- ------------- ------------
ARCHIVED LOG 116912 336
The effect was immediate: the events disappeared completely, both those related to the controlfile and those related to the online redo logs, and the situation returned to normal.
Top 5 Timed Events Avg %Total
~~~~~~~~~~~~~~~~~~ wait Call
Event Waits Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
CPU time 32,205 51.0
db file sequential read 5,966,390 11,183 2 17.7 User I/O
SQL*Net break/reset to client 2,597 4,250 1637 6.7 Applicatio
gc cr grant 2-way 2,705,303 2,222 1 3.5 Cluster
gc current block 2-way 957,231 1,812 2 2.9 Cluster
Knockout
Of course the case was not closed, now there was time to make sure that this situation would not happen again, and therefore it was necessary to modify the MAXLOGHISTORY parameter, which controls how big the archived redo logs history in the controlfile can be. To do this we started by obtaining a script that would allow us to rebuild it with an appropriate value:
alter database backup controlfile to trace;
CREATE CONTROLFILE REUSE DATABASE "DB" NORESETLOGS FORCE LOGGING ARCHIVELOG
MAXLOGFILES 51
MAXLOGMEMBERS 3
MAXDATAFILES 200
MAXINSTANCES 8
MAXLOGHISTORY 29200
LOGFILE
GROUP 11 '+DG1/db/onlinelog/group_11.439.755028947'
SIZE 800M BLOCKSIZE 512,
GROUP 12 '+DG1/db/onlinelog/group_12.1143.755028957'
SIZE 800M BLOCKSIZE 512,
GROUP 13 '+DG1/db/onlinelog/group_13.1170.755028963'
SIZE 800M BLOCKSIZE 512,
. . .
We changed the value 29,200 to 5,000, and on the scheduled date we stopped the database, ran the controlfile rebuild script and restarted the services. As there was also a physical standby, we had to create and transfer a new standby controlfile, so we could finally say with confidence: case closed!