I was in the middle of configuring a couple of Oracle 19c Clusters for an upgrade project from Oracle Database Server 11.2 to 19c, when I rebooted one of the computers (node2) I noticed that the services were not available, for no apparent reason, since this particular Cluster had been operating normally for several weeks.
The problem
As part of the work to copy Linux users and groups from Clusters 11.2 to these new Clusters 19c, a task was accidentally executed that started modifying the owner and group of the files in the /u01 filesystem, where all the Oracle Grid and Oracle Database Server binaries reside.
Although the process was terminated almost immediately, the damage was already done: the contents of the Oracle Home of Grid 19c, as well as the contents of the Oracle Home of Oracle Database Server 19c and 11.2 were altered and the Oracle software was rendered inoperative.
[oracle@node2 ~]$ ls -la /u01/app/
total 4
drwxr-xr-x. 5 root oinstall 52 Aug 24 12:46 .
drwxr-xr-x. 3 root oinstall 17 Aug 24 12:35 ..
drwxr-xr-x. 3 haldaemon haldaemon 20 Aug 24 12:35 grid
drwxr-xr-x. 13 oracle oinstall 4096 Aug 25 20:07 oracle
drwxrwx---. 5 oracle oinstall 92 Sep 28 11:25 oraInventory
[oracle@node2 ~]$ ls -la /u01/app/oracle
total 8
drwxr-xr-x. 13 oracle oinstall 4096 Aug 25 20:07 .
drwxr-xr-x. 5 root oinstall 52 Aug 24 12:46 ..
drwxr-xr-x. 3 haldaemon haldaemon 18 Aug 24 16:50 11.2.0
drwxr-xr-x. 3 haldaemon haldaemon 18 Aug 24 12:35 19.0.0
drwxr-xr-x. 5 oracle oinstall 45 Aug 24 19:14 admin
drwxr-x---. 3 oracle oinstall 21 Aug 24 18:34 audit
drwxrwxr-x. 6 oracle oinstall 60 Aug 24 18:12 cfgtoollogs
[oracle@node2 ~]$ crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
First attempt
Searching in My Oracle Support you will find this document:
1931142.1 | How to check and fix file permissions on Grid Infrastructure environment |
Although it only applies to the Oracle Home of Grid Infrastructure, we have to start somewhere, so we follow the procedure indicated there, starting with stopping the Grid processes and changing the owner and group of the directories to their default values: oracle:oinstall.
[root@node2 ~]# crsctl stop crs -f
CRS-2791: Starting shutdown of Oracle High Availability Services-managed resources on 'node2'
CRS-2673: Attempting to stop 'ora.drivers.acfs' on 'node2'
CRS-2677: Stop of 'ora.drivers.acfs' on 'node2' succeeded
CRS-2793: Shutdown of Oracle High Availability Services-managed resources on 'node2' has completed
CRS-4133: Oracle High Availability Services has been stopped.
[root@node2 ~]# chown -R oracle:oinstall /u01/app
[root@node2 ~]# cd $ORACLE_HOME/crs/install/
[root@node2 install]# ./rootcrs.sh -init
Smartmatch is deprecated at /u01/app/grid/19.0.0/grid_1/crs/install/crsupgrade.pm line 6512.
Using configuration parameter file: /u01/app/grid/19.0.0/grid_1/crs/install/crsconfig_params
The log of current session can be found at:
/u01/app/oracle/crsdata/node2/crsconfig/crsinit_node2_2024-09-26_03-45-33PM.log
We proceed to validate the Oracle Home status of Grid:
[oracle@node2 ~]$ cluvfy comp software -n node2 -verbose
This software is "184" days old. It is a best practice to update the CRS home by downloading and applying the latest release update. Refer to MOS note 756671.1 for more details.
Performing following verification checks ...
Software home: /u01/app/grid/19.0.0/grid_1 ...
Node Name Status Comment
------------ ------------------------ ------------------------
node2 failed 1256 files verified
Software home: /u01/app/grid/19.0.0/grid_1 ...FAILED (PRVG-2031, PRVG-11960, PRVH-0147)
Verification of software was unsuccessful on all the specified nodes.
Failures were encountered during execution of CVU verification request "software".
Software home: /u01/app/grid/19.0.0/grid_1 ...FAILED
node2: PRVG-2031 : Owner of file
"/u01/app/grid/19.0.0/grid_1/lib/liboramysql19.so" did not match the
expected value on node "node2". [Expected = "root(0)" ; Found =
"oracle(54321)"]
node2: PRVG-2031 : Owner of file "/u01/app/grid/19.0.0/grid_1/bin/expdp" did
not match the expected value on node "node2". [Expected = "root(0)" ;
Found = "oracle(54321)"]
node2: PRVG-2031 : Owner of file "/u01/app/grid/19.0.0/grid_1/bin/sqlldr" did
not match the expected value on node "node2". [Expected = "root(0)" ;
Found = "oracle(54321)"]
. . .
node2: PRVG-2031 : Owner of file "/u01/app/grid/19.0.0/grid_1/bin/okdstry" did
not match the expected value on node "node2". [Expected = "root(0)" ;
Found = "oracle(54321)"]
node2: PRVG-2031 : Owner of file "/u01/app/grid/19.0.0/grid_1/bin/okinit" did
not match the expected value on node "node2". [Expected = "root(0)" ;
Found = "oracle(54321)"]
node2: PRVG-2031 : Owner of file "/u01/app/grid/19.0.0/grid_1/bin/oklist" did
not match the expected value on node "node2". [Expected = "root(0)" ;
Found = "oracle(54321)"]
node2: PRVG-2031 : Owner of file
"/u01/app/grid/19.0.0/grid_1/bin/oradaemonagent" did not match the
expected value on node "node2". [Expected = "root(0)" ; Found =
"oracle(54321)"]
The same note indicates that if the execution of rootcrs.sh -init is insufficient, a manual change must be made using the information contained in the crsconfig_fileperms and crsconfig_dirs files, but this is time-consuming and error-prone.
Fortunately, My Oracle Support has this document:
1515018.1 | Script to capture and restore file permission in a directory (for eg. ORACLE_HOME) |
The note indicates that the first step is to download a perl program, called permission.pl.
This program requires as only parameter the directory we want to process, and as a result it will generate a file with the Linux commands to reconstruct the owner, group and permissions of each and every file in that directory!
It seems to be the ultimate solution, but is it true? Let’s see.
Second attempt
We start by executing the permission.pl program on node1, since it was not affected and operates without problems.
Note: it must be run as root user.
[root@node1 ~]# ./permission.pl /u01/app
Following log files are generated
logfile : permission-Thu-Sep-26-16-58-50-2024
Command file : restore-perm-Thu-Sep-26-16-58-50-2024.cmd
Linecount : 147029
It was finished in less than 10 seconds and the “.cmd” file actually contains the promised commands.
[root@node1 ~]# more restore-perm-Thu-Sep-26-16-58-50-2024.cmd
chown root:oinstall "/u01/app"
chmod 755 "/u01/app"
chown root:oinstall "/u01/app/grid"
chmod 755 "/u01/app/grid"
chown root:oinstall "/u01/app/grid/19.0.0"
chmod 755 "/u01/app/grid/19.0.0"
chown root:oinstall "/u01/app/grid/19.0.0/grid_1"
chmod 755 "/u01/app/grid/19.0.0/grid_1"
chown oracle:oinstall "/u01/app/grid/19.0.0/grid_1/env.ora"
chmod 644 "/u01/app/grid/19.0.0/grid_1/env.ora"
chown root:oinstall "/u01/app/grid/19.0.0/grid_1/root.sh"
chmod 755 "/u01/app/grid/19.0.0/grid_1/root.sh"
chown oracle:oinstall "/u01/app/grid/19.0.0/grid_1/welcome.html"
chmod 640 "/u01/app/grid/19.0.0/grid_1/welcome.html"
chown oracle:oinstall "/u01/app/grid/19.0.0/grid_1/gridSetup.sh"
chmod 750 "/u01/app/grid/19.0.0/grid_1/gridSetup.sh"
chown oracle:oinstall "/u01/app/grid/19.0.0/grid_1/runcluvfy.sh"
chmod 750 "/u01/app/grid/19.0.0/grid_1/runcluvfy.sh"
chown oracle:oinstall "/u01/app/grid/19.0.0/grid_1/gridSetup.rsp"
chmod 644 "/u01/app/grid/19.0.0/grid_1/gridSetup.rsp"
We copy the file to the affected computer, but we must take into account that the name of the computer and the name of the ASM instance are different, so we must adapt them in the script before executing it.
[root@node1 ~]# sed -i 's/ASM1/ASM2/g' restore-perm-Thu-Sep-26-16-58-50-2024.cmd
[root@node1 ~]# sed -i 's/asm1/asm2/g' restore-perm-Thu-Sep-26-16-58-50-2024.cmd
[root@node1 ~]# sed -i 's/node1/node2/g' restore-perm-Thu-Sep-26-16-58-50-2024.cmd
[root@node1 ~]# sh restore-perm-Thu-Sep-26-16-58-50-2024.cmd
chown: cannot access ‘/u01/app/grid/19.0.0/grid_1/gpnp/gpnp_bcp__2024_8_24_175252’: No such file or directory
chmod: cannot access ‘/u01/app/grid/19.0.0/grid_1/gpnp/gpnp_bcp__2024_8_24_175252’: No such file or directory
chown: cannot access ‘/u01/app/grid/19.0.0/grid_1/gpnp/gpnp_bcp__2024_8_24_175252/stg__node2’: No such file or directory
chmod: cannot access ‘/u01/app/grid/19.0.0/grid_1/gpnp/gpnp_bcp__2024_8_24_175252/stg__node2’: No such file or directory
. . .
chmod: cannot access ‘/u01/app/grid/19.0.0/grid_1/cfgtoollogs/oui/GridSetupActions2024-08-24_05-56-29PM/time2024-08-24_05-56-29PM.log’: No such file or directory
chown: cannot access ‘/u01/app/grid/19.0.0/grid_1/crf/admin/run/crfmond/snode2.pid’: No such file or directory
chmod: cannot access ‘/u01/app/grid/19.0.0/grid_1/crf/admin/run/crfmond/snode2.pid’: No such file or directory
After a few minutes, it finally completes. Although several error messages are displayed, they can be ignored as they are log and trace files, which do not affect the operation of the Oracle software.
We proceed to validate again the Oracle Home status of Grid:
[oracle@node2 ~]$ cluvfy comp software -n node2 -verbose
This software is "184" days old. It is a best practice to update the CRS home by downloading and applying the latest release update. Refer to MOS note 756671.1 for more details.
Performing following verification checks ...
Software home: /u01/app/grid/19.0.0/grid_1 ...
Node Name Status Comment
------------ ------------------------ ------------------------
node2 passed 1256 files verified
Software home: /u01/app/grid/19.0.0/grid_1 ...PASSED
Verification of software was successful.
CVU operation performed: software
Date: Sep 26, 2024 17:15:11 PM
CVU version: 19.23.0.0.0 (032824x8664)
Clusterware version: 19.0.0.0.0
CVU home: /u01/app/grid/19.0.0/grid_1
Grid home: /u01/app/grid/19.0.0/grid_1
User: oracle
Operating system: Linux5.4.17-2136.330.7.1.el7uek.x86_64
Excellent! Everything seems to be in order, but the litmus test is to see if the Clusterware starts and if the databases dbrac (Oracle Database 11.2) and orcl (Oracle Database 19c) start.
[root@node2 ~]# crsctl start crs
CRS-4123: Oracle High Availability Services has been started.
[root@node2 ~]# crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
[root@node2 ~]# olsnodes -s
node1 Active
node2 Active
node3 Active
[root@node2 ~]# ps -ef | grep pmon
oracle 16125 1 0 17:24 ? 00:00:00 asm_pmon_+ASM2
oracle 16380 1 0 17:24 ? 00:00:00 ora_pmon_dbrac1_2
oracle 16503 1 0 17:24 ? 00:00:00 ora_pmon_orcl1_2
Confirmed: everything operates normally and we can consider the case closed and continue with the day’s tasks.
We did have a scare, but fortunately a solution was found that saved us from having to reinstall the software, which by the way is not that complicated if you are used to using Golden Images.
Don’t know what they are? Well, what are you waiting for to learn about it! I have a whole series of articles you can start reading right now.