Blog dedicated to Oracle Applications (E-Business Suite) Technology; covers Apps Architecture, Administration and third party bolt-ons to Apps

Tuesday, November 11, 2014

Mystery of java.sql.SQLRecoverableException: IO Error: Socket read timed out during adop/adpatch

While applying the R12.2 upgrade driver, we faced the issue of WFXLoad.class failing in adworker log but showing up as running on adctrl

        Control
Worker  Code      Context            Filename                    Status
------  --------  -----------------  --------------------------  --------------
     1  Run       AutoPatch R120 pl  WFXLoad.class               Running      
     2  Run       AutoPatch R120 pl  WFXLoad.class               Running      
     3  Run       AutoPatch R120 pl  WFXLoad.class               Running      
     4  Run       AutoPatch R120 pl  WFXLoad.class               Running      
     5  Run       AutoPatch R120 pl  WFXLoad.class               Running      
     6  Run       AutoPatch R120 pl                              Wait        
     7  Run       AutoPatch R120 pl  WFXLoad.class               Running      
     8  Run       AutoPatch R120 pl  WFXLoad.class               Running      
     9  Run       AutoPatch R120 pl  WFXLoad.class               Running      
    10  Run       AutoPatch R120 pl                              Wait        

adworker log shows:

Exception in thread "main" java.sql.SQLRecoverableException: IO Error: Socket read timed out
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:482)
        at oracle.jdbc.driver.PhysicalConnection.(PhysicalConnection.java:678)
        at oracle.jdbc.driver.T4CConnection.(T4CConnection.java:238)
        at oracle.jdbc.driver.T4CDriverExtension.getConnection(T4CDriverExtension.java:34)
        at oracle.jdbc.driver.OracleDriver.connect(OracleDriver.java:567)
        at java.sql.DriverManager.getConnection(DriverManager.java:571)
        at java.sql.DriverManager.getConnection(DriverManager.java:215)
        at oracle.apps.ad.worker.AdJavaWorker.getAppsConnection(AdJavaWorker.java:1041)
        at oracle.apps.ad.worker.AdJavaWorker.main(AdJavaWorker.java:276)
Caused by: oracle.net.ns.NetException: Socket read timed out
        at oracle.net.ns.Packet.receive(Packet.java:341)
        at oracle.net.ns.NSProtocol.connect(NSProtocol.java:308)
        at oracle.jdbc.driver.T4CConnection.connect(T4CConnection.java:1222)
        at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:330)
        ... 8 more

This was happening again and again. The DBAs were suspecting network issue, cluster issue, server issue and all the usual suspects.  In Database alert log we saw these errors coming every few seconds:

Fatal NI connect error 12537, connecting to:
 (LOCAL=NO)

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.3.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
  Time: 11-NOV-2014 19:58:19
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12537

TNS-12537: TNS:connection closed
    ns secondary err code: 12560
    nt main err code: 0
    nt secondary err code: 0
    nt OS err code: 0
opiodr aborting process unknown ospid (26388) as a result of ORA-609


We tried changing the parameters in sqlnet.ora and listener.ora as instructed in the article:
Troubleshooting Guide for ORA-12537 / TNS-12537 TNS:Connection Closed (Doc ID 555609.1)

Sqlnet.ora: SQLNET.INBOUND_CONNECT_TIMEOUT=180
Listener.ora: INBOUND_CONNECT_TIMEOUT_listener_name=120

However, the errors continued.  To rule out any issues in network, I also restarted the network service on Linux:

service network restart

One thing which I noticed was the extra amount of time that the connect was taking 4 seconds:

21:17:38 SQL> conn apps/apps
Connected.
21:17:42 SQL> 

Checked from remote app tier and it was same 4 seconds.

Stopped listener and checked on DB server that uses bequeath protocol:

21:18:47 SQL> conn / as sysdba
Connected.
21:18:51 SQL> conn / as sysdba
Connected.

Again it took 4 seconds.

A few days back, I had seen that connect time had increased after turning setting the DB initialization parameter pre_page_sga to true in a different instance.  On a hunch, I checked this and indeed pre_page_sga was set to true.  I set this back to false:

alter system set pre_page_sga=false scope=spfile;
shutdown immediate;
exit
sqlplus /nolog
conn / as sysdba
startup
SQL> set time on
22:09:46 SQL> conn / as sysdba
Connected.
22:09:49 SQL>

The connections were happening instantly.  So I went ahead and resumed the patch after setting:

update fnd_install_processes 
set control_code='W', status='W';

commit;

I restarted the patch and all the workers completed successfully.  And the patch was running significantly faster.  So I did a search on support.oracle.com to substantiate my solution with official Oracle documentation.  I found the following articles:

Slow Connection or ORA-12170 During Connect when PRE_PAGE_SGA init.ora Parameter is Set (Doc ID 289585.1)
Health Check Alert: Consider setting PRE_PAGE_SGA to FALSE (Doc ID 957525.1)

The first article (289585.1) says:
PRE_PAGE_SGA can increase the process startup duration, because every process that starts must access every page in the SGA. This approach can be useful with some applications, but not with all applications. Overhead can be significant if your system frequently creates and destroys processes by, for example, continually logging on and logging off. The advantage that PRE_PAGE_SGA can afford depends on page size.

The second article (957525.1) says:
Having the PRE_PAGE_SGA initialization parameter set to TRUE can significantly increase the time required to establish database connections.

The golden words here are "Overhead can be significant if your system frequently creates and destroys processes by, for example, continually logging on and logging off.".  That is exactly what happens when you do adpatch or adop.

Keep this in mind, whenever you do adpatch or adop, make sure that pre_page_sga is set to false.  It is possible that you may get the error "java.sql.SQLRecoverableException: IO Error: Socket read timed out" if you don't.  Also the patch will run significantly slower if pre_page_sga is set to true.  So set it to false and avoid these issues.


1 comment:

AppsWorld-Ravi Kethavath said...

Thank You Sir ..Its really shown difference in performance !!! We lost 24 hours due to above issue ..

Ravi Kethavath