Blog dedicated to Oracle Applications (E-Business Suite) Technology; covers Apps Architecture, Administration and third party bolt-ons to Apps

Wednesday, May 20, 2009

How to disable automatic SGA tuning in 10g

On many of our production E-Business Suite 11.5.10.2 instances which are on 10.2.0.3 and 10.2.0.4, we have been getting ORA-600 and ORA-7445 errors. In one of the severity one TARs, for these errors, which were occuring when a business critical program was run, Oracle support advised us to disable automatic SGA tuning by setting sga_target=0. Here's an excerpt of the TAR:

ISSUE CLARIFICATION
====================

Multiple errors ORA-7445, ORA-3136 and ORA-4031's


.
ISSUE VERIFICATION
===================

Verified the issue by the trace file

Verified the issue by the alert log file


.
RESEARCH
=========

Point 1: We can see that next to Free memory the KGH: No Access is the biggest memory consumer.

This can been one of two things. The SGA_TARGET is configured too low causing frequent growing and shrinking of the SGA components to accomi
date the needs.

or

There is a known issue where the growing and shrinking is due to no minimums being set for the db_buffer_cache and the Shared_pool_size.

Bug 6528336 Automatic SGA may repeatedly shrink / grow the shared pool
Fixed-Releases: A204 B200

Details:
Alternating frequent shrink and grow of the buffer cache and
shared pool may be seen with automatic memory management enabled
causing various waits in sessions.


.
CAUSE DETERMINATION
====================

Bug 6528336 Automatic SGA may repeatedly shrink / grow the shared pool
Fixed-Releases: A204 B200


CAUSE JUSTIFICATION
====================

Known issue when KGH No Access show as large memory consumer.



.
PROPOSED SOLUTION(S)
======================
List the solution option(s) and/or workarounds here - What will work.

1). Apply the 10.2.0.4.0 patchset to get the full fix for the issue.

- or -

2). Disable ASMM by setting SGA_TARGET=0

- or -

3). Note that even after you have applied a fix, frequent resizing between shared poo
l and buffer cache can still lead to buildup of "KGH: NOACCESS" allocations. The
frequent resizing is an indication that sga_target is too low. Increase sga_tar
get . Also, recommend setting a minimum size for shared pool by setting value f
or parameter shared_pool_size to prevent it being shrunk below that level, thus
reducing chances for ORA-4031 to occur.

- or -

4). alter system set "_memory_broker_stat_interval"=999; --- 999sec between resizes

This will increase the time between resize to atleast 999 seconds.
This will reduce the number of resizes.

_memory_broker_stat_interval is in seconds,By Default it is 30 seconds.
You can set _memory_broker_stat_interval to a larger value







PROPOSED SOLUTION JUSTIFICATION(S)
====================================
Known to resolve the ORA-4031 KGH: No Access.



.
SOLUTION / ACTION PLAN
=======================
To implement the solution, please execute the following steps:


1). Apply the 10.2.0.4.0 patchset to get the full fix for the issue.

- or -

2). Disable ASMM by setting SGA_TARGET=0

- or -

3). Note that even after you have applied a fix, frequent resizing between shared pool and buffer cache can still lead to buildup
of "KGH: NOACCESS" allocations. The frequent resizing is an indication that sga
_target is too low. Increase sga_target . Also, recommend setting a minimum siz
e for shared pool by setting value for parameter shared_pool_size to prevent it
being shrunk below that level, thus reducing chances for ORA-4031 to occur.

- or -

4). alter system set "_memory_broker_stat_interval"=999; --- 999sec between resizes

This will increase the time between resize to atleast 999 seconds.
This will reduce the number of resizes.

_memory_broker_stat_interval is in seconds,By Default it is 30 seconds.
You can set _memory_broker_stat_interval to a larger value


Thanks
Kevin

Hi,

We can disable ASSM by setting sga_target to zero. However that will need us
to manually give the values for:

Database buffer cache (Default pool)
Shared pool
Large pool
Java pool
10gR2 the streams pool

This is as per metalink note 257643.1 which says that the above are set auto by
automatic tuning.

However Metalink Note 216205.1 says that large_pool_size, java_pool_size are
obsolete. Should we set these parameters ?

- Vikram



04-JUN-09 21:02:45 GMT

.
UPDATE
=======


The java_pool_size, and large_pool_size are not obsolete, they are not set manaully when SGA_TARGET is set.

Both are listed in the 10.2. reference guide and are not flagged as obsoleted.


I would suggest it might be easier for you to set minimums fort the shared_pool_size and the db_buffer_cache and leave the SGA_TARGET set.

The choice of all the above solutions is of course up to you.


Thanks
Kevin

We decided to disable automatic SGA tuning and set sga_target=0 and manually set values for db_cache_size, shared_pool_size and java_pool_size.

Section 3.4 of Metalink Note 295626.1 has the details about disabling automatic SGA or ASMM:

3.4 ASMM to Manual

You can revert the ASMM mechanism at any time by setting the SGA_TARGET value to 0.

In this case the current_size will be used by default as shown below:

SQL> alter system set sga_target=0;

System altered.

SQL> select component, current_size/1024/1024 "CURRENT_SIZE", min_size/1024/1024 "MIN_SIZE",
user_specified_size/1024/1024 "USER_SPECIFIED_SIZE", last_oper_type "TYPE" from v$sga_dynamic_components;

COMPONENT CURRENT_SIZE MIN_SIZE USER_SPECIFIED_SIZE TYPE
------------------------- ------------ ---------- ------------------- -------------
shared pool 104 80 104 GROW
<<<<< 104 has become the current value
large pool 8 8 8 STATIC
java pool 48 48 48 STATIC
streams pool 12 0 12 GROW
DEFAULT buffer cache 24 24 24 SHRINK
KEEP buffer cache 0 0 0 STATIC
RECYCLE buffer cache 0 0 0 STATIC
DEFAULT 2K buffer cache 0 0 0 STATIC
DEFAULT 4K buffer cache 0 0 0 STATIC
DEFAULT 8K buffer cache 0 0 0 STATIC
DEFAULT 16K buffer cache 0 0 0 STATIC
DEFAULT 32K buffer cache 0 0 0 STATIC
OSM Buffer Cache 0 0 24 STATIC

If you shutdown the instance now, you will record the CURRENT_SIZE value for all the parameters.

After the restart, the MIN_SIZE values are equal to CURRENT_SIZE.

SQL> select component, current_size/1024/1024 "CURRENT_SIZE",
min_size/1024/1024 "MIN_SIZE",
user_specified_size/1024/1024 "USER_SPECIFIED_SIZE",
last_oper_type "TYPE" from v$sga_dynamic_components;

COMPONENT CURRENT_SIZE MIN_SIZE USER_SPECIFIED_SIZE TYPE
------------------------- ------------ ---------- ------------------
shared pool 104 104 104 STATIC
large pool 8 8 8 STATIC
java pool 48 48 48 STATIC
streams pool 12 12 12 STATIC
DEFAULT buffer cache 24 24 24 STATIC
KEEP buffer cache 0 0 0 STATIC
RECYCLE buffer cache 0 0 0 STATIC
DEFAULT 2K buffer cache 0 0 0 STATIC
DEFAULT 4K buffer cache 0 0 0 STATIC
DEFAULT 8K buffer cache 0 0 0 STATIC
DEFAULT 16K buffer cache 0 0 0 STATIC
DEFAULT 32K buffer cache 0 0 0 STATIC
OSM Buffer Cache 0 0 24 STATIC

It is strongly advised to use an spfile with your your instance as you will record any changes applied.

The ORA-600, ORA-7445 errors have disappeared after disabling automatic SGA tuning.

Tuesday, May 19, 2009

Active Responsibilities log shows Rep-3000 Error

Akhilesh and Makarand called me today. When they ran Active Responsibilities concurrent program, the log showed this error:

ERROR
REP-3000: Internal error starting Oracle Toolkit.

This error was not appearing for Active Users program which completed normal.

Metalink Note 467148.1 describes this problem. The note says that Solaris requires window manager for before the X server is started. We use xfvb as the X server. On checking the server for the windows manager twm, I found that it was not running. After starting the twm process, and then starting xfvb (Xsun), the log was clean. It seems that either the twm process died, or someone killed it by mistake.

Monday, May 18, 2009

Script to get the process which listens on a port number

In Solaris 10, lsof utility doesn't work. Here's a script given by Oracle in Metalink Note 839919.1 to help us:

#!/bin/bash

# Get the process which listens on port

# $1 is the port we are looking for

if [ $# -lt 1 ]
then
echo "Please provide a port number parameter for this script"
echo "e.g. $0 22"
exit
fi

echo "Greping for your port, please be patient (CTRL+C breaks) ... "

for i in `ls /proc`
do
[ -d /proc/$i ] && pfiles $i | grep AF_INET | grep $1
if [ $? -eq 0 ]
then
echo Is owned by pid $i
fi
done

Thursday, May 14, 2009

The Applications File Server could not open the file FNDCPGSC

Anand pinged about this error in concurrent manager logs about Output Post Processor (FNDCOPP):

The Applications File Server could not open the file $APPLCSF/$APPLLOG/FNDCPGSC*txt for read

Starting FNDCPOPP Concurrent Manager : 16-MAY-2009 08:24:59

Found dead process: spid=(999999), cpid=(13200), Service Instance=(5271)

Metalink Note 331942.1 describes this issue. As per Oracle, it is an environment issue and running autoconfig corrects it.

However, we restarted the concurrent managers and it worked fine the second time. Root cause is not yet known.

Wednesday, May 13, 2009

cannot open /etc/mail/local-host-names

Recently I added a new hostname to /etc/mail/local-host-names and bounced sendmail:

svcadm restart sendmail.

But sendmail didn't come up and started logging these errors in /var/adm/messages:

May 13 14:21:50 mailserver1 sendmail[19725]: [ID 801593 mail.crit] NOQUEUE: SYSERR(root): /etc/mail/sendmail.cf: line 80: fileclass: cannot open '/etc/mail/loca
l-host-names': Group writable directory
May 13 14:21:50 mailserver1 sendmail[19725]: [ID 801593 mail.crit] NOQUEUE: SYSERR(root): /etc/mail/sendmail.cf: line 561: fileclass: cannot open '/etc/mail/tru
sted-users': Group writable directory


We engaged unix team and they did something and the error changed to:

May 13 14:33:38 tsgp1403 sendmail[13585]: [ID 801593 mail.alert] n4DHe0KO027983: queueup: cannot create ./tfn4DHe0KO027983, uid=25: File exists
May 13 14:35:34 tsgp1403 sendmail[15873]: [ID 801593 mail.alert] n4DHe0KO027983: queueup: cannot create ./tfn4DHe0KO027983, uid=25: File exists
May 13 14:35:35 tsgp1403 sendmail[15916]: [ID 801593 mail.alert] n4DHe0KO027983: queueup: cannot create ./tfn4DHe0KO027983, uid=25: File exists

Finally they realized that the permissions of /etc were set to 775. Once they changed it, sendmail started.

In a different article on oreillynet, I found the information that sendmail won't work properly if permissions are incorrect on any of these directories:

You have to check the permissions on / and all the subdirectories like /etc, /etc/mail for write permissions. If there is no write permission on / or /etc then give the write permissions and check.

This will set the permissions for every directory sendmail needs to be
non-group writable:

sudo chmod go-w / /etc /etc/mail /usr /var /var/spool /var/spool/mqueue /private

Check all these permissions are same as below.

drwxr-xr-x 45 root sys 3584 Feb 3 10:24 etc
drwxr-xr-x 2 root mail 512 Feb 3 11:54 mail
-rw-r--r-- 1 root bin 153 Sep 30 10:24 Mail.rc
-rw-r--r-- 1 root bin 1201 Sep 30 10:24 aliases
-rw-r--r-- 1 root mail 0 Oct 10 04:58 aliases.dir
-rw-r--r-- 1 root mail 1024 Oct 10 04:58 aliases.pag
-rw-r--r-- 1 root bin 5266 Sep 27 19:07 helpfile
-rw-r--r-- 1 root bin 0 Sep 24 2001 local-host-names
-rw-r--r-- 1 root bin 1829 Sep 30 10:19 mailx.rc
-r--r--r-- 1 root bin 34924 Sep 30 10:24 main.cf
-r--r--r-- 1 root bin 35625 Oct 1 05:09 sendmail.cf
-r--r--r-- 1 root other 35625 Oct 1 05:09
sendmail.cf.pre110615-05
lrwxrwxrwx 1 root root 8 Sep 30 10:24 sendmail.hf ->
helpfile
-r--r--r-- 1 root bin 35625 Sep 30 10:24 subsidiary.cf
-rw-r--r-- 1 root bin 5 Sep 24 2001 trusted-users

Tuesday, May 12, 2009

Check what a session does on OS level

First find the oracle session id of the concurrent program with this sql:

SELECT 'Request id: '||request_id ,
'Trace id: '||oracle_Process_id,
'Trace Flag: '||req.enable_trace,
'Trace Name:
'||dest.value||'/'||lower(dbnm.value)||'_ora_'||oracle_process_id||'.trc',
'Prog. Name: '||prog.user_concurrent_program_name,
'File Name: '||execname.execution_file_name|| execname.subroutine_name ,
'Status : '||decode(phase_code,'R','Running')
||'-'||decode(status_code,'R','Normal'),
'SID Serial: '||ses.sid||','|| ses.serial#,
'Module : '||ses.module
from apps.fnd_concurrent_requests req, v$session ses, v$process proc,
v$parameter dest, v$parameter dbnm, apps.fnd_concurrent_programs_vl prog,
apps.fnd_executables execname
where req.oracle_process_id=proc.spid(+)
and proc.addr = ses.paddr(+)
and dest.name='user_dump_dest'
and dbnm.name='db_name'
and req.concurrent_program_id = prog.concurrent_program_id
and req.program_application_id = prog.application_id
and prog.application_id = execname.application_id
and prog.executable_id=execname.executable_id
and request_id=&request_ID

Get the OS Process id by passing the session id obtained from above query to the query below:

SELECT p.program, p.spid, s.saddr, s.sid, s.serial#, s.username,
s.osuser, s.machine, s.program, s.logon_time, s.status
FROM v$session s, v$process p
WHERE s.paddr = p.addr
AND s.sid = &sid;

The above query will give you the os process id in the column spid. Once you get the os process id, go to unix prompt

truss -p os_process_id

APP-FND-00092: Routine PRINT_OUTPUT cannot find the file

Venkat reported this issue today. Concurrent programs were erroring out with this message:

APP-FND-00092: Routine PRINT_OUTPUT cannot find the file $COMMON_TOP/admin/out/o65905566.out to print.

It turned out that someone had updated the profile option: Concurrent Report:Copies to 1. The value used to be 0.

Once the value was changed back to 0, the error stopped coming.

Monday, May 11, 2009

Concurrent Manager cannot find error description for CONC-System Node

For a new instance, when we failed over through Veritas Cluster, the concurrent manager did not come up and showed these errors:

Starting jupiter2_0516@jupiter2 Internal Concurrent Manager -- shell process ID 25341

logfile=$APPLCSF/$APPLLOG/jupiter2_0516.mgr
PRINTER=noprint
mailto=appesep2
restart=N
diag=N
sleep=60 (default)
pmon=20 (default)
quesiz=1 (default)

The Internal Concurrent Manager has encountered an error.

Review concurrent manager log file for more detailed information. : 16-MAY-2009 09:08:43 -


Shutting down Internal Concurrent Manager : 16-MAY-2009 09:08:43

List of errors encountered:
.............................................................................

_ 1 _
Concurrent Manager cannot find error description for CONC-System Node
Name not Registered

Contact your support representative.
.............................................................................



List of errors encountered:
.............................................................................

_ 1 _
Routine AFPCAL received failure code while parsing or running your
concurrent program CPMGR

Review your concurrent request log file for more detailed information.
Make sure you are passing arguments in the correct format.
.............................................................................

The jupiter2_0516@jupiter2 internal concurrent manager has terminated with status 1 - giving up.


This issue is described in Metalink Note 375813.1 and 456865.1

This occurs if the failover node is not present in fnd_nodes.

You need to do this to correct:

To implement the solution, please execute the following steps:

1. Verify correct node name is registered.
- Login with sysadmin responsibility

- Navigate to Install > Nodes

2. Enter correct name and save the change.

3. Restart concurrent managers

Saturday, May 9, 2009

ALECTC undefined symbol alectc in almctc.o

Bimal pinged me this error where relink of ALECTC failed:

Undefined first referenced
symbol in file
alectc $ALR_TOP/lib/almctc.o
ld: fatal: Symbol referencing errors. No output written to $ALR_TOP/bin/ALECTC
*** Error code 1
make: Fatal error: Command failed for target `$ALR_TOP/bin/ALECTC'
Done with link of alr executable 'ALECTC' on Sat May 9 23:58:47 EDT 2009

Relink of module "ALECTC" failed.
See error messages above (also recorded in log file) for possible
reasons for the failure. Also, please check that the Unix userid
running adrelink has read, write, and execute permissions
on the directory $ALR_TOP/bin,
and that there is sufficient space remaining on the disk partition
containing your Oracle Applications installation.

Done with link of product 'alr' on Sat May 9 23:58:47 EDT 2009


adrelink is exiting with status 1

End of adrelink session
Date/time is Sat May 9 23:58:47 EDT 2009
**********************************************************

Line-wrapping log file for readability ...
Done line-wrapping log file.

Original copy is $APPL_TOP/admin/log/adrelink.lsv
New copy is $APPL_TOP/admin/log/adrelink.log

I checked for the presence of $ALR_TOP/lib/alrmctc.o and found it missing. So, I copied the $ALR_TOP/lib/alrmctc.o from a working instance. I ran the command:

adrelink.sh force=y ranlib=y "ALR ALECTC"

However it failed again with the same error.

Many years ago, I had faced a similar problem and had bookmarked Metalink Note 1009718.6 for this purpose. It helped me again.

I ran this command:

cd $APPL_TOP
find . -name "*.a" -exec nm -r {} \; | grep alectc

Among the result was a line:

[10] | 0| 0|NOTY |GLOB |0 |UNDEF |almctc.o:alectc

I ran this command in a working instance and got many more lines:

./alr/11.5.0/lib/libalr.a[alectc.o]:
[7] | 0| 0|SECT |LOCL |0 |6 |alectc.o:
[2] | 0| 0|SECT |LOCL |0 |2 |alectc.o:
[3] | 0| 0|SECT |LOCL |0 |3 |alectc.o:
[5] | 0| 0|SECT |LOCL |0 |4 |alectc.o:
[6] | 0| 0|SECT |LOCL |0 |5 |alectc.o:
[13] | 0| 0|NOTY |GLOB |0 |UNDEF |alectc.o:afpend
[11] | 0| 0|NOTY |GLOB |0 |UNDEF |alectc.o:alecea
[8] | 0| 264|FUNC |GLOB |0 |2 |alectc.o:alectc
[1] | 0| 0|FILE |LOCL |0 |ABS |alectc.o:alectc.c
[12] | 0| 0|NOTY |GLOB |0 |UNDEF |alectc.o:alefis
[10] | 0| 0|NOTY |GLOB |0 |UNDEF |alectc.o:alesis
[9] | 0| 0|NOTY |GLOB |0 |UNDEF |alectc.o:alzerr
[4] | 0| 4|OBJT |LOCL |0 |3 |alectc.o:rcsid

In the server in which the issue was occuring I did this:

nm libalr.a >text

$ grep alectc text
[10] | 0| 0|NOTY |GLOB |0 |UNDEF |alectc

In working server:

[appes tsgp1100 ]$ grep alectc text
libalr.a[alectc.o]:
[8] | 0| 264|FUNC |GLOB |0 |2 |alectc
[1] | 0| 0|FILE |LOCL |0 |ABS |alectc.c
[appes tsgp1100 ]$

So I compared the size of libalr.a in working and failing servers and found the file to be larger in the working server.

I took a backup of the existing libalr.a file in the failing server and copied the libalr.a from working server.

Then I ran

adrelink.sh force=y ranlib=y "ALR ALECTC"

It succeded this time:

Relinking module 'ALECTC' in product alr ...

make -f $APPL_TOP/admin/out/link_alr_6575.mk $ALR_TOP/bin/ALECTC

Starting link of alr executable 'ALECTC' on Sun May 10 00:23:37 EDT 2009
ld -dy $ORACLE_HOME/lib/SC4.2/crti.o $ORACLE_HOME/lib/SC4.2/crt1.o $ORACLE_HOME/lib/SC4.2/crtn.o -s -o $ALR_TOP/bin/ALECTC $ALR_TOP/lib/almctc.o \
$ALR_TOP/lib/libalr.a $ALR_TOP/lib/alstub.o $APPL_TOP/fnd/11.5.0/lib/libfnd.a -lsql -lclntsh $ORACLE_HOME/lib/nautab.o $ORACLE_HOME/lib/naeet.o $ORACLE_HOME/lib/naect.o $ORACLE_HOME/lib/naedhs.o `cat $ORACLE_HOME/lib/naldflgs` -lnetv2 -lnttcp -lnetwork -lncr -lnetv2 -lnttcp -lnetwork -lclient -lvsn -lcommon -lgeneric -lmm -lnlsrtl3 -lcore4 -lnlsrtl3 -lcore4 -lnlsrtl3 -lnetv2 -lnttcp -lnetwork -lncr -lnetv2 -lnttcp -lnetwork -lclient -lvsn -lcommon -lgeneric -lepc -lnlsrtl3 -lcore4 -lnlsrtl3 -lcore4 -lnlsrtl3 -lclient -lvsn -lcommon -lgeneric -lnlsrtl3 -lcore4 -lnlsrtl3 -lcore4 -lnlsrtl3 `cat $ORACLE_HOME/lib/sysliblist` -R /opt/SUNWcluster/lib:$ORACLE_HOME/lib -Y P,$ORACLE_HOME/lib::$ORACLE_HOME/network/jre11/lib/sparc/native_threads:$APPL_TOP/cz/11.5.0/bin:$ORACLE_HOME/network/jre11/lib/sparc/native_threads:$APPL_TOP/cz/11.5.0/bin:/usr/dt/lib:/usr/openwin/lib:/opt/SUNWcluster/lib:/usr/ccs/lib:/usr/lib -Qy -lc -laio -lm $ORACLE_HOME/rdbms/lib/defopt.o $ORACLE_HOME/rdbms/lib/ssbbded.o
chmod 755 $ALR_TOP/bin/ALECTC
mcs -d $ALR_TOP/bin/ALECTC
Done with link of alr executable 'ALECTC' on Sun May 10 00:23:38 EDT 2009

Done relinking module ALECTC in product alr
Done with link of product 'alr' on Sun May 10 00:23:38 EDT 2009


adrelink is exiting with status 0

End of adrelink session
Date/time is Sun May 10 00:23:38 EDT 2009
**********************************************************

Line-wrapping log file for readability ...
Done line-wrapping log file.

Original copy is $APPL_TOP/admin/log/adrelink.lsv
New copy is $APPL_TOP/admin/log/adrelink.log

Issue resolved.

Must supply a DBC file name

After re-applying TXK Rollup S for an issue with TNS described in a previous post, JSP ping test launched through System Administration responsibility failed with this message:

http://jupiter.justanexample.com/OA_HTML/jsp/fnd/fndping.jsp

Must supply a DBC file name.

$FND_SECURE pointed to $FND_TOP/secure. We had $CONTEXT_NAME.dbc, $TWO_TASK.dbc files in it. The aoljtest also has JSP ping. But those JSP ping gave the result "Test is no longer available".

If I create the URL manually and add the dbc clause:

http://jupiter.justanexample.com/OA_HTML/jsp/fnd/fndping.jsp?dbc=$CONTEXT_NAME.dbc

So now the question is, why is it not creating the URL correctly.

Friday, May 8, 2009

Premature end of script headers FNDWRR.exe

Venkat pinged me today and said that they were unable to view log or output files of concurrent requests in a fresh clone. When you clicked on View Log or View Output button the new page which opened showed Internal server error.

The Apache error_log showed errors like this one:

Premature end of script headers: $COMMON_TOP/html/bin/FNDWRR.exe

Metalink Note 390989.1 talks about applying patch 3288588 and relinking FNDWRR.exe. However we are already on ATG RUP6 which includes this patch. So we just relinked FNDWRR.exe on the web node. However the error persisted.

Akhilesh ran autoconfig on a hunch. The issue was resolved after autoconfig. We still don't know what was causing the issue, but the straight solution is to run autoconfig.

Thursday, May 7, 2009

How many JVMs should we use in E-Business Suite

This question was asked by Guru today:

"How many JVMs should we configure per node ?"

I dug the answer out from a pitch given by Oracle Performance Group in Openworld:

OC4J/JVM
• Use one JVM per 2 CPUs
• No more than one JVM/CPU
• No more than 100 concurrent users per JVM

Wednesday, May 6, 2009

Workflow Mailer Architecture

We had a very strange issue in one of the Production instances today. The mount on which 10.2.0 ORACLE_HOME was present had filled up 100%, but doing a du -sh on it showed that 10.2.0 directory occupied 10GB out of the 27GB allocated. We had no idea what happened to the remaining 17 GB as it wasn't showing up in du -sh output. We shutdown the DB, but two processes would not come down even if we tried to kill them. Bill restarted Sendmail on a hunch, and those two processes came down and we had 17 GB free again. It is still a mystery, what really happened.

To understand this in detail I searched Metalink for Workflow Mailer Architecture and found note 172174.1 which describes the architecture. Here's an extract:

The main component of the Oracle Workflow Notification Mailer is the executable
WFMAIL. This is a server side program that queries the database for any pending
notifications. It then dispatches these notifications by calling sendmail for
UNIX and the MAPI APIs for Microsoft Windows NT. The notification mailer also
queries the local inbox for incoming messages. These messages are validated and
then passed to the database for response processing.

Next the mailer will go through an endless loop, only controlled by the presence
of the shutdown file. There are four stages to this loop:

1. Send Loop
2. Inbox Routing
3. Response Processing
4. Idle

1. Send Loop

The notification mailer above release 2.5.1 uses the SMTP queue.

Each of these WF_NOTIFICATIONS rows is verified that it is in the correct
status and is dispatched to the recipient that is defined within the
WF_NOTIFICATIONS.Recipient_Role column. The details of the recipient are
retrieved. Including the email_address, Territory and Language. If the email
address is blank, the mailer will assume that this is a role and will
attempt to send the notification to each of the members of the role.

Before the physical email is generated, the mailer will issue an ALTER
SESSION command to set the NLS settings to that of the recipient. The mailer
will then use the WF_MAIL.GetShortMessage API to generate the notification
contents.

The format of the notification is governed by the mail templates. These are
located in the System: Mailer (WFMAIL) item type, (viewable in Workflow
Builder). The template selected is governed by the status of the notification.

2. Inbox Routing

During Inbox Routing, the mailer opens the inbox (governed by the ACCOUNT=
configuration parameter in the wfmail.cfg file) and examines each incoming
email in turn. The mailer is specifically looking for incoming emails with
Notification ID specifications. These are strings in the form of:

a) NID[Notifcation ID/access key@node]
b) NID[Notifcation ID/access key@node][2]

Note: b) is used when the DIRECT_RESPONSE=Y configuration option is used.

If this string does not exist within the email, then the email is moved
directly to the discard file and a warning notification is generated and
sent back to the Reply-To address or the From address.

Acceptable emails, ones that contain Notification ID specifications, are
moved to the unprocessed file (UNPROCESS= configuration parameter).

For MAPI, the inbox is a MAPI folder. However, DISCARD, UNPROCESS and
PROCESS folders are implemented as files on the local file system of the
NT server.

Metalink Note 458665.1 gives these steps as the basic flow involved in the Inbound Mailer processing

1. Workflow Notification Mailer's inbound thread reads the response e-mail from IMAP inbox. It then parses the content and identifies the notification id using NID string in the response e-mail and it identifies the response it expects. Using this information it parses the e-mail and finds out the response value. Using this information it builds a XML message under event name oracle.apps.wf.notification.receive event and enqueues to WF_NOTIFICATION_IN.

2. Workflow Inbound Notifications Agent Listener listens to WF_NOTIFICATION_IN agent dispatches event oracle.apps.wf.notification.receive that executes WF_XML.RECEIVE rule function. This rule function parses the response XML and applies the response value to the corresponding notification. It then calls WF_NOTIFICATION.Respond that completes the Notification Activity and the complete flow following the Notification Activity till a blocking activity is encountered.

3. Response Processing

Once the Inbox Routing has finished, then the mailer only has to operate on
the unprocessed file(UNPROCESS= configuration parameter). The Inbox Routing
has ensured that the contents of the unprocessed file are intended for this
Oracle Workflow Instance.

Each incoming email is retrieved from the unprocessed file and passed to the
server side PL/SQL for final verification and applying the contents of the
email to the workflow process identified by the NID specification.

4. Idle

The Idle Sleeps for IDLE= n seconds, default is 30 seconds.

5. To troubleshoot Notification Mailer setup issues, refer to Note 125078.1


B. STARTUP AND SHUTDOWN NOTIFICATION MAILER
--------------------------------------------

The mailer program can be started from command line or through a concurrent
program called "Notification Mailer".

When the notification mailer is started as a concurrent request in Oracle
Applications, the OS process is run by the Applications Manager (APPLMGR) user.

To avoid looping issue as described in Note 77276.1, migrate your Workflow
Mailer to the APPLMGR account as per Note 158818.1

To start it from the application, submit the "Notification Mailer"
concurrent request.

To start it from command line on UNIX, execute the following command:
$FND_TOP/bin/WFMAIL apps/pwd@SID 0 Y $FND_TOP/resource/wfmail.cfg &

To start it from command line on NT, execute the following command:
drive:\orant\bin\wfmlr -f drive:\orant\wf\data\wfmail.cfg

To shutdown, create a file name "shutdown" as specified in the wfmail.cfg file
or terminate the concurrent program. When starting up, the mailer reads the
configuration file to initialize its parameters. It then checks for the shutdown
file, if it exists, the mailer terminates gracefully.


C. WORKFLOW DIRECTORY SERVICES
------------------------------

1. Directory Service is a directory repository that Oracle Workflow uses to
reference users and roles information. Notification Mailer relies on the
workflow directory service to route notifications to user/role.

Oracle Workflow Directory Service consists of the following views:

a) WF_USERS view: references information about all the individuals who may
receive workflow notifications.

b) WF_ROLES view: references information about all the roles in your
organization who may receive workflow notifications.

Note: the role information will override the user information when the
Notification System delivers a notification to the role.

c) WF_USER_ROLES: is an intersection of the users and roles in WF_USERS and
WF_ROLES.

Wfdirhrv.sql that creates the views is automatically installed when installing
Oracle Applications Release 11i. At any time, if you suspect that your Workflow
Directory Service is not being setup correctly, you may run the script to
recreate the views.

Script is located under $FND_TOP/admin/sql

Directory service is mapped to:

a) Oracle Human Resources tables
b) Oracle Application Object Library tables
c) WF_LOCAL tables
d) Various Applications tables

2. Run $FND_TOP/sql/wfdirchk.sql to troubleshoot workflow directory
service problem.

I'll update this post as I learn more.