EM Console
Slowness and Stuck Thread Issue
Let me share my experience with EM (Enterprise Manager)
console which got installed as part of Oracle Fusion Middleware 11g Installation.
Before I start discussing the problem and solution in details, here are some brief about the environment which we were running –
Before I start discussing the problem and solution in details, here are some brief about the environment which we were running –
Item
|
Description
|
Domain
|
1 Admin Server and 22 Managed servers (includes 8 SOA mservers)
|
Machine
|
sun4v, 2 physical T4 servers having 12 logical LDOMs, 30
GB Memory and 4 CPU each LDOM. 2 Managed servers installed on each LDOM just
for 11g application.
|
OS
|
SunOS 5.10 64 bit Installation
|
Database
|
11.2.0.2.0
|
FMW
|
11.1.1.6.
|
1 Problem Statement:
In production when multiple users 15-20 start accessing the
EM console, then it’s start behaving badly, response was too slow, login
was not happening properly, was taking long time around 6-10 minutes, login time in all attempt was not consistent. In few of attempts we were not able to login at all, it was
hanging during page loading process etc etc, for few attempts we were able to login but then
internal links e.g. Dashboard tab page, Instance tab page, Fault and recovery
tab page, most of the link were extremely slow.
At JVM level the threads which was getting initiated for EM
console was getting declared as Stuck threads in AdminServer.out file and after
few hours e.g. 4 hrs or so Admin server was getting crash since we have overload
configuration implemented for all the server in our environment where we have
defined if the Stuck thread counts increase more than 60 then server should
stop itself.
All the time, only one error message which we were able to
see in .out file is this –
<Jul 31,
2012 7:47:42 PM EST> <Error> <WebLogicServer> <BEA-000337>
<[STUCK]
ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for
"635" seconds working on the request "Workmanager: default,
Scheduled=true, Started=true, Started time: 635793 ms[GET
/em/faces/as/as/wlFarmHome?target=Farm_PROD1-SOA-domain&type=oracle_ias_farm&_afrLoop=2532392215126770&_afrWindowMode=0&_afrWindowId=null
HTTP/1.1User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101
Firefox/8.0Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language:
en-us,en;q=0.5
Accept-Encoding:
gzip, deflate
Accept-Charset:
ISO-8859-1,utf-8;q=0.7,*;q=0.7
Referer:
http://<HostName.domainName>:<PortNumber>/em/faces/as/as/wlFarmHome?target=Farm_PROD1-SOA-domain&type=oracle_ias_farm&_afrLoop=2532193472836150&_afrWindowMode=0&_afrWindowId=null
Cookie:
JSESSIONID=1nfKQXdBkkPY4nNTxfM4ZLtX19BcsbpFy2nQ2KJ01t4KwpR4NQyb!-86916873;
s_pers=%20s_evar41%3D%255B%255B'e%253A21813dm%253Av%253Apers%253Acus%253Ambb%253Aredeemmodem%253A%253Aredeem-mega-modem-short-lnk'%252C'1303350152411'%255D%255D%7C1461202952411%3B%20s_prop19%3Df429da9da84831ca7aea17ae80f486d0%7C1334887942700%3B%20s_visit%3D1%7C1303353770759%3B%20s_prop25%3Dv%253Abus%253Acorporate-offer%7C1322794757110%3B%20s_nr%3D1322792957112%7C1325384957112%3B;
s_vi=[CS]v1|26D7C5CA85149F1B-600001670000530B[CE]; PRX-LB=185106698.36895.0000;
ADMINCONSOLESESSION=JxZXQXYXBnL2lpXCQZLphk56rm2Hj4jLTZx4yPnftNQPvkY4Lv9q!-86916873
Cache-Control:
max-stale=0
Connection:
Keep-Alive
X-BlueCoat-Via:
C540871B14123F46
]",
which is more than the configured time (StuckThreadMaxTime) of "600"
seconds. Stack trace: null>
Above error message was not giving detailed information that
at what steps EM is hanging, we have took several round of thread dump but were
not able to conclude root cause of failure.
We have raised this concern with Oracle, also we took the
help from all other accounts wherever FMW11g got implemented, SR run almost 4
month to resolve this issue, multiple changes we have done in various
deployment windows and at last EM was stable, no login problem, always get
logged in, all the internal tab pages were working fine and fetching the data
in expected time.
2 Solution:
A series of changes has been implemented which I am going to
list out here –
2.1 Change1: Enable the “Data Display Option”
Enable the checkbox for below options-
Disabled fetching of instance and fault matrices count.
Restrict display of instance and fault to the last 30
minute.
Initially
we have reduced time for “restrict
display of instance and fault” from 24 hrs to 30min, later from 30min to 10min,
and then finally 10min to 0min. So during initial login it will not fetch any
records but instance can be pulled out on demand.
2.2 Change2: Increased cache timeout for discovery
During login process in Em console, EM does three things 1)
Authentication 2) Discovery of targets and 3) loading the page.
Discovery of target step was taking too long around 10
minute since Domain was quite big and having a big list of targets.
Oracle note 1423893.1 has been implemented to cache the
discovery result, so the sub sequent login attempt will be fast - Following
given Mbeans attributes has been added into to improve performance –
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_CACHE_AGE
|
28800000
|
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_WAIT_TIME
|
30000
|
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_USE_CACHED_RESULTS
|
True
|
LargeRepository
|
True
|
All timing for mbeans in milliseconds.
After implementing above change, very first user login will
take time after restart of Admin and the discovery result will be stored in
cache and sub sequent login attempt will be fast.
2.3 Change3: increased the PermSize of Admin server JVM
Max PERM size has been increased from 512M to 1024M
MEM_PERM_SIZE_64BIT="-XX:PermSize=1024m”
2.4 Change4: Decrease the frequency of DMS application
EM has one inbuilt application called Dynamic Monitoring
System (DMS) which does the status collection for all the targets wherever DMS
got deployed, if the frequency of DMS stats collection will be too fast then it
makes EM console to slow down. To increase the frequency here are the steps-
Increased below parameters in dms_config.xml file given at
this location –
$MiddlewareHome/fmw/soa11.1.1.6/oracle_common/modules/oracle.dms_11.1.1/server_config
prefetch
intervalSeconds from 15 second to 120,
Discover
intervalSeconds from 180 second to 300 seconds
### snap of last few line of dms_config.xml
<collectorConfiguration>
<prefetch
intervalSeconds="120" removeCycle="2"
isDefault="true"/>
<prefetch
intervalSeconds="300" removeCycle="3"/>
<discover
intervalSeconds="300"/>
<queue
maxSizePerServer="4" timeoutSeconds="240"/>
</collectorConfiguration>
N.B.
Above
highlighted configuration was also suggested by Oracle but its was causing
issue a lot and got rollbacked after some time.
2.5 Change5: Un-Target DMS application from unwanted servers.
In thread dumps which we took we realized that most of
threads are getting stuck at DMS application level and decided to
disabled DMS application for all servers.
In Weblogic Admin Console >> Domain>> Deployment
>> DMS >> Target tab page >> uncheck all servers.
Initially we have disabled the DMS application for entire
SOA cluster and OSB cluster, and observed huge significant improvement in
logging time, instead of taking 2min or 4 min, now EM was hardly taking 30
seconds to login but internal links response time stand same as it was
previously, but login time got improved a lots.
But with this configuration we have notice lots of other new
issues where most of configuration in SOA-Infrastructure was not working, we were
not able to make any configuration changes e.g. under
FARM>>SOA>>SOA-INFRA>>SOA Administration>>Common
Properties, BPEL properties etc.
All the components e.g.
BPEL, Mediator, human workflow, business rule, fault and rejected message
process everything got impacted once we disabled the DMS application for SOA
cluster.
Due to above impact we have to again enable the DMS
application for SOA servers alone.
Since DMS is quite bulky application which does stats collection
from all the targets where it got deployed, so it was advisable to disable DMS
for some of the targets where monitoring was not required to improve login response
in EM console.
Weblogic Admin Console >> Domain>> Deployment
>> DMS >> Target tab page >> check all SOA servers alone.
2.6 Change6: Unicast mode of communication has been implemented for SOA and OSB cluster
Oracle note id 1437883.1
has been followed to implement the same-
If you have large domain, having multiple clusters e.g. Weblogic
cluster, coherence cluster etc then preferred way for communication between all
the members in cluster should be unicast.
In our environment Weblogic clustering was already enabled
in Unicast mode but coherence cluster was using multicast mode and we have
identified this using string which was present in setDomain.env file –
-Dtangosol.coherence.clusteraddress=227.7.7.9
-Dtangosol.coherence.clusterport=9778 -Dtangosol.coherence.log=jdk
Above entries got removed at domain level and for each
managed server new configuration for coherence cluster got added –
e.g. for OSB servers and
SOA servers–
-Dtangosol.coherence.wka1=hostname1
-Dtangosol.coherence.wka2=hostname2
-Dtangosol.coherence.localhost= hostname1-app
-Dtangosol.coherence.localport=hostname1Port
-Dtangosol.coherence.wka1.port=hostname1Port
-Dtangosol.coherence.wka2.port=hostname2Port
Note: You can use same port number for all hostname, also if
you need more logging for coherence component add this line in your
setDomain.env file under JAVA_Option
-Dtangosol.coherence.log=stdout # this will print details logs for members
joining coherence cluster etc.
2.7 Change7: Disabled “BPEL recovery console” option from Dashboard
Once you logged in into EM console, during dashboard page loading,
EM will try to fetch the data from “DLV_MESSAGE” tables for “invoke” and “callback”
activities which take longer time since DLV_ MESSAGE table normally is quite
huge.
In order to disable the same please perform this –
EM
console>>Farm>>soa-infra>>Administration>>System Mbeam
Browser>> Filter>> Type bean name
"oracle.as.soainfra.config:name=soa-infra,*", >> In result
click on "AduitConfig" attribute and change the
"bpelRecoveryStatus=Off", default value for bpelRecoveryStatus is
"on"
N.B.
In above configuration “O” should be capital letter of “Off”,
small ‘o’ in ‘off’ will not work.
2.8 Change8: Added JVM parameter at domain level “-XX:+UseMembar”
In threads dump Oracle has identified that few of threads
concurrently getting locked to one object, this mainly happening because they
observed whenever thread state is getting change its not notifying to others waiting
threads, to get rid from this situation above JVM parameter got added
###Snap of
stack trace which we have noticed in our thread dump
"[STUCK]
ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'"
daemon prio=3 tid=0x00000001067c4800 nid=0x47 waiting on condition
[0xffffffff5b4f9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0xfffffffeca281c30> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at oracle.adf.model.dcframe.DataControlFrameImpl.lock(DataControlFrameImpl.java:482)
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0xfffffffeca281c30> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at oracle.adf.model.dcframe.DataControlFrameImpl.lock(DataControlFrameImpl.java:482)
2.9 Change9: Added JVM parameter at domain level “-Dweblogic.management.disableManagedServerNotifications=true”
Above JVM
parameter has been suggested by Oracle, this parameter will help to reduce JMX
notification which occur between Admin and Managed servers whenever any new
component get added, any state get changes etc.
This configuration got added only for Admin Server Instance,
not at Manage server level, and we have notice huge benefit in EM after this
change, now EM was not at all hanging during login time, internal link were
slow but at least multiple users were able to login into EM.
2.10 Change10: Modified JVM parameter to print more details information in Thread dump at Domain level–
Added
the following Java option at domain level for all servers. This will help to provide
more representative information in thread dumps which help during investigation..
“-XX:+PrintConcurrentLocks”
2.11 Change11: Patch applied in Environment specific to EM Console Slowness Problem
- Patch Number 12972254, download file name “p12972254_1036_Generic”
- Patch Number 14653405, download file name “fmw_diagnostics_14653405”
- Patch Number 14750864, download file name “p14750864_111160_Generic”
- Patch Number 14812586, download file name “p14812586_111160_Generic”
- Patch Number 15853874, download file name “p15853874_111160_Generic”
- Patch Number 15847755, download file name “p15847755_111160_Generic”
N.B.
You might not get above patches on Oracle Support website,
since we got these patches directly uploaded into SR level. Also, please confirm with Oracle Support before deploying any PATCH into your environment, this given information is just for reference and does not be supported by any Vendor.
2.12 Change12: Identification of EM Queries for tunning purpose
Now after doing all sort of tunning at application level,
login problem got resolved but still internal links to instance tab page, fault
and rejected tab page and BPEL recovery console were taking long time to
respond, then we started looking into AWR report and found that couple of
queries are taking huge time around 20 minute as result EM is extremely slow
for internal links and was keep spinning.
We manage to get the EM queries which EM runs in the
background once users click on Instance, dashboard, fault and rejected message
and BPEL recovery console tab pages.
Here are the queries –
2.12.1DASHBOARD - BPEL Recovery console queries
SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM
rnum FROM (SELECT MESSAGE_GUID AS a1,
DLV_TYPE AS a2, CIKEY AS a3, CLUSTER_NODE_ID AS a4, CLUSTER_NODE_KEY AS a5,
COMPONENT_NAME AS a6, COMPONENT_TYPE AS a7, COMPOSITE_LABEL AS a8,
COMPOSITE_NAME AS a9, COMPOSITE_REVISION AS a10, CONV_ID AS a11, DOMAIN_NAME AS
a12, ECID AS a13, EVENT_NAME AS a14, EXT_INT1 AS a15, EXT_STRING1 AS a16,
EXT_STRING2 AS a17, HEADER_PROPERTIES_BIN_FORMAT AS a18, HEADERS_REF_ID AS a19,
OPERATION_NAME AS a20, PARTNER_LINK AS a21, PRIORITY AS a22, PROPERTIES AS a23,
RECEIVE_DATE AS a24, RECOVER_COUNT AS a25, STATE AS a26, TENANT_ID AS a27,
CONV_TYPE AS a28, RES_SUBSCRIBER AS a29 FROM DLV_MESSAGE WHERE
((((COMPONENT_TYPE = ?) AND (STATE IN (?, ?))) AND (RECEIVE_DATE <= ?)) AND
(DLV_TYPE = ?)) ORDER BY RECEIVE_DATE DESC) a WHERE ROWNUM <= ?) WHERE rnum
> ?
bind =>
[bpel, 0, 1, 2012-10-31 12:07:36.044, 2, 1, 0]
SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM
rnum FROM (SELECT MESSAGE_GUID AS a1,
DLV_TYPE AS a2, CIKEY AS a3, CLUSTER_NODE_ID AS a4, CLUSTER_NODE_KEY AS a5,
COMPONENT_NAME AS a6, COMPONENT_TYPE AS a7, COMPOSITE_LABEL AS a8,
COMPOSITE_NAME AS a9, COMPOSITE_REVISION AS a10, CONV_ID AS a11, DOMAIN_NAME AS
a12, ECID AS a13, EVENT_NAME AS a14, EXT_INT1 AS a15, EXT_STRING1 AS a16,
EXT_STRING2 AS a17, HEADER_PROPERTIES_BIN_FORMAT AS a18, HEADERS_REF_ID AS a19,
OPERATION_NAME AS a20, PARTNER_LINK AS a21, PRIORITY AS a22, PROPERTIES AS a23,
RECEIVE_DATE AS a24, RECOVER_COUNT AS a25, STATE AS a26, TENANT_ID AS a27, MASTER_CONV_ID
AS a28 FROM DLV_MESSAGE WHERE ((((COMPONENT_TYPE = ?) AND (STATE IN (?, ?)))
AND (RECEIVE_DATE <= ?)) AND (DLV_TYPE = ?)) ORDER BY RECEIVE_DATE DESC) a
WHERE ROWNUM <= ?) WHERE rnum > ?
bind =>
[bpel, 0, 1, 2012-10-31 12:07:36.044, 1, 1, 0]
2.12.2Dashboard - Recent fault and rejected messages
SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM rnum
from (SELECT f.CIKEY, f.NODE_ID,
f.SCOPE_ID, f.COUNT_ID, f.FAULT_NAME, f.FAULT_TYPE, f.POLICY_NAME,
f.POLICY_VERSION, f.POLICY_CATEGORY, f.POLICY_ECID, f.CREATION_DATE,
f.MODIFY_DATE, f.MESSAGE, wi.LABEL, wi.CREATOR, wi.MODIFIER, wi.STATE, ci.ECID,
ci.CMPST_ID, ci.DOMAIN_NAME, ci.COMPOSITE_NAME, ci.COMPONENT_NAME,
ci.COMPOSITE_REVISION, ci.COMPOSITE_LABEL FROM CUBE_INSTANCE ci, WI_FAULT f
LEFT JOIN WORK_ITEM wi ON f.CIKEY = wi.CIKEY AND f.NODE_ID = wi.NODE_ID AND
f.SCOPE_ID = wi.SCOPE_ID AND f.COUNT_ID = wi.COUNT_ID WHERE ci.CIKEY = f.CIKEY AND ci.COMPONENTTYPE = ? AND
(f.FAULT_TYPE is null OR f.FAULT_TYPE = ?) AND ci.STATE != 9 AND (wi.STATE is null
OR wi.STATE IN (9, 4, 13, 3)) ORDER BY
f.CREATION_DATE DESC) a where ROWNUM
< ? ) where rnum > ?
bind =>
[bpel, {http://schemas.oracle.com/bpel/extension}RuntimeFaultMessage, 5, 0]
2.12.3Dashboard - Recent composite Instance queries –
SELECT /*+ FIRST_ROWS(40) */ ID, CONVERSATION_ID, HAS_ASSOC,
PARENT_ID, UPDATED_TIME, CREATED_BY, ECID, TITLE, TEST_RUN_NAME, INDEX5,
TEST_RUN_ID, INDEX3, TEST_SUITE, INDEX1, TEST_CASE, BATCH_INDEX, SOURCE_NAME,
COMPOSITE_DN, SOURCE_TYPE, CREATED_TIME, SOURCE_ACTION_TYPE, INDEX6, SOURCE_ACTION_NAME,
INDEX2, STATE, BATCH_ID, LIVE_INSTANCES, TAGS, STATE_COUNT, BUSINESS_STATUS,
VERSION, INDEX4, PARTITION_DATE, UPDATED_BY, TENANT_ID FROM COMPOSITE_INSTANCE
WHERE (CREATED_TIME >= :1 ) ORDER BY CREATED_TIME DESC
We got this query from AWR report and our guess is it’s
being used for “recent instance” section,
2.12.4Instance tab page -
Again our guess is same query which is getting used for
Dashboard tab page instance section is getting used for "Instance"
tab page as well
SELECT /*+ FIRST_ROWS(40) */ ID, CONVERSATION_ID, HAS_ASSOC,
PARENT_ID, UPDATED_TIME, CREATED_BY, ECID, TITLE, TEST_RUN_NAME, INDEX5,
TEST_RUN_ID, INDEX3, TEST_SUITE, INDEX1, TEST_CASE, BATCH_INDEX, SOURCE_NAME,
COMPOSITE_DN, SOURCE_TYPE, CREATED_TIME, SOURCE_ACTION_TYPE, INDEX6,
SOURCE_ACTION_NAME, INDEX2, STATE, BATCH_ID, LIVE_INSTANCES, TAGS, STATE_COUNT,
BUSINESS_STATUS, VERSION, INDEX4, PARTITION_DATE, UPDATED_BY, TENANT_ID FROM
COMPOSITE_INSTANCE WHERE (CREATED_TIME >= :1 ) ORDER BY CREATED_TIME DESC
2.12.5Instance tab page search query based on name vs title-
SELECT
/*+ FIRST_ROWS(50) */ ID, CONVERSATION_ID, HAS_ASSOC, PARENT_ID, UPDATED_TIME,
CREATED_BY, ECID, TITLE, TEST_RUN_NAME, INDEX5, TEST_RUN_ID, INDEX3,
TEST_SUITE, INDEX1, TEST_CASE, BATCH_INDEX, SOURCE_NAME, COMPOSITE_DN,
SOURCE_TYPE, CREATED_TIME, SOURCE_ACTION_TYPE, INDEX6, SOURCE_ACTION_NAME,
INDEX2, STATE, BATCH_ID, LIVE_INSTANCES, TAGS, STATE_COUNT, BUSINESS_STATUS,
VERSION, INDEX4, PARTITION_DATE, UPDATED_BY, TENANT_ID FROM COMPOSITE_INSTANCE
WHERE (TITLE LIKE :1 ) ORDER BY CREATED_TIME DESC
2.12.6Fault and Rejected message tab page –
For this tab page we could see one parent query runs which
fetch initial 40 records and then for each records its run another child sql
query-
2.12.6.1 Parent query-
SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM rnum
from (SELECT f.CIKEY, f.NODE_ID,
f.SCOPE_ID, f.COUNT_ID, f.FAULT_NAME, f.FAULT_TYPE, f.POLICY_NAME,
f.POLICY_VERSION, f.POLICY_CATEGORY, f.POLICY_ECID, f.CREATION_DATE,
f.MODIFY_DATE, f.MESSAGE, wi.LABEL, wi.CREATOR, wi.MODIFIER, wi.STATE, ci.ECID,
ci.CMPST_ID, ci.DOMAIN_NAME, ci.COMPOSITE_NAME, ci.COMPONENT_NAME,
ci.COMPOSITE_REVISION, ci.COMPOSITE_LABEL
FROM CUBE_INSTANCE ci, WI_FAULT f LEFT
JOIN WORK_ITEM wi ON f.CIKEY = wi.CIKEY AND f.NODE_ID =
wi.NODE_ID AND f.SCOPE_ID = wi.SCOPE_ID AND f.COUNT_ID = wi.COUNT_ID WHERE
ci.CIKEY = f.CIKEY AND ci.COMPONENTTYPE
= ? AND ci.STATE != 9 AND (wi.STATE is null OR wi.STATE IN (9, 4, 13, 3)) ORDER BY f.CREATION_DATE DESC) a where ROWNUM < ? ) where rnum > ?
bind =>
[bpel, 40, 0]
2.12.6.2 Child query –
SELECT F.CIKEY, F.NODE_ID, F.SCOPE_ID, F.COUNT_ID,
F.FAULT_NAME,
F.FAULT_TYPE,F.CREATION_DATE,F.MODIFY_DATE,
F.MESSAGE, WI.LABEL, WI.CREATOR,WI.MODIFIER,WI.STATE,CI.ECID,CI.CMPST_ID
FROM
WI_FAULT F LEFT JOIN WORK_ITEM WI ON F.CIKEY = WI.CIKEY
AND
F.NODE_ID = WI.NODE_ID AND F.SCOPE_ID = WI.SCOPE_ID AND F.COUNT_ID =
WI.COUNT_ID
LEFT
JOIN CUBE_INSTANCE CI ON F.CIKEY = CI.CIKEY WHERE F.CIKEY = ? AND F.NODE_ID = ?
AND
F.SCOPE_ID = ? AND F.COUNT_ID = ?
bind =>
[2050981, BpInv1, BpSeq3.17, 2]
SELECT F.CIKEY,
F.NODE_ID, F.SCOPE_ID, F.COUNT_ID, F.FAULT_NAME,
F.FAULT_TYPE,F.CREATION_DATE,F.MODIFY_DATE,
F.MESSAGE, WI.LABEL, WI.CREATOR,WI.MODIFIER,WI.STATE,CI.ECID,CI.CMPST_ID
FROM
WI_FAULT F LEFT JOIN WORK_ITEM WI ON F.CIKEY = WI.CIKEY
AND
F.NODE_ID = WI.NODE_ID AND F.SCOPE_ID = WI.SCOPE_ID AND F.COUNT_ID =
WI.COUNT_ID
LEFT
JOIN CUBE_INSTANCE CI ON F.CIKEY = CI.CIKEY WHERE F.CIKEY = ? AND F.NODE_ID = ?
AND F.SCOPE_ID = ? AND F.COUNT_ID = ?
bind =>
[2050969, BpInv6, BpSeq19.39, 2]
2.13 Change13: Created following Index to improve the EM queries execution response time–
Index Creation
|
||
Table Name
|
Column name
|
Index type
|
COMPOSITE_INSTANCE
|
created_time
|
Normal Index
|
CUBE_INSTANCE
|
CPMST_ID
|
Normal Index
|
COMPOSITE_INSTANCE
|
TITLE
STATE
|
Normal Index
Normal Index
|
DLV_MESSAGE
|
STATE
DLV_TYPE
|
Composite Index
|
Change14: Explored direct links to access the EM console Instance tab page, fault and recovery tab page, dashboard tab etc.
If we login into EM console via normal login procedure, it
takes quite a long time to navigate from Dashboard tab page to Instance tab
page etc. To avoid this delay we have provide direct link for accessing these
page based on user need-
2.13.1Instance Tab Page
&type=oracle_soainfra&selectedTab=instancesTab
N.B.
In above string replace hostname, domain name, port number
and target value to prepare the URL for your environment, target name can be
pulled out from this path- right click on “soa-infra(mservername) >> click
to general >> a dialogue box will appear, copy the target name from there
and replace in above string. Refer the above screen shot for path.
2.13.2Fault and Recovery Tab page
&type=oracle_soainfra&selectedTab=faultsTab
2.13.3Dashboard Tab Page
&type=oracle_soainfra&selectedTab=compositeTabs
2.14 Change 15: Manually execution of Stats pack at Database level for SOA_INFRA partition tables.
One more major impacting solution we have adapted to auto schedule
stats collection process for partition tables.
During our analysis of AWR report, we have identified the
long running queries and also gone through
the explain plan of those query and
concluded that SQL optimizer is not working as expected and possible reason
behind that automatic database stats collection process updates the latest
partition of SOA_INFRA tables where as the old partitions SOA_INFRA tables remains
unchanged.
When EM runs the queries it gets two different explain plans
according to the data exist into separate partition tables as result it take
longer time to response.
In order to overcome this situation we have implemented
manual stats collection on Tables level, rather than having stats collection at
partition level for table. A cron job now run every mid night which does table
scan to identify data changes %, if more than 5% data got changed in table it
does the stats collection for that table. SOA_INFRA table list where stats
collection process been schedule is given below-
2.14.1 Table List where DB Manual Stats collection is required -
TABLES PARTITIONED
|
AUDIT_COUNTER
|
AUDIT_DETAILS
|
AUDIT_TRAIL
|
CI_INDEXES
|
COMPOSITE_INSTANCE
|
COMPOSITE_INSTANCE_ASSOC
|
COMPOSITE_INSTANCE_FAULT
|
CUBE_INSTANCE
|
CUBE_SCOPE
|
DLV_MESSAGE
|
DLV_SUBSCRIPTION
|
DOCUMENT_CI_REF
|
DOCUMENT_DLV_MSG_REF
|
HEADERS_PROPERTIES
|
REFERENCE_INSTANCE
|
WI_FAULT
|
WLI_QS_REPORT_ATTRIBUTE
|
WORK_ITEM
|
XML_DOCUMENT
|
2.14.2 Call the below SQL statement to forcefully initiate DB stats collection for a table –
EXEC DBMS_STATS.gather_table_stats('SOAP_SOAINFRA',
'AUDIT_COUNTER',estimate_percent => 5);
Conclusion
Ultimately EM Stuck Thread problem and slowness problem got resolved;
it took around 3 + months in our production environment to do all the changes
in multiple deployment windows, now multiple users are accessing EM console and
able to do their BAU work. Only regular purging is required for SOA_INFRA database
to keep EM console stable in future as well.
Thanks to Oracle support guys and all my colleagues who were
part of this exercise.
A special thanks goes to the Mr. Deepak Arora (Director,
SOA/B2B Fusion Middleware Architects Team, “The A- Team”), Oracle to provide
his supervision and drive this exercise. We got very useful information from him which enabled us to
make environment stable.
Disclaimer
While I try to make effort to ensure that material in this solution
is accurate but such material does in no way constitute the provision of
professional advice to the problem of your solution to your specific
environment, this solution does not guarantee and accept any legal liabilities
whatsoever arising due to the changes what is mention in solution, before
making any suggested changes please take a advise form your Product Vendor and do
through testing in Test environment before moving it to Production.
hello. Very nice article indeed. helped a lot.
ReplyDeleteOne extra question: How many composite instances remain rezident in the database after frequently applied search? I mean...at what database footprint does this article apply? Have you used also partitioning also?
thank you again
Hi. You've said you had this setting and then removed it because of problems:
ReplyDeleteCan you say what those problems were? Thanks.
This comment has been removed by the author.
ReplyDeleteI mean the queueMaxSizePerServer setting.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi, Please can you provide the SR no or KM note where Oracle mentioned about adding JVM parameter at domain level “-Dweblogic.management.disableManagedServerNotifications=true”. This may have associated impact as some notifications from MS will be disabled to Admin server.
ReplyDeleteThanks
hi, I asked Oracle about this and they said the below
Delete"It is impossible to say if the suggestions in the non-Oracle blog will help your particular situation.
I have found some information where using this parameter is suggested but it does have side effects, such as incident logs not being generated properly in the event of a problem occurring. "
Hi Manish,
ReplyDeleteThanks for the blog its very useful.
As you have speciefied in step 2 (2.2 Change2: Increased cache timeout for discovery) to change the cache i was able to find the elements but there no entries like this in any of the element key value.
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_CACHE_AGE
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_WAIT_TIME
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_USE_CACHED_RESULTS
and even those values are non editable please need you help on this.
Regards
Irfan
Really a good post and appreciate for sharing this with all
ReplyDeleteBrilliant! Exactly what we were looking for...Thank you sir!
ReplyDelete