Pages

Wednesday, November 7, 2012

EM Console Slowness and Stuck Thread Issue


EM Console Slowness and Stuck Thread Issue
Let me share my experience with EM (Enterprise Manager) console which got installed as part of Oracle Fusion Middleware 11g Installation.
Before I start discussing the problem and solution in details, here are some brief about the environment which we were running –

Item
Description
Domain
1 Admin Server and 22 Managed servers (includes 8 SOA mservers)
Machine
sun4v, 2 physical T4 servers having 12 logical LDOMs, 30 GB Memory and 4 CPU each LDOM. 2 Managed servers installed on each LDOM just for 11g application.
OS
SunOS 5.10 64 bit Installation
Database
11.2.0.2.0
FMW
11.1.1.6.

1         Problem Statement: 

In production when multiple users 15-20 start accessing the EM console, then it’s start behaving badly, response was too slow, login was not happening properly, was taking long time around 6-10 minutes, login time in all attempt was not consistent. In few of attempts we were not able to login at all, it was hanging during page loading process etc etc, for few attempts we were able to login but then internal links e.g. Dashboard tab page, Instance tab page, Fault and recovery tab page, most of the link were extremely slow.
At JVM level the threads which was getting initiated for EM console was getting declared as Stuck threads in AdminServer.out file and after few hours e.g. 4 hrs or so Admin server was getting crash since we have overload configuration implemented for all the server in our environment where we have defined if the Stuck thread counts increase more than 60 then server should stop itself.
All the time, only one error message which we were able to see in .out file is this –
<Jul 31, 2012 7:47:42 PM EST> <Error> <WebLogicServer> <BEA-000337> <[STUCK] ExecuteThread: '13' for queue: 'weblogic.kernel.Default (self-tuning)' has been busy for "635" seconds working on the request "Workmanager: default, Scheduled=true, Started=true, Started time: 635793 ms[GET /em/faces/as/as/wlFarmHome?target=Farm_PROD1-SOA-domain&type=oracle_ias_farm&_afrLoop=2532392215126770&_afrWindowMode=0&_afrWindowId=null HTTP/1.1User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Referer: http://<HostName.domainName>:<PortNumber>/em/faces/as/as/wlFarmHome?target=Farm_PROD1-SOA-domain&type=oracle_ias_farm&_afrLoop=2532193472836150&_afrWindowMode=0&_afrWindowId=null
Cookie: JSESSIONID=1nfKQXdBkkPY4nNTxfM4ZLtX19BcsbpFy2nQ2KJ01t4KwpR4NQyb!-86916873; s_pers=%20s_evar41%3D%255B%255B'e%253A21813dm%253Av%253Apers%253Acus%253Ambb%253Aredeemmodem%253A%253Aredeem-mega-modem-short-lnk'%252C'1303350152411'%255D%255D%7C1461202952411%3B%20s_prop19%3Df429da9da84831ca7aea17ae80f486d0%7C1334887942700%3B%20s_visit%3D1%7C1303353770759%3B%20s_prop25%3Dv%253Abus%253Acorporate-offer%7C1322794757110%3B%20s_nr%3D1322792957112%7C1325384957112%3B; s_vi=[CS]v1|26D7C5CA85149F1B-600001670000530B[CE]; PRX-LB=185106698.36895.0000; ADMINCONSOLESESSION=JxZXQXYXBnL2lpXCQZLphk56rm2Hj4jLTZx4yPnftNQPvkY4Lv9q!-86916873
Cache-Control: max-stale=0
Connection: Keep-Alive
X-BlueCoat-Via: C540871B14123F46
]", which is more than the configured time (StuckThreadMaxTime) of "600" seconds. Stack trace: null>

Above error message was not giving detailed information that at what steps EM is hanging, we have took several round of thread dump but were not able to conclude root cause of failure.
We have raised this concern with Oracle, also we took the help from all other accounts wherever FMW11g got implemented, SR run almost 4 month to resolve this issue, multiple changes we have done in various deployment windows and at last EM was stable, no login problem, always get logged in, all the internal tab pages were working fine and fetching the data in expected time.

2         Solution:

A series of changes has been implemented which I am going to list out here –

2.1       Change1: Enable the “Data Display Option”

Enable the checkbox for below options-

Disabled fetching of instance and fault matrices count.
Restrict display of instance and fault to the last 30 minute.



Initially we have reduced time for  “restrict display of instance and fault” from 24 hrs to 30min, later from 30min to 10min, and then finally 10min to 0min. So during initial login it will not fetch any records but instance can be pulled out on demand.

2.2       Change2: Increased cache timeout for discovery


During login process in Em console, EM does three things 1) Authentication 2) Discovery of targets and 3) loading the page.
Discovery of target step was taking too long around 10 minute since Domain was quite big and having a big list of targets.
Oracle note 1423893.1 has been implemented to cache the discovery result, so the sub sequent login attempt will be fast - Following given Mbeans attributes has been added into to improve performance – 

oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_CACHE_AGE
28800000
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_WAIT_TIME
30000
oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_USE_CACHED_RESULTS
True
LargeRepository
True

All timing for mbeans in milliseconds.
After implementing above change, very first user login will take time after restart of Admin and the discovery result will be stored in cache and sub sequent login attempt will be fast.




2.3       Change3: increased the PermSize of Admin server JVM

Max PERM size has been increased from 512M to 1024M
MEM_PERM_SIZE_64BIT="-XX:PermSize=1024m”

2.4       Change4: Decrease the frequency of DMS application

EM has one inbuilt application called Dynamic Monitoring System (DMS) which does the status collection for all the targets wherever DMS got deployed, if the frequency of DMS stats collection will be too fast then it makes EM console to slow down. To increase the frequency here are the steps-
Increased below parameters in dms_config.xml file given at this location –
$MiddlewareHome/fmw/soa11.1.1.6/oracle_common/modules/oracle.dms_11.1.1/server_config
prefetch intervalSeconds from 15 second to 120,
Discover intervalSeconds from 180 second to 300 seconds

### snap of last few line of dms_config.xml
  <collectorConfiguration>
    <prefetch intervalSeconds="120" removeCycle="2" isDefault="true"/>
    <prefetch intervalSeconds="300" removeCycle="3"/>
    <discover intervalSeconds="300"/>
    <queue maxSizePerServer="4" timeoutSeconds="240"/>
  </collectorConfiguration> 

N.B.
Above highlighted configuration was also suggested by Oracle but its was causing issue a lot and got rollbacked after some time.

2.5       Change5: Un-Target DMS application from unwanted servers.

In thread dumps which we took we realized that most of threads are getting stuck at DMS application level and decided to disabled DMS application for all servers.
In Weblogic Admin Console >> Domain>> Deployment >> DMS >> Target tab page >> uncheck all servers.
Initially we have disabled the DMS application for entire SOA cluster and OSB cluster, and observed huge significant improvement in logging time, instead of taking 2min or 4 min, now EM was hardly taking 30 seconds to login but internal links response time stand same as it was previously, but login time got improved a lots.
But with this configuration we have notice lots of other new issues where most of configuration in SOA-Infrastructure was not working, we were not able to make any configuration changes e.g. under FARM>>SOA>>SOA-INFRA>>SOA Administration>>Common Properties, BPEL properties etc.
All the components e.g. BPEL, Mediator, human workflow, business rule, fault and rejected message process everything got impacted once we disabled the DMS application for SOA cluster.
Due to above impact we have to again enable the DMS application for SOA servers alone.
Since DMS is quite bulky application which does stats collection from all the targets where it got deployed, so it was advisable to disable DMS for some of the targets where monitoring was not required to improve login response in EM console.
Weblogic Admin Console >> Domain>> Deployment >> DMS >> Target tab page >> check all SOA servers alone.

2.6       Change6: Unicast mode of communication has been implemented for SOA      and OSB cluster

Oracle note id 1437883.1 has been followed to implement the same-
If you have large domain, having multiple clusters e.g. Weblogic cluster, coherence cluster etc then preferred way for communication between all the members in cluster should be unicast.
In our environment Weblogic clustering was already enabled in Unicast mode but coherence cluster was using multicast mode and we have identified this using string which was present in setDomain.env file –
-Dtangosol.coherence.clusteraddress=227.7.7.9 -Dtangosol.coherence.clusterport=9778 -Dtangosol.coherence.log=jdk
Above entries got removed at domain level and for each managed server new configuration for coherence cluster got added –
e.g.  for OSB servers and SOA servers–
-Dtangosol.coherence.wka1=hostname1
-Dtangosol.coherence.wka2=hostname2
-Dtangosol.coherence.localhost= hostname1-app
-Dtangosol.coherence.localport=hostname1Port
-Dtangosol.coherence.wka1.port=hostname1Port
-Dtangosol.coherence.wka2.port=hostname2Port
Note: You can use same port number for all hostname, also if you need more logging for coherence component add this line in your setDomain.env file under JAVA_Option
-Dtangosol.coherence.log=stdout  # this will print details logs for members joining coherence cluster etc.

2.7       Change7: Disabled “BPEL recovery console” option from Dashboard


Once you logged in into EM console, during dashboard page loading, EM will try to fetch the data from “DLV_MESSAGE” tables for “invoke” and “callback” activities which take longer time since DLV_ MESSAGE table normally is quite huge.


In order to disable the same please perform this –
EM console>>Farm>>soa-infra>>Administration>>System Mbeam Browser>> Filter>> Type bean name "oracle.as.soainfra.config:name=soa-infra,*", >> In result click on "AduitConfig" attribute and change the "bpelRecoveryStatus=Off", default value for bpelRecoveryStatus is "on"

N.B.
In above configuration “O” should be capital letter of “Off”, small ‘o’ in ‘off’ will not work.

2.8       Change8: Added JVM parameter at domain level “-XX:+UseMembar”


In threads dump Oracle has identified that few of threads concurrently getting locked to one object, this mainly happening because they observed whenever thread state is getting change its not notifying to others waiting threads, to get rid from this situation above JVM parameter got added
###Snap of stack trace which we have noticed in our thread dump
"[STUCK] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=3 tid=0x00000001067c4800 nid=0x47 waiting on condition [0xffffffff5b4f9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0xfffffffeca281c30> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:867)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1197)
at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:214)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:290)
at oracle.adf.model.dcframe.DataControlFrameImpl.lock(DataControlFrameImpl.java:482)

2.9       Change9: Added JVM parameter at domain level “-Dweblogic.management.disableManagedServerNotifications=true”


Above JVM parameter has been suggested by Oracle, this parameter will help to reduce JMX notification which occur between Admin and Managed servers whenever any new component get added, any state get changes etc.
This configuration got added only for Admin Server Instance, not at Manage server level, and we have notice huge benefit in EM after this change, now EM was not at all hanging during login time, internal link were slow but at least multiple users were able to login into EM.

2.10  Change10: Modified JVM parameter to print more details information in Thread dump at Domain level–

Added the following Java option at domain level for all servers. This will help to provide more representative information in thread dumps which help during investigation..

“-XX:+PrintConcurrentLocks”

2.11  Change11: Patch applied in Environment specific to EM Console Slowness Problem

  1. Patch Number 12972254, download file name “p12972254_1036_Generic”
  2. Patch Number 14653405, download file name “fmw_diagnostics_14653405”
  3. Patch Number 14750864, download file name “p14750864_111160_Generic”
  4. Patch Number 14812586, download file name “p14812586_111160_Generic”
  5. Patch Number 15853874, download file name “p15853874_111160_Generic”
  6. Patch Number 15847755, download file name “p15847755_111160_Generic”
N.B.
You might not get above patches on Oracle Support website, since we got these patches directly uploaded into SR level. Also, please confirm with Oracle Support before deploying any PATCH into your environment, this given information is just for reference and does not be supported by any Vendor.

2.12  Change12: Identification of EM Queries for tunning purpose

Now after doing all sort of tunning at application level, login problem got resolved but still internal links to instance tab page, fault and rejected tab page and BPEL recovery console were taking long time to respond, then we started looking into AWR report and found that couple of queries are taking huge time around 20 minute as result EM is extremely slow for internal links and was keep spinning.
We manage to get the EM queries which EM runs in the background once users click on Instance, dashboard, fault and rejected message and BPEL recovery console tab pages.
Here are the queries –

2.12.1DASHBOARD - BPEL Recovery console queries

SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM rnum  FROM (SELECT MESSAGE_GUID AS a1, DLV_TYPE AS a2, CIKEY AS a3, CLUSTER_NODE_ID AS a4, CLUSTER_NODE_KEY AS a5, COMPONENT_NAME AS a6, COMPONENT_TYPE AS a7, COMPOSITE_LABEL AS a8, COMPOSITE_NAME AS a9, COMPOSITE_REVISION AS a10, CONV_ID AS a11, DOMAIN_NAME AS a12, ECID AS a13, EVENT_NAME AS a14, EXT_INT1 AS a15, EXT_STRING1 AS a16, EXT_STRING2 AS a17, HEADER_PROPERTIES_BIN_FORMAT AS a18, HEADERS_REF_ID AS a19, OPERATION_NAME AS a20, PARTNER_LINK AS a21, PRIORITY AS a22, PROPERTIES AS a23, RECEIVE_DATE AS a24, RECOVER_COUNT AS a25, STATE AS a26, TENANT_ID AS a27, CONV_TYPE AS a28, RES_SUBSCRIBER AS a29 FROM DLV_MESSAGE WHERE ((((COMPONENT_TYPE = ?) AND (STATE IN (?, ?))) AND (RECEIVE_DATE <= ?)) AND (DLV_TYPE = ?)) ORDER BY RECEIVE_DATE DESC) a WHERE ROWNUM <= ?) WHERE rnum > ?
        bind => [bpel, 0, 1, 2012-10-31 12:07:36.044, 2, 1, 0]

SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM rnum  FROM (SELECT MESSAGE_GUID AS a1, DLV_TYPE AS a2, CIKEY AS a3, CLUSTER_NODE_ID AS a4, CLUSTER_NODE_KEY AS a5, COMPONENT_NAME AS a6, COMPONENT_TYPE AS a7, COMPOSITE_LABEL AS a8, COMPOSITE_NAME AS a9, COMPOSITE_REVISION AS a10, CONV_ID AS a11, DOMAIN_NAME AS a12, ECID AS a13, EVENT_NAME AS a14, EXT_INT1 AS a15, EXT_STRING1 AS a16, EXT_STRING2 AS a17, HEADER_PROPERTIES_BIN_FORMAT AS a18, HEADERS_REF_ID AS a19, OPERATION_NAME AS a20, PARTNER_LINK AS a21, PRIORITY AS a22, PROPERTIES AS a23, RECEIVE_DATE AS a24, RECOVER_COUNT AS a25, STATE AS a26, TENANT_ID AS a27, MASTER_CONV_ID AS a28 FROM DLV_MESSAGE WHERE ((((COMPONENT_TYPE = ?) AND (STATE IN (?, ?))) AND (RECEIVE_DATE <= ?)) AND (DLV_TYPE = ?)) ORDER BY RECEIVE_DATE DESC) a WHERE ROWNUM <= ?) WHERE rnum > ?
        bind => [bpel, 0, 1, 2012-10-31 12:07:36.044, 1, 1, 0]

2.12.2Dashboard - Recent fault and rejected messages

SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM rnum from (SELECT  f.CIKEY, f.NODE_ID, f.SCOPE_ID, f.COUNT_ID, f.FAULT_NAME, f.FAULT_TYPE, f.POLICY_NAME, f.POLICY_VERSION, f.POLICY_CATEGORY, f.POLICY_ECID, f.CREATION_DATE, f.MODIFY_DATE, f.MESSAGE, wi.LABEL, wi.CREATOR, wi.MODIFIER, wi.STATE, ci.ECID, ci.CMPST_ID, ci.DOMAIN_NAME, ci.COMPOSITE_NAME, ci.COMPONENT_NAME, ci.COMPOSITE_REVISION, ci.COMPOSITE_LABEL FROM CUBE_INSTANCE ci, WI_FAULT f LEFT JOIN WORK_ITEM wi ON f.CIKEY = wi.CIKEY AND f.NODE_ID = wi.NODE_ID AND f.SCOPE_ID = wi.SCOPE_ID AND f.COUNT_ID = wi.COUNT_ID WHERE ci.CIKEY =  f.CIKEY AND ci.COMPONENTTYPE = ? AND (f.FAULT_TYPE is null OR f.FAULT_TYPE = ?) AND ci.STATE != 9 AND (wi.STATE is null OR wi.STATE IN (9, 4, 13, 3))  ORDER BY f.CREATION_DATE DESC) a  where ROWNUM < ? ) where rnum > ?
        bind => [bpel, {http://schemas.oracle.com/bpel/extension}RuntimeFaultMessage, 5, 0]

2.12.3Dashboard - Recent composite Instance queries –

SELECT /*+ FIRST_ROWS(40) */ ID, CONVERSATION_ID, HAS_ASSOC, PARENT_ID, UPDATED_TIME, CREATED_BY, ECID, TITLE, TEST_RUN_NAME, INDEX5, TEST_RUN_ID, INDEX3, TEST_SUITE, INDEX1, TEST_CASE, BATCH_INDEX, SOURCE_NAME, COMPOSITE_DN, SOURCE_TYPE, CREATED_TIME, SOURCE_ACTION_TYPE, INDEX6, SOURCE_ACTION_NAME, INDEX2, STATE, BATCH_ID, LIVE_INSTANCES, TAGS, STATE_COUNT, BUSINESS_STATUS, VERSION, INDEX4, PARTITION_DATE, UPDATED_BY, TENANT_ID FROM COMPOSITE_INSTANCE WHERE (CREATED_TIME >= :1 ) ORDER BY CREATED_TIME DESC
We got this query from AWR report and our guess is it’s being used for “recent instance” section,

2.12.4Instance tab page -

Again our guess is same query which is getting used for Dashboard tab page instance section is getting used for "Instance" tab page as well
SELECT /*+ FIRST_ROWS(40) */ ID, CONVERSATION_ID, HAS_ASSOC, PARENT_ID, UPDATED_TIME, CREATED_BY, ECID, TITLE, TEST_RUN_NAME, INDEX5, TEST_RUN_ID, INDEX3, TEST_SUITE, INDEX1, TEST_CASE, BATCH_INDEX, SOURCE_NAME, COMPOSITE_DN, SOURCE_TYPE, CREATED_TIME, SOURCE_ACTION_TYPE, INDEX6, SOURCE_ACTION_NAME, INDEX2, STATE, BATCH_ID, LIVE_INSTANCES, TAGS, STATE_COUNT, BUSINESS_STATUS, VERSION, INDEX4, PARTITION_DATE, UPDATED_BY, TENANT_ID FROM COMPOSITE_INSTANCE WHERE (CREATED_TIME >= :1 ) ORDER BY CREATED_TIME DESC

2.12.5Instance tab page search query based on name vs title-

SELECT /*+ FIRST_ROWS(50) */ ID, CONVERSATION_ID, HAS_ASSOC, PARENT_ID, UPDATED_TIME, CREATED_BY, ECID, TITLE, TEST_RUN_NAME, INDEX5, TEST_RUN_ID, INDEX3, TEST_SUITE, INDEX1, TEST_CASE, BATCH_INDEX, SOURCE_NAME, COMPOSITE_DN, SOURCE_TYPE, CREATED_TIME, SOURCE_ACTION_TYPE, INDEX6, SOURCE_ACTION_NAME, INDEX2, STATE, BATCH_ID, LIVE_INSTANCES, TAGS, STATE_COUNT, BUSINESS_STATUS, VERSION, INDEX4, PARTITION_DATE, UPDATED_BY, TENANT_ID FROM COMPOSITE_INSTANCE WHERE (TITLE LIKE :1 ) ORDER BY CREATED_TIME DESC

2.12.6Fault and Rejected message tab page –

For this tab page we could see one parent query runs which fetch initial 40 records and then for each records its run another child sql query-

2.12.6.1  Parent query-

SELECT * FROM (SELECT /*+ FIRST_ROWS */ a.*, ROWNUM rnum from (SELECT  f.CIKEY, f.NODE_ID, f.SCOPE_ID, f.COUNT_ID, f.FAULT_NAME, f.FAULT_TYPE, f.POLICY_NAME, f.POLICY_VERSION, f.POLICY_CATEGORY, f.POLICY_ECID, f.CREATION_DATE, f.MODIFY_DATE, f.MESSAGE, wi.LABEL, wi.CREATOR, wi.MODIFIER, wi.STATE, ci.ECID, ci.CMPST_ID, ci.DOMAIN_NAME, ci.COMPOSITE_NAME, ci.COMPONENT_NAME, ci.COMPOSITE_REVISION, ci.COMPOSITE_LABEL
FROM CUBE_INSTANCE ci, WI_FAULT f LEFT
JOIN WORK_ITEM wi ON f.CIKEY = wi.CIKEY AND f.NODE_ID = wi.NODE_ID AND f.SCOPE_ID = wi.SCOPE_ID AND f.COUNT_ID = wi.COUNT_ID WHERE ci.CIKEY =  f.CIKEY AND ci.COMPONENTTYPE = ? AND ci.STATE != 9 AND (wi.STATE is null OR wi.STATE IN (9, 4, 13, 3))  ORDER BY f.CREATION_DATE DESC) a  where ROWNUM < ? ) where rnum > ?
        bind => [bpel, 40, 0]

2.12.6.2  Child query –

SELECT F.CIKEY, F.NODE_ID, F.SCOPE_ID, F.COUNT_ID, F.FAULT_NAME,
                F.FAULT_TYPE,F.CREATION_DATE,F.MODIFY_DATE,
                F.MESSAGE, WI.LABEL, WI.CREATOR,WI.MODIFIER,WI.STATE,CI.ECID,CI.CMPST_ID
                FROM WI_FAULT F LEFT JOIN WORK_ITEM WI ON F.CIKEY = WI.CIKEY
                AND F.NODE_ID = WI.NODE_ID AND F.SCOPE_ID = WI.SCOPE_ID AND F.COUNT_ID = WI.COUNT_ID
                LEFT JOIN CUBE_INSTANCE CI ON F.CIKEY = CI.CIKEY WHERE F.CIKEY = ? AND F.NODE_ID = ?
                AND F.SCOPE_ID = ? AND F.COUNT_ID = ?
        bind => [2050981, BpInv1, BpSeq3.17, 2]

                               
 SELECT F.CIKEY, F.NODE_ID, F.SCOPE_ID, F.COUNT_ID, F.FAULT_NAME,
                F.FAULT_TYPE,F.CREATION_DATE,F.MODIFY_DATE,
                F.MESSAGE, WI.LABEL, WI.CREATOR,WI.MODIFIER,WI.STATE,CI.ECID,CI.CMPST_ID
                FROM WI_FAULT F LEFT JOIN WORK_ITEM WI ON F.CIKEY = WI.CIKEY
                AND F.NODE_ID = WI.NODE_ID AND F.SCOPE_ID = WI.SCOPE_ID AND F.COUNT_ID = WI.COUNT_ID
                LEFT JOIN CUBE_INSTANCE CI ON F.CIKEY = CI.CIKEY WHERE F.CIKEY = ? AND F.NODE_ID = ?
                AND F.SCOPE_ID = ? AND F.COUNT_ID = ?
        bind => [2050969, BpInv6, BpSeq19.39, 2]

2.13  Change13: Created following Index to improve the EM queries execution response time–


Index Creation
Table Name
Column name
Index type
COMPOSITE_INSTANCE
created_time
Normal Index
CUBE_INSTANCE
CPMST_ID
Normal Index
COMPOSITE_INSTANCE
TITLE
STATE
Normal Index
Normal Index
DLV_MESSAGE 
 STATE
DLV_TYPE
 Composite Index

Change14: Explored direct links to access the EM console Instance tab page, fault and recovery tab page, dashboard tab etc.


If we login into EM console via normal login procedure, it takes quite a long time to navigate from Dashboard tab page to Instance tab page etc. To avoid this delay we have provide direct link for accessing these page based on user need-

2.13.1Instance Tab Page

&type=oracle_soainfra&selectedTab=instancesTab
N.B.
In above string replace hostname, domain name, port number and target value to prepare the URL for your environment, target name can be pulled out from this path- right click on “soa-infra(mservername) >> click to general >> a dialogue box will appear, copy the target name from there and replace in above string. Refer the above screen shot for path.

2.13.2Fault and Recovery Tab page

&type=oracle_soainfra&selectedTab=faultsTab

2.13.3Dashboard Tab Page

&type=oracle_soainfra&selectedTab=compositeTabs


2.14     Change 15: Manually execution of Stats pack at Database level for SOA_INFRA partition tables.

One more major impacting solution we have adapted to auto schedule stats collection process for partition tables.
During our analysis of AWR report, we have identified the long running queries and also gone through the explain plan of those query and concluded that SQL optimizer is not working as expected and possible reason behind that automatic database stats collection process updates the latest partition of SOA_INFRA tables where as the old partitions SOA_INFRA tables remains unchanged.
When EM runs the queries it gets two different explain plans according to the data exist into separate partition tables as result it take longer time to response.
In order to overcome this situation we have implemented manual stats collection on Tables level, rather than having stats collection at partition level for table. A cron job now run every mid night which does table scan to identify data changes %, if more than 5% data got changed in table it does the stats collection for that table. SOA_INFRA table list where stats collection process been schedule is given below-

2.14.1     Table List where DB Manual Stats collection is required -

TABLES PARTITIONED
AUDIT_COUNTER
AUDIT_DETAILS
AUDIT_TRAIL
CI_INDEXES
COMPOSITE_INSTANCE
COMPOSITE_INSTANCE_ASSOC
COMPOSITE_INSTANCE_FAULT
CUBE_INSTANCE
CUBE_SCOPE
DLV_MESSAGE
DLV_SUBSCRIPTION
DOCUMENT_CI_REF
DOCUMENT_DLV_MSG_REF
HEADERS_PROPERTIES
REFERENCE_INSTANCE
WI_FAULT
WLI_QS_REPORT_ATTRIBUTE
WORK_ITEM
XML_DOCUMENT


2.14.2     Call the below SQL statement to forcefully initiate DB stats collection for a table –

EXEC DBMS_STATS.gather_table_stats('SOAP_SOAINFRA', 'AUDIT_COUNTER',estimate_percent => 5);

Conclusion

Ultimately EM Stuck Thread problem and slowness problem got resolved; it took around 3 + months in our production environment to do all the changes in multiple deployment windows, now multiple users are accessing EM console and able to do their BAU work. Only regular purging is required for SOA_INFRA database to keep EM console stable in future as well.
Thanks to Oracle support guys and all my colleagues who were part of this exercise.
A special thanks goes to the Mr. Deepak Arora (Director, SOA/B2B Fusion Middleware Architects Team, “The A- Team”), Oracle to provide his supervision and drive this exercise. We got very useful information from him which enabled us to make environment stable.



Disclaimer 




While I try to make effort to ensure that material in this solution is accurate but such material does in no way constitute the provision of professional advice to the problem of your solution to your specific environment, this solution does not guarantee and accept any legal liabilities whatsoever arising due to the changes what is mention in solution, before making any suggested changes please take a advise form your Product Vendor and do through testing in Test environment before moving it to Production.


10 comments:

  1. hello. Very nice article indeed. helped a lot.
    One extra question: How many composite instances remain rezident in the database after frequently applied search? I mean...at what database footprint does this article apply? Have you used also partitioning also?

    thank you again

    ReplyDelete
  2. Hi. You've said you had this setting and then removed it because of problems:



    Can you say what those problems were? Thanks.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I mean the queueMaxSizePerServer setting.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Hi, Please can you provide the SR no or KM note where Oracle mentioned about adding JVM parameter at domain level “-Dweblogic.management.disableManagedServerNotifications=true”. This may have associated impact as some notifications from MS will be disabled to Admin server.
    Thanks

    ReplyDelete
    Replies
    1. hi, I asked Oracle about this and they said the below
      "It is impossible to say if the suggestions in the non-Oracle blog will help your particular situation.
      I have found some information where using this parameter is suggested but it does have side effects, such as incident logs not being generated properly in the event of a problem occurring. "

      Delete
  7. Hi Manish,

    Thanks for the blog its very useful.

    As you have speciefied in step 2 (2.2 Change2: Increased cache timeout for discovery) to change the cache i was able to find the elements but there no entries like this in any of the element key value.

    oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_CACHE_AGE

    oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_MAX_WAIT_TIME

    oracle.sysman.emas.discovery.wls.FMW_DISCOVERY_USE_CACHED_RESULTS

    and even those values are non editable please need you help on this.

    Regards
    Irfan



    ReplyDelete
  8. Really a good post and appreciate for sharing this with all

    ReplyDelete
  9. Brilliant! Exactly what we were looking for...Thank you sir!

    ReplyDelete