/ OutOfMemoryError – Oracle Service Bus Alerts ~ Java EE Support Patterns

10.30.2011

OutOfMemoryError – Oracle Service Bus Alerts

This case study describes the complete steps from root cause analysis to resolution of a sudden OutOfMemoryError problem experienced with Oracle Service Bus 2.6 running on Weblogic Server 9.2.
This case study will also demonstrate how you combine Thread Dump analysis and Solaris PRSTAT to pinpoint culprit Thread(s) consuming large amount of memory on your application Java Heap.

Environment specifications

-         Java EE server: Oracle Service Bus 2.6 & Weblogic Server 9.2
-         OS: Solaris 10
-         JDK: Sun HotSpot 1.5 SR7
-         RDBMS: Oracle 10g
-         Platform type: Enterprise Service Bus

Troubleshooting tools

-         Quest Foglight for Java EE
-         Java VM Thread Dump (for stuck Thread analysis)
-         Solaris PRSTAT (for Thread CPU contributors breakdown)

Problem overview

-         Problem type: OutOfMemoryError

An OutOfMemoryError problem was observed in our Oracle Service Bus Weblogic logs along with major performance degradation.

Preventive restart was then implemented since a Java Heap memory leak was initially suspected which did not prove to be efficient.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH
·        Recent change of the affected platform? No
·        Any recent traffic increase to the affected platform? No
·        Since how long this problem has been observed?  Since several months with no identified root cause
·        Is the OutOfMemoryError happening suddenly or over time? It was observed that the Java Heap is suddenly increasing and can then take several hours before failing with a OutOfMemoryError
·        Did a restart of the Weblogic server resolve the problem? No, preventive restart was attempted which did not prove to be efficient

-         Conclusion #1: The sudden increase of the Java Heap could indicate long running Thread(s) consuming a large amount of memory in a short amount of time

-         Conclusion #2: A Java Heap leak cannot be ruled out at this point

Java Heap monitoring using Foglight

Quest Foglight monitoring tool was used to monitor the trend of the Oracle Service Bus Java Heap for each of the managed servers. You can see below a sudden increase of the Java Heap OldGen space can be observed.

This type of memory problem normally does not indicate a linear or long term memory leak but rather rogue Thread(s) consuming large amount of memory in a short amount of time. In this scenario, my recommendation to you is to first perform Thread Dump analysis before jumping too quickly to JVM Heap Dump analysis.


Thread Dump analysis

Thread Dump snapshots were generated during the sudden Java Heap increase and analysis did reveal a primary culprit as you can see below.

Analysis of the different Thread Dump snapshots did reveal a very long elapsed time of this Thread along with a perfect correlation with our Java Heap increase. As you can see, this Thread is triggered by the ALSB alerting layer and is involved in read/write of the Weblogic internal diagnostic File Store.

"pool-2-thread-1" prio=10 tid=0x01952650 nid=0x4c in Object.wait() [0x537fe000..0x537ff8f0]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:474)
        at weblogic.common.CompletionRequest.getResult(CompletionRequest.java:109)
        - locked <0xf06ca710> (a weblogic.common.CompletionRequest)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:657)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:630)
        at weblogic.diagnostics.archive.wlstore.PersistentRecordIterator.fill(PersistentRecordIterator.java:87)
        at weblogic.diagnostics.archive.RecordIterator.fetchMore(RecordIterator.java:157)
        at weblogic.diagnostics.archive.RecordIterator.hasNext(RecordIterator.java:130)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.getAlerts(AlertLoggerImpl.java:157)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.updateAlertEvalTime(AlertLoggerImpl.java:140)
        at com.bea.wli.monitoring.alert.AlertLog.updateAlertEvalTime(AlertLog.java:248)
        at com.bea.wli.monitoring.alert.AlertManager._evaluateSingleRule(AlertManager.java:992)
        at com.bea.wli.monitoring.alert.AlertManager.intervalCompleted(AlertManager.java:809)
        at com.bea.wli.monitoring.alert.AlertEvalTask.execute(AlertEvalTask.java:65)
        at com.bea.wli.monitoring.alert.AlertEvalTask.run(AlertEvalTask.java:37)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)

Solaris PRSTAT

Solaris PRSTAT was also captured in order to understand the amount of CPU consumed and determine if this Thread was stuck or performing a lot of processing.

As you can see below, Thread id #76 was identified as the primary CPU contributor with 8.6 %.

prstat –L <Java Pid>

  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID
  9116 bea      3109M 2592M sleep   59    0  21:52:59 8.6% java/76
  9116 bea      3109M 2592M sleep   57    0   4:28:05 0.3% java/40
  9116 bea      3109M 2592M sleep   59    0   6:52:02 0.3% java/10774
  9116 bea      3109M 2592M sleep   59    0   6:50:00 0.3% java/84
  9116 bea      3109M 2592M sleep   58    0   4:27:20 0.2% java/43
  9116 bea      3109M 2592M sleep   59    0   7:39:42 0.2% java/41287
  9116 bea      3109M 2592M sleep   59    0   3:41:44 0.2% java/30503
  9116 bea      3109M 2592M sleep   59    0   5:48:32 0.2% java/36116
  9116 bea      3109M 2592M sleep   59    0   6:15:52 0.2% java/36118
  9116 bea      3109M 2592M sleep   59    0   2:44:02 0.2% java/36128
  9116 bea      3109M 2592M sleep   59    0   5:53:50 0.1% java/36111
  9116 bea      3109M 2592M sleep   59    0   4:27:55 0.1% java/55
  9116 bea      3109M 2592M sleep   59    0   9:51:19 0.1% java/23479
  9116 bea      3109M 2592M sleep   59    0   4:57:33 0.1% java/36569
  9116 bea      3109M 2592M sleep   59    0   9:51:08 0.1% java/23477
  9116 bea      3109M 2592M sleep   59    0  10:15:13 0.1% java/4339
  9116 bea      3109M 2592M sleep   59    0  10:13:37 0.1% java/4331
  9116 bea      3109M 2592M sleep   59    0   4:58:37 0.1% java/36571
  9116 bea      3109M 2592M sleep   59    0   3:13:46 0.1% java/41285
  9116 bea      3109M 2592M sleep   59    0   4:27:32 0.1% java/48
  9116 bea      3109M 2592M sleep   59    0   5:25:28 0.1% java/30733

Thread Dump and PRSTAT correlation approach

In order to correlate the PRSTAT data please follow the steps below:

-         Identify the primary culprit / contributor Thread(s) and locate the Thread Id (DECIMAL format) from the last column
-         Convert the decimal value to HEXA since Thread Dump native Thread id is in HEXA format
-         Search from your captured Thread Dump data and locate the Thread native Id >> nid=<Thread Id HEXA>

For our case study:

-         Thread Id #76 identified as the primary culprit
-         Thread Id #76 was converted to HEXA format >> 0x4c
-         Thread Dump data Thread native Id >> nid=0x4c

Such correlation and finding did confirm that this rogue Thread was actively processing (reading/writing to the File Store) and allocating a large amount of memory on the Java Heap in a short amount of time.

Root cause

The above analysis along with Mark Smith Oracle Blog article finding did confirm the root cause.

Oracle Service Bus allows Alert actions to be configured within the message flow (pipeline alerts). These pipeline alert actions generate alerts based on message context in a pipeline, to send to an alert destination. Such alert destination is the actual Weblogic diagnostic File Store which means this structure will grow over time depending of the volume of Alerts that your OSB application is generating.

It is located under >> //domain_name/servers/server_name/data/store/diagnostics/

In our case, the File Store size was around 800 MB.

Such increase of the diagnostic File Store size over time is leading to an increase of elapsed time of the Thread involved in read/write operations; allocating large amount of memory on the Java Heap. Such memory cannot be garbage collected until Thread completion which is leading to OutOfMemoryError and performance degradation.

Solution

2 actions were required to resolve this problem:

-         Reset the diagnostic File Store by renaming the existing data file and forcing Weblogic to create a fresh one
-         Review and reduce the Oracle Service Bus alerting level to the minimum acceptable level

The reset of the diagnostic File Store did bring some immediate relief and by ensuring short and optimal diagnostic File Store operations and not causing too much impact on the Java Heap.

The level of OSB alerting is still in review and will be reduced shortly in order to prevent this problem at the source.

Regular monitoring and purging of the diagnostic File Store will also be implemented in order to prevent such performance degradation going forward; as per Oracle recommendations.

Conclusion and recommendations

I hope this article has helped you understand how to identify and resolve this Oracle service bus problem and appreciate how powerful Thread Dump analysis and Solaris PRSTAT tool can be to pinpoint this type of OutOfMemoryError problem.

Please do not hesitate to post any comment or question.

7 comments:

Hi P-H,

This is really excellent information to understand TD with the the help of PRSTAT.

Regarding OSB alert I think to stop this error we can write wlst script to purge old files on regularly basis. Please see below the my understating on this

If the Alert action is used, then the total number of alerts will keep increasing unless they are purged. If they get to a large enough level (many thousands), this will cause the following issues:

1. The alert dashboard will become unusable. You won’t be able to get to the Extended Alert History and you won’t be able to get to the alert detail page.

2. If the file store is allowed to get large enough (GB’s in size), then the respective managed server will take a long time (more than 30 minutes) to start.

Regards,
Sujit

Hi P-H,

This is really very good article which helps me lot to understand TD with the help of PRSTAT.

Regarding The level of OSB alerting please see below the my understating on this to stop OOM error.

If the Alert action is used, then the total number of alerts will keep increasing unless they are purged. If they get to a large enough level (many thousands), this will cause the following issues:
1. The alert dashboard will become unusable. You won’t be able to get to the Extended Alert History and you won’t be able to get to the alert detail page.

2. If the file store is allowed to get large enough (GB’s in size), then the respective managed server will take a long time (more than 30 minutes) to start and also that could be reason of OutOfMemory here.


Regards,
Sujit

Hi P-H,

This is really very good article which helps me lot to understand TD with the help of PRSTAT.

Regarding The level of OSB alerting which is root cause here I think

If the Alert action is used, then the total number of alerts will keep increasing unless they are purged. If they get to a large enough level (many thousands), this will cause the following issues:
1. The alert dashboard will become unusable. You won’t be able to get to the Extended Alert History and you won’t be able to get to the alert detail page. I have found that the underlying query the console uses, will time out due to the amount of the data.
2. If the file store is allowed to get large enough (GB’s in size), then the respective managed server will take a long time (more than 30 minutes) to start.

Regards,
Sujit

Hi P-H,

Wrongly I posted 3 times.

Regards,
Sujit

Hi Pierre ,

First of all thanks a lot for all these nice articles
For me and the team that i work with was very helpful.
Thanks for the time you took responding to the emails.
Finally we identified that one on the new project exposed on OSB are causing the problem.
A split join aggregates the response from 3 other backends ,but the responses are quite big. The soap like 20 MB...
Measuring the heap for such a payload was surprisingly shown apx 800GB (almost 1 GB) more memory used just for one request.

After that the GC cleared the memory but with 3 request on the proxy ,OSB crashed.

Is this normal to have a SOAP payload while transiting the OSB such a huge memory footprint ?

Cheers

Cristian.

Thanks Cris for your comments,

OSB is not really designed to handle such big SOAP payload; especially when using split join. Adding XQuery manipulation on top of it just make it worst which is causing extra short lived objects creation and the executing thread to retain the memory on the Java Heap until the request processing is done (e.g. XQuery can take quite a while to handle transformations for big payload), ultimately leading to Java Heap depletion and excessive GC.

One of my IT clients had similar problem with OSB. The solution was tweak the payload to reasonable size (~ 1 MB) and avoid too much XQuery manipulations. In this scenario, they moved some of the XML orchestration & manipulation to a rule engine layer prior to OSB final consolidation.

Minimizing the OSB data payloads and XQuery manipulations will ensure a scalable environment in the long run.

Regards,
P-H

Logging to database can be a viable alternative solution.

Post a Comment