java.lang.OutOfMemoryError: Java heap space

Description

During perf/scale testing, we see OOM on ODL when a large number of neutron resources are created and deleted we see that ODL is killed due to OOM. Looking at the stdout when the JVM crashed due to OOM, we see:

Heap dump file created [3089813876 bytes in 19.232 secs]
Uncaught error from thread [opendaylight-cluster-data-shard-dispatcher-144] shutting down JVM since 'akka.jvm-exit-on-fatal-error' is enabled for ActorSystem[opendaylight-cluster-data]
java.lang.OutOfMemoryError: Java heap space
at com.google.common.collect.RegularImmutableMap.createEntryArray(RegularImmutableMap.java:148)
at com.google.common.collect.RegularImmutableMap.<init>(RegularImmutableMap.java:81)
at com.google.common.collect.ImmutableMap.copyOf(ImmutableMap.java:294)
at org.opendaylight.controller.cluster.datastore.persisted.FrontendHistoryMetadata.<init>(FrontendHistoryMetadata.java:40)
at org.opendaylight.controller.cluster.datastore.FrontendHistoryMetadataBuilder.build(FrontendHistoryMetadataBuilder.java:54)
at org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder$$Lambda$431/741495460.apply(Unknown Source)
at com.google.common.collect.Iterators$8.transform(Iterators.java:799)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at java.util.AbstractCollection.toArray(AbstractCollection.java:141)
at com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:258)
at org.opendaylight.controller.cluster.datastore.persisted.FrontendClientMetadata.<init>(FrontendClientMetadata.java:38)
at org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder.build(FrontendClientMetadataBuilder.java:77)
at org.opendaylight.controller.cluster.datastore.FrontendMetadata$$Lambda$430/2026307982.apply(Unknown Source)
at com.google.common.collect.Iterators$8.transform(Iterators.java:799)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at java.util.AbstractCollection.toArray(AbstractCollection.java:141)
at com.google.common.collect.ImmutableList.copyOf(ImmutableList.java:258)
at org.opendaylight.controller.cluster.datastore.persisted.FrontendShardDataTreeSnapshotMetadata.<init>(FrontendShardDataTreeSnapshotMetadata.java:71)
at org.opendaylight.controller.cluster.datastore.FrontendMetadata.toSnapshot(FrontendMetadata.java:72)
at org.opendaylight.controller.cluster.datastore.FrontendMetadata.toSnapshot(FrontendMetadata.java:33)
at org.opendaylight.controller.cluster.datastore.ShardDataTree.takeStateSnapshot(ShardDataTree.java:216)
at org.opendaylight.controller.cluster.datastore.ShardSnapshotCohort.createSnapshot(ShardSnapshotCohort.java:68)
at org.opendaylight.controller.cluster.raft.RaftActorSnapshotMessageSupport.lambda$new$0(RaftActorSnapshotMessageSupport.java:52)
at org.opendaylight.controller.cluster.raft.RaftActorSnapshotMessageSupport$$Lambda$123/1533883683.accept(Unknown Source)
at org.opendaylight.controller.cluster.raft.SnapshotManager$Idle.capture(SnapshotManager.java:295)
at org.opendaylight.controller.cluster.raft.SnapshotManager$Idle.capture(SnapshotManager.java:307)
at org.opendaylight.controller.cluster.raft.SnapshotManager.capture(SnapshotManager.java:91)
at org.opendaylight.controller.cluster.raft.behaviors.Follower.lambda$handleAppendEntries$0(Follower.java:254)
at org.opendaylight.controller.cluster.raft.behaviors.Follower$$Lambda$127/742332312.apply(Unknown Source)
at org.opendaylight.controller.cluster.raft.ReplicatedLogImpl.lambda$appendAndPersist$0(ReplicatedLogImpl.java:111)
at org.opendaylight.controller.cluster.raft.ReplicatedLogImpl$$Lambda$128/559701765.apply(Unknown Source)
at akka.persistence.UntypedPersistentActor$$anonfun$persist$1.apply(PersistentActor.scala:206)

Environment

None

Attachments

9
  • 28 Nov 2017, 04:04 PM
  • 28 Nov 2017, 04:04 PM
  • 28 Nov 2017, 04:04 PM
  • 15 Nov 2017, 04:50 PM
  • 15 Nov 2017, 04:50 PM
  • 15 Nov 2017, 04:50 PM
  • 13 Nov 2017, 06:07 PM
  • 13 Nov 2017, 05:14 PM
  • 13 Nov 2017, 04:51 PM

Activity

Show:

Michael Vorburger November 30, 2017 at 12:44 PM

Closing this issue to avoid confusing, carrying on more similar work in different places in OVSDB-435.

Michael Vorburger November 29, 2017 at 3:08 PM

Testing of another scenario ("nova boot") has hit an OOM that looks exactly this again (another huge 857 MB of 1.6 GB Map inside the MD SAL ShardDataTree), so clogging these TX leaks is likely going to be more of an ongoing repetitive than a one time action..

Basically, any time we test new scenarios at scale on a path that hasn't been threaded before, if we hit an OOM that shows a blown up ShardDataTree, we have to run trace:transactions and clog more non-closed TXs - best by using the ManagedTransactionRunner, continuing on the repetitive pattern of fixes which the patches of the past 2 weeks on the https://git.opendaylight.org/gerrit/#/q/topic:NETVIRT-985 have shown.

Michael Vorburger November 28, 2017 at 4:08 PM

Everything done here is now merged to master & carbon; and nitrogen should get merged in the coming days (stable/nitrogen was re-open'd today). In attached Controller[1-3]_open-transactions.txt from latest test with a build including these fixes shows, we can clearly see that all the "big" leaks (as in with hundreds of open TXs) have been plugged; as expected. Preliminary early feedback indicates that they are not hitting an OOM anymore.

 
There ARE still some open TX, but relatively few, compared to where we started; I don't think it's worth following up on them, in the short term. In case we ever want to look at this again in the future, see new attachments.

Michael Vorburger November 17, 2017 at 5:03 PM
Edited

Once everything that is open now, PLUS what will help me with in https://lf-opendaylight.atlassian.net/browse/NETVIRT-1000#icft=NETVIRT-1000, are available in stable/carbon, then can retest. (Full disclosure: There are few minor Tx leaks, showing up with much fewer transactions than the biggies, which we'll fix only on master, not on carbon & nitrogen.)

Michael Vorburger November 15, 2017 at 6:23 PM

Done

Details

Assignee

Reporter

Affects versions

Priority

Created November 10, 2017 at 6:43 PM
Updated May 8, 2018 at 12:59 PM
Resolved November 30, 2017 at 12:44 PM