Unsynchronous change of ModificationType in ModificationNode

Description

This issue was observed in https://lf-opendaylight.atlassian.net/browse/MDSAL-872 . While removing huge number of devices, it may happen that ModificationNode with original modificationType set to WRITE is misinterpreted as WRITE and failed because: “does not have after-image”.
https://github.com/opendaylight/controller/blob/ddbe32f92adc0ec6d49ca9a248f0a95f509b3b7f/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/persisted/DataTreeCandidateInputOutput.java#L202C1-L209C6
In attachment is provided screenshot from debugging where is called requireDataAfter method because modificationType was WRITE, but in the debug variables is changed to DELETE.

In the AutomaticLifecycleMixin class is changed, modificationType in ModificationNode from WRITE into DELETE.
https://github.com/opendaylight/yangtools/blob/9f46dba51a34ee223f9a537da76010710512c921/data/yang-data-tree-ri/src/main/java/org/opendaylight/yangtools/yang/data/tree/impl/AutomaticLifecycleMixin.java#L95-L97


This issue is irregularly reproducible only in the Calcium branch. In newer branches, this issue seems to be suppressed because of model change, which probably give more time for available-capabilities to change ModificationType before executing writeNode method in DataTreeCandidateInputOutput class.
ModifiedNode{identifier=(urn:opendaylight:netconf-node-topology?revision=2023-11-21)available-capabilities, operation=WRITE, modificationType=WRITE}

Probably the ModificationNode should have DELETE ModificationType from beginning without changing it in AutomaticLifecycleMixin or should notify when this change is done.

Environment

None

Attachments

14
  • 20 Dec 2024, 05:24 PM
  • 20 Dec 2024, 05:24 PM
  • 20 Dec 2024, 05:24 PM
  • 20 Dec 2024, 05:24 PM
  • 20 Dec 2024, 05:23 PM
  • 20 Dec 2024, 05:23 PM
  • 11 Dec 2024, 07:06 AM
  • 11 Dec 2024, 07:06 AM
  • 11 Dec 2024, 07:06 AM
  • 11 Dec 2024, 07:06 AM
  • 11 Dec 2024, 07:06 AM
  • 11 Dec 2024, 07:06 AM

Activity

Show:

Robert Varga January 7, 2025 at 2:20 PM
Edited

https://git.opendaylight.org/gerrit/c/yangtools/+/114780 moves the validate/prepare implementation to InMemoryDataTreeModification and https://git.opendaylight.org/gerrit/c/yangtools/+/114848 introduces some basic locking, which should make the collision less likely. Nevertheless it is not a complete solution, as a newModification() after prepare() may still trigger this bit:

java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.tree.impl.node.MaterializedContainerNode@46bc284a and candidate base org.opendaylight.yangtools.yang.data.tree.impl.node.MaterializedContainerNode@73a6f391 differ. at org.opendaylight.yangtools.yang.data.tree.impl.InMemoryDataTree.commit(InMemoryDataTree.java:164) ~[bundleFile:?] at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:1081) ~[bundleFile:?] at org.opendaylight.controller.cluster.datastore.ShardDataTree.payloadReplicationComplete(ShardDataTree.java:538) ~[bundleFile:?] at org.opendaylight.controller.cluster.datastore.ShardDataTree.applyReplicatedPayload(ShardDataTree.java:445) ~[bundleFile:?] at org.opendaylight.controller.cluster.datastore.Shard.applyState(Shard.java:1013) ~[bundleFile:?]

Robert Varga December 20, 2024 at 11:03 PM
Edited

thanks for the stack traces, now the problem seems to be much clearer.

The shard-side things are exactly as expected: they modify the transaction as part of preparing it to top-of-tree. The ‘2-notification.txt’ part is okay as well – it is modifying an unsealed transaction, as usual.

The other notification thread stack traces are actually problematic, as we are traversing the tree just to establish a sealed modification’s view as a snapshot, to start the next modification (InMemoryDataTreeModification.newModification() on line 211). This is troubling, as:

  1. we are memoizing apply result, which is the source of the problem here

  2. the ModifiedNode can definitely undergo asynchronous modification from shard side, so modificationType is not guaranteed to be stable (and therefore we cannot just take the alternative route of achieving the equivalent through applyToCursor())

We used to have synchronized guards around these operations, but we dropped those a decade ago in https://git.opendaylight.org/gerrit/c/yangtools/+/11126. It is quite curious this is cropping up now, but than again, Java 21 seems to be good at finding concurrency issues (as we have found https://lf-opendaylight.atlassian.net/browse/MDSAL-872 as part of this investigation).

I think transaction activity may be playing a role here, as the newModification() path looks at the current state of the data tree, so if transaction activity is low, the memoization done by it may result in prepare() then just picking up the memoized results, partially masking the race (IIRC).

At any rate, I don’t think this will be a trivial fix, we may need multiple parts to completely deal with the two sides interacting here. Also the yangtools-14 refactor of modification tracking may be playing a role here, but that needs a separate investigation.

Sangwook Ha December 20, 2024 at 5:35 PM

Attached the full stack traces corresponding to each event in the table / screen capture image - this is based on Calcium SR2.

Here we have two transactions taking place in quick succession - one to update connection-status to connecting and remove available-capabilities, etc.; and second one to remove the netconf node.

I believe the steps 1 & 2 in data-notification-dispatcher and 3 & 4 in data-shard-dispatcher are for the first update transaction and the steps 3 & 4 in data-notification-dispatcher are for the second transaction.

Robert Varga December 20, 2024 at 1:29 PM

As for 's analysis – this may occur, but unfortuntely the provided stack traces are incomplete – they show deep innards of a transaction implemention, but do not show the complete context in which they are occurring:

  • The changes in data-notification-dispatcher are most likely normal building up of a transaction including sealing – but that at least is the normal path. After the modification is sealed, though, that thread should not be modifying the structure.

  • The changes in data-shard-dispatcher are probably the result of the transaction being rebased as part of its DataTree.validata()/DataTree.prepare() cycle – which should be fine, as the modification is sealed and hence not owned by the user anymore.

Based on the evidence, though, we may have a subsequent transaction in a transaction chain, looking at its baseline (which is the transaction being committed). For that though, we need the stack traces.

Robert Varga December 20, 2024 at 12:53 PM

So re. the first comment: this is completely normal and is subject to modification algebra, for example:

  • transaction starts

  • the container exists

  • it is then overwritten by an empty container

  • transaction is sealed

since this is a non-presence container, it being empty implies it does not exist – and therefore the WRITE operation results in a DELETE modification type – the result is the same as if the container were explicitly deleted.

What is not normal is that the modification type is observed to change – the dispatch code saw it as WRITE, but then it flips to DELETE.

Done

Details

Assignee

Reporter

Labels

Fix versions

Priority

Created December 4, 2024 at 1:39 PM
Updated January 20, 2025 at 1:21 PM
Resolved January 20, 2025 at 1:21 PM