Unsynchronous change of ModificationType in ModificationNode
Description
Environment
Attachments
- 20 Dec 2024, 05:24 PM
- 20 Dec 2024, 05:24 PM
- 20 Dec 2024, 05:24 PM
- 20 Dec 2024, 05:24 PM
- 20 Dec 2024, 05:23 PM
- 20 Dec 2024, 05:23 PM
- 11 Dec 2024, 07:06 AM
- 11 Dec 2024, 07:06 AM
- 11 Dec 2024, 07:06 AM
- 11 Dec 2024, 07:06 AM
- 11 Dec 2024, 07:06 AM
- 11 Dec 2024, 07:06 AM
Activity
Robert Varga January 7, 2025 at 2:20 PMEdited
https://git.opendaylight.org/gerrit/c/yangtools/+/114780 moves the validate/prepare implementation to InMemoryDataTreeModification and https://git.opendaylight.org/gerrit/c/yangtools/+/114848 introduces some basic locking, which should make the collision less likely. Nevertheless it is not a complete solution, as a newModification() after prepare() may still trigger this bit:
java.lang.IllegalStateException: Store tree org.opendaylight.yangtools.yang.data.tree.impl.node.MaterializedContainerNode@46bc284a and candidate base org.opendaylight.yangtools.yang.data.tree.impl.node.MaterializedContainerNode@73a6f391 differ.
at org.opendaylight.yangtools.yang.data.tree.impl.InMemoryDataTree.commit(InMemoryDataTree.java:164) ~[bundleFile:?]
at org.opendaylight.controller.cluster.datastore.ShardDataTree.finishCommit(ShardDataTree.java:1081) ~[bundleFile:?]
at org.opendaylight.controller.cluster.datastore.ShardDataTree.payloadReplicationComplete(ShardDataTree.java:538) ~[bundleFile:?]
at org.opendaylight.controller.cluster.datastore.ShardDataTree.applyReplicatedPayload(ShardDataTree.java:445) ~[bundleFile:?]
at org.opendaylight.controller.cluster.datastore.Shard.applyState(Shard.java:1013) ~[bundleFile:?]
Robert Varga December 20, 2024 at 11:03 PMEdited
@Sangwook Ha thanks for the stack traces, now the problem seems to be much clearer.
The shard-side things are exactly as expected: they modify the transaction as part of preparing it to top-of-tree. The ‘2-notification.txt’ part is okay as well – it is modifying an unsealed transaction, as usual.
The other notification thread stack traces are actually problematic, as we are traversing the tree just to establish a sealed modification’s view as a snapshot, to start the next modification (InMemoryDataTreeModification.newModification()
on line 211). This is troubling, as:
we are memoizing apply result, which is the source of the problem here
the ModifiedNode can definitely undergo asynchronous modification from shard side, so modificationType is not guaranteed to be stable (and therefore we cannot just take the alternative route of achieving the equivalent through applyToCursor())
We used to have synchronized guards around these operations, but we dropped those a decade ago in https://git.opendaylight.org/gerrit/c/yangtools/+/11126. It is quite curious this is cropping up now, but than again, Java 21 seems to be good at finding concurrency issues (as we have found https://lf-opendaylight.atlassian.net/browse/MDSAL-872 as part of this investigation).
I think transaction activity may be playing a role here, as the newModification() path looks at the current state of the data tree, so if transaction activity is low, the memoization done by it may result in prepare() then just picking up the memoized results, partially masking the race (IIRC).
At any rate, I don’t think this will be a trivial fix, we may need multiple parts to completely deal with the two sides interacting here. Also the yangtools-14 refactor of modification tracking may be playing a role here, but that needs a separate investigation.
Sangwook Ha December 20, 2024 at 5:35 PM
@Robert Varga Attached the full stack traces corresponding to each event in the table / screen capture image - this is based on Calcium SR2.
Here we have two transactions taking place in quick succession - one to update connection-status
to connecting
and remove available-capabilities
, etc.; and second one to remove the netconf node.
I believe the steps 1 & 2 in data-notification-dispatcher
and 3 & 4 in data-shard-dispatcher
are for the first update transaction and the steps 3 & 4 in data-notification-dispatcher
are for the second transaction.
Robert Varga December 20, 2024 at 1:29 PM
As for @Sangwook Ha 's analysis – this may occur, but unfortuntely the provided stack traces are incomplete – they show deep innards of a transaction implemention, but do not show the complete context in which they are occurring:
The changes in data-notification-dispatcher are most likely normal building up of a transaction including sealing – but that at least is the normal path. After the modification is sealed, though, that thread should not be modifying the structure.
The changes in data-shard-dispatcher are probably the result of the transaction being rebased as part of its DataTree.validata()/DataTree.prepare() cycle – which should be fine, as the modification is sealed and hence not owned by the user anymore.
Based on the evidence, though, we may have a subsequent transaction in a transaction chain, looking at its baseline (which is the transaction being committed). For that though, we need the stack traces.
Robert Varga December 20, 2024 at 12:53 PM
So re. the first comment: this is completely normal and is subject to modification algebra, for example:
transaction starts
the container exists
it is then overwritten by an empty container
transaction is sealed
since this is a non-presence container, it being empty implies it does not exist – and therefore the WRITE operation results in a DELETE modification type – the result is the same as if the container were explicitly deleted.
What is not normal is that the modification type is observed to change – the dispatch code saw it as WRITE, but then it flips to DELETE.
This issue was observed in https://lf-opendaylight.atlassian.net/browse/MDSAL-872 . While removing huge number of devices, it may happen that ModificationNode with original modificationType set to WRITE is misinterpreted as WRITE and failed because: “does not have after-image”.
https://github.com/opendaylight/controller/blob/ddbe32f92adc0ec6d49ca9a248f0a95f509b3b7f/opendaylight/md-sal/sal-distributed-datastore/src/main/java/org/opendaylight/controller/cluster/datastore/persisted/DataTreeCandidateInputOutput.java#L202C1-L209C6
In attachment is provided screenshot from debugging where is called
requireDataAfter
method because modificationType was WRITE, but in the debug variables is changed to DELETE.In the AutomaticLifecycleMixin class is changed, modificationType in ModificationNode from WRITE into DELETE.
https://github.com/opendaylight/yangtools/blob/9f46dba51a34ee223f9a537da76010710512c921/data/yang-data-tree-ri/src/main/java/org/opendaylight/yangtools/yang/data/tree/impl/AutomaticLifecycleMixin.java#L95-L97
This issue is irregularly reproducible only in the Calcium branch. In newer branches, this issue seems to be suppressed because of model change, which probably give more time for
available-capabilities
to change ModificationType before executingwriteNode
method in DataTreeCandidateInputOutput class.ModifiedNode{identifier=(urn:opendaylight:netconf-node-topology?revision=2023-11-21)available-capabilities, operation=WRITE, modificationType=WRITE}
Probably the ModificationNode should have DELETE ModificationType from beginning without changing it in AutomaticLifecycleMixin or should notify when this change is done.