Follower can retain more than 1 copy of the same Snapshot in memory leading to OOM

Description

There is a bug in the current InstallSnapshot process caused by the delayed reply for the last chunk. The Follower sends InstallSnapshotReply for the last chunk only after the Snapshot gets applied and persisted. The larger the snapshot, the longer it may take. And here the Leader can resend the Snapshot again. Here's the flow leading to the issue:

Follower receives the last chunk of the Snapshot.
Follower wraps the Snapshot in ApplyMessage and sends it to RaftActorSnapshotMessageSupport to be applied and persisted. Number of Snapshots in Follower's memory = 1
Follower closes SnapshotTracker
During this time the Leader awaits the Reply for the last chunk. When the chunk-timeout runs out, the Leader re-sends the last chunk again.
The Follower receives the last chunk again.
Follower creates new SnapshotTracker (previous one was closed)
Follower tries to add last chunk to the SnapshotTracker which ends in InvalidChunkException("expected chunk 1, adding chunk n")
Follower closes SnapshotTracker and sends InstallSnapshotReply(chunkIndex = -1, success =false)
Leader receives InstallSnapshotReply(chunkIndex=-1) and resets LeaderInstallSnapshotState
Leader starts sending chunks of the same Snapshot again from chunk 1
Follower starts collecting chunks
Follower receives the last chunk of the Snapshot
Follower wraps the Snapshot in ApplyMessage and sends it to RaftActorSnapshotMessageSupport to be applied and persisted. Number of Snapshots in Follower's memory = 2
If the previous Snapshot is still being applied/persisted, this ApplySnapshot message is simply dropped.
After chunk-timeout the Leader sends the last chunk again and starts the whole cycle over again.
And here is the issue - if for example:

the GC takes longer to clear the large Snapshot object
or the ActorSystem gets overwhelmed and the ApplySnapshot message is hanging in RaftActorSnapshotMessageSupport 's mailbox
We can have the Leader start sending the third copy of the same Snapshot and Follower will attempt to collect it as he does.

Possible solution is to make the Follower reject the InstallSnapshot message from Leader, if there's a Snapshot being applied at the same time. If the Follower just drops the InstallSnapshot message and doesn't reply, the Leader will just re-send the last chunk again and again after the next chunk-timeout elapses. And when the Follower finally finishes the ApplySnapshot, he'll send the InstallSnapshotReply for the last chunk and the Leader will simply close the LeaderInstallSnapshotState as if there was no extra delay.

Environment

None

Linked issues

split to

CONTROLLER-2135

Refactor raft.SnapshotManager

Activity

Show:

Robert Varga November 6, 2024 at 3:37 PM
Edited

We also have two bits in InstallSnapshot available, so … we kinda can piggy-back an alternative ‘stiill applying’ protocol on top of that.

I have just noticed the lifecycle on follower could be improved quite a bit: rather than pulling in the contents of the snapshot into memory before sending out ApplySnapshot to ourselves, we could just push the SnapshotTracker state, so that the data is still on disk and pull it when we go for ApplyState… that way it does not end up lurking in memory. But then again, things could get skewed in a worse way…. so that’s something we will do separately in https://lf-opendaylight.atlassian.net/browse/CONTROLLER-2135.

Robert Varga November 4, 2024 at 10:32 AM

So the good news is that since https://lf-opendaylight.atlassian.net/browse/CONTROLLER-2058, we have 3 bits unused in the serialization format (raft.messages.IR), which are sent as 0 and ignored on receive – hence we can easily indicate ‘received, not processed’ in a backwards-compatible manner.

Old serialization format, this would require updating receivers – there is a single byte for ‘success’ boolean, but anything non-zero is interpreted as ‘true’ – which would mean old clients would mis-interpret things.

Robert Varga November 3, 2024 at 11:23 PM
Edited

So there are a couple of issues here.

We really should have an InstallSnapshotReply, which indicates the chunk has been received, but it has not been installed yet – so that leader does not re-transmit the last chunk but keeps soliciting installation. Not sure we can do that in InstallSnapshotReply without breaking compatibility.
Follower should also remember it has received the last chunk, has released all the tracker state, but it knows it still needs to send a reply. Currently that state is in the system, but not reachable by Follower class: it is the reply that is being sent after applySnapshotCallback completes.

So we need to do 2) first: create an explicit class to implement ApplySnapshot.Callback, holding the response and some metadata (leader/term/index). That should be stored in Follower as normal state and it should be checked when we receive InstallSnapshot: if that ends up being the same install snapshot, then we should just ignore (or reply with the thing in 1)) and wait for the state to be applied. That way, without changing the protocol, Follower will stop doing the vicious loop, but leader will still be re-transmitting the first chunk – but with no ill effects on the follower.

Then we can look into how Leader can know not to re-transmit the last chunk, i.e. 1).

Details
Assignee
Robert Varga
Reporter
Tibor Král
Labels
Components
clustering
Fix versions
11.0.0
9.0.7
10.0.10
Priority
Highest

Created March 15, 2023 at 1:42 PM

Updated March 10, 2025 at 9:35 AM

Configure

Follower can retain more than 1 copy of the same Snapshot in memory leading to OOM

Description

Environment

Linked issues

split to

Activity

Robert Varga November 6, 2024 at 3:37 PMEdited

Robert Varga November 4, 2024 at 10:32 AM

Robert Varga November 3, 2024 at 11:23 PMEdited

DetailsAssigneeRobert VargaRobert VargaReporterTibor KrálTibor KrálLabelsptComponentsclusteringFix versions11.0.09.0.710.0.10PriorityHighest

Details

Assignee

Reporter

Labels

Components

Fix versions

Priority

Robert Varga November 6, 2024 at 3:37 PM
Edited

Robert Varga November 3, 2024 at 11:23 PM
Edited

Details
Assignee
Robert Varga
Reporter
Tibor Král
Labels
Components
clustering
Fix versions
11.0.0
9.0.7
10.0.10
Priority
Highest