Ditch Akka persistence from sal-akka-raft

Description

Our current interplay with persistence and replication is rather flawed, as we allow for way too much generic things to be pluggable through interfaces which are not designed to handle them.

There are four aspects here:

we need provide the ability to efficiently send journal entries to remote peers
we need to persist those journal entries to storage, durability of which is optional, EXCEPT for ServerConfigurationPayloads, which are always durable
we use a custom journal plugin, based on atomix.io's requirements – which boil down to providing RAFT on top of a rolling set of files. As Atomix has moved away from Java and do their stuff now with Go, we have adopted their code in CONTROLLER-2071.
we use a custom snapshot plugin to store snapshots

We use an on-heap cache of entries serialized to byte[] to serve (1) and we use Akka Peristence to provide (2) via a custom hack, which overrides the normal durability knobs – all the while plugging in (3) and (4) into it.

So we are dealing with 4 layers of indirection:

sal-distributed-datastore, talks to
sal-akka-raft, talks to
AbstractPersistentActor, talks to
sal-akka-segmented journal, talks to
atomix-storage (which then talks to Kryo, but we are ditching that in CONTROLLER-2072)

At the end of the day the serialized view of each journal entry is available, through familiar 64bit ID growing index, from atomix-storage, the contents of which is in a (set of) memory-mapped files. When we need to send a stream of entries to a peer we should be able to just replay them from there, without the need to keeping anything at all on heap.

The second part of the picture here is Akka Persistence protocol – which requires each object to be stored into a byte[] before anything happens to it. This absolutely sucks when combined with huge datastore writes and G1 GC's treatment of humongous objects (for which we already provide a workaround).

What we really need is a streaming input/output interface, where on one side we have a Payload (say, a DataTree delta, in its native form) and on the other side we have a sequence of binary blocks (i.e. SegmentedJournal entries), where the translation between the two works in terms of DataInput/DataOutput (essentially), without intermediate byte[] or other buffers which would hold the entirety of the serialized form.

So long story short, this issue is about ditching the use of AbstractPersistentActor and replace it with an explicit interface very much aligned to what atomix-storage provides, i.e. storage of bytes identified by logical numbers which we can read again if the need arises.

Subtasks

60% Done

CONTROLLER-2135

Refactor raft.SnapshotManager

CONTROLLER-2127

Clean up ElectionTerm lifecycle

CONTROLLER-2133

Replace PersistenceTermInfoStore with PropertiesTermInfoStore

CONTROLLER-2137

Design raft.spi.EntryStore

CONTROLLER-2134

Design raft.spi.SnapshotStore

Linked issues

blocks

CONTROLLER-2118

Do not shade akka-persistence

CONTROLLER-2044

Improve sal-akka-raft serialization protocol

CONTROLLER-2072

Remove Kryo from atomix-storage

is blocked by

CONTROLLER-2115

Separate out byte-level atomix-storage access

CONTROLLER-2121

Split out RAFT entry metadata

Activity

Show:

Robert Varga March 2, 2023 at 9:35 PM

The flow should something like:

sal-distributed-datastore defines the set of payloads that are valid for sal-akka-raft's AppendEntries and provides serdes for them
sal-akka-raft defines valid atomix-storage journal entries
sal-akka-raft uses atomix-storage to implement ReplicatedLog
sal-akka-raft handles ServerConfiguration payload ... somehow, but out-of-band, it just needs to be applied at the right position in ReplicatedLog
atomix-storage provides flexible storage sync – i.e. operational datastore it does not sync data to storage, but uses the memory-mapped file region (which boils down to page cache memory on Linux) instead of usual virtual memory

Most notably ReplicatedLog contents are always backed by SegmentedJournal, i.e. purges of JournalSegmentedFile occur when the corresponding entries are trimmed and any access to their contents (for the purposes of sending AppendEntries) go through SegmentedJournalReader.

Details

Assignee

Unassigned

Reporter

Robert Varga

Labels

Components

Fix versions

11.0.0

Priority

Medium

Parent

CONTROLLER-1988 Migrate from akka persistence to a custom solution

Created March 2, 2023 at 9:16 PM

Updated February 19, 2025 at 7:30 PM

Configure