Provide efficient RAFT term storage

Description

Our current organization of persisted data via atomix-storage leaves a lot to be desired for reasons that are mostly historic: atomix-storage itself, Akka persistence, etc.

At the end of the day, what we need is RAFT journal storage. In a RAFT journal, each entry has 129 bits of metadata, which has intrinsic properties offering optimizations.

Each entry has:

  • a 64bit index, counting either from 1 or 0 (which is a detail), increasing monotonically

  • a 64bit term, counting from 0, increasing (but not necessarily monotonically: there may be times when a leader fails to commit its first entry!)

  • a one-bit commit indicator, which implied by commitIndex – all entries up to and including commitIndex are committed

  • associated state transition data

atomix-storage uses index for its internal entries and has a provision to provide commitIndex, but it leaves the term information with the state data.

This is a missed optimization opportunity, because in steady-state operation, when there are no leader elections, the term remains a constant.

We are currently tracking each entry via:

  • 32bit entry length

  • 32bit entry CRC32

where length == 0 is not allowed.

Perhaps we should be tracking it via:

  • 32bit entry CRC32

  • variable 64bit term increment

  • variable 64bit entry length, perhaps trimmed to 32bits, but future-proof to 64bits

which would lend itself to using WritableObjects.writeTwoLongs() for the second and third items – and the first long being almost always 0, leading to efficient storage of 1-17 bytes, with the usual case being 3-11 bytes (I think, needs to checked).

If we can achieve this, then we can project RaftEntryMeta from the storage layer, without having to involve entry data serialization code – which we only need when we are replicating/applying entries.

Based on the 3-11 bytes figure above, that has a potential of saving 1-9 bytes in overall for each entry. The cost is slightly increased buffer management code, as we would be dealing with variable records of 7-24 bytes headers vs. fixed 8 bytes (and implied 8 bytes payload).

Activity

Show:

Details

Assignee

Reporter

Labels

Components

Fix versions

Priority

Created June 4, 2024 at 6:01 PM
Updated March 10, 2025 at 12:23 PM