Our current organization of persisted data via atomix-storage leaves a lot to be desired for reasons that are mostly historic: atomix-storage itself, Akka persistence, etc.
At the end of the day, what we need is RAFT journal storage. In a RAFT journal, each entry has 129 bits of metadata, which has intrinsic properties offering optimizations.
Each entry has:
a 64bit index, counting either from 1 or 0 (which is a detail), increasing monotonically
a 64bit term, counting from 0, increasing (but not necessarily monotonically: there may be times when a leader fails to commit its first entry!)
a one-bit commit indicator, which implied by commitIndex – all entries up to and including commitIndex are committed
associated state transition data
atomix-storage uses index for its internal entries and has a provision to provide commitIndex, but it leaves the term information with the state data.
This is a missed optimization opportunity, because in steady-state operation, when there are no leader elections, the term remains a constant.
We are currently tracking each entry via:
32bit entry length
32bit entry CRC32
where length == 0 is not allowed.
Perhaps we should be tracking it via:
32bit entry CRC32
variable 64bit term increment
variable 64bit entry length, perhaps trimmed to 32bits, but future-proof to 64bits
which would lend itself to using WritableObjects.writeTwoLongs() for the second and third items – and the first long being almost always 0, leading to efficient storage of 1-17 bytes, with the usual case being 3-11 bytes (I think, needs to checked).
If we can achieve this, then we can project RaftEntryMeta from the storage layer, without having to involve entry data serialization code – which we only need when we are replicating/applying entries.
Based on the 3-11 bytes figure above, that has a potential of saving 1-9 bytes in overall for each entry. The cost is slightly increased buffer management code, as we would be dealing with variable records of 7-24 bytes headers vs. fixed 8 bytes (and implied 8 bytes payload).
Our current organization of persisted data via atomix-storage leaves a lot to be desired for reasons that are mostly historic: atomix-storage itself, Akka persistence, etc.
At the end of the day, what we need is RAFT journal storage. In a RAFT journal, each entry has 129 bits of metadata, which has intrinsic properties offering optimizations.
Each entry has:
a 64bit index, counting either from 1 or 0 (which is a detail), increasing monotonically
a 64bit term, counting from 0, increasing (but not necessarily monotonically: there may be times when a leader fails to commit its first entry!)
a one-bit commit indicator, which implied by commitIndex – all entries up to and including commitIndex are committed
associated state transition data
atomix-storage uses index for its internal entries and has a provision to provide commitIndex, but it leaves the term information with the state data.
This is a missed optimization opportunity, because in steady-state operation, when there are no leader elections, the term remains a constant.
We are currently tracking each entry via:
32bit entry length
32bit entry CRC32
where length == 0 is not allowed.
Perhaps we should be tracking it via:
32bit entry CRC32
variable 64bit term increment
variable 64bit entry length, perhaps trimmed to 32bits, but future-proof to 64bits
which would lend itself to using WritableObjects.writeTwoLongs() for the second and third items – and the first long being almost always 0, leading to efficient storage of 1-17 bytes, with the usual case being 3-11 bytes (I think, needs to checked).
If we can achieve this, then we can project RaftEntryMeta from the storage layer, without having to involve entry data serialization code – which we only need when we are replicating/applying entries.
Based on the 3-11 bytes figure above, that has a potential of saving 1-9 bytes in overall for each entry. The cost is slightly increased buffer management code, as we would be dealing with variable records of 7-24 bytes headers vs. fixed 8 bytes (and implied 8 bytes payload).