[LSF/MM/BPF TOPIC] A block level, active-active replication solution

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] A block level, active-active replication solution
@ 2026-02-03 15:09 Haris Iqbal
  2026-02-03 18:01 ` Bart Van Assche
  2026-02-13 17:32 ` Bart Van Assche
  0 siblings, 2 replies; 8+ messages in thread
From: Haris Iqbal @ 2026-02-03 15:09 UTC (permalink / raw)
  To: lsf-pc, linux-block; +Cc: Jia Li

Hi,

We are working on a pair of kernel modules which would offer a new
replication solution in the Linux kernel. It would be a block level,
active-active replication solution for RDMA transport.

The existing block level replication solution in the Linux kernel is
DRBD, which is an active-passive solution. The data replication in
DRBD happens through 2 network hops.

An active-active solution which one can build is by exporting block
devices, either through NVMeOF or RNBD/RTRS, over the network, and
then creating a raid1 device over it. It would provide a single hop
replication solution, but the synchronization during a degraded state
goes through 2 hops.

The proposed solution would provide an active-active single hop
replication, and a single hop synchronization (directly between
storage nodes) in case of a degraded state.

The first kernel module is Reliable Multicast on top of RTRS (RMR),
which uses the existing RTRS kernel module in the RDMA subsystem. RMR
works in a client-server architecture, with the server module residing
on the storage nodes. RMR uses the transport ulp RTRS to guarantee
delivery of IO to a group of hosts; And also provides data recovery if
one host in the group misses some IOs. The data recovery is handled by
the RMR server module, directly between the storage nodes.

The second one is BRMR, which is a network block device over RMR. BRMR
provides mirroring functionality and supports replacement of disks.

The proposed solution tracks dirty IOs through a dirty map, and has
internal mechanisms to prevent data corruption in case of crashes,
similar to the activity log in DRBD.

We would like to present the idea and internal workings of the
solution, and also discuss design and some benchmarking results
(comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
LSF/MM/BPF. We also want to get feedback, and potentially get more
people involved in the project.

(BRMR/RMR are in-development modules, and we plan to push them to
GitHub or somewhere else before the LSF/MM/BPF summit)

Regards
- Haris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
@ 2026-02-03 18:01 ` Bart Van Assche
  2026-02-03 18:04   ` Haris Iqbal
  2026-02-13 17:32 ` Bart Van Assche
  1 sibling, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-02-03 18:01 UTC (permalink / raw)
  To: Haris Iqbal, lsf-pc, linux-block; +Cc: Jia Li

On 2/3/26 7:09 AM, Haris Iqbal wrote:
> We would like to present the idea and internal workings of the
> solution, and also discuss design and some benchmarking results
> (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> LSF/MM/BPF. We also want to get feedback, and potentially get more
> people involved in the project.

Please include data about the time needed to recover after a network
disconnect. Is this time proportional to the size of the replicated
volumes or is this time proportional to the amount of data that has
to be resynchronized?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 18:01 ` Bart Van Assche
@ 2026-02-03 18:04   ` Haris Iqbal
  2026-02-10 13:06     ` Haris Iqbal
  0 siblings, 1 reply; 8+ messages in thread
From: Haris Iqbal @ 2026-02-03 18:04 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Tue, Feb 3, 2026 at 7:01 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/3/26 7:09 AM, Haris Iqbal wrote:
> > We would like to present the idea and internal workings of the
> > solution, and also discuss design and some benchmarking results
> > (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> > LSF/MM/BPF. We also want to get feedback, and potentially get more
> > people involved in the project.
>
> Please include data about the time needed to recover after a network
> disconnect. Is this time proportional to the size of the replicated
> volumes or is this time proportional to the amount of data that has
> to be resynchronized?

It is proportional to the amount of data to be resynchronized.
RMR divides the disk space into chunks, which are then tracked through
a dirty map in case write IOs are missed for a particular leg during
an outage.

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 18:04   ` Haris Iqbal
@ 2026-02-10 13:06     ` Haris Iqbal
  2026-02-10 18:31       ` Bart Van Assche
  0 siblings, 1 reply; 8+ messages in thread
From: Haris Iqbal @ 2026-02-10 13:06 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Tue, Feb 3, 2026 at 7:04 PM Haris Iqbal <haris.iqbal@ionos.com> wrote:
>
> On Tue, Feb 3, 2026 at 7:01 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 2/3/26 7:09 AM, Haris Iqbal wrote:
> > > We would like to present the idea and internal workings of the
> > > solution, and also discuss design and some benchmarking results
> > > (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> > > LSF/MM/BPF. We also want to get feedback, and potentially get more
> > > people involved in the project.
> >
> > Please include data about the time needed to recover after a network
> > disconnect. Is this time proportional to the size of the replicated
> > volumes or is this time proportional to the amount of data that has
> > to be resynchronized?

Hi Bart,

We did some quick runs to get the sync time and performance numbers
during syncing.

Descriptions of labels used below,
"fio initial": Numbers for fio run with single leg. This was used to
dirty the data.
"fio during sync": Numbers for fio when the broken leg was
reconnected, and data was syncing.
Data size: Amount of data dirtied during the initial fio run.
Sync time: The amount of time taken by the sync thread to complete the
sync (fio was also running in parallel)
RMR can sync data through a dedicated sync thread, which runs on
storage nodes. The thread syncs only a fixed number of chunks at a
time (set to 256 for the below run), which can be configured if one
wants to prioritize IOs from client.
And it can sync chunks when an IO arrives for a sector in that chunk
to the storage node which is degraded.
Both of these syncing methods can run in parallel.

--
SUMMARY for 10GB:
fio initial: bandwidth=1923MB/s, avg latency=1049.59 usec
fio during sync: bandwidth=442MB/s, avg latency=4707.29 usec
  Data Size: 10 GB
  Sync Time: 25 seconds

--
SUMMARY for 20GB:
fio initial: bandwidth=1936MB/s, avg latency=1045.68 usec
fio during sync: bandwidth=447MB/s, avg latency=4624.09 usec
  Data Size: 20 GB
  Sync Time: 48 seconds

--
SUMMARY for 50GB:
fio initial: bandwidth=1736MB/s, avg latency=1176.04 usec
fio during sync: bandwidth=433MB/s, avg latency=4799.21 usec
  Data Size: 50 GB
  Sync Time: 124 seconds

--
SUMMARY for 75GB:
fio initial: bandwidth=1726MB/s, avg latency=1184.49 usec
fio during sync: bandwidth=452MB/s, avg latency=4579.66 usec
  Data Size: 75 GB
  Sync Time: 178 seconds

--
SUMMARY for 100GB:
fio initial: bandwidth=1777MB/s, avg latency=1148.89 usec
fio during sync: bandwidth=442MB/s, avg latency=4708.82 usec
  Data Size: 100 GB
  Sync Time: 243 seconds

fio profile used for dirtying the data and for running IOs for sync
sudo fio --name=${test_name} \
        --filename=${brmr_device} \
        --direct=1 \
        --rw=write \
        --bs=4k \
        --size=${size_mb}M \
        --numjobs=1 \
        --ioengine=libaio \
        --iodepth=512 \
        --iodepth_batch_submit=128 \
        --iodepth_batch_complete_min=1 \
        --iodepth_batch_complete_max=128 \
        --time_based=0 \
        --group_reporting \
        --output-format=normal"

Server configurations,
Client
CPU(s):                      128
  On-line CPU(s) list:       0-127
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz

Storage servers 1
CPU(s):                      56
  On-line CPU(s) list:       0-55
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

Storage server 2
CPU(s):                      40
  On-line CPU(s) list:       0-39
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz

NICs (for all machines)
2x Mellanox Technologies MT27800 Family [ConnectX-5]


>
> It is proportional to the amount of data to be resynchronized.
> RMR divides the disk space into chunks, which are then tracked through
> a dirty map in case write IOs are missed for a particular leg during
> an outage.
>
> >
> > Thanks,
> >
> > Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-10 13:06     ` Haris Iqbal
@ 2026-02-10 18:31       ` Bart Van Assche
  2026-02-13 14:13         ` Haris Iqbal
  0 siblings, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-02-10 18:31 UTC (permalink / raw)
  To: Haris Iqbal; +Cc: lsf-pc, linux-block, Jia Li

On 2/10/26 5:06 AM, Haris Iqbal wrote:
> We did some quick runs to get the sync time and performance numbers
> during syncing.

Hi Haris,

Thanks for having shared this data. Instead of performance numbers I'd
like to see more information about the implemented algorithm.

The description at the start of this e-mail thread says "The proposed
solution tracks dirty IOs through a dirty map". How does the dirty map
work? Is there a single bit in that map that tracks whether any of N
logical blocks has been modified? If so, is the worst case behavior that
logical blocks 0, N, 2*N, ... are modified and no other logical blocks
are modified? Does this mean that the worst case behavior involves
resynchronizing N*M logical blocks if only M logical blocks have been
modified?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-10 18:31       ` Bart Van Assche
@ 2026-02-13 14:13         ` Haris Iqbal
  0 siblings, 0 replies; 8+ messages in thread
From: Haris Iqbal @ 2026-02-13 14:13 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Tue, Feb 10, 2026 at 7:31 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/10/26 5:06 AM, Haris Iqbal wrote:
> > We did some quick runs to get the sync time and performance numbers
> > during syncing.
>
> Hi Haris,
>
> Thanks for having shared this data. Instead of performance numbers I'd
> like to see more information about the implemented algorithm.
>
> The description at the start of this e-mail thread says "The proposed
> solution tracks dirty IOs through a dirty map". How does the dirty map
> work? Is there a single bit in that map that tracks whether any of N
> logical blocks has been modified? If so, is the worst case behavior that
> logical blocks 0, N, 2*N, ... are modified and no other logical blocks
> are modified? Does this mean that the worst case behavior involves
> resynchronizing N*M logical blocks if only M logical blocks have been
> modified?

Hi Bart,

Currently, a single byte tracks a single chunk. In that byte, a bit
tracks whether that chunk is dirty or not. We use one more bit as a
flag for a special case during disk replacement.
The disk "replace" feature in BRMR, allows the user to replace the
backend disk for an active RMR pool. In order to "replace" a backend
disk, that entire disk is marked as dirty. This information then needs
to be communicated to other storage nodes so that map redundancy is
maintained. In case of a "replace" happening on an isolated storage
node, this communication will be delayed. We use a bit from the dirty
map to track that such an information (certain or all chunks have been
marked dirty) needs to be communicated to other storage nodes when
communication is restored.

We also want to eliminate this feature bit usage as it significantly
increases the size of the bitmap. This would mean that only a bit
would be used to track a chunk.
As the project is still in development, we plan to assess this in the future.

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
  2026-02-03 18:01 ` Bart Van Assche
@ 2026-02-13 17:32 ` Bart Van Assche
  2026-02-19 10:43   ` Haris Iqbal
  1 sibling, 1 reply; 8+ messages in thread
From: Bart Van Assche @ 2026-02-13 17:32 UTC (permalink / raw)
  To: Haris Iqbal, lsf-pc, linux-block; +Cc: Jia Li

On 2/3/26 7:09 AM, Haris Iqbal wrote:
> We would like to present the idea and internal workings of the
> solution, and also discuss design and some benchmarking results
> (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> LSF/MM/BPF. We also want to get feedback, and potentially get more
> people involved in the project.

Please prepare for the question why the choice has been made to
implement this functionality as a new kernel module instead of
integrating this functionality in the DRBD kernel driver.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-13 17:32 ` Bart Van Assche
@ 2026-02-19 10:43   ` Haris Iqbal
  0 siblings, 0 replies; 8+ messages in thread
From: Haris Iqbal @ 2026-02-19 10:43 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Fri, Feb 13, 2026 at 6:32 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/3/26 7:09 AM, Haris Iqbal wrote:
> > We would like to present the idea and internal workings of the
> > solution, and also discuss design and some benchmarking results
> > (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> > LSF/MM/BPF. We also want to get feedback, and potentially get more
> > people involved in the project.
>
> Please prepare for the question why the choice has been made to
> implement this functionality as a new kernel module instead of
> integrating this functionality in the DRBD kernel driver.

Hi Bart,

In short, it was mostly because of 2 reasons.
1) We wanted to keep the "single-hop replication and syncing" offering
in a generic transport module (RMR), so that it can be used re-used by
other modules in the future.
2) We think that DRBD’s core abstractions (per‑node peer replication,
single lower device) are orthogonal to “active-active replication”
model.

We can discuss this further during the summit.
I assume when you said to "prepare for the question", you meant the same?

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-02-19 10:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
2026-02-03 18:01 ` Bart Van Assche
2026-02-03 18:04   ` Haris Iqbal
2026-02-10 13:06     ` Haris Iqbal
2026-02-10 18:31       ` Bart Van Assche
2026-02-13 14:13         ` Haris Iqbal
2026-02-13 17:32 ` Bart Van Assche
2026-02-19 10:43   ` Haris Iqbal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox