[LSF/MM/BPF TOPIC] A block level, active-active replication solution

Linux block layer
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] A block level, active-active replication solution
@ 2026-02-03 15:09 Haris Iqbal
  2026-02-03 18:01 ` Bart Van Assche
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Haris Iqbal @ 2026-02-03 15:09 UTC (permalink / raw)
  To: lsf-pc, linux-block; +Cc: Jia Li

Hi,

We are working on a pair of kernel modules which would offer a new
replication solution in the Linux kernel. It would be a block level,
active-active replication solution for RDMA transport.

The existing block level replication solution in the Linux kernel is
DRBD, which is an active-passive solution. The data replication in
DRBD happens through 2 network hops.

An active-active solution which one can build is by exporting block
devices, either through NVMeOF or RNBD/RTRS, over the network, and
then creating a raid1 device over it. It would provide a single hop
replication solution, but the synchronization during a degraded state
goes through 2 hops.

The proposed solution would provide an active-active single hop
replication, and a single hop synchronization (directly between
storage nodes) in case of a degraded state.

The first kernel module is Reliable Multicast on top of RTRS (RMR),
which uses the existing RTRS kernel module in the RDMA subsystem. RMR
works in a client-server architecture, with the server module residing
on the storage nodes. RMR uses the transport ulp RTRS to guarantee
delivery of IO to a group of hosts; And also provides data recovery if
one host in the group misses some IOs. The data recovery is handled by
the RMR server module, directly between the storage nodes.

The second one is BRMR, which is a network block device over RMR. BRMR
provides mirroring functionality and supports replacement of disks.

The proposed solution tracks dirty IOs through a dirty map, and has
internal mechanisms to prevent data corruption in case of crashes,
similar to the activity log in DRBD.

We would like to present the idea and internal workings of the
solution, and also discuss design and some benchmarking results
(comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
LSF/MM/BPF. We also want to get feedback, and potentially get more
people involved in the project.

(BRMR/RMR are in-development modules, and we plan to push them to
GitHub or somewhere else before the LSF/MM/BPF summit)

Regards
- Haris

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
@ 2026-02-03 18:01 ` Bart Van Assche
  2026-02-03 18:04   ` Haris Iqbal
  2026-02-13 17:32 ` Bart Van Assche
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2026-02-03 18:01 UTC (permalink / raw)
  To: Haris Iqbal, lsf-pc, linux-block; +Cc: Jia Li

On 2/3/26 7:09 AM, Haris Iqbal wrote:
> We would like to present the idea and internal workings of the
> solution, and also discuss design and some benchmarking results
> (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> LSF/MM/BPF. We also want to get feedback, and potentially get more
> people involved in the project.

Please include data about the time needed to recover after a network
disconnect. Is this time proportional to the size of the replicated
volumes or is this time proportional to the amount of data that has
to be resynchronized?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 18:01 ` Bart Van Assche
@ 2026-02-03 18:04   ` Haris Iqbal
  2026-02-10 13:06     ` Haris Iqbal
  0 siblings, 1 reply; 13+ messages in thread
From: Haris Iqbal @ 2026-02-03 18:04 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Tue, Feb 3, 2026 at 7:01 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/3/26 7:09 AM, Haris Iqbal wrote:
> > We would like to present the idea and internal workings of the
> > solution, and also discuss design and some benchmarking results
> > (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> > LSF/MM/BPF. We also want to get feedback, and potentially get more
> > people involved in the project.
>
> Please include data about the time needed to recover after a network
> disconnect. Is this time proportional to the size of the replicated
> volumes or is this time proportional to the amount of data that has
> to be resynchronized?

It is proportional to the amount of data to be resynchronized.
RMR divides the disk space into chunks, which are then tracked through
a dirty map in case write IOs are missed for a particular leg during
an outage.

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 18:04   ` Haris Iqbal
@ 2026-02-10 13:06     ` Haris Iqbal
  2026-02-10 18:31       ` Bart Van Assche
  0 siblings, 1 reply; 13+ messages in thread
From: Haris Iqbal @ 2026-02-10 13:06 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Tue, Feb 3, 2026 at 7:04 PM Haris Iqbal <haris.iqbal@ionos.com> wrote:
>
> On Tue, Feb 3, 2026 at 7:01 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 2/3/26 7:09 AM, Haris Iqbal wrote:
> > > We would like to present the idea and internal workings of the
> > > solution, and also discuss design and some benchmarking results
> > > (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> > > LSF/MM/BPF. We also want to get feedback, and potentially get more
> > > people involved in the project.
> >
> > Please include data about the time needed to recover after a network
> > disconnect. Is this time proportional to the size of the replicated
> > volumes or is this time proportional to the amount of data that has
> > to be resynchronized?

Hi Bart,

We did some quick runs to get the sync time and performance numbers
during syncing.

Descriptions of labels used below,
"fio initial": Numbers for fio run with single leg. This was used to
dirty the data.
"fio during sync": Numbers for fio when the broken leg was
reconnected, and data was syncing.
Data size: Amount of data dirtied during the initial fio run.
Sync time: The amount of time taken by the sync thread to complete the
sync (fio was also running in parallel)
RMR can sync data through a dedicated sync thread, which runs on
storage nodes. The thread syncs only a fixed number of chunks at a
time (set to 256 for the below run), which can be configured if one
wants to prioritize IOs from client.
And it can sync chunks when an IO arrives for a sector in that chunk
to the storage node which is degraded.
Both of these syncing methods can run in parallel.

--
SUMMARY for 10GB:
fio initial: bandwidth=1923MB/s, avg latency=1049.59 usec
fio during sync: bandwidth=442MB/s, avg latency=4707.29 usec
  Data Size: 10 GB
  Sync Time: 25 seconds

--
SUMMARY for 20GB:
fio initial: bandwidth=1936MB/s, avg latency=1045.68 usec
fio during sync: bandwidth=447MB/s, avg latency=4624.09 usec
  Data Size: 20 GB
  Sync Time: 48 seconds

--
SUMMARY for 50GB:
fio initial: bandwidth=1736MB/s, avg latency=1176.04 usec
fio during sync: bandwidth=433MB/s, avg latency=4799.21 usec
  Data Size: 50 GB
  Sync Time: 124 seconds

--
SUMMARY for 75GB:
fio initial: bandwidth=1726MB/s, avg latency=1184.49 usec
fio during sync: bandwidth=452MB/s, avg latency=4579.66 usec
  Data Size: 75 GB
  Sync Time: 178 seconds

--
SUMMARY for 100GB:
fio initial: bandwidth=1777MB/s, avg latency=1148.89 usec
fio during sync: bandwidth=442MB/s, avg latency=4708.82 usec
  Data Size: 100 GB
  Sync Time: 243 seconds

fio profile used for dirtying the data and for running IOs for sync
sudo fio --name=${test_name} \
        --filename=${brmr_device} \
        --direct=1 \
        --rw=write \
        --bs=4k \
        --size=${size_mb}M \
        --numjobs=1 \
        --ioengine=libaio \
        --iodepth=512 \
        --iodepth_batch_submit=128 \
        --iodepth_batch_complete_min=1 \
        --iodepth_batch_complete_max=128 \
        --time_based=0 \
        --group_reporting \
        --output-format=normal"

Server configurations,
Client
CPU(s):                      128
  On-line CPU(s) list:       0-127
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz

Storage servers 1
CPU(s):                      56
  On-line CPU(s) list:       0-55
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

Storage server 2
CPU(s):                      40
  On-line CPU(s) list:       0-39
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz

NICs (for all machines)
2x Mellanox Technologies MT27800 Family [ConnectX-5]


>
> It is proportional to the amount of data to be resynchronized.
> RMR divides the disk space into chunks, which are then tracked through
> a dirty map in case write IOs are missed for a particular leg during
> an outage.
>
> >
> > Thanks,
> >
> > Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-10 13:06     ` Haris Iqbal
@ 2026-02-10 18:31       ` Bart Van Assche
  2026-02-13 14:13         ` Haris Iqbal
  0 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2026-02-10 18:31 UTC (permalink / raw)
  To: Haris Iqbal; +Cc: lsf-pc, linux-block, Jia Li

On 2/10/26 5:06 AM, Haris Iqbal wrote:
> We did some quick runs to get the sync time and performance numbers
> during syncing.

Hi Haris,

Thanks for having shared this data. Instead of performance numbers I'd
like to see more information about the implemented algorithm.

The description at the start of this e-mail thread says "The proposed
solution tracks dirty IOs through a dirty map". How does the dirty map
work? Is there a single bit in that map that tracks whether any of N
logical blocks has been modified? If so, is the worst case behavior that
logical blocks 0, N, 2*N, ... are modified and no other logical blocks
are modified? Does this mean that the worst case behavior involves
resynchronizing N*M logical blocks if only M logical blocks have been
modified?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-10 18:31       ` Bart Van Assche
@ 2026-02-13 14:13         ` Haris Iqbal
  0 siblings, 0 replies; 13+ messages in thread
From: Haris Iqbal @ 2026-02-13 14:13 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Tue, Feb 10, 2026 at 7:31 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/10/26 5:06 AM, Haris Iqbal wrote:
> > We did some quick runs to get the sync time and performance numbers
> > during syncing.
>
> Hi Haris,
>
> Thanks for having shared this data. Instead of performance numbers I'd
> like to see more information about the implemented algorithm.
>
> The description at the start of this e-mail thread says "The proposed
> solution tracks dirty IOs through a dirty map". How does the dirty map
> work? Is there a single bit in that map that tracks whether any of N
> logical blocks has been modified? If so, is the worst case behavior that
> logical blocks 0, N, 2*N, ... are modified and no other logical blocks
> are modified? Does this mean that the worst case behavior involves
> resynchronizing N*M logical blocks if only M logical blocks have been
> modified?

Hi Bart,

Currently, a single byte tracks a single chunk. In that byte, a bit
tracks whether that chunk is dirty or not. We use one more bit as a
flag for a special case during disk replacement.
The disk "replace" feature in BRMR, allows the user to replace the
backend disk for an active RMR pool. In order to "replace" a backend
disk, that entire disk is marked as dirty. This information then needs
to be communicated to other storage nodes so that map redundancy is
maintained. In case of a "replace" happening on an isolated storage
node, this communication will be delayed. We use a bit from the dirty
map to track that such an information (certain or all chunks have been
marked dirty) needs to be communicated to other storage nodes when
communication is restored.

We also want to eliminate this feature bit usage as it significantly
increases the size of the bitmap. This would mean that only a bit
would be used to track a chunk.
As the project is still in development, we plan to assess this in the future.

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
  2026-02-03 18:01 ` Bart Van Assche
@ 2026-02-13 17:32 ` Bart Van Assche
  2026-02-19 10:43   ` Haris Iqbal
  2026-05-01 19:47 ` [Lsf-pc] " Matthew Wilcox
  2026-05-05  9:19 ` Philipp Reisner
  3 siblings, 1 reply; 13+ messages in thread
From: Bart Van Assche @ 2026-02-13 17:32 UTC (permalink / raw)
  To: Haris Iqbal, lsf-pc, linux-block; +Cc: Jia Li

On 2/3/26 7:09 AM, Haris Iqbal wrote:
> We would like to present the idea and internal workings of the
> solution, and also discuss design and some benchmarking results
> (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> LSF/MM/BPF. We also want to get feedback, and potentially get more
> people involved in the project.

Please prepare for the question why the choice has been made to
implement this functionality as a new kernel module instead of
integrating this functionality in the DRBD kernel driver.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-13 17:32 ` Bart Van Assche
@ 2026-02-19 10:43   ` Haris Iqbal
  2026-04-29 14:26     ` Haris Iqbal
  0 siblings, 1 reply; 13+ messages in thread
From: Haris Iqbal @ 2026-02-19 10:43 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Fri, Feb 13, 2026 at 6:32 PM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 2/3/26 7:09 AM, Haris Iqbal wrote:
> > We would like to present the idea and internal workings of the
> > solution, and also discuss design and some benchmarking results
> > (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> > LSF/MM/BPF. We also want to get feedback, and potentially get more
> > people involved in the project.
>
> Please prepare for the question why the choice has been made to
> implement this functionality as a new kernel module instead of
> integrating this functionality in the DRBD kernel driver.

Hi Bart,

In short, it was mostly because of 2 reasons.
1) We wanted to keep the "single-hop replication and syncing" offering
in a generic transport module (RMR), so that it can be used re-used by
other modules in the future.
2) We think that DRBD’s core abstractions (per‑node peer replication,
single lower device) are orthogonal to “active-active replication”
model.

We can discuss this further during the summit.
I assume when you said to "prepare for the question", you meant the same?

>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-19 10:43   ` Haris Iqbal
@ 2026-04-29 14:26     ` Haris Iqbal
  0 siblings, 0 replies; 13+ messages in thread
From: Haris Iqbal @ 2026-04-29 14:26 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: lsf-pc, linux-block, Jia Li

On Thu, Feb 19, 2026 at 11:43 AM Haris Iqbal <haris.iqbal@ionos.com> wrote:
>
> On Fri, Feb 13, 2026 at 6:32 PM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > On 2/3/26 7:09 AM, Haris Iqbal wrote:
> > > We would like to present the idea and internal workings of the
> > > solution, and also discuss design and some benchmarking results
> > > (comparison with RAID1 over RNBD/NVMeOF devices, or DRBD) during
> > > LSF/MM/BPF. We also want to get feedback, and potentially get more
> > > people involved in the project.

Hello,

As mentioned earlier, we have open sourced the RMR+BRMR code and the
documentation is also public.

https://github.com/ionos-cloud/RMR
https://ionos-cloud.github.io/rmr.io/

> >
> > Please prepare for the question why the choice has been made to
> > implement this functionality as a new kernel module instead of
> > integrating this functionality in the DRBD kernel driver.
>
> Hi Bart,
>
>
> In short, it was mostly because of 2 reasons.
> 1) We wanted to keep the "single-hop replication and syncing" offering
> in a generic transport module (RMR), so that it can be used re-used by
> other modules in the future.
> 2) We think that DRBD’s core abstractions (per‑node peer replication,
> single lower device) are orthogonal to “active-active replication”
> model.
>
> We can discuss this further during the summit.
> I assume when you said to "prepare for the question", you meant the same?
>
> >
> > Thanks,
> >
> > Bart.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
  2026-02-03 18:01 ` Bart Van Assche
  2026-02-13 17:32 ` Bart Van Assche
@ 2026-05-01 19:47 ` Matthew Wilcox
  2026-05-02 11:41   ` Keith Busch
  2026-05-05  9:19 ` Philipp Reisner
  3 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2026-05-01 19:47 UTC (permalink / raw)
  To: Haris Iqbal; +Cc: lsf-pc, linux-block, Jia Li

On Tue, Feb 03, 2026 at 04:09:59PM +0100, Haris Iqbal via Lsf-pc wrote:
> We are working on a pair of kernel modules which would offer a new
> replication solution in the Linux kernel. It would be a block level,
> active-active replication solution for RDMA transport.

Why is active-active a good idea?

With an active-passive solution, network traffic is directed to the
active node.  Over time at some point we get close to saturating the
link and performance drops.  At that point, human intervention will
occur and the network link will be upgraded.

With an active-active solution, traffic goes to each node.  At smoe point
each link will be about 75% utilised and we won't see any performance
problems.  But then a node goes down and all of a sudden the remaining
node is being hit with 150% of the link capacity.  There's no gradual
degradation here; the whole solution just goes down.

Of course it doesn't have to be network capacity either; it could be CPU,
RAM or any other resource needed to service the requests.

Active-active is fragile and I would never recommend such a solution.
I won't be in your session, but I thought it worth raising this point.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-05-01 19:47 ` [Lsf-pc] " Matthew Wilcox
@ 2026-05-02 11:41   ` Keith Busch
  2026-05-04  8:24     ` Haris Iqbal
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2026-05-02 11:41 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Haris Iqbal, lsf-pc, linux-block, Jia Li

On Fri, May 01, 2026 at 08:47:35PM +0100, Matthew Wilcox wrote:
> On Tue, Feb 03, 2026 at 04:09:59PM +0100, Haris Iqbal via Lsf-pc wrote:
> > We are working on a pair of kernel modules which would offer a new
> > replication solution in the Linux kernel. It would be a block level,
> > active-active replication solution for RDMA transport.
> 
> Why is active-active a good idea?
> 
> With an active-passive solution, network traffic is directed to the
> active node.  Over time at some point we get close to saturating the
> link and performance drops.  At that point, human intervention will
> occur and the network link will be upgraded.
> 
> With an active-active solution, traffic goes to each node.  At smoe point
> each link will be about 75% utilised and we won't see any performance
> problems.  But then a node goes down and all of a sudden the remaining
> node is being hit with 150% of the link capacity.  There's no gradual
> degradation here; the whole solution just goes down.

Maybe I'm out of touch with reality, but I could swear active-passive
setups are often configured such that the passive node for one resource
is the active node for another. That would also suffer the same link
capacity issues you're describing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Lsf-pc] [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-05-02 11:41   ` Keith Busch
@ 2026-05-04  8:24     ` Haris Iqbal
  0 siblings, 0 replies; 13+ messages in thread
From: Haris Iqbal @ 2026-05-04  8:24 UTC (permalink / raw)
  To: Keith Busch, Matthew Wilcox; +Cc: lsf-pc, linux-block, Jia Li

On Sat, May 2, 2026 at 1:41 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Fri, May 01, 2026 at 08:47:35PM +0100, Matthew Wilcox wrote:
> > On Tue, Feb 03, 2026 at 04:09:59PM +0100, Haris Iqbal via Lsf-pc wrote:
> > > We are working on a pair of kernel modules which would offer a new
> > > replication solution in the Linux kernel. It would be a block level,
> > > active-active replication solution for RDMA transport.
> >
> > Why is active-active a good idea?
> >
> > With an active-passive solution, network traffic is directed to the
> > active node.  Over time at some point we get close to saturating the
> > link and performance drops.  At that point, human intervention will
> > occur and the network link will be upgraded.
> >
> > With an active-active solution, traffic goes to each node.  At smoe point
> > each link will be about 75% utilised and we won't see any performance
> > problems.  But then a node goes down and all of a sudden the remaining
> > node is being hit with 150% of the link capacity.  There's no gradual
> > degradation here; the whole solution just goes down.

One can always flow-control the traffic so that each node in the
active-active setup serves enough such that the sum of both can be
served by one node too in case of failure.
By restricting the traffic (as in active-passive) during the happy
case (no node failure), which would actually be the major chunk of the
lifetime of such setups, one is under-utilizing the available link and
intentionally accepting lower performance.

Another point is that, for an active-active setup, when one node dies,
for the other surviving node only the reads are doubled, which would
mean roughly 40% increase (if we consider a 30/70 write/read split).

>
> Maybe I'm out of touch with reality, but I could swear active-passive
> setups are often configured such that the passive node for one resource
> is the active node for another. That would also suffer the same link
> capacity issues you're describing.

AFAIK, what you are describing is a more of a clustered raid setup,
where multiple clients write to different defined and protected
sections of the disk.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] A block level, active-active replication solution
  2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
                   ` (2 preceding siblings ...)
  2026-05-01 19:47 ` [Lsf-pc] " Matthew Wilcox
@ 2026-05-05  9:19 ` Philipp Reisner
  3 siblings, 0 replies; 13+ messages in thread
From: Philipp Reisner @ 2026-05-05  9:19 UTC (permalink / raw)
  To: Haris Iqbal; +Cc: lsf-pc, linux-block, Jia Li

Am Tue, Feb 03, 2026 at 04:09:59PM +0100 schrieb Haris Iqbal:
> Hi Haris,
> 
> We are working on a pair of kernel modules which would offer a new
> replication solution in the Linux kernel. It would be a block level,
> active-active replication solution for RDMA transport.
> 
> The existing block level replication solution in the Linux kernel is
> DRBD, which is an active-passive solution. The data replication in
> DRBD happens through 2 network hops.
>
>
> An active-active solution which one can build is by exporting block
> devices, either through NVMeOF or RNBD/RTRS, over the network, and
> then creating a raid1 device over it. It would provide a single hop
> replication solution, but the synchronization during a degraded state
> goes through 2 hops.
> 
> The proposed solution would provide an active-active single hop
> replication, and a single hop synchronization (directly between
> storage nodes) in case of a degraded state.
[...]

I stumbled across this post because of the newer replies.

I want to point out that we have significantly developed DRBD over the
last 15 Years as an out-of-tree module. In the past months, we began
the process of getting all those improvements back into Linux
upstream.

With that, DRBD9 became multi-node. It does the “active-active single
hop replication” as it is. The networking part is now abstracted into
transport modules. We have one for TCP, one for load balancing across
multiple TCP connections, and one for RDMA.

What you are doing here, in DRBD lingo, is a diskless primary
connected to multiple storage nodes.

Find everything here https://github.com/LINBIT.
The latest edition of what we bring to the upstreaming discussion:
https://github.com/LINBIT/linux-drbd/tree/drbd-next

Philipp

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-05-05  9:20 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 15:09 [LSF/MM/BPF TOPIC] A block level, active-active replication solution Haris Iqbal
2026-02-03 18:01 ` Bart Van Assche
2026-02-03 18:04   ` Haris Iqbal
2026-02-10 13:06     ` Haris Iqbal
2026-02-10 18:31       ` Bart Van Assche
2026-02-13 14:13         ` Haris Iqbal
2026-02-13 17:32 ` Bart Van Assche
2026-02-19 10:43   ` Haris Iqbal
2026-04-29 14:26     ` Haris Iqbal
2026-05-01 19:47 ` [Lsf-pc] " Matthew Wilcox
2026-05-02 11:41   ` Keith Busch
2026-05-04  8:24     ` Haris Iqbal
2026-05-05  9:19 ` Philipp Reisner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox