What mechanisms protect against split brain?

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

* What mechanisms protect against split brain?
@ 2022-05-29 11:34 Forza
  2022-06-08  2:44 ` Wang Yugui
  0 siblings, 1 reply; 11+ messages in thread
From: Forza @ 2022-05-29 11:34 UTC (permalink / raw)
  To: Btrfs BTRFS

Hi,

Recently there have been some discussions, both here on the mailing list and on #btrfs IRC, about the consequences of mounting one RAID1 mirror as degraded and then later re-introduce the missing device. But also on having degraded mount option in fstab and kernel command line.

So I wonder if Btrfs has some protective mechanisms against data loss/corruption if a drive is missing for a bit but later re-introduced. There is also the case of split brain where each mirror might be independently updated and then recombined.

Is there an official recommendation to have with regards to degraded mounts from kernel command line? I understand the use case as it allows the system to boot even if a device goes missing or dead after a reboot.

Thanks,
Forza

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-05-29 11:34 What mechanisms protect against split brain? Forza
@ 2022-06-08  2:44 ` Wang Yugui
  2022-06-08 10:15   ` Wang Yugui
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yugui @ 2022-06-08  2:44 UTC (permalink / raw)
  To: Forza; +Cc: Btrfs BTRFS

Hi,

I tried some test about this case.

After the missing RAID1 device is re-introduced,
1, mount/read seem to work.
   checksum based error detect help.
   current pid based i/o patch select policy may help too.
       preferred_mirror = first + (current->pid % num_stripes);

2, 'btrfs scrub' failed to finish.
    Any advice to return to clean state?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/06/08

> Hi,
> 
> Recently there have been some discussions, both here on the mailing list and on #btrfs IRC, about the consequences of mounting one RAID1 mirror as degraded and then later re-introduce the missing device. But also on having degraded mount option in fstab and kernel command line.
> 
> So I wonder if Btrfs has some protective mechanisms against data loss/corruption if a drive is missing for a bit but later re-introduced. There is also the case of split brain where each mirror might be independently updated and then recombined.
> 
> Is there an official recommendation to have with regards to degraded mounts from kernel command line? I understand the use case as it allows the system to boot even if a device goes missing or dead after a reboot.
> 
> Thanks,
> Forza



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08  2:44 ` Wang Yugui
@ 2022-06-08 10:15   ` Wang Yugui
  2022-06-08 10:32     ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yugui @ 2022-06-08 10:15 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Forza, Btrfs BTRFS

Hi, Forza, Qu Wenruo

I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
*1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u

#!/bin/bash
set -uxe -o pipefail

mnt=/mnt/test
dev1=/dev/vdb1
dev2=/dev/vdb2

  dmesg -C
  mkdir -p $mnt

  mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
  mount $dev1 $mnt
  xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
  sync
  umount $mnt

  btrfs dev scan -u $dev2
  mount -o degraded $dev1 $mnt
  #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  mkdir -p $mnt/branch1; /bin/cp -R /usr/bin $mnt/branch1 #complex than xfs_io
  umount $mnt

  btrfs dev scan
  btrfs dev scan -u $dev1
  mount -o degraded $dev2 $mnt
  #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
  mkdir -p $mnt/branch2; /bin/cp -R /usr/lib64 $mnt/branch2 #complex than xfs_io
  umount $mnt

  btrfs dev scan
  mount $dev1 $mnt # *1
  ls $mnt

  btrfs balance start --full-balance $mnt # *2
  #btrfs scrub start -B $mnt  # *3
  #btrfs scrub start $mnt; sleep 2; btrfs scrub status $mnt; btrfs scrub start -B $mnt; # *4

  umount $mnt

test result:
we may fail in # *1; # *2; # *3; #*4 with different frequency.

dmesg output:
1)
[ 1379.124079] BTRFS error (device vdb1): tree level mismatch detected, bytenr=31866880 level expected=1 has=0
[ 1379.127928] BTRFS error (device vdb1): tree level mismatch detected, bytenr=31866880 level expected=1 has=0
[ 1379.132109] BTRFS error (device vdb1: state C): failed to load root csum
[ 1379.137281] BTRFS error (device vdb1: state C): open_ctree failed

2)
[ 2950.467178] BTRFS error (device vdb1): tree first key mismatch detected, bytenr=32342016 parent_transid=9 key expected=(301555712,168,106496) has=(2552,96,5)
[ 2950.471283] BTRFS error (device vdb1): tree first key mismatch detected, bytenr=32342016 parent_transid=9 key expected=(301555712,168,106496) has=(2552,96,5)
[ 2950.479960] BTRFS info (device vdb1): balance: ended with status: -117

so RAID1 split brain case yet not supported by btrfs now.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/06/08

> Hi,
> 
> I tried some test about this case.
> 
> After the missing RAID1 device is re-introduced,
> 1, mount/read seem to work.
>    checksum based error detect help.
>    current pid based i/o patch select policy may help too.
>        preferred_mirror = first + (current->pid % num_stripes);
> 
> 2, 'btrfs scrub' failed to finish.
>     Any advice to return to clean state?
> 
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/06/08
> 
> > Hi,
> > 
> > Recently there have been some discussions, both here on the mailing list and on #btrfs IRC, about the consequences of mounting one RAID1 mirror as degraded and then later re-introduce the missing device. But also on having degraded mount option in fstab and kernel command line.
> > 
> > So I wonder if Btrfs has some protective mechanisms against data loss/corruption if a drive is missing for a bit but later re-introduced. There is also the case of split brain where each mirror might be independently updated and then recombined.
> > 
> > Is there an official recommendation to have with regards to degraded mounts from kernel command line? I understand the use case as it allows the system to boot even if a device goes missing or dead after a reboot.
> > 
> > Thanks,
> > Forza
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 10:15   ` Wang Yugui
@ 2022-06-08 10:32     ` Qu Wenruo
  2022-06-08 10:58       ` Wang Yugui
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-06-08 10:32 UTC (permalink / raw)
  To: Wang Yugui, Qu Wenruo; +Cc: Forza, Btrfs BTRFS



On 2022/6/8 18:15, Wang Yugui wrote:
> Hi, Forza, Qu Wenruo
>
> I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
> *1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u

No no no, that is not to address split brain, but mostly to drop cache
for recovery path to maximize the chance of recovery.

It's not designed to solve split brain problem at all, it's just one
case of such problem.

In fact, fully split brain (both have the same generation, but
experienced their own degraded mount) case can not be solved by btrfs
itself at all.

Btrfs can only solve partial split brain case (one device has higher
generation, thus btrfs can still determine which copy is the correct one).

>
> #!/bin/bash
> set -uxe -o pipefail
>
> mnt=/mnt/test
> dev1=/dev/vdb1
> dev2=/dev/vdb2
>
>    dmesg -C
>    mkdir -p $mnt
>
>    mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
>    mount $dev1 $mnt
>    xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>    sync
>    umount $mnt
>
>    btrfs dev scan -u $dev2
>    mount -o degraded $dev1 $mnt
>    #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>    mkdir -p $mnt/branch1; /bin/cp -R /usr/bin $mnt/branch1 #complex than xfs_io
>    umount $mnt
>
>    btrfs dev scan
>    btrfs dev scan -u $dev1
>    mount -o degraded $dev2 $mnt

Your case is the full split brain case.

Not possible to solve.

In fact, if you don't do the degraded mount on dev2, btrfs is completely
fine to resilve the fs without any problem.

Thanks,
Qu
>    #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>    mkdir -p $mnt/branch2; /bin/cp -R /usr/lib64 $mnt/branch2 #complex than xfs_io
>    umount $mnt
>
>    btrfs dev scan
>    mount $dev1 $mnt # *1
>    ls $mnt
>
>    btrfs balance start --full-balance $mnt # *2
>    #btrfs scrub start -B $mnt  # *3
>    #btrfs scrub start $mnt; sleep 2; btrfs scrub status $mnt; btrfs scrub start -B $mnt; # *4
>
>    umount $mnt
>
> test result:
> we may fail in # *1; # *2; # *3; #*4 with different frequency.
>
> dmesg output:
> 1)
> [ 1379.124079] BTRFS error (device vdb1): tree level mismatch detected, bytenr=31866880 level expected=1 has=0
> [ 1379.127928] BTRFS error (device vdb1): tree level mismatch detected, bytenr=31866880 level expected=1 has=0
> [ 1379.132109] BTRFS error (device vdb1: state C): failed to load root csum
> [ 1379.137281] BTRFS error (device vdb1: state C): open_ctree failed
>
> 2)
> [ 2950.467178] BTRFS error (device vdb1): tree first key mismatch detected, bytenr=32342016 parent_transid=9 key expected=(301555712,168,106496) has=(2552,96,5)
> [ 2950.471283] BTRFS error (device vdb1): tree first key mismatch detected, bytenr=32342016 parent_transid=9 key expected=(301555712,168,106496) has=(2552,96,5)
> [ 2950.479960] BTRFS info (device vdb1): balance: ended with status: -117
>
> so RAID1 split brain case yet not supported by btrfs now.
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/06/08
>
>> Hi,
>>
>> I tried some test about this case.
>>
>> After the missing RAID1 device is re-introduced,
>> 1, mount/read seem to work.
>>     checksum based error detect help.
>>     current pid based i/o patch select policy may help too.
>>         preferred_mirror = first + (current->pid % num_stripes);
>>
>> 2, 'btrfs scrub' failed to finish.
>>      Any advice to return to clean state?
>>
>> Best Regards
>> Wang Yugui (wangyugui@e16-tech.com)
>> 2022/06/08
>>
>>> Hi,
>>>
>>> Recently there have been some discussions, both here on the mailing list and on #btrfs IRC, about the consequences of mounting one RAID1 mirror as degraded and then later re-introduce the missing device. But also on having degraded mount option in fstab and kernel command line.
>>>
>>> So I wonder if Btrfs has some protective mechanisms against data loss/corruption if a drive is missing for a bit but later re-introduced. There is also the case of split brain where each mirror might be independently updated and then recombined.
>>>
>>> Is there an official recommendation to have with regards to degraded mounts from kernel command line? I understand the use case as it allows the system to boot even if a device goes missing or dead after a reboot.
>>>
>>> Thanks,
>>> Forza
>>
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 10:32     ` Qu Wenruo
@ 2022-06-08 10:58       ` Wang Yugui
  2022-06-08 11:19         ` Qu Wenruo
  2022-06-08 11:40       ` Austin S. Hemmelgarn
  2022-06-08 14:11       ` Andrei Borzenkov
  2 siblings, 1 reply; 11+ messages in thread
From: Wang Yugui @ 2022-06-08 10:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Forza, Btrfs BTRFS

Hi,

> On 2022/6/8 18:15, Wang Yugui wrote:
> > Hi, Forza, Qu Wenruo
> >
> > I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
> > *1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u
> 
> No no no, that is not to address split brain, but mostly to drop cache
> for recovery path to maximize the chance of recovery.
> 
> It's not designed to solve split brain problem at all, it's just one
> case of such problem.
> 
> In fact, fully split brain (both have the same generation, but
> experienced their own degraded mount) case can not be solved by btrfs
> itself at all.
> 
> Btrfs can only solve partial split brain case (one device has higher
> generation, thus btrfs can still determine which copy is the correct one).
> 
> >
> > #!/bin/bash
> > set -uxe -o pipefail
> >
> > mnt=/mnt/test
> > dev1=/dev/vdb1
> > dev2=/dev/vdb2
> >
> >    dmesg -C
> >    mkdir -p $mnt
> >
> >    mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
> >    mount $dev1 $mnt
> >    xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
> >    sync
> >    umount $mnt
> >
> >    btrfs dev scan -u $dev2
> >    mount -o degraded $dev1 $mnt
> >    #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
> >    mkdir -p $mnt/branch1; /bin/cp -R /usr/bin $mnt/branch1 #complex than xfs_io
> >    umount $mnt
> >
> >    btrfs dev scan
> >    btrfs dev scan -u $dev1
> >    mount -o degraded $dev2 $mnt
> 
> Your case is the full split brain case.
> 
> Not possible to solve.
> 
> In fact, if you don't do the degraded mount on dev2, btrfs is completely
> fine to resilve the fs without any problem.

step1: we mark btrfs/RAID1 with degraded write as not-clean-RAID1.
step2: in that state, we default try to read copy 0 of RAID1
	current pid based i/o patch select policy
           preferred_mirror = first + (current->pid % num_stripes);

this idea seem to work?

degraded RAID1 write is almost the same as full split brain?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/06/08



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 10:58       ` Wang Yugui
@ 2022-06-08 11:19         ` Qu Wenruo
  2022-06-08 11:55           ` Wang Yugui
  0 siblings, 1 reply; 11+ messages in thread
From: Qu Wenruo @ 2022-06-08 11:19 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Forza, Btrfs BTRFS



On 2022/6/8 18:58, Wang Yugui wrote:
> Hi,
>
>> On 2022/6/8 18:15, Wang Yugui wrote:
>>> Hi, Forza, Qu Wenruo
>>>
>>> I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
>>> *1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u
>>
>> No no no, that is not to address split brain, but mostly to drop cache
>> for recovery path to maximize the chance of recovery.
>>
>> It's not designed to solve split brain problem at all, it's just one
>> case of such problem.
>>
>> In fact, fully split brain (both have the same generation, but
>> experienced their own degraded mount) case can not be solved by btrfs
>> itself at all.
>>
>> Btrfs can only solve partial split brain case (one device has higher
>> generation, thus btrfs can still determine which copy is the correct one).
>>
>>>
>>> #!/bin/bash
>>> set -uxe -o pipefail
>>>
>>> mnt=/mnt/test
>>> dev1=/dev/vdb1
>>> dev2=/dev/vdb2
>>>
>>>     dmesg -C
>>>     mkdir -p $mnt
>>>
>>>     mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
>>>     mount $dev1 $mnt
>>>     xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>>>     sync
>>>     umount $mnt
>>>
>>>     btrfs dev scan -u $dev2
>>>     mount -o degraded $dev1 $mnt
>>>     #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>>>     mkdir -p $mnt/branch1; /bin/cp -R /usr/bin $mnt/branch1 #complex than xfs_io
>>>     umount $mnt
>>>
>>>     btrfs dev scan
>>>     btrfs dev scan -u $dev1
>>>     mount -o degraded $dev2 $mnt
>>
>> Your case is the full split brain case.
>>
>> Not possible to solve.
>>
>> In fact, if you don't do the degraded mount on dev2, btrfs is completely
>> fine to resilve the fs without any problem.
>
> step1: we mark btrfs/RAID1 with degraded write as not-clean-RAID1.

Then when to clean?
Full scrub or some timing else?

> step2: in that state, we default try to read copy 0 of RAID1
> 	current pid based i/o patch select policy
>             preferred_mirror = first + (current->pid % num_stripes);

That's feasible, but still need an ondisk format change.

Furthermore, this idea can also be done by a more generic way,
write-intent bitmap.

In fact, DM layer uses this to speed up resilver, and handle split brain
cases.

With write-intent bitmap, every degraded write will leave the record in
the write-intent bitmap until properly resilvered.

Thanks,
Qu

>
> this idea seem to work?
>
> degraded RAID1 write is almost the same as full split brain?
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/06/08
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 10:32     ` Qu Wenruo
  2022-06-08 10:58       ` Wang Yugui
@ 2022-06-08 11:40       ` Austin S. Hemmelgarn
  2022-06-08 14:11       ` Andrei Borzenkov
  2 siblings, 0 replies; 11+ messages in thread
From: Austin S. Hemmelgarn @ 2022-06-08 11:40 UTC (permalink / raw)
  To: Wang Yugui, Btrfs BTRFS; +Cc: Forza, Qu Wenruo, Qu Wenruo

On 08/06/2022 06.32, Qu Wenruo wrote:
> In fact, fully split brain (both have the same generation, but
> experienced their own degraded mount) case can not be solved by btrfs
> itself at all.
> 
> Btrfs can only solve partial split brain case (one device has higher
> generation, thus btrfs can still determine which copy is the correct one).
Of note, this is not unique to BTRFS. The quorum requirement that Ceph 
and many other distributed storage systems impose on writes exists to 
very specifically avoid this type of situation.
> 
>>
>> #!/bin/bash
>> set -uxe -o pipefail
>>
>> mnt=/mnt/test
>> dev1=/dev/vdb1
>> dev2=/dev/vdb2
>>
>>    dmesg -C
>>    mkdir -p $mnt
>>
>>    mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
>>    mount $dev1 $mnt
>>    xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>>    sync
>>    umount $mnt
>>
>>    btrfs dev scan -u $dev2
>>    mount -o degraded $dev1 $mnt
>>    #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>>    mkdir -p $mnt/branch1; /bin/cp -R /usr/bin $mnt/branch1 #complex 
>> than xfs_io
>>    umount $mnt
>>
>>    btrfs dev scan
>>    btrfs dev scan -u $dev1
>>    mount -o degraded $dev2 $mnt
> 
> Your case is the full split brain case.
> 
> Not possible to solve.
> 
> In fact, if you don't do the degraded mount on dev2, btrfs is completely
> fine to resilve the fs without any problem.
> 
And this, in turn, is why BTRFS refuses to mount degraded without the 
user explicitly asking for it, and why having `degraded` in your mount 
options in `/etc/fstab` (or on the kernel command line) is so dangerous. 
There’s no way for BTRFS (or the block layer for that matter) to 
reliably differentiate between a missing device resulting from a device 
failure and a missing device resulting from other issues, and those 
other issues can easily result in one half of a two-device volume not 
being present for one boot, and the other half not being present on the 
next boot.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 11:19         ` Qu Wenruo
@ 2022-06-08 11:55           ` Wang Yugui
  2022-06-08 11:59             ` Qu Wenruo
  0 siblings, 1 reply; 11+ messages in thread
From: Wang Yugui @ 2022-06-08 11:55 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Forza, Btrfs BTRFS

Hi,

> On 2022/6/8 18:58, Wang Yugui wrote:
> > Hi,
> >
> >> On 2022/6/8 18:15, Wang Yugui wrote:
> >>> Hi, Forza, Qu Wenruo
> >>>
> >>> I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
> >>> *1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u
> >>
> >> No no no, that is not to address split brain, but mostly to drop cache
> >> for recovery path to maximize the chance of recovery.
> >>
> >> It's not designed to solve split brain problem at all, it's just one
> >> case of such problem.
> >>
> >> In fact, fully split brain (both have the same generation, but
> >> experienced their own degraded mount) case can not be solved by btrfs
> >> itself at all.
> >>
> >> Btrfs can only solve partial split brain case (one device has higher
> >> generation, thus btrfs can still determine which copy is the correct one).
> >>
> >>>
> >>> #!/bin/bash
> >>> set -uxe -o pipefail
> >>>
> >>> mnt=/mnt/test
> >>> dev1=/dev/vdb1
> >>> dev2=/dev/vdb2
> >>>
> >>>     dmesg -C
> >>>     mkdir -p $mnt
> >>>
> >>>     mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
> >>>     mount $dev1 $mnt
> >>>     xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
> >>>     sync
> >>>     umount $mnt
> >>>
> >>>     btrfs dev scan -u $dev2
> >>>     mount -o degraded $dev1 $mnt
> >>>     #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
> >>>     mkdir -p $mnt/branch1; /bin/cp -R /usr/bin $mnt/branch1 #complex than xfs_io
> >>>     umount $mnt
> >>>
> >>>     btrfs dev scan
> >>>     btrfs dev scan -u $dev1
> >>>     mount -o degraded $dev2 $mnt
> >>
> >> Your case is the full split brain case.
> >>
> >> Not possible to solve.
> >>
> >> In fact, if you don't do the degraded mount on dev2, btrfs is completely
> >> fine to resilve the fs without any problem.
> >
> > step1: we mark btrfs/RAID1 with degraded write as not-clean-RAID1.
> 
> Then when to clean?
> Full scrub or some timing else?

'full scrub' or 'full balance' is OK.
this is not the normal path, so no critical performance is required.

> > step2: in that state, we default try to read copy 0 of RAID1
> > 	current pid based i/o patch select policy
> >             preferred_mirror = first + (current->pid % num_stripes);
> 
> That's feasible, but still need an ondisk format change.

yes. we need to save something into the disks.
but maybe 1 byte per disk.  so maybe no ondisk format change.

> Furthermore, this idea can also be done by a more generic way,
> write-intent bitmap.
> 
> In fact, DM layer uses this to speed up resilver, and handle split brain
> cases.
> 
> With write-intent bitmap, every degraded write will leave the record in
> the write-intent bitmap until properly resilvered.

write-intent bitmap have the problem of performance, 
so it is a little expensive for RAID1[C34]?

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2022/06/08

> Thanks,
> Qu
> 
> >
> > this idea seem to work?
> >
> > degraded RAID1 write is almost the same as full split brain?
> >
> > Best Regards
> > Wang Yugui (wangyugui@e16-tech.com)
> > 2022/06/08



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 11:55           ` Wang Yugui
@ 2022-06-08 11:59             ` Qu Wenruo
  0 siblings, 0 replies; 11+ messages in thread
From: Qu Wenruo @ 2022-06-08 11:59 UTC (permalink / raw)
  To: Wang Yugui; +Cc: Forza, Btrfs BTRFS



On 2022/6/8 19:55, Wang Yugui wrote:
> Hi,
>
>> On 2022/6/8 18:58, Wang Yugui wrote:
>>> Hi,
>>>
>>>> On 2022/6/8 18:15, Wang Yugui wrote:
>>>>> Hi, Forza, Qu Wenruo
>>>>>
>>>>> I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
>>>>> *1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u
>>>>
>>>> No no no, that is not to address split brain, but mostly to drop cache
>>>> for recovery path to maximize the chance of recovery.
>>>>
>>>> It's not designed to solve split brain problem at all, it's just one
>>>> case of such problem.
>>>>
>>>> In fact, fully split brain (both have the same generation, but
>>>> experienced their own degraded mount) case can not be solved by btrfs
>>>> itself at all.
>>>>
>>>> Btrfs can only solve partial split brain case (one device has higher
>>>> generation, thus btrfs can still determine which copy is the correct one).
>>>>
>>>>>
>>>>> #!/bin/bash
>>>>> set -uxe -o pipefail
>>>>>
>>>>> mnt=/mnt/test
>>>>> dev1=/dev/vdb1
>>>>> dev2=/dev/vdb2
>>>>>
>>>>>      dmesg -C
>>>>>      mkdir -p $mnt
>>>>>
>>>>>      mkfs.btrfs -f -m raid1 -d raid1 $dev1 $dev2
>>>>>      mount $dev1 $mnt
>>>>>      xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
>>>>>      sync
>>>>>      umount $mnt
>>>>>
>>>>>      btrfs dev scan -u $dev2
>>>>>      mount -o degraded $dev1 $mnt
>>>>>      #xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
>>>>>      mkdir -p $mnt/branch1; /bin/cp -R /usr/bin $mnt/branch1 #complex than xfs_io
>>>>>      umount $mnt
>>>>>
>>>>>      btrfs dev scan
>>>>>      btrfs dev scan -u $dev1
>>>>>      mount -o degraded $dev2 $mnt
>>>>
>>>> Your case is the full split brain case.
>>>>
>>>> Not possible to solve.
>>>>
>>>> In fact, if you don't do the degraded mount on dev2, btrfs is completely
>>>> fine to resilve the fs without any problem.
>>>
>>> step1: we mark btrfs/RAID1 with degraded write as not-clean-RAID1.
>>
>> Then when to clean?
>> Full scrub or some timing else?
>
> 'full scrub' or 'full balance' is OK.
> this is not the normal path, so no critical performance is required.
>
>>> step2: in that state, we default try to read copy 0 of RAID1
>>> 	current pid based i/o patch select policy
>>>              preferred_mirror = first + (current->pid % num_stripes);
>>
>> That's feasible, but still need an ondisk format change.
>
> yes. we need to save something into the disks.
> but maybe 1 byte per disk.  so maybe no ondisk format change.
>
>> Furthermore, this idea can also be done by a more generic way,
>> write-intent bitmap.
>>
>> In fact, DM layer uses this to speed up resilver, and handle split brain
>> cases.
>>
>> With write-intent bitmap, every degraded write will leave the record in
>> the write-intent bitmap until properly resilvered.
>
> write-intent bitmap have the problem of performance,
> so it is a little expensive for RAID1[C34]?

For regular writes with all devices present, no update on write-intent tree.

Only when degraded mounted, we start updating write-intent tree.

And from some testing results, the drop in performance for sequential RW
is mostly negligible, thus why DM-raid always go with write-intent bitmap.

Since I'm going to (at least try to) implement write-intent tree mostly
for RAID56, it may be a good idea to address the partial split brain
case as well.

Thanks,
Qu
>
> Best Regards
> Wang Yugui (wangyugui@e16-tech.com)
> 2022/06/08
>
>> Thanks,
>> Qu
>>
>>>
>>> this idea seem to work?
>>>
>>> degraded RAID1 write is almost the same as full split brain?
>>>
>>> Best Regards
>>> Wang Yugui (wangyugui@e16-tech.com)
>>> 2022/06/08
>
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 10:32     ` Qu Wenruo
  2022-06-08 10:58       ` Wang Yugui
  2022-06-08 11:40       ` Austin S. Hemmelgarn
@ 2022-06-08 14:11       ` Andrei Borzenkov
  2022-06-08 20:22         ` Forza
  2 siblings, 1 reply; 11+ messages in thread
From: Andrei Borzenkov @ 2022-06-08 14:11 UTC (permalink / raw)
  To: Qu Wenruo, Wang Yugui, Qu Wenruo; +Cc: Forza, Btrfs BTRFS

On 08.06.2022 13:32, Qu Wenruo wrote:
> 
> 
> On 2022/6/8 18:15, Wang Yugui wrote:
>> Hi, Forza, Qu Wenruo
>>
>> I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
>> *1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u
> 
> No no no, that is not to address split brain, but mostly to drop cache
> for recovery path to maximize the chance of recovery.
> 
> It's not designed to solve split brain problem at all, it's just one
> case of such problem.
> 
> In fact, fully split brain (both have the same generation, but
> experienced their own degraded mount) case can not be solved by btrfs
> itself at all.
> 
> Btrfs can only solve partial split brain case (one device has higher
> generation, thus btrfs can still determine which copy is the correct one).
> 

Start with both devices having the same generation number N.

Mount device1 separately, do some writes, device has generation N+1.

Mount device2 separately, do some writes, device has generation N+2.

Applying changes between N+1 and N+2 to device1 is wrong because content
of N+1 is different on both devices.

So there is absolutely no difference between "same generation" and
"higher generation".

The only thing btrfs could do is to try to detect this and refuse to
integrate another device. One suggested rather radical approach was to
change UUID on degraded mount, but this is probably unfeasible in real life.

Removing missing device from device list in superblock (or at least
marking it as permanently missing until replaced) is probably another
option.

And write intent log as discussed further in this thread could be used
as well - if btrfs detects write intent log on device it should refuse
to add it to existing filesystem.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: What mechanisms protect against split brain?
  2022-06-08 14:11       ` Andrei Borzenkov
@ 2022-06-08 20:22         ` Forza
  0 siblings, 0 replies; 11+ messages in thread
From: Forza @ 2022-06-08 20:22 UTC (permalink / raw)
  To: Andrei Borzenkov, Qu Wenruo, Wang Yugui, Qu Wenruo,
	Nicholas D Steeves
  Cc: Btrfs BTRFS



---- From: Andrei Borzenkov <arvidjaar@gmail.com> -- Sent: 2022-06-08 - 16:11 ----

> On 08.06.2022 13:32, Qu Wenruo wrote:
>> 
>> 
>> On 2022/6/8 18:15, Wang Yugui wrote:
>>> Hi, Forza, Qu Wenruo
>>>
>>> I write a script to test RAID1 split brain base on Qu's work of raid5(*1)
>>> *1: https://lore.kernel.org/linux-btrfs/53f7bace2ac75d88ace42dd811d48b7912647301.1654672140.git.wqu@suse.com/T/#u
>> 
>> No no no, that is not to address split brain, but mostly to drop cache
>> for recovery path to maximize the chance of recovery.
>> 
>> It's not designed to solve split brain problem at all, it's just one
>> case of such problem.
>> 
>> In fact, fully split brain (both have the same generation, but
>> experienced their own degraded mount) case can not be solved by btrfs
>> itself at all.
>> 
>> Btrfs can only solve partial split brain case (one device has higher
>> generation, thus btrfs can still determine which copy is the correct one).
>> 
> 
> Start with both devices having the same generation number N.
> 
> Mount device1 separately, do some writes, device has generation N+1.
> 
> Mount device2 separately, do some writes, device has generation N+2.
> 
> Applying changes between N+1 and N+2 to device1 is wrong because content
> of N+1 is different on both devices.
> 
> So there is absolutely no difference between "same generation" and
> "higher generation".
> 
> The only thing btrfs could do is to try to detect this and refuse to
> integrate another device. One suggested rather radical approach was to
> change UUID on degraded mount, but this is probably unfeasible in real life.
> 
> Removing missing device from device list in superblock (or at least
> marking it as permanently missing until replaced) is probably another
> option.
> 
> And write intent log as discussed further in this thread could be used
> as well - if btrfs detects write intent log on device it should refuse
> to add it to existing filesystem.


Thank you all for the feedback on this topic. 

My take away from this is that we need to;
* clearly document the Btrfs cannot handle full split brain 
* document best practices on how to recover from a temporary device loss 
* add features that help btrfs handle both types of split brain in a safe way. 

One such mechanism could be to write a simple hash on all existing devices during a degraded mount. If the old device is re-attached, the hash will mismatch and btrfs should reject this device.

Then a special 'btrfs device add --reattach' command could be developed that would do what Qu said, run a full scrub and correct any differences on the reattached device. This avoids writing all data again, which is faster and saves TBW.

Nicholas suggested a similar approach in another thread https://lore.kernel.org/linux-btrfs/87sfogkwbd.fsf@DigitalMercury.freeddns.org/T/#t

Thanks,
Forza


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-06-08 20:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-29 11:34 What mechanisms protect against split brain? Forza
2022-06-08  2:44 ` Wang Yugui
2022-06-08 10:15   ` Wang Yugui
2022-06-08 10:32     ` Qu Wenruo
2022-06-08 10:58       ` Wang Yugui
2022-06-08 11:19         ` Qu Wenruo
2022-06-08 11:55           ` Wang Yugui
2022-06-08 11:59             ` Qu Wenruo
2022-06-08 11:40       ` Austin S. Hemmelgarn
2022-06-08 14:11       ` Andrei Borzenkov
2022-06-08 20:22         ` Forza

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox