Using btrfs raid5/6

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* Using btrfs raid5/6
@ 2024-12-04  3:34 Scoopta
  2024-12-04  4:29 ` Andrei Borzenkov
  2024-12-04  4:40 ` Qu Wenruo
  0 siblings, 2 replies; 28+ messages in thread
From: Scoopta @ 2024-12-04  3:34 UTC (permalink / raw)
  To: linux-btrfs

I'm looking to deploy btfs raid5/6 and have read some of the previous 
posts here about doing so "successfully." I want to make sure I 
understand the limitations correctly. I'm looking to replace an md+ext4 
setup. The data on these drives is replaceable but obviously ideally I 
don't want to have to replace it.

1) use space_cache=v2

2) don't use raid5/6 for metadata

3) run scrubs 1 drive at a time

4) don't expect to use the system in degraded mode

5) there are times where raid5 will make corruption permanent instead of 
fixing it - does this matter? As I understand it md+ext4 can't detect or 
fix corruption either so it's not a loss

6) the write hole exists - As I understand it md has that same problem 
anyway

Are there any other ways I could lose my data? Again the data IS 
replaceable, I'm just trying to understand if there are any major 
advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care 
about downtime during degraded mode. Additionally the posts I'm looking 
at are from 2020, has any of the above changed since then?

Thanks!

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04  3:34 Using btrfs raid5/6 Scoopta
@ 2024-12-04  4:29 ` Andrei Borzenkov
  2024-12-04  4:49   ` Scoopta
  2024-12-04  4:40 ` Qu Wenruo
  1 sibling, 1 reply; 28+ messages in thread
From: Andrei Borzenkov @ 2024-12-04  4:29 UTC (permalink / raw)
  To: Scoopta, linux-btrfs

04.12.2024 06:34, Scoopta wrote:
> I'm looking to deploy btfs raid5/6 and have read some of the previous
> posts here about doing so "successfully." I want to make sure I
> understand the limitations correctly. I'm looking to replace an md+ext4
> setup. The data on these drives is replaceable but obviously ideally I
> don't want to have to replace it.
> 
> 1) use space_cache=v2
> 
> 2) don't use raid5/6 for metadata
> 
> 3) run scrubs 1 drive at a time
> 
> 4) don't expect to use the system in degraded mode
> 
> 5) there are times where raid5 will make corruption permanent instead of
> fixing it - does this matter? As I understand it md+ext4 can't detect or
> fix corruption either so it's not a loss
> 
> 6) the write hole exists - As I understand it md has that same problem
> anyway
> 

Linux MD can use either write cache or partial parity log to protect 
against write hole.

https://docs.kernel.org/driver-api/md/index.html

> Are there any other ways I could lose my data? Again the data IS
> replaceable, I'm just trying to understand if there are any major
> advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care
> about downtime during degraded mode. Additionally the posts I'm looking
> at are from 2020, has any of the above changed since then?
> 
> Thanks!
> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04  3:34 Using btrfs raid5/6 Scoopta
  2024-12-04  4:29 ` Andrei Borzenkov
@ 2024-12-04  4:40 ` Qu Wenruo
  2024-12-04  4:50   ` Scoopta
                     ` (2 more replies)
  1 sibling, 3 replies; 28+ messages in thread
From: Qu Wenruo @ 2024-12-04  4:40 UTC (permalink / raw)
  To: Scoopta, linux-btrfs

在 2024/12/4 14:04, Scoopta 写道:
> I'm looking to deploy btfs raid5/6 and have read some of the previous 
> posts here about doing so "successfully." I want to make sure I 
> understand the limitations correctly. I'm looking to replace an md+ext4 
> setup. The data on these drives is replaceable but obviously ideally I 
> don't want to have to replace it.

0) Use kernel newer than 6.5 at least.

That version introduced a more comprehensive check for any RAID56 RMW, 
so that it will always verify the checksum and rebuild when necessary.

This should mostly solve the write hole problem, and we even have some 
test cases in the fstests already verifying the behavior.

> 
> 1) use space_cache=v2
> 
> 2) don't use raid5/6 for metadata
> 
> 3) run scrubs 1 drive at a time

That's should also no longer be the case.

Although it will waste some IO, but should not be that bad.

> 
> 4) don't expect to use the system in degraded mode

You can still, thanks to the extra verification in 0).

But after the missing device come back, always do a scrub on that 
device, to be extra safe.

> 
> 5) there are times where raid5 will make corruption permanent instead of 
> fixing it - does this matter? As I understand it md+ext4 can't detect or 
> fix corruption either so it's not a loss

With non-RAID56 metadata, and data checksum, it should not cause problem.

But for no-data checksum/ no COW cases, it will cause permanent corruption.

> 
> 6) the write hole exists - As I understand it md has that same problem 
> anyway

The same as 5).

Thanks,
Qu

> 
> Are there any other ways I could lose my data? Again the data IS 
> replaceable, I'm just trying to understand if there are any major 
> advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care 
> about downtime during degraded mode. Additionally the posts I'm looking 
> at are from 2020, has any of the above changed since then?
> 
> Thanks!
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04  4:29 ` Andrei Borzenkov
@ 2024-12-04  4:49   ` Scoopta
  0 siblings, 0 replies; 28+ messages in thread
From: Scoopta @ 2024-12-04  4:49 UTC (permalink / raw)
  To: Andrei Borzenkov, linux-btrfs

Huh, for some reason I was under the impression it had a write hole too? 
Not sure where I read that

On 12/3/24 8:29 PM, Andrei Borzenkov wrote:
> 04.12.2024 06:34, Scoopta wrote:
>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>> posts here about doing so "successfully." I want to make sure I
>> understand the limitations correctly. I'm looking to replace an md+ext4
>> setup. The data on these drives is replaceable but obviously ideally I
>> don't want to have to replace it.
>>
>> 1) use space_cache=v2
>>
>> 2) don't use raid5/6 for metadata
>>
>> 3) run scrubs 1 drive at a time
>>
>> 4) don't expect to use the system in degraded mode
>>
>> 5) there are times where raid5 will make corruption permanent instead of
>> fixing it - does this matter? As I understand it md+ext4 can't detect or
>> fix corruption either so it's not a loss
>>
>> 6) the write hole exists - As I understand it md has that same problem
>> anyway
>>
>
> Linux MD can use either write cache or partial parity log to protect 
> against write hole.
>
> https://docs.kernel.org/driver-api/md/index.html
>
>> Are there any other ways I could lose my data? Again the data IS
>> replaceable, I'm just trying to understand if there are any major
>> advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care
>> about downtime during degraded mode. Additionally the posts I'm looking
>> at are from 2020, has any of the above changed since then?
>>
>> Thanks!
>>
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04  4:40 ` Qu Wenruo
@ 2024-12-04  4:50   ` Scoopta
  2024-12-04 19:17   ` Andrei Borzenkov
  2024-12-06  2:03   ` Using btrfs raid5/6 Jonah Sabean
  2 siblings, 0 replies; 28+ messages in thread
From: Scoopta @ 2024-12-04  4:50 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Oh wow, glad I asked, sounds like things have improved quite a bit. I'll 
make sure to use a debian backport kernel so I'm newer than 6.5. Thanks 
for the information.

On 12/3/24 8:40 PM, Qu Wenruo wrote:
>
>
> 在 2024/12/4 14:04, Scoopta 写道:
>> I'm looking to deploy btfs raid5/6 and have read some of the previous 
>> posts here about doing so "successfully." I want to make sure I 
>> understand the limitations correctly. I'm looking to replace an 
>> md+ext4 setup. The data on these drives is replaceable but obviously 
>> ideally I don't want to have to replace it.
>
> 0) Use kernel newer than 6.5 at least.
>
> That version introduced a more comprehensive check for any RAID56 RMW, 
> so that it will always verify the checksum and rebuild when necessary.
>
> This should mostly solve the write hole problem, and we even have some 
> test cases in the fstests already verifying the behavior.
>
>>
>> 1) use space_cache=v2
>>
>> 2) don't use raid5/6 for metadata
>>
>> 3) run scrubs 1 drive at a time
>
> That's should also no longer be the case.
>
> Although it will waste some IO, but should not be that bad.
>
>>
>> 4) don't expect to use the system in degraded mode
>
> You can still, thanks to the extra verification in 0).
>
> But after the missing device come back, always do a scrub on that 
> device, to be extra safe.
>
>>
>> 5) there are times where raid5 will make corruption permanent instead 
>> of fixing it - does this matter? As I understand it md+ext4 can't 
>> detect or fix corruption either so it's not a loss
>
> With non-RAID56 metadata, and data checksum, it should not cause problem.
>
> But for no-data checksum/ no COW cases, it will cause permanent 
> corruption.
>
>>
>> 6) the write hole exists - As I understand it md has that same 
>> problem anyway
>
> The same as 5).
>
> Thanks,
> Qu
>
>>
>> Are there any other ways I could lose my data? Again the data IS 
>> replaceable, I'm just trying to understand if there are any major 
>> advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't 
>> care about downtime during degraded mode. Additionally the posts I'm 
>> looking at are from 2020, has any of the above changed since then?
>>
>> Thanks!
>>
>>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04  4:40 ` Qu Wenruo
  2024-12-04  4:50   ` Scoopta
@ 2024-12-04 19:17   ` Andrei Borzenkov
  2024-12-04 22:34     ` Qu Wenruo
  2024-12-06  2:03   ` Using btrfs raid5/6 Jonah Sabean
  2 siblings, 1 reply; 28+ messages in thread
From: Andrei Borzenkov @ 2024-12-04 19:17 UTC (permalink / raw)
  To: Qu Wenruo, Scoopta, linux-btrfs

04.12.2024 07:40, Qu Wenruo wrote:
> 
> 
> 在 2024/12/4 14:04, Scoopta 写道:
>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>> posts here about doing so "successfully." I want to make sure I
>> understand the limitations correctly. I'm looking to replace an md+ext4
>> setup. The data on these drives is replaceable but obviously ideally I
>> don't want to have to replace it.
> 
> 0) Use kernel newer than 6.5 at least.
> 
> That version introduced a more comprehensive check for any RAID56 RMW,
> so that it will always verify the checksum and rebuild when necessary.
> 
> This should mostly solve the write hole problem, and we even have some
> test cases in the fstests already verifying the behavior.
> 

Write hole happens when data can *NOT* be rebuilt because data is 
inconsistent between different strips of the same stripe. How btrfs 
solves this problem?

It probably can protect against data corruption (by verifying checksum), 
but how can it recover the correct content?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04 19:17   ` Andrei Borzenkov
@ 2024-12-04 22:34     ` Qu Wenruo
  2024-12-05 16:53       ` Andrei Borzenkov
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-04 22:34 UTC (permalink / raw)
  To: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs



在 2024/12/5 05:47, Andrei Borzenkov 写道:
> 04.12.2024 07:40, Qu Wenruo wrote:
>>
>>
>> 在 2024/12/4 14:04, Scoopta 写道:
>>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>>> posts here about doing so "successfully." I want to make sure I
>>> understand the limitations correctly. I'm looking to replace an md+ext4
>>> setup. The data on these drives is replaceable but obviously ideally I
>>> don't want to have to replace it.
>>
>> 0) Use kernel newer than 6.5 at least.
>>
>> That version introduced a more comprehensive check for any RAID56 RMW,
>> so that it will always verify the checksum and rebuild when necessary.
>>
>> This should mostly solve the write hole problem, and we even have some
>> test cases in the fstests already verifying the behavior.
>>
>
> Write hole happens when data can *NOT* be rebuilt because data is
> inconsistent between different strips of the same stripe. How btrfs
> solves this problem?

An example please.

>
> It probably can protect against data corruption (by verifying checksum),
> but how can it recover the correct content?
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04 22:34     ` Qu Wenruo
@ 2024-12-05 16:53       ` Andrei Borzenkov
  2024-12-05 20:27         ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Andrei Borzenkov @ 2024-12-05 16:53 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, Scoopta, linux-btrfs

05.12.2024 01:34, Qu Wenruo wrote:
> 
> 
> 在 2024/12/5 05:47, Andrei Borzenkov 写道:
>> 04.12.2024 07:40, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/12/4 14:04, Scoopta 写道:
>>>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>>>> posts here about doing so "successfully." I want to make sure I
>>>> understand the limitations correctly. I'm looking to replace an md+ext4
>>>> setup. The data on these drives is replaceable but obviously ideally I
>>>> don't want to have to replace it.
>>>
>>> 0) Use kernel newer than 6.5 at least.
>>>
>>> That version introduced a more comprehensive check for any RAID56 RMW,
>>> so that it will always verify the checksum and rebuild when necessary.
>>>
>>> This should mostly solve the write hole problem, and we even have some
>>> test cases in the fstests already verifying the behavior.
>>>
>>
>> Write hole happens when data can *NOT* be rebuilt because data is
>> inconsistent between different strips of the same stripe. How btrfs
>> solves this problem?
> 
> An example please.

You start with stripe

A1,B1,C1,D1,P1

You overwrite A1 with A2

Before you can write P2, system crashes

After reboot D goes missing, so you now have

A2,B1,C1,miss,P1

You cannot reconstruct "miss" because P1 does not match A2. You can 
detect that it is corrupted using checksum, but not infer the correct data.

MD solves it by either computing the extra parity or by buffering full 
stripe before writing it out. In both cases it is something out of band.

> 
>>
>> It probably can protect against data corruption (by verifying checksum),
>> but how can it recover the correct content?
>>
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-05 16:53       ` Andrei Borzenkov
@ 2024-12-05 20:27         ` Qu Wenruo
  2024-12-06  3:59           ` Andrei Borzenkov
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-05 20:27 UTC (permalink / raw)
  To: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs



在 2024/12/6 03:23, Andrei Borzenkov 写道:
> 05.12.2024 01:34, Qu Wenruo wrote:
>>
>>
>> 在 2024/12/5 05:47, Andrei Borzenkov 写道:
>>> 04.12.2024 07:40, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/12/4 14:04, Scoopta 写道:
>>>>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>>>>> posts here about doing so "successfully." I want to make sure I
>>>>> understand the limitations correctly. I'm looking to replace an
>>>>> md+ext4
>>>>> setup. The data on these drives is replaceable but obviously ideally I
>>>>> don't want to have to replace it.
>>>>
>>>> 0) Use kernel newer than 6.5 at least.
>>>>
>>>> That version introduced a more comprehensive check for any RAID56 RMW,
>>>> so that it will always verify the checksum and rebuild when necessary.
>>>>
>>>> This should mostly solve the write hole problem, and we even have some
>>>> test cases in the fstests already verifying the behavior.
>>>>
>>>
>>> Write hole happens when data can *NOT* be rebuilt because data is
>>> inconsistent between different strips of the same stripe. How btrfs
>>> solves this problem?
>>
>> An example please.
>
> You start with stripe
>
> A1,B1,C1,D1,P1
>
> You overwrite A1 with A2

This already falls into NOCOW case.

No guarantee for data consistency.

For COW cases, the new data are always written into unused slot, and
after crash we will only see the old data.

Thanks,
Qu
>
> Before you can write P2, system crashes
>
> After reboot D goes missing, so you now have
>
> A2,B1,C1,miss,P1
>
> You cannot reconstruct "miss" because P1 does not match A2. You can
> detect that it is corrupted using checksum, but not infer the correct data.
>
> MD solves it by either computing the extra parity or by buffering full
> stripe before writing it out. In both cases it is something out of band.
>
>>
>>>
>>> It probably can protect against data corruption (by verifying checksum),
>>> but how can it recover the correct content?
>>>
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-04  4:40 ` Qu Wenruo
  2024-12-04  4:50   ` Scoopta
  2024-12-04 19:17   ` Andrei Borzenkov
@ 2024-12-06  2:03   ` Jonah Sabean
  2024-12-07 20:48     ` Qu Wenruo
  2 siblings, 1 reply; 28+ messages in thread
From: Jonah Sabean @ 2024-12-06  2:03 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Scoopta, linux-btrfs

On Wed, Dec 4, 2024 at 12:40 AM Qu Wenruo <wqu@suse.com> wrote:
>
>
>
> 在 2024/12/4 14:04, Scoopta 写道:
> > I'm looking to deploy btfs raid5/6 and have read some of the previous
> > posts here about doing so "successfully." I want to make sure I
> > understand the limitations correctly. I'm looking to replace an md+ext4
> > setup. The data on these drives is replaceable but obviously ideally I
> > don't want to have to replace it.
>
> 0) Use kernel newer than 6.5 at least.
>
> That version introduced a more comprehensive check for any RAID56 RMW,
> so that it will always verify the checksum and rebuild when necessary.
>
> This should mostly solve the write hole problem, and we even have some
> test cases in the fstests already verifying the behavior.
>
> >
> > 1) use space_cache=v2
> >
> > 2) don't use raid5/6 for metadata
> >
> > 3) run scrubs 1 drive at a time
>
> That's should also no longer be the case.
>
> Although it will waste some IO, but should not be that bad.

When was this fixed? Last I tested it took a month or more to complete
a scrub on an 8 disk raid5 system with 8tb disks mostly full at the
rate it was going. It was the only thing that kept me from using it.

>
> >
> > 4) don't expect to use the system in degraded mode
>
> You can still, thanks to the extra verification in 0).
>
> But after the missing device come back, always do a scrub on that
> device, to be extra safe.
>
> >
> > 5) there are times where raid5 will make corruption permanent instead of
> > fixing it - does this matter? As I understand it md+ext4 can't detect or
> > fix corruption either so it's not a loss
>
> With non-RAID56 metadata, and data checksum, it should not cause problem.
>
> But for no-data checksum/ no COW cases, it will cause permanent corruption.
>
> >
> > 6) the write hole exists - As I understand it md has that same problem
> > anyway
>
> The same as 5).
>
> Thanks,
> Qu
>
> >
> > Are there any other ways I could lose my data? Again the data IS
> > replaceable, I'm just trying to understand if there are any major
> > advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care
> > about downtime during degraded mode. Additionally the posts I'm looking
> > at are from 2020, has any of the above changed since then?
> >
> > Thanks!
> >
> >
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-05 20:27         ` Qu Wenruo
@ 2024-12-06  3:59           ` Andrei Borzenkov
  2024-12-06  4:16             ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Andrei Borzenkov @ 2024-12-06  3:59 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, Scoopta, linux-btrfs

05.12.2024 23:27, Qu Wenruo wrote:
> 
> 
> 在 2024/12/6 03:23, Andrei Borzenkov 写道:
>> 05.12.2024 01:34, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/12/5 05:47, Andrei Borzenkov 写道:
>>>> 04.12.2024 07:40, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> 在 2024/12/4 14:04, Scoopta 写道:
>>>>>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>>>>>> posts here about doing so "successfully." I want to make sure I
>>>>>> understand the limitations correctly. I'm looking to replace an
>>>>>> md+ext4
>>>>>> setup. The data on these drives is replaceable but obviously ideally I
>>>>>> don't want to have to replace it.
>>>>>
>>>>> 0) Use kernel newer than 6.5 at least.
>>>>>
>>>>> That version introduced a more comprehensive check for any RAID56 RMW,
>>>>> so that it will always verify the checksum and rebuild when necessary.
>>>>>
>>>>> This should mostly solve the write hole problem, and we even have some
>>>>> test cases in the fstests already verifying the behavior.
>>>>>
>>>>
>>>> Write hole happens when data can *NOT* be rebuilt because data is
>>>> inconsistent between different strips of the same stripe. How btrfs
>>>> solves this problem?
>>>
>>> An example please.
>>
>> You start with stripe
>>
>> A1,B1,C1,D1,P1
>>
>> You overwrite A1 with A2
> 
> This already falls into NOCOW case.
> 
> No guarantee for data consistency.
> 
> For COW cases, the new data are always written into unused slot, and
> after crash we will only see the old data.
> 

Do you mean that btrfs only does full stripe write now? As I recall from 
the previous discussions, btrfs is using fixed size stripes and it can 
fill unused strips. Like

First write

A1,B1,...,...,P1

Second write

A1,B1,C2,D2,P2

I.e. A1 and B1 do not change, but C2 and D2 are added.

Now, if parity is not updated before crash and D gets lost we have

A1,B1,C2,miss,P1

with exactly the same problem.

It has been discussed multiple times, that to fix it either btrfs has to 
use variable stripe size (basically, always do full stripe write) or 
some form of journal for pending updates.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-06  3:59           ` Andrei Borzenkov
@ 2024-12-06  4:16             ` Qu Wenruo
  2024-12-06 18:10               ` Goffredo Baroncelli
  2024-12-07  7:37               ` Andrei Borzenkov
  0 siblings, 2 replies; 28+ messages in thread
From: Qu Wenruo @ 2024-12-06  4:16 UTC (permalink / raw)
  To: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

在 2024/12/6 14:29, Andrei Borzenkov 写道:
> 05.12.2024 23:27, Qu Wenruo wrote:
>>
>>
>> 在 2024/12/6 03:23, Andrei Borzenkov 写道:
>>> 05.12.2024 01:34, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/12/5 05:47, Andrei Borzenkov 写道:
>>>>> 04.12.2024 07:40, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> 在 2024/12/4 14:04, Scoopta 写道:
>>>>>>> I'm looking to deploy btfs raid5/6 and have read some of the 
>>>>>>> previous
>>>>>>> posts here about doing so "successfully." I want to make sure I
>>>>>>> understand the limitations correctly. I'm looking to replace an
>>>>>>> md+ext4
>>>>>>> setup. The data on these drives is replaceable but obviously 
>>>>>>> ideally I
>>>>>>> don't want to have to replace it.
>>>>>>
>>>>>> 0) Use kernel newer than 6.5 at least.
>>>>>>
>>>>>> That version introduced a more comprehensive check for any RAID56 
>>>>>> RMW,
>>>>>> so that it will always verify the checksum and rebuild when 
>>>>>> necessary.
>>>>>>
>>>>>> This should mostly solve the write hole problem, and we even have 
>>>>>> some
>>>>>> test cases in the fstests already verifying the behavior.
>>>>>>
>>>>>
>>>>> Write hole happens when data can *NOT* be rebuilt because data is
>>>>> inconsistent between different strips of the same stripe. How btrfs
>>>>> solves this problem?
>>>>
>>>> An example please.
>>>
>>> You start with stripe
>>>
>>> A1,B1,C1,D1,P1
>>>
>>> You overwrite A1 with A2
>>
>> This already falls into NOCOW case.
>>
>> No guarantee for data consistency.
>>
>> For COW cases, the new data are always written into unused slot, and
>> after crash we will only see the old data.
>>
> 
> Do you mean that btrfs only does full stripe write now? As I recall from 
> the previous discussions, btrfs is using fixed size stripes and it can 
> fill unused strips. Like
> 
> First write
> 
> A1,B1,...,...,P1
> 
> Second write
> 
> A1,B1,C2,D2,P2
> 
> I.e. A1 and B1 do not change, but C2 and D2 are added.
> 
> Now, if parity is not updated before crash and D gets lost we have

After crash, C2/D2 is not referenced by anyone.
So we won't need to read C2/D2/P2 because it's just unallocated space.

So still wrong example.

Remember we should discuss on the RMW case, meanwhile your case doesn't 
even involve RMW, just a full stripe write.

> 
> A1,B1,C2,miss,P1
> 
> with exactly the same problem.
> 
> It has been discussed multiple times, that to fix it either btrfs has to 
> use variable stripe size (basically, always do full stripe write) or 
> some form of journal for pending updates.

If taking a correct example, it would be some like this:

Existing D1 data, unused D2 , P(D1+D2).
Write D2 and update P(D1+D2), then power loss.

Case 0): Power loss after all data and metadata reached disk
Nothing to bother, metadata already updated to see both D1 and D2, 
everything is fine.

Case 1): Power loss before metadata reached disk

This means we will only see D1 as the old data, have no idea there is 
any D2.

Case 1.0): both D2 and P(D1+D2) reached disk
Nothing to bother, again.

Case 1.1): D2 reached disk, P(D1+D2) doesn't
We still do not need to bother anything (if all devices are still 
there), because D1 is still correct.

But if the device of D1 is missing, we can not recover D1, because D2 
and P(D1+D2) is out of sync.

However I can argue this is not a simple corruption/power loss, it's two 
problems (power loss + missing device), this should count as 2 
missing/corrupted sectors in the same vertical stripe.

As least btrfs won't do any writeback to the same vertical stripe at all.

Case 1.2): P(D1+D2) reached disk, D2 doesn't
The same as case 1.1).

Case 1.3): Neither D2 nor P(D1+D2) reached disk

It's the same as case 1.0, even missing D1 is fine to recover.

So if you believe powerloss + missing device counts as a single device 
missing, and it doesn't break the tolerance of RAID5, then you can count 
this as a "write-hole".

But to me, this is not a single error, but two error (write failure + 
missing device), beyond the tolerance of RAID5.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-06  4:16             ` Qu Wenruo
@ 2024-12-06 18:10               ` Goffredo Baroncelli
  2024-12-07  7:37               ` Andrei Borzenkov
  1 sibling, 0 replies; 28+ messages in thread
From: Goffredo Baroncelli @ 2024-12-06 18:10 UTC (permalink / raw)
  To: Qu Wenruo, Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

On 06/12/2024 05.16, Qu Wenruo wrote:
> [...]
>  as case 1.0, even missing D1 is fine to recover.
>
>
> So if you believe powerloss + missing device counts as a single device missing, and it doesn't break the tolerance of RAID5, then you can count this as a "write-hole".
>
> But to me, this is not a single error, but two error (write failure + missing device), beyond the tolerance of RAID5.

The "powerloss" and a "device loss" can be considered two different failures only if these are independent.

Only in this case the likelihood of the combination of these two events is the product of the likelihood of each event.

However if these share a common root cause, these cannot be considered independent, and the likelihood of the event "power loss"+"device loos" is the likelihood of the root cause.

The point is that a "device loss" may be a consequence of a power failure. In my experiences (as hobbyist) when a disk is disappeared is very frequent near a reboot. So the likelihood of a write hole is between the likelihood of a powerloss (single failure) and a power loss+disk loss (two failures).

However, this is not the real problem. The real problem is that if a scrub is not performed after a power loss, the data and the parity may mismatch; or better, the likelihood of a data mismatch is low (because the data likely is UN-referenced), but the parity mismatch likelihood is a lot higher, because the parity mismatch is related to the stripe (at whole) that contain the data updated and not necessarily to only the data that it is updated. And this mismatch is not correct until a scrub.

So the "write hole" happens even if the "power loss" and the "disk loss" events happen not at the same time. It is enough that a scrub (or a read) is not performed in the meantime. This to say that considering that the "power loss" and "disk failure" event likelihood as the product of the two likelihoods, is a bit optimistic. If a regular scrub is not performed, the likelihood of this combination is near to the lower likelihood of the two.

This is not specific to BTRFS, however MD solved this issue performing a logging of a stripe update.

A more general solution would be implementing a logging like what MD does. I think that you, Qu, did a prototype  something in the past. BTRFS is advantaged by the fact that it can skip this logging when a full stripe is written (if the full strip is not written is not even referenced). MD can't do that.

>
> Thanks,
> Qu
>
BR

Goffredo

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-06  4:16             ` Qu Wenruo
  2024-12-06 18:10               ` Goffredo Baroncelli
@ 2024-12-07  7:37               ` Andrei Borzenkov
  2024-12-07 20:26                 ` Qu Wenruo
  1 sibling, 1 reply; 28+ messages in thread
From: Andrei Borzenkov @ 2024-12-07  7:37 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, Scoopta, linux-btrfs

06.12.2024 07:16, Qu Wenruo wrote:
> 
> 
> 在 2024/12/6 14:29, Andrei Borzenkov 写道:
>> 05.12.2024 23:27, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/12/6 03:23, Andrei Borzenkov 写道:
>>>> 05.12.2024 01:34, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> 在 2024/12/5 05:47, Andrei Borzenkov 写道:
>>>>>> 04.12.2024 07:40, Qu Wenruo wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2024/12/4 14:04, Scoopta 写道:
>>>>>>>> I'm looking to deploy btfs raid5/6 and have read some of the
>>>>>>>> previous
>>>>>>>> posts here about doing so "successfully." I want to make sure I
>>>>>>>> understand the limitations correctly. I'm looking to replace an
>>>>>>>> md+ext4
>>>>>>>> setup. The data on these drives is replaceable but obviously
>>>>>>>> ideally I
>>>>>>>> don't want to have to replace it.
>>>>>>>
>>>>>>> 0) Use kernel newer than 6.5 at least.
>>>>>>>
>>>>>>> That version introduced a more comprehensive check for any RAID56
>>>>>>> RMW,
>>>>>>> so that it will always verify the checksum and rebuild when
>>>>>>> necessary.
>>>>>>>
>>>>>>> This should mostly solve the write hole problem, and we even have
>>>>>>> some
>>>>>>> test cases in the fstests already verifying the behavior.
>>>>>>>
>>>>>>
>>>>>> Write hole happens when data can *NOT* be rebuilt because data is
>>>>>> inconsistent between different strips of the same stripe. How btrfs
>>>>>> solves this problem?
>>>>>
>>>>> An example please.
>>>>
>>>> You start with stripe
>>>>
>>>> A1,B1,C1,D1,P1
>>>>
>>>> You overwrite A1 with A2
>>>
>>> This already falls into NOCOW case.
>>>
>>> No guarantee for data consistency.
>>>
>>> For COW cases, the new data are always written into unused slot, and
>>> after crash we will only see the old data.
>>>
>>
>> Do you mean that btrfs only does full stripe write now? As I recall from
>> the previous discussions, btrfs is using fixed size stripes and it can
>> fill unused strips. Like
>>
>> First write
>>
>> A1,B1,...,...,P1
>>
>> Second write
>>
>> A1,B1,C2,D2,P2
>>
>> I.e. A1 and B1 do not change, but C2 and D2 are added.
>>
>> Now, if parity is not updated before crash and D gets lost we have
> 
> After crash, C2/D2 is not referenced by anyone.
> So we won't need to read C2/D2/P2 because it's just unallocated space.
> 

You do need to read C2/D2 to build parity and to reconstruct any missing 
block. Parity no more matches C2/D2. Whether C2/D2 are actually 
referenced by upper layers is irrelevant for RAID5/6.

> So still wrong example.
> 

It is the right example, you just prefer to ignore this problem.

> Remember we should discuss on the RMW case, meanwhile your case doesn't
> even involve RMW, just a full stripe write.
> 
>>
>> A1,B1,C2,miss,P1
>>
>> with exactly the same problem.
>>
>> It has been discussed multiple times, that to fix it either btrfs has to
>> use variable stripe size (basically, always do full stripe write) or
>> some form of journal for pending updates.
> 
> If taking a correct example, it would be some like this:
> 
> Existing D1 data, unused D2 , P(D1+D2).
> Write D2 and update P(D1+D2), then power loss.
> 
> Case 0): Power loss after all data and metadata reached disk
> Nothing to bother, metadata already updated to see both D1 and D2,
> everything is fine.
> 
> Case 1): Power loss before metadata reached disk
> 
> This means we will only see D1 as the old data, have no idea there is
> any D2.
> 
> Case 1.0): both D2 and P(D1+D2) reached disk
> Nothing to bother, again.
> 
> Case 1.1): D2 reached disk, P(D1+D2) doesn't
> We still do not need to bother anything (if all devices are still
> there), because D1 is still correct.
> 
> But if the device of D1 is missing, we can not recover D1, because D2
> and P(D1+D2) is out of sync.
> 
> However I can argue this is not a simple corruption/power loss, it's two
> problems (power loss + missing device), this should count as 2
> missing/corrupted sectors in the same vertical stripe.
> 

This is the very definition of the write hole. You are entitled to have 
your opinion, but at least do not confuse others by claiming that btrfs 
protects against write hole.

It need not be the whole device - it is enough to have a single 
unreadable sector which happens more often (at least, with HDD).

And as already mentioned it need not happen at the same (or close) time. 
The data corruption may happen days and months after lost write. Sure, 
you can still wave it off as a double fault - but if in case of failed 
disk (or even unreadable sector) administrator at least gets notified in 
logs, here it is absolutely silent without administrator even being 
aware that this stripe is no more redundant and so administrator cannot 
do anything to fix it.

> As least btrfs won't do any writeback to the same vertical stripe at all.
> 
> Case 1.2): P(D1+D2) reached disk, D2 doesn't
> The same as case 1.1).
> 
> Case 1.3): Neither D2 nor P(D1+D2) reached disk
> 
> It's the same as case 1.0, even missing D1 is fine to recover.
> 
> 
> So if you believe powerloss + missing device counts as a single device
> missing, and it doesn't break the tolerance of RAID5, then you can count
> this as a "write-hole".
> 
> But to me, this is not a single error, but two error (write failure +
> missing device), beyond the tolerance of RAID5.
> 
> Thanks,
> Qu


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-07  7:37               ` Andrei Borzenkov
@ 2024-12-07 20:26                 ` Qu Wenruo
  2024-12-10  2:34                   ` Zygo Blaxell
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-07 20:26 UTC (permalink / raw)
  To: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs



在 2024/12/7 18:07, Andrei Borzenkov 写道:
> 06.12.2024 07:16, Qu Wenruo wrote:
>>
>>
>> 在 2024/12/6 14:29, Andrei Borzenkov 写道:
>>> 05.12.2024 23:27, Qu Wenruo wrote:
>>>>
>>>>
>>>> 在 2024/12/6 03:23, Andrei Borzenkov 写道:
>>>>> 05.12.2024 01:34, Qu Wenruo wrote:
>>>>>>
>>>>>>
>>>>>> 在 2024/12/5 05:47, Andrei Borzenkov 写道:
>>>>>>> 04.12.2024 07:40, Qu Wenruo wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> 在 2024/12/4 14:04, Scoopta 写道:
>>>>>>>>> I'm looking to deploy btfs raid5/6 and have read some of the
>>>>>>>>> previous
>>>>>>>>> posts here about doing so "successfully." I want to make sure I
>>>>>>>>> understand the limitations correctly. I'm looking to replace an
>>>>>>>>> md+ext4
>>>>>>>>> setup. The data on these drives is replaceable but obviously
>>>>>>>>> ideally I
>>>>>>>>> don't want to have to replace it.
>>>>>>>>
>>>>>>>> 0) Use kernel newer than 6.5 at least.
>>>>>>>>
>>>>>>>> That version introduced a more comprehensive check for any RAID56
>>>>>>>> RMW,
>>>>>>>> so that it will always verify the checksum and rebuild when
>>>>>>>> necessary.
>>>>>>>>
>>>>>>>> This should mostly solve the write hole problem, and we even have
>>>>>>>> some
>>>>>>>> test cases in the fstests already verifying the behavior.
>>>>>>>>
>>>>>>>
>>>>>>> Write hole happens when data can *NOT* be rebuilt because data is
>>>>>>> inconsistent between different strips of the same stripe. How btrfs
>>>>>>> solves this problem?
>>>>>>
>>>>>> An example please.
>>>>>
>>>>> You start with stripe
>>>>>
>>>>> A1,B1,C1,D1,P1
>>>>>
>>>>> You overwrite A1 with A2
>>>>
>>>> This already falls into NOCOW case.
>>>>
>>>> No guarantee for data consistency.
>>>>
>>>> For COW cases, the new data are always written into unused slot, and
>>>> after crash we will only see the old data.
>>>>
>>>
>>> Do you mean that btrfs only does full stripe write now? As I recall from
>>> the previous discussions, btrfs is using fixed size stripes and it can
>>> fill unused strips. Like
>>>
>>> First write
>>>
>>> A1,B1,...,...,P1
>>>
>>> Second write
>>>
>>> A1,B1,C2,D2,P2
>>>
>>> I.e. A1 and B1 do not change, but C2 and D2 are added.
>>>
>>> Now, if parity is not updated before crash and D gets lost we have
>>
>> After crash, C2/D2 is not referenced by anyone.
>> So we won't need to read C2/D2/P2 because it's just unallocated space.
>>
>
> You do need to read C2/D2 to build parity and to reconstruct any missing
> block. Parity no more matches C2/D2. Whether C2/D2 are actually
> referenced by upper layers is irrelevant for RAID5/6.

Nope, in that case whatever garbage is in C2/D2, btrfs just do not care.

Just try it yourself.

You can even mkfs without discarding the device, then btrfs has garbage
for unwritten ranges.

Then do btrfs care those unallocated space nor their parity?
No.

Btrfs only cares full stripe that has at least one block being referred.

For vertical stripe that has no sector involved, btrfs treats it as
nocsum, aka, as long as it can read it's fine. If it can not be read
from the disk (missing dev etc), just use the rebuild data.

Either way for unused sector it makes no difference.

>
>> So still wrong example.
>>
>
> It is the right example, you just prefer to ignore this problem.

Sure sure, whatever you believe.

Or why not just read the code on how the current RAID56 works?

>
>> Remember we should discuss on the RMW case, meanwhile your case doesn't
>> even involve RMW, just a full stripe write.
>>
>>>
>>> A1,B1,C2,miss,P1
>>>
>>> with exactly the same problem.
>>>
>>> It has been discussed multiple times, that to fix it either btrfs has to
>>> use variable stripe size (basically, always do full stripe write) or
>>> some form of journal for pending updates.
>>
>> If taking a correct example, it would be some like this:
>>
>> Existing D1 data, unused D2 , P(D1+D2).
>> Write D2 and update P(D1+D2), then power loss.
>>
>> Case 0): Power loss after all data and metadata reached disk
>> Nothing to bother, metadata already updated to see both D1 and D2,
>> everything is fine.
>>
>> Case 1): Power loss before metadata reached disk
>>
>> This means we will only see D1 as the old data, have no idea there is
>> any D2.
>>
>> Case 1.0): both D2 and P(D1+D2) reached disk
>> Nothing to bother, again.
>>
>> Case 1.1): D2 reached disk, P(D1+D2) doesn't
>> We still do not need to bother anything (if all devices are still
>> there), because D1 is still correct.
>>
>> But if the device of D1 is missing, we can not recover D1, because D2
>> and P(D1+D2) is out of sync.
>>
>> However I can argue this is not a simple corruption/power loss, it's two
>> problems (power loss + missing device), this should count as 2
>> missing/corrupted sectors in the same vertical stripe.
>>
>
> This is the very definition of the write hole. You are entitled to have
> your opinion, but at least do not confuse others by claiming that btrfs
> protects against write hole.
>
> It need not be the whole device - it is enough to have a single
> unreadable sector which happens more often (at least, with HDD).
>
> And as already mentioned it need not happen at the same (or close) time.
> The data corruption may happen days and months after lost write. Sure,
> you can still wave it off as a double fault - but if in case of failed
> disk (or even unreadable sector) administrator at least gets notified in
> logs, here it is absolutely silent without administrator even being
> aware that this stripe is no more redundant and so administrator cannot
> do anything to fix it.
>
>> As least btrfs won't do any writeback to the same vertical stripe at all.
>>
>> Case 1.2): P(D1+D2) reached disk, D2 doesn't
>> The same as case 1.1).
>>
>> Case 1.3): Neither D2 nor P(D1+D2) reached disk
>>
>> It's the same as case 1.0, even missing D1 is fine to recover.
>>
>>
>> So if you believe powerloss + missing device counts as a single device
>> missing, and it doesn't break the tolerance of RAID5, then you can count
>> this as a "write-hole".
>>
>> But to me, this is not a single error, but two error (write failure +
>> missing device), beyond the tolerance of RAID5.
>>
>> Thanks,
>> Qu
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-06  2:03   ` Using btrfs raid5/6 Jonah Sabean
@ 2024-12-07 20:48     ` Qu Wenruo
  2024-12-08 16:31       ` Jonah Sabean
  0 siblings, 1 reply; 28+ messages in thread
From: Qu Wenruo @ 2024-12-07 20:48 UTC (permalink / raw)
  To: Jonah Sabean, Qu Wenruo; +Cc: Scoopta, linux-btrfs



在 2024/12/6 12:33, Jonah Sabean 写道:
> On Wed, Dec 4, 2024 at 12:40 AM Qu Wenruo <wqu@suse.com> wrote:
>>
>>
>>
>> 在 2024/12/4 14:04, Scoopta 写道:
>>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>>> posts here about doing so "successfully." I want to make sure I
>>> understand the limitations correctly. I'm looking to replace an md+ext4
>>> setup. The data on these drives is replaceable but obviously ideally I
>>> don't want to have to replace it.
>>
>> 0) Use kernel newer than 6.5 at least.
>>
>> That version introduced a more comprehensive check for any RAID56 RMW,
>> so that it will always verify the checksum and rebuild when necessary.
>>
>> This should mostly solve the write hole problem, and we even have some
>> test cases in the fstests already verifying the behavior.
>>
>>>
>>> 1) use space_cache=v2
>>>
>>> 2) don't use raid5/6 for metadata
>>>
>>> 3) run scrubs 1 drive at a time
>>
>> That's should also no longer be the case.
>>
>> Although it will waste some IO, but should not be that bad.
>
> When was this fixed? Last I tested it took a month or more to complete
> a scrub on an 8 disk raid5 system with 8tb disks mostly full at the
> rate it was going. It was the only thing that kept me from using it.

IIRC it's 6.6 for the scrub speed fix.

Although it still doesn't fully address the extra read (twice of the
data) nor the random read triggered by parity scrub from other devices.

A root fix will need a completely new way to do the scrub (my previous
scrub_fs attempt), but that interface will not handle other profiles
well (can not skip large amount of unused space).

So if your last attempt is using some recent kernel version or the
latest LTS, then I guess the random read is still breaking the performance.

Thanks,
Qu

>
>>
>>>
>>> 4) don't expect to use the system in degraded mode
>>
>> You can still, thanks to the extra verification in 0).
>>
>> But after the missing device come back, always do a scrub on that
>> device, to be extra safe.
>>
>>>
>>> 5) there are times where raid5 will make corruption permanent instead of
>>> fixing it - does this matter? As I understand it md+ext4 can't detect or
>>> fix corruption either so it's not a loss
>>
>> With non-RAID56 metadata, and data checksum, it should not cause problem.
>>
>> But for no-data checksum/ no COW cases, it will cause permanent corruption.
>>
>>>
>>> 6) the write hole exists - As I understand it md has that same problem
>>> anyway
>>
>> The same as 5).
>>
>> Thanks,
>> Qu
>>
>>>
>>> Are there any other ways I could lose my data? Again the data IS
>>> replaceable, I'm just trying to understand if there are any major
>>> advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care
>>> about downtime during degraded mode. Additionally the posts I'm looking
>>> at are from 2020, has any of the above changed since then?
>>>
>>> Thanks!
>>>
>>>
>>
>>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-07 20:48     ` Qu Wenruo
@ 2024-12-08 16:31       ` Jonah Sabean
  2024-12-08 20:07         ` Qu Wenruo
  0 siblings, 1 reply; 28+ messages in thread
From: Jonah Sabean @ 2024-12-08 16:31 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Qu Wenruo, Scoopta, linux-btrfs

On Sat, Dec 7, 2024 at 4:48 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> 在 2024/12/6 12:33, Jonah Sabean 写道:
> > On Wed, Dec 4, 2024 at 12:40 AM Qu Wenruo <wqu@suse.com> wrote:
> >>
> >>
> >>
> >> 在 2024/12/4 14:04, Scoopta 写道:
> >>> I'm looking to deploy btfs raid5/6 and have read some of the previous
> >>> posts here about doing so "successfully." I want to make sure I
> >>> understand the limitations correctly. I'm looking to replace an md+ext4
> >>> setup. The data on these drives is replaceable but obviously ideally I
> >>> don't want to have to replace it.
> >>
> >> 0) Use kernel newer than 6.5 at least.
> >>
> >> That version introduced a more comprehensive check for any RAID56 RMW,
> >> so that it will always verify the checksum and rebuild when necessary.
> >>
> >> This should mostly solve the write hole problem, and we even have some
> >> test cases in the fstests already verifying the behavior.
> >>
> >>>
> >>> 1) use space_cache=v2
> >>>
> >>> 2) don't use raid5/6 for metadata
> >>>
> >>> 3) run scrubs 1 drive at a time
> >>
> >> That's should also no longer be the case.
> >>
> >> Although it will waste some IO, but should not be that bad.
> >
> > When was this fixed? Last I tested it took a month or more to complete
> > a scrub on an 8 disk raid5 system with 8tb disks mostly full at the
> > rate it was going. It was the only thing that kept me from using it.
>
> IIRC it's 6.6 for the scrub speed fix.
>
> Although it still doesn't fully address the extra read (twice of the
> data) nor the random read triggered by parity scrub from other devices.
>
> A root fix will need a completely new way to do the scrub (my previous
> scrub_fs attempt), but that interface will not handle other profiles
> well (can not skip large amount of unused space).
>
> So if your last attempt is using some recent kernel version or the
> latest LTS, then I guess the random read is still breaking the performance.

Thanks for the update! Will your scrub_fs be rebased for raid5/6 in
the near future? Would be nice to be rid of the 2x reads. I suspect
then raid6 results in 3x reads still then?


>
> Thanks,
> Qu
>
> >
> >>
> >>>
> >>> 4) don't expect to use the system in degraded mode
> >>
> >> You can still, thanks to the extra verification in 0).
> >>
> >> But after the missing device come back, always do a scrub on that
> >> device, to be extra safe.
> >>
> >>>
> >>> 5) there are times where raid5 will make corruption permanent instead of
> >>> fixing it - does this matter? As I understand it md+ext4 can't detect or
> >>> fix corruption either so it's not a loss
> >>
> >> With non-RAID56 metadata, and data checksum, it should not cause problem.
> >>
> >> But for no-data checksum/ no COW cases, it will cause permanent corruption.
> >>
> >>>
> >>> 6) the write hole exists - As I understand it md has that same problem
> >>> anyway
> >>
> >> The same as 5).
> >>
> >> Thanks,
> >> Qu
> >>
> >>>
> >>> Are there any other ways I could lose my data? Again the data IS
> >>> replaceable, I'm just trying to understand if there are any major
> >>> advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care
> >>> about downtime during degraded mode. Additionally the posts I'm looking
> >>> at are from 2020, has any of the above changed since then?
> >>>
> >>> Thanks!
> >>>
> >>>
> >>
> >>
> >
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-08 16:31       ` Jonah Sabean
@ 2024-12-08 20:07         ` Qu Wenruo
  0 siblings, 0 replies; 28+ messages in thread
From: Qu Wenruo @ 2024-12-08 20:07 UTC (permalink / raw)
  To: Jonah Sabean; +Cc: Qu Wenruo, Scoopta, linux-btrfs



在 2024/12/9 03:01, Jonah Sabean 写道:
> On Sat, Dec 7, 2024 at 4:48 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> 在 2024/12/6 12:33, Jonah Sabean 写道:
>>> On Wed, Dec 4, 2024 at 12:40 AM Qu Wenruo <wqu@suse.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/12/4 14:04, Scoopta 写道:
>>>>> I'm looking to deploy btfs raid5/6 and have read some of the previous
>>>>> posts here about doing so "successfully." I want to make sure I
>>>>> understand the limitations correctly. I'm looking to replace an md+ext4
>>>>> setup. The data on these drives is replaceable but obviously ideally I
>>>>> don't want to have to replace it.
>>>>
>>>> 0) Use kernel newer than 6.5 at least.
>>>>
>>>> That version introduced a more comprehensive check for any RAID56 RMW,
>>>> so that it will always verify the checksum and rebuild when necessary.
>>>>
>>>> This should mostly solve the write hole problem, and we even have some
>>>> test cases in the fstests already verifying the behavior.
>>>>
>>>>>
>>>>> 1) use space_cache=v2
>>>>>
>>>>> 2) don't use raid5/6 for metadata
>>>>>
>>>>> 3) run scrubs 1 drive at a time
>>>>
>>>> That's should also no longer be the case.
>>>>
>>>> Although it will waste some IO, but should not be that bad.
>>>
>>> When was this fixed? Last I tested it took a month or more to complete
>>> a scrub on an 8 disk raid5 system with 8tb disks mostly full at the
>>> rate it was going. It was the only thing that kept me from using it.
>>
>> IIRC it's 6.6 for the scrub speed fix.
>>
>> Although it still doesn't fully address the extra read (twice of the
>> data) nor the random read triggered by parity scrub from other devices.
>>
>> A root fix will need a completely new way to do the scrub (my previous
>> scrub_fs attempt), but that interface will not handle other profiles
>> well (can not skip large amount of unused space).
>>
>> So if your last attempt is using some recent kernel version or the
>> latest LTS, then I guess the random read is still breaking the performance.
>
> Thanks for the update! Will your scrub_fs be rebased for raid5/6 in
> the near future? Would be nice to be rid of the 2x reads. I suspect
> then raid6 results in 3x reads still then?

Yes, RAID6 it's 3x read.

Unfortunately that feature is no longer under active development for a
while.

But I'll take some time to revive it in the near future.

Thanks,
Qu
>
>
>>
>> Thanks,
>> Qu
>>
>>>
>>>>
>>>>>
>>>>> 4) don't expect to use the system in degraded mode
>>>>
>>>> You can still, thanks to the extra verification in 0).
>>>>
>>>> But after the missing device come back, always do a scrub on that
>>>> device, to be extra safe.
>>>>
>>>>>
>>>>> 5) there are times where raid5 will make corruption permanent instead of
>>>>> fixing it - does this matter? As I understand it md+ext4 can't detect or
>>>>> fix corruption either so it's not a loss
>>>>
>>>> With non-RAID56 metadata, and data checksum, it should not cause problem.
>>>>
>>>> But for no-data checksum/ no COW cases, it will cause permanent corruption.
>>>>
>>>>>
>>>>> 6) the write hole exists - As I understand it md has that same problem
>>>>> anyway
>>>>
>>>> The same as 5).
>>>>
>>>> Thanks,
>>>> Qu
>>>>
>>>>>
>>>>> Are there any other ways I could lose my data? Again the data IS
>>>>> replaceable, I'm just trying to understand if there are any major
>>>>> advantages to using md+btrfs or md+ext4 over btrfs raid5 if I don't care
>>>>> about downtime during degraded mode. Additionally the posts I'm looking
>>>>> at are from 2020, has any of the above changed since then?
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>
>>>>
>>>
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-07 20:26                 ` Qu Wenruo
@ 2024-12-10  2:34                   ` Zygo Blaxell
  2024-12-10 19:36                     ` Goffredo Baroncelli
  2024-12-21 18:32                     ` Proposal for RAID-PN (was Re: Using btrfs raid5/6) Forza
  0 siblings, 2 replies; 28+ messages in thread
From: Zygo Blaxell @ 2024-12-10  2:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:
> 在 2024/12/7 18:07, Andrei Borzenkov 写道:
> > 06.12.2024 07:16, Qu Wenruo wrote:
> > > 在 2024/12/6 14:29, Andrei Borzenkov 写道:
> > > > 05.12.2024 23:27, Qu Wenruo wrote:
> > > > > 在 2024/12/6 03:23, Andrei Borzenkov 写道:
> > > > > > 05.12.2024 01:34, Qu Wenruo wrote:
> > > > > > > 在 2024/12/5 05:47, Andrei Borzenkov 写道:
> > > > > > > > 04.12.2024 07:40, Qu Wenruo wrote:
> > > > > > > > > 
> > > > > > > > > 在 2024/12/4 14:04, Scoopta 写道:
> > > > > > > > > > I'm looking to deploy btfs raid5/6 and have read some of the
> > > > > > > > > > previous
> > > > > > > > > > posts here about doing so "successfully." I want to make sure I
> > > > > > > > > > understand the limitations correctly. I'm looking to replace an
> > > > > > > > > > md+ext4
> > > > > > > > > > setup. The data on these drives is replaceable but obviously
> > > > > > > > > > ideally I
> > > > > > > > > > don't want to have to replace it.
> > > > > > > > > 
> > > > > > > > > 0) Use kernel newer than 6.5 at least.
> > > > > > > > > 
> > > > > > > > > That version introduced a more comprehensive check for any RAID56
> > > > > > > > > RMW,
> > > > > > > > > so that it will always verify the checksum and rebuild when
> > > > > > > > > necessary.
> > > > > > > > > 
> > > > > > > > > This should mostly solve the write hole problem, and we even have
> > > > > > > > > some
> > > > > > > > > test cases in the fstests already verifying the behavior.
> > > > > > > > 
> > > > > > > > Write hole happens when data can *NOT* be rebuilt because data is
> > > > > > > > inconsistent between different strips of the same stripe. How btrfs
> > > > > > > > solves this problem?
> > > > > > > 
> > > > > > > An example please.
> > > > > > 
> > > > > > You start with stripe
> > > > > > 
> > > > > > A1,B1,C1,D1,P1
> > > > > > 
> > > > > > You overwrite A1 with A2
> > > > > 
> > > > > This already falls into NOCOW case.
> > > > > 
> > > > > No guarantee for data consistency.
> > > > > 
> > > > > For COW cases, the new data are always written into unused slot, and
> > > > > after crash we will only see the old data.
> > > > 
> > > > Do you mean that btrfs only does full stripe write now? As I recall from
> > > > the previous discussions, btrfs is using fixed size stripes and it can
> > > > fill unused strips. Like
> > > > 
> > > > First write
> > > > 
> > > > A1,B1,...,...,P1
> > > > 
> > > > Second write
> > > > 
> > > > A1,B1,C2,D2,P2
> > > > 
> > > > I.e. A1 and B1 do not change, but C2 and D2 are added.
> > > > 
> > > > Now, if parity is not updated before crash and D gets lost we have
> > > 
> > > After crash, C2/D2 is not referenced by anyone.
> > > So we won't need to read C2/D2/P2 because it's just unallocated space.
> > 
> > You do need to read C2/D2 to build parity and to reconstruct any missing
> > block. Parity no more matches C2/D2. Whether C2/D2 are actually
> > referenced by upper layers is irrelevant for RAID5/6.
> 
> Nope, in that case whatever garbage is in C2/D2, btrfs just do not care.
> 
> Just try it yourself.
> 
> You can even mkfs without discarding the device, then btrfs has garbage
> for unwritten ranges.
> 
> Then do btrfs care those unallocated space nor their parity?
> No.
> 
> Btrfs only cares full stripe that has at least one block being referred.
> 
> For vertical stripe that has no sector involved, btrfs treats it as
> nocsum, aka, as long as it can read it's fine. If it can not be read
> from the disk (missing dev etc), just use the rebuild data.
> 
> Either way for unused sector it makes no difference.

The assumption Qu made here is that btrfs never writes data blocks to the
same stripe from two or more different transactions, without freeing and
allocating the entire stripe in between.  If that assumption were true,
there would be no write hole in the current implementation.

The reality is that btrfs does exactly the opposite, as in Andrei's second
example.  This causes potential data loss of the first transaction's
data if the second transaction's write is aborted by a crash.  After the
first transaction, the parity and uninitialized data blocks can be used
to recover any data block in the first transaction.  When the second
transaction is aborted with some but not all of the blocks updated, the
parity will no longer be usable to reconstruct the data blocks from _any_
part of the stripe, including the first transaction's committed data.

Technically, in this event, the second transaction's data is _also_
lost, but as Qu mentioned above, that data isn't part of a committed
transaction, so the damaged data won't appear in the filesystem after a
crash, corrupted or otherwise.

The potential data loss does not become actual data loss until the stripe
goes into degraded mode, where the out-of-sync parity block is needed to
recover a missing or corrupt data block.  If the stripe was already in
degraded mode during the crash, data loss is immediate.

If the drives are all healthy, the parity block can be recomputed
by a scrub, as long as the scrub is completed between a crash and a
drive failure.

If drives are missing or corrupt and parity hasn't been properly updated,
then data block reconstruction cannot occur.  btrfs will reject the
reconstructed block when its csum doesn't match, resulting in an
uncorrectable error.

There's several options to fix the write hole:

1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
within a partially filled raid56 stripe, unless the stripe was empty
at the beginning of the current transaction (i.e. multiple RMW writes
are OK, as long as they all disappear in the same crash event).  This
ensures a stripe is never written from two separate btrfs transactions,
eliminating the write hole.  This option requires an allocator change,
and some rework/optimization of how ordered extents are written out.
It also requires more balances--space within partially filled stripes
isn't usable until every data block within the stripe is freed, and
balances do exactly that.

2.  Add a stripe journal.  Requires on-disk format change to add the
journal, and recovery code at startup to replay it.  It's the industry
standard way to fix the write hole in a traditional raid5 implementation,
so it's the first idea everyone proposes.  It's also quite slow if you
don't have dedicated purpose-built hardware for the journal.  It's the
only option for closing the write hole on nodatacow files.

3.  Add a generic remapping layer for all IO blocks to avoid requiring
RMW cycles.  This is the raid-stripe-tree feature, a brute-force approach
that makes RAID profiles possible on ZNS drives.  ZNS drives have similar
but much more strict write-ordering constraints than traditional raid56,
so if the raid stripe tree can do raid5 on ZNS, it should be able to
handle CMR easily ("efficiently" is a separate question).

4.  Replace the btrfs raid5 profile with something else, and deprecate
the raid5 profile.  I'd recommend not considering that option until
after someone delivers a complete, write-hole-free replacement profile,
ready for merging.  The existing raid5 is not _that_ hard to fix, we
already have 3 well-understood options, and one of them doesn't require
an on-disk format change.

Option 1 is probably the best one:  it doesn't require on-disk format
changes, only changes to the way kernels manage future writes.  Ideally,
the implementation includes an optimization to collect small extent writes
and merge them into full-stripe writes, which will make those _much_
faster on raid56.  The current implementation does multiple unnecessary
RMW cycles when writing multiple separate data extents to the same
stripe, even when the extents are allocated within a single transaction
and collectively the extents fill the entire stripe.

Option 1 won't fix nodatacow files, but that's only a problem if you
use nodatacow files.

I suspect options 2 and 3 have so much overhead that they are far
slower than option 1, even counting the extra balances option 1 requires.
With option 1, the extra overhead is in a big batch you can run overnight,
while options 2 and 3 impose continuous overhead on writes, and for
option 3, on reads as well.

> > > So still wrong example.
> > > 
> > 
> > It is the right example, you just prefer to ignore this problem.
> 
> Sure sure, whatever you believe.
> 
> Or why not just read the code on how the current RAID56 works?

The above is a summary of the state of raid56 when I last read the code
in depth (from circa v6.6), combined with direct experience from running
a small fleet of btrfs raid5 arrays and observing how they behave since
2016, and review of the raid-stripe-tree design docs.

> > > Remember we should discuss on the RMW case, meanwhile your case doesn't
> > > even involve RMW, just a full stripe write.
> > > 
> > > > 
> > > > A1,B1,C2,miss,P1
> > > > 
> > > > with exactly the same problem.
> > > > 
> > > > It has been discussed multiple times, that to fix it either btrfs has to
> > > > use variable stripe size (basically, always do full stripe write) or
> > > > some form of journal for pending updates.
> > > 
> > > If taking a correct example, it would be some like this:
> > > 
> > > Existing D1 data, unused D2 , P(D1+D2).
> > > Write D2 and update P(D1+D2), then power loss.
> > > 
> > > Case 0): Power loss after all data and metadata reached disk
> > > Nothing to bother, metadata already updated to see both D1 and D2,
> > > everything is fine.
> > > 
> > > Case 1): Power loss before metadata reached disk
> > > 
> > > This means we will only see D1 as the old data, have no idea there is
> > > any D2.
> > > 
> > > Case 1.0): both D2 and P(D1+D2) reached disk
> > > Nothing to bother, again.
> > > 
> > > Case 1.1): D2 reached disk, P(D1+D2) doesn't
> > > We still do not need to bother anything (if all devices are still
> > > there), because D1 is still correct.
> > > 
> > > But if the device of D1 is missing, we can not recover D1, because D2
> > > and P(D1+D2) is out of sync.
> > > 
> > > However I can argue this is not a simple corruption/power loss, it's two
> > > problems (power loss + missing device), this should count as 2
> > > missing/corrupted sectors in the same vertical stripe.

A raid56 array must still tolerate power failures while it is degraded.
This is table stakes for a modern parity raid implementation.

The raid56 write hole occurs when it is possible for an active stripe
to enter an unrecoverable state.  This is an implementation bug, not a
device failure.

Leaving an inactive stripe in a corrupted state after a crash is OK.
Never modifying any active stripe, so they are never corrupted, is OK.
btrfs corrupts active stripes, which is not OK.

Hopefully this is clear.

> > This is the very definition of the write hole. You are entitled to have
> > your opinion, but at least do not confuse others by claiming that btrfs
> > protects against write hole.
> > 
> > It need not be the whole device - it is enough to have a single
> > unreadable sector which happens more often (at least, with HDD).
> > 
> > And as already mentioned it need not happen at the same (or close) time.
> > The data corruption may happen days and months after lost write. Sure,
> > you can still wave it off as a double fault - but if in case of failed
> > disk (or even unreadable sector) administrator at least gets notified in
> > logs, here it is absolutely silent without administrator even being
> > aware that this stripe is no more redundant and so administrator cannot
> > do anything to fix it.
> > 
> > > As least btrfs won't do any writeback to the same vertical stripe at all.
> > > 
> > > Case 1.2): P(D1+D2) reached disk, D2 doesn't
> > > The same as case 1.1).
> > > 
> > > Case 1.3): Neither D2 nor P(D1+D2) reached disk
> > > 
> > > It's the same as case 1.0, even missing D1 is fine to recover.
> > > 
> > > 
> > > So if you believe powerloss + missing device counts as a single device
> > > missing, and it doesn't break the tolerance of RAID5, then you can count
> > > this as a "write-hole".
> > > 
> > > But to me, this is not a single error, but two error (write failure +
> > > missing device), beyond the tolerance of RAID5.
> > > 
> > > Thanks,
> > > Qu
> > 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-10  2:34                   ` Zygo Blaxell
@ 2024-12-10 19:36                     ` Goffredo Baroncelli
  2024-12-11  1:47                       ` Jonah Sabean
  2024-12-11  7:26                       ` Zygo Blaxell
  2024-12-21 18:32                     ` Proposal for RAID-PN (was Re: Using btrfs raid5/6) Forza
  1 sibling, 2 replies; 28+ messages in thread
From: Goffredo Baroncelli @ 2024-12-10 19:36 UTC (permalink / raw)
  To: Zygo Blaxell, Qu Wenruo; +Cc: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

On 10/12/2024 03.34, Zygo Blaxell wrote:
> On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:


Hi Zygo,


thank for this excellent analisys

[...]


> There's several options to fix the write hole:
>
> 1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
> within a partially filled raid56 stripe, unless the stripe was empty
> at the beginning of the current transaction (i.e. multiple RMW writes
> are OK, as long as they all disappear in the same crash event).  This
> ensures a stripe is never written from two separate btrfs transactions,
> eliminating the write hole.  This option requires an allocator change,
> and some rework/optimization of how ordered extents are written out.
> It also requires more balances--space within partially filled stripes
> isn't usable until every data block within the stripe is freed, and
> balances do exactly that.

My impression  is that this solution would degenerate in a lot of waste space, in case

of small/frequent/sync writes.

> 2.  Add a stripe journal.  Requires on-disk format change to add the
> journal, and recovery code at startup to replay it.  It's the industry
> standard way to fix the write hole in a traditional raid5 implementation,
> so it's the first idea everyone proposes.  It's also quite slow if you
> don't have dedicated purpose-built hardware for the journal.  It's the
> only option for closing the write hole on nodatacow files.

I am not sure about the "quite slow" part. In the journal only the "small" write should go. The

"bigger" write should go directly in the "fully empty" stripe (where it is not needed a RMW

cycle).

So a typical transaction  would be composed by:

- update the "partial empty" stripes with a RMW cycle with the data that were stored in the journal

in the *previous* transaction

- updated the journal with the new "small" write

- pass-through the "big write" directly to the "fully empty" stripes

- commit the transaction (w/journal)


The key point is that the allocator should know if a stripe is fully empty or is partially empty. A classic

block-based raid doesn't have this information. BTRFS has this information.



BR

G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-10 19:36                     ` Goffredo Baroncelli
@ 2024-12-11  1:47                       ` Jonah Sabean
  2024-12-11  7:26                       ` Zygo Blaxell
  1 sibling, 0 replies; 28+ messages in thread
From: Jonah Sabean @ 2024-12-11  1:47 UTC (permalink / raw)
  To: kreijack
  Cc: Zygo Blaxell, Qu Wenruo, Andrei Borzenkov, Qu Wenruo, Scoopta,
	linux-btrfs

On Tue, Dec 10, 2024 at 3:36 PM Goffredo Baroncelli <kreijack@libero.it> wrote:
>
> On 10/12/2024 03.34, Zygo Blaxell wrote:
> > On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:
>
>
> Hi Zygo,
>
>
> thank for this excellent analisys
>
> [...]
>
>
> > There's several options to fix the write hole:
> >
> > 1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
> > within a partially filled raid56 stripe, unless the stripe was empty
> > at the beginning of the current transaction (i.e. multiple RMW writes
> > are OK, as long as they all disappear in the same crash event).  This
> > ensures a stripe is never written from two separate btrfs transactions,
> > eliminating the write hole.  This option requires an allocator change,
> > and some rework/optimization of how ordered extents are written out.
> > It also requires more balances--space within partially filled stripes
> > isn't usable until every data block within the stripe is freed, and
> > balances do exactly that.
>
> My impression  is that this solution would degenerate in a lot of waste space, in case
>
> of small/frequent/sync writes.

In the case of SSDs, I'd also like to avoid the need to run balance as
much as possible so I don't think this is a good solution. Just more
writes that contribute to wear.

>
> > 2.  Add a stripe journal.  Requires on-disk format change to add the
> > journal, and recovery code at startup to replay it.  It's the industry
> > standard way to fix the write hole in a traditional raid5 implementation,
> > so it's the first idea everyone proposes.  It's also quite slow if you
> > don't have dedicated purpose-built hardware for the journal.  It's the
> > only option for closing the write hole on nodatacow files.
>
> I am not sure about the "quite slow" part. In the journal only the "small" write should go. The
>
> "bigger" write should go directly in the "fully empty" stripe (where it is not needed a RMW
>
> cycle).
>
> So a typical transaction  would be composed by:
>
> - update the "partial empty" stripes with a RMW cycle with the data that were stored in the journal
>
> in the *previous* transaction
>
> - updated the journal with the new "small" write
>
> - pass-through the "big write" directly to the "fully empty" stripes
>
> - commit the transaction (w/journal)
>
>
> The key point is that the allocator should know if a stripe is fully empty or is partially empty. A classic
>
> block-based raid doesn't have this information. BTRFS has this information.
>
>
>
> BR
>
> G.Baroncelli
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-10 19:36                     ` Goffredo Baroncelli
  2024-12-11  1:47                       ` Jonah Sabean
@ 2024-12-11  7:26                       ` Zygo Blaxell
  2024-12-11 19:39                         ` Goffredo Baroncelli
  1 sibling, 1 reply; 28+ messages in thread
From: Zygo Blaxell @ 2024-12-11  7:26 UTC (permalink / raw)
  To: kreijack; +Cc: Qu Wenruo, Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

On Tue, Dec 10, 2024 at 08:36:05PM +0100, Goffredo Baroncelli wrote:
> On 10/12/2024 03.34, Zygo Blaxell wrote:
> > On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:
> 
> 
> Hi Zygo,
> 
> 
> thank for this excellent analisys
> 
> [...]
> 
> 
> > There's several options to fix the write hole:
> > 
> > 1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
> > within a partially filled raid56 stripe, unless the stripe was empty
> > at the beginning of the current transaction (i.e. multiple RMW writes
> > are OK, as long as they all disappear in the same crash event).  This
> > ensures a stripe is never written from two separate btrfs transactions,
> > eliminating the write hole.  This option requires an allocator change,
> > and some rework/optimization of how ordered extents are written out.
> > It also requires more balances--space within partially filled stripes
> > isn't usable until every data block within the stripe is freed, and
> > balances do exactly that.
> 
> My impression  is that this solution would degenerate in a lot of
> waste space, in case of small/frequent/sync writes.

"A lot" is relative.

If we first coalesce small writes, e.g. put a hook in delalloc or ordered
extent handling that defers small writes until there are enough to fill a
stripe, and only get to partial stripe writes during a commit or a fsync,
then there far fewer of these then there are now.  This would be a big
performance gain for btrfs raid5, without even considering the gains
in robustness.

That would leave nodatacow and small fsyncs as the only remaining users
of the RMW mechanism.

In the "fsync 4K" case, we would use an entire stripe width (N devices *
4K), not an entire stripe (N device * 64K).  So for a 9-disk array that
means we use 32K of space to store 4K of data; however, that's smaller
than the space and IO we're already using for the log tree, and much
smaller than the amount of space we use to update csum, extent, and
free-space trees.

In the real world this may not be a problem.  On my RAID5 arrays, all
extents smaller than a RAID stripe combined represent only 1-3% percent
of the total space.  Part of this is self-selection: if I have a workload
that hammers the filesystem with 4K fsyncs, I'm not going to run it on
top of raid5 in the stack because the random write performance will suck.
I'll move that thing to a raid1 filesystem, and use raid5 for bulk data
storage.  On my raid1 filesystems, small extents are 10-15% of the space.

There are probably half a dozen ways to fix this problem, if it is
a problem.  Other filesystems degenerate to raid1 in this case, so they
can lose a lot of space this way.  The existing btrfs allocator doesn't
completely fill block groups even without raid5, and requires balance to
get the space back (unless you're willing to burn CPU all day running
the allocator's search algorithm).  Nobody is currently running OLTP
workloads on btrfs raid5 because of the risk they'll explode from the
write hole.

Off the top of my head.

We only ever need one partially filled stripe, so we can keep it in
memory and add blocks to it as we get more data to write.  For the first
short fsync in a transaction, write a new partially filled stripe to a
new location (this is a full write, no RMW, but some of the data blocks
are empty so they'll be zero-filled) and save the data's location in
the log tree (except for the full write, this is what btrfs does now).

For the next short fsync, write a stripe which contains the first
and second fsync's data, and save the new location of both blocks in
the log tree (since there are only a few of these, they can also be
kept in kernel memory).

For the third short fsync, write a stripe which contains first, second,
and third fsync's data, and save the new location of all 3 blocks in the
log tree.  During transaction commit, write out only the last stripe's
location in the metadata (the first two stripes remain free space).
During log replay after a crash, replay all the add/delete operations up
to the last completed fsync.  There's never a RMW write here, so nothing
ever gets lost in a write hole.  There's never an overwrite of data in
any form, which is better than most journalling writeback schemes where
overwrites are mandatory.

I think we can also hack up balance so that it only relocates extents
that begin or end in partially filled/partially empty raid56 stripes,
with some additional heuristics like "don't move a 60M extent to get 4K
free space."  That would be much faster than moving around all data that
already occupies full stripes.

> > 2.  Add a stripe journal.  Requires on-disk format change to add the
> > journal, and recovery code at startup to replay it.  It's the industry
> > standard way to fix the write hole in a traditional raid5 implementation,
> > so it's the first idea everyone proposes.  It's also quite slow if you
> > don't have dedicated purpose-built hardware for the journal.  It's the
> > only option for closing the write hole on nodatacow files.
> 
> I am not sure about the "quite slow" part. 

"Journal" means writing the same data to two different locations.
That's always slower than writing once to a single location, after all
other optimizations have been done.

bcachefs (grossly oversimplified) writes a raid1 journal, then batches
up and replays the writes as raid5 in the background.  That's great if
you have long periods of idle IO bandwidth, but it can suck a lot if
the background thread doesn't keep up.  It's functionally the same as
using balance to reclaim space on btrfs.

> In the journal only the "small" write should go. The "bigger" write
> should go directly in the "fully empty" stripe (where it is not needed
> a RMW cycle).

Yes, the best first step is to remove as many RMW operations as possible
in the upper layers of the filesystem.  If we don't have RMW, we don't
need to journal or do anything else with it.  We also don't have the
severe performance hit that unnecessary RMW does, either.

> So a typical transaction  would be composed by:
> 
> - update the "partial empty" stripes with a RMW cycle with the data
> that were stored in the journal in the *previous* transaction
> 
> - updated the journal with the new "small" write

A naive design of the journal implementation would store PPL blocks
in metadata, as this is more efficient than storing entire stripes or
copies of data.  Metadata is necessarily non-parity raid, both to avoid
the write hole and for performance reasons, so it can redundantly store
PPL information.

Error handling with PPL might be challenging:  what happens if we need
to correct a corrupt stripe while updating it?

But maybe journalling full stripe writes is a better match for btrfs.
To do a PPL update, the PPL has to be committed first, which would mean it
has to be written out _before_ the transaction that modifies the target
raid5 stripe.  That would mean the updated stripe has to be written
during a later transaction than the one that updated it, so it needs to
be stored somewhere until that can happen.  Given those requirements,
it might be better to simply write out the whole new stripe to a free
stripe in a raid5 data block group, and write a log tree item to copy
that stripe to the destination location.  That starts to sound a lot
like option 1 with the fix above...why overwrite, when you can put data
in the log tree and commit it from there without moving it?

In mdadm-land, PPL burns 30-40% of write performance.  We can assume that
will be higher on btrfs.

For me, if the tradeoff is losing 1.5% of a 100 TB filesystem for
unreclaimed free space vs. a 30% performance hit on all writes with
journalling, I'll just delete one day's worth of snapshots to make that
space available, and never consider journalling again.

Journalling isn't so great for nodatacow either:  the whole point of
nodatacow is to reduce overhead comparable to plain xfs/ext4 on mdadm,
and journalling puts that overhead back in.  I'd expect to see users in
both camps, those who want btrfs to emulate ext4 on mdadm raid5 without
PPL, and those who want btrfs to emulate ext4 on mdadm raid5 with PPL.
So that would add two more options to either mount flags or inode flags.

> - pass-through the "big write" directly to the "fully empty" stripes
> 
> - commit the transaction (w/journal)
> 
> 
> The key point is that the allocator should know if a stripe is fully
> empty or is partially empty. A classic block-based raid doesn't have
> this information. BTRFS has this information.
> 
> 
> 
> BR
> 
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-11  7:26                       ` Zygo Blaxell
@ 2024-12-11 19:39                         ` Goffredo Baroncelli
  2024-12-15  7:49                           ` Zygo Blaxell
  0 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2024-12-11 19:39 UTC (permalink / raw)
  To: Zygo Blaxell; +Cc: Qu Wenruo, Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

On 11/12/2024 08.26, Zygo Blaxell wrote:
> On Tue, Dec 10, 2024 at 08:36:05PM +0100, Goffredo Baroncelli wrote:
>> On 10/12/2024 03.34, Zygo Blaxell wrote:
>>> On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:
>>
>>
>> Hi Zygo,
>>
>>
>> thank for this excellent analisys
>>
>> [...]
>>
>>
>>> There's several options to fix the write hole:
>>>
>>> 1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
>>> within a partially filled raid56 stripe, unless the stripe was empty
>>> at the beginning of the current transaction (i.e. multiple RMW writes
>>> are OK, as long as they all disappear in the same crash event).  This
>>> ensures a stripe is never written from two separate btrfs transactions,
>>> eliminating the write hole.  This option requires an allocator change,
>>> and some rework/optimization of how ordered extents are written out.
>>> It also requires more balances--space within partially filled stripes
>>> isn't usable until every data block within the stripe is freed, and
>>> balances do exactly that.
>>
>> My impression  is that this solution would degenerate in a lot of
>> waste space, in case of small/frequent/sync writes.
> 
> "A lot" is relative.
> 
> If we first coalesce small writes, e.g. put a hook in delalloc or ordered
> extent handling that defers small writes until there are enough to fill a
> stripe, and only get to partial stripe writes during a commit or a fsync,
> then there far fewer of these then there are now.  This would be a big
> performance gain for btrfs raid5, without even considering the gains
> in robustness.

Are you considered the "write in the middle of a file case" ? In this case
the old data will be an hole in the stripe that will not filled until
a re - balance. My impression is that this approach for some workload will
degenerate badly with an high number of half empty stripe (more below)

[..]
> 
> In the real world this may not be a problem.  On my RAID5 arrays, all
> extents smaller than a RAID stripe combined represent only 1-3% percent
> of the total space.  Part of this is self-selection: if I have a workload
> that hammers the filesystem with 4K fsyncs, I'm not going to run it on
> top of raid5 in the stack because the random write performance will suck.
> I'll move that thing to a raid1 filesystem, and use raid5 for bulk data
> storage.  On my raid1 filesystems, small extents are 10-15% of the space.

1-3% ? In effect I expected a lot more. Do you perform regular balances ?

> There are probably half a dozen ways to fix this problem, if it is
> a problem.  Other filesystems degenerate to raid1 in this case, so they
> can lose a lot of space this way.  The existing btrfs allocator doesn't
> completely fill block groups even without raid5, and requires balance to
> get the space back (unless you're willing to burn CPU all day running
> the allocator's search algorithm).  Nobody is currently running OLTP
> workloads on btrfs raid5 because of the risk they'll explode from the
> write hole.

The hard part of a filesystem is to not suck for *any workload* even at the
cost to not be the best for *all workloads*.

> 
> Off the top of my head.
> 
> We only ever need one partially filled stripe, so we can keep it in
> memory and add blocks to it as we get more data to write.  For the first
> short fsync in a transaction, write a new partially filled stripe to a
> new location (this is a full write, no RMW, but some of the data blocks
> are empty so they'll be zero-filled) and save the data's location in
> the log tree (except for the full write, this is what btrfs does now).
> 
> For the next short fsync, write a stripe which contains the first
> and second fsync's data, and save the new location of both blocks in
> the log tree (since there are only a few of these, they can also be
> kept in kernel memory).
> 
> For the third short fsync, write a stripe which contains first, second,
> and third fsync's data, and save the new location of all 3 blocks in the
> log tree.  During transaction commit, write out only the last stripe's
> location in the metadata (the first two stripes remain free space).
> During log replay after a crash, replay all the add/delete operations up
> to the last completed fsync.  There's never a RMW write here, so nothing
> ever gets lost in a write hole.  There's never an overwrite of data in
> any form, which is better than most journalling writeback schemes where
> overwrites are mandatory.

If I understood correctly, to write 3 short data+fsync, you write 1+2+3=6
fsync data: 2 times the data and 3 fsync. This is the same cost of a journal.

> 
> I think we can also hack up balance so that it only relocates extents
> that begin or end in partially filled/partially empty raid56 stripes,
> with some additional heuristics like "don't move a 60M extent to get 4K
> free space."  That would be much faster than moving around all data that
> already occupies full stripes.
> 
>>> 2.  Add a stripe journal.  Requires on-disk format change to add the
>>> journal, and recovery code at startup to replay it.  It's the industry
>>> standard way to fix the write hole in a traditional raid5 implementation,
>>> so it's the first idea everyone proposes.  It's also quite slow if you
>>> don't have dedicated purpose-built hardware for the journal.  It's the
>>> only option for closing the write hole on nodatacow files.
>>
>> I am not sure about the "quite slow" part.
> 
> "Journal" means writing the same data to two different locations.
> That's always slower than writing once to a single location, after all
> other optimizations have been done.

Why you say write only once ? In the example above, you write the data
two times.

> 
> bcachefs (grossly oversimplified) writes a raid1 journal, then batches
> up and replays the writes as raid5 in the background.  That's great if
> you have long periods of idle IO bandwidth, but it can suck a lot if
> the background thread doesn't keep up.  It's functionally the same as
> using balance to reclaim space on btrfs.
> 

My understanding is that, more or less all the proposed solution try to
reduce the number of RMW cycles putting the data in a journal/log/raid1 bg.

Then the data is move in a raid5/6 bg. The main difference is how big is this
journal. Depending by this size, a "balance" is required more or less frequently.

The current btrfs design, is based on the fact that the raid profile is managed at
the block-group level, and it is completely independent by the btree/extents
managements. This scaled very well with the different raid model until the raid
parity based. This because the COW model (needed to maintain the sync between the
checksum and the data) is not enough to maintain the sync between the parity and the data.

Time to time I thought a variable stripe length, where the parity is stored
inside the extent at fixed offset, where this offset depend by the number
of disk and the height of the stripe.

Having the parity inside the extents would solve this part of the problem, until you need
to update in the middle a file. In this case the thing became very complex until
the point that it is easier to write a full stripe.


>> In the journal only the "small" write should go. The "bigger" write
>> should go directly in the "fully empty" stripe (where it is not needed
>> a RMW cycle).
> 
> Yes, the best first step is to remove as many RMW operations as possible
> in the upper layers of the filesystem.  If we don't have RMW, we don't
> need to journal or do anything else with it.  We also don't have the
> severe performance hit that unnecessary RMW does, either.
> 
>> So a typical transaction  would be composed by:
>>
>> - update the "partial empty" stripes with a RMW cycle with the data
>> that were stored in the journal in the *previous* transaction
>>
>> - updated the journal with the new "small" write
> 
> A naive design of the journal implementation would store PPL blocks
> in metadata, as this is more efficient than storing entire stripes or
> copies of data.  Metadata is necessarily non-parity raid, both to avoid
> the write hole and for performance reasons, so it can redundantly store
> PPL information.
> 
> Error handling with PPL might be challenging:  what happens if we need
> to correct a corrupt stripe while updating it?

I think the same thing that happens when you do a standard RMW cycle with a corruption:
pause the transaction: read the full stripe and the data checksum, try all the
possible combination between data and parity until you get good data, rewrite the stripe
(passing by a journal !); resume the transaction.
> 
> But maybe journalling full stripe writes is a better match for btrfs.
> To do a PPL update, the PPL has to be committed first, which would mean it
> has to be written out _before_ the transaction that modifies the target
> raid5 stripe.  That would mean the updated stripe has to be written
> during a later transaction than the one that updated it, so it needs to
> be stored somewhere until that can happen.  Given those requirements,
> it might be better to simply write out the whole new stripe to a free
> stripe in a raid5 data block group, and write a log tree item to copy
> that stripe to the destination location.  That starts to sound a lot
> like option 1 with the fix above...why overwrite, when you can put data
> in the log tree and commit it from there without moving it?
> 
> In mdadm-land, PPL burns 30-40% of write performance.  We can assume that
> will be higher on btrfs.

My suspects is that this happens because md is not in the position to relocate
data to fill a stripe. BTRFS can join different writes and put together in a
stripe. And if this happens BTRFS can bypass the journal because no RMW cycle is
needed. So I expected BTRFS would performs better.

> For me, if the tradeoff is losing 1.5% of a 100 TB filesystem for
> unreclaimed free space vs. a 30% performance hit on all writes with
> journalling, I'll just delete one day's worth of snapshots to make that
> space available, and never consider journalling again.
> 
> Journalling isn't so great for nodatacow either:  the whole point of
> nodatacow is to reduce overhead comparable to plain xfs/ext4 on mdadm,
> and journalling puts that overhead back in.  I'd expect to see users in
> both camps, those who want btrfs to emulate ext4 on mdadm raid5 without
> PPL, and those who want btrfs to emulate ext4 on mdadm raid5 with PPL.
> So that would add two more options to either mount flags or inode flags.
> 
>> - pass-through the "big write" directly to the "fully empty" stripes
>>
>> - commit the transaction (w/journal)
>>
>>
>> The key point is that the allocator should know if a stripe is fully
>> empty or is partially empty. A classic block-based raid doesn't have
>> this information. BTRFS has this information.
>>
>>
>>
>> BR
>>
>> G.Baroncelli
>>
>> -- 
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
>>
>>


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Using btrfs raid5/6
  2024-12-11 19:39                         ` Goffredo Baroncelli
@ 2024-12-15  7:49                           ` Zygo Blaxell
  0 siblings, 0 replies; 28+ messages in thread
From: Zygo Blaxell @ 2024-12-15  7:49 UTC (permalink / raw)
  To: Goffredo Baroncelli
  Cc: Qu Wenruo, Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

On Wed, Dec 11, 2024 at 08:39:04PM +0100, Goffredo Baroncelli wrote:
> On 11/12/2024 08.26, Zygo Blaxell wrote:
> > On Tue, Dec 10, 2024 at 08:36:05PM +0100, Goffredo Baroncelli wrote:
> > > On 10/12/2024 03.34, Zygo Blaxell wrote:
> > > > On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:
> > > 
> > > Hi Zygo,
> > > 
> > > thank for this excellent analisys
> > > 
> > > [...]
> > > 
> > > > There's several options to fix the write hole:
> > > > 
> > > > 1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
> > > > within a partially filled raid56 stripe, unless the stripe was empty
> > > > at the beginning of the current transaction (i.e. multiple RMW writes
> > > > are OK, as long as they all disappear in the same crash event).  This
> > > > ensures a stripe is never written from two separate btrfs transactions,
> > > > eliminating the write hole.  This option requires an allocator change,
> > > > and some rework/optimization of how ordered extents are written out.
> > > > It also requires more balances--space within partially filled stripes
> > > > isn't usable until every data block within the stripe is freed, and
> > > > balances do exactly that.
> > > 
> > > My impression  is that this solution would degenerate in a lot of
> > > waste space, in case of small/frequent/sync writes.
> > 
> > "A lot" is relative.
> > 
> > If we first coalesce small writes, e.g. put a hook in delalloc or ordered
> > extent handling that defers small writes until there are enough to fill a
> > stripe, and only get to partial stripe writes during a commit or a fsync,
> > then there far fewer of these then there are now.  This would be a big
> > performance gain for btrfs raid5, without even considering the gains
> > in robustness.
> 
> Are you considered the "write in the middle of a file case" ? In this case
> the old data will be an hole in the stripe that will not filled until
> a re - balance. My impression is that this approach for some workload will
> degenerate badly with an high number of half empty stripe (more below)

Oof, there are a number of misconceptions to unpack here...

Write to the middle of a file does not leave a hole in the stripe.
It doesn't free the overwritten extent at all, until the last reference
to the last block is removed--and then the entire extent is freed,
which would empty the stripe.  This is extent bookending, which can
waste a lot of space on btrfs, e.g.

	# compsize vm-raw.bin
	Processed 1 file, 1097037 regular extents (1196454 refs), 0 inline.
	Type       Perc     Disk Usage   Uncompressed Referenced
	TOTAL       97%       31G          32G          25G
	none       100%       31G          31G          23G
	zlib        56%      2.0M         3.5M         1.0M
	zstd        32%      442M         1.3G         1.1G

That's about 25% of the space waiting to be freed by an as-yet-unwritten
garbage collection tool, the extent tree v2 filesystem on-disk format
change, or as a side-effect of `btrfs fi defrag`.  I don't care about
1% of the space when I'm already overprovisioning by 25% on btrfs compared
to other filesystems.

If the extent is longer than the raid stripe, then the raid stripe
simply remains on disk, complete with csums that can participate in
error detection and correction.

If the extent is shorter than the stripe, then there is a hole left
behind; however, this is the case _now_ on btrfs.  The extent allocator
will try to align allocated extents on 64K boundaries, and will not fill
in any hole smaller than 64K until there are no holes larger than 64K
left in the filesystem (in other words, the allocator will prefer to
allocate a new block group before filling in the smaller holes).

This second allocation pass is _slow_ on large filesystems.  Like,
"your disks will crumble to dust before you fill a large filesystem
past 99.5%, because btrfs gets exponentially slower toward the end,
and the filesystem will take thousands of years to fill in the last 0.1%
on a big array" slow.

This is one of the more frustrating aspects of the current situation:
if the allocator simply didn't do the second pass, it 1) would not hang
as it gets too close to full, but could promptly return ENOSPC instead,
and 2) would not have the raid5 write hole today if the cluster size
and alignment was reliably aligned with the stripe size.

> [..]
> > 
> > In the real world this may not be a problem.  On my RAID5 arrays, all
> > extents smaller than a RAID stripe combined represent only 1-3% percent
> > of the total space.  Part of this is self-selection: if I have a workload
> > that hammers the filesystem with 4K fsyncs, I'm not going to run it on
> > top of raid5 in the stack because the random write performance will suck.
> > I'll move that thing to a raid1 filesystem, and use raid5 for bulk data
> > storage.  On my raid1 filesystems, small extents are 10-15% of the space.
> 
> 1-3% ? In effect I expected a lot more. Do you perform regular balances ?

More misconceptions:  balances do not change existing extent lengths.
Balance defragments larger free spaces, but does not fill in
sub-cluster-sized free space holes (indeed, it makes _more_ of them
because the allocator's clustering parameters used during balance are
different from the parameters used for normal writes).

The 1-3% is simple math:  if 90% of the filesystem's extents are 32K
long, and 10% of the extents are 32M long, then 1% of the filesystem
blocks are in 32K extents (smaller than a 9-drive-wide raid5 stripe),
and 99.9% of the rest of the blocks are in extents much larger than any
raid5 stripe.  Real filesystems have a wider variety of extent sizes,
but they end up with similar averages.

Some btrfs raid1 filesystems end up with higher percentages of small
writes (I have one with 11% small extents); however, with those higher
percentages come severe performance issues on btrfs.  Performance would
get _much_ worse on raid5 because so many writes go through RMW.

> > There are probably half a dozen ways to fix this problem, if it is
> > a problem.  Other filesystems degenerate to raid1 in this case, so they
> > can lose a lot of space this way.  The existing btrfs allocator doesn't
> > completely fill block groups even without raid5, and requires balance to
> > get the space back (unless you're willing to burn CPU all day running
> > the allocator's search algorithm).  Nobody is currently running OLTP
> > workloads on btrfs raid5 because of the risk they'll explode from the
> > write hole.
> 
> The hard part of a filesystem is to not suck for *any workload* even at the
> cost to not be the best for *all workloads*.

I don't disagree with that.  Why did you think I did?

> > Off the top of my head.
> > 
> > We only ever need one partially filled stripe, so we can keep it in
> > memory and add blocks to it as we get more data to write.  For the first
> > short fsync in a transaction, write a new partially filled stripe to a
> > new location (this is a full write, no RMW, but some of the data blocks
> > are empty so they'll be zero-filled) and save the data's location in
> > the log tree (except for the full write, this is what btrfs does now).
> > 
> > For the next short fsync, write a stripe which contains the first
> > and second fsync's data, and save the new location of both blocks in
> > the log tree (since there are only a few of these, they can also be
> > kept in kernel memory).
> > 
> > For the third short fsync, write a stripe which contains first, second,
> > and third fsync's data, and save the new location of all 3 blocks in the
> > log tree.  During transaction commit, write out only the last stripe's
> > location in the metadata (the first two stripes remain free space).
> > During log replay after a crash, replay all the add/delete operations up
> > to the last completed fsync.  There's never a RMW write here, so nothing
> > ever gets lost in a write hole.  There's never an overwrite of data in
> > any form, which is better than most journalling writeback schemes where
> > overwrites are mandatory.
> 
> If I understood correctly, to write 3 short data+fsync, you write 1+2+3=6
> fsync data: 2 times the data and 3 fsync. This is the same cost of a journal.

I write 1 unit of data for each of 3 fsyncs, so 1 write per fsync (the
unit is always a full stripe, it just has fewer zero blocks in later
fsyncs because more of the empty blocks are filled in).  These writes are
always to the final location of the data (the same approach used by the
btrfs log tree) so there are no further read or write operations as there
would be with a journal.  The stripes that are replaced by fsyncs during
the same tree update become free space at the end of the transaction
(again with no additional writes, same as the current log tree).

For a filesystem with metadata on SSD, the journal adds data block writes
to the SSD side.  Writing the full stripe goes directly to the HDD side,
without any head movement on spinning drives.

> > I think we can also hack up balance so that it only relocates extents
> > that begin or end in partially filled/partially empty raid56 stripes,
> > with some additional heuristics like "don't move a 60M extent to get 4K
> > free space."  That would be much faster than moving around all data that
> > already occupies full stripes.
> > 
> > > > 2.  Add a stripe journal.  Requires on-disk format change to add the
> > > > journal, and recovery code at startup to replay it.  It's the industry
> > > > standard way to fix the write hole in a traditional raid5 implementation,
> > > > so it's the first idea everyone proposes.  It's also quite slow if you
> > > > don't have dedicated purpose-built hardware for the journal.  It's the
> > > > only option for closing the write hole on nodatacow files.
> > > 
> > > I am not sure about the "quite slow" part.
> > 
> > "Journal" means writing the same data to two different locations.
> > That's always slower than writing once to a single location, after all
> > other optimizations have been done.
> 
> Why you say write only once ? In the example above, you write the data
> two times.

One data write per fsync, one block per disk:

First write:   [ data1 zero  zero zero parity ]

Second write:  [ data1 data2 zero zero parity ]

The log tree records the first write in the first fsync.  In the second
fsync, the log tree atomically deletes the metadata pointers to data1 in
the first write and replaces them with pointers to data1 in the second
write, then adds pointers to data2.  The committed tree records only
the metadata pointing to both data blocks in the second write.  The first
write becomes free space at the end of thte transaction.

If there are two more blocks written with fsync, we write a new stripe:

Third write:  [ data1 data2 data3 data4 parity ]

At this point we can throw away the partially filled stripe that we've
been keeping in memory, because there are no more empty blocks to fill in.
We leave the log tree items on disk to be replayed after a crash, and
incorporate the final block locations into the metadata tree on commit.

(Note:  in reality, the parity block would be moving around from stripe
to stripe.  I left that detail out for clarity).

> > bcachefs (grossly oversimplified) writes a raid1 journal, then batches
> > up and replays the writes as raid5 in the background.  That's great if
> > you have long periods of idle IO bandwidth, but it can suck a lot if
> > the background thread doesn't keep up.  It's functionally the same as
> > using balance to reclaim space on btrfs.
> 
> My understanding is that, more or less all the proposed solution try to
> reduce the number of RMW cycles putting the data in a journal/log/raid1 bg.

Another misconception:  journalling in raid5 does not change the number
of RMW cycles.  Journalling provides a mechanism to _restart_ RMW cycles
that are interrupted.

RMW cycles are difficult to repair if you don't have a complete copy of
the data before and after the update.  If there are csum errors detected
after using a partial parity log block, that combinatorially expands the
number of possible corrections (OK, for raid5 we go from 1 to 2, but for
raid6 we multiply the number of combinations by N).  With crc32c csums,
you can run into false positives when recovering blocks because crc32(A
^ corruption) = crc32(A) ^ crc32(corruption).  If you're journalling
full block stripes to avoid that complexity, then you might as well
do it the way the log tree does:  write a full stripe in free space,
and point the metadata tree to it for recovery.

> Then the data is move in a raid5/6 bg. The main difference is how big is this
> journal. Depending by this size, a "balance" is required more or less frequently.

You might be thinking of something like the approach bcachefs used here:
write updates to something raid1, then have a background process copy
that data to raid5 stripes.  That can cause problems where the filesystem
ends up being forced to stop allowing more writes because it has fallen
too far behind.

Balances can be run at any time to reclaim free space on a CMR drive.
When using this scheme, the balance must be run to keep the journal from
filling up the filesystem with metadata.  On a filesystem with continuous
write load, it will eventually force writes to stop because there's no
metadata space left.  It would behave like a SMR drive in that respect.

> The current btrfs design, is based on the fact that the raid profile is managed at
> the block-group level, and it is completely independent by the btree/extents
> managements. This scaled very well with the different raid model until the raid
> parity based. This because the COW model (needed to maintain the sync between the
> checksum and the data) is not enough to maintain the sync between the parity and the data.

This is a correct statement of the cause of the raid5 write hole on btrfs.

> Time to time I thought a variable stripe length, where the parity is stored
> inside the extent at fixed offset, where this offset depend by the number
> of disk and the height of the stripe.

This is the ZFS approach.  bcachefs rejected this approach because it causes
short fsyncs to degenerate to raid1, i.e. they take up more space because
the ratio of data to parity shrinks from N:1 to 1:1.

I'm not particularly bothered by that ratio change (apart from anything
else, it's the fastest option for continuous small-write fsync-heavy
workloads on very fast SSDs).  What bothers me about this is that we
effectively need a new encoding method, i.e. the parity blocks have to
be represented in the extent tree, and reading and writing them will
be like compressed files.  That means inserting and debugging a new
compression-like layer to run on top of raid5 data block groups (can't
do this for metadata block groups without undoing skinny-metadata and
all the changes to metadata behavior that implies).

I want btrfs raid5 write hole fixed *today*, not 5-10 years from now
when a new encoding type finally gets debugged enough to be usable.

> Having the parity inside the extents would solve this part of the problem, until you need
> to update in the middle a file. In this case the thing became very complex until
> the point that it is easier to write a full stripe.

Another misconception here:  btrfs already has proven mechanisms in
place to handle similar cases.

Recall that btrfs extents are immutable.  There is no update in the
middle of a file.  btrfs writes the new data to a new extent, then makes
the middle of the file point to the new extent.

> > > In the journal only the "small" write should go. The "bigger" write
> > > should go directly in the "fully empty" stripe (where it is not needed
> > > a RMW cycle).
> > 
> > Yes, the best first step is to remove as many RMW operations as possible
> > in the upper layers of the filesystem.  If we don't have RMW, we don't
> > need to journal or do anything else with it.  We also don't have the
> > severe performance hit that unnecessary RMW does, either.
> > 
> > > So a typical transaction  would be composed by:
> > > 
> > > - update the "partial empty" stripes with a RMW cycle with the data
> > > that were stored in the journal in the *previous* transaction
> > > 
> > > - updated the journal with the new "small" write
> > 
> > A naive design of the journal implementation would store PPL blocks
> > in metadata, as this is more efficient than storing entire stripes or
> > copies of data.  Metadata is necessarily non-parity raid, both to avoid
> > the write hole and for performance reasons, so it can redundantly store
> > PPL information.
> > 
> > Error handling with PPL might be challenging:  what happens if we need
> > to correct a corrupt stripe while updating it?
> 
> I think the same thing that happens when you do a standard RMW cycle with a corruption:
> pause the transaction: read the full stripe and the data checksum, try all the
> possible combination between data and parity until you get good data, rewrite the stripe
> (passing by a journal !); resume the transaction.

All of which is terrible:  while under memory pressure, in the middle
of doing some writes, we stop the world to do a read, because we didn't
completely eliminate RMW in btrfs.

That's an unforced error.  Let's not do that.

> > But maybe journalling full stripe writes is a better match for btrfs.
> > To do a PPL update, the PPL has to be committed first, which would mean it
> > has to be written out _before_ the transaction that modifies the target
> > raid5 stripe.  That would mean the updated stripe has to be written
> > during a later transaction than the one that updated it, so it needs to
> > be stored somewhere until that can happen.  Given those requirements,
> > it might be better to simply write out the whole new stripe to a free
> > stripe in a raid5 data block group, and write a log tree item to copy
> > that stripe to the destination location.  That starts to sound a lot
> > like option 1 with the fix above...why overwrite, when you can put data
> > in the log tree and commit it from there without moving it?
> > 
> > In mdadm-land, PPL burns 30-40% of write performance.  We can assume that
> > will be higher on btrfs.
> 
> My suspects is that this happens because md is not in the position to relocate
> data to fill a stripe. BTRFS can join different writes and put together in a
> stripe. And if this happens BTRFS can bypass the journal because no RMW cycle is
> needed. So I expected BTRFS would performs better.

Well...OK, you're right.  It wouldn't be 30-40% of all write performance.
Only nodatacow files and short fsyncs would be subject to the slowdown.

> > For me, if the tradeoff is losing 1.5% of a 100 TB filesystem for
> > unreclaimed free space vs. a 30% performance hit on all writes with
> > journalling, I'll just delete one day's worth of snapshots to make that
> > space available, and never consider journalling again.
> > 
> > Journalling isn't so great for nodatacow either:  the whole point of
> > nodatacow is to reduce overhead comparable to plain xfs/ext4 on mdadm,
> > and journalling puts that overhead back in.  I'd expect to see users in
> > both camps, those who want btrfs to emulate ext4 on mdadm raid5 without
> > PPL, and those who want btrfs to emulate ext4 on mdadm raid5 with PPL.
> > So that would add two more options to either mount flags or inode flags.
> > 
> > > - pass-through the "big write" directly to the "fully empty" stripes
> > > 
> > > - commit the transaction (w/journal)
> > > 
> > > 
> > > The key point is that the allocator should know if a stripe is fully
> > > empty or is partially empty. A classic block-based raid doesn't have
> > > this information. BTRFS has this information.
> > > 
> > > 
> > > 
> > > BR
> > > 
> > > G.Baroncelli
> > > 
> > > -- 
> > > gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> > > Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> > > 
> > > 
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re:  Proposal for RAID-PN (was Re: Using btrfs raid5/6)
  2024-12-10  2:34                   ` Zygo Blaxell
  2024-12-10 19:36                     ` Goffredo Baroncelli
@ 2024-12-21 18:32                     ` Forza
  2024-12-22 12:00                       ` Goffredo Baroncelli
  1 sibling, 1 reply; 28+ messages in thread
From: Forza @ 2024-12-21 18:32 UTC (permalink / raw)
  To: Zygo Blaxell, Qu Wenruo; +Cc: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs



---- From: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> -- Sent: 2024-12-10 - 03:34 ----

> On Sun, Dec 08, 2024 at 06:56:00AM +1030, Qu Wenruo wrote:
>> 在 2024/12/7 18:07, Andrei Borzenkov 写道:
>> > 06.12.2024 07:16, Qu Wenruo wrote:
>> > > 在 2024/12/6 14:29, Andrei Borzenkov 写道:
>> > > > 05.12.2024 23:27, Qu Wenruo wrote:
>> > > > > 在 2024/12/6 03:23, Andrei Borzenkov 写道:
>> > > > > > 05.12.2024 01:34, Qu Wenruo wrote:
>> > > > > > > 在 2024/12/5 05:47, Andrei Borzenkov 写道:
>> > > > > > > > 04.12.2024 07:40, Qu Wenruo wrote:
>> > > > > > > > > 
>> > > > > > > > > 在 2024/12/4 14:04, Scoopta 写道:
>> > > > > > > > > > I'm looking to deploy btfs raid5/6 and have read some of the
>> > > > > > > > > > previous
>> > > > > > > > > > posts here about doing so "successfully." I want to make sure I
>> > > > > > > > > > understand the limitations correctly. I'm looking to replace an
>> > > > > > > > > > md+ext4
>> > > > > > > > > > setup. The data on these drives is replaceable but obviously
>> > > > > > > > > > ideally I
>> > > > > > > > > > don't want to have to replace it.
>> > > > > > > > > 
>> > > > > > > > > 0) Use kernel newer than 6.5 at least.
>> > > > > > > > > 
>> > > > > > > > > That version introduced a more comprehensive check for any RAID56
>> > > > > > > > > RMW,
>> > > > > > > > > so that it will always verify the checksum and rebuild when
>> > > > > > > > > necessary.
>> > > > > > > > > 
>> > > > > > > > > This should mostly solve the write hole problem, and we even have
>> > > > > > > > > some
>> > > > > > > > > test cases in the fstests already verifying the behavior.
>> > > > > > > > 
>> > > > > > > > Write hole happens when data can *NOT* be rebuilt because data is
>> > > > > > > > inconsistent between different strips of the same stripe. How btrfs
>> > > > > > > > solves this problem?
>> > > > > > > 
>> > > > > > > An example please.
>> > > > > > 
>> > > > > > You start with stripe
>> > > > > > 
>> > > > > > A1,B1,C1,D1,P1
>> > > > > > 
>> > > > > > You overwrite A1 with A2
>> > > > > 
>> > > > > This already falls into NOCOW case.
>> > > > > 
>> > > > > No guarantee for data consistency.
>> > > > > 
>> > > > > For COW cases, the new data are always written into unused slot, and
>> > > > > after crash we will only see the old data.
>> > > > 
>> > > > Do you mean that btrfs only does full stripe write now? As I recall from
>> > > > the previous discussions, btrfs is using fixed size stripes and it can
>> > > > fill unused strips. Like
>> > > > 
>> > > > First write
>> > > > 
>> > > > A1,B1,...,...,P1
>> > > > 
>> > > > Second write
>> > > > 
>> > > > A1,B1,C2,D2,P2
>> > > > 
>> > > > I.e. A1 and B1 do not change, but C2 and D2 are added.
>> > > > 
>> > > > Now, if parity is not updated before crash and D gets lost we have
>> > > 
>> > > After crash, C2/D2 is not referenced by anyone.
>> > > So we won't need to read C2/D2/P2 because it's just unallocated space.
>> > 
>> > You do need to read C2/D2 to build parity and to reconstruct any missing
>> > block. Parity no more matches C2/D2. Whether C2/D2 are actually
>> > referenced by upper layers is irrelevant for RAID5/6.
>> 
>> Nope, in that case whatever garbage is in C2/D2, btrfs just do not care.
>> 
>> Just try it yourself.
>> 
>> You can even mkfs without discarding the device, then btrfs has garbage
>> for unwritten ranges.
>> 
>> Then do btrfs care those unallocated space nor their parity?
>> No.
>> 
>> Btrfs only cares full stripe that has at least one block being referred.
>> 
>> For vertical stripe that has no sector involved, btrfs treats it as
>> nocsum, aka, as long as it can read it's fine. If it can not be read
>> from the disk (missing dev etc), just use the rebuild data.
>> 
>> Either way for unused sector it makes no difference.
> 
> The assumption Qu made here is that btrfs never writes data blocks to the
> same stripe from two or more different transactions, without freeing and
> allocating the entire stripe in between.  If that assumption were true,
> there would be no write hole in the current implementation.
> 
> The reality is that btrfs does exactly the opposite, as in Andrei's second
> example.  This causes potential data loss of the first transaction's
> data if the second transaction's write is aborted by a crash.  After the
> first transaction, the parity and uninitialized data blocks can be used
> to recover any data block in the first transaction.  When the second
> transaction is aborted with some but not all of the blocks updated, the
> parity will no longer be usable to reconstruct the data blocks from _any_
> part of the stripe, including the first transaction's committed data.
> 
> Technically, in this event, the second transaction's data is _also_
> lost, but as Qu mentioned above, that data isn't part of a committed
> transaction, so the damaged data won't appear in the filesystem after a
> crash, corrupted or otherwise.
> 
> The potential data loss does not become actual data loss until the stripe
> goes into degraded mode, where the out-of-sync parity block is needed to
> recover a missing or corrupt data block.  If the stripe was already in
> degraded mode during the crash, data loss is immediate.
> 
> If the drives are all healthy, the parity block can be recomputed
> by a scrub, as long as the scrub is completed between a crash and a
> drive failure.
> 
> If drives are missing or corrupt and parity hasn't been properly updated,
> then data block reconstruction cannot occur.  btrfs will reject the
> reconstructed block when its csum doesn't match, resulting in an
> uncorrectable error.
> 
> There's several options to fix the write hole:
> 
> 1.  Modify btrfs so it behaves the way Qu thinks it does:  no allocations
> within a partially filled raid56 stripe, unless the stripe was empty
> at the beginning of the current transaction (i.e. multiple RMW writes
> are OK, as long as they all disappear in the same crash event).  This
> ensures a stripe is never written from two separate btrfs transactions,
> eliminating the write hole.  This option requires an allocator change,
> and some rework/optimization of how ordered extents are written out.
> It also requires more balances--space within partially filled stripes
> isn't usable until every data block within the stripe is freed, and
> balances do exactly that.
> 
> 2.  Add a stripe journal.  Requires on-disk format change to add the
> journal, and recovery code at startup to replay it.  It's the industry
> standard way to fix the write hole in a traditional raid5 implementation,
> so it's the first idea everyone proposes.  It's also quite slow if you
> don't have dedicated purpose-built hardware for the journal.  It's the
> only option for closing the write hole on nodatacow files.
> 
> 3.  Add a generic remapping layer for all IO blocks to avoid requiring
> RMW cycles.  This is the raid-stripe-tree feature, a brute-force approach
> that makes RAID profiles possible on ZNS drives.  ZNS drives have similar
> but much more strict write-ordering constraints than traditional raid56,
> so if the raid stripe tree can do raid5 on ZNS, it should be able to
> handle CMR easily ("efficiently" is a separate question).
> 
> 4.  Replace the btrfs raid5 profile with something else, and deprecate
> the raid5 profile.  I'd recommend not considering that option until
> after someone delivers a complete, write-hole-free replacement profile,
> ready for merging.  The existing raid5 is not _that_ hard to fix, we
> already have 3 well-understood options, and one of them doesn't require
> an on-disk format change.
> 
> 
> Option 1 is probably the best one:  it doesn't require on-disk format
> changes, only changes to the way kernels manage future writes.  Ideally,
> the implementation includes an optimization to collect small extent writes
> and merge them into full-stripe writes, which will make those _much_
> faster on raid56.  The current implementation does multiple unnecessary
> RMW cycles when writing multiple separate data extents to the same
> stripe, even when the extents are allocated within a single transaction
> and collectively the extents fill the entire stripe.
> 
> Option 1 won't fix nodatacow files, but that's only a problem if you
> use nodatacow files.
> 
> I suspect options 2 and 3 have so much overhead that they are far
> slower than option 1, even counting the extra balances option 1 requires.
> With option 1, the extra overhead is in a big batch you can run overnight,
> while options 2 and 3 impose continuous overhead on writes, and for
> option 3, on reads as well.
> 
>> > > So still wrong example.
>> > > 
>> > 
>> > It is the right example, you just prefer to ignore this problem.
>> 
>> Sure sure, whatever you believe.
>> 
>> Or why not just read the code on how the current RAID56 works?
> 
> The above is a summary of the state of raid56 when I last read the code
> in depth (from circa v6.6), combined with direct experience from running
> a small fleet of btrfs raid5 arrays and observing how they behave since
> 2016, and review of the raid-stripe-tree design docs.
> 
>> > > Remember we should discuss on the RMW case, meanwhile your case doesn't
>> > > even involve RMW, just a full stripe write.
>> > > 
>> > > > 
>> > > > A1,B1,C2,miss,P1
>> > > > 
>> > > > with exactly the same problem.
>> > > > 
>> > > > It has been discussed multiple times, that to fix it either btrfs has to
>> > > > use variable stripe size (basically, always do full stripe write) or
>> > > > some form of journal for pending updates.
>> > > 
>> > > If taking a correct example, it would be some like this:
>> > > 
>> > > Existing D1 data, unused D2 , P(D1+D2).
>> > > Write D2 and update P(D1+D2), then power loss.
>> > > 
>> > > Case 0): Power loss after all data and metadata reached disk
>> > > Nothing to bother, metadata already updated to see both D1 and D2,
>> > > everything is fine.
>> > > 
>> > > Case 1): Power loss before metadata reached disk
>> > > 
>> > > This means we will only see D1 as the old data, have no idea there is
>> > > any D2.
>> > > 
>> > > Case 1.0): both D2 and P(D1+D2) reached disk
>> > > Nothing to bother, again.
>> > > 
>> > > Case 1.1): D2 reached disk, P(D1+D2) doesn't
>> > > We still do not need to bother anything (if all devices are still
>> > > there), because D1 is still correct.
>> > > 
>> > > But if the device of D1 is missing, we can not recover D1, because D2
>> > > and P(D1+D2) is out of sync.
>> > > 
>> > > However I can argue this is not a simple corruption/power loss, it's two
>> > > problems (power loss + missing device), this should count as 2
>> > > missing/corrupted sectors in the same vertical stripe.
> 
> A raid56 array must still tolerate power failures while it is degraded.
> This is table stakes for a modern parity raid implementation.
> 
> The raid56 write hole occurs when it is possible for an active stripe
> to enter an unrecoverable state.  This is an implementation bug, not a
> device failure.
> 
> Leaving an inactive stripe in a corrupted state after a crash is OK.
> Never modifying any active stripe, so they are never corrupted, is OK.
> btrfs corrupts active stripes, which is not OK.
> 
> Hopefully this is clear.
> 
>> > This is the very definition of the write hole. You are entitled to have
>> > your opinion, but at least do not confuse others by claiming that btrfs
>> > protects against write hole.
>> > 
>> > It need not be the whole device - it is enough to have a single
>> > unreadable sector which happens more often (at least, with HDD).
>> > 
>> > And as already mentioned it need not happen at the same (or close) time.
>> > The data corruption may happen days and months after lost write. Sure,
>> > you can still wave it off as a double fault - but if in case of failed
>> > disk (or even unreadable sector) administrator at least gets notified in
>> > logs, here it is absolutely silent without administrator even being
>> > aware that this stripe is no more redundant and so administrator cannot
>> > do anything to fix it.
>> > 
>> > > As least btrfs won't do any writeback to the same vertical stripe at all.
>> > > 
>> > > Case 1.2): P(D1+D2) reached disk, D2 doesn't
>> > > The same as case 1.1).
>> > > 
>> > > Case 1.3): Neither D2 nor P(D1+D2) reached disk
>> > > 
>> > > It's the same as case 1.0, even missing D1 is fine to recover.
>> > > 
>> > > 
>> > > So if you believe powerloss + missing device counts as a single device
>> > > missing, and it doesn't break the tolerance of RAID5, then you can count
>> > > this as a "write-hole".
>> > > 
>> > > But to me, this is not a single error, but two error (write failure +
>> > > missing device), beyond the tolerance of RAID5.
>> > > 
>> > > Thanks,
>> > > Qu
>> > 
>> 
>> 
> 




Hi,

Thank you for the detailed explanations and suggestions regarding the write hole issues in Btrfs RAID5/6. I would like to contribute to this discussion by proposing an alternative implementation, which I call RAID-PN, an extent-based parity scheme that avoids the write hole while addressing the shortcomings of the current RAID5/6 implementation.


I hope this proposal provides a useful perspective on addressing the write hole and improving RAID performance in Btrfs. I welcome feedback on its feasibility and implementation details.


---

Proposal: RAID-PN

RAID-PN introduces a dynamic parity scheme that uses data sub-extents and parity extents rather than fixed-width stripes. It eliminates RMW cycles, ensures atomic writes, and provides flexible redundancy levels comparable to or exceeding RAID6 and RAID1c4.

Design Overview

1. Non-Striped Data and Parity:

Data extents are divided into sub-extents based on the pool size. Parity extents are calculated for the current data sub-extents and written atomically in the same transaction.

Each parity extent is independent and immutable, ensuring consistency.

Example: A 6-device RAID-P3 setup allocates 3 data sub-extents and 3 parity extents. This configuration achieves 50% space efficiency while tolerating the same number of device failures as RAID1c4, which only achieves 25% efficiency on 6 devices.


2. Avoidance of RMW:

Parity is calculated only for the data sub-extents being written. No previously written data extents or parity extents are read or modified, completely avoiding RMW cycles.


3. Atomicity of Writes:

Both data and parity extents are part of the same transaction. If a crash occurs, uncommitted writes are rolled back, leaving only valid, consistent extents on disk.


4. Dynamic Allocation:

RAID-PN eliminates partially filled stripes by dynamically allocating data sub-extents. Parity extents are calculated only for the allocated sub-extents. This avoids garbage collection and balancing operations required by fixed-stripe designs.


5. Checksummed Parity:

Parity extents are checksummed, allowing verification during scrubbing and recovery.



Addressing Btrfs RAID5/6 Issues

1. Write Hole:

RAID-PN ensures parity and data extents are written atomically and never updated across transactions, inherently avoiding the write hole issue.


2. Degraded Mode Recovery:

Checksummed parity extents ensure reliable recovery from missing or corrupt data, even in degraded mode.


3. Scrubbing and Updates:

Scrubbing validates parity extents against checksums. Inconsistent parity can be recomputed using data sub-extents without relying on crash-free states.


4. Small Writes and Performance:

For writes smaller than the pool size, RAID-PN is less space efficient due to parity overhead (e.g., a 4 KiB write in RAID-P2 requires 1 data sub-extent and 2 parity extents, totaling 12 KiB). However, random small I/O performance is likely better than RAID56 due to the absence of RMW cycles.



Comparison to Proposed Fixes

1. Allocator Changes (Option 1):

RAID-PN achieves similar outcomes without requiring garbage collection or balancing operations to reclaim partially filled stripes.


2. Stripe Journal (Option 2):

RAID-PN avoids the need for a stripe journal by writing parity atomically alongside data in a single transaction.


3. RAID-Stripe-Tree (Option 3):

RAID-PN avoids the complexity of a remapping layer, though extent allocator changes are required to handle sub-extents.


4. Replacement Profile (Option 4):

RAID-PN offers a new profile that supports multiple-device redundancy, avoids RMW and journaling, and remains write-hole-free while adhering to Btrfs's CoW principles. I think it provides an interesting alternative, or complement, to RAID56.



Implementation Considerations

RAID-PN requires changes to support sub-extents for data. Parity extents must be tracked and linked to the corresponding data extents/sub-extents..


NoCOW files remain problematic. We need to be able to generate parity data, which is similarly difficult to generating csum, making NoCOW files unprotected under RAID-PN.


Random small I/O is likely to outperform RAID56 due to the lack of RMW cycles. Large sequential I/O should perform similarly to RAID56.

---



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Proposal for RAID-PN (was Re: Using btrfs raid5/6)
  2024-12-21 18:32                     ` Proposal for RAID-PN (was Re: Using btrfs raid5/6) Forza
@ 2024-12-22 12:00                       ` Goffredo Baroncelli
  2024-12-23  7:42                         ` Andrei Borzenkov
  0 siblings, 1 reply; 28+ messages in thread
From: Goffredo Baroncelli @ 2024-12-22 12:00 UTC (permalink / raw)
  To: Forza, Zygo Blaxell, Qu Wenruo
  Cc: Andrei Borzenkov, Qu Wenruo, Scoopta, linux-btrfs

On 21/12/2024 19.32, Forza wrote:
> 
>
> 
> 
> Hi,
> 
> Thank you for the detailed explanations and suggestions regarding the write hole issues in Btrfs RAID5/6. I would like to contribute to this discussion by proposing an alternative implementation, which I call RAID-PN, an extent-based parity scheme that avoids the write hole while addressing the shortcomings of the current RAID5/6 implementation.
> 
> 
> I hope this proposal provides a useful perspective on addressing the write hole and improving RAID performance in Btrfs. I welcome feedback on its feasibility and implementation details.
> 
> 
> ---
> 
> Proposal: RAID-PN
> 
> RAID-PN introduces a dynamic parity scheme that uses data sub-extents and parity extents rather than fixed-width stripes. It eliminates RMW cycles, ensures atomic writes, and provides flexible redundancy levels comparable to or exceeding RAID6 and RAID1c4.
> 
> Design Overview
> 
> 1. Non-Striped Data and Parity:
> 
> Data extents are divided into sub-extents based on the pool size. Parity extents are calculated for the current data sub-extents and written atomically in the same transaction.
> 
> Each parity extent is independent and immutable, ensuring consistency.
> 
> Example: A 6-device RAID-P3 setup allocates 3 data sub-extents and 3 parity extents. This configuration achieves 50% space efficiency while tolerating the same number of device failures as RAID1c4, which only achieves 25% efficiency on 6 devices.
> 
I see here few technical challenges:

1) I assume that each substripe is "device" specific. This means that the allocator should not find an empty space big enough to host the data, but  the allocator should find 6 empty spaces on different disks.

2) the efficency depend by the data size. If the extent is a 1 sector size, the system should allocate: 1 sub-extent for data and 3 sub-extent for the parity.

3) the number of sub extent grows by factor equal to the number of disks


> 2. Avoidance of RMW:
> 
> Parity is calculated only for the data sub-extents being written. No previously written data extents or parity extents are read or modified, completely avoiding RMW cycles.
> 
> 
> 3. Atomicity of Writes:
> 
> Both data and parity extents are part of the same transaction. If a crash occurs, uncommitted writes are rolled back, leaving only valid, consistent extents on disk.
> 
> 
> 4. Dynamic Allocation:
> 
> RAID-PN eliminates partially filled stripes by dynamically allocating data sub-extents. Parity extents are calculated only for the allocated sub-extents. This avoids garbage collection and balancing operations required by fixed-stripe designs.
> 
> 
> 5. Checksummed Parity:
> 
> Parity extents are checksummed, allowing verification during scrubbing and recovery.
> 
> 
> 
> Addressing Btrfs RAID5/6 Issues
> 
> 1. Write Hole:
> 
> RAID-PN ensures parity and data extents are written atomically and never updated across transactions, inherently avoiding the write hole issue.
> 
> 
> 2. Degraded Mode Recovery:
> 
> Checksummed parity extents ensure reliable recovery from missing or corrupt data, even in degraded mode.
> 
> 
> 3. Scrubbing and Updates:
> 
> Scrubbing validates parity extents against checksums. Inconsistent parity can be recomputed using data sub-extents without relying on crash-free states.
> 
> 
> 4. Small Writes and Performance:
> 
> For writes smaller than the pool size, RAID-PN is less space efficient due to parity overhead (e.g., a 4 KiB write in RAID-P2 requires 1 data sub-extent and 2 parity extents, totaling 12 KiB). However, random small I/O performance is likely better than RAID56 due to the absence of RMW cycles.
> 
> 
> 
> Comparison to Proposed Fixes
> 
> 1. Allocator Changes (Option 1):
> 
> RAID-PN achieves similar outcomes without requiring garbage collection or balancing operations to reclaim partially filled stripes.
> 
> 
> 2. Stripe Journal (Option 2):
> 
> RAID-PN avoids the need for a stripe journal by writing parity atomically alongside data in a single transaction.
> 
> 
> 3. RAID-Stripe-Tree (Option 3):
> 
> RAID-PN avoids the complexity of a remapping layer, though extent allocator changes are required to handle sub-extents.
> 
> 
> 4. Replacement Profile (Option 4):
> 
> RAID-PN offers a new profile that supports multiple-device redundancy, avoids RMW and journaling, and remains write-hole-free while adhering to Btrfs's CoW principles. I think it provides an interesting alternative, or complement, to RAID56.
> 
> 


5 - let me to add another possible implementation:

The parity is stored inside the extent, at fixed offset. Then the extent is written in a striped profile.

1) map the disks like a striped (non raid) profile: the first block is the first block of the 1st disk, the second block is the 1st block of the 2nd disk... the n block is the 1st block of the n disk, the n+1 block is the 2nd block of the first disk...

2) in the begin of the extent there are the parity blocks, and the parity blocks are at fixed offset (each n blocks); in the example below we assume 6 disks, and 3 parity

     if we have to write 1 data block, the extent will contain
		{P11, P21, P31,     D1}

     if we have to write 2 data block, the extent will contain
		{P112, P212, P312,  D1, D2}

     if we have to write 6 data block, the extent will contain:
		{P1135, P2135, P3135,  D1, D3, D5,
                  P1246, P2246, P3246,  D2, D4, D6}

     if we have to write 12 data block, the extent will contain
		{P1159,  P2159,  P3159,   D1, D5, D9,
                  P12610, P22610, P2610,   D2, D6, D10,
                  P13711, P23711, P33711,  D3, D7, D11,
                  P14812, P24812, P34812,  D4, D8, D12}

In this way, the number of the extents remain low, and the allocator logic should be the same.

Rewriting in the middle of an extent would be a mess, but as pointed out by Zygo, this doesn't happen.


> Implementation Considerations
> 
> RAID-PN requires changes to support sub-extents for data. Parity extents must be tracked and linked to the corresponding data extents/sub-extents..
> 
> 
> NoCOW files remain problematic. We need to be able to generate parity data, which is similarly difficult to generating csum, making NoCOW files unprotected under RAID-PN.
> 
> 
> Random small I/O is likely to outperform RAID56 due to the lack of RMW cycles. Large sequential I/O should perform similarly to RAID56.
> 
> ---
> 
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Proposal for RAID-PN (was Re: Using btrfs raid5/6)
  2024-12-22 12:00                       ` Goffredo Baroncelli
@ 2024-12-23  7:42                         ` Andrei Borzenkov
  2024-12-24  9:31                           ` Goffredo Baroncelli
  0 siblings, 1 reply; 28+ messages in thread
From: Andrei Borzenkov @ 2024-12-23  7:42 UTC (permalink / raw)
  To: kreijack; +Cc: Forza, Zygo Blaxell, Qu Wenruo, Qu Wenruo, Scoopta, linux-btrfs

On Sun, Dec 22, 2024 at 3:00 PM Goffredo Baroncelli <kreijack@libero.it> wrote:
>
> On 21/12/2024 19.32, Forza wrote:
> >
> >
> >
> >
> > Hi,
> >
> > Thank you for the detailed explanations and suggestions regarding the write hole issues in Btrfs RAID5/6. I would like to contribute to this discussion by proposing an alternative implementation, which I call RAID-PN, an extent-based parity scheme that avoids the write hole while addressing the shortcomings of the current RAID5/6 implementation.
> >
> >
> > I hope this proposal provides a useful perspective on addressing the write hole and improving RAID performance in Btrfs. I welcome feedback on its feasibility and implementation details.
> >
> >
> > ---
> >
> > Proposal: RAID-PN
> >
> > RAID-PN introduces a dynamic parity scheme that uses data sub-extents and parity extents rather than fixed-width stripes. It eliminates RMW cycles, ensures atomic writes, and provides flexible redundancy levels comparable to or exceeding RAID6 and RAID1c4.
> >
> > Design Overview
> >
> > 1. Non-Striped Data and Parity:
> >
> > Data extents are divided into sub-extents based on the pool size. Parity extents are calculated for the current data sub-extents and written atomically in the same transaction.
> >
> > Each parity extent is independent and immutable, ensuring consistency.
> >
> > Example: A 6-device RAID-P3 setup allocates 3 data sub-extents and 3 parity extents. This configuration achieves 50% space efficiency while tolerating the same number of device failures as RAID1c4, which only achieves 25% efficiency on 6 devices.

Giving something a pretty name does not really explain how it works.
Can you show an example layout?

...
>
> 5 - let me to add another possible implementation:
>
> The parity is stored inside the extent, at fixed offset. Then the extent is written in a striped profile.
>
> 1) map the disks like a striped (non raid) profile: the first block is the first block of the 1st disk, the second block is the 1st block of the 2nd disk... the n block is the 1st block of the n disk, the n+1 block is the 2nd block of the first disk...
>
> 2) in the begin of the extent there are the parity blocks, and the parity blocks are at fixed offset (each n blocks); in the example below we assume 6 disks, and 3 parity
>
>      if we have to write 1 data block, the extent will contain
>                 {P11, P21, P31,     D1}
>

What happens with the holes? If they can be filled later, we are back
on square one. If not, this is a partial stripe that was mentioned
already.

The challenge with implementing partial stripes is tracking unused
(and unusable) space efficiently. The most straightforward
implementation - only use RAID stripe for one extent. Practically it
means always allocating and freeing space in the units of RAID stripe.
This should not require any on-disk format changes and minimal
allocator changes but can waste space for small extents. Making RAID
strip size smaller than 64K will improve space utilization at the cost
of impacting sequential IO.

Packing multiple small extents in one stripe needs extra metadata to
track still used stripe after some extents are deallocated.

>      if we have to write 2 data block, the extent will contain
>                 {P112, P212, P312,  D1, D2}
>
>      if we have to write 6 data block, the extent will contain:
>                 {P1135, P2135, P3135,  D1, D3, D5,
>                   P1246, P2246, P3246,  D2, D4, D6}
>
>      if we have to write 12 data block, the extent will contain
>                 {P1159,  P2159,  P3159,   D1, D5, D9,
>                   P12610, P22610, P2610,   D2, D6, D10,
>                   P13711, P23711, P33711,  D3, D7, D11,
>                   P14812, P24812, P34812,  D4, D8, D12}
>
> In this way, the number of the extents remain low, and the allocator logic should be the same.
>
> Rewriting in the middle of an extent would be a mess, but as pointed out by Zygo, this doesn't happen.
>
>
> > Implementation Considerations
> >
> > RAID-PN requires changes to support sub-extents for data. Parity extents must be tracked and linked to the corresponding data extents/sub-extents..
> >
> >
> > NoCOW files remain problematic. We need to be able to generate parity data, which is similarly difficult to generating csum, making NoCOW files unprotected under RAID-PN.
> >
> >
> > Random small I/O is likely to outperform RAID56 due to the lack of RMW cycles. Large sequential I/O should perform similarly to RAID56.
> >
> > ---
> >
> >
> >
>
>
> --
> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Proposal for RAID-PN (was Re: Using btrfs raid5/6)
  2024-12-23  7:42                         ` Andrei Borzenkov
@ 2024-12-24  9:31                           ` Goffredo Baroncelli
  0 siblings, 0 replies; 28+ messages in thread
From: Goffredo Baroncelli @ 2024-12-24  9:31 UTC (permalink / raw)
  To: Andrei Borzenkov
  Cc: Forza, Zygo Blaxell, Qu Wenruo, Qu Wenruo, Scoopta, linux-btrfs

On 23/12/2024 08.42, Andrei Borzenkov wrote:
> On Sun, Dec 22, 2024 at 3:00 PM Goffredo Baroncelli <kreijack@libero.it> wrote:
>>
>> On 21/12/2024 19.32, Forza wrote:
>>>
>>>
>>>
>>>
>>> Hi,
>>>
>>> Thank you for the detailed explanations and suggestions regarding the write hole issues in Btrfs RAID5/6. I would like to contribute to this discussion by proposing an alternative implementation, which I call RAID-PN, an extent-based parity scheme that avoids the write hole while addressing the shortcomings of the current RAID5/6 implementation.
>>>
>>>
>>> I hope this proposal provides a useful perspective on addressing the write hole and improving RAID performance in Btrfs. I welcome feedback on its feasibility and implementation details.
>>>
>>>
>>> ---
>>>
>>> Proposal: RAID-PN
>>>
>>> RAID-PN introduces a dynamic parity scheme that uses data sub-extents and parity extents rather than fixed-width stripes. It eliminates RMW cycles, ensures atomic writes, and provides flexible redundancy levels comparable to or exceeding RAID6 and RAID1c4.
>>>
>>> Design Overview
>>>
>>> 1. Non-Striped Data and Parity:
>>>
>>> Data extents are divided into sub-extents based on the pool size. Parity extents are calculated for the current data sub-extents and written atomically in the same transaction.
>>>
>>> Each parity extent is independent and immutable, ensuring consistency.
>>>
>>> Example: A 6-device RAID-P3 setup allocates 3 data sub-extents and 3 parity extents. This configuration achieves 50% space efficiency while tolerating the same number of device failures as RAID1c4, which only achieves 25% efficiency on 6 devices.
> 
> Giving something a pretty name does not really explain how it works.
> Can you show an example layout?
> 

My understanding is that every "strip" part of a "stripe" is referenced
by an extent. So if you want to update a stripe, you can rewrite it in a COW
way the impacted part (the data strip and the parity stripS).
To me it seems what already happen with RST+Raid.

>>
>> 5 - let me to add another possible implementation:
>>
>> The parity is stored inside the extent, at fixed offset. Then the extent is written in a striped profile.
>>
>> 1) map the disks like a striped (non raid) profile: the first block is the first block of the 1st disk, the second block is the 1st block of the 2nd disk... the n block is the 1st block of the n disk, the n+1 block is the 2nd block of the first disk...
>>
>> 2) in the begin of the extent there are the parity blocks, and the parity blocks are at fixed offset (each n blocks); in the example below we assume 6 disks, and 3 parity
>>
>>       if we have to write 1 data block, the extent will contain
>>                  {P11, P21, P31,     D1}
>>
> 
> What happens with the holes? If they can be filled later, we are back
> on square one. If not, this is a partial stripe that was mentioned
> already.

It is a "standard" extent, which means that it is "written" (or better,
referenced) atomically, so write hole can't happen. The only differences
is that these kind of extents host also the parity blocks.

Assuming a raid 5 with 6 disks, you can have an arrangement like

{P1 D1 D1} {P2 D2   D2
  D2 D2 D2} {P3 D3} {P4
  D4 D4 D4   D4 D4   P4
  D4 D4 D4} ...

The column are the disks, the {} pair delimites an extents. In this case
I showed extents with:
  D1: 2 blocks,
  D2: 5 blocks
  D3: 1 blocks
  D4: 8 blocks

During a extent read, the parity blocks are skipped; if a read error
happen, the extent is fully read, and the missing information are
rebuild on the basis of the parity blocks. Being an extent, it is
not updated in place.

Exception is NOCOW, for which it is impossible to preserve the
property of being a "light" write and maintain a "sync" of the
parity block (except a journal is used).

Regarding the comparison with the "partial stripe" design (I assume
that you are referring to the Zygo email) this design doesn't
coalesce small (unrelated) writes; or at least nothing more than
the usual delalloc extents handling.

> The challenge with implementing partial stripes is tracking unused
> (and unusable) space efficiently. The most straightforward
> implementation - only use RAID stripe for one extent. Practically it
> means always allocating and freeing space in the units of RAID stripe.
> This should not require any on-disk format changes and minimal
> allocator changes but can waste space for small extents. Making RAID
> strip size smaller than 64K will improve space utilization at the cost
> of impacting sequential IO.

The advantage of putting the parity inside the extent, is that you have
a variable stripe size, so the problem of wasting the space disappear.
However for small extent you have to pay the cost of impacting
the sequential IO (but does it still matters in a world which is turning
into a SSD storage ?).

For larger extents, you can have an arrangement which preserves the 64K strip,
as show below in a 3 disks raid 5 layout.

{ P1  D1  D17
   P2  D2  D18
   P3  D3  D19
   [...]
   P16 D16 D32
   P17 D33 D49
   P18 D34 D50
[...]
}

> 
> Packing multiple small extents in one stripe needs extra metadata to
> track still used stripe after some extents are deallocated.
> 
>>       if we have to write 2 data block, the extent will contain
>>                  {P112, P212, P312,  D1, D2}
>>
>>       if we have to write 6 data block, the extent will contain:
>>                  {P1135, P2135, P3135,  D1, D3, D5,
>>                    P1246, P2246, P3246,  D2, D4, D6}
>>
>>       if we have to write 12 data block, the extent will contain
>>                  {P1159,  P2159,  P3159,   D1, D5, D9,
>>                    P12610, P22610, P2610,   D2, D6, D10,
>>                    P13711, P23711, P33711,  D3, D7, D11,
>>                    P14812, P24812, P34812,  D4, D8, D12}
>>
>> In this way, the number of the extents remain low, and the allocator logic should be the same.
>>
>> Rewriting in the middle of an extent would be a mess, but as pointed out by Zygo, this doesn't happen.
>>
>>
>>> Implementation Considerations
>>>
>>> RAID-PN requires changes to support sub-extents for data. Parity extents must be tracked and linked to the corresponding data extents/sub-extents..
>>>
>>>
>>> NoCOW files remain problematic. We need to be able to generate parity data, which is similarly difficult to generating csum, making NoCOW files unprotected under RAID-PN.
>>>
>>>
>>> Random small I/O is likely to outperform RAID56 due to the lack of RMW cycles. Large sequential I/O should perform similarly to RAID56.
>>>
>>> ---
>>>
>>>
>>>
>>
>>
>> --
>> gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
>> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2024-12-24  9:33 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-04  3:34 Using btrfs raid5/6 Scoopta
2024-12-04  4:29 ` Andrei Borzenkov
2024-12-04  4:49   ` Scoopta
2024-12-04  4:40 ` Qu Wenruo
2024-12-04  4:50   ` Scoopta
2024-12-04 19:17   ` Andrei Borzenkov
2024-12-04 22:34     ` Qu Wenruo
2024-12-05 16:53       ` Andrei Borzenkov
2024-12-05 20:27         ` Qu Wenruo
2024-12-06  3:59           ` Andrei Borzenkov
2024-12-06  4:16             ` Qu Wenruo
2024-12-06 18:10               ` Goffredo Baroncelli
2024-12-07  7:37               ` Andrei Borzenkov
2024-12-07 20:26                 ` Qu Wenruo
2024-12-10  2:34                   ` Zygo Blaxell
2024-12-10 19:36                     ` Goffredo Baroncelli
2024-12-11  1:47                       ` Jonah Sabean
2024-12-11  7:26                       ` Zygo Blaxell
2024-12-11 19:39                         ` Goffredo Baroncelli
2024-12-15  7:49                           ` Zygo Blaxell
2024-12-21 18:32                     ` Proposal for RAID-PN (was Re: Using btrfs raid5/6) Forza
2024-12-22 12:00                       ` Goffredo Baroncelli
2024-12-23  7:42                         ` Andrei Borzenkov
2024-12-24  9:31                           ` Goffredo Baroncelli
2024-12-06  2:03   ` Using btrfs raid5/6 Jonah Sabean
2024-12-07 20:48     ` Qu Wenruo
2024-12-08 16:31       ` Jonah Sabean
2024-12-08 20:07         ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox