Ideas for RAIDZ-like design to solve write-holes, with larger fs block size

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Ideas for RAIDZ-like design to solve write-holes, with larger fs block size
@ 2025-11-28  3:07 Qu Wenruo
  2025-11-28 19:49 ` Goffredo Baroncelli
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-11-28  3:07 UTC (permalink / raw)
  To: linux-btrfs, linux-fsdevel@vger.kernel.org, zfs-devel

Hi,

With the recent bs > ps support for btrfs, I'm wondering if it's 
possible to experiment some RAIDZ-like solutions to solve RAID56 
write-holes problems (at least for data COW cases) without traditional 
journal.

Currently my idea looks like this:

- Fixed and much smaller stripe data length
   Currently the data stripe length is fixed for all btrfs RAID profiles,
   64K.

   But will change to 4K (minimal and default) for RAIDZ chunks.

- Force a larger than 4K fs block size (or data io size)
   And that fs block size will determine how many devices we can use for
   a RAIDZ chunk.

   E.g. with 32K fs block size, and 4K stripe length, we can use 8
   devices for data, +1 for parity.
   But this also means, one has to have at least 9 devices to maintain
   the this scheme with 4K stripe length.
   (More is fine, less is not possible)

But there are still some uncertainty that I hope to get some feedback 
before starting coding on this.

- Conflicts with raid-stripe-tree and no zoned support
   I know WDC is working on raid-stripe-tree feature, which will support
   all profiles including RAID56 for data on zoned devices.

   And the feature can be used without zoned device.

   Although there is never RAID56 support implemented so far.

   Would raid-stripe-tree conflicts with this new RAIDZ idea, or it's
   better just wait for raid-stripe-tree?

- Performance
   If our stripe length is 4K it means one fs block will be split into
   4K writes into each device.

   The initial sequential write will be split into a lot of 4K sized
   random writes into the real disks.

   Not sure how much performance impact it will have, maybe it can be
   solved with proper blk plug?

- Larger fs block size or larger IO size
   If the fs block size is larger than the 4K stripe length, it means
   the data checksum is calulated for the whole fs block, and it will
   make rebuild much harder.

   E.g. fs block size is 16K, stripe length is 4K, and have 4 data
   stripes and 1 parity stripe.

   If one data stripe is corrupted, the checksum will mismatch for the
   whole 16K, but we don't know which 4K is corrupted, thus has to try
   4 times to get a correct rebuild result.

   Apply this to a whole disk, then rebuild will take forever...

   But this only requires extra rebuild mechanism for RAID chunks.

   The other solution is to introduce another size limit, maybe something
   like data_io_size, and for example using 16K data_io_size, and still
   4K fs block size, with the same 4K stripe length.

   So that every writes will be aligned to that 16K (a single 4K write
   will dirty the whole 16K range). And checksum will be calculated for
   each 4K block.

   Then reading the 16K we verify every 4K block, and can detect which
   block is corrupted and just repair that block.

   The cost will be the extra space spent saving 4x data checksum, and
   the extra data_io_size related code.

- Way more rigid device number requirement
   Everything must be decided at mkfs time, the stripe length, fs block
   size/data io size, and number of devices.

   Sure one can still add more devices than required, but it will just
   behave like more disks with RAID1.
   Each RAIDZ chunk will have fixed amount of devices.

   And furthermore, one can no longer remove devices below the minimal
   amount required by the RAIDZ chunks.
   If going with 16K blocksize/data io size, 4K stripe length, then it
   will always require 5 disks for RAIDZ1.
   Unless the end user gets rid of all RAIDZ chunks (e.g. convert
   to regular RAID1* or even SINGLE).

- Larger fs block size/data io size means higher write amplification
   That's the most obvious part, and may be less obvious higher memory
   pressure, and btrfs is already pretty bad at write-amplification.

   Currently page cache is relying on larger folios to handle those
   bs > ps cases, requiring more contigous physical memory space.

   And this limit will not go away even the end user choose to get
   rid of all RAIDZ chunks.

So any feedback is appreciated, no matter from end users, or even ZFS 
developers who invented RAIDZ in the first place.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ideas for RAIDZ-like design to solve write-holes, with larger fs block size
  2025-11-28  3:07 Ideas for RAIDZ-like design to solve write-holes, with larger fs block size Qu Wenruo
@ 2025-11-28 19:49 ` Goffredo Baroncelli
  2025-11-28 20:10   ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Goffredo Baroncelli @ 2025-11-28 19:49 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, zfs-devel

On 28/11/2025 04.07, Qu Wenruo wrote:
> Hi,
> 
> With the recent bs > ps support for btrfs, I'm wondering if it's possible to experiment some RAIDZ-like solutions to solve RAID56 write-holes problems (at least for data COW cases) without traditional journal.

More than a RAIDZ-like solution (== a variable stripe size), it seems that you want to use a stripe width equal to the fs block size. So you can avoid the RWM cycles inside a stripe. It is an interesting idea, with a little development cost, which may work quite well when the number of disks * ps matches the fs bs.

In order to reduce some downside, I suggests to use a "per chunk" fs-bs

> 
> Currently my idea looks like this:
> 
> - Fixed and much smaller stripe data length
>    Currently the data stripe length is fixed for all btrfs RAID profiles,
>    64K.
> 
>    But will change to 4K (minimal and default) for RAIDZ chunks.
> 
> - Force a larger than 4K fs block size (or data io size)
>    And that fs block size will determine how many devices we can use for
>    a RAIDZ chunk.
> 
>    E.g. with 32K fs block size, and 4K stripe length, we can use 8
>    devices for data, +1 for parity.
>    But this also means, one has to have at least 9 devices to maintain
>    the this scheme with 4K stripe length.
>    (More is fine, less is not possible)
> 
> 
> But there are still some uncertainty that I hope to get some feedback before starting coding on this.
> 
> - Conflicts with raid-stripe-tree and no zoned support
>    I know WDC is working on raid-stripe-tree feature, which will support
>    all profiles including RAID56 for data on zoned devices.
> 
>    And the feature can be used without zoned device.
> 
>    Although there is never RAID56 support implemented so far.
> 
>    Would raid-stripe-tree conflicts with this new RAIDZ idea, or it's
>    better just wait for raid-stripe-tree?
> 
> - Performance
>    If our stripe length is 4K it means one fs block will be split into
>    4K writes into each device.
> 
>    The initial sequential write will be split into a lot of 4K sized
>    random writes into the real disks.
> 
>    Not sure how much performance impact it will have, maybe it can be
>    solved with proper blk plug?
> 
> - Larger fs block size or larger IO size
>    If the fs block size is larger than the 4K stripe length, it means
>    the data checksum is calulated for the whole fs block, and it will
>    make rebuild much harder.
> 
>    E.g. fs block size is 16K, stripe length is 4K, and have 4 data
>    stripes and 1 parity stripe.
> 
>    If one data stripe is corrupted, the checksum will mismatch for the
>    whole 16K, but we don't know which 4K is corrupted, thus has to try
>    4 times to get a correct rebuild result.
> 
>    Apply this to a whole disk, then rebuild will take forever...

I am not sure about that: the checksum failure should be an exception.
A disk failure is more common. But it this case, the parity should be enough
to rebuild correctly the data and in the most case the checksum will be correct.

> 
>    But this only requires extra rebuild mechanism for RAID chunks.
> 
> 
>    The other solution is to introduce another size limit, maybe something
>    like data_io_size, and for example using 16K data_io_size, and still
>    4K fs block size, with the same 4K stripe length.
> 
>    So that every writes will be aligned to that 16K (a single 4K write
>    will dirty the whole 16K range). And checksum will be calculated for
>    each 4K block.
> 
>    Then reading the 16K we verify every 4K block, and can detect which
>    block is corrupted and just repair that block.
> 
>    The cost will be the extra space spent saving 4x data checksum, and
>    the extra data_io_size related code.

I am not sure about the assumption that the BS must be equal to 4k*(ndisk-1).

This is an upper limit, but you could have different mapping. E.g. another valid
example is having BS=4k*(ndisk/2-2). But I think that even more strange arrangement
can be done, like:
	ndisk = 7
	BS=4k*3

so the 2nd stripe is in two different rows:


              D1     D2     D2     D4     D5     D6     D7
            ------ ------ ------ ------ ------ ------ ------
              B1     B1     B1     P1     B2     B2     B2
              P2     B3 ....

What you really need is that:
1) bs=stripe width <= (ndisk - parity-level)* 4k
2) each bs is never updated in the middle (which would create a new RWM cycle)

> 
> 
> - Way more rigid device number requirement
>    Everything must be decided at mkfs time, the stripe length, fs block
>    size/data io size, and number of devices.

As wrote above, I suggests to use a "per chunk" fs-bs

>    Sure one can still add more devices than required, but it will just
>    behave like more disks with RAID1.
>    Each RAIDZ chunk will have fixed amount of devices.
> 
>    And furthermore, one can no longer remove devices below the minimal
>    amount required by the RAIDZ chunks.
>    If going with 16K blocksize/data io size, 4K stripe length, then it
>    will always require 5 disks for RAIDZ1.
>    Unless the end user gets rid of all RAIDZ chunks (e.g. convert
>    to regular RAID1* or even SINGLE).
> 
> - Larger fs block size/data io size means higher write amplification
>    That's the most obvious part, and may be less obvious higher memory
>    pressure, and btrfs is already pretty bad at write-amplification.
> 

This is true, but you avoid the RWM cycle which is also expensive.

>    Currently page cache is relying on larger folios to handle those
>    bs > ps cases, requiring more contigous physical memory space.
> 
>    And this limit will not go away even the end user choose to get
>    rid of all RAIDZ chunks.
> 
> 
> So any feedback is appreciated, no matter from end users, or even ZFS developers who invented RAIDZ in the first place.
> 
> Thanks,
> Qu
> 

Let me to add a "my" proposal (which is completely unrelated to your one :-)


Assumptions:

- an extent is never update (true for BTRFS)
- in the example below it is showed a raid5 case; but it can be easily extend for higher redundancy level

Nomenclature:
- N = disks count
- stride = number of consecutive block in a disk, before jumping to other disks
- stripe = stride * (N - 1)   # -1 is for raid5, -2 in case of raid6 ...

Idea design:

- the redundancy is put inside the extent (and not below). Think it like a new kind of compression.

- a new chunk type is created composed by a sequence of blocks (4k ?) spread on the disks, where the 1st block is disk1 - offeset 0,  2nd block is disk2 - offset 0 .... Nth block is disk N, offset 0, (N+1)th block is placed at disk1, offset +4K.... Like raid 0 with stride 4k.

- option #1 (simpler)

     - when an extent is created, every (N-1) blocks a parity block is stored; if the extent is shorter than N-1, a parity block is attached at its end;

              D1     D2     D2     D4	
            ------ ------ ------ ------
             E1,0   E1,1   P1,0   E2,0
             E2,1   E2,2   P2,1   E2,3
             E2,4   P2,1   E3,0   E3,1
             E3,2   P3,0   E3,3   E3,4
             E3,5   P3,1   E3,6   E3,7
             E3,8   P3,2   E3,9   E3,10
             P3,3

        Dz      Disk #z
        Ex,y	Extent x, offset y
        Px,y    Parity, extent x, range [y*N...y*N+N-1]


- option #2 (more complex)

     - like above when an extent is created, every (N-1) blocks a parity block is stored; if the extent is shorter than N-1, a parity block is attached at its end;
       The idea is that if an extent spans more than a rows, the logical block can be arranged so the stride may be longer (comparable with the number of the rows).
       In this way you can write more *consecutive* 4K block a time (when enough data to write is available). In this case is crucial the delayed block allocation.
       See E2,{0,1} and E3,{0,3},E3,{4,7}, E3,{8,10}....

              D1     D2     D2     D4	
            ------ ------ ------ ------
             E1,0   E1,1   P1,0   E2,0
             E2,1   E2,3   P2,1   E2,4
             E2,2   P2,1   E3,0   E3,4
             E3,8   P3,0   E3,1   E3,5
             E3,9   P3,1   E3,2   E3,6
             E3,10  P3,2   E3,3   E3,7
             P3,3

        Dz      Disk #z
        Ex,y	Extent x, offset y
        Px,y    Parity, extent x, range row related



Pros:
- no update in the middle of a stripe with so no RWM cycles anymore
- (option 2 only), in case a large write, consecutive blocks can be arranged in the same disk
- each block can have its checksum
- each stripe can have different raid level
- maximum flexibility to change the number of disks

Cons:
- the scrub logic must be totally redesigned
- the map logical block <-> physical block in option#1 is not complex to compute. However in option#2 it will be ... funny to find a good algorithm.
- the ratio data-blocks/parity-blocks may be very inefficient for small write.
- moving an extent between different block groups with different number of disks, would cause to reallocate the parity blocks inside the extent

Best
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ideas for RAIDZ-like design to solve write-holes, with larger fs block size
  2025-11-28 19:49 ` Goffredo Baroncelli
@ 2025-11-28 20:10   ` Qu Wenruo
  2025-11-28 20:21     ` Goffredo Baroncelli
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2025-11-28 20:10 UTC (permalink / raw)
  To: kreijack, Qu Wenruo; +Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, zfs-devel



在 2025/11/29 06:19, Goffredo Baroncelli 写道:
> On 28/11/2025 04.07, Qu Wenruo wrote:
>> Hi,
>>
>> With the recent bs > ps support for btrfs, I'm wondering if it's 
>> possible to experiment some RAIDZ-like solutions to solve RAID56 
>> write-holes problems (at least for data COW cases) without traditional 
>> journal.
> 
> More than a RAIDZ-like solution (== a variable stripe size), it seems 
> that you want to use a stripe width equal to the fs block size. So you 
> can avoid the RWM cycles inside a stripe. It is an interesting idea, 
> with a little development cost, which may work quite well when the 
> number of disks * ps matches the fs bs.
> 
> In order to reduce some downside, I suggests to use a "per chunk" fs-bs
> 
>>
>> Currently my idea looks like this:
>>
>> - Fixed and much smaller stripe data length
>>    Currently the data stripe length is fixed for all btrfs RAID profiles,
>>    64K.
>>
>>    But will change to 4K (minimal and default) for RAIDZ chunks.
>>
>> - Force a larger than 4K fs block size (or data io size)
>>    And that fs block size will determine how many devices we can use for
>>    a RAIDZ chunk.
>>
>>    E.g. with 32K fs block size, and 4K stripe length, we can use 8
>>    devices for data, +1 for parity.
>>    But this also means, one has to have at least 9 devices to maintain
>>    the this scheme with 4K stripe length.
>>    (More is fine, less is not possible)
>>
>>
>> But there are still some uncertainty that I hope to get some feedback 
>> before starting coding on this.
>>
>> - Conflicts with raid-stripe-tree and no zoned support
>>    I know WDC is working on raid-stripe-tree feature, which will support
>>    all profiles including RAID56 for data on zoned devices.
>>
>>    And the feature can be used without zoned device.
>>
>>    Although there is never RAID56 support implemented so far.
>>
>>    Would raid-stripe-tree conflicts with this new RAIDZ idea, or it's
>>    better just wait for raid-stripe-tree?
>>
>> - Performance
>>    If our stripe length is 4K it means one fs block will be split into
>>    4K writes into each device.
>>
>>    The initial sequential write will be split into a lot of 4K sized
>>    random writes into the real disks.
>>
>>    Not sure how much performance impact it will have, maybe it can be
>>    solved with proper blk plug?
>>
>> - Larger fs block size or larger IO size
>>    If the fs block size is larger than the 4K stripe length, it means
>>    the data checksum is calulated for the whole fs block, and it will
>>    make rebuild much harder.
>>
>>    E.g. fs block size is 16K, stripe length is 4K, and have 4 data
>>    stripes and 1 parity stripe.
>>
>>    If one data stripe is corrupted, the checksum will mismatch for the
>>    whole 16K, but we don't know which 4K is corrupted, thus has to try
>>    4 times to get a correct rebuild result.
>>
>>    Apply this to a whole disk, then rebuild will take forever...
> 
> I am not sure about that: the checksum failure should be an exception.
> A disk failure is more common. But it this case, the parity should be 
> enough
> to rebuild correctly the data and in the most case the checksum will be 
> correct.

Well, there will definitely be some crazy corner cases jumping out of 
the bush, like someone just copy a super block into a completely blank 
disk, and let btrfs try to rebuild it.

And not to mention RAID6...

> 
>>
>>    But this only requires extra rebuild mechanism for RAID chunks.
>>
>>
>>    The other solution is to introduce another size limit, maybe something
>>    like data_io_size, and for example using 16K data_io_size, and still
>>    4K fs block size, with the same 4K stripe length.
>>
>>    So that every writes will be aligned to that 16K (a single 4K write
>>    will dirty the whole 16K range). And checksum will be calculated for
>>    each 4K block.
>>
>>    Then reading the 16K we verify every 4K block, and can detect which
>>    block is corrupted and just repair that block.
>>
>>    The cost will be the extra space spent saving 4x data checksum, and
>>    the extra data_io_size related code.
> 
> I am not sure about the assumption that the BS must be equal to 
> 4k*(ndisk-1).
> 
> This is an upper limit, but you could have different mapping. E.g. 
> another valid
> example is having BS=4k*(ndisk/2-2). But I think that even more strange 
> arrangement
> can be done, like:
>      ndisk = 7
>      BS=4k*3

At least for btrfs block size must be power of 2, and the whole fs must 
follow the same block size.
So 12K block size is not going to exist.

We can slightly adjust the stripe length of each chunk, but I tend to 
not do so until I got an RFC version.

> 
> so the 2nd stripe is in two different rows:
> 
> 
>               D1     D2     D2     D4     D5     D6     D7
>             ------ ------ ------ ------ ------ ------ ------
>               B1     B1     B1     P1     B2     B2     B2
>               P2     B3 ....
> 
> What you really need is that:
> 1) bs=stripe width <= (ndisk - parity-level)* 4k
> 2) each bs is never updated in the middle (which would create a new RWM 
> cycle)
> 
>>
>>
>> - Way more rigid device number requirement
>>    Everything must be decided at mkfs time, the stripe length, fs block
>>    size/data io size, and number of devices.
> 
> As wrote above, I suggests to use a "per chunk" fs-bs

As mentioned, bs must be per-fs, or writes can not be ensured to be bs 
aligned.

Thanks,
Qu

> 
>>    Sure one can still add more devices than required, but it will just
>>    behave like more disks with RAID1.
>>    Each RAIDZ chunk will have fixed amount of devices.
>>
>>    And furthermore, one can no longer remove devices below the minimal
>>    amount required by the RAIDZ chunks.
>>    If going with 16K blocksize/data io size, 4K stripe length, then it
>>    will always require 5 disks for RAIDZ1.
>>    Unless the end user gets rid of all RAIDZ chunks (e.g. convert
>>    to regular RAID1* or even SINGLE).
>>
>> - Larger fs block size/data io size means higher write amplification
>>    That's the most obvious part, and may be less obvious higher memory
>>    pressure, and btrfs is already pretty bad at write-amplification.
>>
> 
> This is true, but you avoid the RWM cycle which is also expensive.
> 
>>    Currently page cache is relying on larger folios to handle those
>>    bs > ps cases, requiring more contigous physical memory space.
>>
>>    And this limit will not go away even the end user choose to get
>>    rid of all RAIDZ chunks.
>>
>>
>> So any feedback is appreciated, no matter from end users, or even ZFS 
>> developers who invented RAIDZ in the first place.
>>
>> Thanks,
>> Qu
>>
> 
> Let me to add a "my" proposal (which is completely unrelated to your 
> one :-)
> 
> 
> Assumptions:
> 
> - an extent is never update (true for BTRFS)
> - in the example below it is showed a raid5 case; but it can be easily 
> extend for higher redundancy level
> 
> Nomenclature:
> - N = disks count
> - stride = number of consecutive block in a disk, before jumping to 
> other disks
> - stripe = stride * (N - 1)   # -1 is for raid5, -2 in case of raid6 ...
> 
> Idea design:
> 
> - the redundancy is put inside the extent (and not below). Think it like 
> a new kind of compression.
> 
> - a new chunk type is created composed by a sequence of blocks (4k ?) 
> spread on the disks, where the 1st block is disk1 - offeset 0,  2nd 
> block is disk2 - offset 0 .... Nth block is disk N, offset 0, (N+1)th 
> block is placed at disk1, offset +4K.... Like raid 0 with stride 4k.
> 
> - option #1 (simpler)
> 
>      - when an extent is created, every (N-1) blocks a parity block is 
> stored; if the extent is shorter than N-1, a parity block is attached at 
> its end;
> 
>               D1     D2     D2     D4
>             ------ ------ ------ ------
>              E1,0   E1,1   P1,0   E2,0
>              E2,1   E2,2   P2,1   E2,3
>              E2,4   P2,1   E3,0   E3,1
>              E3,2   P3,0   E3,3   E3,4
>              E3,5   P3,1   E3,6   E3,7
>              E3,8   P3,2   E3,9   E3,10
>              P3,3
> 
>         Dz      Disk #z
>         Ex,y    Extent x, offset y
>         Px,y    Parity, extent x, range [y*N...y*N+N-1]
> 
> 
> - option #2 (more complex)
> 
>      - like above when an extent is created, every (N-1) blocks a parity 
> block is stored; if the extent is shorter than N-1, a parity block is 
> attached at its end;
>        The idea is that if an extent spans more than a rows, the logical 
> block can be arranged so the stride may be longer (comparable with the 
> number of the rows).
>        In this way you can write more *consecutive* 4K block a time 
> (when enough data to write is available). In this case is crucial the 
> delayed block allocation.
>        See E2,{0,1} and E3,{0,3},E3,{4,7}, E3,{8,10}....
> 
>               D1     D2     D2     D4
>             ------ ------ ------ ------
>              E1,0   E1,1   P1,0   E2,0
>              E2,1   E2,3   P2,1   E2,4
>              E2,2   P2,1   E3,0   E3,4
>              E3,8   P3,0   E3,1   E3,5
>              E3,9   P3,1   E3,2   E3,6
>              E3,10  P3,2   E3,3   E3,7
>              P3,3
> 
>         Dz      Disk #z
>         Ex,y    Extent x, offset y
>         Px,y    Parity, extent x, range row related
> 
> 
> 
> Pros:
> - no update in the middle of a stripe with so no RWM cycles anymore
> - (option 2 only), in case a large write, consecutive blocks can be 
> arranged in the same disk
> - each block can have its checksum
> - each stripe can have different raid level
> - maximum flexibility to change the number of disks
> 
> Cons:
> - the scrub logic must be totally redesigned
> - the map logical block <-> physical block in option#1 is not complex to 
> compute. However in option#2 it will be ... funny to find a good algorithm.
> - the ratio data-blocks/parity-blocks may be very inefficient for small 
> write.
> - moving an extent between different block groups with different number 
> of disks, would cause to reallocate the parity blocks inside the extent
> 
> Best
> G.Baroncelli
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ideas for RAIDZ-like design to solve write-holes, with larger fs block size
  2025-11-28 20:10   ` Qu Wenruo
@ 2025-11-28 20:21     ` Goffredo Baroncelli
  2025-11-28 20:30       ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Goffredo Baroncelli @ 2025-11-28 20:21 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo
  Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, zfs-devel

On 28/11/2025 21.10, Qu Wenruo wrote:
[...]

>>>
>>> - Larger fs block size or larger IO size
>>>    If the fs block size is larger than the 4K stripe length, it means
>>>    the data checksum is calulated for the whole fs block, and it will
>>>    make rebuild much harder.
>>>
>>>    E.g. fs block size is 16K, stripe length is 4K, and have 4 data
>>>    stripes and 1 parity stripe.
>>>
>>>    If one data stripe is corrupted, the checksum will mismatch for the
>>>    whole 16K, but we don't know which 4K is corrupted, thus has to try
>>>    4 times to get a correct rebuild result.
>>>
>>>    Apply this to a whole disk, then rebuild will take forever...
>>
>> I am not sure about that: the checksum failure should be an exception.
>> A disk failure is more common. But it this case, the parity should be enough
>> to rebuild correctly the data and in the most case the checksum will be correct.
> 
> Well, there will definitely be some crazy corner cases jumping out of the bush, like someone just copy a super block into a completely blank disk, and let btrfs try to rebuild it.
> 
> And not to mention RAID6...

Increasing the logic complexity, you increase the number corner case were computation cost explodes. However for standard user it should not be problematic.

> 
>>
>>>
>>>    But this only requires extra rebuild mechanism for RAID chunks.
>>>
>>>
>>>    The other solution is to introduce another size limit, maybe something
>>>    like data_io_size, and for example using 16K data_io_size, and still
>>>    4K fs block size, with the same 4K stripe length.
>>>
>>>    So that every writes will be aligned to that 16K (a single 4K write
>>>    will dirty the whole 16K range). And checksum will be calculated for
>>>    each 4K block.
>>>
>>>    Then reading the 16K we verify every 4K block, and can detect which
>>>    block is corrupted and just repair that block.
>>>
>>>    The cost will be the extra space spent saving 4x data checksum, and
>>>    the extra data_io_size related code.
>>
>> I am not sure about the assumption that the BS must be equal to 4k*(ndisk-1).
>>
>> This is an upper limit, but you could have different mapping. E.g. another valid
>> example is having BS=4k*(ndisk/2-2). But I think that even more strange arrangement
>> can be done, like:
>>      ndisk = 7
>>      BS=4k*3
> 
> At least for btrfs block size must be power of 2, and the whole fs must follow the same block size.
> So 12K block size is not going to exist.
> 
> We can slightly adjust the stripe length of each chunk, but I tend to not do so until I got an RFC version.
> 
>>
>> so the 2nd stripe is in two different rows:
>>
>>
>>               D1     D2     D2     D4     D5     D6     D7
>>             ------ ------ ------ ------ ------ ------ ------
>>               B1     B1     B1     P1     B2     B2     B2
>>               P2     B3 ....
>>
>> What you really need is that:
>> 1) bs=stripe width <= (ndisk - parity-level)* 4k
>> 2) each bs is never updated in the middle (which would create a new RWM cycle)
>>
>>>
>>>
>>> - Way more rigid device number requirement
>>>    Everything must be decided at mkfs time, the stripe length, fs block
>>>    size/data io size, and number of devices.
>>
>> As wrote above, I suggests to use a "per chunk" fs-bs
> 
> As mentioned, bs must be per-fs, or writes can not be ensured to be bs aligned.


Ok.. try to see from another POV: may be that it would be enough that the allocator  ... allocates space for extent with a specific multiple of ps ?
This and the fact that an extent is immutable, should be enough...

> 
> Thanks,
> Qu
> 
>>
>>>    Sure one can still add more devices than required, but it will just
>>>    behave like more disks with RAID1.
>>>    Each RAIDZ chunk will have fixed amount of devices.
>>>
>>>    And furthermore, one can no longer remove devices below the minimal
>>>    amount required by the RAIDZ chunks.
>>>    If going with 16K blocksize/data io size, 4K stripe length, then it
>>>    will always require 5 disks for RAIDZ1.
>>>    Unless the end user gets rid of all RAIDZ chunks (e.g. convert
>>>    to regular RAID1* or even SINGLE).
>>>
>>> - Larger fs block size/data io size means higher write amplification
>>>    That's the most obvious part, and may be less obvious higher memory
>>>    pressure, and btrfs is already pretty bad at write-amplification.
>>>
>>
>> This is true, but you avoid the RWM cycle which is also expensive.
>>
>>>    Currently page cache is relying on larger folios to handle those
>>>    bs > ps cases, requiring more contigous physical memory space.
>>>
>>>    And this limit will not go away even the end user choose to get
>>>    rid of all RAIDZ chunks.
>>>
>>>
>>> So any feedback is appreciated, no matter from end users, or even ZFS developers who invented RAIDZ in the first place.
>>>
>>> Thanks,
>>> Qu
>>>
>>
>> Let me to add a "my" proposal (which is completely unrelated to your one :-)
>>
>>
>> Assumptions:
>>
>> - an extent is never update (true for BTRFS)
>> - in the example below it is showed a raid5 case; but it can be easily extend for higher redundancy level
>>
>> Nomenclature:
>> - N = disks count
>> - stride = number of consecutive block in a disk, before jumping to other disks
>> - stripe = stride * (N - 1)   # -1 is for raid5, -2 in case of raid6 ...
>>
>> Idea design:
>>
>> - the redundancy is put inside the extent (and not below). Think it like a new kind of compression.
>>
>> - a new chunk type is created composed by a sequence of blocks (4k ?) spread on the disks, where the 1st block is disk1 - offeset 0,  2nd block is disk2 - offset 0 .... Nth block is disk N, offset 0, (N+1)th block is placed at disk1, offset +4K.... Like raid 0 with stride 4k.
>>
>> - option #1 (simpler)
>>
>>      - when an extent is created, every (N-1) blocks a parity block is stored; if the extent is shorter than N-1, a parity block is attached at its end;
>>
>>               D1     D2     D2     D4
>>             ------ ------ ------ ------
>>              E1,0   E1,1   P1,0   E2,0
>>              E2,1   E2,2   P2,1   E2,3
>>              E2,4   P2,1   E3,0   E3,1
>>              E3,2   P3,0   E3,3   E3,4
>>              E3,5   P3,1   E3,6   E3,7
>>              E3,8   P3,2   E3,9   E3,10
>>              P3,3
>>
>>         Dz      Disk #z
>>         Ex,y    Extent x, offset y
>>         Px,y    Parity, extent x, range [y*N...y*N+N-1]
>>
>>
>> - option #2 (more complex)
>>
>>      - like above when an extent is created, every (N-1) blocks a parity block is stored; if the extent is shorter than N-1, a parity block is attached at its end;
>>        The idea is that if an extent spans more than a rows, the logical block can be arranged so the stride may be longer (comparable with the number of the rows).
>>        In this way you can write more *consecutive* 4K block a time (when enough data to write is available). In this case is crucial the delayed block allocation.
>>        See E2,{0,1} and E3,{0,3},E3,{4,7}, E3,{8,10}....
>>
>>               D1     D2     D2     D4
>>             ------ ------ ------ ------
>>              E1,0   E1,1   P1,0   E2,0
>>              E2,1   E2,3   P2,1   E2,4
>>              E2,2   P2,1   E3,0   E3,4
>>              E3,8   P3,0   E3,1   E3,5
>>              E3,9   P3,1   E3,2   E3,6
>>              E3,10  P3,2   E3,3   E3,7
>>              P3,3
>>
>>         Dz      Disk #z
>>         Ex,y    Extent x, offset y
>>         Px,y    Parity, extent x, range row related
>>
>>
>>
>> Pros:
>> - no update in the middle of a stripe with so no RWM cycles anymore
>> - (option 2 only), in case a large write, consecutive blocks can be arranged in the same disk
>> - each block can have its checksum
>> - each stripe can have different raid level
>> - maximum flexibility to change the number of disks
>>
>> Cons:
>> - the scrub logic must be totally redesigned
>> - the map logical block <-> physical block in option#1 is not complex to compute. However in option#2 it will be ... funny to find a good algorithm.
>> - the ratio data-blocks/parity-blocks may be very inefficient for small write.
>> - moving an extent between different block groups with different number of disks, would cause to reallocate the parity blocks inside the extent
>>
>> Best
>> G.Baroncelli
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Ideas for RAIDZ-like design to solve write-holes, with larger fs block size
  2025-11-28 20:21     ` Goffredo Baroncelli
@ 2025-11-28 20:30       ` Qu Wenruo
  0 siblings, 0 replies; 5+ messages in thread
From: Qu Wenruo @ 2025-11-28 20:30 UTC (permalink / raw)
  To: kreijack, Qu Wenruo; +Cc: linux-fsdevel@vger.kernel.org, linux-btrfs, zfs-devel



在 2025/11/29 06:51, Goffredo Baroncelli 写道:
> On 28/11/2025 21.10, Qu Wenruo wrote:
[...]
>>> As wrote above, I suggests to use a "per chunk" fs-bs
>>
>> As mentioned, bs must be per-fs, or writes can not be ensured to be bs 
>> aligned.
> 
> 
> Ok.. try to see from another POV: may be that it would be enough that 
> the allocator  ... allocates space for extent with a specific multiple 
> of ps ?
> This and the fact that an extent is immutable, should be enough...

Nope, things like relocation can easily break the immutable assumption.

That's why I don't think it's even possible to implement a per-chunk 
stripe length, as that will make some extent that's totally bs aligned 
to be not aligned on another chunk.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-11-28 20:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-28  3:07 Ideas for RAIDZ-like design to solve write-holes, with larger fs block size Qu Wenruo
2025-11-28 19:49 ` Goffredo Baroncelli
2025-11-28 20:10   ` Qu Wenruo
2025-11-28 20:21     ` Goffredo Baroncelli
2025-11-28 20:30       ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).