Ideas for RAIDZ-like design to solve write-holes, with larger fs block size

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Qu Wenruo <wqu@suse.com>
To: linux-btrfs <linux-btrfs@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	zfs-devel@list.zfsonlinux.org
Subject: Ideas for RAIDZ-like design to solve write-holes, with larger fs block size
Date: Fri, 28 Nov 2025 13:37:29 +1030	[thread overview]
Message-ID: <4c0c1d27-957c-4a6f-9397-47ca321b1805@suse.com> (raw)

Hi,

With the recent bs > ps support for btrfs, I'm wondering if it's 
possible to experiment some RAIDZ-like solutions to solve RAID56 
write-holes problems (at least for data COW cases) without traditional 
journal.

Currently my idea looks like this:

- Fixed and much smaller stripe data length
   Currently the data stripe length is fixed for all btrfs RAID profiles,
   64K.

   But will change to 4K (minimal and default) for RAIDZ chunks.

- Force a larger than 4K fs block size (or data io size)
   And that fs block size will determine how many devices we can use for
   a RAIDZ chunk.

   E.g. with 32K fs block size, and 4K stripe length, we can use 8
   devices for data, +1 for parity.
   But this also means, one has to have at least 9 devices to maintain
   the this scheme with 4K stripe length.
   (More is fine, less is not possible)

But there are still some uncertainty that I hope to get some feedback 
before starting coding on this.

- Conflicts with raid-stripe-tree and no zoned support
   I know WDC is working on raid-stripe-tree feature, which will support
   all profiles including RAID56 for data on zoned devices.

   And the feature can be used without zoned device.

   Although there is never RAID56 support implemented so far.

   Would raid-stripe-tree conflicts with this new RAIDZ idea, or it's
   better just wait for raid-stripe-tree?

- Performance
   If our stripe length is 4K it means one fs block will be split into
   4K writes into each device.

   The initial sequential write will be split into a lot of 4K sized
   random writes into the real disks.

   Not sure how much performance impact it will have, maybe it can be
   solved with proper blk plug?

- Larger fs block size or larger IO size
   If the fs block size is larger than the 4K stripe length, it means
   the data checksum is calulated for the whole fs block, and it will
   make rebuild much harder.

   E.g. fs block size is 16K, stripe length is 4K, and have 4 data
   stripes and 1 parity stripe.

   If one data stripe is corrupted, the checksum will mismatch for the
   whole 16K, but we don't know which 4K is corrupted, thus has to try
   4 times to get a correct rebuild result.

   Apply this to a whole disk, then rebuild will take forever...

   But this only requires extra rebuild mechanism for RAID chunks.

   The other solution is to introduce another size limit, maybe something
   like data_io_size, and for example using 16K data_io_size, and still
   4K fs block size, with the same 4K stripe length.

   So that every writes will be aligned to that 16K (a single 4K write
   will dirty the whole 16K range). And checksum will be calculated for
   each 4K block.

   Then reading the 16K we verify every 4K block, and can detect which
   block is corrupted and just repair that block.

   The cost will be the extra space spent saving 4x data checksum, and
   the extra data_io_size related code.

- Way more rigid device number requirement
   Everything must be decided at mkfs time, the stripe length, fs block
   size/data io size, and number of devices.

   Sure one can still add more devices than required, but it will just
   behave like more disks with RAID1.
   Each RAIDZ chunk will have fixed amount of devices.

   And furthermore, one can no longer remove devices below the minimal
   amount required by the RAIDZ chunks.
   If going with 16K blocksize/data io size, 4K stripe length, then it
   will always require 5 disks for RAIDZ1.
   Unless the end user gets rid of all RAIDZ chunks (e.g. convert
   to regular RAID1* or even SINGLE).

- Larger fs block size/data io size means higher write amplification
   That's the most obvious part, and may be less obvious higher memory
   pressure, and btrfs is already pretty bad at write-amplification.

   Currently page cache is relying on larger folios to handle those
   bs > ps cases, requiring more contigous physical memory space.

   And this limit will not go away even the end user choose to get
   rid of all RAIDZ chunks.

So any feedback is appreciated, no matter from end users, or even ZFS 
developers who invented RAIDZ in the first place.

Thanks,
Qu

next             reply	other threads:[~2025-11-28  3:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-28  3:07 Qu Wenruo [this message]
2025-11-28 19:49 ` Ideas for RAIDZ-like design to solve write-holes, with larger fs block size Goffredo Baroncelli
2025-11-28 20:10   ` Qu Wenruo
2025-11-28 20:21     ` Goffredo Baroncelli
2025-11-28 20:30       ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4c0c1d27-957c-4a6f-9397-47ca321b1805@suse.com \
    --to=wqu@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=zfs-devel@list.zfsonlinux.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).