Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Forza <forza@tnonline.net>, Qu Wenruo <wqu@suse.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features
Date: Sat, 14 May 2022 06:58:11 +0800	[thread overview]
Message-ID: <aa64c204-2ae7-3a85-73c6-bb5f14b9a3c0@gmx.com> (raw)
In-Reply-To: <b49e00b.dcea448a.180bdfc2a51@tnonline.net>



On 2022/5/13 23:14, Forza wrote:
> Hi,
>
> ---- From: Qu Wenruo <wqu@suse.com> -- Sent: 2022-05-13 - 10:34 ----
>
>> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
>> (J for journal),
>
> Great to see work being done on the RAID56 parts of Btrfs. :)
>
> I am just a user of btrfs and don't have the full understanding of the internals, but it makes me a little curious that we choose to use journals and RMW instead of a CoW solution to solve the write hole.

In fact, Johannes from WDC is already working on a pure CoW based
solution, called stripe tree.

The idea there is to introduce a new layer of mapping.

With that stripe tree, inside one chunk, the logical bytenr is no longer
directly mapped to a physical location, but can be dynamically mapped to
any physical location inside the chunk range.

So previously if we have a RAID56 chunk with 3 disks looks like this:

     Logical bytenr X		X + 64K		X + 128K
		   |            |		|

Then we have the following on-disk mapping:

   [X, X + 64K):		Devid 1		Physical Y1
   [X + 64K, X + 128K)	Devid 2		Physical Y2
   Parity for above	Devid 3		Physical Y3

So if we just write 64K into logical bytenr X, we need to read out data
at logical bytenr [X + 64K, X + 128K), then calculate the parity, write
into devid3 physical offset Y3.

But with the new stripe tree, we can map [X, X + 64K) into any location
in the devid 1.
So is [X + 64K, X + 128K) and the parity.

Then we we write data into logical bytenr [X, X + 64K), then we just
find a free 64K range in stripe tree of devid 1, and check if we have
mapped [X + 64K, X + 128) in the stripe tree.

a) Mapped

If we have [X + 64K, X + 128) mapped, then we read that range out,
update our parity stripe, and write the parity stripe into some newer
location (CoW), then free up the old stripe.

b) Not mapped

This means we don't have any data write into that range, thus it is all
zero. We calculate parity with all zero, then find a new location for
parity in devid 3, write the newly calculated parity and insert a
mapping for the new parity location.


By this, we in fact decouple the 1:1 mapping for RAID56, and get way
more flexibility.
Although this idea no longer follows the strict rotation of RAID5, thus
it's a middle ground between RAID4 and RAID5.


The brilliant idea is introduced mostly to support different chunk
profiles for zoned devices, but Johannes is working on enabling this for
non-zoned devices too.



Then you may ask why I'm still pushing this way more traditional RAID56J
solution, the reasons are:

- Complexity
   The stripe tree is flexible, thus more complex.
   And AFAIK it will affect all chunk types, not only RAID56.
   Thus it can be more challenging.

- Currently relies on zoned unit to split extents/stripes
   Thus I believe Johannes can solve it without any problems.

- I just want a valid way to steal code from dm/md guys :)

Don't get me wrong, I totally believe stripe tree can be the silver
bullet, but it doesn't prevent us to explore some different (and more
traditional) ways.

>
>
> Since we need on-disk changes to implement it, could it not be better to rethink the raid56 modes and implement a solution with full CoW, such as variable stripe extents etc? It is likely much more work, but could have better performance because it avoids double writes and RMW cycles too.

Well, the journal will have optimizations, e.g. full stripe doesn't need
to journal its data.

I'll learn (steal code) from dm/md to implement the code.


But there are problems related in RAID56, affecting dm/md raid56 too.

Like bitrot in one data stripe, while we're writing data into the other
data stripe.
Then RWM will read out the bad data stripe, calculate parity, and cause
the bit rot permanent.

The destructive RMW will not be detected in traditional raid56 with
traditional fs, but can be detected by btrfs.

Thus after the RAID56J project, I'll take more time on that destructive
RMW problem.

Thanks,
Qu

>
> Thanks
>
> Forza
>

  reply	other threads:[~2022-05-13 22:58 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
2022-05-13  8:34 ` [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function Qu Wenruo
2022-05-13  8:47   ` Johannes Thumshirn
2022-05-13  8:34 ` [PATCH 2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation Qu Wenruo
2022-05-13  8:45   ` Johannes Thumshirn
2022-05-13  8:34 ` [PATCH 3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes Qu Wenruo
2022-05-13  8:56   ` Johannes Thumshirn
2022-05-13  8:34 ` [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() Qu Wenruo
2022-05-13  9:15   ` Johannes Thumshirn
2022-05-13  9:22     ` Qu Wenruo
2022-05-13  9:24       ` Johannes Thumshirn
2022-05-13  9:33         ` Qu Wenruo
2022-05-13 11:38 ` [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features David Sterba
2022-05-13 12:21   ` Qu Wenruo
2022-05-13 15:14 ` Forza
2022-05-13 22:58   ` Qu Wenruo [this message]
2022-06-15 11:45 ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa64c204-2ae7-3a85-73c6-bb5f14b9a3c0@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=forza@tnonline.net \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox