public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Johannes Thumshirn <johannes.thumshirn@wdc.com>,
	linux-btrfs@vger.kernel.org
Subject: RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree")
Date: Wed, 13 Jul 2022 18:54:40 +0800	[thread overview]
Message-ID: <78daa7e4-7c88-d6c0-ccaa-fb148baf7bc8@gmx.com> (raw)
In-Reply-To: <cover.1652711187.git.johannes.thumshirn@wdc.com>



On 2022/5/16 22:31, Johannes Thumshirn wrote:
> Introduce a raid-stripe-tree to record writes in a RAID environment.
>
> In essence this adds another address translation layer between the logical
> and the physical addresses in btrfs and is designed to close two gaps. The
> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the
> second one is the inability of doing RAID with zoned block devices due to the
> constraints we have with REQ_OP_ZONE_APPEND writes.

Here I want to discuss about something related to RAID56 and RST.

One of my long existing concern is, P/Q stripes have a higher update
frequency, thus with certain transaction commit/data writeback timing,
wouldn't it cause the device storing P/Q stripes go out of space before
the data stripe devices?

One example is like this, we have 3 disks RAID5, with RST and zoned
allocator (allocated logical bytenr can only go forward):

	0		32K		64K
Disk 1	|                               | (data stripe)
Disk 2	|                               | (data stripe)
Disk 3	|                               | (parity stripe)

And initially, all the zones in those disks are empty, and their write
pointer are all at the beginning of the zone. (all data)

Then we write 0~4K in the range, and write back happens immediate (can
be DIO or sync).

We need to write the 0~4K back to disk 1, and update P for that vertical
stripe, right? So we got:

	0		32K		64K
Disk 1	|X                              | (data stripe)
Disk 2	|                               | (data stripe)
Disk 3	|X                              | (parity stripe)

Then we write into 4~8K range, and sync immedately.

If we go C0W for the P (we have to anyway), so what we got is:

	0		32K		64K
Disk 1	|X                              | (data stripe)
Disk 2	|X                              | (data stripe)
Disk 3	|XX                             | (parity stripe)

So now, you can see disk3 (the zone handling parity) has its writer
pointer moved 8K forward, but both data stripe zone only has its writer
pointer moved 4K forward.

If we go forward like this, always 4K write and sync, we will hit the
following case eventually:

	0		32K		64K
Disk 1	|XXXXXXXXXXXXXXX                | (data stripe)
Disk 2	|XXXXXXXXXXXXXXX                | (data stripe)
Disk 3	|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| (parity stripe)

The extent allocator should still think we have 64K free space to write,
as we only really have written 64K.

But the zone for parity stripe is already exhausted.

How could we handle such case?
As RAID0/1 shouldn't have such problem at all, the imbalance is purely
caused by the fact that CoWing P/Q will cause higher write frequency.

Thanks,
Qu

>
> Thsi is an RFC/PoC only which just shows how the code will look like for a
> zoned RAID1. Its sole purpose is to facilitate design reviews and is not
> intended to be merged yet. Or if merged to be used on an actual file-system.
>
> Johannes Thumshirn (8):
>    btrfs: add raid stripe tree definitions
>    btrfs: move btrfs_io_context to volumes.h
>    btrfs: read raid-stripe-tree from disk
>    btrfs: add boilerplate code to insert raid extent
>    btrfs: add code to delete raid extent
>    btrfs: add code to read raid extent
>    btrfs: zoned: allow zoned RAID1
>    btrfs: add raid stripe tree pretty printer
>
>   fs/btrfs/Makefile               |   2 +-
>   fs/btrfs/ctree.c                |   1 +
>   fs/btrfs/ctree.h                |  29 ++++
>   fs/btrfs/disk-io.c              |  12 ++
>   fs/btrfs/extent-tree.c          |   9 ++
>   fs/btrfs/file.c                 |   1 -
>   fs/btrfs/print-tree.c           |  21 +++
>   fs/btrfs/raid-stripe-tree.c     | 251 ++++++++++++++++++++++++++++++++
>   fs/btrfs/raid-stripe-tree.h     |  39 +++++
>   fs/btrfs/volumes.c              |  44 +++++-
>   fs/btrfs/volumes.h              |  93 ++++++------
>   fs/btrfs/zoned.c                |  39 +++++
>   include/uapi/linux/btrfs.h      |   1 +
>   include/uapi/linux/btrfs_tree.h |  17 +++
>   14 files changed, 509 insertions(+), 50 deletions(-)
>   create mode 100644 fs/btrfs/raid-stripe-tree.c
>   create mode 100644 fs/btrfs/raid-stripe-tree.h
>

  parent reply	other threads:[~2022-07-13 10:54 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-16 14:31 [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 1/8] btrfs: add raid stripe tree definitions Johannes Thumshirn
2022-05-17  7:39   ` Qu Wenruo
2022-05-17  7:45     ` Johannes Thumshirn
2022-05-17  7:56       ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 2/8] btrfs: move btrfs_io_context to volumes.h Johannes Thumshirn
2022-05-17  7:42   ` Qu Wenruo
2022-05-17  7:51     ` Johannes Thumshirn
2022-05-17  7:58       ` Qu Wenruo
2022-05-17  8:01         ` Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 3/8] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2022-05-17  8:09   ` Qu Wenruo
2022-05-17  8:13     ` Johannes Thumshirn
2022-05-17  8:28       ` Qu Wenruo
2022-05-18 11:29         ` Johannes Thumshirn
2022-05-19  8:36           ` Qu Wenruo
2022-05-19  8:39             ` Johannes Thumshirn
2022-05-19 10:37               ` Qu Wenruo
2022-05-19 11:44                 ` Johannes Thumshirn
2022-05-19 11:48                   ` Qu Wenruo
2022-05-19 11:53                     ` Johannes Thumshirn
2022-05-19 13:26                       ` Qu Wenruo
2022-05-19 13:49                         ` Johannes Thumshirn
2022-05-19 22:56                           ` Qu Wenruo
2022-05-20  8:27                             ` Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 4/8] btrfs: add boilerplate code to insert raid extent Johannes Thumshirn
2022-05-17  7:53   ` Qu Wenruo
2022-05-17  8:00   ` Qu Wenruo
2022-05-17  8:05     ` Johannes Thumshirn
2022-05-17  8:09       ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 5/8] btrfs: add code to delete " Johannes Thumshirn
2022-05-17  8:06   ` Qu Wenruo
2022-05-17  8:10     ` Johannes Thumshirn
2022-05-17  8:14       ` Qu Wenruo
2022-05-17  8:20         ` Johannes Thumshirn
2022-05-17  8:31           ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 6/8] btrfs: add code to read " Johannes Thumshirn
2022-05-16 14:55   ` Josef Bacik
2022-05-16 14:31 ` [RFC ONLY 7/8] btrfs: zoned: allow zoned RAID1 Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 8/8] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2022-05-16 14:58 ` [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree Josef Bacik
2022-05-16 15:04   ` Johannes Thumshirn
2022-05-16 15:10     ` Josef Bacik
2022-05-16 15:47       ` Johannes Thumshirn
2022-05-17  7:23 ` Nikolay Borisov
2022-05-17  7:31   ` Qu Wenruo
2022-05-17  7:41     ` Johannes Thumshirn
2022-05-17  7:32   ` Johannes Thumshirn
2022-07-13 10:54 ` Qu Wenruo [this message]
2022-07-13 11:43   ` RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree") Johannes Thumshirn
2022-07-13 12:01     ` Qu Wenruo
2022-07-13 12:42       ` Johannes Thumshirn
2022-07-13 13:47         ` Qu Wenruo
2022-07-13 14:01           ` Johannes Thumshirn
2022-07-13 15:24             ` Lukas Straub
2022-07-13 15:28               ` Johannes Thumshirn
2022-07-14  1:08             ` Qu Wenruo
2022-07-14  7:08               ` Johannes Thumshirn
2022-07-14  7:32                 ` Qu Wenruo
2022-07-14  7:46                   ` Johannes Thumshirn
2022-07-14  7:53                     ` Qu Wenruo
2022-07-15 17:54                     ` Goffredo Baroncelli
2022-07-15 19:08                       ` Thiago Ramon
2022-07-16  0:34                         ` Qu Wenruo
2022-07-16 11:11                           ` Qu Wenruo
2022-07-16 13:52                             ` Thiago Ramon
2022-07-16 14:26                               ` Goffredo Baroncelli
2022-07-17 17:58                                 ` Goffredo Baroncelli
2022-07-17  0:30                               ` Qu Wenruo
2022-07-17 15:18                                 ` Thiago Ramon
2022-07-17 22:01                                   ` Qu Wenruo
2022-07-17 23:00                           ` Zygo Blaxell
2022-07-18  1:04                             ` Qu Wenruo
2022-07-15 20:14                       ` Chris Murphy
2022-07-18  7:33                         ` Johannes Thumshirn
2022-07-18  8:03                           ` Qu Wenruo
2022-07-18 21:49                         ` Forza
2022-07-19  1:19                           ` Qu Wenruo
2022-07-21 14:51                             ` Forza
2022-07-24 11:27                               ` Qu Wenruo
2022-07-25  0:00                             ` Zygo Blaxell
2022-07-25  0:25                               ` Qu Wenruo
2022-07-25  5:41                                 ` Zygo Blaxell
2022-07-25  7:49                                   ` Qu Wenruo
2022-07-25 19:58                               ` Goffredo Baroncelli
2022-07-25 21:29                                 ` Qu Wenruo
2022-07-18  7:30                       ` Johannes Thumshirn
2022-07-19 18:58                         ` Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=78daa7e4-7c88-d6c0-ccaa-fb148baf7bc8@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=johannes.thumshirn@wdc.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox