public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>,
	"linux-btrfs@vger.kernel.org" <linux-btrfs@vger.kernel.org>
Subject: Re: RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree")
Date: Wed, 13 Jul 2022 20:01:02 +0800	[thread overview]
Message-ID: <03630cb7-e637-3375-37c6-d0eb8546c958@gmx.com> (raw)
In-Reply-To: <PH0PR04MB74164213B5F136059236B78C9B899@PH0PR04MB7416.namprd04.prod.outlook.com>



On 2022/7/13 19:43, Johannes Thumshirn wrote:
> On 13.07.22 12:54, Qu Wenruo wrote:
>>
>>
>> On 2022/5/16 22:31, Johannes Thumshirn wrote:
>>> Introduce a raid-stripe-tree to record writes in a RAID environment.
>>>
>>> In essence this adds another address translation layer between the logical
>>> and the physical addresses in btrfs and is designed to close two gaps. The
>>> first is the ominous RAID-write-hole we suffer from with RAID5/6 and the
>>> second one is the inability of doing RAID with zoned block devices due to the
>>> constraints we have with REQ_OP_ZONE_APPEND writes.
>>
>> Here I want to discuss about something related to RAID56 and RST.
>>
>> One of my long existing concern is, P/Q stripes have a higher update
>> frequency, thus with certain transaction commit/data writeback timing,
>> wouldn't it cause the device storing P/Q stripes go out of space before
>> the data stripe devices?
>
> P/Q stripes on a dedicated drive would be RAID4, which we don't have.

I'm just using one block group as an example.

Sure, the next bg can definitely go somewhere else.

But inside one bg, we are still using one zone for the bg, right?
>
>>
>> One example is like this, we have 3 disks RAID5, with RST and zoned
>> allocator (allocated logical bytenr can only go forward):
>>
>> 	0		32K		64K
>> Disk 1	|                               | (data stripe)
>> Disk 2	|                               | (data stripe)
>> Disk 3	|                               | (parity stripe)
>>
>> And initially, all the zones in those disks are empty, and their write
>> pointer are all at the beginning of the zone. (all data)
>>
>> Then we write 0~4K in the range, and write back happens immediate (can
>> be DIO or sync).
>>
>> We need to write the 0~4K back to disk 1, and update P for that vertical
>> stripe, right? So we got:
>>
>> 	0		32K		64K
>> Disk 1	|X                              | (data stripe)
>> Disk 2	|                               | (data stripe)
>> Disk 3	|X                              | (parity stripe)
>>
>> Then we write into 4~8K range, and sync immedately.
>>
>> If we go C0W for the P (we have to anyway), so what we got is:
>>
>> 	0		32K		64K
>> Disk 1	|X                              | (data stripe)
>> Disk 2	|X                              | (data stripe)
>> Disk 3	|XX                             | (parity stripe)
>>
>> So now, you can see disk3 (the zone handling parity) has its writer
>> pointer moved 8K forward, but both data stripe zone only has its writer
>> pointer moved 4K forward.
>>
>> If we go forward like this, always 4K write and sync, we will hit the
>> following case eventually:
>>
>> 	0		32K		64K
>> Disk 1	|XXXXXXXXXXXXXXX                | (data stripe)
>> Disk 2	|XXXXXXXXXXXXXXX                | (data stripe)
>> Disk 3	|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX| (parity stripe)
>>
>> The extent allocator should still think we have 64K free space to write,
>> as we only really have written 64K.
>>
>> But the zone for parity stripe is already exhausted.
>>
>> How could we handle such case?
>> As RAID0/1 shouldn't have such problem at all, the imbalance is purely
>> caused by the fact that CoWing P/Q will cause higher write frequency.
>>
>
> Then the a new zone for the parity stripe has to be allocated, and the old one
> gets reclaimed. That's nothing new. Of cause there's some gotchas in the extent
> allocator and the active zone management we need to consider, but over all I do
> not see where the blocker is here.

The problem is, we can not reclaim the existing full parity zone yet.

We still have parity for the above 32K in that zone.

So that zone can not be reclaimed, until both data stripe zoned are claimed.

This means, we can have a case, that all data stripes are in the above
cases, and need twice the amount of parity zones.

And in that case, I'm not sure if our chunk allocator can handle it
properly, but at least our free space estimation is not accurate.

Thanks,
Qu


  reply	other threads:[~2022-07-13 12:01 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-05-16 14:31 [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 1/8] btrfs: add raid stripe tree definitions Johannes Thumshirn
2022-05-17  7:39   ` Qu Wenruo
2022-05-17  7:45     ` Johannes Thumshirn
2022-05-17  7:56       ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 2/8] btrfs: move btrfs_io_context to volumes.h Johannes Thumshirn
2022-05-17  7:42   ` Qu Wenruo
2022-05-17  7:51     ` Johannes Thumshirn
2022-05-17  7:58       ` Qu Wenruo
2022-05-17  8:01         ` Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 3/8] btrfs: read raid-stripe-tree from disk Johannes Thumshirn
2022-05-17  8:09   ` Qu Wenruo
2022-05-17  8:13     ` Johannes Thumshirn
2022-05-17  8:28       ` Qu Wenruo
2022-05-18 11:29         ` Johannes Thumshirn
2022-05-19  8:36           ` Qu Wenruo
2022-05-19  8:39             ` Johannes Thumshirn
2022-05-19 10:37               ` Qu Wenruo
2022-05-19 11:44                 ` Johannes Thumshirn
2022-05-19 11:48                   ` Qu Wenruo
2022-05-19 11:53                     ` Johannes Thumshirn
2022-05-19 13:26                       ` Qu Wenruo
2022-05-19 13:49                         ` Johannes Thumshirn
2022-05-19 22:56                           ` Qu Wenruo
2022-05-20  8:27                             ` Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 4/8] btrfs: add boilerplate code to insert raid extent Johannes Thumshirn
2022-05-17  7:53   ` Qu Wenruo
2022-05-17  8:00   ` Qu Wenruo
2022-05-17  8:05     ` Johannes Thumshirn
2022-05-17  8:09       ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 5/8] btrfs: add code to delete " Johannes Thumshirn
2022-05-17  8:06   ` Qu Wenruo
2022-05-17  8:10     ` Johannes Thumshirn
2022-05-17  8:14       ` Qu Wenruo
2022-05-17  8:20         ` Johannes Thumshirn
2022-05-17  8:31           ` Qu Wenruo
2022-05-16 14:31 ` [RFC ONLY 6/8] btrfs: add code to read " Johannes Thumshirn
2022-05-16 14:55   ` Josef Bacik
2022-05-16 14:31 ` [RFC ONLY 7/8] btrfs: zoned: allow zoned RAID1 Johannes Thumshirn
2022-05-16 14:31 ` [RFC ONLY 8/8] btrfs: add raid stripe tree pretty printer Johannes Thumshirn
2022-05-16 14:58 ` [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree Josef Bacik
2022-05-16 15:04   ` Johannes Thumshirn
2022-05-16 15:10     ` Josef Bacik
2022-05-16 15:47       ` Johannes Thumshirn
2022-05-17  7:23 ` Nikolay Borisov
2022-05-17  7:31   ` Qu Wenruo
2022-05-17  7:41     ` Johannes Thumshirn
2022-05-17  7:32   ` Johannes Thumshirn
2022-07-13 10:54 ` RAID56 discussion related to RST. (Was "Re: [RFC ONLY 0/8] btrfs: introduce raid-stripe-tree") Qu Wenruo
2022-07-13 11:43   ` Johannes Thumshirn
2022-07-13 12:01     ` Qu Wenruo [this message]
2022-07-13 12:42       ` Johannes Thumshirn
2022-07-13 13:47         ` Qu Wenruo
2022-07-13 14:01           ` Johannes Thumshirn
2022-07-13 15:24             ` Lukas Straub
2022-07-13 15:28               ` Johannes Thumshirn
2022-07-14  1:08             ` Qu Wenruo
2022-07-14  7:08               ` Johannes Thumshirn
2022-07-14  7:32                 ` Qu Wenruo
2022-07-14  7:46                   ` Johannes Thumshirn
2022-07-14  7:53                     ` Qu Wenruo
2022-07-15 17:54                     ` Goffredo Baroncelli
2022-07-15 19:08                       ` Thiago Ramon
2022-07-16  0:34                         ` Qu Wenruo
2022-07-16 11:11                           ` Qu Wenruo
2022-07-16 13:52                             ` Thiago Ramon
2022-07-16 14:26                               ` Goffredo Baroncelli
2022-07-17 17:58                                 ` Goffredo Baroncelli
2022-07-17  0:30                               ` Qu Wenruo
2022-07-17 15:18                                 ` Thiago Ramon
2022-07-17 22:01                                   ` Qu Wenruo
2022-07-17 23:00                           ` Zygo Blaxell
2022-07-18  1:04                             ` Qu Wenruo
2022-07-15 20:14                       ` Chris Murphy
2022-07-18  7:33                         ` Johannes Thumshirn
2022-07-18  8:03                           ` Qu Wenruo
2022-07-18 21:49                         ` Forza
2022-07-19  1:19                           ` Qu Wenruo
2022-07-21 14:51                             ` Forza
2022-07-24 11:27                               ` Qu Wenruo
2022-07-25  0:00                             ` Zygo Blaxell
2022-07-25  0:25                               ` Qu Wenruo
2022-07-25  5:41                                 ` Zygo Blaxell
2022-07-25  7:49                                   ` Qu Wenruo
2022-07-25 19:58                               ` Goffredo Baroncelli
2022-07-25 21:29                                 ` Qu Wenruo
2022-07-18  7:30                       ` Johannes Thumshirn
2022-07-19 18:58                         ` Goffredo Baroncelli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=03630cb7-e637-3375-37c6-d0eb8546c958@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=Johannes.Thumshirn@wdc.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox