From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Christoph Hellwig <hch@lst.de>
Cc: Chris Mason <clm@fb.com>, Josef Bacik <josef@toxicpanda.com>,
David Sterba <dsterba@suse.com>,
linux-btrfs@vger.kernel.org
Subject: Re: [PATCH 8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio
Date: Tue, 24 May 2022 16:21:38 +0800 [thread overview]
Message-ID: <b78b6c09-eb70-68a7-7e69-e8481378b968@gmx.com> (raw)
In-Reply-To: <6047f29e-966d-1bf5-6052-915c1572d07a@gmx.com>
On 2022/5/24 16:04, Qu Wenruo wrote:
>
>
> On 2022/5/24 15:32, Christoph Hellwig wrote:
>> On Mon, May 23, 2022 at 03:46:02PM +0800, Qu Wenruo wrote:
>>>> Becasue btrfs_repair_io_failure can't handle multiple-page I/O. It
>>>> is also is rather cumbersome because it bypassed the normal bio
>>>> mapping. As a follow on I'd rather move it over to btrfs_map_bio
>>>> with a special flag for the single mirror parity write rather than that
>>>> hack.
>>>
>>> In fact so far for all callers of btrfs_repair_io_failure(), we are
>>> always handling things inside one stripe.
>>>
>>> Thus we can easily enhance that function to handle multi page ranges.
>>>
>>> Although a dedicated btrfs_map_bio() flags seems more generic and
>>> better.
>>
>> I did think of moving btrfs_repair_io_failure over to my new
>> infrastructure in fact, because it seems inherently possible. Just
>> not the highest priority right now.
>>
>>>> Because the whole bio at this point is all the bad sectors. There
>>>> is no point in writing only parts of the bio because that would leave
>>>> corruption on disk.
>>>>
>>>>> The only reason I can think of is, we're still trying to do some
>>>>> "optimization".
>>>>>
>>>>> But all our bio submission is already synchronous, I doubt such
>>>>> "optimization" would make much difference.
>>>>
>>>> Can you explain what you mean here?
>>>
>>> We wait for the read bio anyway, I doubt the batched write part is that
>>> important.
>>
>> I still don't understand the point. Once we read more than a single
>> page, writing it back as a patch is completely trivial as shown by
>> this series. Why would we not do it?
>>
>>>
>>> If you really want, I can try to make the write part asynchronous, while
>>> still keep the read part synchronous, and easier to read.
>>
>> Asynchronous writes gets us back into all the I/O completion handler
>> complexities, which was the whole reason to start on the synchronous
>> repair.
>>
>>> In your current version, the do {} while() loop iterates through all
>>> mirrors.
>>>
>>> But for the following case, we will hit problems thanks to RAID1C3
>>> again:
>>>
>>> Mirror 1 |X|X|X|X|
>>> Mirror 2 |X| |X| |
>>> Mirror 3 | |X| |X|
>>>
>>> We hit mirror 1 initially, thus @initial_mirror is 1.
>>>
>>> Then when we try mirror 2, since the first sector is still bad, we jump
>>> to the next mirror.
>>>
>>> For mirror 3, we fixed the first sector only. Then 2nd sector is still
>>> from mirror 3 and didn't pass.
>>> Now we have no more mirrors, and still return -EIO.
>>
>> Can you share a test case?
>
> Unfortunately no real test case can work here.
>
> The problem is, VFS will try to re-read with smaller block size.
> In that case, fallback to sector-by-sector repair, thus even if we have
> some policy terribly wrong, as long as the sector-by-sector behavior is
> fine, it will be hidden.
>
> That's why when I do my bitmap version, I have to add extra trace events
> to make sure it's not the retry, but really my read-repair code doing
> the correct repair, without triggering the re-read.
>
>> The code resets initial_mirror as soon as
>> we made any progress so that should not happen.
>
> Oh, didn't see that part.
>
> And in that case, yes, it will work for the checker pattern, although
> not really any more efficient.
>
> We will do 4 different reads to fix it, no better than sector-by-sector
> repair.
> And worse than 2 reads from the bitmap version.
>
> But I get your point, it can handle continuous corruption better than
> sector-by-sector, and no worse than bitmap version in that case.
>
>>
>>> So my points still stand, if we want to do batched handling, either we
>>> go bitmap or we give up.
>>
>> Why? For the very common case of clustered corruption or entirely
>> failing reads it is significantly faster than a simple synchronous
>> read of each sector, and also much better than the existing code.
>> It also is a lot less code than the existing code base, and (maybe
>> I'm biassed) a lot more readable.
>
> The problem here is, yes, you can pick the common case one, but it comes
> with the burden of worst cases too.
>
> And for the readable part, I strongly doubt.
>
> The things like resetting initial_mirror, making the naming "initial"
> meaningless.
> And the reset on the length part is also very quirky.
In fact, if you didn't do the initial_mirror and length change (which is
a big disaster of readability, to change iterator in a loop, at least to
me), and rely on the VFS re-read behavior to fall back to sector by
secot read, I would call it better readability...
Thanks,
Qu
>
> Yes, my bitmap version is super complex, that's no doubt, but the idea
> and code should be very straightforward.
>
> Just loop all mirrors until the bad bitmap is all zero. No need to reset
> the length or whatever halfway, bitmap and mirror number is the only
> iterator.
>
> And even for the bitmap preallocation failure part, we have VFS to bail
> us out.
> And code wise, it's not that simpler, if you ignore the bitmap
> pre-allocation part...
>
> And for the ultimate readability, the sector-by-sector method can not be
> beaten.
> Thus I'm not a particular fan of any middle ground here.
>
>>
>> Bitmaps only help you with randomly splattered corruption, which simply
>> is not how SSDs or hard drives actually fail.
>
> But that's the case you have to take into consideration.
>
> Even for cases where real world SSD to corrupt a big range of data, we
> can still submit a read that crosses the corruption boundary.
>
>>
>>> Such hacky bandage seems to work at first glance and will pass your new
>>> test cases, but it doesn't do it any better than sector-by-sector
>>> waiting.
>>> (Forgot to mention, the new RAID1C3 test case may also be flawed, as any
>>> read on other mirrors will cause read-repair, screwing up our later
>>> retry, thus we must check pid first before doing any read.)
>>
>> The updated version uses the read from mirror loop from btrfs/142
>> that cleverly used bash internals to not issue the read if it would
>> be done using the wrong mirror. Which also really nicely speeds up
>> the tests including the exist 140 and 141 ones.
>>
> That's wonderful.
>
> However the smaller problem is still there, we have no way to know if
> it's the read-repair itself does its part correctly, or it's VFS retry
> saves our day.
>
> But yeah, btrfs/142 method is much better and should be backported to
> btrfs/140 and btrfs/141.
>
> Thanks,
> Qu
next prev parent reply other threads:[~2022-05-24 8:21 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-22 11:47 misc btrfs cleanups Christoph Hellwig
2022-05-22 11:47 ` [PATCH 1/8] btrfs: quit early if the fs has no RAID56 support for raid56 related checks Christoph Hellwig
2022-05-22 11:47 ` [PATCH 2/8] btrfs: introduce a pure data checksum checking helper Christoph Hellwig
2022-05-23 0:38 ` Qu Wenruo
2022-05-24 7:24 ` Christoph Hellwig
2022-05-24 8:07 ` Qu Wenruo
2022-05-24 8:11 ` Christoph Hellwig
2022-05-25 16:20 ` David Sterba
2022-05-22 11:47 ` [PATCH 3/8] btrfs: remove duplicated parameters from submit_data_read_repair() Christoph Hellwig
2022-05-22 11:47 ` [PATCH 4/8] btrfs: factor out a helper to end a single sector buffere I/O Christoph Hellwig
2022-05-25 14:54 ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 5/8] btrfs: refactor end_bio_extent_readpage Christoph Hellwig
2022-05-25 14:57 ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 6/8] btrfs: factor out a btrfs_csum_ptr helper Christoph Hellwig
2022-05-25 14:32 ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 7/8] btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks Christoph Hellwig
2022-05-26 12:58 ` Nikolay Borisov
2022-05-22 11:47 ` [PATCH 8/8] btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio Christoph Hellwig
2022-05-22 12:21 ` Qu Wenruo
2022-05-22 12:31 ` Christoph Hellwig
2022-05-22 12:38 ` Qu Wenruo
2022-05-22 12:53 ` Christoph Hellwig
2022-05-23 0:07 ` Qu Wenruo
2022-05-23 6:26 ` Christoph Hellwig
2022-05-23 7:46 ` Qu Wenruo
2022-05-24 7:32 ` Christoph Hellwig
2022-05-24 8:04 ` Qu Wenruo
2022-05-24 8:21 ` Qu Wenruo [this message]
2022-05-24 12:08 ` Christoph Hellwig
2022-05-24 13:13 ` Qu Wenruo
2022-05-24 14:02 ` Qu Wenruo
2022-05-25 15:12 ` Nikolay Borisov
2022-05-25 22:46 ` Qu Wenruo
2022-05-26 12:58 ` Nikolay Borisov
2022-05-25 20:20 ` misc btrfs cleanups David Sterba
2022-05-27 15:20 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b78b6c09-eb70-68a7-7e69-e8481378b968@gmx.com \
--to=quwenruo.btrfs@gmx.com \
--cc=clm@fb.com \
--cc=dsterba@suse.com \
--cc=hch@lst.de \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox