public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <wqu@suse.com>
To: kreijack@inwind.it, Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum"
Date: Tue, 3 Jun 2025 08:25:54 +0930	[thread overview]
Message-ID: <79875f25-26d8-413b-938e-8b0e8fffcb95@suse.com> (raw)
In-Reply-To: <b2eee8fc-ef11-431f-81b2-46f48e087a0d@inwind.it>



在 2025/6/3 02:58, Goffredo Baroncelli 写道:
> On 01/06/2025 00.07, Qu Wenruo wrote:
>>
>>
>> 在 2025/6/1 02:28, Goffredo Baroncelli 写道:
>>> On 15/05/2025 10.00, Qu Wenruo wrote:
> [...]
>>>> We have a long history of data csum mismatch, caused by direct IO and
>>>> buffered being modified during writeback.
>>>>
>>> What about having an ioctl (on a mounted fs) which allow us to read 
>>> data from a file even
>>> if the csum doesn't match ?
>>
>> The problem is again with mirrors.
>>
>> Unless you ask the end user to provide a mirror number, for a with 
>> multiple mirrors, and the mirrors doesn't match each other, the 
>> behavior will be a mess.
>>
>> That's why I'm planning to add something like a mirror priority, with 
>> a new mirror "best" to try use the mirror with the most matches.
>>
>>
>> Despite the mirror number problem (we need to ask the end user for 
>> mirror number), I think it's possible to implement the feature 
>> (mostly) in user space.
>>
>> E.g. combine fiemap result with btrfs-map-logical, then read from the 
>> disk directly.
>>
> This would expose to the user space the complexity of the filesystem; 

Let me be clear, the filesystem is complex.

If there is a csum mismatch, the end user should dig into the rabbit 
hole to find out why, fixing it with some tool should not be the first 
thing they want to do.

> this means having two code path doing the same thing in two subtle 
> different ways.

It's fine to have two code paths doing the same thing.

One for offline and one for online.

> Pay attention to consider also a raid5/6 case with a 
> missing disks.
> 
> I don't know how important is select a valid mirror. Or at least, I 
> don't see a realistic case where a mirror may be better than another one.

RAID1C3, RAID1C4, where one can have 2 or more mirrors matches each 
other but not matching the csum.

> 
> After your consideration, I would like to suggest the following:
> 1) read the data from a mirror, the csum matches, return data
> 2) read the data from a mirror, the csum DOES NOT match:
>     2.1) read the data from another mirror(s), the csum matches,
>      - rebuild the csum
>      - return the data
>     2.2) read the data from another mirror(s), the csum still doesn't 
> match, return error
>     2.3) (NEW, alternatively to 2.2) read the data from another 
> mirror(s), the csum still doesn't match:
>      - take the latest read attempt as valid
>          - rebuild the csum
>          - return the last read attempt
> 
> 2.3) mode should be a enabled by an IOCTL on a specific fd.

Again, we should not even put a complex csum repair into kernel mode and 
allow it to be done online.

Back to the first point, csum mismatch should be investigated, it's not 
something one should hit commonly, and we should not make fixing it 
easier either.


Csum mismatch should be treated as tree-checker errors.
I have not yet seen anyone suggesting there should be some fancy ioctl 
to fix a tree-checker error.

Then there should not be any easy tool to fix csum error online.


> The 
> assumption here is that if the csum doesn't match any mirror may be 
> equally bad/good. But in the IOCTL we could add a parameter to specify 
> the order of the mirrors (still we need a way to specify the order of 
> the mirror in the raid5 case).
> 
>>> I asked that because the problem usually happen on a specific file
>>> and not to an entire filesystem. In this case I think that it would 
>>> be more practical to read the block
>>> using the IOCTL, and then rewrite the block, at the specific offset 
>>> (to update the checksum).
>>
>> I'm fine with the idea of reading from raw data idea (although prefer 
>> to implement it in user space), but not a huge fan to simply re-write 
>> with COW.
>>
>> E.g. the bad csum is still there, can still be exposed by scrub, even 
>> it's not referred by any file anymore.
>>
> 
> If I understood correctly, you are saying that due to the fact that even 
> if an extent is partially referred, all data in the extent is tracked 
> and has his checksum. A rewriting in the middle would generate a new 
> extent with a new checksum, but the old data still exists in the extent 
> with his dedicated (and un-matching ) csum. (I repeated this because it 
> seems to me a technical detail not so obvious but very important)

Yes.


>> So overall, if one hits a csum mismatch, especially after the direct 
>> IO fix, then it should not be treated as "something that can be easily 
>> fixed online", it should be something huge, at least needs some tool 
>> to handle it offline.
>>
> 
> Even tough I agree about the severity (e.g. the corruption happened due 
> to an hw error), I less agree about the fact that this is a 
> justification to not have a simpler interface (form an user interface 
> POV) that don't requires an un-mount/mount of the filesystem.
> 
> Un-mounting a root filesystem is a not so easy operation, so we should 
> provide a different way to access the data.

If you can not trust your hardware, online repair is just going to cause 
more problems.

Again, replace csum error with tree-checker.

Thanks,
Qu

> 
> However I agree that we should allow the user to select a different 
> "mirror", considering that a "mirror" may be any valid combination of 
> disk to rebuild the data (e.g. when we will define the interface, we 
> should consider raid 6).
> 
>> Thanks,
>> Qu
>>
>>>
> 
> BR
> Goffredo
> 


      reply	other threads:[~2025-06-02 22:56 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-15  8:00 [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" Qu Wenruo
2025-05-15  8:00 ` [PATCH v2 1/6] " Qu Wenruo
2025-05-30 11:10   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 2/6] btrfs-progs: fix a bug in btrfs_find_item() Qu Wenruo
2025-05-30 11:11   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 3/6] btrfs-progs: fix-data-checksum: show affected files Qu Wenruo
2025-05-30 11:14   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 4/6] btrfs-progs: fix-data-checksum: introduce interactive mode Qu Wenruo
2025-05-30 11:18   ` David Sterba
2025-05-15  8:00 ` [PATCH v2 5/6] btrfs-progs: fix-data-checksum: update csum items to fix csum mismatch Qu Wenruo
2025-05-30 11:19   ` David Sterba
2025-05-30 22:23     ` Qu Wenruo
2025-05-15  8:00 ` [PATCH v2 6/6] btrfs-progs: fix-data-checksum: introduce -m|--mirror option Qu Wenruo
2025-05-30 11:07 ` [PATCH v2 0/6] btrfs-progs: introduce "btrfs rescue fix-data-checksum" David Sterba
2025-05-31 16:58 ` Goffredo Baroncelli
2025-05-31 22:07   ` Qu Wenruo
2025-06-02 17:28     ` Goffredo Baroncelli
2025-06-02 22:55       ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=79875f25-26d8-413b-938e-8b0e8fffcb95@suse.com \
    --to=wqu@suse.com \
    --cc=kreijack@inwind.it \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox