public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: dsterba@suse.cz, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, "hch@infradead.org" <hch@infradead.org>
Subject: Re: [PATCH v2] btrfs: always fallback to buffered write if the inode requires checksum
Date: Fri, 14 Feb 2025 07:02:09 +1030	[thread overview]
Message-ID: <4f75f3d1-3ca4-405c-bf8b-0f07e8a890fa@gmx.com> (raw)
In-Reply-To: <20250213202302.GD5777@twin.jikos.cz>



在 2025/2/14 06:53, David Sterba 写道:
> On Tue, Feb 04, 2025 at 02:00:23PM +1030, Qu Wenruo wrote:
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode none).
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>>
>> This is a valid optimization to improve the concurrency, and since these
>> filesystems have no extra checksum on data, the content change is not a
>> problem at all.
>>
>> But the final write into the image file is handled by btrfs, which need
>> the content not to be modified during writeback, or the checksum will
>> not match the data (checksum is calculated before submitting the bio).
>>
>> So EXT4/XFS/NTRFS believes they can modify the folio under writeback,
>> but btrfs requires no modification, this leads to the false csum
>> mismatch.
>>
>> This is only a controlled example, there are even cases where
>> multi-thread programs can submit a direct IO write, then another thread
>> modifies the direct IO buffer for whatever reason.
>>
>> For such cases, btrfs has no sane way to detect such cases and leads to
>> false data csum mismatch.
>>
>> [FIX]
>> I have considered the following ideas to solve the problem:
>>
>> - Make direct IO to always skip data checksum
>>    This not only requires a new incompatible flag, as it breaks the
>>    current per-inode NODATASUM flag.
>>    But also requires extra handling for no csum found cases.
>>
>>    And this also reduces our checksum protection.
>>
>> - Let hardware to handle all the checksum
>>    AKA, just nodatasum mount option.
>>    That requires trust for hardware (which is not that trustful in a lot
>>    of cases), and it's not generic at all.
>>
>> - Always fallback to buffered write if the inode requires checksum
>>    This is suggested by Christoph, and is the solution utilized by this
>>    patch.
>>
>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>    reduce the performance.
>>    But at least it's still user configurable, if the end user still wants
>>    the zero-copy performance, just set NODATASUM flag for the inode
>>    (which is a common practice for VM images on btrfs).
>>
>>    Since we can not trust user space programs to keep the buffer
>>    consistent during direct IO, we have no choice but always falling
>>    back to buffered IO.
>>    At least by this, we avoid the more deadly false data checksum
>>    mismatch error.
>>
>> Suggested-by: hch@infradead.org <hch@infradead.org>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>
> I had this patch in the -rc queue but I think this is a significant
> change in the semantics of DIO so the target is 6.14. A backport to
> older stable tree is possible but we may need a bit longer period before
> this happens. DIO is used for speed in the VMs so falling back to the
> buffered write will likely be noticed.
>

I can prepare a doc update in btrfs-progs, and I totally understand your
concern about the fallback.

However I'd argue that, for VM useage, the root reason for using direct
IO is not really for performance, but to avoid double page cache as the
VM also manage its own page cache, there is no need to double page caching.

Furthermore there are more common cases where btrfs falls back to
buffered IO. The most common one is when the DIO is only 512 aligned but
not btrfs block aligned (normally 4K).
Considering not every DIO users do the proper fs block size detection, I
believe for a lot of cases, DIO of VMs are already falling back to buffered.

Thus I'm still optimistic about the behavior change, especially
considering the extra overhead is just a memcpy, it should not be that
obvious to end users anyway.

Thanks,
Qu

      reply	other threads:[~2025-02-13 20:32 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-04  3:30 [PATCH v2] btrfs: always fallback to buffered write if the inode requires checksum Qu Wenruo
2025-02-04 11:19 ` Filipe Manana
2025-02-05 19:12   ` Filipe Manana
2025-02-05 21:00     ` Qu Wenruo
2025-02-06 14:13 ` hch
2025-02-06 20:32   ` Qu Wenruo
2025-02-13 20:23 ` David Sterba
2025-02-13 20:32   ` Qu Wenruo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4f75f3d1-3ca4-405c-bf8b-0f07e8a890fa@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=dsterba@suse.cz \
    --cc=hch@infradead.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox