Linux kernel -stable discussions
 help / color / mirror / Atom feed
From: Qu Wenruo <quwenruo.btrfs@gmx.com>
To: Greg KH <gregkh@linuxfoundation.org>, Qu Wenruo <wqu@suse.com>
Cc: linux-btrfs@vger.kernel.org, stable@vger.kernel.org,
	Christoph Hellwig <hch@infradead.org>,
	Filipe Manana <fdmanana@suse.com>,
	David Sterba <dsterba@suse.com>
Subject: Re: [PATCH] btrfs: always fallback to buffered write if the inode requires checksum
Date: Thu, 8 May 2025 16:35:00 +0930	[thread overview]
Message-ID: <d161b5ed-b405-4fce-9efd-91117ba06a1c@gmx.com> (raw)
In-Reply-To: <2025050830-epilepsy-emu-5a5d@gregkh>



在 2025/5/8 15:00, Greg KH 写道:
> On Thu, May 08, 2025 at 10:39:17AM +0930, Qu Wenruo wrote:
>> commit 968f19c5b1b7d5595423b0ac0020cc18dfed8cb5 upstream.
>>
>> [BUG]
>> It is a long known bug that VM image on btrfs can lead to data csum
>> mismatch, if the qemu is using direct-io for the image (this is commonly
>> known as cache mode 'none').
>>
>> [CAUSE]
>> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
>> fs is allowed to dirty/modify the folio even if the folio is under
>> writeback (as long as the address space doesn't have AS_STABLE_WRITES
>> flag inherited from the block device).
>>
>> This is a valid optimization to improve the concurrency, and since these
>> filesystems have no extra checksum on data, the content change is not a
>> problem at all.
>>
>> But the final write into the image file is handled by btrfs, which needs
>> the content not to be modified during writeback, or the checksum will
>> not match the data (checksum is calculated before submitting the bio).
>>
>> So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
>> btrfs requires no modification, this leads to the false csum mismatch.
>>
>> This is only a controlled example, there are even cases where
>> multi-thread programs can submit a direct IO write, then another thread
>> modifies the direct IO buffer for whatever reason.
>>
>> For such cases, btrfs has no sane way to detect such cases and leads to
>> false data csum mismatch.
>>
>> [FIX]
>> I have considered the following ideas to solve the problem:
>>
>> - Make direct IO to always skip data checksum
>>    This not only requires a new incompatible flag, as it breaks the
>>    current per-inode NODATASUM flag.
>>    But also requires extra handling for no csum found cases.
>>
>>    And this also reduces our checksum protection.
>>
>> - Let hardware handle all the checksum
>>    AKA, just nodatasum mount option.
>>    That requires trust for hardware (which is not that trustful in a lot
>>    of cases), and it's not generic at all.
>>
>> - Always fallback to buffered write if the inode requires checksum
>>    This was suggested by Christoph, and is the solution utilized by this
>>    patch.
>>
>>    The cost is obvious, the extra buffer copying into page cache, thus it
>>    reduces the performance.
>>    But at least it's still user configurable, if the end user still wants
>>    the zero-copy performance, just set NODATASUM flag for the inode
>>    (which is a common practice for VM images on btrfs).
>>
>>    Since we cannot trust user space programs to keep the buffer
>>    consistent during direct IO, we have no choice but always falling back
>>    to buffered IO.  At least by this, we avoid the more deadly false data
>>    checksum mismatch error.
>>
>> CC: stable@vger.kernel.org # 6.6
>> Suggested-by: Christoph Hellwig <hch@infradead.org>
>> Reviewed-by: Filipe Manana <fdmanana@suse.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> Reviewed-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: David Sterba <dsterba@suse.com>
>> [ Fix a conflict due to the movement of the function. ]
>> ---
>>   fs/btrfs/direct-io.c | 1094 ++++++++++++++++++++++++++++++++++++++++++
> 
> Did you mean to include all of this file in here?
> 
> I see 2 versions of this patch sent, the first one looks "correct", but
> this one is very odd.

My bad, this one is incorrectly including the missing file which doesn't 
yet exists in 6.6.y branch.

I'll send out an updated version.

Thanks,
Qu>
> thanks,
> 
> greg k-h
> 


  reply	other threads:[~2025-05-08  7:05 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-08  1:09 [PATCH 6.6.y] btrfs: always fallback to buffered write if the inode requires checksum Qu Wenruo
2025-05-08  1:09 ` [PATCH] " Qu Wenruo
2025-05-08  5:30   ` Greg KH
2025-05-08  7:05     ` Qu Wenruo [this message]
2025-05-09  1:52   ` Sasha Levin
2025-05-09  1:52 ` [PATCH 6.6.y] " Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d161b5ed-b405-4fce-9efd-91117ba06a1c@gmx.com \
    --to=quwenruo.btrfs@gmx.com \
    --cc=dsterba@suse.com \
    --cc=fdmanana@suse.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=hch@infradead.org \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=wqu@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox