* [PATCH 6.12.y] btrfs: always fallback to buffered write if the inode requires checksum
@ 2025-05-08 1:01 Qu Wenruo
2025-05-09 1:52 ` Sasha Levin
2025-05-12 14:00 ` Greg KH
0 siblings, 2 replies; 3+ messages in thread
From: Qu Wenruo @ 2025-05-08 1:01 UTC (permalink / raw)
To: linux-btrfs; +Cc: stable, Christoph Hellwig, Filipe Manana, David Sterba
commit 968f19c5b1b7d5595423b0ac0020cc18dfed8cb5 upstream.
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
known as cache mode 'none').
[CAUSE]
Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
fs is allowed to dirty/modify the folio even if the folio is under
writeback (as long as the address space doesn't have AS_STABLE_WRITES
flag inherited from the block device).
This is a valid optimization to improve the concurrency, and since these
filesystems have no extra checksum on data, the content change is not a
problem at all.
Bu the final write into the image file is handled by btrfs, which needs
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).
So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
btrfs requires no modification, this leads to the false csum mismatch.
This is only a controlled example, there are even cases where
multi-thread programs can submit a direct IO write, then another thread
modifies the direct IO buffer for whatever reason.
For such cases, btrfs has no sane way to detect such cases and leads to
false data csum mismatch.
[FIX]
I have considered the following ideas to solve the problem:
- Make direct IO to always skip data checksum
This not only requires a new incompatible flag, as it breaks the
current per-inode NODATASUM flag.
But also requires extra handling for no csum found cases.
And this also reduces our checksum protection.
- Let hardware handle all the checksum
AKA, just nodatasum mount option.
That requires trust for hardware (which is not that trustful in a lot
of cases), and it's not generic at all.
- Always fallback to buffered write if the inode requires checksum
This was suggested by Christoph, and is the solution utilized by this
patch.
The cost is obvious, the extra buffer copying into page cache, thus it
reduces the performance.
But at least it's still user configurable, if the end user still wants
the zero-copy performance, just set NODATASUM flag for the inode
(which is a common practice for VM images on btrfs).
Since we cannot trust user space programs to keep the buffer
consistent during direct IO, we have no choice but always falling back
to buffered IO. At least by this, we avoid the more deadly false data
checksum mismatch error.
CC: stable@vger.kernel.org # 6.12+
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/direct-io.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 8567af46e16f..eacbb74bf6bc 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -855,6 +855,22 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
goto buffered;
}
+ /*
+ * We can't control the folios being passed in, applications can write
+ * to them while a direct IO write is in progress. This means the
+ * content might change after we calculated the data checksum.
+ * Therefore we can end up storing a checksum that doesn't match the
+ * persisted data.
+ *
+ * To be extra safe and avoid false data checksum mismatch, if the
+ * inode requires data checksum, just fallback to buffered IO.
+ * For buffered IO we have full control of page cache and can ensure
+ * no one is modifying the content during writeback.
+ */
+ if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
+ btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
+ goto buffered;
+ }
/*
* The iov_iter can be mapped to the same file range we are writing to.
--
2.49.0
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH 6.12.y] btrfs: always fallback to buffered write if the inode requires checksum
2025-05-08 1:01 [PATCH 6.12.y] btrfs: always fallback to buffered write if the inode requires checksum Qu Wenruo
@ 2025-05-09 1:52 ` Sasha Levin
2025-05-12 14:00 ` Greg KH
1 sibling, 0 replies; 3+ messages in thread
From: Sasha Levin @ 2025-05-09 1:52 UTC (permalink / raw)
To: stable; +Cc: Qu Wenruo, Sasha Levin
[ Sasha's backport helper bot ]
Hi,
✅ All tests passed successfully. No issues detected.
No action required from the submitter.
The upstream commit SHA1 provided is correct: 968f19c5b1b7d5595423b0ac0020cc18dfed8cb5
Status in newer kernel trees:
6.14.y | Not found
Note: The patch differs from the upstream commit:
---
1: 968f19c5b1b7d ! 1: 9ab4b4f9bed11 btrfs: always fallback to buffered write if the inode requires checksum
@@ Metadata
## Commit message ##
btrfs: always fallback to buffered write if the inode requires checksum
+ commit 968f19c5b1b7d5595423b0ac0020cc18dfed8cb5 upstream.
+
[BUG]
It is a long known bug that VM image on btrfs can lead to data csum
mismatch, if the qemu is using direct-io for the image (this is commonly
@@ Commit message
filesystems have no extra checksum on data, the content change is not a
problem at all.
- But the final write into the image file is handled by btrfs, which needs
+ Bu the final write into the image file is handled by btrfs, which needs
the content not to be modified during writeback, or the checksum will
not match the data (checksum is calculated before submitting the bio).
@@ Commit message
to buffered IO. At least by this, we avoid the more deadly false data
checksum mismatch error.
+ CC: stable@vger.kernel.org # 6.12+
Suggested-by: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Results of testing on various branches:
| Branch | Patch Apply | Build Test |
|---------------------------|-------------|------------|
| stable/linux-6.12.y | Success | Success |
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH 6.12.y] btrfs: always fallback to buffered write if the inode requires checksum
2025-05-08 1:01 [PATCH 6.12.y] btrfs: always fallback to buffered write if the inode requires checksum Qu Wenruo
2025-05-09 1:52 ` Sasha Levin
@ 2025-05-12 14:00 ` Greg KH
1 sibling, 0 replies; 3+ messages in thread
From: Greg KH @ 2025-05-12 14:00 UTC (permalink / raw)
To: Qu Wenruo
Cc: linux-btrfs, stable, Christoph Hellwig, Filipe Manana,
David Sterba
On Thu, May 08, 2025 at 10:31:25AM +0930, Qu Wenruo wrote:
> commit 968f19c5b1b7d5595423b0ac0020cc18dfed8cb5 upstream.
>
> [BUG]
> It is a long known bug that VM image on btrfs can lead to data csum
> mismatch, if the qemu is using direct-io for the image (this is commonly
> known as cache mode 'none').
>
> [CAUSE]
> Inside the VM, if the fs is EXT4 or XFS, or even NTFS from Windows, the
> fs is allowed to dirty/modify the folio even if the folio is under
> writeback (as long as the address space doesn't have AS_STABLE_WRITES
> flag inherited from the block device).
>
> This is a valid optimization to improve the concurrency, and since these
> filesystems have no extra checksum on data, the content change is not a
> problem at all.
>
> Bu the final write into the image file is handled by btrfs, which needs
> the content not to be modified during writeback, or the checksum will
> not match the data (checksum is calculated before submitting the bio).
>
> So EXT4/XFS/NTRFS assume they can modify the folio under writeback, but
> btrfs requires no modification, this leads to the false csum mismatch.
>
> This is only a controlled example, there are even cases where
> multi-thread programs can submit a direct IO write, then another thread
> modifies the direct IO buffer for whatever reason.
>
> For such cases, btrfs has no sane way to detect such cases and leads to
> false data csum mismatch.
>
> [FIX]
> I have considered the following ideas to solve the problem:
>
> - Make direct IO to always skip data checksum
> This not only requires a new incompatible flag, as it breaks the
> current per-inode NODATASUM flag.
> But also requires extra handling for no csum found cases.
>
> And this also reduces our checksum protection.
>
> - Let hardware handle all the checksum
> AKA, just nodatasum mount option.
> That requires trust for hardware (which is not that trustful in a lot
> of cases), and it's not generic at all.
>
> - Always fallback to buffered write if the inode requires checksum
> This was suggested by Christoph, and is the solution utilized by this
> patch.
>
> The cost is obvious, the extra buffer copying into page cache, thus it
> reduces the performance.
> But at least it's still user configurable, if the end user still wants
> the zero-copy performance, just set NODATASUM flag for the inode
> (which is a common practice for VM images on btrfs).
>
> Since we cannot trust user space programs to keep the buffer
> consistent during direct IO, we have no choice but always falling back
> to buffered IO. At least by this, we avoid the more deadly false data
> checksum mismatch error.
>
> CC: stable@vger.kernel.org # 6.12+
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Filipe Manana <fdmanana@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: David Sterba <dsterba@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
> fs/btrfs/direct-io.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
What about 6.14.y? It is needed there too, right? We can't take a
patch only for an older branch and not a newer one.
thanks,
greg k-h
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-05-12 14:00 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-08 1:01 [PATCH 6.12.y] btrfs: always fallback to buffered write if the inode requires checksum Qu Wenruo
2025-05-09 1:52 ` Sasha Levin
2025-05-12 14:00 ` Greg KH
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox