[PATCH] btrfs: fallback to buffered IO if the data profile has duplication

public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] btrfs: fallback to buffered IO if the data profile has duplication
@ 2025-10-30 23:43 Qu Wenruo
  2025-10-31  9:58 ` Filipe Manana
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2025-10-30 23:43 UTC (permalink / raw)
  To: linux-btrfs

[BACKGROUND]
Inspired by a recent kernel bug report, which is related to direct IO
buffer modification during writeback, that leads to contents mismatch of
different RAID1 mirrors.

[CAUSE AND PROBLEMS]
The root cause is exactly the same explained in commit 968f19c5b1b7
("btrfs: always fallback to buffered write if the inode requires
checksum"), that we can not trust direct IO buffer which can be modified
halfway during writeback.

Unlike data checksum verification, if this happened on inodes without
data checksum but has the data has extra mirrors, it will lead to
stealth data mismatch on different mirrors.

This will be way harder to detect without data checksum.

Furthermore for RAID56, we can even have data without checksum and data
with checksum mixed inside the same full stripe.

In that case if the direct IO buffer got changed halfway for the
nodatasum part, the data with checksum immediately lost its ability to
recover, e.g.:

" " = Good old data or parity calculated using good old data
"X" = Data modified during writeback

              0                32K                      64K
  Data 1      |                                         |  Has csum
  Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
  Parity      |                                         |

In above case, the parity is calculated using data 1 (has csum, from
page cache, won't change during writeback), and old data 2 (has no csum,
direct IO write).

After parity is calculated, but before submission to the storage, direct
IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
has a different content.

Now all data is submitted to the storage, and the fs got fully synced.

Then the device of data 1 is lost, has to be rebuilt from data 2 and
parity. But since the data 2 has some modified data, and the parity is
calculated using old data, the recovered data is no the same for data 1,
causing data checksum mismatch.

[FIX]
Fix this problem by introduce a new helper,
btrfs_has_data_duplication(), to check if there is any data profiles
that have any duplication. If so fallback to buffered IO to keep data
consistent, no matter if the inode has data checksum or not.

However this is not going to fix all situations, as it's still possible
to race with balance where the fs got a new data profile after
btrfs_has_data_duplication() check.
But this fix should still greatly reduce the window of the original bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/direct-io.c  |  9 +++++++++
 fs/btrfs/space-info.c | 18 ++++++++++++++++++
 fs/btrfs/space-info.h |  1 +
 3 files changed, 28 insertions(+)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 962fccceffd6..3165543f35bc 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -12,6 +12,7 @@
 #include "volumes.h"
 #include "bio.h"
 #include "ordered-data.h"
+#include "space-info.h"

 struct btrfs_dio_data {
 	ssize_t submitted;
@@ -827,6 +828,14 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;

+	/*
+	 * If our data profile has duplication (either extra mirrors or RAID56),
+	 * we can not trust the direct IO buffer, the content may change during
+	 * writeback and cause different contents written to different mirrors.
+	 */
+	if (btrfs_has_data_duplication(fs_info))
+		goto buffered;
+
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 63d14b5dfc6c..02b5ebdd3146 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -2203,3 +2203,21 @@ void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len)
 	if (len)
 		btrfs_try_granting_tickets(space_info);
 }
+
+bool btrfs_has_data_duplication(struct btrfs_fs_info *fs_info)
+{
+	struct btrfs_space_info *sinfo = fs_info->data_sinfo;
+	bool ret = false;
+
+	down_write(&sinfo->groups_sem);
+	for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
+		if (!list_empty(&sinfo->block_groups[i]) &&
+		    (btrfs_raid_array[i].ncopies > 1 ||
+		     btrfs_raid_array[i].nparity != 0)) {
+			ret = true;
+			break;
+		}
+	}
+	up_write(&sinfo->groups_sem);
+	return ret;
+}
diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
index a4c2a3c8b388..15b96163a167 100644
--- a/fs/btrfs/space-info.h
+++ b/fs/btrfs/space-info.h
@@ -315,5 +315,6 @@ void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool
 int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
 void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info);
 void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len);
+bool btrfs_has_data_duplication(struct btrfs_fs_info *fs_info);

 #endif /* BTRFS_SPACE_INFO_H */
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fallback to buffered IO if the data profile has duplication
  2025-10-30 23:43 Qu Wenruo
@ 2025-10-31  9:58 ` Filipe Manana
  2025-10-31 10:13   ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: Filipe Manana @ 2025-10-31  9:58 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Thu, Oct 30, 2025 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote:
>
> [BACKGROUND]
> Inspired by a recent kernel bug report, which is related to direct IO
> buffer modification during writeback, that leads to contents mismatch of
> different RAID1 mirrors.
>
> [CAUSE AND PROBLEMS]
> The root cause is exactly the same explained in commit 968f19c5b1b7
> ("btrfs: always fallback to buffered write if the inode requires
> checksum"), that we can not trust direct IO buffer which can be modified
> halfway during writeback.
>
> Unlike data checksum verification, if this happened on inodes without
> data checksum but has the data has extra mirrors, it will lead to
> stealth data mismatch on different mirrors.
>
> This will be way harder to detect without data checksum.
>
> Furthermore for RAID56, we can even have data without checksum and data
> with checksum mixed inside the same full stripe.
>
> In that case if the direct IO buffer got changed halfway for the
> nodatasum part, the data with checksum immediately lost its ability to
> recover, e.g.:
>
> " " = Good old data or parity calculated using good old data
> "X" = Data modified during writeback
>
>               0                32K                      64K
>   Data 1      |                                         |  Has csum
>   Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
>   Parity      |                                         |
>
> In above case, the parity is calculated using data 1 (has csum, from
> page cache, won't change during writeback), and old data 2 (has no csum,
> direct IO write).
>
> After parity is calculated, but before submission to the storage, direct
> IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
> has a different content.
>
> Now all data is submitted to the storage, and the fs got fully synced.
>
> Then the device of data 1 is lost, has to be rebuilt from data 2 and
> parity. But since the data 2 has some modified data, and the parity is
> calculated using old data, the recovered data is no the same for data 1,
> causing data checksum mismatch.
>
> [FIX]
> Fix this problem by introduce a new helper,

introduce -> introducing

> btrfs_has_data_duplication(), to check if there is any data profiles

is any -> are any

> that have any duplication. If so fallback to buffered IO to keep data
> consistent, no matter if the inode has data checksum or not.
>
> However this is not going to fix all situations, as it's still possible
> to race with balance where the fs got a new data profile after
> btrfs_has_data_duplication() check.
> But this fix should still greatly reduce the window of the original bug.
>
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
>  fs/btrfs/direct-io.c  |  9 +++++++++
>  fs/btrfs/space-info.c | 18 ++++++++++++++++++
>  fs/btrfs/space-info.h |  1 +
>  3 files changed, 28 insertions(+)
>
> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
> index 962fccceffd6..3165543f35bc 100644
> --- a/fs/btrfs/direct-io.c
> +++ b/fs/btrfs/direct-io.c
> @@ -12,6 +12,7 @@
>  #include "volumes.h"
>  #include "bio.h"
>  #include "ordered-data.h"
> +#include "space-info.h"
>
>  struct btrfs_dio_data {
>         ssize_t submitted;
> @@ -827,6 +828,14 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>         if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
>                 ilock_flags |= BTRFS_ILOCK_SHARED;
>
> +       /*
> +        * If our data profile has duplication (either extra mirrors or RAID56),
> +        * we can not trust the direct IO buffer, the content may change during
> +        * writeback and cause different contents written to different mirrors.
> +        */
> +       if (btrfs_has_data_duplication(fs_info))
> +               goto buffered;
> +
>  relock:
>         ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
>         if (ret < 0)
> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
> index 63d14b5dfc6c..02b5ebdd3146 100644
> --- a/fs/btrfs/space-info.c
> +++ b/fs/btrfs/space-info.c
> @@ -2203,3 +2203,21 @@ void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len)
>         if (len)
>                 btrfs_try_granting_tickets(space_info);
>  }
> +
> +bool btrfs_has_data_duplication(struct btrfs_fs_info *fs_info)
> +{
> +       struct btrfs_space_info *sinfo = fs_info->data_sinfo;
> +       bool ret = false;
> +
> +       down_write(&sinfo->groups_sem);
> +       for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
> +               if (!list_empty(&sinfo->block_groups[i]) &&
> +                   (btrfs_raid_array[i].ncopies > 1 ||
> +                    btrfs_raid_array[i].nparity != 0)) {
> +                       ret = true;
> +                       break;
> +               }
> +       }
> +       up_write(&sinfo->groups_sem);

This is huge and rather heavy.
Taking the lock in write mode will block with concurrent extent
allocations which take it in read mode.

We could lock in read mode instead, but this can be much simpler by doing this:

return (btrfs_data_alloc_profile(fs_info) &
BTRFS_BLOCK_GROUP_PROFILE_MASK) != 0;

Thanks.



> +       return ret;
> +}
> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
> index a4c2a3c8b388..15b96163a167 100644
> --- a/fs/btrfs/space-info.h
> +++ b/fs/btrfs/space-info.h
> @@ -315,5 +315,6 @@ void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool
>  int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
>  void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info);
>  void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len);
> +bool btrfs_has_data_duplication(struct btrfs_fs_info *fs_info);
>
>  #endif /* BTRFS_SPACE_INFO_H */
> --
> 2.51.2
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fallback to buffered IO if the data profile has duplication
  2025-10-31  9:58 ` Filipe Manana
@ 2025-10-31 10:13   ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-10-31 10:13 UTC (permalink / raw)
  To: Filipe Manana, Qu Wenruo; +Cc: linux-btrfs



在 2025/10/31 20:28, Filipe Manana 写道:
> On Thu, Oct 30, 2025 at 11:48 PM Qu Wenruo <wqu@suse.com> wrote:
>>
>> [BACKGROUND]
>> Inspired by a recent kernel bug report, which is related to direct IO
>> buffer modification during writeback, that leads to contents mismatch of
>> different RAID1 mirrors.
>>
>> [CAUSE AND PROBLEMS]
>> The root cause is exactly the same explained in commit 968f19c5b1b7
>> ("btrfs: always fallback to buffered write if the inode requires
>> checksum"), that we can not trust direct IO buffer which can be modified
>> halfway during writeback.
>>
>> Unlike data checksum verification, if this happened on inodes without
>> data checksum but has the data has extra mirrors, it will lead to
>> stealth data mismatch on different mirrors.
>>
>> This will be way harder to detect without data checksum.
>>
>> Furthermore for RAID56, we can even have data without checksum and data
>> with checksum mixed inside the same full stripe.
>>
>> In that case if the direct IO buffer got changed halfway for the
>> nodatasum part, the data with checksum immediately lost its ability to
>> recover, e.g.:
>>
>> " " = Good old data or parity calculated using good old data
>> "X" = Data modified during writeback
>>
>>                0                32K                      64K
>>    Data 1      |                                         |  Has csum
>>    Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
>>    Parity      |                                         |
>>
>> In above case, the parity is calculated using data 1 (has csum, from
>> page cache, won't change during writeback), and old data 2 (has no csum,
>> direct IO write).
>>
>> After parity is calculated, but before submission to the storage, direct
>> IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
>> has a different content.
>>
>> Now all data is submitted to the storage, and the fs got fully synced.
>>
>> Then the device of data 1 is lost, has to be rebuilt from data 2 and
>> parity. But since the data 2 has some modified data, and the parity is
>> calculated using old data, the recovered data is no the same for data 1,
>> causing data checksum mismatch.
>>
>> [FIX]
>> Fix this problem by introduce a new helper,
> 
> introduce -> introducing
> 
>> btrfs_has_data_duplication(), to check if there is any data profiles
> 
> is any -> are any
> 
>> that have any duplication. If so fallback to buffered IO to keep data
>> consistent, no matter if the inode has data checksum or not.
>>
>> However this is not going to fix all situations, as it's still possible
>> to race with balance where the fs got a new data profile after
>> btrfs_has_data_duplication() check.
>> But this fix should still greatly reduce the window of the original bug.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>>   fs/btrfs/direct-io.c  |  9 +++++++++
>>   fs/btrfs/space-info.c | 18 ++++++++++++++++++
>>   fs/btrfs/space-info.h |  1 +
>>   3 files changed, 28 insertions(+)
>>
>> diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
>> index 962fccceffd6..3165543f35bc 100644
>> --- a/fs/btrfs/direct-io.c
>> +++ b/fs/btrfs/direct-io.c
>> @@ -12,6 +12,7 @@
>>   #include "volumes.h"
>>   #include "bio.h"
>>   #include "ordered-data.h"
>> +#include "space-info.h"
>>
>>   struct btrfs_dio_data {
>>          ssize_t submitted;
>> @@ -827,6 +828,14 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
>>          if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
>>                  ilock_flags |= BTRFS_ILOCK_SHARED;
>>
>> +       /*
>> +        * If our data profile has duplication (either extra mirrors or RAID56),
>> +        * we can not trust the direct IO buffer, the content may change during
>> +        * writeback and cause different contents written to different mirrors.
>> +        */
>> +       if (btrfs_has_data_duplication(fs_info))
>> +               goto buffered;
>> +
>>   relock:
>>          ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
>>          if (ret < 0)
>> diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
>> index 63d14b5dfc6c..02b5ebdd3146 100644
>> --- a/fs/btrfs/space-info.c
>> +++ b/fs/btrfs/space-info.c
>> @@ -2203,3 +2203,21 @@ void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len)
>>          if (len)
>>                  btrfs_try_granting_tickets(space_info);
>>   }
>> +
>> +bool btrfs_has_data_duplication(struct btrfs_fs_info *fs_info)
>> +{
>> +       struct btrfs_space_info *sinfo = fs_info->data_sinfo;
>> +       bool ret = false;
>> +
>> +       down_write(&sinfo->groups_sem);
>> +       for (int i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
>> +               if (!list_empty(&sinfo->block_groups[i]) &&
>> +                   (btrfs_raid_array[i].ncopies > 1 ||
>> +                    btrfs_raid_array[i].nparity != 0)) {
>> +                       ret = true;
>> +                       break;
>> +               }
>> +       }
>> +       up_write(&sinfo->groups_sem);
> 
> This is huge and rather heavy.
> Taking the lock in write mode will block with concurrent extent
> allocations which take it in read mode.
> 
> We could lock in read mode instead, but this can be much simpler by doing this:
> 
> return (btrfs_data_alloc_profile(fs_info) &
> BTRFS_BLOCK_GROUP_PROFILE_MASK) != 0;

Thanks a lot for the btrfs_data_alloc_profile() helper, I know the 
semaphore is super expensive but didn't find a good solution and even 
considering implementing a cached version.
With this it's way more simpler.

Will go the helper and check for SINGLE/RAID0.

Thanks,
Qu

> 
> Thanks.
> 
> 
> 
>> +       return ret;
>> +}
>> diff --git a/fs/btrfs/space-info.h b/fs/btrfs/space-info.h
>> index a4c2a3c8b388..15b96163a167 100644
>> --- a/fs/btrfs/space-info.h
>> +++ b/fs/btrfs/space-info.h
>> @@ -315,5 +315,6 @@ void btrfs_set_periodic_reclaim_ready(struct btrfs_space_info *space_info, bool
>>   int btrfs_calc_reclaim_threshold(const struct btrfs_space_info *space_info);
>>   void btrfs_reclaim_sweep(const struct btrfs_fs_info *fs_info);
>>   void btrfs_return_free_space(struct btrfs_space_info *space_info, u64 len);
>> +bool btrfs_has_data_duplication(struct btrfs_fs_info *fs_info);
>>
>>   #endif /* BTRFS_SPACE_INFO_H */
>> --
>> 2.51.2
>>
>>
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH] btrfs: fallback to buffered IO if the data profile has duplication
@ 2025-10-31 23:52 Qu Wenruo
  2025-11-18 14:56 ` David Sterba
  0 siblings, 1 reply; 6+ messages in thread
From: Qu Wenruo @ 2025-10-31 23:52 UTC (permalink / raw)
  To: linux-btrfs

[BACKGROUND]
Inspired by a recent kernel bug report, which is related to direct IO
buffer modification during writeback, that leads to contents mismatch of
different RAID1 mirrors.

[CAUSE AND PROBLEMS]
The root cause is exactly the same explained in commit 968f19c5b1b7
("btrfs: always fallback to buffered write if the inode requires
checksum"), that we can not trust direct IO buffer which can be modified
halfway during writeback.

Unlike data checksum verification, if this happened on inodes without
data checksum but has the data has extra mirrors, it will lead to
stealth data mismatch on different mirrors.

This will be way harder to detect without data checksum.

Furthermore for RAID56, we can even have data without checksum and data
with checksum mixed inside the same full stripe.

In that case if the direct IO buffer got changed halfway for the
nodatasum part, the data with checksum immediately lost its ability to
recover, e.g.:

" " = Good old data or parity calculated using good old data
"X" = Data modified during writeback

              0                32K                      64K
  Data 1      |                                         |  Has csum
  Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
  Parity      |                                         |

In above case, the parity is calculated using data 1 (has csum, from
page cache, won't change during writeback), and old data 2 (has no csum,
direct IO write).

After parity is calculated, but before submission to the storage, direct
IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
has a different content.

Now all data is submitted to the storage, and the fs got fully synced.

Then the device of data 1 is lost, has to be rebuilt from data 2 and
parity. But since the data 2 has some modified data, and the parity is
calculated using old data, the recovered data is no the same for data 1,
causing data checksum mismatch.

[FIX]
Fix the problem by checking the data allocation profile.
If our data allocation profile is either RAID0 or SINGLE, we can allow
true zero-copy direct IO and the end user is fully responsible for any
race.

However this is not going to fix all situations, as it's still possible
to race with balance where the fs got a new data profile after the data
allocation profile check.
But this fix should still greatly reduce the window of the original bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Use btrfs_data_alloc_profile() to avoid expensive rw semaphore in
  write path
  Which is now only a seqlock, which is way lighter weight.
---
 fs/btrfs/direct-io.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/direct-io.c b/fs/btrfs/direct-io.c
index 962fccceffd6..6ff186595625 100644
--- a/fs/btrfs/direct-io.c
+++ b/fs/btrfs/direct-io.c
@@ -814,6 +814,8 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	ssize_t ret;
 	unsigned int ilock_flags = 0;
 	struct iomap_dio *dio;
+	const u64 data_profile = btrfs_data_alloc_profile(fs_info) &
+				 BTRFS_BLOCK_GROUP_PROFILE_MASK;

 	if (iocb->ki_flags & IOCB_NOWAIT)
 		ilock_flags |= BTRFS_ILOCK_TRY;
@@ -827,6 +829,16 @@ ssize_t btrfs_direct_write(struct kiocb *iocb, struct iov_iter *from)
 	if (iocb->ki_pos + iov_iter_count(from) <= i_size_read(inode) && IS_NOSEC(inode))
 		ilock_flags |= BTRFS_ILOCK_SHARED;

+	/*
+	 * If our data profile has duplication (either extra mirrors or RAID56),
+	 * we can not trust the direct IO buffer, the content may change during
+	 * writeback and cause different contents written to different mirrors.
+	 *
+	 * Thus only RAID0 and SINGLE can go true zero-copy direct IO.
+	 */
+	if (data_profile != BTRFS_BLOCK_GROUP_RAID0 && data_profile != 0)
+		goto buffered;
+
 relock:
 	ret = btrfs_inode_lock(BTRFS_I(inode), ilock_flags);
 	if (ret < 0)
-- 
2.51.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fallback to buffered IO if the data profile has duplication
  2025-10-31 23:52 [PATCH] btrfs: fallback to buffered IO if the data profile has duplication Qu Wenruo
@ 2025-11-18 14:56 ` David Sterba
  2025-11-18 20:36   ` Qu Wenruo
  0 siblings, 1 reply; 6+ messages in thread
From: David Sterba @ 2025-11-18 14:56 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Sat, Nov 01, 2025 at 10:22:16AM +1030, Qu Wenruo wrote:
> [BACKGROUND]
> Inspired by a recent kernel bug report, which is related to direct IO
> buffer modification during writeback, that leads to contents mismatch of
> different RAID1 mirrors.
> 
> [CAUSE AND PROBLEMS]
> The root cause is exactly the same explained in commit 968f19c5b1b7
> ("btrfs: always fallback to buffered write if the inode requires
> checksum"), that we can not trust direct IO buffer which can be modified
> halfway during writeback.
> 
> Unlike data checksum verification, if this happened on inodes without
> data checksum but has the data has extra mirrors, it will lead to
> stealth data mismatch on different mirrors.
> 
> This will be way harder to detect without data checksum.
> 
> Furthermore for RAID56, we can even have data without checksum and data
> with checksum mixed inside the same full stripe.
> 
> In that case if the direct IO buffer got changed halfway for the
> nodatasum part, the data with checksum immediately lost its ability to
> recover, e.g.:
> 
> " " = Good old data or parity calculated using good old data
> "X" = Data modified during writeback
> 
>               0                32K                      64K
>   Data 1      |                                         |  Has csum
>   Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
>   Parity      |                                         |
> 
> In above case, the parity is calculated using data 1 (has csum, from
> page cache, won't change during writeback), and old data 2 (has no csum,
> direct IO write).
> 
> After parity is calculated, but before submission to the storage, direct
> IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
> has a different content.
> 
> Now all data is submitted to the storage, and the fs got fully synced.
> 
> Then the device of data 1 is lost, has to be rebuilt from data 2 and
> parity. But since the data 2 has some modified data, and the parity is
> calculated using old data, the recovered data is no the same for data 1,
> causing data checksum mismatch.
> 
> [FIX]
> Fix the problem by checking the data allocation profile.
> If our data allocation profile is either RAID0 or SINGLE, we can allow
> true zero-copy direct IO and the end user is fully responsible for any
> race.
> 
> However this is not going to fix all situations, as it's still possible
> to race with balance where the fs got a new data profile after the data
> allocation profile check.
> But this fix should still greatly reduce the window of the original bug.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
> Signed-off-by: Qu Wenruo <wqu@suse.com>

This fixes some cases but also adds another exception/quirky behaviour
of direct io. At minimum we should document it in a clear way of what is
compatible and how. So far I think we have the direct io features
scattered in the documentation.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] btrfs: fallback to buffered IO if the data profile has duplication
  2025-11-18 14:56 ` David Sterba
@ 2025-11-18 20:36   ` Qu Wenruo
  0 siblings, 0 replies; 6+ messages in thread
From: Qu Wenruo @ 2025-11-18 20:36 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



在 2025/11/19 01:26, David Sterba 写道:
> On Sat, Nov 01, 2025 at 10:22:16AM +1030, Qu Wenruo wrote:
>> [BACKGROUND]
>> Inspired by a recent kernel bug report, which is related to direct IO
>> buffer modification during writeback, that leads to contents mismatch of
>> different RAID1 mirrors.
>>
>> [CAUSE AND PROBLEMS]
>> The root cause is exactly the same explained in commit 968f19c5b1b7
>> ("btrfs: always fallback to buffered write if the inode requires
>> checksum"), that we can not trust direct IO buffer which can be modified
>> halfway during writeback.
>>
>> Unlike data checksum verification, if this happened on inodes without
>> data checksum but has the data has extra mirrors, it will lead to
>> stealth data mismatch on different mirrors.
>>
>> This will be way harder to detect without data checksum.
>>
>> Furthermore for RAID56, we can even have data without checksum and data
>> with checksum mixed inside the same full stripe.
>>
>> In that case if the direct IO buffer got changed halfway for the
>> nodatasum part, the data with checksum immediately lost its ability to
>> recover, e.g.:
>>
>> " " = Good old data or parity calculated using good old data
>> "X" = Data modified during writeback
>>
>>                0                32K                      64K
>>    Data 1      |                                         |  Has csum
>>    Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
>>    Parity      |                                         |
>>
>> In above case, the parity is calculated using data 1 (has csum, from
>> page cache, won't change during writeback), and old data 2 (has no csum,
>> direct IO write).
>>
>> After parity is calculated, but before submission to the storage, direct
>> IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
>> has a different content.
>>
>> Now all data is submitted to the storage, and the fs got fully synced.
>>
>> Then the device of data 1 is lost, has to be rebuilt from data 2 and
>> parity. But since the data 2 has some modified data, and the parity is
>> calculated using old data, the recovered data is no the same for data 1,
>> causing data checksum mismatch.
>>
>> [FIX]
>> Fix the problem by checking the data allocation profile.
>> If our data allocation profile is either RAID0 or SINGLE, we can allow
>> true zero-copy direct IO and the end user is fully responsible for any
>> race.
>>
>> However this is not going to fix all situations, as it's still possible
>> to race with balance where the fs got a new data profile after the data
>> allocation profile check.
>> But this fix should still greatly reduce the window of the original bug.
>>
>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
> 
> This fixes some cases but also adds another exception/quirky behaviour
> of direct io. At minimum we should document it in a clear way of what is
> compatible and how. So far I think we have the direct io features
> scattered in the documentation.

If all (including that direct IO read fallback patch) patches got 
merged, the final doc would be easy:

   Fallback to buffered IO if:
   - The inode has csum (no matter read or write)
   - The fs has duplication (DUP/RAID1*/RAID5/RAID6)

But still, the direct IO read patch doesn't have any feedback yet, which 
makes the matrix much more complex.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-11-18 20:37 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-31 23:52 [PATCH] btrfs: fallback to buffered IO if the data profile has duplication Qu Wenruo
2025-11-18 14:56 ` David Sterba
2025-11-18 20:36   ` Qu Wenruo
  -- strict thread matches above, loose matches on Subject: below --
2025-10-30 23:43 Qu Wenruo
2025-10-31  9:58 ` Filipe Manana
2025-10-31 10:13   ` Qu Wenruo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox