* [PATCH] btrfs: fix encoded write i_size corruption
@ 2023-04-28 21:02 Boris Burkov
2023-04-29 19:54 ` Josef Bacik
2023-05-02 12:20 ` David Sterba
0 siblings, 2 replies; 3+ messages in thread
From: Boris Burkov @ 2023-04-28 21:02 UTC (permalink / raw)
To: linux-btrfs, kernel-team
We have observed a btrfs filesystem corruption on workloads using
NOHOLES and encoded writes via sendstream v2. The symptom is that a file
appears to be truncated to the end of its last aligned extent, even
though the final unaligned extent and even the file extent and otherwise
correctly updated inode item have been written.
So if we were writing out a 1MiB+X file via 8 128K extents and one
extent of length X, isize would be set to 1MiB, but the ninth extent,
nbyte, etc.. would all appear correct otherwise.
The source of the race is a narrow (one code line..) window in which a
noholes fs has read in an updated isize, but has not yet set a shared
disk_i_size variable to write. Therefore, if two ordered extents run in
parallel (par for the course for receive workloads), the following
sequence can play out: (following "threads" a bit loosely, since there
are callbacks involved for endio but extra threads aren't needed to
cause the issue)
ENC-WR1 (second to last) ENC-WR2 (last)
------- -------
btrfs_do_encoded_write
set isize = 1M
submit bio B1 ending at 1M
endio B1
btrfs_inode_safe_disk_i_size_write
local isize = 1M
falls off a cliff for some reason
btrfs_do_encoded_write
set isize = 1M+X
submit bio B2 ending at 1M+X
endio B2
btrfs_inode_safe_disk_i_size_write
local isize = 1M+X
disk_i_size = 1M+X
disk_i_size = 1M
btrfs_delayed_update_inode
btrfs_delayed_update_inode
And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
writes respect isize and present a corruted file missing its last
extents.
Fix this by holding the inode lock in the noholes case so that a thread
can't sneak in a write to disk_i_size that gets overwritten with an out
of date isize.
Fixes: 41a2ee75aab0290 btrfs: introduce per-inode file extent tree
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/file-item.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 018c711a0bc8..cd4cce9ba443 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -52,13 +52,13 @@ void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_siz
u64 start, end, i_size;
int ret;
+ spin_lock(&inode->lock);
i_size = new_i_size ?: i_size_read(&inode->vfs_inode);
if (btrfs_fs_incompat(fs_info, NO_HOLES)) {
inode->disk_i_size = i_size;
- return;
+ goto out_unlock;
}
- spin_lock(&inode->lock);
ret = find_contiguous_extent_bit(&inode->file_extent_tree, 0, &start,
&end, EXTENT_DIRTY);
if (!ret && start == 0)
@@ -66,6 +66,7 @@ void btrfs_inode_safe_disk_i_size_write(struct btrfs_inode *inode, u64 new_i_siz
else
i_size = 0;
inode->disk_i_size = i_size;
+out_unlock:
spin_unlock(&inode->lock);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] btrfs: fix encoded write i_size corruption
2023-04-28 21:02 [PATCH] btrfs: fix encoded write i_size corruption Boris Burkov
@ 2023-04-29 19:54 ` Josef Bacik
2023-05-02 12:20 ` David Sterba
1 sibling, 0 replies; 3+ messages in thread
From: Josef Bacik @ 2023-04-29 19:54 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Fri, Apr 28, 2023 at 02:02:11PM -0700, Boris Burkov wrote:
> We have observed a btrfs filesystem corruption on workloads using
> NOHOLES and encoded writes via sendstream v2. The symptom is that a file
> appears to be truncated to the end of its last aligned extent, even
> though the final unaligned extent and even the file extent and otherwise
> correctly updated inode item have been written.
>
> So if we were writing out a 1MiB+X file via 8 128K extents and one
> extent of length X, isize would be set to 1MiB, but the ninth extent,
> nbyte, etc.. would all appear correct otherwise.
>
> The source of the race is a narrow (one code line..) window in which a
> noholes fs has read in an updated isize, but has not yet set a shared
> disk_i_size variable to write. Therefore, if two ordered extents run in
> parallel (par for the course for receive workloads), the following
> sequence can play out: (following "threads" a bit loosely, since there
> are callbacks involved for endio but extra threads aren't needed to
> cause the issue)
>
> ENC-WR1 (second to last) ENC-WR2 (last)
> ------- -------
> btrfs_do_encoded_write
> set isize = 1M
> submit bio B1 ending at 1M
> endio B1
> btrfs_inode_safe_disk_i_size_write
> local isize = 1M
> falls off a cliff for some reason
> btrfs_do_encoded_write
> set isize = 1M+X
> submit bio B2 ending at 1M+X
> endio B2
> btrfs_inode_safe_disk_i_size_write
> local isize = 1M+X
> disk_i_size = 1M+X
> disk_i_size = 1M
> btrfs_delayed_update_inode
> btrfs_delayed_update_inode
>
> And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
> writes respect isize and present a corruted file missing its last
> extents.
>
> Fix this by holding the inode lock in the noholes case so that a thread
> can't sneak in a write to disk_i_size that gets overwritten with an out
> of date isize.
>
> Fixes: 41a2ee75aab0290 btrfs: introduce per-inode file extent tree
> Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Thanks,
Josef
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] btrfs: fix encoded write i_size corruption
2023-04-28 21:02 [PATCH] btrfs: fix encoded write i_size corruption Boris Burkov
2023-04-29 19:54 ` Josef Bacik
@ 2023-05-02 12:20 ` David Sterba
1 sibling, 0 replies; 3+ messages in thread
From: David Sterba @ 2023-05-02 12:20 UTC (permalink / raw)
To: Boris Burkov; +Cc: linux-btrfs, kernel-team
On Fri, Apr 28, 2023 at 02:02:11PM -0700, Boris Burkov wrote:
> We have observed a btrfs filesystem corruption on workloads using
> NOHOLES and encoded writes via sendstream v2. The symptom is that a file
> appears to be truncated to the end of its last aligned extent, even
> though the final unaligned extent and even the file extent and otherwise
> correctly updated inode item have been written.
>
> So if we were writing out a 1MiB+X file via 8 128K extents and one
> extent of length X, isize would be set to 1MiB, but the ninth extent,
> nbyte, etc.. would all appear correct otherwise.
>
> The source of the race is a narrow (one code line..) window in which a
> noholes fs has read in an updated isize, but has not yet set a shared
> disk_i_size variable to write. Therefore, if two ordered extents run in
> parallel (par for the course for receive workloads), the following
> sequence can play out: (following "threads" a bit loosely, since there
> are callbacks involved for endio but extra threads aren't needed to
> cause the issue)
>
> ENC-WR1 (second to last) ENC-WR2 (last)
> ------- -------
> btrfs_do_encoded_write
> set isize = 1M
> submit bio B1 ending at 1M
> endio B1
> btrfs_inode_safe_disk_i_size_write
> local isize = 1M
> falls off a cliff for some reason
> btrfs_do_encoded_write
> set isize = 1M+X
> submit bio B2 ending at 1M+X
> endio B2
> btrfs_inode_safe_disk_i_size_write
> local isize = 1M+X
> disk_i_size = 1M+X
> disk_i_size = 1M
> btrfs_delayed_update_inode
> btrfs_delayed_update_inode
>
> And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
> writes respect isize and present a corruted file missing its last
> extents.
>
> Fix this by holding the inode lock in the noholes case so that a thread
> can't sneak in a write to disk_i_size that gets overwritten with an out
> of date isize.
>
> Fixes: 41a2ee75aab0290 btrfs: introduce per-inode file extent tree
> Signed-off-by: Boris Burkov <boris@bur.io>
Added to misc-next, thanks.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-05-02 12:26 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-28 21:02 [PATCH] btrfs: fix encoded write i_size corruption Boris Burkov
2023-04-29 19:54 ` Josef Bacik
2023-05-02 12:20 ` David Sterba
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox