From: Wang Yugui <wangyugui@e16-tech.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: fdmanana@gmail.com, linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: 'ls /mnt/scratch/' freeze(deadlock?) when run xfstest(btrfs/232)
Date: Thu, 22 Apr 2021 09:25:59 +0800 [thread overview]
Message-ID: <20210422092558.F39A.409509F4@e16-tech.com> (raw)
In-Reply-To: <aa9ffab6-02cb-16a0-794c-80a990c4f999@gmx.com>
[-- Attachment #1: Type: text/plain, Size: 4468 bytes --]
Hi,
> On 2021/4/22 上午8:32, Wang Yugui wrote:
> > Hi,
> >
> >>>>> we run xfstest on two server with this patch.
> >>>>> one passed the tests.
> >>>>> but one got a btrfs/232 error.
> >>>>>
> >>>>> btrfs/232 32s ... _check_btrfs_filesystem: filesystem on
> >>>>> /dev/nvme0n1p1is inconsistent
> >>>>> (see /usr/hpc-bio/xfstests/results//btrfs/232.full for details)
> >>>>> ...
> >>>>> [4/7] checking fs roots
> >>>>> root 5 inode 337 errors 400, nbytes wrong
> >>>>> ERROR: errors found in fs roots
> >>>>
> >>>> Ok, that's a different problem caused by something else.
> >>>> It's possible to be due to the recent refactorings for preparation to
> >>>> subpage block size.
> >>>
> >>> This error looks exactly what I have seen during subpage development.
> >>> The subpage bug is caused by incorrect btrfs_invalidatepage() though,
> >>> and not yet merged into misc-next anyway.
> >>>
> >>> I guess it's some error path not clearing extent states correctly, thus
> >>> leaving the inode nbytes accounting wrong.
> >>>
> >>> BTW, the new @in_reclaim_context parameter for start_delalloc_inodes()
> >>> is already in misc-next:
> >>> commit 3d45f221ce627d13e2e6ef3274f06750c84a6542
> >>> Author: Filipe Manana <fdmanana@suse.com>
> >>> Date:?? Wed Dec 2 11:55:58 2020 +0000
> >>>
> >>> ?? btrfs: fix deadlock when cloning inline extent and low on free
> >>> metadata space
> >>>
> >>> We only need to make btrfs_start_delalloc_snapshot() to accept the new
> >>> parameter and pass in_reclaim_context = true for qgroup.
> >>
> >> Strangely, on my subpage branch, with new @in_reclaim_context parameter
> >> added to btrfs_start_delalloc_snapshot(), I can't reproduce the nbytes
> >> mismatch error in 32 runs loop.
> >> I guess one of the refactor around ordered extents and invalidatepage
> >> may fix the problem by accident.
> >>
> >> Mind to test my subpage branch
> >> (https://github.com/adam900710/linux/tree/subpage) with the attached diff?
> >
> > The attached diff( more in_reclaim_context) seems a replacement for
> > https://pastebin.com/raw/U9GUZiEf (less in_reclaim_context).
>
> The attached diff is for subpage branch, as misc-next already has the
> parameter introduced for another bug.
> Thus only a small part is needed for subpage branch.
>
> >
> > so I will firstly test with the attached diff but drop
> > https://pastebin.com/raw/U9GUZiEf.
> >
> > The test of whole subpage branch will be done later.
>
> So far, I also tested the older misc-next branch, and unable to
> reproduce the problem.
> I guess some patch in misc-next has already solved the problem.
>
> If possible it would be better to provide the branch you're on so that
> we could do more tests to pin down the bug.
our branch is based on 5.12-rc8 but with some patch in misc-next.
The patch list added to 5.12-rc8
mmap race:(btrfs-mmap-race.patch)
Subject: [PATCH] btrfs: add a i_mmap_lock to our inode
Subject: [PATCH] btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock
Subject: [PATCH] btrfs: exclude mmaps while doing remap
Subject: [PATCH] btrfs: exclude mmap from happening during all fallocate
Subject: [PATCH] btrfs: fix race between memory mapped writes and fsync
Subject: [PATCH] btrfs: fix race between marking inode needs to be logged and
Subject: [PATCH] btrfs: remove stale comment and logic from
others(btrfs-in-todo.patch):
Subject: [PATCH] btrfs: Allow read-only mount with corrupted extent tree
Subject: [PATCH] btrfs: fix a potential hole-punching failure
Subject: [PATCH] btrfs: fix lockdep warning while mounting sprout fs
Subject: [PATCH] btrfs: fix race between transaction aborts and fsyncs leading
Subject: [PATCH] btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags
Subject: [PATCH] btrfs: return whole extents in fiemap
Subject: [PATCH] btrfs: use percpu_read_positive instead of sum_positive for
Subject: [PATCH] btrfs: zoned: move log tree node allocation out of
Subject: [PATCH] btrfs: fix exhaustion of the system chunk array due to
Subject: [PATCH] btrfs: fix metadata extent leak after failure to create
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
btrfs-add-btree-read-ahead-for-full-send-operations.patch
We try to build a stable btrfs code based on 5.12.
btrfs 5.4/5.10 have some performance problem fixed in 5.12 but
difficult to backport, and we not need the support of subpage and
zoned device that will come into 5.13.
Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2021/04/22
[-- Attachment #2: btrfs-in-todo.patch --]
[-- Type: application/octet-stream, Size: 71994 bytes --]
From f51320504b2560f7c28af2a03961900ea6401ff7 Mon Sep 17 00:00:00 2001
From: <davispuh@gmail.com>
Date: Sun, 21 Mar 2021 23:49:39 +0200
Subject: [PATCH] btrfs: Allow read-only mount with corrupted extent tree
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Currently if there's any corruption at all in extent tree
(eg. even single bit) then mounting will fail with:
"failed to read block groups: -5" (-EIO)
It happens because we immediately abort on first error when
searching in extent tree for block groups.
Now with this patch if `ignorebadroots` option is specified
then we handle such case and continue by removing already
created block groups and creating dummy block groups.
Signed-off-by: D¨¡vis Mos¨¡ns <davispuh@gmail.com>
---
fs/btrfs/block-group.c | 20 ++++++++++++++++++++
fs/btrfs/disk-io.c | 4 ++--
fs/btrfs/disk-io.h | 2 ++
3 files changed, 24 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 744b99d..f3e7ab0 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2098,6 +2098,26 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
ret = check_chunk_block_group_mappings(info);
error:
btrfs_free_path(path);
+
+ if (ret == -EIO && btrfs_test_opt(info, IGNOREBADROOTS)) {
+
+ if (btrfs_super_log_root(info->super_copy) != 0) {
+ btrfs_warn(info, "Ignoring tree-log replay due to extent tree corruption!");
+ btrfs_set_super_log_root(info->super_copy, 0);
+ }
+
+ btrfs_put_block_group_cache(info);
+ btrfs_stop_all_workers(info);
+ btrfs_free_block_groups(info);
+ ret = btrfs_init_workqueues(info, NULL);
+ if (ret)
+ return ret;
+ ret = btrfs_init_space_info(info);
+ if (ret)
+ return ret;
+ return fill_dummy_bgs(info);
+ }
+
return ret;
}
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 41b718c..fe7e9e6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2042,7 +2042,7 @@ static int read_backup_root(struct btrfs_fs_info *fs_info, u8 priority)
}
/* helper to cleanup workers */
-static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
+void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
{
btrfs_destroy_workqueue(fs_info->fixup_workers);
btrfs_destroy_workqueue(fs_info->delalloc_workers);
@@ -2209,7 +2209,7 @@ static void btrfs_init_qgroup(struct btrfs_fs_info *fs_info)
mutex_init(&fs_info->qgroup_rescan_lock);
}
-static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
+int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
struct btrfs_fs_devices *fs_devices)
{
u32 max_active = fs_info->thread_pool_size;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 0e7e952..41348c8 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -139,6 +139,8 @@ int btrfs_get_free_objectid(struct btrfs_root *root, u64 *objectid);
int btrfs_init_root_free_objectid(struct btrfs_root *root);
int __init btrfs_end_io_wq_init(void);
void __cold btrfs_end_io_wq_exit(void);
+void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info);
+int btrfs_init_workqueues(struct btrfs_fs_info *fs_info, struct btrfs_fs_devices *fs_devices);
#ifdef CONFIG_DEBUG_LOCK_ALLOC
void btrfs_set_buffer_lockdep_class(u64 objectid,
--
2.30.2
From 90a1e58ff338f1ff6c4cd56649fb636de9117df1 Mon Sep 17 00:00:00 2001
From: BingJing Chang <bingjingc@synology.com>
Date: Wed, 24 Mar 2021 17:31:10 +0800
Subject: [PATCH] btrfs: fix a potential hole-punching failure
In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existed holes can be skipped by calling
find_first_non_hole() to adjust *start and *len. However, if the given
len is invalid and large, when an EXTENT_MAP_HOLE extent is found, the
*len will not be set to zero because (em->start + em->len) is less than
(*start + *len). Then the ret will be 1 but the *len will not be set to
0. The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with -ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero. The
break-loop condition (ret && !len) will not meet. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with -ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko@synology.com>
Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a
already existed hole.")
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
---
fs/btrfs/file.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e155f0..dccb017 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2735,8 +2735,6 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
extent_info->file_offset += replace_len;
}
- cur_offset = drop_args.drop_end;
-
ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
if (ret)
break;
@@ -2756,7 +2754,9 @@ int btrfs_replace_file_extents(struct inode *inode, struct btrfs_path *path,
BUG_ON(ret); /* shouldn't happen */
trans->block_rsv = rsv;
- if (!extent_info) {
+ cur_offset = drop_args.drop_end;
+ len = end - cur_offset;
+ if (!extent_info && len) {
ret = find_first_non_hole(BTRFS_I(inode), &cur_offset,
&len);
if (unlikely(ret < 0))
--
2.30.2
From 4a822db5614bdf1c257f3b7def68d711f86a19e5 Mon Sep 17 00:00:00 2001
From: Anand Jain <anand.jain@oracle.com>
Date: Thu, 4 Mar 2021 02:10:35 +0800
Subject: [PATCH] btrfs: fix lockdep warning while mounting sprout fs
Following test case reproduces lockdep warning.
Test case:
DEV1=/dev/vdb
DEV2=/dev/vdc
umount /btrfs
run mkfs.btrfs -f $DEV1
run btrfstune -S 1 $DEV1
run mount $DEV1 /btrfs
run btrfs device add $DEV2 /btrfs -f
run umount /btrfs
run mount $DEV2 /btrfs
run umount /btrfs
The warning claims a possible ABBA deadlock between the threads initiated by
[#1] btrfs device add and [#0] the mount.
======================================================
[ 540.743122] WARNING: possible circular locking dependency detected
[ 540.743129] 5.11.0-rc7+ #5 Not tainted
[ 540.743135] ------------------------------------------------------
[ 540.743142] mount/2515 is trying to acquire lock:
[ 540.743149] ffffa0c5544c2ce0 (&fs_devs->device_list_mutex){+.+.}-{4:4}, at: clone_fs_devices+0x6d/0x210 [btrfs]
[ 540.743458] but task is already holding lock:
[ 540.743461] ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.743541] which lock already depends on the new lock.
[ 540.743543] the existing dependency chain (in reverse order) is:
[ 540.743546] -> #1 (btrfs-chunk-00){++++}-{4:4}:
[ 540.743566] down_read_nested+0x48/0x2b0
[ 540.743585] __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.743650] btrfs_read_lock_root_node+0x70/0x200 [btrfs]
[ 540.743733] btrfs_search_slot+0x6c6/0xe00 [btrfs]
[ 540.743785] btrfs_update_device+0x83/0x260 [btrfs]
[ 540.743849] btrfs_finish_chunk_alloc+0x13f/0x660 [btrfs] <--- device_list_mutex
[ 540.743911] btrfs_create_pending_block_groups+0x18d/0x3f0 [btrfs]
[ 540.743982] btrfs_commit_transaction+0x86/0x1260 [btrfs]
[ 540.744037] btrfs_init_new_device+0x1600/0x1dd0 [btrfs]
[ 540.744101] btrfs_ioctl+0x1c77/0x24c0 [btrfs]
[ 540.744166] __x64_sys_ioctl+0xe4/0x140
[ 540.744170] do_syscall_64+0x4b/0x80
[ 540.744174] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 540.744180] -> #0 (&fs_devs->device_list_mutex){+.+.}-{4:4}:
[ 540.744184] __lock_acquire+0x155f/0x2360
[ 540.744188] lock_acquire+0x10b/0x5c0
[ 540.744190] __mutex_lock+0xb1/0xf80
[ 540.744193] mutex_lock_nested+0x27/0x30
[ 540.744196] clone_fs_devices+0x6d/0x210 [btrfs]
[ 540.744270] btrfs_read_chunk_tree+0x3c7/0xbb0 [btrfs]
[ 540.744336] open_ctree+0xf6e/0x2074 [btrfs]
[ 540.744406] btrfs_mount_root.cold.72+0x16/0x127 [btrfs]
[ 540.744472] legacy_get_tree+0x38/0x90
[ 540.744475] vfs_get_tree+0x30/0x140
[ 540.744478] fc_mount+0x16/0x60
[ 540.744482] vfs_kern_mount+0x91/0x100
[ 540.744484] btrfs_mount+0x1e6/0x670 [btrfs]
[ 540.744536] legacy_get_tree+0x38/0x90
[ 540.744537] vfs_get_tree+0x30/0x140
[ 540.744539] path_mount+0x8d8/0x1070
[ 540.744541] do_mount+0x8d/0xc0
[ 540.744543] __x64_sys_mount+0x125/0x160
[ 540.744545] do_syscall_64+0x4b/0x80
[ 540.744547] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 540.744551] other info that might help us debug this:
[ 540.744552] Possible unsafe locking scenario:
[ 540.744553] CPU0 CPU1
[ 540.744554] ---- ----
[ 540.744555] lock(btrfs-chunk-00);
[ 540.744557] lock(&fs_devs->device_list_mutex);
[ 540.744560] lock(btrfs-chunk-00);
[ 540.744562] lock(&fs_devs->device_list_mutex);
[ 540.744564]
*** DEADLOCK ***
[ 540.744565] 3 locks held by mount/2515:
[ 540.744567] #0: ffffa0c56bf7a0e0 (&type->s_umount_key#42/1){+.+.}-{4:4}, at: alloc_super.isra.16+0xdf/0x450
[ 540.744574] #1: ffffffffc05a9628 (uuid_mutex){+.+.}-{4:4}, at: btrfs_read_chunk_tree+0x63/0xbb0 [btrfs]
[ 540.744640] #2: ffffa0c54a7932b8 (btrfs-chunk-00){++++}-{4:4}, at: __btrfs_tree_read_lock+0x32/0x200 [btrfs]
[ 540.744708]
stack backtrace:
[ 540.744712] CPU: 2 PID: 2515 Comm: mount Not tainted 5.11.0-rc7+ #5
But the device_list_mutex in clone_fs_devices() is redundant, as explained
below.
Two threads [1] and [2] (below) could lead to clone_fs_device().
[1]
open_ctree <== mount sprout fs
btrfs_read_chunk_tree()
mutex_lock(&uuid_mutex) <== global lock
read_one_dev()
open_seed_devices()
clone_fs_devices() <== seed fs_devices
mutex_lock(&orig->device_list_mutex) <== seed fs_devices
[2]
btrfs_init_new_device() <== sprouting
mutex_lock(&uuid_mutex); <== global lock
btrfs_prepare_sprout()
lockdep_assert_held(&uuid_mutex)
clone_fs_devices(seed_fs_device) <== seed fs_devices
Both of these threads hold uuid_mutex which is sufficient to protect
getting the seed device(s) freed while we are trying to clone it for
sprouting [2] or mounting a sprout [1] (as above). A mounted seed
device can not free/write/replace because it is read-only. An unmounted
seed device can free by btrfs_free_stale_devices(), but it needs uuid_mutex.
So this patch removes the unnecessary device_list_mutex in clone_fs_devices().
And adds a lockdep_assert_held(&uuid_mutex) in clone_fs_devices().
Reported-by: Su Yue <l@damenly.su>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
fs/btrfs/volumes.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bc3b33e..4188edb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -570,6 +570,8 @@ static int btrfs_free_stale_devices(const char *path,
struct btrfs_device *device, *tmp_device;
int ret = 0;
+ lockdep_assert_held(&uuid_mutex);
+
if (path)
ret = -ENOENT;
@@ -1000,11 +1002,12 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
struct btrfs_device *orig_dev;
int ret = 0;
+ lockdep_assert_held(&uuid_mutex);
+
fs_devices = alloc_fs_devices(orig->fsid, NULL);
if (IS_ERR(fs_devices))
return fs_devices;
- mutex_lock(&orig->device_list_mutex);
fs_devices->total_devices = orig->total_devices;
list_for_each_entry(orig_dev, &orig->devices, dev_list) {
@@ -1036,10 +1039,8 @@ static struct btrfs_fs_devices *clone_fs_devices(struct btrfs_fs_devices *orig)
device->fs_devices = fs_devices;
fs_devices->num_devices++;
}
- mutex_unlock(&orig->device_list_mutex);
return fs_devices;
error:
- mutex_unlock(&orig->device_list_mutex);
free_fs_devices(fs_devices);
return ERR_PTR(ret);
}
--
2.30.1
From c5f3d7f7cc894a9cd06082ac1bc58bdd5056ffaf Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Mon, 5 Apr 2021 12:32:16 +0100
Subject: [PATCH] btrfs: fix race between transaction aborts and fsyncs leading
to use-after-free
There is a race between a task aborting a transaction during a commit,
a task doing an fsync and the transaction kthread, which leads to an
use-after-free of the log root tree. When this happens, it results in a
stack trace like the following:
[99678.547335] BTRFS info (device dm-0): forced readonly
[99678.547340] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
[99678.547341] BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure
[99678.547373] BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5)
[99678.547533] BTRFS warning (device dm-0): Skipping commit of aborted transaction.
[99678.548743] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10
[99678.549188] BTRFS error (device dm-0): error writing primary super block to device 1
[99678.551100] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10
[99678.551149] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10
[99678.551205] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10
[99678.551401] BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers)
[99678.565169] BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure
[99678.566132] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[99678.567526] CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1
[99678.568531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[99678.569980] RIP: 0010:__mutex_lock+0x139/0xa40
[99678.570556] Code: c0 74 19 (...)
[99678.573752] RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202
[99678.574723] RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002
[99678.576027] RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000
[99678.577314] RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000
[99678.578601] R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040
[99678.579890] R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358
[99678.581282] FS: 00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000
[99678.582818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[99678.583771] CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0
[99678.584600] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[99678.585425] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[99678.586247] Call Trace:
[99678.586542] ? __btrfs_handle_fs_error+0xde/0x146 [btrfs]
[99678.587260] ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
[99678.587930] ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
[99678.588573] btrfs_sync_log+0x7c1/0xf20 [btrfs]
[99678.589222] btrfs_sync_file+0x40c/0x580 [btrfs]
[99678.589947] do_fsync+0x38/0x70
[99678.590514] __x64_sys_fsync+0x10/0x20
[99678.591196] do_syscall_64+0x33/0x80
[99678.591829] entry_SYSCALL_64_after_hwframe+0x44/0xae
[99678.592744] RIP: 0033:0x7fa9142a55c3
[99678.593403] Code: 8b 15 09 (...)
[99678.596777] RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[99678.598143] RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3
[99678.599450] RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005
[99678.600770] RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c
[99678.602067] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340
[99678.603380] R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0
[99678.604714] Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...)
[99678.616646] ---[ end trace ee2f1b19327d791d ]---
The steps that lead to this crash are the following:
1) We are at transaction N;
2) We have two tasks with a transaction handle attached to transaction N.
Task A and Task B. Task B is doing an fsync;
3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree
into a local variable named 'log_root_tree' at the top of
btrfs_sync_log(). Task B is about to call write_all_supers(), but
before that...
4) Task A calls btrfs_commit_transaction(), and after it sets the
transaction state to TRANS_STATE_COMMIT_START, an error happens before
it waits for the transaction's 'num_writers' counter to reach a value
of 1 (no one else attached to the transaction), so it jumps to the
label "cleanup_transaction";
5) Task A then calls cleanup_transaction(), where it aborts the
transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state,
setting the ->aborted field of the transaction and the handle to an
errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state.
After that, at cleanup_transaction(), it deletes the transaction from
the list of transactions (fs_info->trans_list), sets the transaction
to the state TRANS_STATE_COMMIT_DOING and then waits for the number
of writers to go down to 1, as it's currently 2 (1 for task A and 1
for task B);
6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR
is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction().
There it sees the list fs_info->trans_list is empty, and then proceeds
into calling btrfs_drop_all_logs(), which frees the log root tree with
a call to btrfs_free_log_root_tree();
7) Task B calls write_all_supers() and, shortly after, under the label
'out_wake_log_root', it deferences the pointer stored in
'log_root_tree', which was already freed in the previous step by the
transaction kthread. This results in a use-after-free leading to a
crash.
Fix this by deleting the transaction from the list of transactions at
cleanup_transaction() only after setting the transaction state to
TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are
attached to the transaction to release their transaction handles.
This makes the transaction kthread wait for all the tasks attached to
the transaction to be done with the transaction before dropping the
log roots and doing other cleanups.
Fixes: ef67963dac255b ("btrfs: drop logs when we've aborted a transaction")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/transaction.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index acff6bb..3f2c4ee 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1961,7 +1961,6 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
*/
BUG_ON(list_empty(&cur_trans->list));
- list_del_init(&cur_trans->list);
if (cur_trans == fs_info->running_transaction) {
cur_trans->state = TRANS_STATE_COMMIT_DOING;
spin_unlock(&fs_info->trans_lock);
@@ -1970,6 +1969,17 @@ static void cleanup_transaction(struct btrfs_trans_handle *trans, int err)
spin_lock(&fs_info->trans_lock);
}
+
+ /*
+ * Now that we know no one else is still using the transaction we can
+ * remove the transaction from the list of transactions. This avoids
+ * the transaction kthread from cleaning up the transaction while some
+ * other task is still using it, which could result in a use-after-free
+ * on things like log trees, as it forces the transaction kthread to
+ * wait for this transaction to be cleaned up by us.
+ */
+ list_del_init(&cur_trans->list);
+
spin_unlock(&fs_info->trans_lock);
btrfs_cleanup_one_transaction(trans->transaction, fs_info);
--
2.30.2
From 6dd0a5410f124060bc19d77f1023cefec3102aaa Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Tue, 23 Mar 2021 18:39:49 +0000
Subject: [PATCH] btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags
If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
file that has the S_SYNC attribute set, we totally ignore that and do not
durably persist the reflink changes. Since a reflink can change the data
readable from a file (and mtime/ctime, or a file size), it makes sense to
durably persist (fsync) the source and destination files/ranges.
This was previously discussed at:
https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/
The recently introduced test case generic/628, from fstests, exercises
these scenarios and currently fails without this change.
So make sure we fsync the source and destination files/ranges when either
of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
just like XFS already does.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/reflink.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 7786c49..89d543f 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -834,6 +834,16 @@ static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,
len, remap_flags);
}
+static bool file_sync_write(struct file *file)
+{
+ if (file->f_flags & (__O_SYNC | O_DSYNC))
+ return true;
+ if (IS_SYNC(file_inode(file)))
+ return true;
+
+ return false;
+}
+
loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
struct file *dst_file, loff_t destoff, loff_t len,
unsigned int remap_flags)
@@ -871,5 +881,19 @@ out_unlock:
unlock_two_nondirectories(src_inode, dst_inode);
}
+ /*
+ * If either the source or the destination file was opened with O_SYNC,
+ * O_DSYNC or has the S_SYNC attribute, fsync both the destination and
+ * source files/ranges, so that after a successful return (0) followed
+ * by a power failure results in the reflinked data to be readable from
+ * both files/ranges.
+ */
+ if (ret == 0 && len > 0 && (file_sync_write(src_file) || file_sync_write(dst_file))) {
+ ret = btrfs_sync_file(src_file, off, off + len - 1, 0);
+ if (ret == 0)
+ ret = btrfs_sync_file(dst_file, destoff,
+ destoff + len - 1, 0);
+ }
+
return ret < 0 ? ret : len;
}
--
2.30.2
From 430a22159b16cf7bcc34ec5339f025578645b55a Mon Sep 17 00:00:00 2001
From: Boris Burkov <boris@bur.io>
Date: Tue, 6 Apr 2021 15:31:18 -0700
Subject: [PATCH] btrfs: return whole extents in fiemap
`xfs_io -c 'fiemap <off> <len>' <file>`
can give surprising results on btrfs that differ from xfs.
btrfs spits out extents trimmed to fit the user input. If the user's
fiemap request has an offset, then rather than returning each whole
extent which intersects that range, we also trim the start extent to not
have start < off.
Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
that returning the whole extent is expected.
Some cases which all yield the same fiemap in xfs, but not btrfs:
dd if=/dev/zero of=$f bs=4k count=1
sudo xfs_io -c 'fiemap 0 1024' $f
0: [0..7]: 26624..26631
sudo xfs_io -c 'fiemap 2048 1024' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 2048 4096' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 3584 512' $f
0: [7..7]: 26631..26631
sudo xfs_io -c 'fiemap 4091 5' $f
0: [7..6]: 26631..26630
I believe this is a consequence of the logic for merging contiguous
extents represented by separate extent items. That logic needs to track
the last offset as it loops through the extent items, which happens to
pick up the start offset on the first iteration, and trim off the
beginning of the full extent. To fix it, start `off` at 0 rather than
`start` so that we keep the iteration/merging intact without cutting off
the start of the extent.
after the fix, all the above commands give:
0: [0..7]: 26624..26631
The merging logic is exercised by xfstest generic/483, and I have
written a new xfstest for checking we don't have backwards or
zero-length fiemaps for cases like those above.
Signed-off-by: Boris Burkov <boris@bur.io>
---
fs/btrfs/extent_io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 910769d..50c8e34 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4975,7 +4975,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
u64 start, u64 len)
{
int ret = 0;
- u64 off = start;
+ u64 off = 0;
u64 max = start + len;
u32 flags = 0;
u32 found_type;
--
2.30.2
From ad22469f0eaf3aa400d6c7797b354b41c1bf82ec Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 24 Mar 2021 09:44:21 -0400
Subject: [PATCH] btrfs: use percpu_read_positive instead of sum_positive for
need_preempt
Looking at perf data for a fio workload I noticed that we were spending
a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
need_preemptive_reclaim. This is silly, as we only want to know if we
have more ordered than delalloc to see if we should be counting the
delayed items in our threshold calculation. Change this to
percpu_read_positive() to avoid the overhead.
I ran this through fsperf to validate the changes, obviously the latency
numbers in dbench and fio are quite jittery, so take them as you wish,
but overall the improvements on throughput, iops, and bw are all
positive. Each test was run two times, the given value is the average
of both runs for their respective column.
btrfs ssd normal test results
bufferedrandwrite16g results
metric baseline current diff
==========================================================
write_io_kbytes 16777216 16777216 0.00%
read_clat_ns_p99 0 0 0.00%
write_bw_bytes 1.04e+08 1.05e+08 1.12%
read_iops 0 0 0.00%
write_clat_ns_p50 13888 11840 -14.75%
read_io_kbytes 0 0 0.00%
read_io_bytes 0 0 0.00%
write_clat_ns_p99 35008 29312 -16.27%
read_bw_bytes 0 0 0.00%
elapsed 170 167 -1.76%
write_lat_ns_min 4221.50 3762.50 -10.87%
sys_cpu 39.65 35.37 -10.79%
write_lat_ns_max 2.67e+10 2.50e+10 -6.63%
read_lat_ns_min 0 0 0.00%
write_iops 25270.10 25553.43 1.12%
read_lat_ns_max 0 0 0.00%
read_clat_ns_p50 0 0 0.00%
dbench60 results
metric baseline current diff
==================================================
qpathinfo 11.12 12.73 14.52%
throughput 416.09 445.66 7.11%
flush 3485.63 1887.55 -45.85%
qfileinfo 0.70 1.92 173.86%
ntcreatex 992.60 695.76 -29.91%
qfsinfo 2.43 3.71 52.48%
close 1.67 3.14 88.09%
sfileinfo 66.54 105.20 58.10%
rename 809.23 619.59 -23.43%
find 16.88 15.46 -8.41%
unlink 820.54 670.86 -18.24%
writex 3375.20 2637.91 -21.84%
deltree 386.33 449.98 16.48%
readx 3.43 3.41 -0.60%
mkdir 0.05 0.03 -38.46%
lockx 0.26 0.26 -0.76%
unlockx 0.81 0.32 -60.33%
dio4kbs16threads results
metric baseline current diff
================================================================
write_io_kbytes 5249676 3357150 -36.05%
read_clat_ns_p99 0 0 0.00%
write_bw_bytes 89583501.50 57291192.50 -36.05%
read_iops 0 0 0.00%
write_clat_ns_p50 242688 263680 8.65%
read_io_kbytes 0 0 0.00%
read_io_bytes 0 0 0.00%
write_clat_ns_p99 15826944 36732928 132.09%
read_bw_bytes 0 0 0.00%
elapsed 61 61 0.00%
write_lat_ns_min 42704 42095 -1.43%
sys_cpu 5.27 3.45 -34.52%
write_lat_ns_max 7.43e+08 9.27e+08 24.71%
read_lat_ns_min 0 0 0.00%
write_iops 21870.97 13987.11 -36.05%
read_lat_ns_max 0 0 0.00%
read_clat_ns_p50 0 0 0.00%
randwrite2xram results
metric baseline current diff
================================================================
write_io_kbytes 24831972 28876262 16.29%
read_clat_ns_p99 0 0 0.00%
write_bw_bytes 83745273.50 92182192.50 10.07%
read_iops 0 0 0.00%
write_clat_ns_p50 13952 11648 -16.51%
read_io_kbytes 0 0 0.00%
read_io_bytes 0 0 0.00%
write_clat_ns_p99 50176 52992 5.61%
read_bw_bytes 0 0 0.00%
elapsed 314 332 5.73%
write_lat_ns_min 5920.50 5127 -13.40%
sys_cpu 7.82 7.35 -6.07%
write_lat_ns_max 5.27e+10 3.88e+10 -26.44%
read_lat_ns_min 0 0 0.00%
write_iops 20445.62 22505.42 10.07%
read_lat_ns_max 0 0 0.00%
read_clat_ns_p50 0 0 0.00%
untarfirefox results
metric baseline current diff
==============================================
elapsed 47.41 47.40 -0.03%
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
fs/btrfs/space-info.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c
index 2da6177..2dc674b 100644
--- a/fs/btrfs/space-info.c
+++ b/fs/btrfs/space-info.c
@@ -861,8 +861,8 @@ static bool need_preemptive_reclaim(struct btrfs_fs_info *fs_info,
* of heavy DIO or ordered reservations, preemptive flushing will just
* waste time and cause us to slow down.
*/
- ordered = percpu_counter_sum_positive(&fs_info->ordered_bytes);
- delalloc = percpu_counter_sum_positive(&fs_info->delalloc_bytes);
+ ordered = percpu_counter_read_positive(&fs_info->ordered_bytes);
+ delalloc = percpu_counter_read_positive(&fs_info->delalloc_bytes);
if (ordered >= delalloc)
used += fs_info->delayed_refs_rsv.reserved +
fs_info->delayed_block_rsv.reserved;
--
2.30.2
From 8c789827084f4130cfda754833e44ee02735d661 Mon Sep 17 00:00:00 2001
From: Naohiro Aota <naohiro.aota@wdc.com>
Date: Wed, 24 Mar 2021 23:23:11 +0900
Subject: [PATCH] btrfs: zoned: move log tree node allocation out of
log_root_tree->log_mutex
Commit 6e37d2459941 ("btrfs: zoned: fix deadlock on log sync") pointed out
a deadlock warning and removed
mutex_{lock,unlock}(&fs_info->tree_root->log_mutex). While it looks like it
always cause a deadlock, we didn't see actual deadlock in fstests runs. The
reason is log_root_tree->log_mutex != fs_info->tree_root->log_mutex, not
taking the same lock. So, the warning was actually a false-positive.
Since btrfs_alloc_log_tree_node() is protected only by
fs_info->tree_root->log_mutex, we can (and should) move the code out of the
lock scope of log_root_tree->log_mutex and silence the warning.
Cc: Filipe Manana <fdmanana@suse.com>
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
---
fs/btrfs/tree-log.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 92a3686..72c4b66 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -3165,20 +3165,22 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
*/
mutex_unlock(&root->log_mutex);
- btrfs_init_log_ctx(&root_log_ctx, NULL);
-
- mutex_lock(&log_root_tree->log_mutex);
-
if (btrfs_is_zoned(fs_info)) {
+ mutex_lock(&fs_info->tree_root->log_mutex);
if (!log_root_tree->node) {
ret = btrfs_alloc_log_tree_node(trans, log_root_tree);
if (ret) {
- mutex_unlock(&log_root_tree->log_mutex);
+ mutex_unlock(&fs_info->tree_log_mutex);
goto out;
}
}
+ mutex_unlock(&fs_info->tree_root->log_mutex);
}
+ btrfs_init_log_ctx(&root_log_ctx, NULL);
+
+ mutex_lock(&log_root_tree->log_mutex);
+
index2 = log_root_tree->log_transid % 2;
list_add_tail(&root_log_ctx.list, &log_root_tree->log_ctxs[index2]);
root_log_ctx.log_transid = log_root_tree->log_transid;
--
2.30.2
From 3845534b3a865e298d79b92a0fc8826576c280e9 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Wed, 31 Mar 2021 11:55:50 +0100
Subject: [PATCH] btrfs: fix exhaustion of the system chunk array due to
concurrent allocations
When we are running out of space for updating the chunk tree, that is,
when we are low on available space in the system space info, if we have
many task concurrently allocating block groups, via fallocate for example,
many of them can end up all allocating new system chunks when only one is
needed. In extreme cases this can lead to exhaustion of the system chunk
array, which has a size limit of 2048 bytes, and results in a transaction
abort with errno -EFBIG, producing a trace in dmesg like the following,
which was triggered on a PowerPC machine with a node/leaf size of 64K:
[ 1359.518899] ------------[ cut here ]------------
[ 1359.518980] BTRFS: Transaction aborted (error -27)
[ 1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
[ 1359.519152] Modules linked in: (...)
[ 1359.519239] Supported: Yes, External
[ 1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3
[ 1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8
[ 1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default)
[ 1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007
[ 1359.519356] CFAR: c00000000013e170 IRQMASK: 0
[ 1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026
[ 1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007
[ 1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000
[ 1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0
[ 1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff
[ 1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000
[ 1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0
[ 1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08
[ 1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
[ 1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs]
[ 1359.519626] Call Trace:
[ 1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable)
[ 1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs]
[ 1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs]
[ 1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs]
[ 1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350
[ 1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0
[ 1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40
[ 1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170
[ 1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278
[ 1359.520037] Instruction dump:
[ 1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000
[ 1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378
[ 1359.520122] ---[ end trace d6c186e151022e20 ]---
The following steps explain how we can end up in this situation:
1) Task A is at check_system_chunk(), either because it is allocating a
new data or metadata block group, at btrfs_chunk_alloc(), or because
it is removing a block group or turning a block group RO. It does not
matter why;
2) Task A sees that there is not enough free space in the system
space_info object, that is 'left' is < 'thresh'. And at this point
the system space_info has a value of 0 for its 'bytes_may_use'
counter;
3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
a new system block group (chunk) and then reserves 'thresh' bytes in
the chunk block reserve with the call to btrfs_block_rsv_add(). This
changes the chunk block reserve's 'reserved' and 'size' counters by an
amount of 'thresh', and changes the 'bytes_may_use' counter of the
system space_info object from 0 to 'thresh'.
Also during its call to btrfs_alloc_chunk(), we end up increasing the
value of the 'total_bytes' counter of the system space_info object by
8MiB (the size of a system chunk stripe). This happens through the
call chain:
btrfs_alloc_chunk()
create_chunk()
btrfs_make_block_group()
btrfs_update_space_info()
4) After it finishes the first phase of the block group allocation, at
btrfs_chunk_alloc(), task A unlocks the chunk mutex;
5) At this point the new system block group was added to the transaction
handle's list of new block groups, but its block group item, device
items and chunk item were not yet inserted in the extent, device and
chunk trees, respectively. That only happens later when we call
btrfs_finish_chunk_alloc() through a call to
btrfs_create_pending_block_groups();
Note that only when we update the chunk tree, through the call to
btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter
of the chunk block reserve as we COW/allocate extent buffers,
through:
btrfs_alloc_tree_block()
btrfs_use_block_rsv()
btrfs_block_rsv_use_bytes()
And the system space_info's 'bytes_may_use' is decremented everytime
we allocate an extent buffer for COW operations on the chunk tree,
through:
btrfs_alloc_tree_block()
btrfs_reserve_extent()
find_free_extent()
btrfs_add_reserved_bytes()
If we end up COWing less chunk btree nodes/leaves than expected, which
is the typical case since the amount of space we reserve is always
pessimistic to account for the worst possible case, we release the
unused space through:
btrfs_create_pending_block_groups()
btrfs_trans_release_chunk_metadata()
btrfs_block_rsv_release()
block_rsv_release_bytes()
btrfs_space_info_free_bytes_may_use()
But before task A gets into btrfs_create_pending_block_groups()...
6) Many other tasks start allocating new block groups through fallocate,
each one does the first phase of block group allocation in a
serialized way, since btrfs_chunk_alloc() takes the chunk mutex
before calling check_system_chunk() and btrfs_alloc_chunk().
However before everyone enters the final phase of the block group
allocation, that is, before calling btrfs_create_pending_block_groups(),
new tasks keep coming to allocate new block groups and while at
check_system_chunk(), the system space_info's 'bytes_may_use' keeps
increasing each time a task reserves space in the chunk block reserve.
This means that eventually some other task can end up not seeing enough
free space in the system space_info and decide to allocate yet another
system chunk.
This may repeat several times if yet more new tasks keep allocating
new block groups before task A, and all the other tasks, finish the
creation of the pending block groups, which is when reserved space
in excess is released. Eventually this can result in exhaustion of
system chunk array in the superblock, with btrfs_add_system_chunk()
returning -EFBIG, resulting later in a transaction abort.
Even when we don't reach the extreme case of exhausting the system
array, most, if not all, unnecessarily created system block groups
end up being unused since when finishing creation of the first
pending system block group, the creation of the following ones end
up not needing to COW nodes/leaves of the chunk tree, so we never
allocate and deallocate from them, resulting in them never being
added to the list of unused block groups - as a consequence they
don't get deleted by the cleaner kthread - the only exceptions are
if we unmount and mount the filesystem again, which adds any unused
block groups to the list of unused block groups, if a scrub is
run, which also adds unused block groups to the unused list, and
under some circumstances when using a zoned filesystem or async
discard, which may also add unused block groups to the unused list.
So fix this by:
*) Tracking the number of reserved bytes for the chunk tree per
transaction, which is the sum of reserved chunk bytes by each
transaction handle currently being used;
*) When there is not enough free space in the system space_info,
if there are other transaction handles which reserved chunk space,
wait for some of them to complete in order to have enough excess
reserved space released, and then try again. Otherwise proceed with
the creation of a new system chunk.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/block-group.c | 58 +++++++++++++++++++++++++++++++++++++++++-
fs/btrfs/transaction.c | 5 ++++
fs/btrfs/transaction.h | 7 +++++
3 files changed, 69 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 744b99d..a7d9e14 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -3269,6 +3269,7 @@ static u64 get_profile_num_devs(struct btrfs_fs_info *fs_info, u64 type)
*/
void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
{
+ struct btrfs_transaction *cur_trans = trans->transaction;
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_space_info *info;
u64 left;
@@ -3283,6 +3284,7 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
lockdep_assert_held(&fs_info->chunk_mutex);
info = btrfs_find_space_info(fs_info, BTRFS_BLOCK_GROUP_SYSTEM);
+again:
spin_lock(&info->lock);
left = info->total_bytes - btrfs_space_info_used(info, true);
spin_unlock(&info->lock);
@@ -3301,6 +3303,58 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
if (left < thresh) {
u64 flags = btrfs_system_alloc_profile(fs_info);
+ u64 reserved = atomic64_read(&cur_trans->chunk_bytes_reserved);
+
+ /*
+ * If there's not available space for the chunk tree (system
+ * space) and there are other tasks that reserved space for
+ * creating a new system block group, wait for them to complete
+ * the creation of their system block group and release excess
+ * reserved space. We do this because:
+ *
+ * *) We can end up allocating more system chunks than necessary
+ * when there are multiple tasks that are concurrently
+ * allocating block groups, which can lead to exhaustion of
+ * the system array in the superblock;
+ *
+ * *) If we allocate extra and unnecessary system block groups,
+ * despite being empty for a long time, and possibly forever,
+ * they end not being added to the list of unused block groups
+ * because that typically happens only when deallocating the
+ * last extent from a block group - which never happens since
+ * we never allocate from them in the first place. The few
+ * exceptions are when mounting a filesystem or running scrub,
+ * which add unused block groups to the list of unused block
+ * groups, to be deleted by the cleaner kthread.
+ * And even when they are added to the list of unused block
+ * groups, it can take a long time until they get deleted,
+ * since the cleaner kthread might be sleeping or busy with
+ * other work (deleting subvolumes, running delayed iputs,
+ * defrag scheduling, etc);
+ *
+ * This is rare in practice, but can happen when too many tasks
+ * are allocating blocks groups in parallel (via fallocate())
+ * and before the one that reserved space for a new system block
+ * group finishes the block group creation and releases the space
+ * reserved in excess (at btrfs_create_pending_block_groups()),
+ * other tasks end up here and see free system space temporarily
+ * not enough for updating the chunk tree.
+ *
+ * We unlock the chunk mutex before waiting for such tasks and
+ * lock it again after the wait, otherwise we would deadlock.
+ * It is safe to do so because allocating a system chunk is the
+ * first thing done while allocating a new block group.
+ */
+ if (reserved > trans->chunk_bytes_reserved) {
+ const u64 min_needed = reserved - thresh;
+
+ mutex_unlock(&fs_info->chunk_mutex);
+ wait_event(cur_trans->chunk_reserve_wait,
+ atomic64_read(&cur_trans->chunk_bytes_reserved) <=
+ min_needed);
+ mutex_lock(&fs_info->chunk_mutex);
+ goto again;
+ }
/*
* Ignore failure to create system chunk. We might end up not
@@ -3315,8 +3369,10 @@ void check_system_chunk(struct btrfs_trans_handle *trans, u64 type)
ret = btrfs_block_rsv_add(fs_info->chunk_root,
&fs_info->chunk_block_rsv,
thresh, BTRFS_RESERVE_NO_FLUSH);
- if (!ret)
+ if (!ret) {
+ atomic64_add(thresh, &cur_trans->chunk_bytes_reserved);
trans->chunk_bytes_reserved += thresh;
+ }
}
}
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 3f2c4ee..d56d3e7 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -260,6 +260,7 @@ static inline int extwriter_counter_read(struct btrfs_transaction *trans)
void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans)
{
struct btrfs_fs_info *fs_info = trans->fs_info;
+ struct btrfs_transaction *cur_trans = trans->transaction;
if (!trans->chunk_bytes_reserved)
return;
@@ -268,6 +269,8 @@ void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans)
btrfs_block_rsv_release(fs_info, &fs_info->chunk_block_rsv,
trans->chunk_bytes_reserved, NULL);
+ atomic64_sub(trans->chunk_bytes_reserved, &cur_trans->chunk_bytes_reserved);
+ cond_wake_up(&cur_trans->chunk_reserve_wait);
trans->chunk_bytes_reserved = 0;
}
@@ -383,6 +386,8 @@ loop:
spin_lock_init(&cur_trans->dropped_roots_lock);
INIT_LIST_HEAD(&cur_trans->releasing_ebs);
spin_lock_init(&cur_trans->releasing_ebs_lock);
+ atomic64_set(&cur_trans->chunk_bytes_reserved, 0);
+ init_waitqueue_head(&cur_trans->chunk_reserve_wait);
list_add_tail(&cur_trans->list, &fs_info->trans_list);
extent_io_tree_init(fs_info, &cur_trans->dirty_pages,
IO_TREE_TRANS_DIRTY_PAGES, fs_info->btree_inode);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 6335716..4c95ca5 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -96,6 +96,13 @@ struct btrfs_transaction {
spinlock_t releasing_ebs_lock;
struct list_head releasing_ebs;
+
+ /*
+ * The number of bytes currently reserved, by all transaction handles
+ * attached to this transaction, for metadata extents of the chunk tree.
+ */
+ atomic64_t chunk_bytes_reserved;
+ wait_queue_head_t chunk_reserve_wait;
};
#define __TRANS_FREEZABLE (1U << 0)
--
2.30.2
From 21dc044f4373fba79f231ea273d9b259279f3e12 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Tue, 20 Apr 2021 10:55:12 +0100
Subject: [PATCH] btrfs: fix metadata extent leak after failure to create
subvolume
When creating a subvolume we allocate an extent buffer for its root node
after starting a transaction. We setup a root item for the subvolume that
points to that extent buffer and then attempt to insert the root item into
the root tree - however if that fails, due to -ENOMEM for example, we do
not free the extent buffer previously allocated and we do not abort the
transaction (as at that point we did nothing that can not be undone).
This means that we effectively do not return the metadata extent back to
the free space cache/tree and we leave a delayed reference for it which
causes a metadata extent item to be added to the extent tree, in the next
transaction commit, without having backreferences. When this happens
'btrfs check' reports the following:
$ btrfs check /dev/sdi
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb
[1/7] checking root items
[2/7] checking extents
ref mismatch on [30425088 16384] extent item 1, found 0
backref 30425088 root 256 not referenced back 0x564a91c23d70
incorrect global backref count on 30425088 found 1 wanted 0
backpointer mismatch on [30425088 16384]
owner ref check failed [30425088 16384]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 212992 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 131072
total fs tree bytes: 32768
total extent tree bytes: 16384
btree space waste bytes: 124669
file data blocks allocated: 65536
referenced 65536
So fix this by freeing the metadata extent if btrfs_insert_root() returns
an error.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/ioctl.c | 18 +++++++++++++++---
1 file changed, 15 insertions(+), 3 deletions(-)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ede09df..e999c51 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -697,8 +697,6 @@ static noinline int create_subvol(struct inode *dir,
btrfs_set_root_otransid(root_item, trans->transid);
btrfs_tree_unlock(leaf);
- free_extent_buffer(leaf);
- leaf = NULL;
btrfs_set_root_dirid(root_item, BTRFS_FIRST_FREE_OBJECTID);
@@ -707,8 +705,22 @@ static noinline int create_subvol(struct inode *dir,
key.type = BTRFS_ROOT_ITEM_KEY;
ret = btrfs_insert_root(trans, fs_info->tree_root, &key,
root_item);
- if (ret)
+ if (ret) {
+ /*
+ * Since we don't abort the transaction in this case, free the
+ * tree block so that we don't leak space and leave the filesytem
+ * in an inconsistent state (an extent item in the extent tree
+ * without backreferences). Also no need to have the tree block
+ * locked since it is not in any tree at this point, so no other
+ * task can find it and use it.
+ */
+ btrfs_free_tree_block(trans, root, leaf, 0, 1);
+ free_extent_buffer(leaf);
goto fail;
+ }
+
+ free_extent_buffer(leaf);
+ leaf = NULL;
key.offset = (u64)-1;
new_root = btrfs_get_new_fs_root(fs_info, objectid, anon_dev);
--
2.30.2
From 09253899214ce75714442a28c68e5f2bd9e3405d Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Tue, 20 Apr 2021 10:55:44 +0100
Subject: [PATCH] btrfs: fix race when picking most recent mod log operation
for an old root
Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
rewind of an old root"), fixed a race when we need to rewind the extent
buffer of an old root. It was caused by picking a new mod log operation
for the extent buffer while getting a cloned extent buffer with an outdated
number of items (off by -1), because we cloned the extent buffer without
locking it first.
However there is still another similar race, but in the opposite direction.
The cloned extent buffer has a number of items that does not match the
number of tree mod log operations that are going to be replayed. This is
because right after we got the last (most recent) tree mod log operation to
replay and before locking and cloning the extent buffer, another task adds
a new pointer to the extent buffer, which results in adding a new tree mod
log operation and incrementing the number of items in the extent buffer.
So after cloning we have mismatch between the number of items in the extent
buffer and the number of mod log operations we are going to apply to it.
This results in hitting a BUG_ON() that produces the following stack trace:
[145427.689964][ T4811] ------------[ cut here ]------------
[145427.692498][ T4811] kernel BUG at fs/btrfs/tree-mod-log.c:675!
[145427.694668][ T4811] invalid opcode: 0000 [#1] SMP KASAN PTI
[145427.696379][ T4811] CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G W 5.12.0-7d1efdf501f8-misc-next+ #99
[145427.700221][ T4811] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[145427.703623][ T4811] RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
[145427.706135][ T4811] Code: 05 48 8d 74 10 (...)
[145427.713034][ T4811] RSP: 0018:ffffc90001027090 EFLAGS: 00010293
[145427.714996][ T4811] RAX: 0000000000000000 RBX: ffff8880a8514600 RCX: ffffffffaa9e59b6
[145427.717158][ T4811] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8880a851462c
[145427.720422][ T4811] RBP: ffffc900010270e0 R08: 00000000000000c0 R09: ffffed1004333417
[145427.723835][ T4811] R10: ffff88802199a0b7 R11: ffffed1004333416 R12: 000000000000000e
[145427.727695][ T4811] R13: ffff888135af8748 R14: ffff88818766ff00 R15: ffff8880a851462c
[145427.731636][ T4811] FS: 00007f29acf62700(0000) GS:ffff8881f2200000(0000) knlGS:0000000000000000
[145427.736305][ T4811] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[145427.739587][ T4811] CR2: 00007f0e6013f718 CR3: 000000010d42e003 CR4: 0000000000170ee0
[145427.743573][ T4811] Call Trace:
[145427.745117][ T4811] btrfs_get_old_root+0x16a/0x5c0
[145427.747686][ T4811] ? lock_downgrade+0x400/0x400
[145427.754189][ T4811] btrfs_search_old_slot+0x192/0x520
[145427.758023][ T4811] ? btrfs_search_slot+0x1090/0x1090
[145427.761014][ T4811] ? free_extent_buffer.part.61+0xd7/0x140
[145427.765208][ T4811] ? free_extent_buffer+0x13/0x20
[145427.770042][ T4811] resolve_indirect_refs+0x3e9/0xfc0
[145427.773633][ T4811] ? lock_downgrade+0x400/0x400
[145427.777323][ T4811] ? __kasan_check_read+0x11/0x20
[145427.780539][ T4811] ? add_prelim_ref.part.11+0x150/0x150
[145427.785722][ T4811] ? lock_downgrade+0x400/0x400
[145427.791086][ T4811] ? __kasan_check_read+0x11/0x20
[145427.796266][ T4811] ? lock_acquired+0xbb/0x620
[145427.798764][ T4811] ? __kasan_check_write+0x14/0x20
[145427.801118][ T4811] ? do_raw_spin_unlock+0xa8/0x140
[145427.804491][ T4811] ? rb_insert_color+0x340/0x360
[145427.808066][ T4811] ? prelim_ref_insert+0x12d/0x430
[145427.811889][ T4811] find_parent_nodes+0x5c3/0x1830
[145427.815498][ T4811] ? stack_trace_save+0x87/0xb0
[145427.819210][ T4811] ? resolve_indirect_refs+0xfc0/0xfc0
[145427.823254][ T4811] ? fs_reclaim_acquire+0x67/0xf0
[145427.827220][ T4811] ? __kasan_check_read+0x11/0x20
[145427.829080][ T4811] ? lockdep_hardirqs_on_prepare+0x210/0x210
[145427.831237][ T4811] ? fs_reclaim_acquire+0x67/0xf0
[145427.835061][ T4811] ? __kasan_check_read+0x11/0x20
[145427.836508][ T4811] ? ___might_sleep+0x10f/0x1e0
[145427.841389][ T4811] ? __kasan_kmalloc+0x9d/0xd0
[145427.843054][ T4811] ? trace_hardirqs_on+0x55/0x120
[145427.845533][ T4811] btrfs_find_all_roots_safe+0x142/0x1e0
[145427.847325][ T4811] ? find_parent_nodes+0x1830/0x1830
[145427.849318][ T4811] ? trace_hardirqs_on+0x55/0x120
[145427.851210][ T4811] ? ulist_free+0x1f/0x30
[145427.852809][ T4811] ? btrfs_inode_flags_to_xflags+0x50/0x50
[145427.854654][ T4811] iterate_extent_inodes+0x20e/0x580
[145427.856429][ T4811] ? tree_backref_for_extent+0x230/0x230
[145427.858552][ T4811] ? release_extent_buffer+0x225/0x280
[145427.862789][ T4811] ? read_extent_buffer+0xdd/0x110
[145427.865092][ T4811] ? lock_downgrade+0x400/0x400
[145427.867069][ T4811] ? __kasan_check_read+0x11/0x20
[145427.868585][ T4811] ? lock_acquired+0xbb/0x620
[145427.872309][ T4811] ? __kasan_check_write+0x14/0x20
[145427.873641][ T4811] ? do_raw_spin_unlock+0xa8/0x140
[145427.878150][ T4811] ? _raw_spin_unlock+0x22/0x30
[145427.879355][ T4811] ? release_extent_buffer+0x225/0x280
[145427.881424][ T4811] iterate_inodes_from_logical+0x129/0x170
[145427.884711][ T4811] ? iterate_inodes_from_logical+0x129/0x170
[145427.888124][ T4811] ? btrfs_inode_flags_to_xflags+0x50/0x50
[145427.891553][ T4811] ? iterate_extent_inodes+0x580/0x580
[145427.894531][ T4811] ? __vmalloc_node+0x92/0xb0
[145427.897439][ T4811] ? init_data_container+0x34/0xb0
[145427.900518][ T4811] ? init_data_container+0x34/0xb0
[145427.903705][ T4811] ? kvmalloc_node+0x60/0x80
[145427.906538][ T4811] btrfs_ioctl_logical_to_ino+0x158/0x230
[145427.910125][ T4811] btrfs_ioctl+0x2038/0x4360
[145427.912430][ T4811] ? __kasan_check_write+0x14/0x20
[145427.914061][ T4811] ? mmput+0x3b/0x220
[145427.915380][ T4811] ? btrfs_ioctl_get_supported_features+0x30/0x30
[145427.917512][ T4811] ? __kasan_check_read+0x11/0x20
[145427.919110][ T4811] ? __kasan_check_read+0x11/0x20
[145427.920845][ T4811] ? lock_release+0xc8/0x650
[145427.922227][ T4811] ? __might_fault+0x64/0xd0
[145427.923687][ T4811] ? __kasan_check_read+0x11/0x20
[145427.925222][ T4811] ? lock_downgrade+0x400/0x400
[145427.926729][ T4811] ? lockdep_hardirqs_on_prepare+0x210/0x210
[145427.928496][ T4811] ? lockdep_hardirqs_on_prepare+0x13/0x210
[145427.930396][ T4811] ? _raw_spin_unlock_irqrestore+0x51/0x63
[145427.932123][ T4811] ? __kasan_check_read+0x11/0x20
[145427.933910][ T4811] ? do_vfs_ioctl+0xfc/0x9d0
[145427.935664][ T4811] ? ioctl_file_clone+0xe0/0xe0
[145427.938147][ T4811] ? lock_downgrade+0x400/0x400
[145427.940717][ T4811] ? lockdep_hardirqs_on_prepare+0x210/0x210
[145427.943673][ T4811] ? __kasan_check_read+0x11/0x20
[145427.946249][ T4811] ? lock_release+0xc8/0x650
[145427.948509][ T4811] ? __task_pid_nr_ns+0xd3/0x250
[145427.950946][ T4811] ? __kasan_check_read+0x11/0x20
[145427.953415][ T4811] ? __fget_files+0x160/0x230
[145427.955693][ T4811] ? __fget_light+0xf2/0x110
[145427.957951][ T4811] __x64_sys_ioctl+0xc3/0x100
[145427.961647][ T4811] do_syscall_64+0x37/0x80
[145427.963112][ T4811] entry_SYSCALL_64_after_hwframe+0x44/0xae
[145427.971975][ T4811] RIP: 0033:0x7f29ae85b427
[145427.974101][ T4811] Code: 00 00 90 48 8b (...)
[145427.980483][ T4811] RSP: 002b:00007f29acf5fcf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[145427.983314][ T4811] RAX: ffffffffffffffda RBX: 00007f29acf5ff40 RCX: 00007f29ae85b427
[145427.985963][ T4811] RDX: 00007f29acf5ff48 RSI: 00000000c038943b RDI: 0000000000000003
[145427.988504][ T4811] RBP: 0000000001000000 R08: 0000000000000000 R09: 00007f29acf60120
[145427.991085][ T4811] R10: 00005640d5fc7b00 R11: 0000000000000246 R12: 0000000000000003
[145427.993662][ T4811] R13: 00007f29acf5ff48 R14: 00007f29acf5ff40 R15: 00007f29acf5fef8
[145427.996289][ T4811] Modules linked in:
[145427.997661][ T4811] ---[ end trace 85e5fce078dfbe04 ]---
(gdb) l *(tree_mod_log_rewind+0x3b1)
0xffffffff819e5b21 is in tree_mod_log_rewind (fs/btrfs/tree-mod-log.c:675).
670 * the modification. As we're going backwards, we do the
671 * opposite of each operation here.
672 */
673 switch (tm->op) {
674 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
675 BUG_ON(tm->slot < n);
676 fallthrough;
677 case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_MOVING:
678 case BTRFS_MOD_LOG_KEY_REMOVE:
679 btrfs_set_node_key(eb, &tm->key, tm->slot);
(gdb) quit
The following steps explain in more detail how it happens:
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1.
This is task A;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
ret = btrfs_tree_mod_log_insert_root(root->node, child, true);
3) At btrfs_tree_mod_log_insert_root() we create a tree mod log operation
of type BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING, with a ->logical field
pointing to ebX->start. We only have one item in eb X, so we create
only one tree mod log operation, and store in the "tm_list" array;
4) Then, still at btrfs_tree_mod_log_insert_root(), we create a tree mod
log element of operation type BTRFS_MOD_LOG_ROOT_REPLACE, ->logical set
to ebY->start, ->old_root.logical set to ebX->start, ->old_root.level
set to the level of eb X and ->generation set to the generation of eb X;
5) Then btrfs_tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
"tm_list" as argument. After that, tree_mod_log_free_eb() calls
tree_mod_log_insert(). This inserts the mod log operation of type
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING from step 3 into the rbtree
with a sequence number of 2 (and fs_info->tree_mod_seq set to 2);
6) Then, after inserting the "tm_list" single element into the tree mod
log rbtree, the BTRFS_MOD_LOG_ROOT_REPLACE element is inserted, which
gets the sequence number 3 (and fs_info->tree_mod_seq set to 3);
7) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
8) Later some other task B allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
9) The tree mod log user task calls btrfs_search_old_slot(), which calls
btrfs_get_old_root(), and finally that calls tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
10) The first iteration of the while loop finds the tree mod log element
with sequence number 3, for the logical address of eb Y and of type
BTRFS_MOD_LOG_ROOT_REPLACE;
11) Because the operation type is BTRFS_MOD_LOG_ROOT_REPLACE, we don't
break out of the loop, and set root_logical to point to
tm->old_root.logical, which corresponds to the logical address of
eb X;
12) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and
corresponds to the old slot 0 of eb X (eb X had only 1 item in it
before being freed at step 7);
13) We then break out of the while loop and return the tree mod log
operation of type BTRFS_MOD_LOG_ROOT_REPLACE (eb Y), and not the one
for slot 0 of eb X, to btrfs_get_old_root();
14) At btrfs_get_old_root(), we process the BTRFS_MOD_LOG_ROOT_REPLACE
operation and set "logical" to the logical address of eb X, which was
the old root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
15) But before calling tree_mod_log_search(), task B locks eb X, adds a
key to eb X, which results in adding a tree mod log operation of type
BTRFS_MOD_LOG_KEY_ADD, with a sequence number of 4, to the tree mod
log, and increments the number of items in eb X from 0 to 1.
Now fs_info->tree_mod_seq has a value of 4;
16) Task A then calls tree_mod_log_search(), which returns the most recent
tree mod log operation for eb X, which is the one just added by task B
at the previous step, with a sequence number of 4, a type of
BTRFS_MOD_LOG_KEY_ADD and for slot 0;
17) Before task A locks and clones eb X, task A adds another key to eb X,
which results in adding a new BTRFS_MOD_LOG_KEY_ADD mod log operation,
with a sequence number of 5, for slot 1 of eb X, increments the
number of items in eb X from 1 to 2, and unlocks eb X.
Now fs_info->tree_mod_seq has a value of 5;
18) Task A then locks eb X and clones it. The clone has a value of 2 for
the number of items and the pointer "tm" points to the tree mod log
operation with sequence number 4, not the most recent one with a
sequence number of 5, so there is mismatch between the number of
mod log operations that are going to be applied to the cloned version
of eb X and the number of items in the clone;
19) Task A then calls tree_mod_log_rewind() with the clone of eb X, the
tree mod log operation with sequence number 4 and a type of
BTRFS_MOD_LOG_KEY_ADD, and time_seq == 1;
20) At tree_mod_log_rewind(), we set the local varibale "n" with a value
of 2, which is the number of items in the clone of eb X.
Then in the first iteration of the while loop, we process the mod log
operation with sequence number 4, which is targeted at slot 0 and has
a type of BTRFS_MOD_LOG_KEY_ADD. This results in decrementing "n" from
2 to 1.
Then we pick the next tree mod log operation for eb X, which is the
tree mod log operation with a sequence number of 2, a type of
BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING and for slot 0, it is the one
added in step 5 to the tree mod log tree.
We go back to the top of the loop to process this mod log operation,
and because its slot is 0 and "n" has a value of 1, we hit the BUG_ON:
(...)
switch (tm->op) {
case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Fix this by checking for a more recent tree mod log operation after locking
and cloning the extent buffer of the old root node, and use it as the first
operation to apply to the cloned extent buffer when rewinding it.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210404040732.GZ32440@hungrycats.org/
Fixes: 834328a8493079 ("Btrfs: tree mod log's old roots could still be part of the tree")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/ctree.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 34b929b..f43ce82 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -1365,10 +1365,30 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
"failed to read tree block %llu from get_old_root",
logical);
} else {
+ struct tree_mod_elem *tm2;
+
btrfs_tree_read_lock(old);
eb = btrfs_clone_extent_buffer(old);
+ /*
+ * After the lookup for the most recent tree mod operation
+ * above and before we locked and cloned the extent buffer
+ * 'old', a new tree mod log operation may have been added.
+ * So lookup for a more recent one to make sure the number
+ * of mod log operations we replay is consistent with the
+ * number of items we have in the cloned extent buffer,
+ * otherwise we can hit a BUG_ON when rewinding the extent
+ * buffer.
+ */
+ tm2 = tree_mod_log_search(fs_info, logical, time_seq);
btrfs_tree_read_unlock(old);
free_extent_buffer(old);
+ ASSERT(tm2);
+ ASSERT(tm2 == tm || tm2->seq > tm->seq);
+ if (!tm2 || tm2->seq < tm->seq) {
+ free_extent_buffer(eb);
+ return NULL;
+ }
+ tm = tm2;
}
} else if (old_root) {
eb_root_owner = btrfs_header_owner(eb_root);
--
2.30.2
From 94040e4ccd4df5df91784b598972fa2d57583492 Mon Sep 17 00:00:00 2001
From: wangyugui <wangyugui@e16-tech.com>
Date: Thu, 22 Apr 2021 07:50:31 +0800
Subject: [PATCH] btrfs: bug found by wangyugui,fix-by-Qu
---
fs/btrfs/ctree.h | 3 ++-
fs/btrfs/inode.c | 5 +++--
fs/btrfs/ioctl.c | 2 +-
fs/btrfs/qgroup.c | 2 +-
fs/btrfs/send.c | 4 ++--
5 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 9ae776a..c537ee6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3110,7 +3110,8 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle *trans,
struct btrfs_inode *inode, u64 new_size,
u32 min_type);
-int btrfs_start_delalloc_snapshot(struct btrfs_root *root);
+int btrfs_start_delalloc_snapshot(struct btrfs_root *root,
+ bool in_reclaim_context);
int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
bool in_reclaim_context);
int btrfs_set_extent_delalloc(struct btrfs_inode *inode, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a520775..adff9ff 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9674,7 +9674,8 @@ out:
return ret;
}
-int btrfs_start_delalloc_snapshot(struct btrfs_root *root)
+int btrfs_start_delalloc_snapshot(struct btrfs_root *root,
+ bool in_reclaim_context)
{
struct writeback_control wbc = {
.nr_to_write = LONG_MAX,
@@ -9687,7 +9688,7 @@ int btrfs_start_delalloc_snapshot(struct btrfs_root *root)
if (test_bit(BTRFS_FS_STATE_ERROR, &fs_info->fs_state))
return -EROFS;
- return start_delalloc_inodes(root, &wbc, true, false);
+ return start_delalloc_inodes(root, &wbc, true, in_reclaim_context);
}
int btrfs_start_delalloc_roots(struct btrfs_fs_info *fs_info, long nr,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e8d53fe..1233668 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1034,7 +1034,7 @@ static noinline int btrfs_mksnapshot(const struct path *parent,
*/
btrfs_drew_read_lock(&root->snapshot_lock);
- ret = btrfs_start_delalloc_snapshot(root);
+ ret = btrfs_start_delalloc_snapshot(root, false);
if (ret)
goto out;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index f0b9ef1..2991287 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3579,7 +3579,7 @@ static int try_flush_qgroup(struct btrfs_root *root)
return 0;
}
- ret = btrfs_start_delalloc_snapshot(root);
+ ret = btrfs_start_delalloc_snapshot(root, true);
if (ret < 0)
goto out;
btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 8f32385..8ae8f17 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -7139,7 +7139,7 @@ static int flush_delalloc_roots(struct send_ctx *sctx)
int i;
if (root) {
- ret = btrfs_start_delalloc_snapshot(root);
+ ret = btrfs_start_delalloc_snapshot(root, false);
if (ret)
return ret;
btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX);
@@ -7147,7 +7147,7 @@ static int flush_delalloc_roots(struct send_ctx *sctx)
for (i = 0; i < sctx->clone_roots_cnt; i++) {
root = sctx->clone_roots[i].root;
- ret = btrfs_start_delalloc_snapshot(root);
+ ret = btrfs_start_delalloc_snapshot(root, false);
if (ret)
return ret;
btrfs_wait_ordered_extents(root, U64_MAX, 0, U64_MAX);
--
2.30.2
[-- Attachment #3: btrfs-mmap-race.patch --]
[-- Type: application/octet-stream, Size: 52332 bytes --]
From deac96c87c2e48c3a111d75cf3b94d3bb14425d0 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 10 Feb 2021 17:14:33 -0500
Subject: [PATCH] btrfs: add a i_mmap_lock to our inode
We need to be able to exclude page_mkwrite from happening concurrently
with certain operations. To facilitate this, add a i_mmap_lock to our
inode, down_read() it in our mkwrite, and add a new ILOCK flag to
indicate that we want to take the i_mmap_lock as well. I used pahole to
check the size of the btrfs_inode, the sizes are as follows
no lockdep:
before: 1120 (3 per 4k page)
after: 1160 (3 per 4k page)
lockdep:
before: 2072 (1 per 4k page)
after: 2224 (1 per 4k page)
We're slightly larger but it doesn't change how many objects we can fit
per page.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/btrfs_inode.h | 1 +
fs/btrfs/ctree.h | 1 +
fs/btrfs/inode.c | 10 ++++++++++
3 files changed, 12 insertions(+)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 28e202e89660..26837c3ca7f6 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -220,6 +220,7 @@ struct btrfs_inode {
/* Hook into fs_info->delayed_iputs */
struct list_head delayed_iput;
+ struct rw_semaphore i_mmap_lock;
struct inode vfs_inode;
};
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5776026c532e..751ab9f80e4c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3178,6 +3178,7 @@ extern const struct iomap_dio_ops btrfs_dio_ops;
/* Inode locking type flags, by default the exclusive lock is taken */
#define BTRFS_ILOCK_SHARED (1U << 0)
#define BTRFS_ILOCK_TRY (1U << 1)
+#define BTRFS_ILOCK_MMAP (1U << 2)
int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags);
void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 52dc5f52ea58..0b02d07800db 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -102,6 +102,7 @@ static void __endio_write_update_ordered(struct btrfs_inode *inode,
* BTRFS_ILOCK_SHARED - acquire a shared lock on the inode
* BTRFS_ILOCK_TRY - try to acquire the lock, if fails on first attempt
* return -EAGAIN
+ * BTRFS_ILOCK_MMAP - acquire a write lock on the i_mmap_lock
*/
int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags)
{
@@ -122,6 +123,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags)
}
inode_lock(inode);
}
+ if (ilock_flags & BTRFS_ILOCK_MMAP)
+ down_write(&BTRFS_I(inode)->i_mmap_lock);
return 0;
}
@@ -133,6 +136,8 @@ int btrfs_inode_lock(struct inode *inode, unsigned int ilock_flags)
*/
void btrfs_inode_unlock(struct inode *inode, unsigned int ilock_flags)
{
+ if (ilock_flags & BTRFS_ILOCK_MMAP)
+ up_write(&BTRFS_I(inode)->i_mmap_lock);
if (ilock_flags & BTRFS_ILOCK_SHARED)
inode_unlock_shared(inode);
else
@@ -8540,6 +8545,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
ret = VM_FAULT_NOPAGE; /* make the VM retry the fault */
again:
+ down_read(&BTRFS_I(inode)->i_mmap_lock);
lock_page(page);
size = i_size_read(inode);
@@ -8568,6 +8574,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
unlock_extent_cached(io_tree, page_start, page_end,
&cached_state);
unlock_page(page);
+ up_read(&BTRFS_I(inode)->i_mmap_lock);
btrfs_start_ordered_extent(ordered, 1);
btrfs_put_ordered_extent(ordered);
goto again;
@@ -8625,6 +8632,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
+ up_read(&BTRFS_I(inode)->i_mmap_lock);
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
sb_end_pagefault(inode->i_sb);
@@ -8633,6 +8641,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
out_unlock:
unlock_page(page);
+ up_read(&BTRFS_I(inode)->i_mmap_lock);
out:
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE);
btrfs_delalloc_release_space(BTRFS_I(inode), data_reserved, page_start,
@@ -8884,6 +8893,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
INIT_LIST_HEAD(&ei->delalloc_inodes);
INIT_LIST_HEAD(&ei->delayed_iput);
RB_CLEAR_NODE(&ei->rb_node);
+ init_rwsem(&ei->i_mmap_lock);
return inode;
}
--
2.30.2
From dd9559f173439185784744ba2d84b84132456b1d Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 10 Feb 2021 17:14:34 -0500
Subject: [PATCH] btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock
helpers
A few places we intermix btrfs_inode_lock with a inode_unlock, and some
places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
None of these places are using this incorrectly, but as we adjust some
of these callers it would be nice to keep everything consistent, so
convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/delayed-inode.c | 4 ++--
fs/btrfs/file.c | 18 +++++++++---------
fs/btrfs/ioctl.c | 26 +++++++++++++-------------
fs/btrfs/reflink.c | 4 ++--
fs/btrfs/relocation.c | 4 ++--
5 files changed, 28 insertions(+), 28 deletions(-)
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index a40025dca1ca..1a88f6214ebc 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -1573,8 +1573,8 @@ bool btrfs_readdir_get_delayed_items(struct inode *inode,
* We can only do one readdir with delayed items at a time because of
* item->readdir_list.
*/
- inode_unlock_shared(inode);
- inode_lock(inode);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+ btrfs_inode_lock(inode, 0);
mutex_lock(&delayed_node->mutex);
item = __btrfs_first_delayed_insertion_item(delayed_node);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d0cb513d0019..ed629a18e158 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2122,7 +2122,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (ret)
goto out;
- inode_lock(inode);
+ btrfs_inode_lock(inode, 0);
atomic_inc(&root->log_batch);
@@ -2154,7 +2154,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
*/
ret = start_ordered_ops(inode, start, end);
if (ret) {
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
goto out;
}
@@ -2255,7 +2255,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* file again, but that will end up using the synchronization
* inside btrfs_sync_log to keep things safe.
*/
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
if (ret != BTRFS_NO_LOG_SYNC) {
if (!ret) {
@@ -2285,7 +2285,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
out_release_extents:
btrfs_release_log_ctx_extents(&ctx);
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
goto out;
}
@@ -2867,7 +2867,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
if (ret)
return ret;
- inode_lock(inode);
+ btrfs_inode_lock(inode, 0);
ino_size = round_up(inode->i_size, fs_info->sectorsize);
ret = find_first_non_hole(BTRFS_I(inode), &offset, &len);
if (ret < 0)
@@ -2907,7 +2907,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
truncated_block = true;
ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
if (ret) {
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
return ret;
}
}
@@ -3008,7 +3008,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
ret = ret2;
}
}
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
return ret;
}
@@ -3376,7 +3376,7 @@ static long btrfs_fallocate(struct file *file, int mode,
if (mode & FALLOC_FL_ZERO_RANGE) {
ret = btrfs_zero_range(inode, offset, len, mode);
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
return ret;
}
@@ -3486,7 +3486,7 @@ static long btrfs_fallocate(struct file *file, int mode,
unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
&cached_state);
out:
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
/* Let go of our reservation. */
if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE))
btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e8d53fea4c61..3415a9f06c81 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -226,7 +226,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
if (ret)
return ret;
- inode_lock(inode);
+ btrfs_inode_lock(inode, 0);
fsflags = btrfs_mask_fsflags_for_type(inode, fsflags);
old_fsflags = btrfs_inode_flags_to_fsflags(binode->flags);
@@ -353,7 +353,7 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
out_end_trans:
btrfs_end_transaction(trans);
out_unlock:
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
mnt_drop_write_file(file);
return ret;
}
@@ -449,7 +449,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg)
if (ret)
return ret;
- inode_lock(inode);
+ btrfs_inode_lock(inode, 0);
old_flags = binode->flags;
old_i_flags = inode->i_flags;
@@ -501,7 +501,7 @@ static int btrfs_ioctl_fssetxattr(struct file *file, void __user *arg)
inode->i_flags = old_i_flags;
}
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
mnt_drop_write_file(file);
return ret;
@@ -1014,7 +1014,7 @@ static noinline int btrfs_mksubvol(const struct path *parent,
out_dput:
dput(dentry);
out_unlock:
- inode_unlock(dir);
+ btrfs_inode_unlock(dir, 0);
return error;
}
@@ -1612,7 +1612,7 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
ra_index += cluster;
}
- inode_lock(inode);
+ btrfs_inode_lock(inode, 0);
if (IS_SWAPFILE(inode)) {
ret = -ETXTBSY;
} else {
@@ -1621,13 +1621,13 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
ret = cluster_pages_for_defrag(inode, pages, i, cluster);
}
if (ret < 0) {
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
goto out_ra;
}
defrag_count += ret;
balance_dirty_pages_ratelimited(inode->i_mapping);
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
if (newer_than) {
if (newer_off == (u64)-1)
@@ -1675,9 +1675,9 @@ int btrfs_defrag_file(struct inode *inode, struct file *file,
out_ra:
if (do_compress) {
- inode_lock(inode);
+ btrfs_inode_lock(inode, 0);
BTRFS_I(inode)->defrag_compress = BTRFS_COMPRESS_NONE;
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
}
if (!file)
kfree(ra);
@@ -3112,9 +3112,9 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
goto out_dput;
}
- inode_lock(inode);
+ btrfs_inode_lock(inode, 0);
err = btrfs_delete_subvolume(dir, dentry);
- inode_unlock(inode);
+ btrfs_inode_unlock(inode, 0);
if (!err) {
fsnotify_rmdir(dir, dentry);
d_delete(dentry);
@@ -3123,7 +3123,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
out_dput:
dput(dentry);
out_unlock_dir:
- inode_unlock(dir);
+ btrfs_inode_unlock(dir, 0);
free_subvol_name:
kfree(subvol_name_ptr);
free_parent:
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index 9ebe95cc355b..d60ea7ea5f27 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -833,7 +833,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
return -EINVAL;
if (same_inode)
- inode_lock(src_inode);
+ btrfs_inode_lock(src_inode, 0);
else
lock_two_nondirectories(src_inode, dst_inode);
@@ -849,7 +849,7 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
out_unlock:
if (same_inode)
- inode_unlock(src_inode);
+ btrfs_inode_unlock(src_inode, 0);
else
unlock_two_nondirectories(src_inode, dst_inode);
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 232d5da7b7be..bf269ee17e68 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -2578,7 +2578,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
return btrfs_end_transaction(trans);
}
- inode_lock(&inode->vfs_inode);
+ btrfs_inode_lock(&inode->vfs_inode, 0);
for (nr = 0; nr < cluster->nr; nr++) {
start = cluster->boundary[nr] - offset;
if (nr + 1 < cluster->nr)
@@ -2596,7 +2596,7 @@ static noinline_for_stack int prealloc_file_extent_cluster(
if (ret)
break;
}
- inode_unlock(&inode->vfs_inode);
+ btrfs_inode_unlock(&inode->vfs_inode, 0);
if (cur_offset < prealloc_end)
btrfs_free_reserved_data_space_noquota(inode->root->fs_info,
--
2.30.2
From d918c442ae122b16330395414ff55ec6bc490236 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 10 Feb 2021 17:14:35 -0500
Subject: [PATCH] btrfs: exclude mmaps while doing remap
Darrick reported a potential issue to me where we could allow mmap
writes after validating a page range matched in the case of dedupe.
Generally we rely on lock page -> lock extent with the ordered flush to
protect us, but this is done after we check the pages because we use the
generic helpers, so we could modify the page in between doing the check
and locking the range.
There also exists a deadlock, as described by Filipe
"""
When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.
Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.
Basically what happens is the following:
1) A clone operation locks inodes A and B, flushes delalloc for both
inodes in the respective file ranges and waits for any ordered extents
in those ranges to complete;
2) Before the clone task locks the file ranges, another task does a
memory mapped write (which does not lock the inode) for one of the
inodes of the clone operation. So now we have a dirty page in one of
the ranges used by the clone operation;
3) The clone operation locks the file ranges for inodes A and B;
4) Later, when iterating over the file extents of inode A, the clone
task attempts to start a transaction. There's not enough available
free metadata space, so the async reclaim task is started (if not
running already) and we wait for someone to wake us up on our
reservation ticket;
5) The async reclaim task is not able to release space by any other
means and decides to flush delalloc for the inode of the clone
operation;
6) The workqueue job used to flush the inode blocks when starting
delalloc for the inode, since the file range is currently locked by
the clone task;
7) But the clone task is waiting on its reservation ticket and the async
reclaim task is waiting on the flush job to complete, which can't
progress since the clone task has the file range locked. So unless
some other task is able to release space, for example an ordered
extent for some other inode completes, we have a deadlock between all
these tasks;
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several other tasks blocked on inode locks held by the clone task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? lock_release+0x20e/0x4c0
btrfs_clone+0x3e4/0x7e0 [btrfs]
? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
btrfs_clone_files+0xf6/0x150 [btrfs]
btrfs_remap_file_range+0x324/0x3d0 [btrfs]
do_clone_file_range+0xd4/0x1f0
vfs_clone_file_range+0x4d/0x230
? lock_release+0x20e/0x4c0
ioctl_file_clone+0x8f/0xc0
do_vfs_ioctl+0x342/0x750
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix both of these issues by excluding mmaps from happening we are doing
any sort of remap, which prevents this race completely.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/reflink.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c
index d60ea7ea5f27..6746460fd219 100644
--- a/fs/btrfs/reflink.c
+++ b/fs/btrfs/reflink.c
@@ -604,6 +604,20 @@ static void btrfs_double_extent_lock(struct inode *inode1, u64 loff1,
lock_extent(&BTRFS_I(inode2)->io_tree, loff2, loff2 + len - 1);
}
+static void btrfs_double_mmap_lock(struct inode *inode1, struct inode *inode2)
+{
+ if (inode1 < inode2)
+ swap(inode1, inode2);
+ down_write(&BTRFS_I(inode1)->i_mmap_lock);
+ down_write_nested(&BTRFS_I(inode2)->i_mmap_lock, SINGLE_DEPTH_NESTING);
+}
+
+static void btrfs_double_mmap_unlock(struct inode *inode1, struct inode *inode2)
+{
+ up_write(&BTRFS_I(inode1)->i_mmap_lock);
+ up_write(&BTRFS_I(inode2)->i_mmap_lock);
+}
+
static int btrfs_extent_same_range(struct inode *src, u64 loff, u64 len,
struct inode *dst, u64 dst_loff)
{
@@ -832,10 +846,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
if (remap_flags & ~(REMAP_FILE_DEDUP | REMAP_FILE_ADVISORY))
return -EINVAL;
- if (same_inode)
- btrfs_inode_lock(src_inode, 0);
- else
+ if (same_inode) {
+ btrfs_inode_lock(src_inode, BTRFS_ILOCK_MMAP);
+ } else {
lock_two_nondirectories(src_inode, dst_inode);
+ btrfs_double_mmap_lock(src_inode, dst_inode);
+ }
ret = btrfs_remap_file_range_prep(src_file, off, dst_file, destoff,
&len, remap_flags);
@@ -848,10 +864,12 @@ loff_t btrfs_remap_file_range(struct file *src_file, loff_t off,
ret = btrfs_clone_files(dst_file, src_file, off, len, destoff);
out_unlock:
- if (same_inode)
- btrfs_inode_unlock(src_inode, 0);
- else
+ if (same_inode) {
+ btrfs_inode_unlock(src_inode, BTRFS_ILOCK_MMAP);
+ } else {
+ btrfs_double_mmap_unlock(src_inode, dst_inode);
unlock_two_nondirectories(src_inode, dst_inode);
+ }
return ret < 0 ? ret : len;
}
--
2.30.2
From 492861df03b67e577d8b3cb38aafec4e85ef7326 Mon Sep 17 00:00:00 2001
From: Josef Bacik <josef@toxicpanda.com>
Date: Wed, 10 Feb 2021 17:14:36 -0500
Subject: [PATCH] btrfs: exclude mmap from happening during all fallocate
operations
There's a small window where a deadlock can happen between fallocate and
mmap. This is described in detail by Filipe:
"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.
This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:
1) A fallocate operation starts, locks the inode, flushes delalloc in the
range and waits for ordered extents in the range to complete;
2) Before the fallocate task locks the file range, another task does a
memory mapped write for a page in the fallocate target range. This is
possible since memory mapped writes do not (and can not) lock the
inode;
3) The fallocate task locks the file range. At this point there is one
dirty page in the range (due to the memory mapped write);
4) When the fallocate task attempts to start a transaction, it blocks when
attempting to reserve metadata space, since we are low on available
metadata space. Before blocking (wait on its reservation ticket), it
starts the async reclaim task (if not running already);
5) The async reclaim task is not able to release space through any other
means, so it decides to flush delalloc for inodes with dirty pages.
It finds that the inode used in the fallocate operation has a dirty
page and therefore queues a job (fs_info->flush_workers workqueue) to
flush delalloc for that inode and waits on that job to complete;
6) The flush job blocks when attempting to lock the file range because
it is currently locked by the fallocate task;
7) The fallocate task keeps waiting for its metadata reservation, waiting
for a wakeup on its reservation ticket. The async reclaim task is
waiting on the flush job, which in turn is waiting for locking the file
range that is currently locked by the fallocate task. So unless some
other task is able to release enough metadata space, for example an
ordered extent for some other inode completes, we end up in a deadlock
between all these tasks.
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? btrfs_fallocate+0xdcf/0x1260 [btrfs]
btrfs_fallocate+0xdfb/0x1260 [btrfs]
? filename_lookup+0xf1/0x180
vfs_fallocate+0x14f/0x440
ioctl_preallocate+0x92/0xc0
do_vfs_ioctl+0x66b/0x750
? __do_sys_newfstat+0x53/0x60
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix this by disallowing mmaps from happening while we're doing any of
the fallocate operations on this inode.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/file.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index ed629a18e158..c57a9ecf861e 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2867,7 +2867,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
if (ret)
return ret;
- btrfs_inode_lock(inode, 0);
+ btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
ino_size = round_up(inode->i_size, fs_info->sectorsize);
ret = find_first_non_hole(BTRFS_I(inode), &offset, &len);
if (ret < 0)
@@ -2907,7 +2907,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
truncated_block = true;
ret = btrfs_truncate_block(BTRFS_I(inode), offset, 0, 0);
if (ret) {
- btrfs_inode_unlock(inode, 0);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
return ret;
}
}
@@ -3008,7 +3008,7 @@ static int btrfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
ret = ret2;
}
}
- btrfs_inode_unlock(inode, 0);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
return ret;
}
@@ -3334,7 +3334,7 @@ static long btrfs_fallocate(struct file *file, int mode,
return ret;
}
- btrfs_inode_lock(inode, 0);
+ btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) {
ret = inode_newsize_ok(inode, offset + len);
@@ -3376,7 +3376,7 @@ static long btrfs_fallocate(struct file *file, int mode,
if (mode & FALLOC_FL_ZERO_RANGE) {
ret = btrfs_zero_range(inode, offset, len, mode);
- btrfs_inode_unlock(inode, 0);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
return ret;
}
@@ -3486,7 +3486,7 @@ static long btrfs_fallocate(struct file *file, int mode,
unlock_extent_cached(&BTRFS_I(inode)->io_tree, alloc_start, locked_end,
&cached_state);
out:
- btrfs_inode_unlock(inode, 0);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
/* Let go of our reservation. */
if (ret != 0 && !(mode & FALLOC_FL_ZERO_RANGE))
btrfs_free_reserved_data_space(BTRFS_I(inode), data_reserved,
--
2.30.2
From 15dc132bbff0c819b2392fc93ff76b2774844b87 Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Tue, 23 Feb 2021 12:08:47 +0000
Subject: [PATCH] btrfs: fix race between memory mapped writes and fsync
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
Leaf N, generation == current transaction id:
---------------------------------------------------------
| (...) [ file extent item, offset 640K, length 128K ] |
---------------------------------------------------------
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
min_key.objectid = <inode X number>
min_key.type = BTRFS_EXTENT_DATA_KEY
min_key.offset = 640K
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in 0c713cbab6200b ("Btrfs: fix race
between ranged fsync and writeback of adjacent ranges").
Scenario 2 - logging a version of the file that never existed
This scenario only happens when using the NO_HOLES feature and results in
a silent corruption, in the sense that is not detectable by 'btrfs check'
or the tree checker:
1) We have an inode I with a size of 1M and two file extent items, one
covering an extent with disk_bytenr == X for the file range [0, 512K)
and another one covering another extent with disk_bytenr == Y for the
file range [512K, 1M);
2) A hole is punched for the file range [512K, 1M);
3) Task A starts an fsync of inode I, which has the full sync runtime flag
set. It starts by flushing all existing delalloc, locks the inode (VFS
lock), starts any new delalloc that might have been created before
taking the lock and waits for all ordered extents to complete;
4) Some other task does a memory mapped write for the page corresponding to
the file range [640K, 644K) for example;
5) Task A then logs all items of the inode with the call to
copy_inode_items_to_log();
6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) The ordered extent for the range [640K, 644K) completes and a file
extent item for that range is added to the subvolume tree, pointing
to a 4K extent with a disk_bytenr == Z;
8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
the subvolume tree. It finds two implicit holes:
- one for the file range [512K, 640K)
- one for the file range [644K, 1M)
As a result we end up neither logging a hole for the range [640K, 644K)
nor logging the file extent item with a disk_bytenr == Z.
This means that if we have a power failure and replay the log tree we
end up getting the following file extent layout:
[ disk_bytenr X ] [ hole ] [ disk_bytenr Y ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
Which does not corresponding to any layout the file ever had before
the power failure. The only two valid layouts would be:
[ disk_bytenr X ] [ hole ]
0 512K 512K 1M
and
[ disk_bytenr X ] [ hole ] [ disk_bytenr Z ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
This can be fixed by serializing memory mapped writes with fsync, and there
are two ways to do it:
1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
in the inode's io tree. This prevents the race but also blocks any reads
during the duration of the fsync, which has a negative impact for many
common workloads;
2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
semaphore was recently added by Josef's patch set:
btrfs: add a i_mmap_lock to our inode
btrfs: cleanup inode_lock/inode_unlock uses
btrfs: exclude mmaps while doing remap
btrfs: exclude mmap from happening during all fallocate operations
and is used to solve races between memory mapped writes and
clone/dedupe/fallocate. This also makes us have the same behaviour we
have regarding other writes (buffered and direct IO) and fsync - block
them while the inode logging is in progress.
This change uses the second approach due to the performance impact of the
first one.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/file.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index c57a9ecf861e..4c7049e41fe5 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2122,7 +2122,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (ret)
goto out;
- btrfs_inode_lock(inode, 0);
+ btrfs_inode_lock(inode, BTRFS_ILOCK_MMAP);
atomic_inc(&root->log_batch);
@@ -2135,11 +2135,11 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
&BTRFS_I(inode)->runtime_flags);
/*
- * Before we acquired the inode's lock, someone may have dirtied more
- * pages in the target range. We need to make sure that writeback for
- * any such pages does not start while we are logging the inode, because
- * if it does, any of the following might happen when we are not doing a
- * full inode sync:
+ * Before we acquired the inode's lock and the mmap lock, someone may
+ * have dirtied more pages in the target range. We need to make sure
+ * that writeback for any such pages does not start while we are logging
+ * the inode, because if it does, any of the following might happen when
+ * we are not doing a full inode sync:
*
* 1) We log an extent after its writeback finishes but before its
* checksums are added to the csum tree, leading to -EIO errors
@@ -2154,7 +2154,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
*/
ret = start_ordered_ops(inode, start, end);
if (ret) {
- btrfs_inode_unlock(inode, 0);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
goto out;
}
@@ -2255,7 +2255,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* file again, but that will end up using the synchronization
* inside btrfs_sync_log to keep things safe.
*/
- btrfs_inode_unlock(inode, 0);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
if (ret != BTRFS_NO_LOG_SYNC) {
if (!ret) {
@@ -2285,7 +2285,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
out_release_extents:
btrfs_release_log_ctx_extents(&ctx);
- btrfs_inode_unlock(inode, 0);
+ btrfs_inode_unlock(inode, BTRFS_ILOCK_MMAP);
goto out;
}
--
2.30.2
From 9f94086543c529fe019cb4ab9ea23ca3e31b28dc Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Tue, 23 Feb 2021 12:08:48 +0000
Subject: [PATCH] btrfs: fix race between marking inode needs to be logged and
log syncing
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.
1) We are at transaction N;
2) Inode I was previously fsynced in the current transaction so it has:
inode->logged_trans set to N;
3) The inode's root currently has:
root->log_transid set to 1
root->last_log_commit set to 0
Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
4) One more range of pages is dirtied in inode I;
5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.
Before task A calls btrfs_sync_log()...
6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:
spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...
8) Task A is at btrfs_sync_log():
- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1
It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;
9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
Which results in inode->last_log_commit being set to 1.
The ordered extent completes;
10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:
inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty
And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.
It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.
However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:
vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
{
(...)
BTRFS_I(inode)->last_trans = fs_info->generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
(...)
So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.
So fix this in two different ways:
1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;
2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.
This is an extremely unlikely race to happen in practice.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/btrfs_inode.h | 15 +++++++++++++++
fs/btrfs/file.c | 10 ++--------
fs/btrfs/inode.c | 4 +---
fs/btrfs/transaction.h | 2 +-
4 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 26837c3ca7f6..8011596e1eb3 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -300,6 +300,21 @@ static inline void btrfs_mod_outstanding_extents(struct btrfs_inode *inode,
mod);
}
+/*
+ * Called every time after doing a buffered, direct IO or memory mapped write.
+ *
+ * This is to ensure that if we write to a file that was previously fsynced in
+ * the current transaction, then try to fsync it again in the same transaction,
+ * we will know that there were changes in the file and that it needs to be
+ * logged.
+ */
+static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
+{
+ spin_lock(&inode->lock);
+ inode->last_sub_trans = inode->root->log_transid;
+ spin_unlock(&inode->lock);
+}
+
static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
{
int ret = 0;
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 4c7049e41fe5..c666d20370c1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2014,14 +2014,8 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
else
num_written = btrfs_buffered_write(iocb, from);
- /*
- * We also have to set last_sub_trans to the current log transid,
- * otherwise subsequent syncs to a file that's been synced in this
- * transaction will appear to have already occurred.
- */
- spin_lock(&inode->lock);
- inode->last_sub_trans = inode->root->log_transid;
- spin_unlock(&inode->lock);
+ btrfs_set_inode_last_sub_trans(inode);
+
if (num_written > 0)
num_written = generic_write_sync(iocb, num_written);
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 0b02d07800db..ddbac78c4abe 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8627,9 +8627,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
set_page_dirty(page);
SetPageUptodate(page);
- BTRFS_I(inode)->last_trans = fs_info->generation;
- BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
- BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
+ btrfs_set_inode_last_sub_trans(BTRFS_I(inode));
unlock_extent_cached(io_tree, page_start, page_end, &cached_state);
up_read(&BTRFS_I(inode)->i_mmap_lock);
diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
index 6335716e513f..dd7c3eea08ad 100644
--- a/fs/btrfs/transaction.h
+++ b/fs/btrfs/transaction.h
@@ -175,7 +175,7 @@ static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
- inode->last_log_commit = inode->root->last_log_commit;
+ inode->last_log_commit = inode->last_sub_trans - 1;
spin_unlock(&inode->lock);
}
--
2.30.2
From adaaaab326211cf7fbf513600542eb78f4a5f2bd Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Tue, 23 Feb 2021 12:08:49 +0000
Subject: [PATCH] btrfs: remove stale comment and logic from
btrfs_inode_in_log()
Currently btrfs_inode_in_log() checks the list of modified extents of the
inode, and has a comment mentioning why, as it used to be necessary to
make sure if we did something like the following:
mmap write range A
mmap write range B
msync range A (ranged fsync)
msync range B (ranged fsync)
we ended up with both ranges being logged.
If we did not check it, then the second fsync would do nothing because
btrfs_inode_in_log() would return true. This was added in 125c4cf9f37c98
("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and
test case generic/325 from fstests exercises that scenario.
However, as of commit 487781796d3022 ("btrfs: make fast fsyncs wait only
for writeback"), every ranged fsync is now turned into a full ranged fsync
(operates on the range from 0 to LLONG_MAX), so it is now pointless to
test of emptiness of the list of modified extents, and the comment is
clearly outdated.
So just remove the comment and list emptiness check, while also changing
the function's return type to be a boolean instead of an integer.
In case one day we get support for ranged fsyncs again, it will be easy
to notice the check is necessary again, because it will make generic/325
always fail.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
---
fs/btrfs/btrfs_inode.h | 17 ++++-------------
1 file changed, 4 insertions(+), 13 deletions(-)
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 8011596e1eb3..c652e19ad74e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -315,24 +315,15 @@ static inline void btrfs_set_inode_last_sub_trans(struct btrfs_inode *inode)
spin_unlock(&inode->lock);
}
-static inline int btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
+static inline bool btrfs_inode_in_log(struct btrfs_inode *inode, u64 generation)
{
- int ret = 0;
+ bool ret = false;
spin_lock(&inode->lock);
if (inode->logged_trans == generation &&
inode->last_sub_trans <= inode->last_log_commit &&
- inode->last_sub_trans <= inode->root->last_log_commit) {
- /*
- * After a ranged fsync we might have left some extent maps
- * (that fall outside the fsync's range). So return false
- * here if the list isn't empty, to make sure btrfs_log_inode()
- * will be called and process those extent maps.
- */
- smp_mb();
- if (list_empty(&inode->extent_tree.modified_extents))
- ret = 1;
- }
+ inode->last_sub_trans <= inode->root->last_log_commit)
+ ret = true;
spin_unlock(&inode->lock);
return ret;
}
--
2.30.2
[-- Attachment #4: btrfs-add-btree-read-ahead-for-full-send-operations.patch --]
[-- Type: application/octet-stream, Size: 4051 bytes --]
From 40ddcab1771a11ee26ed1e7438fed15595145c7c Mon Sep 17 00:00:00 2001
From: Filipe Manana <fdmanana@suse.com>
Date: Fri, 26 Feb 2021 15:17:01 +0000
Subject: [PATCH] btrfs: add btree read ahead for full send operations
When doing a full send we know that we are going to be reading every node
and leaf of the send root, so we benefit from enabling read ahead for the
btree.
This change enables read ahead for full send operations only, incremental
sends will have read ahead enabled in a different way by a separate patch.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16Gb of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, full send duration:
with $initial_file_count == 200000: 165 seconds
with $initial_file_count == 500000: 407 seconds
After this change, full send duration:
with $initial_file_count == 200000: 149 seconds (-10.2%)
with $initial_file_count == 500000: 353 seconds (-14.2%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, while for $initial_file_count == 500000 there
are 152476 nodes and leaves.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
---
fs/btrfs/send.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index f878782..7ff81da 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -6653,6 +6653,7 @@ static int full_send_tree(struct send_ctx *sctx)
path = alloc_path_for_send();
if (!path)
return -ENOMEM;
+ path->reada = READA_FORWARD;
key.objectid = BTRFS_FIRST_FREE_OBJECTID;
key.type = BTRFS_INODE_ITEM_KEY;
--
2.30.1
next prev parent reply other threads:[~2021-04-22 1:26 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-21 0:31 'ls /mnt/scratch/' freeze(deadlock?) when run xfstest(btrfs/232) Wang Yugui
2021-04-21 12:17 ` Wang Yugui
2021-04-21 13:29 ` Filipe Manana
2021-04-21 15:57 ` Wang Yugui
2021-04-21 16:03 ` Filipe Manana
2021-04-21 23:19 ` Qu Wenruo
2021-04-21 23:43 ` Qu Wenruo
2021-04-22 0:32 ` Wang Yugui
2021-04-22 0:57 ` Qu Wenruo
2021-04-22 1:25 ` Wang Yugui [this message]
2021-04-22 4:16 ` Wang Yugui
2021-04-22 11:06 ` Filipe Manana
2021-04-22 10:59 ` Filipe Manana
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210422092558.F39A.409509F4@e16-tech.com \
--to=wangyugui@e16-tech.com \
--cc=fdmanana@gmail.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox