From: Filipe Manana <fdmanana@kernel.org>
To: Josef Bacik <josef@toxicpanda.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH v2 1/3] btrfs: do not start relocation until in progress drops are done
Date: Mon, 21 Feb 2022 12:38:02 +0000 [thread overview]
Message-ID: <YhOHqtAM/JbfTKdo@debian9.Home> (raw)
In-Reply-To: <78d6f8e496b367fc520549ab00465cbd704ea22f.1645214059.git.josef@toxicpanda.com>
On Fri, Feb 18, 2022 at 02:56:10PM -0500, Josef Bacik wrote:
> We hit a bug with a recovering relocation on mount for one of our file
> systems in production. I reproduced this locally by injecting errors
> into snapshot delete with balance running at the same time. This
> presented as an error while looking up an extent item
>
> ------------[ cut here ]------------
> WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
> CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
> RIP: 0010:lookup_inline_extent_backref+0x647/0x680
> RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
> RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
> RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
> R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
> R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
> FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
> Call Trace:
> <TASK>
> insert_inline_extent_backref+0x46/0xd0
> __btrfs_inc_extent_ref.isra.0+0x5f/0x200
> ? btrfs_merge_delayed_refs+0x164/0x190
> __btrfs_run_delayed_refs+0x561/0xfa0
> ? btrfs_search_slot+0x7b4/0xb30
> ? btrfs_update_root+0x1a9/0x2c0
> btrfs_run_delayed_refs+0x73/0x1f0
> ? btrfs_update_root+0x1a9/0x2c0
> btrfs_commit_transaction+0x50/0xa50
> ? btrfs_update_reloc_root+0x122/0x220
> prepare_to_merge+0x29f/0x320
> relocate_block_group+0x2b8/0x550
> btrfs_relocate_block_group+0x1a6/0x350
> btrfs_relocate_chunk+0x27/0xe0
> btrfs_balance+0x777/0xe60
> balance_kthread+0x35/0x50
> ? btrfs_balance+0xe60/0xe60
> kthread+0x16b/0x190
> ? set_kthread_struct+0x40/0x40
> ret_from_fork+0x22/0x30
> </TASK>
> ---[ end trace 7ebc95131709d2b0 ]---
>
> Normally snapshot deletion and relocation are excluded from running at
> the same time by the fs_info->cleaner_mutex. However if we had a
> pending balance waiting to get the ->cleaner_mutex, and a snapshot
> deletion was running, and then the box crashed, we would come up in a
> state where we have a half deleted snapshot.
>
> Again, in the normal case the snapshot deletion needs to complete before
> relocation can start, but in this case relocation could very well start
> before the snapshot deletion completes, as we simply add the root to the
> dead roots list and wait for the next time the cleaner runs to clean up
> the snapshot.
>
> Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
> had a pending drop_progress key. If they do then we know we were in the
> middle of the drop operation and set a flag on the fs_info. Then
> balance can wait until this flag is cleared to start up again.
>
> If there are DEAD_ROOT's that don't have a drop_progress set then we're
> safe to start balance right away as we'll be properly protected by the
> cleaner_mutex.
>
> Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> ---
> fs/btrfs/ctree.h | 13 +++++++++++++
> fs/btrfs/disk-io.c | 10 ++++++++++
> fs/btrfs/extent-tree.c | 7 +++++++
> fs/btrfs/relocation.c | 14 ++++++++++++++
> fs/btrfs/root-tree.c | 18 ++++++++++++++++++
> fs/btrfs/transaction.c | 33 ++++++++++++++++++++++++++++++++-
> fs/btrfs/transaction.h | 1 +
> 7 files changed, 95 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index f870d893d13b..1e238748dc73 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -629,6 +629,9 @@ enum {
> /* Indicate that we want the transaction kthread to commit right now. */
> BTRFS_FS_COMMIT_TRANS,
>
> + /* Indicate we have half completed snapshot deletions pending. */
> + BTRFS_FS_UNFINISHED_DROPS,
> +
> #if BITS_PER_LONG == 32
> /* Indicate if we have error/warn message printed on 32bit systems */
> BTRFS_FS_32BIT_ERROR,
> @@ -1136,8 +1139,18 @@ enum {
> BTRFS_ROOT_QGROUP_FLUSHING,
> /* We started the orphan cleanup for this root. */
> BTRFS_ROOT_ORPHAN_CLEANUP,
> + /* This root has a drop operation that was started previously. */
> + BTRFS_ROOT_UNFINISHED_DROP,
> };
>
> +static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
> +{
> + clear_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags);
> + /* Needs to be here so the clear_bit is seen by the sleeper. */
> + smp_mb__after_atomic();
> + wake_up_bit(&fs_info->flags, BTRFS_FS_UNFINISHED_DROPS);
> +}
clear_and_wake_up_bit() does exactly that, we could use it instead.
Otherwise it looks good, thanks.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
> +
> /*
> * Record swapped tree blocks of a subvolume tree for delayed subtree trace
> * code. For detail check comment in fs/btrfs/qgroup.c.
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index b6a81c39d5f4..ed62e81c0b66 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3891,6 +3891,10 @@ int __cold open_ctree(struct super_block *sb, struct btrfs_fs_devices *fs_device
>
> set_bit(BTRFS_FS_OPEN, &fs_info->flags);
>
> + /* Kick the cleaner thread so it'll start deleting snapshots. */
> + if (test_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags))
> + wake_up_process(fs_info->cleaner_kthread);
> +
> clear_oneshot:
> btrfs_clear_oneshot_options(fs_info);
> return 0;
> @@ -4616,6 +4620,12 @@ void __cold close_ctree(struct btrfs_fs_info *fs_info)
> */
> kthread_park(fs_info->cleaner_kthread);
>
> + /*
> + * If we had UNFINISHED_DROPS we could still be processing them, so
> + * clear that bit and wake up relocation so it can stop.
> + */
> + btrfs_wake_unfinished_drop(fs_info);
> +
> /* wait for the qgroup rescan worker to stop */
> btrfs_qgroup_wait_for_completion(fs_info, false);
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 99e550b83794..e94b8f168a85 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5837,6 +5837,13 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int update_ref, int for_reloc)
> kfree(wc);
> btrfs_free_path(path);
> out:
> + /*
> + * We were an unfinished drop root, check to see if there are any
> + * pending, and if not clear and wake up any waiters.
> + */
> + if (!err && test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state))
> + btrfs_maybe_wake_unfinished_drop(fs_info);
> +
> /*
> * So if we need to stop dropping the snapshot for whatever reason we
> * need to make sure to add it back to the dead root list so that we
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index f5465197996d..f528c5283f25 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3960,6 +3960,20 @@ int btrfs_relocate_block_group(struct btrfs_fs_info *fs_info, u64 group_start)
> int rw = 0;
> int err = 0;
>
> + /*
> + * This only gets set if we had a half-deleted snapshot on mount. We
> + * cannot allow relocation to start while we're still trying to clean up
> + * these pending deletions.
> + */
> + ret = wait_on_bit(&fs_info->flags, BTRFS_FS_UNFINISHED_DROPS,
> + TASK_INTERRUPTIBLE);
> + if (ret)
> + return ret;
> +
> + /* We may have been woken up by close_ctree, so bail if we're closing. */
> + if (btrfs_fs_closing(fs_info))
> + return -EINTR;
> +
> bg = btrfs_lookup_block_group(fs_info, group_start);
> if (!bg)
> return -ENOENT;
> diff --git a/fs/btrfs/root-tree.c b/fs/btrfs/root-tree.c
> index 3d68d2dcd83e..36770bf42d1a 100644
> --- a/fs/btrfs/root-tree.c
> +++ b/fs/btrfs/root-tree.c
> @@ -278,6 +278,24 @@ int btrfs_find_orphan_roots(struct btrfs_fs_info *fs_info)
>
> WARN_ON(!test_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED, &root->state));
> if (btrfs_root_refs(&root->root_item) == 0) {
> + struct btrfs_key drop_key;
> +
> + btrfs_disk_key_to_cpu(&drop_key,
> + &root->root_item.drop_progress);
> + /*
> + * If we have a non-zero drop_progress then we know we
> + * made it partly through deleting this snapshot, and
> + * thus we need to make sure we block any balance from
> + * happening until this snapshot is completely dropped.
> + */
> + if (drop_key.objectid != 0 || drop_key.type != 0 ||
> + drop_key.offset != 0) {
> + set_bit(BTRFS_FS_UNFINISHED_DROPS,
> + &fs_info->flags);
> + set_bit(BTRFS_ROOT_UNFINISHED_DROP,
> + &root->state);
> + }
> +
> set_bit(BTRFS_ROOT_DEAD_TREE, &root->state);
> btrfs_add_dead_root(root);
> }
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index c6e550fa4d55..dfceee28a149 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1319,6 +1319,32 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans)
> return 0;
> }
>
> +/*
> + * If we had a pending drop we need to see if there are any others left in our
> + * dead roots list, and if not clear our bit and wake any waiters.
> + */
> +void btrfs_maybe_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
> +{
> + /*
> + * We put the drop in progress roots at the front of the list, so if the
> + * first entry doesn't have UNFINISHED_DROP set we can wake everybody
> + * up.
> + */
> + spin_lock(&fs_info->trans_lock);
> + if (!list_empty(&fs_info->dead_roots)) {
> + struct btrfs_root *root = list_first_entry(&fs_info->dead_roots,
> + struct btrfs_root,
> + root_list);
> + if (test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state)) {
> + spin_unlock(&fs_info->trans_lock);
> + return;
> + }
> + }
> + spin_unlock(&fs_info->trans_lock);
> +
> + btrfs_wake_unfinished_drop(fs_info);
> +}
> +
> /*
> * dead roots are old snapshots that need to be deleted. This allocates
> * a dirty root struct and adds it into the list of dead roots that need to
> @@ -1331,7 +1357,12 @@ void btrfs_add_dead_root(struct btrfs_root *root)
> spin_lock(&fs_info->trans_lock);
> if (list_empty(&root->root_list)) {
> btrfs_grab_root(root);
> - list_add_tail(&root->root_list, &fs_info->dead_roots);
> +
> + /* We want to process the partially complete drops first. */
> + if (test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state))
> + list_add(&root->root_list, &fs_info->dead_roots);
> + else
> + list_add_tail(&root->root_list, &fs_info->dead_roots);
> }
> spin_unlock(&fs_info->trans_lock);
> }
> diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
> index 9402d8d94484..ba8a9826eb37 100644
> --- a/fs/btrfs/transaction.h
> +++ b/fs/btrfs/transaction.h
> @@ -216,6 +216,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid);
>
> void btrfs_add_dead_root(struct btrfs_root *root);
> int btrfs_defrag_root(struct btrfs_root *root);
> +void btrfs_maybe_wake_unfinished_drop(struct btrfs_fs_info *fs_info);
> int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
> int btrfs_commit_transaction(struct btrfs_trans_handle *trans);
> void btrfs_commit_transaction_async(struct btrfs_trans_handle *trans);
> --
> 2.26.3
>
next prev parent reply other threads:[~2022-02-21 12:38 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-18 19:56 [PATCH v2 0/3] btrfs: fix problem with balance recovery and snap delete Josef Bacik
2022-02-18 19:56 ` [PATCH v2 1/3] btrfs: do not start relocation until in progress drops are done Josef Bacik
2022-02-21 12:38 ` Filipe Manana [this message]
2022-02-24 15:03 ` [btrfs] 0ac06c96a6: BUG:KASAN:use-after-free_in_btrfs_drop_snapshot kernel test robot
2022-02-18 19:56 ` [PATCH v2 2/3] btrfs: use btrfs_fs_info for deleting snapshots and cleaner Josef Bacik
2022-02-21 12:40 ` Filipe Manana
2022-02-18 19:56 ` [PATCH v2 3/3] btrfs: pass btrfs_fs_info to btrfs_recover_relocation Josef Bacik
2022-02-21 12:41 ` Filipe Manana
2022-02-21 16:20 ` [PATCH v2 0/3] btrfs: fix problem with balance recovery and snap delete David Sterba
2022-02-24 14:47 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YhOHqtAM/JbfTKdo@debian9.Home \
--to=fdmanana@kernel.org \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox