public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Boris Burkov <boris@bur.io>
To: Filipe Manana <fdmanana@gmail.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH 1/2] btrfs: fix size class loading logic
Date: Tue, 14 Feb 2023 14:12:54 -0800	[thread overview]
Message-ID: <Y+wHZje21c9p6//D@zen> (raw)
In-Reply-To: <CAL3q7H7eKMD44Z1+=Kb-1RFMMeZpAm2fwyO59yeBwCcSOU80Pg@mail.gmail.com>

On Tue, Feb 14, 2023 at 09:47:54PM +0000, Filipe Manana wrote:
> On Wed, Jan 25, 2023 at 9:33 PM Boris Burkov <boris@bur.io> wrote:
> >
> > The original implementation was completely incorrect. It used
> > btrfs_search_slot to make an inexact match, which simply returned >0 to
> > indicate not finding the key.
> >
> > Change it to using search_forward with no transid to actually walk the
> > leaves looking for extent items. Some small tweaks to the key space
> > condition checking in the iteration were also necessary.
> >
> > Finally, since the sampling lookups are of fixed complexity, move them
> > into the main, blocking part of caching a block group, not as a
> > best-effort thing after. This has no effect on total block group caching
> > throughput as there is only one thread anyway, but makes it simpler and
> > reduces weird races where we change the size class simultaneously from
> > an allocation and loading.
> >
> > Signed-off-by: Boris Burkov <boris@bur.io>
> > ---
> >  fs/btrfs/block-group.c | 56 ++++++++++++++++++++++++++++--------------
> >  1 file changed, 37 insertions(+), 19 deletions(-)
> >
> > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> > index 73e1270b3904..45ccb25c5b1f 100644
> > --- a/fs/btrfs/block-group.c
> > +++ b/fs/btrfs/block-group.c
> > @@ -555,7 +555,8 @@ u64 add_new_free_space(struct btrfs_block_group *block_group, u64 start, u64 end
> >   * Returns: 0 on success, 1 if the search didn't yield a useful item, negative
> >   * error code on error.
> >   */
> > -static int sample_block_group_extent_item(struct btrfs_block_group *block_group,
> > +static int sample_block_group_extent_item(struct btrfs_caching_control *caching_ctl,
> > +                                         struct btrfs_block_group *block_group,
> >                                           int index, int max_index,
> >                                           struct btrfs_key *key)
> >  {
> > @@ -563,17 +564,19 @@ static int sample_block_group_extent_item(struct btrfs_block_group *block_group,
> >         struct btrfs_root *extent_root;
> >         int ret = 0;
> >         u64 search_offset;
> > +       u64 search_end = block_group->start + block_group->length;
> >         struct btrfs_path *path;
> >
> >         ASSERT(index >= 0);
> >         ASSERT(index <= max_index);
> >         ASSERT(max_index > 0);
> > +       lockdep_assert_held(&caching_ctl->mutex);
> > +       lockdep_assert_held_read(&fs_info->commit_root_sem);
> >
> >         path = btrfs_alloc_path();
> >         if (!path)
> >                 return -ENOMEM;
> >
> > -       down_read(&fs_info->commit_root_sem);
> >         extent_root = btrfs_extent_root(fs_info, max_t(u64, block_group->start,
> >                                                        BTRFS_SUPER_INFO_OFFSET));
> >
> > @@ -586,21 +589,36 @@ static int sample_block_group_extent_item(struct btrfs_block_group *block_group,
> >         key->type = BTRFS_EXTENT_ITEM_KEY;
> >         key->offset = 0;
> >
> > -       ret = btrfs_search_slot(NULL, extent_root, key, path, 0, 0);
> > -       if (ret != 0)
> > -               goto out;
> > -       if (key->objectid < block_group->start ||
> > -           key->objectid > block_group->start + block_group->length) {
> > -               ret = 1;
> > -               goto out;
> > -       }
> > -       if (key->type != BTRFS_EXTENT_ITEM_KEY) {
> > -               ret = 1;
> > -               goto out;
> > +       while (1) {
> > +               ret = btrfs_search_forward(extent_root, key, path, 0);
> 
> Boris, this is broken and can result in a deadlock.
> 
> btrfs_search_forward() will always lock the root node, despite the
> path having ->skip_locking and ->search_commit_root set to true (1).
> It's not meant to be used for commit roots, so it always needs to do locking.

Thanks for the catch and explanation. Working on a fix!

> 
> So if another task is COWing a child node of the same root node and
> then needs to wait for block group caching to complete when trying to
> allocate a metadata extent, it deadlocks.
> For example:
> 
> [539604.239315] sysrq: Show Blocked State
> [539604.240133] task:kworker/u16:6   state:D stack:0     pid:2119594
> ppid:2      flags:0x00004000
> [539604.241613] Workqueue: btrfs-cache btrfs_work_helper [btrfs]
> [539604.242673] Call Trace:
> [539604.243129]  <TASK>
> [539604.243925]  __schedule+0x41d/0xee0
> [539604.244797]  ? rcu_read_lock_sched_held+0x12/0x70
> [539604.245399]  ? rwsem_down_read_slowpath+0x185/0x490
> [539604.246111]  schedule+0x5d/0xf0
> [539604.246593]  rwsem_down_read_slowpath+0x2da/0x490
> [539604.247290]  ? rcu_barrier_tasks_trace+0x10/0x20
> [539604.248090]  __down_read_common+0x3d/0x150
> [539604.248702]  down_read_nested+0xc3/0x140
> [539604.249280]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
> [539604.250097]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
> [539604.250915]  btrfs_search_forward+0x59/0x460 [btrfs]
> [539604.251781]  ? btrfs_global_root+0x50/0x70 [btrfs]
> [539604.252476]  caching_thread+0x1be/0x920 [btrfs]
> [539604.253167]  btrfs_work_helper+0xf6/0x400 [btrfs]
> [539604.253848]  process_one_work+0x24f/0x5a0
> [539604.254476]  worker_thread+0x52/0x3b0
> [539604.255166]  ? __pfx_worker_thread+0x10/0x10
> [539604.256047]  kthread+0xf0/0x120
> [539604.256591]  ? __pfx_kthread+0x10/0x10
> [539604.257212]  ret_from_fork+0x29/0x50
> [539604.257822]  </TASK>
> [539604.258233] task:btrfs-transacti state:D stack:0     pid:2236474
> ppid:2      flags:0x00004000
> [539604.259802] Call Trace:
> [539604.260243]  <TASK>
> [539604.260615]  __schedule+0x41d/0xee0
> [539604.261205]  ? rcu_read_lock_sched_held+0x12/0x70
> [539604.262000]  ? rwsem_down_read_slowpath+0x185/0x490
> [539604.262822]  schedule+0x5d/0xf0
> [539604.263374]  rwsem_down_read_slowpath+0x2da/0x490
> [539604.266228]  ? lock_acquire+0x160/0x310
> [539604.266917]  ? rcu_read_lock_sched_held+0x12/0x70
> [539604.267996]  ? lock_contended+0x19e/0x500
> [539604.268720]  __down_read_common+0x3d/0x150
> [539604.269400]  down_read_nested+0xc3/0x140
> [539604.270057]  __btrfs_tree_read_lock+0x24/0x100 [btrfs]
> [539604.271129]  btrfs_read_lock_root_node+0x48/0x60 [btrfs]
> [539604.272372]  btrfs_search_slot+0x143/0xf70 [btrfs]
> [539604.273295]  update_block_group_item+0x9e/0x190 [btrfs]
> [539604.274282]  btrfs_start_dirty_block_groups+0x1c4/0x4f0 [btrfs]
> [539604.275381]  ? __mutex_unlock_slowpath+0x45/0x280
> [539604.276390]  btrfs_commit_transaction+0xee/0xed0 [btrfs]
> [539604.277391]  ? lock_acquire+0x1a4/0x310
> [539604.278080]  ? start_transaction+0xcb/0x6c0 [btrfs]
> [539604.279099]  transaction_kthread+0x142/0x1c0 [btrfs]
> [539604.279996]  ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
> [539604.280673]  kthread+0xf0/0x120
> [539604.281050]  ? __pfx_kthread+0x10/0x10
> [539604.281496]  ret_from_fork+0x29/0x50
> [539604.281966]  </TASK>
> [539604.282255] task:fsstress        state:D stack:0     pid:2236483
> ppid:1      flags:0x00004006
> [539604.283897] Call Trace:
> [539604.284700]  <TASK>
> [539604.285088]  __schedule+0x41d/0xee0
> [539604.285660]  schedule+0x5d/0xf0
> [539604.286175]  btrfs_wait_block_group_cache_progress+0xf2/0x170 [btrfs]
> [539604.287342]  ? __pfx_autoremove_wake_function+0x10/0x10
> [539604.288450]  find_free_extent+0xd93/0x1750 [btrfs]
> [539604.289256]  ? _raw_spin_unlock+0x29/0x50
> [539604.289911]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
> [539604.290843]  btrfs_reserve_extent+0x147/0x290 [btrfs]
> [539604.291943]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
> [539604.292903]  __btrfs_cow_block+0x138/0x580 [btrfs]
> [539604.293773]  btrfs_cow_block+0x10e/0x240 [btrfs]
> [539604.294595]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
> [539604.295585]  btrfs_update_device+0x71/0x1b0 [btrfs]
> [539604.296459]  btrfs_chunk_alloc_add_chunk_item+0xe0/0x340 [btrfs]
> [539604.297489]  btrfs_chunk_alloc+0x1bf/0x490 [btrfs]
> [539604.298335]  find_free_extent+0x6fa/0x1750 [btrfs]
> [539604.299174]  ? _raw_spin_unlock+0x29/0x50
> [539604.299950]  ? btrfs_get_alloc_profile+0x127/0x2a0 [btrfs]
> [539604.300918]  btrfs_reserve_extent+0x147/0x290 [btrfs]
> [539604.301797]  btrfs_alloc_tree_block+0xcb/0x3e0 [btrfs]
> [539604.303017]  ? lock_release+0x224/0x4a0
> [539604.303855]  __btrfs_cow_block+0x138/0x580 [btrfs]
> [539604.304789]  btrfs_cow_block+0x10e/0x240 [btrfs]
> [539604.305611]  btrfs_search_slot+0x7f3/0xf70 [btrfs]
> [539604.306682]  ? btrfs_global_root+0x50/0x70 [btrfs]
> [539604.308198]  lookup_inline_extent_backref+0x17b/0x7a0 [btrfs]
> [539604.309254]  lookup_extent_backref+0x43/0xd0 [btrfs]
> [539604.310122]  __btrfs_free_extent+0xf8/0x810 [btrfs]
> [539604.310874]  ? lock_release+0x224/0x4a0
> [539604.311724]  ? btrfs_merge_delayed_refs+0x17b/0x1d0 [btrfs]
> [539604.313023]  __btrfs_run_delayed_refs+0x2ba/0x1260 [btrfs]
> [539604.314271]  btrfs_run_delayed_refs+0x8f/0x1c0 [btrfs]
> [539604.315445]  ? rcu_read_lock_sched_held+0x12/0x70
> [539604.316706]  btrfs_commit_transaction+0xa2/0xed0 [btrfs]
> [539604.317855]  ? do_raw_spin_unlock+0x4b/0xa0
> [539604.318544]  ? _raw_spin_unlock+0x29/0x50
> [539604.319240]  create_subvol+0x53d/0x6e0 [btrfs]
> [539604.320283]  btrfs_mksubvol+0x4f5/0x590 [btrfs]
> [539604.321220]  __btrfs_ioctl_snap_create+0x11b/0x180 [btrfs]
> [539604.322307]  btrfs_ioctl_snap_create_v2+0xc6/0x150 [btrfs]
> [539604.323295]  btrfs_ioctl+0x9f7/0x33e0 [btrfs]
> [539604.324331]  ? rcu_read_lock_sched_held+0x12/0x70
> [539604.325137]  ? lock_release+0x224/0x4a0
> [539604.325808]  ? __x64_sys_ioctl+0x87/0xc0
> [539604.326467]  __x64_sys_ioctl+0x87/0xc0
> [539604.327109]  do_syscall_64+0x38/0x90
> [539604.327875]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
> [539604.328792] RIP: 0033:0x7f05a7babaeb
> [539604.329378] RSP: 002b:00007ffd0fecc480 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [539604.330561] RAX: ffffffffffffffda RBX: 00007ffd0fecd520 RCX:
> 00007f05a7babaeb
> [539604.335194] RDX: 00007ffd0fecc4e0 RSI: 0000000050009418 RDI:
> 0000000000000004
> [539604.336583] RBP: 0000000000000002 R08: 0000000000000000 R09:
> 000055a66e107480
> [539604.337685] R10: 00007f05a7ac5358 R11: 0000000000000246 R12:
> 0000000000000004
> [539604.339220] R13: 00007ffd0fecc4e0 R14: 0000000000000004 R15:
> 000055a66c4be590
> [539604.341137]  </TASK>
> 
> This needs to use regular btrfs_search_slot() with some skip and stop logic.
> 
> Thanks.
> 
> 
> > +               if (ret != 0)
> > +                       goto out;
> > +               /* Success; sampled an extent item in the block group */
> > +               if (key->type == BTRFS_EXTENT_ITEM_KEY &&
> > +                   key->objectid >= block_group->start &&
> > +                   key->objectid + key->offset <= search_end)
> > +                       goto out;
> > +
> > +               /* We can't possibly find a valid extent item anymore */
> > +               if (key->objectid >= search_end) {
> > +                       ret = 1;
> > +                       break;
> > +               }
> > +               if (key->type < BTRFS_EXTENT_ITEM_KEY)
> > +                       key->type = BTRFS_EXTENT_ITEM_KEY;
> > +               else
> > +                       key->objectid++;
> > +               btrfs_release_path(path);
> > +               up_read(&fs_info->commit_root_sem);
> > +               mutex_unlock(&caching_ctl->mutex);
> > +               cond_resched();
> > +               mutex_lock(&caching_ctl->mutex);
> > +               down_read(&fs_info->commit_root_sem);
> >         }
> >  out:
> > +       lockdep_assert_held(&caching_ctl->mutex);
> > +       lockdep_assert_held_read(&fs_info->commit_root_sem);
> >         btrfs_free_path(path);
> > -       up_read(&fs_info->commit_root_sem);
> >         return ret;
> >  }
> >
> > @@ -638,7 +656,8 @@ static int sample_block_group_extent_item(struct btrfs_block_group *block_group,
> >   *
> >   * Returns: 0 on success, negative error code on error.
> >   */
> > -static int load_block_group_size_class(struct btrfs_block_group *block_group)
> > +static int load_block_group_size_class(struct btrfs_caching_control *caching_ctl,
> > +                                      struct btrfs_block_group *block_group)
> >  {
> >         struct btrfs_key key;
> >         int i;
> > @@ -646,11 +665,11 @@ static int load_block_group_size_class(struct btrfs_block_group *block_group)
> >         enum btrfs_block_group_size_class size_class = BTRFS_BG_SZ_NONE;
> >         int ret;
> >
> > -       if (btrfs_block_group_should_use_size_class(block_group))
> > +       if (!btrfs_block_group_should_use_size_class(block_group))
> >                 return 0;
> >
> >         for (i = 0; i < 5; ++i) {
> > -               ret = sample_block_group_extent_item(block_group, i, 5, &key);
> > +               ret = sample_block_group_extent_item(caching_ctl, block_group, i, 5, &key);
> >                 if (ret < 0)
> >                         goto out;
> >                 if (ret > 0)
> > @@ -812,6 +831,7 @@ static noinline void caching_thread(struct btrfs_work *work)
> >         mutex_lock(&caching_ctl->mutex);
> >         down_read(&fs_info->commit_root_sem);
> >
> > +       load_block_group_size_class(caching_ctl, block_group);
> >         if (btrfs_test_opt(fs_info, SPACE_CACHE)) {
> >                 ret = load_free_space_cache(block_group);
> >                 if (ret == 1) {
> > @@ -867,8 +887,6 @@ static noinline void caching_thread(struct btrfs_work *work)
> >
> >         wake_up(&caching_ctl->wait);
> >
> > -       load_block_group_size_class(block_group);
> > -
> >         btrfs_put_caching_control(caching_ctl);
> >         btrfs_put_block_group(block_group);
> >  }
> > --
> > 2.38.1
> >
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”

  reply	other threads:[~2023-02-14 22:13 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-25 20:50 [PATCH 0/2] btrfs: block group size class load fixes Boris Burkov
2023-01-25 20:50 ` [PATCH 1/2] btrfs: fix size class loading logic Boris Burkov
2023-02-14 21:47   ` Filipe Manana
2023-02-14 22:12     ` Boris Burkov [this message]
2023-01-25 20:50 ` [PATCH 2/2] btrfs: add size class stats to sysfs Boris Burkov
2023-01-27 13:23   ` David Sterba
2023-01-27 21:43     ` Boris Burkov
2023-02-07 15:08       ` David Sterba
2023-01-25 22:05 ` [PATCH 0/2] btrfs: block group size class load fixes David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+wHZje21c9p6//D@zen \
    --to=boris@bur.io \
    --cc=fdmanana@gmail.com \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox