public inbox for linux-btrfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com, willy@infradead.org
Subject: Re: [PATCH v3 1/3] btrfs: convert the buffer_radix to an xarray
Date: Fri, 25 Apr 2025 11:48:44 -0400	[thread overview]
Message-ID: <20250425154844.GB494636@perftesting> (raw)
In-Reply-To: <CAL3q7H6Nbar_o0GVGuTr5BVmCRsDUgAJfnOz-hSi5OEi86jejg@mail.gmail.com>

On Thu, Apr 24, 2025 at 05:07:29PM +0100, Filipe Manana wrote:
> On Thu, Apr 24, 2025 at 4:47 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 04:08:56PM +0100, Filipe Manana wrote:
> > > On Fri, Apr 18, 2025 at 2:58 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >
> > > > In order to fully utilize xarray tagging to improve writeback we need to
> > > > convert the buffer_radix to a proper xarray.  This conversion is
> > > > relatively straightforward as the radix code uses the xarray underneath.
> > > > Using xarray directly allows for quite a lot less code.
> > > >
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > > ---
> > > >  fs/btrfs/disk-io.c           |  15 ++-
> > > >  fs/btrfs/extent_io.c         | 196 +++++++++++++++--------------------
> > > >  fs/btrfs/fs.h                |   4 +-
> > > >  fs/btrfs/tests/btrfs-tests.c |  27 ++---
> > > >  fs/btrfs/zoned.c             |  16 +--
> > > >  5 files changed, 113 insertions(+), 145 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > > index 59da809b7d57..24c08eb86b7b 100644
> > > > --- a/fs/btrfs/disk-io.c
> > > > +++ b/fs/btrfs/disk-io.c
> > > > @@ -2762,10 +2762,22 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
> > > >         return ret;
> > > >  }
> > > >
> > > > +/*
> > > > + * lockdep gets confused between our buffer_tree which requires IRQ locking
> > > > + * because we modify marks in the IRQ context, and our delayed inode xarray
> > > > + * which doesn't have these requirements. Use a class key so lockdep doesn't get
> > > > + * them mixed up.
> > > > + */
> > > > +static struct lock_class_key buffer_xa_class;
> > > > +
> > > >  void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> > > >  {
> > > >         INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
> > > > -       INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
> > > > +
> > > > +       /* Use the same flags as mapping->i_pages. */
> > > > +       xa_init_flags(&fs_info->buffer_tree, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> > > > +       lockdep_set_class(&fs_info->buffer_tree.xa_lock, &buffer_xa_class);
> > > > +
> > > >         INIT_LIST_HEAD(&fs_info->trans_list);
> > > >         INIT_LIST_HEAD(&fs_info->dead_roots);
> > > >         INIT_LIST_HEAD(&fs_info->delayed_iputs);
> > > > @@ -2777,7 +2789,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> > > >         spin_lock_init(&fs_info->delayed_iput_lock);
> > > >         spin_lock_init(&fs_info->defrag_inodes_lock);
> > > >         spin_lock_init(&fs_info->super_lock);
> > > > -       spin_lock_init(&fs_info->buffer_lock);
> > > >         spin_lock_init(&fs_info->unused_bgs_lock);
> > > >         spin_lock_init(&fs_info->treelog_bg_lock);
> > > >         spin_lock_init(&fs_info->zone_active_bgs_lock);
> > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > > index 6cfd286b8bbc..aa451ad52528 100644
> > > > --- a/fs/btrfs/extent_io.c
> > > > +++ b/fs/btrfs/extent_io.c
> > > > @@ -1893,19 +1893,20 @@ static void set_btree_ioerr(struct extent_buffer *eb)
> > > >   * context.
> > > >   */
> > > >  static struct extent_buffer *find_extent_buffer_nolock(
> > > > -               const struct btrfs_fs_info *fs_info, u64 start)
> > > > +               struct btrfs_fs_info *fs_info, u64 start)
> > > >  {
> > > > +       XA_STATE(xas, &fs_info->buffer_tree, start >> fs_info->sectorsize_bits);
> > > >         struct extent_buffer *eb;
> > > >
> > > >         rcu_read_lock();
> > > > -       eb = radix_tree_lookup(&fs_info->buffer_radix,
> > > > -                              start >> fs_info->sectorsize_bits);
> > > > -       if (eb && atomic_inc_not_zero(&eb->refs)) {
> > > > -               rcu_read_unlock();
> > > > -               return eb;
> > > > -       }
> > > > +       do {
> > > > +               eb = xas_load(&xas);
> > > > +       } while (xas_retry(&xas, eb));
> > >
> > > Why not use the simpler xarray API?
> > > This is basically open coding what xa_load() does, so we can get rid
> > > of this loop and no need to define a XA_STATE above.
> >
> > Because we have to do the atomic_inc_not_zero() under the rcu_lock(), so we have
> > have to open code the retry logic.
> 
> Sure, but xa_load() can still be called while we are under the rcu
> read section, can't it?
> 
> >
> > >
> > > > +
> > > > +       if (eb && !atomic_inc_not_zero(&eb->refs))
> > > > +               eb = NULL;
> > > >         rcu_read_unlock();
> > > > -       return NULL;
> > > > +       return eb;
> > > >  }
> > > >
> > > >  static void end_bbio_meta_write(struct btrfs_bio *bbio)
> > > > @@ -2769,11 +2770,10 @@ static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct fo
> > > >
> > > >         if (!btrfs_meta_is_subpage(fs_info)) {
> > > >                 /*
> > > > -                * We do this since we'll remove the pages after we've
> > > > -                * removed the eb from the radix tree, so we could race
> > > > -                * and have this page now attached to the new eb.  So
> > > > -                * only clear folio if it's still connected to
> > > > -                * this eb.
> > > > +                * We do this since we'll remove the pages after we've removed
> > > > +                * the eb from the xarray, so we could race and have this page
> > > > +                * now attached to the new eb.  So only clear folio if it's
> > > > +                * still connected to this eb.
> > > >                  */
> > > >                 if (folio_test_private(folio) && folio_get_private(folio) == eb) {
> > > >                         BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> > > > @@ -2938,9 +2938,9 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
> > > >  {
> > > >         int refs;
> > > >         /*
> > > > -        * The TREE_REF bit is first set when the extent_buffer is added
> > > > -        * to the radix tree. It is also reset, if unset, when a new reference
> > > > -        * is created by find_extent_buffer.
> > > > +        * The TREE_REF bit is first set when the extent_buffer is added to the
> > > > +        * xarray. It is also reset, if unset, when a new reference is created
> > > > +        * by find_extent_buffer.
> > > >          *
> > > >          * It is only cleared in two cases: freeing the last non-tree
> > > >          * reference to the extent_buffer when its STALE bit is set or
> > > > @@ -2952,13 +2952,12 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
> > > >          * conditions between the calls to check_buffer_tree_ref in those
> > > >          * codepaths and clearing TREE_REF in try_release_extent_buffer.
> > > >          *
> > > > -        * The actual lifetime of the extent_buffer in the radix tree is
> > > > -        * adequately protected by the refcount, but the TREE_REF bit and
> > > > -        * its corresponding reference are not. To protect against this
> > > > -        * class of races, we call check_buffer_tree_ref from the codepaths
> > > > -        * which trigger io. Note that once io is initiated, TREE_REF can no
> > > > -        * longer be cleared, so that is the moment at which any such race is
> > > > -        * best fixed.
> > > > +        * The actual lifetime of the extent_buffer in the xarray is adequately
> > > > +        * protected by the refcount, but the TREE_REF bit and its corresponding
> > > > +        * reference are not. To protect against this class of races, we call
> > > > +        * check_buffer_tree_ref from the codepaths which trigger io. Note that
> > > > +        * once io is initiated, TREE_REF can no longer be cleared, so that is
> > > > +        * the moment at which any such race is best fixed.
> > > >          */
> > > >         refs = atomic_read(&eb->refs);
> > > >         if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> > > > @@ -3022,23 +3021,26 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >                 return ERR_PTR(-ENOMEM);
> > > >         eb->fs_info = fs_info;
> > > >  again:
> > > > -       ret = radix_tree_preload(GFP_NOFS);
> > > > -       if (ret) {
> > > > -               exists = ERR_PTR(ret);
> > > > +       xa_lock_irq(&fs_info->buffer_tree);
> > > > +       exists = __xa_cmpxchg(&fs_info->buffer_tree,
> > > > +                             start >> fs_info->sectorsize_bits, NULL, eb,
> > > > +                             GFP_NOFS);
> > > > +       if (xa_is_err(exists)) {
> > > > +               ret = xa_err(exists);
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > > +               btrfs_release_extent_buffer(eb);
> > > > +               return ERR_PTR(ret);
> > > > +       }
> > > > +       if (exists) {
> > > > +               if (!atomic_inc_not_zero(&exists->refs)) {
> > > > +                       /* The extent buffer is being freed, retry. */
> > > > +                       xa_unlock_irq(&fs_info->buffer_tree);
> > > > +                       goto again;
> > > > +               }
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > >                 goto free_eb;
> > > >         }
> > > > -       spin_lock(&fs_info->buffer_lock);
> > > > -       ret = radix_tree_insert(&fs_info->buffer_radix,
> > > > -                               start >> fs_info->sectorsize_bits, eb);
> > > > -       spin_unlock(&fs_info->buffer_lock);
> > > > -       radix_tree_preload_end();
> > > > -       if (ret == -EEXIST) {
> > > > -               exists = find_extent_buffer(fs_info, start);
> > > > -               if (exists)
> > > > -                       goto free_eb;
> > > > -               else
> > > > -                       goto again;
> > > > -       }
> > > > +       xa_unlock_irq(&fs_info->buffer_tree);
> > > >         check_buffer_tree_ref(eb);
> > > >
> > > >         return eb;
> > > > @@ -3059,9 +3061,9 @@ static struct extent_buffer *grab_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >         lockdep_assert_held(&folio->mapping->i_private_lock);
> > > >
> > > >         /*
> > > > -        * For subpage case, we completely rely on radix tree to ensure we
> > > > -        * don't try to insert two ebs for the same bytenr.  So here we always
> > > > -        * return NULL and just continue.
> > > > +        * For subpage case, we completely rely on xarray to ensure we don't try
> > > > +        * to insert two ebs for the same bytenr.  So here we always return NULL
> > > > +        * and just continue.
> > > >          */
> > > >         if (btrfs_meta_is_subpage(fs_info))
> > > >                 return NULL;
> > > > @@ -3194,7 +3196,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> > > >         /*
> > > >          * To inform we have an extra eb under allocation, so that
> > > >          * detach_extent_buffer_page() won't release the folio private when the
> > > > -        * eb hasn't been inserted into radix tree yet.
> > > > +        * eb hasn't been inserted into the xarray yet.
> > > >          *
> > > >          * The ref will be decreased when the eb releases the page, in
> > > >          * detach_extent_buffer_page().  Thus needs no special handling in the
> > > > @@ -3328,10 +3330,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >
> > > >                 /*
> > > >                  * We can't unlock the pages just yet since the extent buffer
> > > > -                * hasn't been properly inserted in the radix tree, this
> > > > -                * opens a race with btree_release_folio which can free a page
> > > > -                * while we are still filling in all pages for the buffer and
> > > > -                * we could crash.
> > > > +                * hasn't been properly inserted in the xarray, this opens a
> > > > +                * race with btree_release_folio which can free a page while we
> > > > +                * are still filling in all pages for the buffer and we could
> > > > +                * crash.
> > > >                  */
> > > >         }
> > > >         if (uptodate)
> > > > @@ -3340,23 +3342,25 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >         if (page_contig)
> > > >                 eb->addr = folio_address(eb->folios[0]) + offset_in_page(eb->start);
> > > >  again:
> > > > -       ret = radix_tree_preload(GFP_NOFS);
> > > > -       if (ret)
> > > > +       xa_lock_irq(&fs_info->buffer_tree);
> > > > +       existing_eb = __xa_cmpxchg(&fs_info->buffer_tree,
> > > > +                                  start >> fs_info->sectorsize_bits, NULL, eb,
> > > > +                                  GFP_NOFS);
> > > > +       if (xa_is_err(existing_eb)) {
> > > > +               ret = xa_err(existing_eb);
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > >                 goto out;
> > > > -
> > > > -       spin_lock(&fs_info->buffer_lock);
> > > > -       ret = radix_tree_insert(&fs_info->buffer_radix,
> > > > -                               start >> fs_info->sectorsize_bits, eb);
> > > > -       spin_unlock(&fs_info->buffer_lock);
> > > > -       radix_tree_preload_end();
> > > > -       if (ret == -EEXIST) {
> > > > -               ret = 0;
> > > > -               existing_eb = find_extent_buffer(fs_info, start);
> > > > -               if (existing_eb)
> > > > -                       goto out;
> > > > -               else
> > > > -                       goto again;
> > > >         }
> > > > +       if (existing_eb) {
> > > > +               if (!atomic_inc_not_zero(&existing_eb->refs)) {
> > > > +                       xa_unlock_irq(&fs_info->buffer_tree);
> > > > +                       goto again;
> > > > +               }
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > > +               goto out;
> > > > +       }
> > > > +       xa_unlock_irq(&fs_info->buffer_tree);
> > > > +
> > > >         /* add one reference for the tree */
> > > >         check_buffer_tree_ref(eb);
> > > >
> > > > @@ -3426,10 +3430,13 @@ static int release_extent_buffer(struct extent_buffer *eb)
> > > >
> > > >                 spin_unlock(&eb->refs_lock);
> > > >
> > > > -               spin_lock(&fs_info->buffer_lock);
> > > > -               radix_tree_delete_item(&fs_info->buffer_radix,
> > > > -                                      eb->start >> fs_info->sectorsize_bits, eb);
> > > > -               spin_unlock(&fs_info->buffer_lock);
> > > > +               /*
> > > > +                * We're erasing, theoretically there will be no allocations, so
> > > > +                * just use GFP_ATOMIC.
> > > > +                */
> > > > +               xa_cmpxchg_irq(&fs_info->buffer_tree,
> > > > +                              eb->start >> fs_info->sectorsize_bits, eb, NULL,
> > > > +                              GFP_ATOMIC);
> > >
> > > This I don't get. Why are we doing a cmp exchange with NULL and not
> > > removing the entry?
> > > Swapping with NULL doesn't release the entry, as xarray permits
> > > storing NULL values. So after a long time releasing extent buffers we
> > > may be holding on to memory unnecessarily.
> > >
> > > Why not use xa_erase()?
> >
> > Because we may be freeing a buffer that we raced with the insert, so we don't
> > want to replace the winner with NULL.
> >
> > Assume that we allocate two buffers for ->start == 0, we insert one buffer into
> > the xarray, and we free the other.  When we're freeing the other we will not
> > delete the xarray entry because it doesn't match the buffer we're removing.
> 
> Sure, I get that part, that we don't want to delete an entry if it
> doesn't match our eb.
> But if it matches, we store a NULL in it, meaning the xarray entry
> will remain and potentially hold unnecessary memory for a long time
> unless that index gets reused soon or there are  other indexes in the
> region.
> 
> That was (and is my) my concern.
> 
> As far as I can see we never do actual index deletions from the
> xarray, just store NULL values, not allowing the xarray to release
> space.

From what I gathered xa_erase() and storing a NULL is the same thing?

From the documentation:

You can then set entries using xa_store() and get entries using
xa_load().  xa_store() will overwrite any entry with the new entry and
return the previous entry stored at that index.  You can unset entries
using xa_erase() or by setting the entry to ``NULL`` using xa_store().
There is no difference between an entry that has never been stored to
and one that has been erased with xa_erase(); an entry that has most
recently had ``NULL`` stored to it is also equivalent except if the
XArray was initialized with ``XA_FLAGS_ALLOC``.

So this seems to say that if we store a ``NULL`` value it's the same as deleting
it, so we should be fine with this pattern?

Willy am I doing this wrong?  What should I do instead if this is indeed
incorrect?  Do an xa_load(), check the value, then xa_erase() if it matches, all
under the lock?  Thanks,

Josef

  parent reply	other threads:[~2025-04-25 15:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-18 13:57 [PATCH v3 0/3] btrfs: simplify extent buffer writeback Josef Bacik
2025-04-18 13:57 ` [PATCH v3 1/3] btrfs: convert the buffer_radix to an xarray Josef Bacik
2025-04-23 15:08   ` Filipe Manana
2025-04-24 15:47     ` Josef Bacik
2025-04-24 16:07       ` Filipe Manana
2025-04-25 13:35         ` Christoph Hellwig
2025-04-25 15:44           ` Josef Bacik
2025-04-25 15:48         ` Josef Bacik [this message]
2025-04-25 15:59           ` Daniel Vacek
2025-04-18 13:57 ` [PATCH v3 2/3] btrfs: set DIRTY and WRITEBACK tags on the buffer_tree Josef Bacik
2025-04-23 15:16   ` Filipe Manana
2025-04-18 13:57 ` [PATCH v3 3/3] btrfs: use buffer radix for extent buffer writeback operations Josef Bacik
2025-04-23 16:00   ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250425154844.GB494636@perftesting \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox