All of lore.kernel.org
 help / color / mirror / Atom feed
From: Josef Bacik <josef@toxicpanda.com>
To: Filipe Manana <fdmanana@kernel.org>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com, willy@infradead.org
Subject: Re: [PATCH v3 1/3] btrfs: convert the buffer_radix to an xarray
Date: Fri, 25 Apr 2025 11:48:44 -0400	[thread overview]
Message-ID: <20250425154844.GB494636@perftesting> (raw)
In-Reply-To: <CAL3q7H6Nbar_o0GVGuTr5BVmCRsDUgAJfnOz-hSi5OEi86jejg@mail.gmail.com>

On Thu, Apr 24, 2025 at 05:07:29PM +0100, Filipe Manana wrote:
> On Thu, Apr 24, 2025 at 4:47 PM Josef Bacik <josef@toxicpanda.com> wrote:
> >
> > On Wed, Apr 23, 2025 at 04:08:56PM +0100, Filipe Manana wrote:
> > > On Fri, Apr 18, 2025 at 2:58 PM Josef Bacik <josef@toxicpanda.com> wrote:
> > > >
> > > > In order to fully utilize xarray tagging to improve writeback we need to
> > > > convert the buffer_radix to a proper xarray.  This conversion is
> > > > relatively straightforward as the radix code uses the xarray underneath.
> > > > Using xarray directly allows for quite a lot less code.
> > > >
> > > > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > > > ---
> > > >  fs/btrfs/disk-io.c           |  15 ++-
> > > >  fs/btrfs/extent_io.c         | 196 +++++++++++++++--------------------
> > > >  fs/btrfs/fs.h                |   4 +-
> > > >  fs/btrfs/tests/btrfs-tests.c |  27 ++---
> > > >  fs/btrfs/zoned.c             |  16 +--
> > > >  5 files changed, 113 insertions(+), 145 deletions(-)
> > > >
> > > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> > > > index 59da809b7d57..24c08eb86b7b 100644
> > > > --- a/fs/btrfs/disk-io.c
> > > > +++ b/fs/btrfs/disk-io.c
> > > > @@ -2762,10 +2762,22 @@ static int __cold init_tree_roots(struct btrfs_fs_info *fs_info)
> > > >         return ret;
> > > >  }
> > > >
> > > > +/*
> > > > + * lockdep gets confused between our buffer_tree which requires IRQ locking
> > > > + * because we modify marks in the IRQ context, and our delayed inode xarray
> > > > + * which doesn't have these requirements. Use a class key so lockdep doesn't get
> > > > + * them mixed up.
> > > > + */
> > > > +static struct lock_class_key buffer_xa_class;
> > > > +
> > > >  void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> > > >  {
> > > >         INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
> > > > -       INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
> > > > +
> > > > +       /* Use the same flags as mapping->i_pages. */
> > > > +       xa_init_flags(&fs_info->buffer_tree, XA_FLAGS_LOCK_IRQ | XA_FLAGS_ACCOUNT);
> > > > +       lockdep_set_class(&fs_info->buffer_tree.xa_lock, &buffer_xa_class);
> > > > +
> > > >         INIT_LIST_HEAD(&fs_info->trans_list);
> > > >         INIT_LIST_HEAD(&fs_info->dead_roots);
> > > >         INIT_LIST_HEAD(&fs_info->delayed_iputs);
> > > > @@ -2777,7 +2789,6 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
> > > >         spin_lock_init(&fs_info->delayed_iput_lock);
> > > >         spin_lock_init(&fs_info->defrag_inodes_lock);
> > > >         spin_lock_init(&fs_info->super_lock);
> > > > -       spin_lock_init(&fs_info->buffer_lock);
> > > >         spin_lock_init(&fs_info->unused_bgs_lock);
> > > >         spin_lock_init(&fs_info->treelog_bg_lock);
> > > >         spin_lock_init(&fs_info->zone_active_bgs_lock);
> > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > > > index 6cfd286b8bbc..aa451ad52528 100644
> > > > --- a/fs/btrfs/extent_io.c
> > > > +++ b/fs/btrfs/extent_io.c
> > > > @@ -1893,19 +1893,20 @@ static void set_btree_ioerr(struct extent_buffer *eb)
> > > >   * context.
> > > >   */
> > > >  static struct extent_buffer *find_extent_buffer_nolock(
> > > > -               const struct btrfs_fs_info *fs_info, u64 start)
> > > > +               struct btrfs_fs_info *fs_info, u64 start)
> > > >  {
> > > > +       XA_STATE(xas, &fs_info->buffer_tree, start >> fs_info->sectorsize_bits);
> > > >         struct extent_buffer *eb;
> > > >
> > > >         rcu_read_lock();
> > > > -       eb = radix_tree_lookup(&fs_info->buffer_radix,
> > > > -                              start >> fs_info->sectorsize_bits);
> > > > -       if (eb && atomic_inc_not_zero(&eb->refs)) {
> > > > -               rcu_read_unlock();
> > > > -               return eb;
> > > > -       }
> > > > +       do {
> > > > +               eb = xas_load(&xas);
> > > > +       } while (xas_retry(&xas, eb));
> > >
> > > Why not use the simpler xarray API?
> > > This is basically open coding what xa_load() does, so we can get rid
> > > of this loop and no need to define a XA_STATE above.
> >
> > Because we have to do the atomic_inc_not_zero() under the rcu_lock(), so we have
> > have to open code the retry logic.
> 
> Sure, but xa_load() can still be called while we are under the rcu
> read section, can't it?
> 
> >
> > >
> > > > +
> > > > +       if (eb && !atomic_inc_not_zero(&eb->refs))
> > > > +               eb = NULL;
> > > >         rcu_read_unlock();
> > > > -       return NULL;
> > > > +       return eb;
> > > >  }
> > > >
> > > >  static void end_bbio_meta_write(struct btrfs_bio *bbio)
> > > > @@ -2769,11 +2770,10 @@ static void detach_extent_buffer_folio(const struct extent_buffer *eb, struct fo
> > > >
> > > >         if (!btrfs_meta_is_subpage(fs_info)) {
> > > >                 /*
> > > > -                * We do this since we'll remove the pages after we've
> > > > -                * removed the eb from the radix tree, so we could race
> > > > -                * and have this page now attached to the new eb.  So
> > > > -                * only clear folio if it's still connected to
> > > > -                * this eb.
> > > > +                * We do this since we'll remove the pages after we've removed
> > > > +                * the eb from the xarray, so we could race and have this page
> > > > +                * now attached to the new eb.  So only clear folio if it's
> > > > +                * still connected to this eb.
> > > >                  */
> > > >                 if (folio_test_private(folio) && folio_get_private(folio) == eb) {
> > > >                         BUG_ON(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
> > > > @@ -2938,9 +2938,9 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
> > > >  {
> > > >         int refs;
> > > >         /*
> > > > -        * The TREE_REF bit is first set when the extent_buffer is added
> > > > -        * to the radix tree. It is also reset, if unset, when a new reference
> > > > -        * is created by find_extent_buffer.
> > > > +        * The TREE_REF bit is first set when the extent_buffer is added to the
> > > > +        * xarray. It is also reset, if unset, when a new reference is created
> > > > +        * by find_extent_buffer.
> > > >          *
> > > >          * It is only cleared in two cases: freeing the last non-tree
> > > >          * reference to the extent_buffer when its STALE bit is set or
> > > > @@ -2952,13 +2952,12 @@ static void check_buffer_tree_ref(struct extent_buffer *eb)
> > > >          * conditions between the calls to check_buffer_tree_ref in those
> > > >          * codepaths and clearing TREE_REF in try_release_extent_buffer.
> > > >          *
> > > > -        * The actual lifetime of the extent_buffer in the radix tree is
> > > > -        * adequately protected by the refcount, but the TREE_REF bit and
> > > > -        * its corresponding reference are not. To protect against this
> > > > -        * class of races, we call check_buffer_tree_ref from the codepaths
> > > > -        * which trigger io. Note that once io is initiated, TREE_REF can no
> > > > -        * longer be cleared, so that is the moment at which any such race is
> > > > -        * best fixed.
> > > > +        * The actual lifetime of the extent_buffer in the xarray is adequately
> > > > +        * protected by the refcount, but the TREE_REF bit and its corresponding
> > > > +        * reference are not. To protect against this class of races, we call
> > > > +        * check_buffer_tree_ref from the codepaths which trigger io. Note that
> > > > +        * once io is initiated, TREE_REF can no longer be cleared, so that is
> > > > +        * the moment at which any such race is best fixed.
> > > >          */
> > > >         refs = atomic_read(&eb->refs);
> > > >         if (refs >= 2 && test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags))
> > > > @@ -3022,23 +3021,26 @@ struct extent_buffer *alloc_test_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >                 return ERR_PTR(-ENOMEM);
> > > >         eb->fs_info = fs_info;
> > > >  again:
> > > > -       ret = radix_tree_preload(GFP_NOFS);
> > > > -       if (ret) {
> > > > -               exists = ERR_PTR(ret);
> > > > +       xa_lock_irq(&fs_info->buffer_tree);
> > > > +       exists = __xa_cmpxchg(&fs_info->buffer_tree,
> > > > +                             start >> fs_info->sectorsize_bits, NULL, eb,
> > > > +                             GFP_NOFS);
> > > > +       if (xa_is_err(exists)) {
> > > > +               ret = xa_err(exists);
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > > +               btrfs_release_extent_buffer(eb);
> > > > +               return ERR_PTR(ret);
> > > > +       }
> > > > +       if (exists) {
> > > > +               if (!atomic_inc_not_zero(&exists->refs)) {
> > > > +                       /* The extent buffer is being freed, retry. */
> > > > +                       xa_unlock_irq(&fs_info->buffer_tree);
> > > > +                       goto again;
> > > > +               }
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > >                 goto free_eb;
> > > >         }
> > > > -       spin_lock(&fs_info->buffer_lock);
> > > > -       ret = radix_tree_insert(&fs_info->buffer_radix,
> > > > -                               start >> fs_info->sectorsize_bits, eb);
> > > > -       spin_unlock(&fs_info->buffer_lock);
> > > > -       radix_tree_preload_end();
> > > > -       if (ret == -EEXIST) {
> > > > -               exists = find_extent_buffer(fs_info, start);
> > > > -               if (exists)
> > > > -                       goto free_eb;
> > > > -               else
> > > > -                       goto again;
> > > > -       }
> > > > +       xa_unlock_irq(&fs_info->buffer_tree);
> > > >         check_buffer_tree_ref(eb);
> > > >
> > > >         return eb;
> > > > @@ -3059,9 +3061,9 @@ static struct extent_buffer *grab_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >         lockdep_assert_held(&folio->mapping->i_private_lock);
> > > >
> > > >         /*
> > > > -        * For subpage case, we completely rely on radix tree to ensure we
> > > > -        * don't try to insert two ebs for the same bytenr.  So here we always
> > > > -        * return NULL and just continue.
> > > > +        * For subpage case, we completely rely on xarray to ensure we don't try
> > > > +        * to insert two ebs for the same bytenr.  So here we always return NULL
> > > > +        * and just continue.
> > > >          */
> > > >         if (btrfs_meta_is_subpage(fs_info))
> > > >                 return NULL;
> > > > @@ -3194,7 +3196,7 @@ static int attach_eb_folio_to_filemap(struct extent_buffer *eb, int i,
> > > >         /*
> > > >          * To inform we have an extra eb under allocation, so that
> > > >          * detach_extent_buffer_page() won't release the folio private when the
> > > > -        * eb hasn't been inserted into radix tree yet.
> > > > +        * eb hasn't been inserted into the xarray yet.
> > > >          *
> > > >          * The ref will be decreased when the eb releases the page, in
> > > >          * detach_extent_buffer_page().  Thus needs no special handling in the
> > > > @@ -3328,10 +3330,10 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >
> > > >                 /*
> > > >                  * We can't unlock the pages just yet since the extent buffer
> > > > -                * hasn't been properly inserted in the radix tree, this
> > > > -                * opens a race with btree_release_folio which can free a page
> > > > -                * while we are still filling in all pages for the buffer and
> > > > -                * we could crash.
> > > > +                * hasn't been properly inserted in the xarray, this opens a
> > > > +                * race with btree_release_folio which can free a page while we
> > > > +                * are still filling in all pages for the buffer and we could
> > > > +                * crash.
> > > >                  */
> > > >         }
> > > >         if (uptodate)
> > > > @@ -3340,23 +3342,25 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info,
> > > >         if (page_contig)
> > > >                 eb->addr = folio_address(eb->folios[0]) + offset_in_page(eb->start);
> > > >  again:
> > > > -       ret = radix_tree_preload(GFP_NOFS);
> > > > -       if (ret)
> > > > +       xa_lock_irq(&fs_info->buffer_tree);
> > > > +       existing_eb = __xa_cmpxchg(&fs_info->buffer_tree,
> > > > +                                  start >> fs_info->sectorsize_bits, NULL, eb,
> > > > +                                  GFP_NOFS);
> > > > +       if (xa_is_err(existing_eb)) {
> > > > +               ret = xa_err(existing_eb);
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > >                 goto out;
> > > > -
> > > > -       spin_lock(&fs_info->buffer_lock);
> > > > -       ret = radix_tree_insert(&fs_info->buffer_radix,
> > > > -                               start >> fs_info->sectorsize_bits, eb);
> > > > -       spin_unlock(&fs_info->buffer_lock);
> > > > -       radix_tree_preload_end();
> > > > -       if (ret == -EEXIST) {
> > > > -               ret = 0;
> > > > -               existing_eb = find_extent_buffer(fs_info, start);
> > > > -               if (existing_eb)
> > > > -                       goto out;
> > > > -               else
> > > > -                       goto again;
> > > >         }
> > > > +       if (existing_eb) {
> > > > +               if (!atomic_inc_not_zero(&existing_eb->refs)) {
> > > > +                       xa_unlock_irq(&fs_info->buffer_tree);
> > > > +                       goto again;
> > > > +               }
> > > > +               xa_unlock_irq(&fs_info->buffer_tree);
> > > > +               goto out;
> > > > +       }
> > > > +       xa_unlock_irq(&fs_info->buffer_tree);
> > > > +
> > > >         /* add one reference for the tree */
> > > >         check_buffer_tree_ref(eb);
> > > >
> > > > @@ -3426,10 +3430,13 @@ static int release_extent_buffer(struct extent_buffer *eb)
> > > >
> > > >                 spin_unlock(&eb->refs_lock);
> > > >
> > > > -               spin_lock(&fs_info->buffer_lock);
> > > > -               radix_tree_delete_item(&fs_info->buffer_radix,
> > > > -                                      eb->start >> fs_info->sectorsize_bits, eb);
> > > > -               spin_unlock(&fs_info->buffer_lock);
> > > > +               /*
> > > > +                * We're erasing, theoretically there will be no allocations, so
> > > > +                * just use GFP_ATOMIC.
> > > > +                */
> > > > +               xa_cmpxchg_irq(&fs_info->buffer_tree,
> > > > +                              eb->start >> fs_info->sectorsize_bits, eb, NULL,
> > > > +                              GFP_ATOMIC);
> > >
> > > This I don't get. Why are we doing a cmp exchange with NULL and not
> > > removing the entry?
> > > Swapping with NULL doesn't release the entry, as xarray permits
> > > storing NULL values. So after a long time releasing extent buffers we
> > > may be holding on to memory unnecessarily.
> > >
> > > Why not use xa_erase()?
> >
> > Because we may be freeing a buffer that we raced with the insert, so we don't
> > want to replace the winner with NULL.
> >
> > Assume that we allocate two buffers for ->start == 0, we insert one buffer into
> > the xarray, and we free the other.  When we're freeing the other we will not
> > delete the xarray entry because it doesn't match the buffer we're removing.
> 
> Sure, I get that part, that we don't want to delete an entry if it
> doesn't match our eb.
> But if it matches, we store a NULL in it, meaning the xarray entry
> will remain and potentially hold unnecessary memory for a long time
> unless that index gets reused soon or there are  other indexes in the
> region.
> 
> That was (and is my) my concern.
> 
> As far as I can see we never do actual index deletions from the
> xarray, just store NULL values, not allowing the xarray to release
> space.

From what I gathered xa_erase() and storing a NULL is the same thing?

From the documentation:

You can then set entries using xa_store() and get entries using
xa_load().  xa_store() will overwrite any entry with the new entry and
return the previous entry stored at that index.  You can unset entries
using xa_erase() or by setting the entry to ``NULL`` using xa_store().
There is no difference between an entry that has never been stored to
and one that has been erased with xa_erase(); an entry that has most
recently had ``NULL`` stored to it is also equivalent except if the
XArray was initialized with ``XA_FLAGS_ALLOC``.

So this seems to say that if we store a ``NULL`` value it's the same as deleting
it, so we should be fine with this pattern?

Willy am I doing this wrong?  What should I do instead if this is indeed
incorrect?  Do an xa_load(), check the value, then xa_erase() if it matches, all
under the lock?  Thanks,

Josef

  parent reply	other threads:[~2025-04-25 15:48 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-18 13:57 [PATCH v3 0/3] btrfs: simplify extent buffer writeback Josef Bacik
2025-04-18 13:57 ` [PATCH v3 1/3] btrfs: convert the buffer_radix to an xarray Josef Bacik
2025-04-23 15:08   ` Filipe Manana
2025-04-24 15:47     ` Josef Bacik
2025-04-24 16:07       ` Filipe Manana
2025-04-25 13:35         ` Christoph Hellwig
2025-04-25 15:44           ` Josef Bacik
2025-04-25 15:48         ` Josef Bacik [this message]
2025-04-25 15:59           ` Daniel Vacek
2025-04-18 13:57 ` [PATCH v3 2/3] btrfs: set DIRTY and WRITEBACK tags on the buffer_tree Josef Bacik
2025-04-23 15:16   ` Filipe Manana
2025-04-18 13:57 ` [PATCH v3 3/3] btrfs: use buffer radix for extent buffer writeback operations Josef Bacik
2025-04-23 16:00   ` Filipe Manana

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250425154844.GB494636@perftesting \
    --to=josef@toxicpanda.com \
    --cc=fdmanana@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.