All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Josef Bacik <josef@toxicpanda.com>,
	 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
Date: Sat, 03 Aug 2024 06:59:24 -0400	[thread overview]
Message-ID: <7ff040d4a0fb1634d3dc9282da014165a347dbb2.camel@kernel.org> (raw)
In-Reply-To: <CAGudoHHLcKoG6Y2Zzm34gLrtaXmtuMc=CPcVpVQUaJ1Ysz8EDQ@mail.gmail.com>

On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > 
> > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > When we tested that initially, the performance devolved significantly
> > > due to contention for the parent's d_lockref spinlock.
> > > 
> > > There are two problems with lockrefs today: First, once any concurrent
> > > task takes the spinlock, they all end up taking the spinlock, which is
> > > much more costly than a single cmpxchg operation. The second problem is
> > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > spinlock. The upshot there is that even moderate contention can cause a
> > > fallback to serialized spinlocking, which worsens performance.
> > > 
> > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > 
> > > First, change the loop to spin instead of falling back to a locked
> > > codepath when the spinlock is held. Once the lock is released, allow the
> > > task to continue trying its cmpxchg loop as before instead of taking the
> > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > Just continue infinitely.
> > > 
> > > This greatly reduces contention on the lockref when there are large
> > > numbers of concurrent increments and decrements occurring.
> > > 
> > 
> > This was already tried by me and it unfortunately can reduce performance.
> > 
> 
> Oh wait I misread the patch based on what I tried there. Spinning
> indefinitely waiting for the lock to be free is a no-go as it loses
> the forward progress guarantee (and it is possible to get the lock
> being continuously held). Only spinning up to an arbitrary point wins
> some in some tests and loses in others.
> 

I'm a little confused about the forward progress guarantee here. Does
that exist today at all? ISTM that falling back to spin_lock() after a
certain number of retries doesn't guarantee any forward progress. You
can still just end up spinning on the lock forever once that happens,
no?

> Either way, as described below, chances are decent that:
> 1. there is an easy way to not lockref_get/put on the parent if the
> file is already there, dodging the problem
> .. and even if that's not true
> 2. lockref can be ditched in favor of atomics. apart from some minor
> refactoring this all looks perfectly doable and I have a wip. I will
> try to find the time next week to sort it out
> 

Like I said in the earlier mail, I don't think we can stay in RCU mode
because of the audit_inode call. I'm definitely interested in your WIP
though!

> > Key problem is that in some corner cases the lock can be continuously
> > held and be queued on, making the fast path always fail and making all
> > the spins actively waste time (and notably pull on the cacheline).
> > 
> > See this for more details:
> > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
> > 
> > However, I *suspect* in the case you are optimizing here (open + O_CREAT
> > of an existing file) lockref on the parent can be avoided altogether
> > with some hackery and that's what should be done here.
> > 
> > When it comes to lockref in vfs in general, most uses can be elided with
> > some hackery (see the above thread) which is in early WIP (the LSMs are
> > a massive headache).
> > 
> > For open calls which *do* need to take a real ref the hackery does not
> > help of course.
> > 
> > This is where I think decoupling ref from the lock is the best way
> > forward. For that to work the dentry must hang around after the last
> > unref (already done thanks to RCU and dput even explicitly handles that
> > already!) and there needs to be a way to block new refs atomically --
> > can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> > coming in. I have that as a WIP as well.
> > 
> > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> > >  1 file changed, 32 insertions(+), 53 deletions(-)
> > > 
> > > diff --git a/lib/lockref.c b/lib/lockref.c
> > > index 2afe4c5d8919..b76941043fe9 100644
> > > --- a/lib/lockref.c
> > > +++ b/lib/lockref.c
> > > @@ -8,22 +8,25 @@
> > >   * Note that the "cmpxchg()" reloads the "old" value for the
> > >   * failure case.
> > >   */
> > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                     \
> > > -     int retry = 100;                                                        \
> > > -     struct lockref old;                                                     \
> > > -     BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > > -     old.lock_count = READ_ONCE(lockref->lock_count);                        \
> > > -     while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
> > > -             struct lockref new = old;                                       \
> > > -             CODE                                                            \
> > > -             if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > -                                              &old.lock_count,               \
> > > -                                              new.lock_count))) {            \
> > > -                     SUCCESS;                                                \
> > > -             }                                                               \
> > > -             if (!--retry)                                                   \
> > > -                     break;                                                  \
> > > -     }                                                                       \
> > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                             \
> > > +     struct lockref old;                                                             \
> > > +     BUILD_BUG_ON(sizeof(old) != 8);                                                 \
> > > +     old.lock_count = READ_ONCE(lockref->lock_count);                                \
> > > +     for (;;) {                                                                      \
> > > +             struct lockref new = old;                                               \
> > > +                                                                                     \
> > > +             if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {        \
> > > +                     CODE                                                            \
> > > +                     if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > +                                                      &old.lock_count,               \
> > > +                                                      new.lock_count))) {            \
> > > +                             SUCCESS;                                                \
> > > +                     }                                                               \
> > > +             } else {                                                                \
> > > +                     cpu_relax();                                                    \
> > > +                     old.lock_count = READ_ONCE(lockref->lock_count);                \
> > > +             }                                                                       \
> > > +     }                                                                               \
> > >  } while (0)
> > > 
> > >  #else
> > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> > >       ,
> > >               return;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     lockref->count++;
> > > -     spin_unlock(&lockref->lock);
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >  }
> > >  EXPORT_SYMBOL(lockref_get);
> > > 
> > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> > >   */
> > >  int lockref_get_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count <= 0)
> > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_zero);
> > > 
> > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> > >   */
> > >  int lockref_put_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count--;
> > >               if (old.count <= 1)
> > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 1) {
> > > -             lockref->count--;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_not_zero);
> > > 
> > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> > >       ,
> > >               return new.count;
> > >       );
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >       return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_return);
> > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> > >   */
> > >  int lockref_get_not_dead(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count < 0)
> > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count >= 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_dead);
> > > 
> > > --
> > > 2.45.2
> > > 
> 
> 
> 

-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2024-08-03 10:59 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton
2024-08-03  4:44   ` Mateusz Guzik
2024-08-03  9:09     ` Mateusz Guzik
2024-08-03 10:59       ` Jeff Layton [this message]
2024-08-03 11:21         ` Mateusz Guzik
2024-08-03 11:32           ` Jeff Layton
2024-08-05 11:44             ` Christian Brauner
2024-08-05 12:52               ` Jeff Layton
2024-08-06 11:36                 ` Christian Brauner
2024-08-03 10:55     ` Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner
2024-08-05 11:55   ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ff040d4a0fb1634d3dc9282da014165a347dbb2.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.