linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Layton <jlayton@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	Andrew Morton <akpm@linux-foundation.org>,
	Josef Bacik <josef@toxicpanda.com>,
	 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
Date: Sat, 03 Aug 2024 06:59:24 -0400	[thread overview]
Message-ID: <7ff040d4a0fb1634d3dc9282da014165a347dbb2.camel@kernel.org> (raw)
In-Reply-To: <CAGudoHHLcKoG6Y2Zzm34gLrtaXmtuMc=CPcVpVQUaJ1Ysz8EDQ@mail.gmail.com>

On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > 
> > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > When we tested that initially, the performance devolved significantly
> > > due to contention for the parent's d_lockref spinlock.
> > > 
> > > There are two problems with lockrefs today: First, once any concurrent
> > > task takes the spinlock, they all end up taking the spinlock, which is
> > > much more costly than a single cmpxchg operation. The second problem is
> > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > spinlock. The upshot there is that even moderate contention can cause a
> > > fallback to serialized spinlocking, which worsens performance.
> > > 
> > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > 
> > > First, change the loop to spin instead of falling back to a locked
> > > codepath when the spinlock is held. Once the lock is released, allow the
> > > task to continue trying its cmpxchg loop as before instead of taking the
> > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > Just continue infinitely.
> > > 
> > > This greatly reduces contention on the lockref when there are large
> > > numbers of concurrent increments and decrements occurring.
> > > 
> > 
> > This was already tried by me and it unfortunately can reduce performance.
> > 
> 
> Oh wait I misread the patch based on what I tried there. Spinning
> indefinitely waiting for the lock to be free is a no-go as it loses
> the forward progress guarantee (and it is possible to get the lock
> being continuously held). Only spinning up to an arbitrary point wins
> some in some tests and loses in others.
> 

I'm a little confused about the forward progress guarantee here. Does
that exist today at all? ISTM that falling back to spin_lock() after a
certain number of retries doesn't guarantee any forward progress. You
can still just end up spinning on the lock forever once that happens,
no?

> Either way, as described below, chances are decent that:
> 1. there is an easy way to not lockref_get/put on the parent if the
> file is already there, dodging the problem
> .. and even if that's not true
> 2. lockref can be ditched in favor of atomics. apart from some minor
> refactoring this all looks perfectly doable and I have a wip. I will
> try to find the time next week to sort it out
> 

Like I said in the earlier mail, I don't think we can stay in RCU mode
because of the audit_inode call. I'm definitely interested in your WIP
though!

> > Key problem is that in some corner cases the lock can be continuously
> > held and be queued on, making the fast path always fail and making all
> > the spins actively waste time (and notably pull on the cacheline).
> > 
> > See this for more details:
> > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
> > 
> > However, I *suspect* in the case you are optimizing here (open + O_CREAT
> > of an existing file) lockref on the parent can be avoided altogether
> > with some hackery and that's what should be done here.
> > 
> > When it comes to lockref in vfs in general, most uses can be elided with
> > some hackery (see the above thread) which is in early WIP (the LSMs are
> > a massive headache).
> > 
> > For open calls which *do* need to take a real ref the hackery does not
> > help of course.
> > 
> > This is where I think decoupling ref from the lock is the best way
> > forward. For that to work the dentry must hang around after the last
> > unref (already done thanks to RCU and dput even explicitly handles that
> > already!) and there needs to be a way to block new refs atomically --
> > can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> > coming in. I have that as a WIP as well.
> > 
> > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> > >  1 file changed, 32 insertions(+), 53 deletions(-)
> > > 
> > > diff --git a/lib/lockref.c b/lib/lockref.c
> > > index 2afe4c5d8919..b76941043fe9 100644
> > > --- a/lib/lockref.c
> > > +++ b/lib/lockref.c
> > > @@ -8,22 +8,25 @@
> > >   * Note that the "cmpxchg()" reloads the "old" value for the
> > >   * failure case.
> > >   */
> > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                     \
> > > -     int retry = 100;                                                        \
> > > -     struct lockref old;                                                     \
> > > -     BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > > -     old.lock_count = READ_ONCE(lockref->lock_count);                        \
> > > -     while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
> > > -             struct lockref new = old;                                       \
> > > -             CODE                                                            \
> > > -             if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > -                                              &old.lock_count,               \
> > > -                                              new.lock_count))) {            \
> > > -                     SUCCESS;                                                \
> > > -             }                                                               \
> > > -             if (!--retry)                                                   \
> > > -                     break;                                                  \
> > > -     }                                                                       \
> > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                             \
> > > +     struct lockref old;                                                             \
> > > +     BUILD_BUG_ON(sizeof(old) != 8);                                                 \
> > > +     old.lock_count = READ_ONCE(lockref->lock_count);                                \
> > > +     for (;;) {                                                                      \
> > > +             struct lockref new = old;                                               \
> > > +                                                                                     \
> > > +             if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {        \
> > > +                     CODE                                                            \
> > > +                     if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > +                                                      &old.lock_count,               \
> > > +                                                      new.lock_count))) {            \
> > > +                             SUCCESS;                                                \
> > > +                     }                                                               \
> > > +             } else {                                                                \
> > > +                     cpu_relax();                                                    \
> > > +                     old.lock_count = READ_ONCE(lockref->lock_count);                \
> > > +             }                                                                       \
> > > +     }                                                                               \
> > >  } while (0)
> > > 
> > >  #else
> > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> > >       ,
> > >               return;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     lockref->count++;
> > > -     spin_unlock(&lockref->lock);
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >  }
> > >  EXPORT_SYMBOL(lockref_get);
> > > 
> > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> > >   */
> > >  int lockref_get_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count <= 0)
> > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_zero);
> > > 
> > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> > >   */
> > >  int lockref_put_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count--;
> > >               if (old.count <= 1)
> > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 1) {
> > > -             lockref->count--;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_not_zero);
> > > 
> > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> > >       ,
> > >               return new.count;
> > >       );
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >       return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_return);
> > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> > >   */
> > >  int lockref_get_not_dead(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count < 0)
> > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count >= 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_dead);
> > > 
> > > --
> > > 2.45.2
> > > 
> 
> 
> 

-- 
Jeff Layton <jlayton@kernel.org>

  reply	other threads:[~2024-08-03 10:59 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton
2024-08-03  4:44   ` Mateusz Guzik
2024-08-03  9:09     ` Mateusz Guzik
2024-08-03 10:59       ` Jeff Layton [this message]
2024-08-03 11:21         ` Mateusz Guzik
2024-08-03 11:32           ` Jeff Layton
2024-08-05 11:44             ` Christian Brauner
2024-08-05 12:52               ` Jeff Layton
2024-08-06 11:36                 ` Christian Brauner
2024-08-03 10:55     ` Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner
2024-08-05 11:55   ` Jeff Layton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ff040d4a0fb1634d3dc9282da014165a347dbb2.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).