From: Jeff Layton <jlayton@kernel.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
Andrew Morton <akpm@linux-foundation.org>,
Josef Bacik <josef@toxicpanda.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
Date: Sat, 03 Aug 2024 06:59:24 -0400 [thread overview]
Message-ID: <7ff040d4a0fb1634d3dc9282da014165a347dbb2.camel@kernel.org> (raw)
In-Reply-To: <CAGudoHHLcKoG6Y2Zzm34gLrtaXmtuMc=CPcVpVQUaJ1Ysz8EDQ@mail.gmail.com>
On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > When we tested that initially, the performance devolved significantly
> > > due to contention for the parent's d_lockref spinlock.
> > >
> > > There are two problems with lockrefs today: First, once any concurrent
> > > task takes the spinlock, they all end up taking the spinlock, which is
> > > much more costly than a single cmpxchg operation. The second problem is
> > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > spinlock. The upshot there is that even moderate contention can cause a
> > > fallback to serialized spinlocking, which worsens performance.
> > >
> > > This patch changes CMPXCHG_LOOP in 2 ways:
> > >
> > > First, change the loop to spin instead of falling back to a locked
> > > codepath when the spinlock is held. Once the lock is released, allow the
> > > task to continue trying its cmpxchg loop as before instead of taking the
> > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > Just continue infinitely.
> > >
> > > This greatly reduces contention on the lockref when there are large
> > > numbers of concurrent increments and decrements occurring.
> > >
> >
> > This was already tried by me and it unfortunately can reduce performance.
> >
>
> Oh wait I misread the patch based on what I tried there. Spinning
> indefinitely waiting for the lock to be free is a no-go as it loses
> the forward progress guarantee (and it is possible to get the lock
> being continuously held). Only spinning up to an arbitrary point wins
> some in some tests and loses in others.
>
I'm a little confused about the forward progress guarantee here. Does
that exist today at all? ISTM that falling back to spin_lock() after a
certain number of retries doesn't guarantee any forward progress. You
can still just end up spinning on the lock forever once that happens,
no?
> Either way, as described below, chances are decent that:
> 1. there is an easy way to not lockref_get/put on the parent if the
> file is already there, dodging the problem
> .. and even if that's not true
> 2. lockref can be ditched in favor of atomics. apart from some minor
> refactoring this all looks perfectly doable and I have a wip. I will
> try to find the time next week to sort it out
>
Like I said in the earlier mail, I don't think we can stay in RCU mode
because of the audit_inode call. I'm definitely interested in your WIP
though!
> > Key problem is that in some corner cases the lock can be continuously
> > held and be queued on, making the fast path always fail and making all
> > the spins actively waste time (and notably pull on the cacheline).
> >
> > See this for more details:
> > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
> >
> > However, I *suspect* in the case you are optimizing here (open + O_CREAT
> > of an existing file) lockref on the parent can be avoided altogether
> > with some hackery and that's what should be done here.
> >
> > When it comes to lockref in vfs in general, most uses can be elided with
> > some hackery (see the above thread) which is in early WIP (the LSMs are
> > a massive headache).
> >
> > For open calls which *do* need to take a real ref the hackery does not
> > help of course.
> >
> > This is where I think decoupling ref from the lock is the best way
> > forward. For that to work the dentry must hang around after the last
> > unref (already done thanks to RCU and dput even explicitly handles that
> > already!) and there needs to be a way to block new refs atomically --
> > can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> > coming in. I have that as a WIP as well.
> >
> >
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > > lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> > > 1 file changed, 32 insertions(+), 53 deletions(-)
> > >
> > > diff --git a/lib/lockref.c b/lib/lockref.c
> > > index 2afe4c5d8919..b76941043fe9 100644
> > > --- a/lib/lockref.c
> > > +++ b/lib/lockref.c
> > > @@ -8,22 +8,25 @@
> > > * Note that the "cmpxchg()" reloads the "old" value for the
> > > * failure case.
> > > */
> > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \
> > > - int retry = 100; \
> > > - struct lockref old; \
> > > - BUILD_BUG_ON(sizeof(old) != 8); \
> > > - old.lock_count = READ_ONCE(lockref->lock_count); \
> > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \
> > > - struct lockref new = old; \
> > > - CODE \
> > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \
> > > - &old.lock_count, \
> > > - new.lock_count))) { \
> > > - SUCCESS; \
> > > - } \
> > > - if (!--retry) \
> > > - break; \
> > > - } \
> > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \
> > > + struct lockref old; \
> > > + BUILD_BUG_ON(sizeof(old) != 8); \
> > > + old.lock_count = READ_ONCE(lockref->lock_count); \
> > > + for (;;) { \
> > > + struct lockref new = old; \
> > > + \
> > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \
> > > + CODE \
> > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \
> > > + &old.lock_count, \
> > > + new.lock_count))) { \
> > > + SUCCESS; \
> > > + } \
> > > + } else { \
> > > + cpu_relax(); \
> > > + old.lock_count = READ_ONCE(lockref->lock_count); \
> > > + } \
> > > + } \
> > > } while (0)
> > >
> > > #else
> > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> > > ,
> > > return;
> > > );
> > > -
> > > - spin_lock(&lockref->lock);
> > > - lockref->count++;
> > > - spin_unlock(&lockref->lock);
> > > + /* should never get here */
> > > + WARN_ON_ONCE(1);
> > > }
> > > EXPORT_SYMBOL(lockref_get);
> > >
> > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> > > */
> > > int lockref_get_not_zero(struct lockref *lockref)
> > > {
> > > - int retval;
> > > -
> > > CMPXCHG_LOOP(
> > > new.count++;
> > > if (old.count <= 0)
> > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> > > ,
> > > return 1;
> > > );
> > > -
> > > - spin_lock(&lockref->lock);
> > > - retval = 0;
> > > - if (lockref->count > 0) {
> > > - lockref->count++;
> > > - retval = 1;
> > > - }
> > > - spin_unlock(&lockref->lock);
> > > - return retval;
> > > + /* should never get here */
> > > + WARN_ON_ONCE(1);
> > > + return -1;
> > > }
> > > EXPORT_SYMBOL(lockref_get_not_zero);
> > >
> > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> > > */
> > > int lockref_put_not_zero(struct lockref *lockref)
> > > {
> > > - int retval;
> > > -
> > > CMPXCHG_LOOP(
> > > new.count--;
> > > if (old.count <= 1)
> > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> > > ,
> > > return 1;
> > > );
> > > -
> > > - spin_lock(&lockref->lock);
> > > - retval = 0;
> > > - if (lockref->count > 1) {
> > > - lockref->count--;
> > > - retval = 1;
> > > - }
> > > - spin_unlock(&lockref->lock);
> > > - return retval;
> > > + /* should never get here */
> > > + WARN_ON_ONCE(1);
> > > + return -1;
> > > }
> > > EXPORT_SYMBOL(lockref_put_not_zero);
> > >
> > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> > > ,
> > > return new.count;
> > > );
> > > + /* should never get here */
> > > + WARN_ON_ONCE(1);
> > > return -1;
> > > }
> > > EXPORT_SYMBOL(lockref_put_return);
> > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> > > */
> > > int lockref_get_not_dead(struct lockref *lockref)
> > > {
> > > - int retval;
> > > -
> > > CMPXCHG_LOOP(
> > > new.count++;
> > > if (old.count < 0)
> > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> > > ,
> > > return 1;
> > > );
> > > -
> > > - spin_lock(&lockref->lock);
> > > - retval = 0;
> > > - if (lockref->count >= 0) {
> > > - lockref->count++;
> > > - retval = 1;
> > > - }
> > > - spin_unlock(&lockref->lock);
> > > - return retval;
> > > + /* should never get here */
> > > + WARN_ON_ONCE(1);
> > > + return -1;
> > > }
> > > EXPORT_SYMBOL(lockref_get_not_dead);
> > >
> > > --
> > > 2.45.2
> > >
>
>
>
--
Jeff Layton <jlayton@kernel.org>
next prev parent reply other threads:[~2024-08-03 10:59 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton
2024-08-03 4:44 ` Mateusz Guzik
2024-08-03 9:09 ` Mateusz Guzik
2024-08-03 10:59 ` Jeff Layton [this message]
2024-08-03 11:21 ` Mateusz Guzik
2024-08-03 11:32 ` Jeff Layton
2024-08-05 11:44 ` Christian Brauner
2024-08-05 12:52 ` Jeff Layton
2024-08-06 11:36 ` Christian Brauner
2024-08-03 10:55 ` Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner
2024-08-05 11:55 ` Jeff Layton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7ff040d4a0fb1634d3dc9282da014165a347dbb2.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=josef@toxicpanda.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mjguzik@gmail.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).