* [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too
@ 2024-08-02 21:45 Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
` (4 more replies)
0 siblings, 5 replies; 16+ messages in thread
From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton
We've had some reports of i_rwsem contention in some workloads. On an
open with O_CREAT set, it always takes i_rwsem for write when handling
the last component. This patchset is intended to alleviate some of that
contention, and to generally speed up opens when O_CREAT is set and the
file already exists.
I have a simple benchmark that forks a bunch of processes that then do
O_CREAT opens and closes of a file in the same directory as their peers.
This test runs in about 18s on v6.10 kernels. With this patchset, it
runs in about 10s.
The basic idea is to do an opportunistic dcache lookup even in the
O_CREAT case, which allows us to avoid taking the inode_lock altogether
when the file already exists. The last patch does this, but it turns out
to perform worse under heavy contention than taking the i_rwsem for
write.
The problem is that that moves the contention to the parent's d_lockref,
which rapidly devolved to plain old spinlock contention. So, there is an
earlier patch in the series which changes how lockrefs work to perform
better under heavy contention. Locking primitives aren't my area of
expertise, so consider that one a starting point for discussion.
I also wrote a second testcase that just attempts to create and delete a
file in a loop (trying to hit the pessimal case where the fast_lookup
never helps). That test runs in about the same amount of time with both
kernels. A Linux kernel build seems to run in about the same amount of
time both with and without this patchset.
Many thanks to Josef for his help with this.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Jeff Layton (4):
fs: remove comment about d_rcu_to_refcount
fs: add a kerneldoc header over lookup_fast
lockref: rework CMPXCHG_LOOP to handle contention better
fs: try an opportunistic lookup for O_CREAT opens too
fs/dcache.c | 3 ---
fs/namei.c | 57 ++++++++++++++++++++++++++++++++++-----
lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
3 files changed, 82 insertions(+), 63 deletions(-)
---
base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
change-id: 20240723-openfast-ac49a7b6ade2
Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount 2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton @ 2024-08-02 21:45 ` Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton ` (3 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton This function no longer exists. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/dcache.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 4c144519aa70..474c5205a3db 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2160,9 +2160,6 @@ static noinline struct dentry *__d_lookup_rcu_op_compare( * without taking d_lock and checking d_seq sequence count against @seq * returned here. * - * A refcount may be taken on the found dentry with the d_rcu_to_refcount - * function. - * * Alternatively, __d_lookup_rcu may be called again to look up the child of * the returned dentry, so long as its parent's seqlock is checked after the * child is looked up. Thus, an interlocking stepping of sequence lock checks -- 2.45.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast 2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton @ 2024-08-02 21:45 ` Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton ` (2 subsequent siblings) 4 siblings, 0 replies; 16+ messages in thread From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton The lookup_fast helper in fs/namei.c has some subtlety in how dentries are returned. Document them. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/namei.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/namei.c b/fs/namei.c index 1e05a0f3f04d..b9bdb8e6214a 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1613,6 +1613,20 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name, } EXPORT_SYMBOL(lookup_one_qstr_excl); +/** + * lookup_fast - do fast lockless (but racy) lookup of a dentry + * @nd: current nameidata + * + * Do a fast, but racy lookup in the dcache for the given dentry, and + * revalidate it. Returns a valid dentry pointer or NULL if one wasn't + * found. On error, an ERR_PTR will be returned. + * + * If this function returns a valid dentry and the walk is no longer + * lazy, the dentry will carry a reference that must later be put. If + * RCU mode is still in force, then this is not the case and the dentry + * must be legitimized before use. If this returns NULL, then the walk + * will no longer be in RCU mode. + */ static struct dentry *lookup_fast(struct nameidata *nd) { struct dentry *dentry, *parent = nd->path.dentry; -- 2.45.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton @ 2024-08-02 21:45 ` Jeff Layton 2024-08-03 4:44 ` Mateusz Guzik 2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton 2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner 4 siblings, 1 reply; 16+ messages in thread From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton In a later patch, we want to change the open(..., O_CREAT) codepath to avoid taking the inode->i_rwsem for write when the dentry already exists. When we tested that initially, the performance devolved significantly due to contention for the parent's d_lockref spinlock. There are two problems with lockrefs today: First, once any concurrent task takes the spinlock, they all end up taking the spinlock, which is much more costly than a single cmpxchg operation. The second problem is that once any task fails to cmpxchg 100 times, it falls back to the spinlock. The upshot there is that even moderate contention can cause a fallback to serialized spinlocking, which worsens performance. This patch changes CMPXCHG_LOOP in 2 ways: First, change the loop to spin instead of falling back to a locked codepath when the spinlock is held. Once the lock is released, allow the task to continue trying its cmpxchg loop as before instead of taking the lock. Second, don't allow the cmpxchg loop to give up after 100 retries. Just continue infinitely. This greatly reduces contention on the lockref when there are large numbers of concurrent increments and decrements occurring. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- 1 file changed, 32 insertions(+), 53 deletions(-) diff --git a/lib/lockref.c b/lib/lockref.c index 2afe4c5d8919..b76941043fe9 100644 --- a/lib/lockref.c +++ b/lib/lockref.c @@ -8,22 +8,25 @@ * Note that the "cmpxchg()" reloads the "old" value for the * failure case. */ -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ - int retry = 100; \ - struct lockref old; \ - BUILD_BUG_ON(sizeof(old) != 8); \ - old.lock_count = READ_ONCE(lockref->lock_count); \ - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ - struct lockref new = old; \ - CODE \ - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ - &old.lock_count, \ - new.lock_count))) { \ - SUCCESS; \ - } \ - if (!--retry) \ - break; \ - } \ +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ + struct lockref old; \ + BUILD_BUG_ON(sizeof(old) != 8); \ + old.lock_count = READ_ONCE(lockref->lock_count); \ + for (;;) { \ + struct lockref new = old; \ + \ + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ + CODE \ + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ + &old.lock_count, \ + new.lock_count))) { \ + SUCCESS; \ + } \ + } else { \ + cpu_relax(); \ + old.lock_count = READ_ONCE(lockref->lock_count); \ + } \ + } \ } while (0) #else @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) , return; ); - - spin_lock(&lockref->lock); - lockref->count++; - spin_unlock(&lockref->lock); + /* should never get here */ + WARN_ON_ONCE(1); } EXPORT_SYMBOL(lockref_get); @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); */ int lockref_get_not_zero(struct lockref *lockref) { - int retval; - CMPXCHG_LOOP( new.count++; if (old.count <= 0) @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) , return 1; ); - - spin_lock(&lockref->lock); - retval = 0; - if (lockref->count > 0) { - lockref->count++; - retval = 1; - } - spin_unlock(&lockref->lock); - return retval; + /* should never get here */ + WARN_ON_ONCE(1); + return -1; } EXPORT_SYMBOL(lockref_get_not_zero); @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); */ int lockref_put_not_zero(struct lockref *lockref) { - int retval; - CMPXCHG_LOOP( new.count--; if (old.count <= 1) @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) , return 1; ); - - spin_lock(&lockref->lock); - retval = 0; - if (lockref->count > 1) { - lockref->count--; - retval = 1; - } - spin_unlock(&lockref->lock); - return retval; + /* should never get here */ + WARN_ON_ONCE(1); + return -1; } EXPORT_SYMBOL(lockref_put_not_zero); @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) , return new.count; ); + /* should never get here */ + WARN_ON_ONCE(1); return -1; } EXPORT_SYMBOL(lockref_put_return); @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); */ int lockref_get_not_dead(struct lockref *lockref) { - int retval; - CMPXCHG_LOOP( new.count++; if (old.count < 0) @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) , return 1; ); - - spin_lock(&lockref->lock); - retval = 0; - if (lockref->count >= 0) { - lockref->count++; - retval = 1; - } - spin_unlock(&lockref->lock); - return retval; + /* should never get here */ + WARN_ON_ONCE(1); + return -1; } EXPORT_SYMBOL(lockref_get_not_dead); -- 2.45.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton @ 2024-08-03 4:44 ` Mateusz Guzik 2024-08-03 9:09 ` Mateusz Guzik 2024-08-03 10:55 ` Jeff Layton 0 siblings, 2 replies; 16+ messages in thread From: Mateusz Guzik @ 2024-08-03 4:44 UTC (permalink / raw) To: Jeff Layton Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > In a later patch, we want to change the open(..., O_CREAT) codepath to > avoid taking the inode->i_rwsem for write when the dentry already exists. > When we tested that initially, the performance devolved significantly > due to contention for the parent's d_lockref spinlock. > > There are two problems with lockrefs today: First, once any concurrent > task takes the spinlock, they all end up taking the spinlock, which is > much more costly than a single cmpxchg operation. The second problem is > that once any task fails to cmpxchg 100 times, it falls back to the > spinlock. The upshot there is that even moderate contention can cause a > fallback to serialized spinlocking, which worsens performance. > > This patch changes CMPXCHG_LOOP in 2 ways: > > First, change the loop to spin instead of falling back to a locked > codepath when the spinlock is held. Once the lock is released, allow the > task to continue trying its cmpxchg loop as before instead of taking the > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > Just continue infinitely. > > This greatly reduces contention on the lockref when there are large > numbers of concurrent increments and decrements occurring. > This was already tried by me and it unfortunately can reduce performance. Key problem is that in some corner cases the lock can be continuously held and be queued on, making the fast path always fail and making all the spins actively waste time (and notably pull on the cacheline). See this for more details: https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ However, I *suspect* in the case you are optimizing here (open + O_CREAT of an existing file) lockref on the parent can be avoided altogether with some hackery and that's what should be done here. When it comes to lockref in vfs in general, most uses can be elided with some hackery (see the above thread) which is in early WIP (the LSMs are a massive headache). For open calls which *do* need to take a real ref the hackery does not help of course. This is where I think decoupling ref from the lock is the best way forward. For that to work the dentry must hang around after the last unref (already done thanks to RCU and dput even explicitly handles that already!) and there needs to be a way to block new refs atomically -- can be done with cmpxchg from a 0-ref state to a flag blocking new refs coming in. I have that as a WIP as well. > Signed-off-by: Jeff Layton <jlayton@kernel.org> > --- > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > 1 file changed, 32 insertions(+), 53 deletions(-) > > diff --git a/lib/lockref.c b/lib/lockref.c > index 2afe4c5d8919..b76941043fe9 100644 > --- a/lib/lockref.c > +++ b/lib/lockref.c > @@ -8,22 +8,25 @@ > * Note that the "cmpxchg()" reloads the "old" value for the > * failure case. > */ > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > - int retry = 100; \ > - struct lockref old; \ > - BUILD_BUG_ON(sizeof(old) != 8); \ > - old.lock_count = READ_ONCE(lockref->lock_count); \ > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > - struct lockref new = old; \ > - CODE \ > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > - &old.lock_count, \ > - new.lock_count))) { \ > - SUCCESS; \ > - } \ > - if (!--retry) \ > - break; \ > - } \ > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > + struct lockref old; \ > + BUILD_BUG_ON(sizeof(old) != 8); \ > + old.lock_count = READ_ONCE(lockref->lock_count); \ > + for (;;) { \ > + struct lockref new = old; \ > + \ > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > + CODE \ > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > + &old.lock_count, \ > + new.lock_count))) { \ > + SUCCESS; \ > + } \ > + } else { \ > + cpu_relax(); \ > + old.lock_count = READ_ONCE(lockref->lock_count); \ > + } \ > + } \ > } while (0) > > #else > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > , > return; > ); > - > - spin_lock(&lockref->lock); > - lockref->count++; > - spin_unlock(&lockref->lock); > + /* should never get here */ > + WARN_ON_ONCE(1); > } > EXPORT_SYMBOL(lockref_get); > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > */ > int lockref_get_not_zero(struct lockref *lockref) > { > - int retval; > - > CMPXCHG_LOOP( > new.count++; > if (old.count <= 0) > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > , > return 1; > ); > - > - spin_lock(&lockref->lock); > - retval = 0; > - if (lockref->count > 0) { > - lockref->count++; > - retval = 1; > - } > - spin_unlock(&lockref->lock); > - return retval; > + /* should never get here */ > + WARN_ON_ONCE(1); > + return -1; > } > EXPORT_SYMBOL(lockref_get_not_zero); > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > */ > int lockref_put_not_zero(struct lockref *lockref) > { > - int retval; > - > CMPXCHG_LOOP( > new.count--; > if (old.count <= 1) > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > , > return 1; > ); > - > - spin_lock(&lockref->lock); > - retval = 0; > - if (lockref->count > 1) { > - lockref->count--; > - retval = 1; > - } > - spin_unlock(&lockref->lock); > - return retval; > + /* should never get here */ > + WARN_ON_ONCE(1); > + return -1; > } > EXPORT_SYMBOL(lockref_put_not_zero); > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > , > return new.count; > ); > + /* should never get here */ > + WARN_ON_ONCE(1); > return -1; > } > EXPORT_SYMBOL(lockref_put_return); > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > */ > int lockref_get_not_dead(struct lockref *lockref) > { > - int retval; > - > CMPXCHG_LOOP( > new.count++; > if (old.count < 0) > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > , > return 1; > ); > - > - spin_lock(&lockref->lock); > - retval = 0; > - if (lockref->count >= 0) { > - lockref->count++; > - retval = 1; > - } > - spin_unlock(&lockref->lock); > - return retval; > + /* should never get here */ > + WARN_ON_ONCE(1); > + return -1; > } > EXPORT_SYMBOL(lockref_get_not_dead); > > -- > 2.45.2 > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-03 4:44 ` Mateusz Guzik @ 2024-08-03 9:09 ` Mateusz Guzik 2024-08-03 10:59 ` Jeff Layton 2024-08-03 10:55 ` Jeff Layton 1 sibling, 1 reply; 16+ messages in thread From: Mateusz Guzik @ 2024-08-03 9:09 UTC (permalink / raw) To: Jeff Layton Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > When we tested that initially, the performance devolved significantly > > due to contention for the parent's d_lockref spinlock. > > > > There are two problems with lockrefs today: First, once any concurrent > > task takes the spinlock, they all end up taking the spinlock, which is > > much more costly than a single cmpxchg operation. The second problem is > > that once any task fails to cmpxchg 100 times, it falls back to the > > spinlock. The upshot there is that even moderate contention can cause a > > fallback to serialized spinlocking, which worsens performance. > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > First, change the loop to spin instead of falling back to a locked > > codepath when the spinlock is held. Once the lock is released, allow the > > task to continue trying its cmpxchg loop as before instead of taking the > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > Just continue infinitely. > > > > This greatly reduces contention on the lockref when there are large > > numbers of concurrent increments and decrements occurring. > > > > This was already tried by me and it unfortunately can reduce performance. > Oh wait I misread the patch based on what I tried there. Spinning indefinitely waiting for the lock to be free is a no-go as it loses the forward progress guarantee (and it is possible to get the lock being continuously held). Only spinning up to an arbitrary point wins some in some tests and loses in others. Either way, as described below, chances are decent that: 1. there is an easy way to not lockref_get/put on the parent if the file is already there, dodging the problem .. and even if that's not true 2. lockref can be ditched in favor of atomics. apart from some minor refactoring this all looks perfectly doable and I have a wip. I will try to find the time next week to sort it out > Key problem is that in some corner cases the lock can be continuously > held and be queued on, making the fast path always fail and making all > the spins actively waste time (and notably pull on the cacheline). > > See this for more details: > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ > > However, I *suspect* in the case you are optimizing here (open + O_CREAT > of an existing file) lockref on the parent can be avoided altogether > with some hackery and that's what should be done here. > > When it comes to lockref in vfs in general, most uses can be elided with > some hackery (see the above thread) which is in early WIP (the LSMs are > a massive headache). > > For open calls which *do* need to take a real ref the hackery does not > help of course. > > This is where I think decoupling ref from the lock is the best way > forward. For that to work the dentry must hang around after the last > unref (already done thanks to RCU and dput even explicitly handles that > already!) and there needs to be a way to block new refs atomically -- > can be done with cmpxchg from a 0-ref state to a flag blocking new refs > coming in. I have that as a WIP as well. > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > > 1 file changed, 32 insertions(+), 53 deletions(-) > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > index 2afe4c5d8919..b76941043fe9 100644 > > --- a/lib/lockref.c > > +++ b/lib/lockref.c > > @@ -8,22 +8,25 @@ > > * Note that the "cmpxchg()" reloads the "old" value for the > > * failure case. > > */ > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > - int retry = 100; \ > > - struct lockref old; \ > > - BUILD_BUG_ON(sizeof(old) != 8); \ > > - old.lock_count = READ_ONCE(lockref->lock_count); \ > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > - struct lockref new = old; \ > > - CODE \ > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > - &old.lock_count, \ > > - new.lock_count))) { \ > > - SUCCESS; \ > > - } \ > > - if (!--retry) \ > > - break; \ > > - } \ > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > + struct lockref old; \ > > + BUILD_BUG_ON(sizeof(old) != 8); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + for (;;) { \ > > + struct lockref new = old; \ > > + \ > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > + CODE \ > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > + &old.lock_count, \ > > + new.lock_count))) { \ > > + SUCCESS; \ > > + } \ > > + } else { \ > > + cpu_relax(); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + } \ > > + } \ > > } while (0) > > > > #else > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > > , > > return; > > ); > > - > > - spin_lock(&lockref->lock); > > - lockref->count++; > > - spin_unlock(&lockref->lock); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > } > > EXPORT_SYMBOL(lockref_get); > > > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > > */ > > int lockref_get_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count <= 0) > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_zero); > > > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > > */ > > int lockref_put_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count--; > > if (old.count <= 1) > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 1) { > > - lockref->count--; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_put_not_zero); > > > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > > , > > return new.count; > > ); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > return -1; > > } > > EXPORT_SYMBOL(lockref_put_return); > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > > */ > > int lockref_get_not_dead(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count < 0) > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count >= 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_dead); > > > > -- > > 2.45.2 > > -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-03 9:09 ` Mateusz Guzik @ 2024-08-03 10:59 ` Jeff Layton 2024-08-03 11:21 ` Mateusz Guzik 0 siblings, 1 reply; 16+ messages in thread From: Jeff Layton @ 2024-08-03 10:59 UTC (permalink / raw) To: Mateusz Guzik Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > When we tested that initially, the performance devolved significantly > > > due to contention for the parent's d_lockref spinlock. > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > task takes the spinlock, they all end up taking the spinlock, which is > > > much more costly than a single cmpxchg operation. The second problem is > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > spinlock. The upshot there is that even moderate contention can cause a > > > fallback to serialized spinlocking, which worsens performance. > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > First, change the loop to spin instead of falling back to a locked > > > codepath when the spinlock is held. Once the lock is released, allow the > > > task to continue trying its cmpxchg loop as before instead of taking the > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > Just continue infinitely. > > > > > > This greatly reduces contention on the lockref when there are large > > > numbers of concurrent increments and decrements occurring. > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > Oh wait I misread the patch based on what I tried there. Spinning > indefinitely waiting for the lock to be free is a no-go as it loses > the forward progress guarantee (and it is possible to get the lock > being continuously held). Only spinning up to an arbitrary point wins > some in some tests and loses in others. > I'm a little confused about the forward progress guarantee here. Does that exist today at all? ISTM that falling back to spin_lock() after a certain number of retries doesn't guarantee any forward progress. You can still just end up spinning on the lock forever once that happens, no? > Either way, as described below, chances are decent that: > 1. there is an easy way to not lockref_get/put on the parent if the > file is already there, dodging the problem > .. and even if that's not true > 2. lockref can be ditched in favor of atomics. apart from some minor > refactoring this all looks perfectly doable and I have a wip. I will > try to find the time next week to sort it out > Like I said in the earlier mail, I don't think we can stay in RCU mode because of the audit_inode call. I'm definitely interested in your WIP though! > > Key problem is that in some corner cases the lock can be continuously > > held and be queued on, making the fast path always fail and making all > > the spins actively waste time (and notably pull on the cacheline). > > > > See this for more details: > > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ > > > > However, I *suspect* in the case you are optimizing here (open + O_CREAT > > of an existing file) lockref on the parent can be avoided altogether > > with some hackery and that's what should be done here. > > > > When it comes to lockref in vfs in general, most uses can be elided with > > some hackery (see the above thread) which is in early WIP (the LSMs are > > a massive headache). > > > > For open calls which *do* need to take a real ref the hackery does not > > help of course. > > > > This is where I think decoupling ref from the lock is the best way > > forward. For that to work the dentry must hang around after the last > > unref (already done thanks to RCU and dput even explicitly handles that > > already!) and there needs to be a way to block new refs atomically -- > > can be done with cmpxchg from a 0-ref state to a flag blocking new refs > > coming in. I have that as a WIP as well. > > > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > > --- > > > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > > > 1 file changed, 32 insertions(+), 53 deletions(-) > > > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > > index 2afe4c5d8919..b76941043fe9 100644 > > > --- a/lib/lockref.c > > > +++ b/lib/lockref.c > > > @@ -8,22 +8,25 @@ > > > * Note that the "cmpxchg()" reloads the "old" value for the > > > * failure case. > > > */ > > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > > - int retry = 100; \ > > > - struct lockref old; \ > > > - BUILD_BUG_ON(sizeof(old) != 8); \ > > > - old.lock_count = READ_ONCE(lockref->lock_count); \ > > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > > - struct lockref new = old; \ > > > - CODE \ > > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > > - &old.lock_count, \ > > > - new.lock_count))) { \ > > > - SUCCESS; \ > > > - } \ > > > - if (!--retry) \ > > > - break; \ > > > - } \ > > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > > + struct lockref old; \ > > > + BUILD_BUG_ON(sizeof(old) != 8); \ > > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > > + for (;;) { \ > > > + struct lockref new = old; \ > > > + \ > > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > > + CODE \ > > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > > + &old.lock_count, \ > > > + new.lock_count))) { \ > > > + SUCCESS; \ > > > + } \ > > > + } else { \ > > > + cpu_relax(); \ > > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > > + } \ > > > + } \ > > > } while (0) > > > > > > #else > > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > > > , > > > return; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - lockref->count++; > > > - spin_unlock(&lockref->lock); > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > } > > > EXPORT_SYMBOL(lockref_get); > > > > > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > > > */ > > > int lockref_get_not_zero(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count++; > > > if (old.count <= 0) > > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count > 0) { > > > - lockref->count++; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_get_not_zero); > > > > > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > > > */ > > > int lockref_put_not_zero(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count--; > > > if (old.count <= 1) > > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count > 1) { > > > - lockref->count--; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_put_not_zero); > > > > > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > > > , > > > return new.count; > > > ); > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > return -1; > > > } > > > EXPORT_SYMBOL(lockref_put_return); > > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > > > */ > > > int lockref_get_not_dead(struct lockref *lockref) > > > { > > > - int retval; > > > - > > > CMPXCHG_LOOP( > > > new.count++; > > > if (old.count < 0) > > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > > > , > > > return 1; > > > ); > > > - > > > - spin_lock(&lockref->lock); > > > - retval = 0; > > > - if (lockref->count >= 0) { > > > - lockref->count++; > > > - retval = 1; > > > - } > > > - spin_unlock(&lockref->lock); > > > - return retval; > > > + /* should never get here */ > > > + WARN_ON_ONCE(1); > > > + return -1; > > > } > > > EXPORT_SYMBOL(lockref_get_not_dead); > > > > > > -- > > > 2.45.2 > > > > > > -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-03 10:59 ` Jeff Layton @ 2024-08-03 11:21 ` Mateusz Guzik 2024-08-03 11:32 ` Jeff Layton 0 siblings, 1 reply; 16+ messages in thread From: Mateusz Guzik @ 2024-08-03 11:21 UTC (permalink / raw) To: Jeff Layton Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > > When we tested that initially, the performance devolved significantly > > > > due to contention for the parent's d_lockref spinlock. > > > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > > task takes the spinlock, they all end up taking the spinlock, which is > > > > much more costly than a single cmpxchg operation. The second problem is > > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > > spinlock. The upshot there is that even moderate contention can cause a > > > > fallback to serialized spinlocking, which worsens performance. > > > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > > > First, change the loop to spin instead of falling back to a locked > > > > codepath when the spinlock is held. Once the lock is released, allow the > > > > task to continue trying its cmpxchg loop as before instead of taking the > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > > Just continue infinitely. > > > > > > > > This greatly reduces contention on the lockref when there are large > > > > numbers of concurrent increments and decrements occurring. > > > > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > > > > Oh wait I misread the patch based on what I tried there. Spinning > > indefinitely waiting for the lock to be free is a no-go as it loses > > the forward progress guarantee (and it is possible to get the lock > > being continuously held). Only spinning up to an arbitrary point wins > > some in some tests and loses in others. > > > > I'm a little confused about the forward progress guarantee here. Does > that exist today at all? ISTM that falling back to spin_lock() after a > certain number of retries doesn't guarantee any forward progress. You > can still just end up spinning on the lock forever once that happens, > no? > There is the implicit assumption that everyone holds locks for a finite time. I agree there are no guarantees otherwise if that's what you meant. In this case, since spinlocks are queued, a constant stream of lock holders will make the lock appear taken indefinitely even if they all hold it for a short period. Stock lockref will give up atomics immediately and make sure to change the ref thanks to queueing up. Lockref as proposed in this patch wont be able to do anything as long as the lock trading is taking place. > > Either way, as described below, chances are decent that: > > 1. there is an easy way to not lockref_get/put on the parent if the > > file is already there, dodging the problem > > .. and even if that's not true > > 2. lockref can be ditched in favor of atomics. apart from some minor > > refactoring this all looks perfectly doable and I have a wip. I will > > try to find the time next week to sort it out > > > > Like I said in the earlier mail, I don't think we can stay in RCU mode > because of the audit_inode call. I'm definitely interested in your WIP > though! > well audit may be hackable so that it works in rcu most of the time, but that's not something i'm interested in sorting out the lockref situation would definitely help other stuff (notably opening the same file RO). anyhow one idea is to temporarily disable atomic ops with a flag in the counter, a fallback plan is to loosen lockref so that it can do transitions other than 0->1->2 with atomics, even if the lock is held. I have not looked at this in over a month, I'm going to need to refresh my memory on the details, I do remember there was some stuff to massage first. Anyhow, I expect a working WIP some time in the upcoming week. -- Mateusz Guzik <mjguzik gmail.com> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-03 11:21 ` Mateusz Guzik @ 2024-08-03 11:32 ` Jeff Layton 2024-08-05 11:44 ` Christian Brauner 0 siblings, 1 reply; 16+ messages in thread From: Jeff Layton @ 2024-08-03 11:32 UTC (permalink / raw) To: Mateusz Guzik Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Sat, 2024-08-03 at 13:21 +0200, Mateusz Guzik wrote: > On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote: > > > > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote: > > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > > > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > > > > When we tested that initially, the performance devolved significantly > > > > > due to contention for the parent's d_lockref spinlock. > > > > > > > > > > There are two problems with lockrefs today: First, once any concurrent > > > > > task takes the spinlock, they all end up taking the spinlock, which is > > > > > much more costly than a single cmpxchg operation. The second problem is > > > > > that once any task fails to cmpxchg 100 times, it falls back to the > > > > > spinlock. The upshot there is that even moderate contention can cause a > > > > > fallback to serialized spinlocking, which worsens performance. > > > > > > > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > > > > > > > First, change the loop to spin instead of falling back to a locked > > > > > codepath when the spinlock is held. Once the lock is released, allow the > > > > > task to continue trying its cmpxchg loop as before instead of taking the > > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > > > > Just continue infinitely. > > > > > > > > > > This greatly reduces contention on the lockref when there are large > > > > > numbers of concurrent increments and decrements occurring. > > > > > > > > > > > > > This was already tried by me and it unfortunately can reduce performance. > > > > > > > > > > Oh wait I misread the patch based on what I tried there. Spinning > > > indefinitely waiting for the lock to be free is a no-go as it loses > > > the forward progress guarantee (and it is possible to get the lock > > > being continuously held). Only spinning up to an arbitrary point wins > > > some in some tests and loses in others. > > > > > > > I'm a little confused about the forward progress guarantee here. Does > > that exist today at all? ISTM that falling back to spin_lock() after a > > certain number of retries doesn't guarantee any forward progress. You > > can still just end up spinning on the lock forever once that happens, > > no? > > > > There is the implicit assumption that everyone holds locks for a > finite time. I agree there are no guarantees otherwise if that's what > you meant. > > In this case, since spinlocks are queued, a constant stream of lock > holders will make the lock appear taken indefinitely even if they all > hold it for a short period. > > Stock lockref will give up atomics immediately and make sure to change > the ref thanks to queueing up. > > Lockref as proposed in this patch wont be able to do anything as long > as the lock trading is taking place. > Got it, thanks. This spinning is very simplistic, so I could see that you could have one task continually getting shuffled to the end of the queue. > > > Either way, as described below, chances are decent that: > > > 1. there is an easy way to not lockref_get/put on the parent if the > > > file is already there, dodging the problem > > > .. and even if that's not true > > > 2. lockref can be ditched in favor of atomics. apart from some minor > > > refactoring this all looks perfectly doable and I have a wip. I will > > > try to find the time next week to sort it out > > > > > > > Like I said in the earlier mail, I don't think we can stay in RCU mode > > because of the audit_inode call. I'm definitely interested in your WIP > > though! > > > > well audit may be hackable so that it works in rcu most of the time, > but that's not something i'm interested in Audit not my favorite area of the kernel to work in either. I don't see a good way to make it rcu-friendly, but I haven't looked too hard yet either. It would be nice to be able to do some of the auditing under rcu or spinlock. > > sorting out the lockref situation would definitely help other stuff > (notably opening the same file RO). > Indeed. It's clear that the current implementation is a real scalability problem in a lot of situations. > anyhow one idea is to temporarily disable atomic ops with a flag in > the counter, a fallback plan is to loosen lockref so that it can do > transitions other than 0->1->2 with atomics, even if the lock is held. > > I have not looked at this in over a month, I'm going to need to > refresh my memory on the details, I do remember there was some stuff > to massage first. > > Anyhow, I expect a working WIP some time in the upcoming week. > Great, I'll stay tuned. Thanks! -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-03 11:32 ` Jeff Layton @ 2024-08-05 11:44 ` Christian Brauner 2024-08-05 12:52 ` Jeff Layton 0 siblings, 1 reply; 16+ messages in thread From: Christian Brauner @ 2024-08-05 11:44 UTC (permalink / raw) To: Jeff Layton Cc: Mateusz Guzik, Alexander Viro, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel > Audit not my favorite area of the kernel to work in either. I don't see > a good way to make it rcu-friendly, but I haven't looked too hard yet > either. It would be nice to be able to do some of the auditing under > rcu or spinlock. For audit your main option is to dodge the problem and check whether audit is active and only drop out of rcu if it is. That sidesteps the problem. I'm somewhat certain that a lot of systems don't really have audit active. From a brief look at audit it would be quite involved to make it work just under rcu. Not just because it does various allocation but it also reads fscaps from disk and so on. That's not going to work unless we add a vfs based fscaps cache similar to what we do for acls. I find that very unlikely. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-05 11:44 ` Christian Brauner @ 2024-08-05 12:52 ` Jeff Layton 2024-08-06 11:36 ` Christian Brauner 0 siblings, 1 reply; 16+ messages in thread From: Jeff Layton @ 2024-08-05 12:52 UTC (permalink / raw) To: Christian Brauner Cc: Mateusz Guzik, Alexander Viro, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote: > > Audit not my favorite area of the kernel to work in either. I don't see > > a good way to make it rcu-friendly, but I haven't looked too hard yet > > either. It would be nice to be able to do some of the auditing under > > rcu or spinlock. > > For audit your main option is to dodge the problem and check whether > audit is active and only drop out of rcu if it is. That sidesteps the > problem. I'm somewhat certain that a lot of systems don't really have > audit active. > I did have an earlier version of 4/4 that checked audit_context() and stayed in RCU mode if it comes back NULL. I can resurrect that if you think it's worthwhile. > From a brief look at audit it would be quite involved to make it work > just under rcu. Not just because it does various allocation but it also > reads fscaps from disk and so on. That's not going to work unless we add > a vfs based fscaps cache similar to what we do for acls. I find that > very unlikely. Yeah. It wants to record a lot of (variable-length) information at very inconvenient times. I think we're sort of stuck with it though until someone has a vision on how to do this in a non-blocking way. Handwavy thought: there is some similarity to tracepoints in what audit_inode does, and tracepoints are able to be called in all sorts of contexts. I wonder if we could leverage the same infrastructure somehow? The catch here is that we can't just drop audit records if things go wrong. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-05 12:52 ` Jeff Layton @ 2024-08-06 11:36 ` Christian Brauner 0 siblings, 0 replies; 16+ messages in thread From: Christian Brauner @ 2024-08-06 11:36 UTC (permalink / raw) To: Jeff Layton Cc: Mateusz Guzik, Alexander Viro, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Mon, Aug 05, 2024 at 08:52:28AM GMT, Jeff Layton wrote: > On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote: > > > Audit not my favorite area of the kernel to work in either. I don't see > > > a good way to make it rcu-friendly, but I haven't looked too hard yet > > > either. It would be nice to be able to do some of the auditing under > > > rcu or spinlock. > > > > For audit your main option is to dodge the problem and check whether > > audit is active and only drop out of rcu if it is. That sidesteps the > > problem. I'm somewhat certain that a lot of systems don't really have > > audit active. > > > > I did have an earlier version of 4/4 that checked audit_context() and > stayed in RCU mode if it comes back NULL. I can resurrect that if you > think it's worthwhile. Let's at least see what it looks like. Maybe just use a helper local to fs/namei.c that returns ECHILD if audit is available and 0 otherwise? > > From a brief look at audit it would be quite involved to make it work > > just under rcu. Not just because it does various allocation but it also > > reads fscaps from disk and so on. That's not going to work unless we add > > a vfs based fscaps cache similar to what we do for acls. I find that > > very unlikely. > > Yeah. It wants to record a lot of (variable-length) information at very > inconvenient times. I think we're sort of stuck with it though until > someone has a vision on how to do this in a non-blocking way. > > Handwavy thought: there is some similarity to tracepoints in what > audit_inode does, and tracepoints are able to be called in all sorts of > contexts. I wonder if we could leverage the same infrastructure > somehow? The catch here is that we can't just drop audit records if > things go wrong. I can't say much about the tracepoint idea as I lack the necessary details around their implementation. I think the better way forward is a model with a fastpath and a slowpath. Under RCU audit_inode() returns -ECHILD if it sees that it neeeds to end up doing anything it couldn't do in a non-blocking way and then path lookup can drop out of RCU and call audit_inode() again. I think this wouldn't be extremly terrible. It would amount to adding a flag to audit_inode() AUDIT_MAY_NOT_BLOCK and then on ECHILD audit_inode() gets called again without that flag. Over time if people are interested they could then make more and more stuff available under rcu for audit. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better 2024-08-03 4:44 ` Mateusz Guzik 2024-08-03 9:09 ` Mateusz Guzik @ 2024-08-03 10:55 ` Jeff Layton 1 sibling, 0 replies; 16+ messages in thread From: Jeff Layton @ 2024-08-03 10:55 UTC (permalink / raw) To: Mateusz Guzik Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Sat, 2024-08-03 at 06:44 +0200, Mateusz Guzik wrote: > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote: > > In a later patch, we want to change the open(..., O_CREAT) codepath to > > avoid taking the inode->i_rwsem for write when the dentry already exists. > > When we tested that initially, the performance devolved significantly > > due to contention for the parent's d_lockref spinlock. > > > > There are two problems with lockrefs today: First, once any concurrent > > task takes the spinlock, they all end up taking the spinlock, which is > > much more costly than a single cmpxchg operation. The second problem is > > that once any task fails to cmpxchg 100 times, it falls back to the > > spinlock. The upshot there is that even moderate contention can cause a > > fallback to serialized spinlocking, which worsens performance. > > > > This patch changes CMPXCHG_LOOP in 2 ways: > > > > First, change the loop to spin instead of falling back to a locked > > codepath when the spinlock is held. Once the lock is released, allow the > > task to continue trying its cmpxchg loop as before instead of taking the > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries. > > Just continue infinitely. > > > > This greatly reduces contention on the lockref when there are large > > numbers of concurrent increments and decrements occurring. > > > > This was already tried by me and it unfortunately can reduce performance. > > Key problem is that in some corner cases the lock can be continuously > held and be queued on, making the fast path always fail and making all > the spins actively waste time (and notably pull on the cacheline). > The cacheline contention does seem like a real problem with this approach. > See this for more details: > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/ > > However, I *suspect* in the case you are optimizing here (open + O_CREAT > of an existing file) lockref on the parent can be avoided altogether > with some hackery and that's what should be done here. > Unfortunately I don't think we can in this codepath: -------------------8<---------------------- if (!(open_flag & O_CREAT)) { ... } else { /* create side of things */ if (nd->flags & LOOKUP_RCU) { if (!try_to_unlazy(nd)) return ERR_PTR(-ECHILD); } audit_inode(nd->name, dir, AUDIT_INODE_PARENT); /* trailing slashes? */ if (unlikely(nd->last.name[nd->last.len])) return ERR_PTR(-EISDIR); } -------------------8<---------------------- The problem here is the audit_inode call, which can do a GFP_KERNEL allocation. We can't stay in RCU mode for that, and we need a reference to "dir" (at least with the current way audit works). > When it comes to lockref in vfs in general, most uses can be elided with > some hackery (see the above thread) which is in early WIP (the LSMs are > a massive headache). > > For open calls which *do* need to take a real ref the hackery does not > help of course. > > This is where I think decoupling ref from the lock is the best way > forward. For that to work the dentry must hang around after the last > unref (already done thanks to RCU and dput even explicitly handles that > already!) and there needs to be a way to block new refs atomically -- > can be done with cmpxchg from a 0-ref state to a flag blocking new refs > coming in. I have that as a WIP as well. > These both sound very interesting. FWIW, Josef also started looking at decoupling the refcount and lock, but I don't think he's gotten very far yet. I'm happy to help test some of this too if you get to that point. The 4th patch in this RFC series really amps up the contention for the lockref once the i_rwsem isn't being touched. > > > Signed-off-by: Jeff Layton <jlayton@kernel.org> > > --- > > lib/lockref.c | 85 ++++++++++++++++++++++------------------------------------- > > 1 file changed, 32 insertions(+), 53 deletions(-) > > > > diff --git a/lib/lockref.c b/lib/lockref.c > > index 2afe4c5d8919..b76941043fe9 100644 > > --- a/lib/lockref.c > > +++ b/lib/lockref.c > > @@ -8,22 +8,25 @@ > > * Note that the "cmpxchg()" reloads the "old" value for the > > * failure case. > > */ > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > - int retry = 100; \ > > - struct lockref old; \ > > - BUILD_BUG_ON(sizeof(old) != 8); \ > > - old.lock_count = READ_ONCE(lockref->lock_count); \ > > - while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > - struct lockref new = old; \ > > - CODE \ > > - if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > - &old.lock_count, \ > > - new.lock_count))) { \ > > - SUCCESS; \ > > - } \ > > - if (!--retry) \ > > - break; \ > > - } \ > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do { \ > > + struct lockref old; \ > > + BUILD_BUG_ON(sizeof(old) != 8); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + for (;;) { \ > > + struct lockref new = old; \ > > + \ > > + if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) { \ > > + CODE \ > > + if (likely(try_cmpxchg64_relaxed(&lockref->lock_count, \ > > + &old.lock_count, \ > > + new.lock_count))) { \ > > + SUCCESS; \ > > + } \ > > + } else { \ > > + cpu_relax(); \ > > + old.lock_count = READ_ONCE(lockref->lock_count); \ > > + } \ > > + } \ > > } while (0) > > > > #else > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref) > > , > > return; > > ); > > - > > - spin_lock(&lockref->lock); > > - lockref->count++; > > - spin_unlock(&lockref->lock); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > } > > EXPORT_SYMBOL(lockref_get); > > > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get); > > */ > > int lockref_get_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count <= 0) > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_zero); > > > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero); > > */ > > int lockref_put_not_zero(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count--; > > if (old.count <= 1) > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count > 1) { > > - lockref->count--; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_put_not_zero); > > > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref) > > , > > return new.count; > > ); > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > return -1; > > } > > EXPORT_SYMBOL(lockref_put_return); > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead); > > */ > > int lockref_get_not_dead(struct lockref *lockref) > > { > > - int retval; > > - > > CMPXCHG_LOOP( > > new.count++; > > if (old.count < 0) > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref) > > , > > return 1; > > ); > > - > > - spin_lock(&lockref->lock); > > - retval = 0; > > - if (lockref->count >= 0) { > > - lockref->count++; > > - retval = 1; > > - } > > - spin_unlock(&lockref->lock); > > - return retval; > > + /* should never get here */ > > + WARN_ON_ONCE(1); > > + return -1; > > } > > EXPORT_SYMBOL(lockref_get_not_dead); > > > > -- > > 2.45.2 > > -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too 2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton ` (2 preceding siblings ...) 2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton @ 2024-08-02 21:45 ` Jeff Layton 2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner 4 siblings, 0 replies; 16+ messages in thread From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw) To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. I'm sure this was done with the expectation that O_CREAT being set means that we expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists. This patch rearranges the pathwalk-for-open code to also attempt a fast_lookup in the O_CREAT case. Have the code always do a fast_lookup (unless O_EXCL is set), and return that without taking the inode_lock when a positive dentry is found in the O_CREAT codepath. Signed-off-by: Jeff Layton <jlayton@kernel.org> --- fs/namei.c | 43 ++++++++++++++++++++++++++++++++++++------- 1 file changed, 36 insertions(+), 7 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index b9bdb8e6214a..1793ed090314 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3538,7 +3538,7 @@ static const char *open_last_lookups(struct nameidata *nd, struct dentry *dir = nd->path.dentry; int open_flag = op->open_flag; bool got_write = false; - struct dentry *dentry; + struct dentry *dentry = NULL; const char *res; nd->flags |= op->intent; @@ -3549,28 +3549,57 @@ static const char *open_last_lookups(struct nameidata *nd, return handle_dots(nd, nd->last_type); } - if (!(open_flag & O_CREAT)) { - if (nd->last.name[nd->last.len]) + /* + * We _can_ be in RCU mode here. For everything but O_EXCL case, do a + * fast lookup for the dentry first. For O_CREAT case, we are only + * interested in positive dentries. If nothing suitable is found, + * fall back to locked codepath. + */ + if ((open_flag & (O_CREAT | O_EXCL)) != (O_CREAT | O_EXCL)) { + /* Trailing slashes? */ + if (unlikely(nd->last.name[nd->last.len])) nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY; - /* we _can_ be in RCU mode here */ + dentry = lookup_fast(nd); if (IS_ERR(dentry)) return ERR_CAST(dentry); + } + + if (!(open_flag & O_CREAT)) { if (likely(dentry)) goto finish_lookup; if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU)) return ERR_PTR(-ECHILD); } else { - /* create side of things */ + /* If negative dentry was found earlier, + * discard it as we'll need to use the slow path anyway. + */ if (nd->flags & LOOKUP_RCU) { - if (!try_to_unlazy(nd)) + bool unlazied; + + /* discard negative dentry if one was found */ + if (dentry && !dentry->d_inode) + dentry = NULL; + + unlazied = dentry ? try_to_unlazy_next(nd, dentry) : + try_to_unlazy(nd); + if (!unlazied) return ERR_PTR(-ECHILD); + } else if (dentry && !dentry->d_inode) { + /* discard negative dentry if one was found */ + dput(dentry); + dentry = NULL; } audit_inode(nd->name, dir, AUDIT_INODE_PARENT); + /* trailing slashes? */ - if (unlikely(nd->last.name[nd->last.len])) + if (unlikely(nd->last.name[nd->last.len])) { + dput(dentry); return ERR_PTR(-EISDIR); + } + if (dentry) + goto finish_lookup; } if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { -- 2.45.2 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too 2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton ` (3 preceding siblings ...) 2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton @ 2024-08-05 10:46 ` Christian Brauner 2024-08-05 11:55 ` Jeff Layton 4 siblings, 1 reply; 16+ messages in thread From: Christian Brauner @ 2024-08-05 10:46 UTC (permalink / raw) To: Jeff Layton Cc: Alexander Viro, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel > fs: remove comment about d_rcu_to_refcount > fs: add a kerneldoc header over lookup_fast I took both of these into vfs.misc since they're really unrelated cleanups. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too 2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner @ 2024-08-05 11:55 ` Jeff Layton 0 siblings, 0 replies; 16+ messages in thread From: Jeff Layton @ 2024-08-05 11:55 UTC (permalink / raw) To: Christian Brauner Cc: Alexander Viro, Jan Kara, Andrew Morton, Josef Bacik, linux-fsdevel, linux-kernel On Mon, 2024-08-05 at 12:46 +0200, Christian Brauner wrote: > > fs: remove comment about d_rcu_to_refcount > > fs: add a kerneldoc header over lookup_fast > > I took both of these into vfs.misc since they're really unrelated cleanups. Thanks! I was going to suggest that. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2024-08-06 11:36 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton 2024-08-03 4:44 ` Mateusz Guzik 2024-08-03 9:09 ` Mateusz Guzik 2024-08-03 10:59 ` Jeff Layton 2024-08-03 11:21 ` Mateusz Guzik 2024-08-03 11:32 ` Jeff Layton 2024-08-05 11:44 ` Christian Brauner 2024-08-05 12:52 ` Jeff Layton 2024-08-06 11:36 ` Christian Brauner 2024-08-03 10:55 ` Jeff Layton 2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton 2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner 2024-08-05 11:55 ` Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).