[PATCH RFC 0/4] fs: try an opportunistic lookup for O

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too
@ 2024-08-02 21:45 Jeff Layton
  2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
  Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton

We've had some reports of i_rwsem contention in some workloads. On an
open with O_CREAT set, it always takes i_rwsem for write when handling
the last component. This patchset is intended to alleviate some of that
contention, and to generally speed up opens when O_CREAT is set and the
file already exists.

I have a simple benchmark that forks a bunch of processes that then do
O_CREAT opens and closes of a file in the same directory as their peers.
This test runs in about 18s on v6.10 kernels. With this patchset, it
runs in about 10s.

The basic idea is to do an opportunistic dcache lookup even in the
O_CREAT case, which allows us to avoid taking the inode_lock altogether
when the file already exists. The last patch does this, but it turns out
to perform worse under heavy contention than taking the i_rwsem for
write.

The problem is that that moves the contention to the parent's d_lockref,
which rapidly devolved to plain old spinlock contention. So, there is an
earlier patch in the series which changes how lockrefs work to perform
better under heavy contention. Locking primitives aren't my area of
expertise, so consider that one a starting point for discussion.

I also wrote a second testcase that just attempts to create and delete a
file in a loop (trying to hit the pessimal case where the fast_lookup
never helps). That test runs in about the same amount of time with both
kernels. A Linux kernel build seems to run in about the same amount of
time both with and without this patchset.

Many thanks to Josef for his help with this.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
Jeff Layton (4):
      fs: remove comment about d_rcu_to_refcount
      fs: add a kerneldoc header over lookup_fast
      lockref: rework CMPXCHG_LOOP to handle contention better
      fs: try an opportunistic lookup for O_CREAT opens too

 fs/dcache.c   |  3 ---
 fs/namei.c    | 57 ++++++++++++++++++++++++++++++++++-----
 lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
 3 files changed, 82 insertions(+), 63 deletions(-)
---
base-commit: 0c3836482481200ead7b416ca80c68a29cfdaabd
change-id: 20240723-openfast-ac49a7b6ade2

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount
  2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
@ 2024-08-02 21:45 ` Jeff Layton
  2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
  Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton

This function no longer exists.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/dcache.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4c144519aa70..474c5205a3db 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2160,9 +2160,6 @@ static noinline struct dentry *__d_lookup_rcu_op_compare(
  * without taking d_lock and checking d_seq sequence count against @seq
  * returned here.
  *
- * A refcount may be taken on the found dentry with the d_rcu_to_refcount
- * function.
- *
  * Alternatively, __d_lookup_rcu may be called again to look up the child of
  * the returned dentry, so long as its parent's seqlock is checked after the
  * child is looked up. Thus, an interlocking stepping of sequence lock checks

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast
  2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
  2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
@ 2024-08-02 21:45 ` Jeff Layton
  2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
  Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton

The lookup_fast helper in fs/namei.c has some subtlety in how dentries
are returned. Document them.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/namei.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index 1e05a0f3f04d..b9bdb8e6214a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1613,6 +1613,20 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
 }
 EXPORT_SYMBOL(lookup_one_qstr_excl);
 
+/**
+ * lookup_fast - do fast lockless (but racy) lookup of a dentry
+ * @nd: current nameidata
+ *
+ * Do a fast, but racy lookup in the dcache for the given dentry, and
+ * revalidate it. Returns a valid dentry pointer or NULL if one wasn't
+ * found. On error, an ERR_PTR will be returned.
+ *
+ * If this function returns a valid dentry and the walk is no longer
+ * lazy, the dentry will carry a reference that must later be put. If
+ * RCU mode is still in force, then this is not the case and the dentry
+ * must be legitimized before use. If this returns NULL, then the walk
+ * will no longer be in RCU mode.
+ */
 static struct dentry *lookup_fast(struct nameidata *nd)
 {
 	struct dentry *dentry, *parent = nd->path.dentry;

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
  2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
  2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton
@ 2024-08-02 21:45 ` Jeff Layton
  2024-08-03  4:44   ` Mateusz Guzik
  2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
  2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner
  4 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
  Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton

In a later patch, we want to change the open(..., O_CREAT) codepath to
avoid taking the inode->i_rwsem for write when the dentry already exists.
When we tested that initially, the performance devolved significantly
due to contention for the parent's d_lockref spinlock.

There are two problems with lockrefs today: First, once any concurrent
task takes the spinlock, they all end up taking the spinlock, which is
much more costly than a single cmpxchg operation. The second problem is
that once any task fails to cmpxchg 100 times, it falls back to the
spinlock. The upshot there is that even moderate contention can cause a
fallback to serialized spinlocking, which worsens performance.

This patch changes CMPXCHG_LOOP in 2 ways:

First, change the loop to spin instead of falling back to a locked
codepath when the spinlock is held. Once the lock is released, allow the
task to continue trying its cmpxchg loop as before instead of taking the
lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
Just continue infinitely.

This greatly reduces contention on the lockref when there are large
numbers of concurrent increments and decrements occurring.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
 1 file changed, 32 insertions(+), 53 deletions(-)

diff --git a/lib/lockref.c b/lib/lockref.c
index 2afe4c5d8919..b76941043fe9 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -8,22 +8,25 @@
  * Note that the "cmpxchg()" reloads the "old" value for the
  * failure case.
  */
-#define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
-	int retry = 100;							\
-	struct lockref old;							\
-	BUILD_BUG_ON(sizeof(old) != 8);						\
-	old.lock_count = READ_ONCE(lockref->lock_count);			\
-	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
-		struct lockref new = old;					\
-		CODE								\
-		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
-						 &old.lock_count,		\
-						 new.lock_count))) {		\
-			SUCCESS;						\
-		}								\
-		if (!--retry)							\
-			break;							\
-	}									\
+#define CMPXCHG_LOOP(CODE, SUCCESS) do {						\
+	struct lockref old;								\
+	BUILD_BUG_ON(sizeof(old) != 8);							\
+	old.lock_count = READ_ONCE(lockref->lock_count);				\
+	for (;;) {									\
+		struct lockref new = old;						\
+											\
+		if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {	\
+			CODE								\
+			if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
+							 &old.lock_count,		\
+							 new.lock_count))) {		\
+				SUCCESS;						\
+			}								\
+		} else {								\
+			cpu_relax();							\
+			old.lock_count = READ_ONCE(lockref->lock_count);		\
+		}									\
+	}										\
 } while (0)
 
 #else
@@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
 	,
 		return;
 	);
-
-	spin_lock(&lockref->lock);
-	lockref->count++;
-	spin_unlock(&lockref->lock);
+	/* should never get here */
+	WARN_ON_ONCE(1);
 }
 EXPORT_SYMBOL(lockref_get);
 
@@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
  */
 int lockref_get_not_zero(struct lockref *lockref)
 {
-	int retval;
-
 	CMPXCHG_LOOP(
 		new.count++;
 		if (old.count <= 0)
@@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
 	,
 		return 1;
 	);
-
-	spin_lock(&lockref->lock);
-	retval = 0;
-	if (lockref->count > 0) {
-		lockref->count++;
-		retval = 1;
-	}
-	spin_unlock(&lockref->lock);
-	return retval;
+	/* should never get here */
+	WARN_ON_ONCE(1);
+	return -1;
 }
 EXPORT_SYMBOL(lockref_get_not_zero);
 
@@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
  */
 int lockref_put_not_zero(struct lockref *lockref)
 {
-	int retval;
-
 	CMPXCHG_LOOP(
 		new.count--;
 		if (old.count <= 1)
@@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
 	,
 		return 1;
 	);
-
-	spin_lock(&lockref->lock);
-	retval = 0;
-	if (lockref->count > 1) {
-		lockref->count--;
-		retval = 1;
-	}
-	spin_unlock(&lockref->lock);
-	return retval;
+	/* should never get here */
+	WARN_ON_ONCE(1);
+	return -1;
 }
 EXPORT_SYMBOL(lockref_put_not_zero);
 
@@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
 	,
 		return new.count;
 	);
+	/* should never get here */
+	WARN_ON_ONCE(1);
 	return -1;
 }
 EXPORT_SYMBOL(lockref_put_return);
@@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
  */
 int lockref_get_not_dead(struct lockref *lockref)
 {
-	int retval;
-
 	CMPXCHG_LOOP(
 		new.count++;
 		if (old.count < 0)
@@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
 	,
 		return 1;
 	);
-
-	spin_lock(&lockref->lock);
-	retval = 0;
-	if (lockref->count >= 0) {
-		lockref->count++;
-		retval = 1;
-	}
-	spin_unlock(&lockref->lock);
-	return retval;
+	/* should never get here */
+	WARN_ON_ONCE(1);
+	return -1;
 }
 EXPORT_SYMBOL(lockref_get_not_dead);

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too
  2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
                   ` (2 preceding siblings ...)
  2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton
@ 2024-08-02 21:45 ` Jeff Layton
  2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner
  4 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2024-08-02 21:45 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton
  Cc: Josef Bacik, linux-fsdevel, linux-kernel, Jeff Layton

Today, when opening a file we'll typically do a fast lookup, but if
O_CREAT is set, the kernel always takes the exclusive inode lock. I'm
sure this was done with the expectation that O_CREAT being set means
that we expect to do the create, but that's often not the case. Many
programs set O_CREAT even in scenarios where the file already exists.

This patch rearranges the pathwalk-for-open code to also attempt a
fast_lookup in the O_CREAT case.  Have the code always do a fast_lookup
(unless O_EXCL is set), and return that without taking the inode_lock
when a positive dentry is found in the O_CREAT codepath.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/namei.c | 43 ++++++++++++++++++++++++++++++++++++-------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b9bdb8e6214a..1793ed090314 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3538,7 +3538,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 	struct dentry *dir = nd->path.dentry;
 	int open_flag = op->open_flag;
 	bool got_write = false;
-	struct dentry *dentry;
+	struct dentry *dentry = NULL;
 	const char *res;
 
 	nd->flags |= op->intent;
@@ -3549,28 +3549,57 @@ static const char *open_last_lookups(struct nameidata *nd,
 		return handle_dots(nd, nd->last_type);
 	}
 
-	if (!(open_flag & O_CREAT)) {
-		if (nd->last.name[nd->last.len])
+	/*
+	 * We _can_ be in RCU mode here. For everything but O_EXCL case, do a
+	 * fast lookup for the dentry first. For O_CREAT case, we are only
+	 * interested in positive dentries. If nothing suitable is found,
+	 * fall back to locked codepath.
+	 */
+	if ((open_flag & (O_CREAT | O_EXCL)) != (O_CREAT | O_EXCL)) {
+		/* Trailing slashes? */
+		if (unlikely(nd->last.name[nd->last.len]))
 			nd->flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
-		/* we _can_ be in RCU mode here */
+
 		dentry = lookup_fast(nd);
 		if (IS_ERR(dentry))
 			return ERR_CAST(dentry);
+	}
+
+	if (!(open_flag & O_CREAT)) {
 		if (likely(dentry))
 			goto finish_lookup;
 
 		if (WARN_ON_ONCE(nd->flags & LOOKUP_RCU))
 			return ERR_PTR(-ECHILD);
 	} else {
-		/* create side of things */
+		/* If negative dentry was found earlier,
+		 * discard it as we'll need to use the slow path anyway.
+		 */
 		if (nd->flags & LOOKUP_RCU) {
-			if (!try_to_unlazy(nd))
+			bool unlazied;
+
+			/* discard negative dentry if one was found */
+			if (dentry && !dentry->d_inode)
+				dentry = NULL;
+
+			unlazied = dentry ? try_to_unlazy_next(nd, dentry) :
+					    try_to_unlazy(nd);
+			if (!unlazied)
 				return ERR_PTR(-ECHILD);
+		} else if (dentry && !dentry->d_inode) {
+			/* discard negative dentry if one was found */
+			dput(dentry);
+			dentry = NULL;
 		}
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
+
 		/* trailing slashes? */
-		if (unlikely(nd->last.name[nd->last.len]))
+		if (unlikely(nd->last.name[nd->last.len])) {
+			dput(dentry);
 			return ERR_PTR(-EISDIR);
+		}
+		if (dentry)
+			goto finish_lookup;
 	}
 
 	if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) {

-- 
2.45.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton
@ 2024-08-03  4:44   ` Mateusz Guzik
  2024-08-03  9:09     ` Mateusz Guzik
  2024-08-03 10:55     ` Jeff Layton
  0 siblings, 2 replies; 16+ messages in thread
From: Mateusz Guzik @ 2024-08-03  4:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> In a later patch, we want to change the open(..., O_CREAT) codepath to
> avoid taking the inode->i_rwsem for write when the dentry already exists.
> When we tested that initially, the performance devolved significantly
> due to contention for the parent's d_lockref spinlock.
> 
> There are two problems with lockrefs today: First, once any concurrent
> task takes the spinlock, they all end up taking the spinlock, which is
> much more costly than a single cmpxchg operation. The second problem is
> that once any task fails to cmpxchg 100 times, it falls back to the
> spinlock. The upshot there is that even moderate contention can cause a
> fallback to serialized spinlocking, which worsens performance.
> 
> This patch changes CMPXCHG_LOOP in 2 ways:
> 
> First, change the loop to spin instead of falling back to a locked
> codepath when the spinlock is held. Once the lock is released, allow the
> task to continue trying its cmpxchg loop as before instead of taking the
> lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> Just continue infinitely.
> 
> This greatly reduces contention on the lockref when there are large
> numbers of concurrent increments and decrements occurring.
> 

This was already tried by me and it unfortunately can reduce performance.

Key problem is that in some corner cases the lock can be continuously
held and be queued on, making the fast path always fail and making all
the spins actively waste time (and notably pull on the cacheline).

See this for more details:
https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/

However, I *suspect* in the case you are optimizing here (open + O_CREAT
of an existing file) lockref on the parent can be avoided altogether
with some hackery and that's what should be done here.

When it comes to lockref in vfs in general, most uses can be elided with
some hackery (see the above thread) which is in early WIP (the LSMs are
a massive headache).

For open calls which *do* need to take a real ref the hackery does not
help of course.

This is where I think decoupling ref from the lock is the best way
forward. For that to work the dentry must hang around after the last
unref (already done thanks to RCU and dput even explicitly handles that
already!) and there needs to be a way to block new refs atomically --
can be done with cmpxchg from a 0-ref state to a flag blocking new refs
coming in. I have that as a WIP as well.


> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
>  1 file changed, 32 insertions(+), 53 deletions(-)
> 
> diff --git a/lib/lockref.c b/lib/lockref.c
> index 2afe4c5d8919..b76941043fe9 100644
> --- a/lib/lockref.c
> +++ b/lib/lockref.c
> @@ -8,22 +8,25 @@
>   * Note that the "cmpxchg()" reloads the "old" value for the
>   * failure case.
>   */
> -#define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
> -	int retry = 100;							\
> -	struct lockref old;							\
> -	BUILD_BUG_ON(sizeof(old) != 8);						\
> -	old.lock_count = READ_ONCE(lockref->lock_count);			\
> -	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
> -		struct lockref new = old;					\
> -		CODE								\
> -		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> -						 &old.lock_count,		\
> -						 new.lock_count))) {		\
> -			SUCCESS;						\
> -		}								\
> -		if (!--retry)							\
> -			break;							\
> -	}									\
> +#define CMPXCHG_LOOP(CODE, SUCCESS) do {						\
> +	struct lockref old;								\
> +	BUILD_BUG_ON(sizeof(old) != 8);							\
> +	old.lock_count = READ_ONCE(lockref->lock_count);				\
> +	for (;;) {									\
> +		struct lockref new = old;						\
> +											\
> +		if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {	\
> +			CODE								\
> +			if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> +							 &old.lock_count,		\
> +							 new.lock_count))) {		\
> +				SUCCESS;						\
> +			}								\
> +		} else {								\
> +			cpu_relax();							\
> +			old.lock_count = READ_ONCE(lockref->lock_count);		\
> +		}									\
> +	}										\
>  } while (0)
>  
>  #else
> @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
>  	,
>  		return;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	lockref->count++;
> -	spin_unlock(&lockref->lock);
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
>  }
>  EXPORT_SYMBOL(lockref_get);
>  
> @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
>   */
>  int lockref_get_not_zero(struct lockref *lockref)
>  {
> -	int retval;
> -
>  	CMPXCHG_LOOP(
>  		new.count++;
>  		if (old.count <= 0)
> @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
>  	,
>  		return 1;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	retval = 0;
> -	if (lockref->count > 0) {
> -		lockref->count++;
> -		retval = 1;
> -	}
> -	spin_unlock(&lockref->lock);
> -	return retval;
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
> +	return -1;
>  }
>  EXPORT_SYMBOL(lockref_get_not_zero);
>  
> @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
>   */
>  int lockref_put_not_zero(struct lockref *lockref)
>  {
> -	int retval;
> -
>  	CMPXCHG_LOOP(
>  		new.count--;
>  		if (old.count <= 1)
> @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
>  	,
>  		return 1;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	retval = 0;
> -	if (lockref->count > 1) {
> -		lockref->count--;
> -		retval = 1;
> -	}
> -	spin_unlock(&lockref->lock);
> -	return retval;
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
> +	return -1;
>  }
>  EXPORT_SYMBOL(lockref_put_not_zero);
>  
> @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
>  	,
>  		return new.count;
>  	);
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
>  	return -1;
>  }
>  EXPORT_SYMBOL(lockref_put_return);
> @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
>   */
>  int lockref_get_not_dead(struct lockref *lockref)
>  {
> -	int retval;
> -
>  	CMPXCHG_LOOP(
>  		new.count++;
>  		if (old.count < 0)
> @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
>  	,
>  		return 1;
>  	);
> -
> -	spin_lock(&lockref->lock);
> -	retval = 0;
> -	if (lockref->count >= 0) {
> -		lockref->count++;
> -		retval = 1;
> -	}
> -	spin_unlock(&lockref->lock);
> -	return retval;
> +	/* should never get here */
> +	WARN_ON_ONCE(1);
> +	return -1;
>  }
>  EXPORT_SYMBOL(lockref_get_not_dead);
> 
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-03  4:44   ` Mateusz Guzik
@ 2024-08-03  9:09     ` Mateusz Guzik
  2024-08-03 10:59       ` Jeff Layton
  2024-08-03 10:55     ` Jeff Layton
  1 sibling, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2024-08-03  9:09 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > When we tested that initially, the performance devolved significantly
> > due to contention for the parent's d_lockref spinlock.
> >
> > There are two problems with lockrefs today: First, once any concurrent
> > task takes the spinlock, they all end up taking the spinlock, which is
> > much more costly than a single cmpxchg operation. The second problem is
> > that once any task fails to cmpxchg 100 times, it falls back to the
> > spinlock. The upshot there is that even moderate contention can cause a
> > fallback to serialized spinlocking, which worsens performance.
> >
> > This patch changes CMPXCHG_LOOP in 2 ways:
> >
> > First, change the loop to spin instead of falling back to a locked
> > codepath when the spinlock is held. Once the lock is released, allow the
> > task to continue trying its cmpxchg loop as before instead of taking the
> > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > Just continue infinitely.
> >
> > This greatly reduces contention on the lockref when there are large
> > numbers of concurrent increments and decrements occurring.
> >
>
> This was already tried by me and it unfortunately can reduce performance.
>

Oh wait I misread the patch based on what I tried there. Spinning
indefinitely waiting for the lock to be free is a no-go as it loses
the forward progress guarantee (and it is possible to get the lock
being continuously held). Only spinning up to an arbitrary point wins
some in some tests and loses in others.

Either way, as described below, chances are decent that:
1. there is an easy way to not lockref_get/put on the parent if the
file is already there, dodging the problem
.. and even if that's not true
2. lockref can be ditched in favor of atomics. apart from some minor
refactoring this all looks perfectly doable and I have a wip. I will
try to find the time next week to sort it out

> Key problem is that in some corner cases the lock can be continuously
> held and be queued on, making the fast path always fail and making all
> the spins actively waste time (and notably pull on the cacheline).
>
> See this for more details:
> https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
>
> However, I *suspect* in the case you are optimizing here (open + O_CREAT
> of an existing file) lockref on the parent can be avoided altogether
> with some hackery and that's what should be done here.
>
> When it comes to lockref in vfs in general, most uses can be elided with
> some hackery (see the above thread) which is in early WIP (the LSMs are
> a massive headache).
>
> For open calls which *do* need to take a real ref the hackery does not
> help of course.
>
> This is where I think decoupling ref from the lock is the best way
> forward. For that to work the dentry must hang around after the last
> unref (already done thanks to RCU and dput even explicitly handles that
> already!) and there needs to be a way to block new refs atomically --
> can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> coming in. I have that as a WIP as well.
>
>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> >  1 file changed, 32 insertions(+), 53 deletions(-)
> >
> > diff --git a/lib/lockref.c b/lib/lockref.c
> > index 2afe4c5d8919..b76941043fe9 100644
> > --- a/lib/lockref.c
> > +++ b/lib/lockref.c
> > @@ -8,22 +8,25 @@
> >   * Note that the "cmpxchg()" reloads the "old" value for the
> >   * failure case.
> >   */
> > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                     \
> > -     int retry = 100;                                                        \
> > -     struct lockref old;                                                     \
> > -     BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > -     old.lock_count = READ_ONCE(lockref->lock_count);                        \
> > -     while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
> > -             struct lockref new = old;                                       \
> > -             CODE                                                            \
> > -             if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > -                                              &old.lock_count,               \
> > -                                              new.lock_count))) {            \
> > -                     SUCCESS;                                                \
> > -             }                                                               \
> > -             if (!--retry)                                                   \
> > -                     break;                                                  \
> > -     }                                                                       \
> > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                             \
> > +     struct lockref old;                                                             \
> > +     BUILD_BUG_ON(sizeof(old) != 8);                                                 \
> > +     old.lock_count = READ_ONCE(lockref->lock_count);                                \
> > +     for (;;) {                                                                      \
> > +             struct lockref new = old;                                               \
> > +                                                                                     \
> > +             if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {        \
> > +                     CODE                                                            \
> > +                     if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > +                                                      &old.lock_count,               \
> > +                                                      new.lock_count))) {            \
> > +                             SUCCESS;                                                \
> > +                     }                                                               \
> > +             } else {                                                                \
> > +                     cpu_relax();                                                    \
> > +                     old.lock_count = READ_ONCE(lockref->lock_count);                \
> > +             }                                                                       \
> > +     }                                                                               \
> >  } while (0)
> >
> >  #else
> > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> >       ,
> >               return;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     lockref->count++;
> > -     spin_unlock(&lockref->lock);
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> >  }
> >  EXPORT_SYMBOL(lockref_get);
> >
> > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> >   */
> >  int lockref_get_not_zero(struct lockref *lockref)
> >  {
> > -     int retval;
> > -
> >       CMPXCHG_LOOP(
> >               new.count++;
> >               if (old.count <= 0)
> > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> >       ,
> >               return 1;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     retval = 0;
> > -     if (lockref->count > 0) {
> > -             lockref->count++;
> > -             retval = 1;
> > -     }
> > -     spin_unlock(&lockref->lock);
> > -     return retval;
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> > +     return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_zero);
> >
> > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> >   */
> >  int lockref_put_not_zero(struct lockref *lockref)
> >  {
> > -     int retval;
> > -
> >       CMPXCHG_LOOP(
> >               new.count--;
> >               if (old.count <= 1)
> > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> >       ,
> >               return 1;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     retval = 0;
> > -     if (lockref->count > 1) {
> > -             lockref->count--;
> > -             retval = 1;
> > -     }
> > -     spin_unlock(&lockref->lock);
> > -     return retval;
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> > +     return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_not_zero);
> >
> > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> >       ,
> >               return new.count;
> >       );
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> >       return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_return);
> > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> >   */
> >  int lockref_get_not_dead(struct lockref *lockref)
> >  {
> > -     int retval;
> > -
> >       CMPXCHG_LOOP(
> >               new.count++;
> >               if (old.count < 0)
> > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> >       ,
> >               return 1;
> >       );
> > -
> > -     spin_lock(&lockref->lock);
> > -     retval = 0;
> > -     if (lockref->count >= 0) {
> > -             lockref->count++;
> > -             retval = 1;
> > -     }
> > -     spin_unlock(&lockref->lock);
> > -     return retval;
> > +     /* should never get here */
> > +     WARN_ON_ONCE(1);
> > +     return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_dead);
> >
> > --
> > 2.45.2
> >



-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-03  4:44   ` Mateusz Guzik
  2024-08-03  9:09     ` Mateusz Guzik
@ 2024-08-03 10:55     ` Jeff Layton
  1 sibling, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2024-08-03 10:55 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Sat, 2024-08-03 at 06:44 +0200, Mateusz Guzik wrote:
> On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > When we tested that initially, the performance devolved significantly
> > due to contention for the parent's d_lockref spinlock.
> > 
> > There are two problems with lockrefs today: First, once any concurrent
> > task takes the spinlock, they all end up taking the spinlock, which is
> > much more costly than a single cmpxchg operation. The second problem is
> > that once any task fails to cmpxchg 100 times, it falls back to the
> > spinlock. The upshot there is that even moderate contention can cause a
> > fallback to serialized spinlocking, which worsens performance.
> > 
> > This patch changes CMPXCHG_LOOP in 2 ways:
> > 
> > First, change the loop to spin instead of falling back to a locked
> > codepath when the spinlock is held. Once the lock is released, allow the
> > task to continue trying its cmpxchg loop as before instead of taking the
> > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > Just continue infinitely.
> > 
> > This greatly reduces contention on the lockref when there are large
> > numbers of concurrent increments and decrements occurring.
> > 
> 
> This was already tried by me and it unfortunately can reduce performance.
> 
> Key problem is that in some corner cases the lock can be continuously
> held and be queued on, making the fast path always fail and making all
> the spins actively waste time (and notably pull on the cacheline).
> 

The cacheline contention does seem like a real problem with this
approach.

> See this for more details:
> https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
> 
> However, I *suspect* in the case you are optimizing here (open + O_CREAT
> of an existing file) lockref on the parent can be avoided altogether
> with some hackery and that's what should be done here.
> 

Unfortunately I don't think we can in this codepath:

-------------------8<----------------------
	if (!(open_flag & O_CREAT)) {                                        
		...
	} else {
        	/* create side of things */
                if (nd->flags & LOOKUP_RCU) {
                        if (!try_to_unlazy(nd))
                                return ERR_PTR(-ECHILD);
                }
                audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
                /* trailing slashes? */
                if (unlikely(nd->last.name[nd->last.len]))
                        return ERR_PTR(-EISDIR);
        }
-------------------8<----------------------

The problem here is the audit_inode call, which can do a GFP_KERNEL
allocation. We can't stay in RCU mode for that, and we need a reference
to "dir" (at least with the current way audit works).

> When it comes to lockref in vfs in general, most uses can be elided with
> some hackery (see the above thread) which is in early WIP (the LSMs are
> a massive headache).
> 
> For open calls which *do* need to take a real ref the hackery does not
> help of course.
> 
> This is where I think decoupling ref from the lock is the best way
> forward. For that to work the dentry must hang around after the last
> unref (already done thanks to RCU and dput even explicitly handles that
> already!) and there needs to be a way to block new refs atomically --
> can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> coming in. I have that as a WIP as well.
> 

These both sound very interesting. FWIW, Josef also started looking at
decoupling the refcount and lock, but I don't think he's gotten very
far yet.

I'm happy to help test some of this too if you get to that point. The
4th patch in this RFC series really amps up the contention for the
lockref once the i_rwsem isn't being touched.

> 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> >  1 file changed, 32 insertions(+), 53 deletions(-)
> > 
> > diff --git a/lib/lockref.c b/lib/lockref.c
> > index 2afe4c5d8919..b76941043fe9 100644
> > --- a/lib/lockref.c
> > +++ b/lib/lockref.c
> > @@ -8,22 +8,25 @@
> >   * Note that the "cmpxchg()" reloads the "old" value for the
> >   * failure case.
> >   */
> > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {					\
> > -	int retry = 100;							\
> > -	struct lockref old;							\
> > -	BUILD_BUG_ON(sizeof(old) != 8);						\
> > -	old.lock_count = READ_ONCE(lockref->lock_count);			\
> > -	while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {  	\
> > -		struct lockref new = old;					\
> > -		CODE								\
> > -		if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> > -						 &old.lock_count,		\
> > -						 new.lock_count))) {		\
> > -			SUCCESS;						\
> > -		}								\
> > -		if (!--retry)							\
> > -			break;							\
> > -	}									\
> > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {						\
> > +	struct lockref old;								\
> > +	BUILD_BUG_ON(sizeof(old) != 8);							\
> > +	old.lock_count = READ_ONCE(lockref->lock_count);				\
> > +	for (;;) {									\
> > +		struct lockref new = old;						\
> > +											\
> > +		if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {	\
> > +			CODE								\
> > +			if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,		\
> > +							 &old.lock_count,		\
> > +							 new.lock_count))) {		\
> > +				SUCCESS;						\
> > +			}								\
> > +		} else {								\
> > +			cpu_relax();							\
> > +			old.lock_count = READ_ONCE(lockref->lock_count);		\
> > +		}									\
> > +	}										\
> >  } while (0)
> >  
> >  #else
> > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> >  	,
> >  		return;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	lockref->count++;
> > -	spin_unlock(&lockref->lock);
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> >  }
> >  EXPORT_SYMBOL(lockref_get);
> >  
> > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> >   */
> >  int lockref_get_not_zero(struct lockref *lockref)
> >  {
> > -	int retval;
> > -
> >  	CMPXCHG_LOOP(
> >  		new.count++;
> >  		if (old.count <= 0)
> > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> >  	,
> >  		return 1;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	retval = 0;
> > -	if (lockref->count > 0) {
> > -		lockref->count++;
> > -		retval = 1;
> > -	}
> > -	spin_unlock(&lockref->lock);
> > -	return retval;
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> > +	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_zero);
> >  
> > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> >   */
> >  int lockref_put_not_zero(struct lockref *lockref)
> >  {
> > -	int retval;
> > -
> >  	CMPXCHG_LOOP(
> >  		new.count--;
> >  		if (old.count <= 1)
> > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> >  	,
> >  		return 1;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	retval = 0;
> > -	if (lockref->count > 1) {
> > -		lockref->count--;
> > -		retval = 1;
> > -	}
> > -	spin_unlock(&lockref->lock);
> > -	return retval;
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> > +	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_not_zero);
> >  
> > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> >  	,
> >  		return new.count;
> >  	);
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> >  	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_put_return);
> > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> >   */
> >  int lockref_get_not_dead(struct lockref *lockref)
> >  {
> > -	int retval;
> > -
> >  	CMPXCHG_LOOP(
> >  		new.count++;
> >  		if (old.count < 0)
> > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> >  	,
> >  		return 1;
> >  	);
> > -
> > -	spin_lock(&lockref->lock);
> > -	retval = 0;
> > -	if (lockref->count >= 0) {
> > -		lockref->count++;
> > -		retval = 1;
> > -	}
> > -	spin_unlock(&lockref->lock);
> > -	return retval;
> > +	/* should never get here */
> > +	WARN_ON_ONCE(1);
> > +	return -1;
> >  }
> >  EXPORT_SYMBOL(lockref_get_not_dead);
> > 
> > -- 
> > 2.45.2
> > 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-03  9:09     ` Mateusz Guzik
@ 2024-08-03 10:59       ` Jeff Layton
  2024-08-03 11:21         ` Mateusz Guzik
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2024-08-03 10:59 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > 
> > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > When we tested that initially, the performance devolved significantly
> > > due to contention for the parent's d_lockref spinlock.
> > > 
> > > There are two problems with lockrefs today: First, once any concurrent
> > > task takes the spinlock, they all end up taking the spinlock, which is
> > > much more costly than a single cmpxchg operation. The second problem is
> > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > spinlock. The upshot there is that even moderate contention can cause a
> > > fallback to serialized spinlocking, which worsens performance.
> > > 
> > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > 
> > > First, change the loop to spin instead of falling back to a locked
> > > codepath when the spinlock is held. Once the lock is released, allow the
> > > task to continue trying its cmpxchg loop as before instead of taking the
> > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > Just continue infinitely.
> > > 
> > > This greatly reduces contention on the lockref when there are large
> > > numbers of concurrent increments and decrements occurring.
> > > 
> > 
> > This was already tried by me and it unfortunately can reduce performance.
> > 
> 
> Oh wait I misread the patch based on what I tried there. Spinning
> indefinitely waiting for the lock to be free is a no-go as it loses
> the forward progress guarantee (and it is possible to get the lock
> being continuously held). Only spinning up to an arbitrary point wins
> some in some tests and loses in others.
> 

I'm a little confused about the forward progress guarantee here. Does
that exist today at all? ISTM that falling back to spin_lock() after a
certain number of retries doesn't guarantee any forward progress. You
can still just end up spinning on the lock forever once that happens,
no?

> Either way, as described below, chances are decent that:
> 1. there is an easy way to not lockref_get/put on the parent if the
> file is already there, dodging the problem
> .. and even if that's not true
> 2. lockref can be ditched in favor of atomics. apart from some minor
> refactoring this all looks perfectly doable and I have a wip. I will
> try to find the time next week to sort it out
> 

Like I said in the earlier mail, I don't think we can stay in RCU mode
because of the audit_inode call. I'm definitely interested in your WIP
though!

> > Key problem is that in some corner cases the lock can be continuously
> > held and be queued on, making the fast path always fail and making all
> > the spins actively waste time (and notably pull on the cacheline).
> > 
> > See this for more details:
> > https://lore.kernel.org/oe-lkp/lv7ykdnn2nrci3orajf7ev64afxqdw2d65bcpu2mfaqbkvv4ke@hzxat7utjnvx/
> > 
> > However, I *suspect* in the case you are optimizing here (open + O_CREAT
> > of an existing file) lockref on the parent can be avoided altogether
> > with some hackery and that's what should be done here.
> > 
> > When it comes to lockref in vfs in general, most uses can be elided with
> > some hackery (see the above thread) which is in early WIP (the LSMs are
> > a massive headache).
> > 
> > For open calls which *do* need to take a real ref the hackery does not
> > help of course.
> > 
> > This is where I think decoupling ref from the lock is the best way
> > forward. For that to work the dentry must hang around after the last
> > unref (already done thanks to RCU and dput even explicitly handles that
> > already!) and there needs to be a way to block new refs atomically --
> > can be done with cmpxchg from a 0-ref state to a flag blocking new refs
> > coming in. I have that as a WIP as well.
> > 
> > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  lib/lockref.c | 85 ++++++++++++++++++++++-------------------------------------
> > >  1 file changed, 32 insertions(+), 53 deletions(-)
> > > 
> > > diff --git a/lib/lockref.c b/lib/lockref.c
> > > index 2afe4c5d8919..b76941043fe9 100644
> > > --- a/lib/lockref.c
> > > +++ b/lib/lockref.c
> > > @@ -8,22 +8,25 @@
> > >   * Note that the "cmpxchg()" reloads the "old" value for the
> > >   * failure case.
> > >   */
> > > -#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                     \
> > > -     int retry = 100;                                                        \
> > > -     struct lockref old;                                                     \
> > > -     BUILD_BUG_ON(sizeof(old) != 8);                                         \
> > > -     old.lock_count = READ_ONCE(lockref->lock_count);                        \
> > > -     while (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {     \
> > > -             struct lockref new = old;                                       \
> > > -             CODE                                                            \
> > > -             if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > -                                              &old.lock_count,               \
> > > -                                              new.lock_count))) {            \
> > > -                     SUCCESS;                                                \
> > > -             }                                                               \
> > > -             if (!--retry)                                                   \
> > > -                     break;                                                  \
> > > -     }                                                                       \
> > > +#define CMPXCHG_LOOP(CODE, SUCCESS) do {                                             \
> > > +     struct lockref old;                                                             \
> > > +     BUILD_BUG_ON(sizeof(old) != 8);                                                 \
> > > +     old.lock_count = READ_ONCE(lockref->lock_count);                                \
> > > +     for (;;) {                                                                      \
> > > +             struct lockref new = old;                                               \
> > > +                                                                                     \
> > > +             if (likely(arch_spin_value_unlocked(old.lock.rlock.raw_lock))) {        \
> > > +                     CODE                                                            \
> > > +                     if (likely(try_cmpxchg64_relaxed(&lockref->lock_count,          \
> > > +                                                      &old.lock_count,               \
> > > +                                                      new.lock_count))) {            \
> > > +                             SUCCESS;                                                \
> > > +                     }                                                               \
> > > +             } else {                                                                \
> > > +                     cpu_relax();                                                    \
> > > +                     old.lock_count = READ_ONCE(lockref->lock_count);                \
> > > +             }                                                                       \
> > > +     }                                                                               \
> > >  } while (0)
> > > 
> > >  #else
> > > @@ -46,10 +49,8 @@ void lockref_get(struct lockref *lockref)
> > >       ,
> > >               return;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     lockref->count++;
> > > -     spin_unlock(&lockref->lock);
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >  }
> > >  EXPORT_SYMBOL(lockref_get);
> > > 
> > > @@ -60,8 +61,6 @@ EXPORT_SYMBOL(lockref_get);
> > >   */
> > >  int lockref_get_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count <= 0)
> > > @@ -69,15 +68,9 @@ int lockref_get_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_zero);
> > > 
> > > @@ -88,8 +81,6 @@ EXPORT_SYMBOL(lockref_get_not_zero);
> > >   */
> > >  int lockref_put_not_zero(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count--;
> > >               if (old.count <= 1)
> > > @@ -97,15 +88,9 @@ int lockref_put_not_zero(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count > 1) {
> > > -             lockref->count--;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_not_zero);
> > > 
> > > @@ -125,6 +110,8 @@ int lockref_put_return(struct lockref *lockref)
> > >       ,
> > >               return new.count;
> > >       );
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > >       return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_put_return);
> > > @@ -171,8 +158,6 @@ EXPORT_SYMBOL(lockref_mark_dead);
> > >   */
> > >  int lockref_get_not_dead(struct lockref *lockref)
> > >  {
> > > -     int retval;
> > > -
> > >       CMPXCHG_LOOP(
> > >               new.count++;
> > >               if (old.count < 0)
> > > @@ -180,14 +165,8 @@ int lockref_get_not_dead(struct lockref *lockref)
> > >       ,
> > >               return 1;
> > >       );
> > > -
> > > -     spin_lock(&lockref->lock);
> > > -     retval = 0;
> > > -     if (lockref->count >= 0) {
> > > -             lockref->count++;
> > > -             retval = 1;
> > > -     }
> > > -     spin_unlock(&lockref->lock);
> > > -     return retval;
> > > +     /* should never get here */
> > > +     WARN_ON_ONCE(1);
> > > +     return -1;
> > >  }
> > >  EXPORT_SYMBOL(lockref_get_not_dead);
> > > 
> > > --
> > > 2.45.2
> > > 
> 
> 
> 

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-03 10:59       ` Jeff Layton
@ 2024-08-03 11:21         ` Mateusz Guzik
  2024-08-03 11:32           ` Jeff Layton
  0 siblings, 1 reply; 16+ messages in thread
From: Mateusz Guzik @ 2024-08-03 11:21 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > > When we tested that initially, the performance devolved significantly
> > > > due to contention for the parent's d_lockref spinlock.
> > > >
> > > > There are two problems with lockrefs today: First, once any concurrent
> > > > task takes the spinlock, they all end up taking the spinlock, which is
> > > > much more costly than a single cmpxchg operation. The second problem is
> > > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > > spinlock. The upshot there is that even moderate contention can cause a
> > > > fallback to serialized spinlocking, which worsens performance.
> > > >
> > > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > >
> > > > First, change the loop to spin instead of falling back to a locked
> > > > codepath when the spinlock is held. Once the lock is released, allow the
> > > > task to continue trying its cmpxchg loop as before instead of taking the
> > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > > Just continue infinitely.
> > > >
> > > > This greatly reduces contention on the lockref when there are large
> > > > numbers of concurrent increments and decrements occurring.
> > > >
> > >
> > > This was already tried by me and it unfortunately can reduce performance.
> > >
> >
> > Oh wait I misread the patch based on what I tried there. Spinning
> > indefinitely waiting for the lock to be free is a no-go as it loses
> > the forward progress guarantee (and it is possible to get the lock
> > being continuously held). Only spinning up to an arbitrary point wins
> > some in some tests and loses in others.
> >
>
> I'm a little confused about the forward progress guarantee here. Does
> that exist today at all? ISTM that falling back to spin_lock() after a
> certain number of retries doesn't guarantee any forward progress. You
> can still just end up spinning on the lock forever once that happens,
> no?
>

There is the implicit assumption that everyone holds locks for a
finite time. I agree there are no guarantees otherwise if that's what
you meant.

In this case, since spinlocks are queued, a constant stream of lock
holders will make the lock appear taken indefinitely even if they all
hold it for a short period.

Stock lockref will give up atomics immediately and make sure to change
the ref thanks to queueing up.

Lockref as proposed in this patch wont be able to do anything as long
as the lock trading is taking place.

> > Either way, as described below, chances are decent that:
> > 1. there is an easy way to not lockref_get/put on the parent if the
> > file is already there, dodging the problem
> > .. and even if that's not true
> > 2. lockref can be ditched in favor of atomics. apart from some minor
> > refactoring this all looks perfectly doable and I have a wip. I will
> > try to find the time next week to sort it out
> >
>
> Like I said in the earlier mail, I don't think we can stay in RCU mode
> because of the audit_inode call. I'm definitely interested in your WIP
> though!
>

well audit may be hackable so that it works in rcu most of the time,
but that's not something i'm interested in

sorting out the lockref situation would definitely help other stuff
(notably opening the same file RO).

anyhow one idea is to temporarily disable atomic ops with a flag in
the counter, a fallback plan is to loosen lockref so that it can do
transitions other than 0->1->2 with atomics, even if the lock is held.

I have not looked at this in over a month, I'm going to need to
refresh my memory on the details, I do remember there was some stuff
to massage first.

Anyhow, I expect a working WIP some time in the upcoming week.

--
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-03 11:21         ` Mateusz Guzik
@ 2024-08-03 11:32           ` Jeff Layton
  2024-08-05 11:44             ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2024-08-03 11:32 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Sat, 2024-08-03 at 13:21 +0200, Mateusz Guzik wrote:
> On Sat, Aug 3, 2024 at 12:59 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Sat, 2024-08-03 at 11:09 +0200, Mateusz Guzik wrote:
> > > On Sat, Aug 3, 2024 at 6:44 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > > 
> > > > On Fri, Aug 02, 2024 at 05:45:04PM -0400, Jeff Layton wrote:
> > > > > In a later patch, we want to change the open(..., O_CREAT) codepath to
> > > > > avoid taking the inode->i_rwsem for write when the dentry already exists.
> > > > > When we tested that initially, the performance devolved significantly
> > > > > due to contention for the parent's d_lockref spinlock.
> > > > > 
> > > > > There are two problems with lockrefs today: First, once any concurrent
> > > > > task takes the spinlock, they all end up taking the spinlock, which is
> > > > > much more costly than a single cmpxchg operation. The second problem is
> > > > > that once any task fails to cmpxchg 100 times, it falls back to the
> > > > > spinlock. The upshot there is that even moderate contention can cause a
> > > > > fallback to serialized spinlocking, which worsens performance.
> > > > > 
> > > > > This patch changes CMPXCHG_LOOP in 2 ways:
> > > > > 
> > > > > First, change the loop to spin instead of falling back to a locked
> > > > > codepath when the spinlock is held. Once the lock is released, allow the
> > > > > task to continue trying its cmpxchg loop as before instead of taking the
> > > > > lock. Second, don't allow the cmpxchg loop to give up after 100 retries.
> > > > > Just continue infinitely.
> > > > > 
> > > > > This greatly reduces contention on the lockref when there are large
> > > > > numbers of concurrent increments and decrements occurring.
> > > > > 
> > > > 
> > > > This was already tried by me and it unfortunately can reduce performance.
> > > > 
> > > 
> > > Oh wait I misread the patch based on what I tried there. Spinning
> > > indefinitely waiting for the lock to be free is a no-go as it loses
> > > the forward progress guarantee (and it is possible to get the lock
> > > being continuously held). Only spinning up to an arbitrary point wins
> > > some in some tests and loses in others.
> > > 
> > 
> > I'm a little confused about the forward progress guarantee here. Does
> > that exist today at all? ISTM that falling back to spin_lock() after a
> > certain number of retries doesn't guarantee any forward progress. You
> > can still just end up spinning on the lock forever once that happens,
> > no?
> > 
> 
> There is the implicit assumption that everyone holds locks for a
> finite time. I agree there are no guarantees otherwise if that's what
> you meant.
> 
> In this case, since spinlocks are queued, a constant stream of lock
> holders will make the lock appear taken indefinitely even if they all
> hold it for a short period.
> 
> Stock lockref will give up atomics immediately and make sure to change
> the ref thanks to queueing up.
> 
> Lockref as proposed in this patch wont be able to do anything as long
> as the lock trading is taking place.
> 

Got it, thanks. This spinning is very simplistic, so I could see that
you could have one task continually getting shuffled to the end of the
queue.

> > > Either way, as described below, chances are decent that:
> > > 1. there is an easy way to not lockref_get/put on the parent if the
> > > file is already there, dodging the problem
> > > .. and even if that's not true
> > > 2. lockref can be ditched in favor of atomics. apart from some minor
> > > refactoring this all looks perfectly doable and I have a wip. I will
> > > try to find the time next week to sort it out
> > > 
> > 
> > Like I said in the earlier mail, I don't think we can stay in RCU mode
> > because of the audit_inode call. I'm definitely interested in your WIP
> > though!
> > 
> 
> well audit may be hackable so that it works in rcu most of the time,
> but that's not something i'm interested in

Audit not my favorite area of the kernel to work in either. I don't see
a good way to make it rcu-friendly, but I haven't looked too hard yet
either. It would be nice to be able to do some of the auditing under
rcu or spinlock.

>
> sorting out the lockref situation would definitely help other stuff
> (notably opening the same file RO).
> 

Indeed. It's clear that the current implementation is a real
scalability problem in a lot of situations.

> anyhow one idea is to temporarily disable atomic ops with a flag in
> the counter, a fallback plan is to loosen lockref so that it can do
> transitions other than 0->1->2 with atomics, even if the lock is held.
> 
> I have not looked at this in over a month, I'm going to need to
> refresh my memory on the details, I do remember there was some stuff
> to massage first.
> 
> Anyhow, I expect a working WIP some time in the upcoming week.
> 

Great, I'll stay tuned.

Thanks!
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too
  2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
                   ` (3 preceding siblings ...)
  2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
@ 2024-08-05 10:46 ` Christian Brauner
  2024-08-05 11:55   ` Jeff Layton
  4 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2024-08-05 10:46 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Jan Kara, Andrew Morton, Josef Bacik,
	linux-fsdevel, linux-kernel

>       fs: remove comment about d_rcu_to_refcount
>       fs: add a kerneldoc header over lookup_fast

I took both of these into vfs.misc since they're really unrelated cleanups.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-03 11:32           ` Jeff Layton
@ 2024-08-05 11:44             ` Christian Brauner
  2024-08-05 12:52               ` Jeff Layton
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2024-08-05 11:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Mateusz Guzik, Alexander Viro, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

> Audit not my favorite area of the kernel to work in either. I don't see
> a good way to make it rcu-friendly, but I haven't looked too hard yet
> either. It would be nice to be able to do some of the auditing under
> rcu or spinlock.

For audit your main option is to dodge the problem and check whether
audit is active and only drop out of rcu if it is. That sidesteps the
problem. I'm somewhat certain that a lot of systems don't really have
audit active.

From a brief look at audit it would be quite involved to make it work
just under rcu. Not just because it does various allocation but it also
reads fscaps from disk and so on. That's not going to work unless we add
a vfs based fscaps cache similar to what we do for acls. I find that
very unlikely. 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too
  2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner
@ 2024-08-05 11:55   ` Jeff Layton
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff Layton @ 2024-08-05 11:55 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Andrew Morton, Josef Bacik,
	linux-fsdevel, linux-kernel

On Mon, 2024-08-05 at 12:46 +0200, Christian Brauner wrote:
> >       fs: remove comment about d_rcu_to_refcount
> >       fs: add a kerneldoc header over lookup_fast
> 
> I took both of these into vfs.misc since they're really unrelated cleanups.

Thanks! I was going to suggest that.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-05 11:44             ` Christian Brauner
@ 2024-08-05 12:52               ` Jeff Layton
  2024-08-06 11:36                 ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Layton @ 2024-08-05 12:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Mateusz Guzik, Alexander Viro, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote:
> > Audit not my favorite area of the kernel to work in either. I don't see
> > a good way to make it rcu-friendly, but I haven't looked too hard yet
> > either. It would be nice to be able to do some of the auditing under
> > rcu or spinlock.
> 
> For audit your main option is to dodge the problem and check whether
> audit is active and only drop out of rcu if it is. That sidesteps the
> problem. I'm somewhat certain that a lot of systems don't really have
> audit active.
> 

I did have an earlier version of 4/4 that checked audit_context() and
stayed in RCU mode if it comes back NULL. I can resurrect that if you
think it's worthwhile.

> From a brief look at audit it would be quite involved to make it work
> just under rcu. Not just because it does various allocation but it also
> reads fscaps from disk and so on. That's not going to work unless we add
> a vfs based fscaps cache similar to what we do for acls. I find that
> very unlikely. 

Yeah. It wants to record a lot of (variable-length) information at very
inconvenient times. I think we're sort of stuck with it though until
someone has a vision on how to do this in a non-blocking way.

Handwavy thought: there is some similarity to tracepoints in what
audit_inode does, and tracepoints are able to be called in all sorts of
contexts. I wonder if we could leverage the same infrastructure
somehow? The catch here is that we can't just drop audit records if
things go wrong.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better
  2024-08-05 12:52               ` Jeff Layton
@ 2024-08-06 11:36                 ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2024-08-06 11:36 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Mateusz Guzik, Alexander Viro, Jan Kara, Andrew Morton,
	Josef Bacik, linux-fsdevel, linux-kernel

On Mon, Aug 05, 2024 at 08:52:28AM GMT, Jeff Layton wrote:
> On Mon, 2024-08-05 at 13:44 +0200, Christian Brauner wrote:
> > > Audit not my favorite area of the kernel to work in either. I don't see
> > > a good way to make it rcu-friendly, but I haven't looked too hard yet
> > > either. It would be nice to be able to do some of the auditing under
> > > rcu or spinlock.
> > 
> > For audit your main option is to dodge the problem and check whether
> > audit is active and only drop out of rcu if it is. That sidesteps the
> > problem. I'm somewhat certain that a lot of systems don't really have
> > audit active.
> > 
> 
> I did have an earlier version of 4/4 that checked audit_context() and
> stayed in RCU mode if it comes back NULL. I can resurrect that if you
> think it's worthwhile.

Let's at least see what it looks like. Maybe just use a helper local to
fs/namei.c that returns ECHILD if audit is available and 0 otherwise?

> > From a brief look at audit it would be quite involved to make it work
> > just under rcu. Not just because it does various allocation but it also
> > reads fscaps from disk and so on. That's not going to work unless we add
> > a vfs based fscaps cache similar to what we do for acls. I find that
> > very unlikely. 
> 
> Yeah. It wants to record a lot of (variable-length) information at very
> inconvenient times. I think we're sort of stuck with it though until
> someone has a vision on how to do this in a non-blocking way.
> 
> Handwavy thought: there is some similarity to tracepoints in what
> audit_inode does, and tracepoints are able to be called in all sorts of
> contexts. I wonder if we could leverage the same infrastructure
> somehow? The catch here is that we can't just drop audit records if
> things go wrong.

I can't say much about the tracepoint idea as I lack the necessary
details around their implementation.

I think the better way forward is a model with a fastpath and a
slowpath. Under RCU audit_inode() returns -ECHILD if it sees that it
neeeds to end up doing anything it couldn't do in a non-blocking way and
then path lookup can drop out of RCU and call audit_inode() again.

I think this wouldn't be extremly terrible. It would amount to adding a
flag to audit_inode() AUDIT_MAY_NOT_BLOCK and then on ECHILD
audit_inode() gets called again without that flag.

Over time if people are interested they could then make more and more
stuff available under rcu for audit.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-08-06 11:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-02 21:45 [PATCH RFC 0/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 1/4] fs: remove comment about d_rcu_to_refcount Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 2/4] fs: add a kerneldoc header over lookup_fast Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 3/4] lockref: rework CMPXCHG_LOOP to handle contention better Jeff Layton
2024-08-03  4:44   ` Mateusz Guzik
2024-08-03  9:09     ` Mateusz Guzik
2024-08-03 10:59       ` Jeff Layton
2024-08-03 11:21         ` Mateusz Guzik
2024-08-03 11:32           ` Jeff Layton
2024-08-05 11:44             ` Christian Brauner
2024-08-05 12:52               ` Jeff Layton
2024-08-06 11:36                 ` Christian Brauner
2024-08-03 10:55     ` Jeff Layton
2024-08-02 21:45 ` [PATCH RFC 4/4] fs: try an opportunistic lookup for O_CREAT opens too Jeff Layton
2024-08-05 10:46 ` [PATCH RFC 0/4] " Christian Brauner
2024-08-05 11:55   ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).