[RFC next v2 0/5] ucount: add rlimit cache for ucount

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC next v2 0/5] ucount: add rlimit cache for ucount
@ 2025-05-09  7:20 Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 1/5] user_namespace: add children list node Chen Ridong
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-09  7:20 UTC (permalink / raw)
  To: akpm, paulmck, bigeasy, legion, roman.gushchin, brauner, tglx,
	frederic, peterz, oleg, joel.granados, viro, lorenzo.stoakes,
	avagin, mengensun, linux, jlayton, ruanjinjie, kees
  Cc: linux-kernel, lujialin4, chenridong

The will-it-scale test case signal1 [1] has been observed. and the test
results reveal that the signal sending system call lacks linearity.
To further investigate this issue, we initiated a series of tests by
launching varying numbers of dockers and closely monitored the throughput
of each individual docker. The detailed test outcomes are presented as
follows:

	| Dockers     |1      |4      |8      |16     |32     |64     |
	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |

The data clearly demonstrates a discernible trend: as the quantity of
dockers increases, the throughput per container progressively declines.
In-depth analysis has identified the root cause of this performance
degradation. The ucouts module conducts statistics on rlimit, which
involves a significant number of atomic operations. These atomic
operations, when acting on the same variable, trigger a substantial number
of cache misses or remote accesses, ultimately resulting in a drop in
performance.

Notably, even though a new user_namespace is created upon docker startup,
the problem persists. This is because all these dockers share the same
parent node, meaning that rlimit statistics continuously modify the same
atomic variable.

Currently, when incrementing a specific rlimit within a child user
namespace by 1, the corresponding rlimit in the parent node must also be
incremented by 1. Specifically, if the ucounts corresponding to a task in
Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
the rlimit of the parent node, init_ucounts, must also be incremented by 1.
This operation should be ensured to stay within the limits set for the
user namespaces.

	init_user_ns                             init_ucounts
	^                                              ^
	|                        |                     |
	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
	|                        |                     |
	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
					^
					|
					|
					|
					ucount_b_1

What is expected is that dockers operating within separate namespaces
should remain isolated and not interfere with one another. Regrettably,
the current signal system call fails to achieve this desired level of
isolation.

Proposal:

To address the aforementioned issues, the concept of implementing a cache
for each namespace's rlimit has been proposed. If a cache is added for
each user namespace's rlimit, a certain amount of rlimits can be allocated
to a particular namespace in one go. When resources are abundant, these
resources do not need to be immediately returned to the parent node. Within
a user namespace, if there are available values in the cache, there is no
need to request additional resources from the parent node.

	init_user_ns                             init_ucounts
	^                                              ^
	|                        |                     |
	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
	|                        |                     |
	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
			^		^
			|		|
			cache_rlimit--->|
					|
					ucount_b_1

The ultimate objective of this solution is to achieve complete isolation
among namespaces. After applying this patch set, the final test results
indicate that in the signal1 test case, the performance does not
deteriorate as the number of containers increases. This effectively meets
the goal of linear scalability.

	| Dockers     |1      |4      |8      |16     |32     |64     |
	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |

Challenges:

When checking the pending signals in the parent node using the command
 cat /proc/self/status | grep SigQ, the retrieved value includes the
cached signal counts from its child nodes. As a result, the SigQ value
in the parent node fails to accurately and instantaneously reflect the
actual number of pending signals.

	# cat /proc/self/status | grep SigQ
	SigQ:	16/6187667

TODO:

Add cache for the other rlimits.

[1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/

Chen Ridong (5):
  user_namespace: add children list node
  usernamespace: make usernamespace rcu safe
  user_namespace: add user_ns iteration helper
  uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
  ucount: add rlimit cache for ucount

 include/linux/user_namespace.h |  23 ++++-
 kernel/signal.c                |   2 +-
 kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
 kernel/user.c                  |   2 +
 kernel/user_namespace.c        |  60 ++++++++++-
 5 files changed, 243 insertions(+), 25 deletions(-)

-- 
2.34.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [RFC next v2 1/5] user_namespace: add children list node
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
@ 2025-05-09  7:20 ` Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 2/5] usernamespace: make usernamespace rcu safe Chen Ridong
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-09  7:20 UTC (permalink / raw)
  To: akpm, paulmck, bigeasy, legion, roman.gushchin, brauner, tglx,
	frederic, peterz, oleg, joel.granados, viro, lorenzo.stoakes,
	avagin, mengensun, linux, jlayton, ruanjinjie, kees
  Cc: linux-kernel, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

Add the 'children' and 'ns_node' fields to the user_namespace structure.
This addition enables the user_namespace to locate all of its nested
child namespaces efficiently.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 include/linux/user_namespace.h | 2 ++
 kernel/user.c                  | 2 ++
 kernel/user_namespace.c        | 4 ++++
 3 files changed, 8 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index a0bb6d012137..7b1e180227c8 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -78,6 +78,8 @@ struct user_namespace {
 	struct uid_gid_map	gid_map;
 	struct uid_gid_map	projid_map;
 	struct user_namespace	*parent;
+	struct list_head	ns_node;
+	struct list_head	children;
 	int			level;
 	kuid_t			owner;
 	kgid_t			group;
diff --git a/kernel/user.c b/kernel/user.c
index f46b1d41163b..3a712a6894fd 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -65,6 +65,8 @@ struct user_namespace init_user_ns = {
 			.nr_extents = 1,
 		},
 	},
+	.ns_node = LIST_HEAD_INIT(init_user_ns.ns_node),
+	.children = LIST_HEAD_INIT(init_user_ns.children),
 	.ns.count = REFCOUNT_INIT(3),
 	.owner = GLOBAL_ROOT_UID,
 	.group = GLOBAL_ROOT_GID,
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 682f40d5632d..b570536934cc 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -135,6 +135,9 @@ int create_user_ns(struct cred *new)
 	ns->level = parent_ns->level + 1;
 	ns->owner = owner;
 	ns->group = group;
+	INIT_LIST_HEAD(&ns->children);
+	INIT_LIST_HEAD(&ns->ns_node);
+	list_add_tail_rcu(&ns->ns_node, &parent_ns->children);
 	INIT_WORK(&ns->work, free_user_ns);
 	for (i = 0; i < UCOUNT_COUNTS; i++) {
 		ns->ucount_max[i] = INT_MAX;
@@ -217,6 +220,7 @@ static void free_user_ns(struct work_struct *work)
 		kfree(ns->binfmt_misc);
 #endif
 		retire_userns_sysctls(ns);
+		list_del_rcu(&ns->ns_node);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
 		kmem_cache_free(user_ns_cachep, ns);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC next v2 2/5] usernamespace: make usernamespace rcu safe
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 1/5] user_namespace: add children list node Chen Ridong
@ 2025-05-09  7:20 ` Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 3/5] user_namespace: add user_ns iteration helper Chen Ridong
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-09  7:20 UTC (permalink / raw)
  To: akpm, paulmck, bigeasy, legion, roman.gushchin, brauner, tglx,
	frederic, peterz, oleg, joel.granados, viro, lorenzo.stoakes,
	avagin, mengensun, linux, jlayton, ruanjinjie, kees
  Cc: linux-kernel, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

To ensure a safe top-down iteration, the user namespace should be made
RCU safe. This way, it is safe to iterate over all the child namespaces
of a root namespace while holding an RCU read lock.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 include/linux/user_namespace.h |  1 +
 kernel/user_namespace.c        | 12 ++++++++++--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 7b1e180227c8..d84b2703caab 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -80,6 +80,7 @@ struct user_namespace {
 	struct user_namespace	*parent;
 	struct list_head	ns_node;
 	struct list_head	children;
+	struct rcu_head		rcu;
 	int			level;
 	kuid_t			owner;
 	kgid_t			group;
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index b570536934cc..cbe8f96c3e60 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -196,6 +196,15 @@ int unshare_userns(unsigned long unshare_flags, struct cred **new_cred)
 	return err;
 }
 
+static void __free_user_ns(struct rcu_head *p)
+{
+	struct user_namespace *ns =
+		container_of(p, struct user_namespace, rcu);
+
+	list_del_rcu(&ns->ns_node);
+	kmem_cache_free(user_ns_cachep, ns);
+}
+
 static void free_user_ns(struct work_struct *work)
 {
 	struct user_namespace *parent, *ns =
@@ -220,10 +229,9 @@ static void free_user_ns(struct work_struct *work)
 		kfree(ns->binfmt_misc);
 #endif
 		retire_userns_sysctls(ns);
-		list_del_rcu(&ns->ns_node);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
-		kmem_cache_free(user_ns_cachep, ns);
+		call_rcu(&ns->rcu, __free_user_ns);
 		dec_user_namespaces(ucounts);
 		ns = parent;
 	} while (refcount_dec_and_test(&parent->ns.count));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC next v2 3/5] user_namespace: add user_ns iteration helper
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 1/5] user_namespace: add children list node Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 2/5] usernamespace: make usernamespace rcu safe Chen Ridong
@ 2025-05-09  7:20 ` Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 4/5] uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts Chen Ridong
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-09  7:20 UTC (permalink / raw)
  To: akpm, paulmck, bigeasy, legion, roman.gushchin, brauner, tglx,
	frederic, peterz, oleg, joel.granados, viro, lorenzo.stoakes,
	avagin, mengensun, linux, jlayton, ruanjinjie, kees
  Cc: linux-kernel, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

Add a helper function named 'ns_next_child_pre' that performs
a pre-order traversal of a namespace's descendants.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 include/linux/user_namespace.h |  9 +++++++
 kernel/user_namespace.c        | 44 ++++++++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index d84b2703caab..823df9267a4a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -169,6 +169,15 @@ static inline void set_userns_rlimit_max(struct user_namespace *ns,
 	ns->rlimit_max[type] = max <= LONG_MAX ? max : LONG_MAX;
 }
 
+struct user_namespace *ns_next_child(struct user_namespace *pos,
+					   struct user_namespace *parent);
+struct user_namespace *ns_next_child_pre(struct user_namespace *pos,
+						    struct user_namespace *root);
+
+#define ns_for_each_child_pre(pos, ns)				\
+	for ((pos) = ns_next_child_pre(NULL, (ns)); (pos);	\
+	     (pos) = ns_next_child_pre((pos), (ns)))
+
 #ifdef CONFIG_USER_NS
 
 static inline struct user_namespace *get_user_ns(struct user_namespace *ns)
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index cbe8f96c3e60..9a2e77505b97 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -30,6 +30,50 @@ static bool new_idmap_permitted(const struct file *file,
 				struct uid_gid_map *map);
 static void free_user_ns(struct work_struct *work);
 
+struct user_namespace *ns_next_child(struct user_namespace *pos,
+					   struct user_namespace *parent)
+{
+	struct user_namespace *next;
+
+	if (!pos)
+		/* Get the first child of the parent. */
+		next = list_entry_rcu(parent->children.next, struct user_namespace, ns_node);
+	else
+		next = list_entry_rcu(pos->ns_node.next, struct user_namespace, ns_node);
+
+	if (&next->ns_node != &parent->children)
+		return next;
+
+	return NULL;
+}
+
+/* Should be called under rcu_read_lock() */
+struct user_namespace *ns_next_child_pre(struct user_namespace *pos,
+						    struct user_namespace *root)
+{
+	struct user_namespace *next;
+
+
+	/* if first iteration, visit @root */
+	if (!pos)
+		return root;
+
+	/* visit the first child if exists */
+	next = ns_next_child(NULL, pos);
+	if (next)
+		return next;
+
+	/* no child, visit my or the closest ancestor's next ns_node */
+	while (pos != root) {
+		next = ns_next_child(pos, pos->parent);
+		if (next)
+			return next;
+		pos = pos->parent;
+	}
+
+	return NULL;
+}
+
 static struct ucounts *inc_user_namespaces(struct user_namespace *ns, kuid_t uid)
 {
 	return inc_ucount(ns, uid, UCOUNT_USER_NAMESPACES);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC next v2 4/5] uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
                   ` (2 preceding siblings ...)
  2025-05-09  7:20 ` [RFC next v2 3/5] user_namespace: add user_ns iteration helper Chen Ridong
@ 2025-05-09  7:20 ` Chen Ridong
  2025-05-09  7:20 ` [RFC next v2 5/5] ucount: add rlimit cache for ucount Chen Ridong
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-09  7:20 UTC (permalink / raw)
  To: akpm, paulmck, bigeasy, legion, roman.gushchin, brauner, tglx,
	frederic, peterz, oleg, joel.granados, viro, lorenzo.stoakes,
	avagin, mengensun, linux, jlayton, ruanjinjie, kees
  Cc: linux-kernel, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

The __inc_rlimit_get_ucounts function has been factored out. This function
can increment the rlimit by a variable number and acquires an additional
ucount reference when the rlimit count was previously zero.

Correspondingly, the __dec_rlimit_put_ucounts function has also been
factored out. This function releases the ucount reference when the rlimit
reaches zero.

These functions not only make the code more concise but also serve as a
foundation for subsequent patches.

Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 kernel/ucount.c | 56 +++++++++++++++++++++++++++++++------------------
 1 file changed, 36 insertions(+), 20 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 8686e329b8f2..33605e416724 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -276,22 +276,46 @@ bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v)
 	return (new == 0);
 }
 
+static void __dec_rlimit_put_ucounts(struct ucounts *ucounts,
+				enum rlimit_type type, long v)
+{
+	long dec = atomic_long_sub_return(v, &ucounts->rlimit[type]);
+
+	WARN_ON_ONCE(dec < 0);
+	if (dec == 0)
+		put_ucounts(ucounts);
+}
+
+static long __inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v)
+{
+	long new = atomic_long_add_return(v, &ucounts->rlimit[type]);
+
+	/*
+	 * Grab an extra ucount reference for the caller when
+	 * the rlimit count was previously 0.
+	 */
+	if (new == v && !get_ucounts(ucounts)) {
+		long dec = atomic_long_sub_return(v, &ucounts->rlimit[type]);
+
+		WARN_ON_ONCE(dec < 0);
+		return 0;
+	}
+	return new;
+}
+
 static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
-				struct ucounts *last, enum rlimit_type type)
+				struct ucounts *last, enum rlimit_type type, long v)
 {
 	struct ucounts *iter, *next;
 	for (iter = ucounts; iter != last; iter = next) {
-		long dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-		WARN_ON_ONCE(dec < 0);
 		next = iter->ns->ucounts;
-		if (dec == 0)
-			put_ucounts(iter);
+		__dec_rlimit_put_ucounts(ucounts, type, v);
 	}
 }
 
 void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
 {
-	do_dec_rlimit_put_ucounts(ucounts, NULL, type);
+	do_dec_rlimit_put_ucounts(ucounts, NULL, type, 1);
 }
 
 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
@@ -300,30 +324,22 @@ long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
 	/* Caller must hold a reference to ucounts */
 	struct ucounts *iter;
 	long max = LONG_MAX;
-	long dec, ret = 0;
+	long ret = 0;
 
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		long new = atomic_long_add_return(1, &iter->rlimit[type]);
-		if (new < 0 || new > max)
+		long new = __inc_rlimit_get_ucounts(iter, type, 1);
+
+		if (new <= 0 || new > max)
 			goto dec_unwind;
 		if (iter == ucounts)
 			ret = new;
 		if (!override_rlimit)
 			max = get_userns_rlimit_max(iter->ns, type);
-		/*
-		 * Grab an extra ucount reference for the caller when
-		 * the rlimit count was previously 0.
-		 */
-		if (new != 1)
-			continue;
-		if (!get_ucounts(iter))
-			goto dec_unwind;
 	}
 	return ret;
+
 dec_unwind:
-	dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-	WARN_ON_ONCE(dec < 0);
-	do_dec_rlimit_put_ucounts(ucounts, iter, type);
+	do_dec_rlimit_put_ucounts(ucounts, iter, type, 1);
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [RFC next v2 5/5] ucount: add rlimit cache for ucount
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
                   ` (3 preceding siblings ...)
  2025-05-09  7:20 ` [RFC next v2 4/5] uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts Chen Ridong
@ 2025-05-09  7:20 ` Chen Ridong
  2025-05-09 20:18 ` [RFC next v2 0/5] " Andrew Morton
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-09  7:20 UTC (permalink / raw)
  To: akpm, paulmck, bigeasy, legion, roman.gushchin, brauner, tglx,
	frederic, peterz, oleg, joel.granados, viro, lorenzo.stoakes,
	avagin, mengensun, linux, jlayton, ruanjinjie, kees
  Cc: linux-kernel, lujialin4, chenridong

From: Chen Ridong <chenridong@huawei.com>

The will-it-scale test case signal1 [1] has been observed. and the test
results reveal that the signal sending system call lacks linearity.
To further investigate this issue, we initiated a series of tests by
launching varying numbers of dockers and closely monitored the throughput
of each individual docker. The detailed test outcomes are presented as
follows:

  | Dockers 	|1	|4	|8	|16	|32	|64	|
  | Throughput 	|380068	|353204	|308948	|306453	|180659	|129152	|

The data clearly demonstrates a discernible trend: as the quantity of
dockers increases, the throughput per container progressively declines.
In-depth analysis has identified the root cause of this performance
degradation. The ucouts module conducts statistics on rlimit, which
involves a significant number of atomic operations. These atomic
operations, when acting on the same variable, trigger a substantial number
of cache misses or remote accesses, ultimately resulting in a drop in
performance.

Notably, even though a new user_namespace is created upon docker startup,
the problem persists. This is because all these dockers share the same
parent node, meaning that rlimit statistics continuously modify the same
atomic variable.

Currently, when incrementing a specific rlimit within a child user
namespace by 1, the corresponding rlimit in the parent node must also be
incremented by 1. Specifically, if the ucounts corresponding to a task in
Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
the rlimit of the parent node, init_ucounts, must also be incremented by 1.
This operation should be ensured to stay within the limits set for the
user namespaces.

	init_user_ns                             init_ucounts
	^                                              ^
	|                        |                     |
	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
	|                        |                     |
	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
			^		^
			|		|
		  (add) cache_rlimit--->|
					|
					ucount_b_1(user1)

What is expected is that dockers operating within separate namespaces
should remain isolated and not interfere with one another. Regrettably,
the current signal system call fails to achieve this desired level of
isolation.

To address the aforementioned issues, the concept of implementing a cache
for each namespace's rlimit has been proposed. If a cache is added for
each user namespace's rlimit, a certain amount of rlimits can be allocated
to a particular namespace in one go. When resources are abundant, these
resources do not need to be immediately returned to the parent node. Within
a user namespace, if there are available values in the cache, there is no
need to request additional resources from the parent node.

The ultimate objective of this solution is to achieve complete isolation
among namespaces. After applying this patch set, the final test results
indicate that in the signal1 test case, the performance does not
deteriorate as the number of containers increases. This effectively meets
the goal of linear scalability.

  | Dockers 	|1	|4	|8	|16	|32	|64	|
  | Throughput 	|381809	|382284	|380640	|383515	|381318	|380120	|

[1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
Signed-off-by: Chen Ridong <chenridong@huawei.com>
---
 include/linux/user_namespace.h |  11 ++-
 kernel/signal.c                |   2 +-
 kernel/ucount.c                | 131 +++++++++++++++++++++++++++++++--
 kernel/user_namespace.c        |   2 +
 4 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 823df9267a4a..30e80d46ab5f 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -113,6 +113,7 @@ struct user_namespace {
 	struct ucounts		*ucounts;
 	long ucount_max[UCOUNT_COUNTS];
 	long rlimit_max[UCOUNT_RLIMIT_COUNTS];
+	atomic_t rlimit_cache[UCOUNT_RLIMIT_COUNTS];
 
 #if IS_ENABLED(CONFIG_BINFMT_MISC)
 	struct binfmt_misc *binfmt_misc;
@@ -139,6 +140,8 @@ void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
 struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
 void put_ucounts(struct ucounts *ucounts);
 
+void rlimit_drain_cache(struct user_namespace *root);
+
 static inline struct ucounts * __must_check get_ucounts(struct ucounts *ucounts)
 {
 	if (rcuref_get(&ucounts->count))
@@ -154,7 +157,7 @@ static inline long get_rlimit_value(struct ucounts *ucounts, enum rlimit_type ty
 long inc_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
 bool dec_rlimit_ucounts(struct ucounts *ucounts, enum rlimit_type type, long v);
 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
-			    bool override_rlimit);
+			    bool override_rlimit, long tlimit);
 void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type);
 bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigned long max);
 
@@ -169,6 +172,12 @@ static inline void set_userns_rlimit_max(struct user_namespace *ns,
 	ns->rlimit_max[type] = max <= LONG_MAX ? max : LONG_MAX;
 }
 
+static inline void init_userns_rlimit_cache(struct user_namespace *ns)
+{
+	for (int i = 0; i < UCOUNT_RLIMIT_COUNTS; ++i)
+		atomic_set(&ns->rlimit_cache[i], 0);
+}
+
 struct user_namespace *ns_next_child(struct user_namespace *pos,
 					   struct user_namespace *parent);
 struct user_namespace *ns_next_child_pre(struct user_namespace *pos,
diff --git a/kernel/signal.c b/kernel/signal.c
index 148082db9a55..e7147fcaa55f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -416,7 +416,7 @@ static struct ucounts *sig_get_ucounts(struct task_struct *t, int sig,
 	rcu_read_lock();
 	ucounts = task_ucounts(t);
 	sigpending = inc_rlimit_get_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING,
-					    override_rlimit);
+					    override_rlimit, task_rlimit(t, RLIMIT_SIGPENDING));
 	rcu_read_unlock();
 	if (!sigpending)
 		return NULL;
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 33605e416724..f29ed2d3b3c8 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -16,6 +16,8 @@ struct ucounts init_ucounts = {
 
 #define UCOUNTS_HASHTABLE_BITS 10
 #define UCOUNTS_HASHTABLE_ENTRIES (1 << UCOUNTS_HASHTABLE_BITS)
+#define UCOUNT_BATCH_SIZE 16
+
 static struct hlist_nulls_head ucounts_hashtable[UCOUNTS_HASHTABLE_ENTRIES] = {
 	[0 ... UCOUNTS_HASHTABLE_ENTRIES - 1] = HLIST_NULLS_HEAD_INIT(0)
 };
@@ -315,24 +317,143 @@ static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
 
 void dec_rlimit_put_ucounts(struct ucounts *ucounts, enum rlimit_type type)
 {
-	do_dec_rlimit_put_ucounts(ucounts, NULL, type, 1);
+	struct user_namespace *ns = ucounts->ns;
+	int cache;
+
+	if (ns != &init_user_ns) {
+		__dec_rlimit_put_ucounts(ucounts, type, 1);
+		cache = atomic_add_return(1, &ns->rlimit_cache[type]);
+		if (cache > UCOUNT_BATCH_SIZE) {
+			cache = atomic_sub_return(UCOUNT_BATCH_SIZE,
+						  &ns->rlimit_cache[type]);
+			if (cache > 0)
+				do_dec_rlimit_put_ucounts(ns->ucounts, NULL,
+							  type, UCOUNT_BATCH_SIZE);
+			else
+				atomic_add(UCOUNT_BATCH_SIZE, &ns->rlimit_cache[type]);
+		}
+	} else {
+		do_dec_rlimit_put_ucounts(ucounts, NULL, type, 1);
+	}
+}
+
+/* Drain the root cache, return how many cache have been relcaimed */
+static int rlimit_drain_type_cache(struct user_namespace *root, enum rlimit_type type)
+{
+	struct user_namespace *child;
+	int reclaim_cache = 0;
+
+	rcu_read_lock();
+	ns_for_each_child_pre(child, root) {
+		int cache;
+retry:
+		cache = atomic_read(&child->rlimit_cache[type]);
+		if (cache > 0) {
+			int old = atomic_cmpxchg(&child->rlimit_cache[type], cache, 0);
+
+			if (cache == old) {
+				reclaim_cache += cache;
+				do_dec_rlimit_put_ucounts(child->ucounts, NULL, type, cache);
+			} else {
+				goto retry;
+			}
+		}
+	}
+	rcu_read_unlock();
+	return reclaim_cache;
+}
+
+void rlimit_drain_cache(struct user_namespace *root)
+{
+	for (int i = 0; i < UCOUNT_RLIMIT_COUNTS; i++)
+		rlimit_drain_type_cache(root, i);
+}
+
+static bool rlimit_charge_cache(struct ucounts *ucounts, enum rlimit_type type)
+{
+	struct ucounts *iter;
+	long max = LONG_MAX;
+	long new;
+	struct user_namespace *ns = ucounts->ns;
+
+	for (iter = ns->ucounts; iter; iter = iter->ns->ucounts) {
+		max = get_userns_rlimit_max(iter->ns, type);
+		new = __inc_rlimit_get_ucounts(iter, type, UCOUNT_BATCH_SIZE);
+		if (new <= 0 || new > max)
+			goto dec_unwind;
+	}
+
+	/* charge ok, add the ns's cache */
+	atomic_add_return(UCOUNT_BATCH_SIZE, &ucounts->ns->rlimit_cache[type]);
+	return true;
+
+dec_unwind:
+	do_dec_rlimit_put_ucounts(ns->ucounts, iter, type, UCOUNT_BATCH_SIZE);
+	return false;
 }
 
 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
-			    bool override_rlimit)
+			    bool override_rlimit, long tlimit)
 {
 	/* Caller must hold a reference to ucounts */
 	struct ucounts *iter;
 	long max = LONG_MAX;
 	long ret = 0;
+	struct user_namespace *ns = ucounts->ns;
+	bool is_trying = false;
+	bool non_cache = false;
+	long new;
+
+try_cache:
+	/* If the ucounts.ns is not init_user_ns, and it has cache in its ns, consume cache */
+	if (ns != &init_user_ns) {
+		if (atomic_dec_return(&ns->rlimit_cache[type]) >= 0) {
+			new =  __inc_rlimit_get_ucounts(ucounts, type, 1);
+			/*
+			 * If new is below tlimit, return success
+			 * Otherwise, goto non-cache logic. It should keep the
+			 * rlimit below the tlimit as much as possible
+			 */
+			if (new <= tlimit)
+				return new;
+			non_cache = true;
+		}
+		/* Restore the previously incremented value */
+		atomic_inc(&ns->rlimit_cache[type]);
+
+		if (!non_cache && !is_trying &&
+		    rlimit_charge_cache(ucounts, type)) {
+			is_trying = true;
+			goto try_cache;
+		}
+	}
 
 	for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-		long new = __inc_rlimit_get_ucounts(iter, type, 1);
+retry_inc:
+		new = __inc_rlimit_get_ucounts(iter, type, 1);
+
+		/*
+		 * When the 'iter' is equal to 'ucounts', the 'new' value is what will be returned.
+		 *
+		 * Case 1: If the return value is larger than 'tlimit'.
+		 * Case 2: If the 'new' value is larger than the maximum of 'rlimit_max'.
+		 *
+		 * In both cases, we need to drain the cache. This is because when the cache is
+		 * present, the value might exceed the acceptable threshold. However, when the
+		 * cache is removed,the value should fall within the allowed limit
+		 */
+		if (iter == ucounts)
+			ret = new;
+
+		if ((new > max || ret > tlimit) &&
+			rlimit_drain_type_cache(iter->ns, type) > 0) {
+			__dec_rlimit_put_ucounts(iter, type, 1);
+			goto retry_inc;
+		}
 
 		if (new <= 0 || new > max)
 			goto dec_unwind;
-		if (iter == ucounts)
-			ret = new;
+
 		if (!override_rlimit)
 			max = get_userns_rlimit_max(iter->ns, type);
 	}
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 9a2e77505b97..bc77c9acf426 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -190,6 +190,7 @@ int create_user_ns(struct cred *new)
 	set_userns_rlimit_max(ns, UCOUNT_RLIMIT_MSGQUEUE, rlimit(RLIMIT_MSGQUEUE));
 	set_userns_rlimit_max(ns, UCOUNT_RLIMIT_SIGPENDING, rlimit(RLIMIT_SIGPENDING));
 	set_userns_rlimit_max(ns, UCOUNT_RLIMIT_MEMLOCK, rlimit(RLIMIT_MEMLOCK));
+	init_userns_rlimit_cache(ns);
 	ns->ucounts = ucounts;
 
 	/* Inherit USERNS_SETGROUPS_ALLOWED from our parent */
@@ -273,6 +274,7 @@ static void free_user_ns(struct work_struct *work)
 		kfree(ns->binfmt_misc);
 #endif
 		retire_userns_sysctls(ns);
+		rlimit_drain_cache(ns);
 		key_free_user_ns(ns);
 		ns_free_inum(&ns->ns);
 		call_rcu(&ns->rcu, __free_user_ns);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
                   ` (4 preceding siblings ...)
  2025-05-09  7:20 ` [RFC next v2 5/5] ucount: add rlimit cache for ucount Chen Ridong
@ 2025-05-09 20:18 ` Andrew Morton
  2025-05-12 10:48   ` Sebastian Andrzej Siewior
  2025-05-15 10:29 ` Christian Brauner
  2025-05-16 11:48 ` Alexey Gladkov
  7 siblings, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2025-05-09 20:18 UTC (permalink / raw)
  To: Chen Ridong
  Cc: paulmck, bigeasy, legion, roman.gushchin, brauner, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4

On Fri,  9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote:

> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.
> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> 
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.
> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.

Did you consider simply turning that atomic_t counter into a
percpu_counter?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-09 20:18 ` [RFC next v2 0/5] " Andrew Morton
@ 2025-05-12 10:48   ` Sebastian Andrzej Siewior
  2025-05-13  1:48     ` Chen Ridong
  0 siblings, 1 reply; 17+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-05-12 10:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Chen Ridong, paulmck, legion, roman.gushchin, brauner, tglx,
	frederic, peterz, oleg, joel.granados, viro, lorenzo.stoakes,
	avagin, mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4

On 2025-05-09 13:18:49 [-0700], Andrew Morton wrote:
> On Fri,  9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote:
> 
> > The will-it-scale test case signal1 [1] has been observed. and the test
> > results reveal that the signal sending system call lacks linearity.
> > To further investigate this issue, we initiated a series of tests by
> > launching varying numbers of dockers and closely monitored the throughput
> > of each individual docker. The detailed test outcomes are presented as
> > follows:
> > 
> > 	| Dockers     |1      |4      |8      |16     |32     |64     |
> > 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> > 
> > The data clearly demonstrates a discernible trend: as the quantity of
> > dockers increases, the throughput per container progressively declines.
> > In-depth analysis has identified the root cause of this performance
> > degradation. The ucouts module conducts statistics on rlimit, which
> > involves a significant number of atomic operations. These atomic
> > operations, when acting on the same variable, trigger a substantial number
> > of cache misses or remote accesses, ultimately resulting in a drop in
> > performance.
> 
> Did you consider simply turning that atomic_t counter into a
> percpu_counter?

That sounds like a smaller change. Also, do these 1…64 docker container
play signal ping-pong or is there a real workload behind it?

Sebastian

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-12 10:48   ` Sebastian Andrzej Siewior
@ 2025-05-13  1:48     ` Chen Ridong
  0 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-13  1:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Andrew Morton
  Cc: paulmck, legion, roman.gushchin, brauner, tglx, frederic, peterz,
	oleg, joel.granados, viro, lorenzo.stoakes, avagin, mengensun,
	linux, jlayton, ruanjinjie, kees, linux-kernel, lujialin4,
	Chen Ridong



On 2025/5/12 18:48, Sebastian Andrzej Siewior wrote:
> On 2025-05-09 13:18:49 [-0700], Andrew Morton wrote:
>> On Fri,  9 May 2025 07:20:49 +0000 Chen Ridong <chenridong@huaweicloud.com> wrote:
>>
>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>> results reveal that the signal sending system call lacks linearity.
>>> To further investigate this issue, we initiated a series of tests by
>>> launching varying numbers of dockers and closely monitored the throughput
>>> of each individual docker. The detailed test outcomes are presented as
>>> follows:
>>>
>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>
>>> The data clearly demonstrates a discernible trend: as the quantity of
>>> dockers increases, the throughput per container progressively declines.
>>> In-depth analysis has identified the root cause of this performance
>>> degradation. The ucouts module conducts statistics on rlimit, which
>>> involves a significant number of atomic operations. These atomic
>>> operations, when acting on the same variable, trigger a substantial number
>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>> performance.
>>
>> Did you consider simply turning that atomic_t counter into a
>> percpu_counter?
> 
> That sounds like a smaller change. Also, do these 1…64 docker container
> play signal ping-pong or is there a real workload behind it?
> 
> Sebastian

Hi Andrew and Sebastian,

Thanks for your prompt reply. I'm currently conducting a "will-it-scale"
test on a 384-core machine configured to run up to 64 Docker
containers(max num). Even with just 32 containers, the throughput drops
by 53%, which indicates significant scaling challenges.

Your suggestion about using percpu_counter was spot on. I've since
implemented a demo to benchmark its performance. Here's the code I wrote
for testing:

static void do_dec_rlimit_put_ucounts(struct ucounts *ucounts,
@@ -281,10 +289,10 @@ static void do_dec_rlimit_put_ucounts(struct
ucounts *ucounts,
 {
        struct ucounts *iter, *next;
        for (iter = ucounts; iter != last; iter = next) {
-               long dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-               WARN_ON_ONCE(dec < 0);
+               percpu_counter_sub(&iter->rlimit[type], 1);
+
                next = iter->ns->ucounts;
-               if (dec == 0)
+               if (percpu_counter_compare(&iter->rlimit[type], 0) == 0)
                        put_ucounts(iter);
        }
 }
@@ -295,36 +303,40 @@ void dec_rlimit_put_ucounts(struct ucounts
*ucounts, enum rlimit_type type)
 }

 long inc_rlimit_get_ucounts(struct ucounts *ucounts, enum rlimit_type type,
-                           bool override_rlimit)
+                           bool override_rlimit, long limit)
 {
        /* Caller must hold a reference to ucounts */
        struct ucounts *iter;
        long max = LONG_MAX;
-       long dec, ret = 0;
+       long ret = 0;
+
+       if (override_rlimit)
+               limit = LONG_MAX;

        for (iter = ucounts; iter; iter = iter->ns->ucounts) {
-               long new = atomic_long_add_return(1, &iter->rlimit[type]);
-               if (new < 0 || new > max)
+               max = min(limit, max);
+
+               if (!percpu_counter_limited_add(&iter->rlimit[type],
max, 1)) {
+                       ret = -1;
                        goto dec_unwind;
-               if (iter == ucounts)
-                       ret = new;
+               }
                if (!override_rlimit)
                        max = get_userns_rlimit_max(iter->ns, type);
                /*
                 * Grab an extra ucount reference for the caller when
                 * the rlimit count was previously 0.
                 */
-               if (new != 1)
+               if (percpu_counter_compare(&iter->rlimit[type], 1) != 0)
                        continue;
-               if (!get_ucounts(iter))
+               if (!get_ucounts(iter)) {
+                       ret = 0;
                        goto dec_unwind;
+               }
        }
-       return ret;
+       return 1;
 dec_unwind:
-       dec = atomic_long_sub_return(1, &iter->rlimit[type]);
-       WARN_ON_ONCE(dec < 0);
        do_dec_rlimit_put_ucounts(ucounts, iter, type);
-       return 0;
+       return ret;
 }

As shown in demo code, the current implementation retrieves ucounts
during the initial rlimit increment and releases them when the rlimit
hits zero. This mechanism was introduced with the commits:

  fda31c50292a5062332fa0343c084bd9f46604d9 signal: avoid double atomic
 counter increments for user accounting

  d64696905554e919321e31afc210606653b8f6a4   Reimplement
RLIMIT_SIGPENDING on top of ucounts

  15bc01effefe97757ef02ca09e9d1b927ab22725 ucounts: Fix signal ucount
refcounting

It means that we have to requires summing all percpu rlimit counters
very time we increment or decrement the rlimit and this is expensive.

Running the demo code in a single Docker container yielded a throughput
of 73,970 —significantly lower than expected. Performance profiling via
perf revealed:__percpu_counter_sum is the primary performance bottleneck.

+   97.44%     0.27%  signal1_process    [k] entry_SYSCALL_64_after_hwframe
+   97.13%     1.96%  signal1_process    [k] do_syscall_64
+   91.54%     0.00%  signal1_process    [.] 0x00007fb905c8d13f
+   78.66%     0.03%  signal1_process    [k] __percpu_counter_compare
+   78.63%    68.18%  signal1_process    [k] __percpu_counter_sum
+   45.17%     0.37%  signal1_process    [k] syscall_exit_to_user_mode
+   44.95%     0.20%  signal1_process    [k] __x64_sys_tgkill
+   44.51%     0.40%  signal1_process    [k] do_send_specific
+   44.16%     0.07%  signal1_process    [k] arch_do_signal_or_restart
+   43.03%     0.37%  signal1_process    [k] do_send_sig_info
+   42.08%     0.34%  signal1_process    [k] __send_signal_locked
+   40.87%     0.03%  signal1_process    [k] sig_get_ucounts
+   40.74%     0.44%  signal1_process    [k] inc_rlimit_get_ucounts
+   40.55%     0.54%  signal1_process    [k] get_signal
+   39.81%     0.07%  signal1_process    [k] dequeue_signal
+   39.00%     0.07%  signal1_process    [k] __sigqueue_free
+   38.94%     0.27%  signal1_process    [k] do_dec_rlimit_put_ucounts
+    8.60%     8.60%  signal1_process    [k] _find_next_or_bit

However, If we can implement a mechanism to unpin ucounts for signal
pending operations (eliminating the need for get/put refcount
operations), the percpu_counter approach should effectively resolve this
scalability issue. I am trying to figure this out, and if you have any
ideas, please let know, I will do test.

Thank you for your guidance on this matter.

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
                   ` (5 preceding siblings ...)
  2025-05-09 20:18 ` [RFC next v2 0/5] " Andrew Morton
@ 2025-05-15 10:29 ` Christian Brauner
  2025-05-15 12:04   ` Chen Ridong
  2025-05-16 11:48 ` Alexey Gladkov
  7 siblings, 1 reply; 17+ messages in thread
From: Christian Brauner @ 2025-05-15 10:29 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, paulmck, bigeasy, legion, roman.gushchin, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4

On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.
> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> 
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.
> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.
> 
> Notably, even though a new user_namespace is created upon docker startup,
> the problem persists. This is because all these dockers share the same
> parent node, meaning that rlimit statistics continuously modify the same
> atomic variable.
> 
> Currently, when incrementing a specific rlimit within a child user
> namespace by 1, the corresponding rlimit in the parent node must also be
> incremented by 1. Specifically, if the ucounts corresponding to a task in
> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> This operation should be ensured to stay within the limits set for the
> user namespaces.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> 					^
> 					|
> 					|
> 					|
> 					ucount_b_1
> 
> What is expected is that dockers operating within separate namespaces
> should remain isolated and not interfere with one another. Regrettably,
> the current signal system call fails to achieve this desired level of
> isolation.
> 
> Proposal:
> 
> To address the aforementioned issues, the concept of implementing a cache
> for each namespace's rlimit has been proposed. If a cache is added for
> each user namespace's rlimit, a certain amount of rlimits can be allocated
> to a particular namespace in one go. When resources are abundant, these
> resources do not need to be immediately returned to the parent node. Within
> a user namespace, if there are available values in the cache, there is no
> need to request additional resources from the parent node.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> 			^		^
> 			|		|
> 			cache_rlimit--->|
> 					|
> 					ucount_b_1
> 
> 
> The ultimate objective of this solution is to achieve complete isolation
> among namespaces. After applying this patch set, the final test results
> indicate that in the signal1 test case, the performance does not
> deteriorate as the number of containers increases. This effectively meets
> the goal of linear scalability.
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> 
> Challenges:
> 
> When checking the pending signals in the parent node using the command
>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> cached signal counts from its child nodes. As a result, the SigQ value
> in the parent node fails to accurately and instantaneously reflect the
> actual number of pending signals.
> 
> 	# cat /proc/self/status | grep SigQ
> 	SigQ:	16/6187667
> 
> TODO:
> 
> Add cache for the other rlimits.

Woah, I don't think we want to go down that route. That sounds so overly
complex. We should only do that if we absolutely have to. If we can get
away with the percpu counter and some optimizations we might be better
off in the long run.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-15 10:29 ` Christian Brauner
@ 2025-05-15 12:04   ` Chen Ridong
  0 siblings, 0 replies; 17+ messages in thread
From: Chen Ridong @ 2025-05-15 12:04 UTC (permalink / raw)
  To: Christian Brauner
  Cc: akpm, paulmck, bigeasy, legion, roman.gushchin, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4



On 2025/5/15 18:29, Christian Brauner wrote:
> Woah, I don't think we want to go down that route. That sounds so overly
> complex. We should only do that if we absolutely have to. If we can get
> away with the percpu counter and some optimizations we might be better
> off in the long run.

Thank you for your reply, I will send the next version with percpu_counter.

Thanks,
Ridong


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
                   ` (6 preceding siblings ...)
  2025-05-15 10:29 ` Christian Brauner
@ 2025-05-16 11:48 ` Alexey Gladkov
  2025-05-19 13:39   ` Chen Ridong
  7 siblings, 1 reply; 17+ messages in thread
From: Alexey Gladkov @ 2025-05-16 11:48 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, paulmck, bigeasy, roman.gushchin, brauner, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4, Eric Biederman

On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> The will-it-scale test case signal1 [1] has been observed. and the test
> results reveal that the signal sending system call lacks linearity.

The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.

Do you have an example of a closer-to-life scenario where this delay
becomes a bottleneck ?

https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

> To further investigate this issue, we initiated a series of tests by
> launching varying numbers of dockers and closely monitored the throughput
> of each individual docker. The detailed test outcomes are presented as
> follows:
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> 
> The data clearly demonstrates a discernible trend: as the quantity of
> dockers increases, the throughput per container progressively declines.
> In-depth analysis has identified the root cause of this performance
> degradation. The ucouts module conducts statistics on rlimit, which
> involves a significant number of atomic operations. These atomic
> operations, when acting on the same variable, trigger a substantial number
> of cache misses or remote accesses, ultimately resulting in a drop in
> performance.
> 
> Notably, even though a new user_namespace is created upon docker startup,
> the problem persists. This is because all these dockers share the same
> parent node, meaning that rlimit statistics continuously modify the same
> atomic variable.
> 
> Currently, when incrementing a specific rlimit within a child user
> namespace by 1, the corresponding rlimit in the parent node must also be
> incremented by 1. Specifically, if the ucounts corresponding to a task in
> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> This operation should be ensured to stay within the limits set for the
> user namespaces.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> 					^
> 					|
> 					|
> 					|
> 					ucount_b_1
> 
> What is expected is that dockers operating within separate namespaces
> should remain isolated and not interfere with one another. Regrettably,
> the current signal system call fails to achieve this desired level of
> isolation.
> 
> Proposal:
> 
> To address the aforementioned issues, the concept of implementing a cache
> for each namespace's rlimit has been proposed. If a cache is added for
> each user namespace's rlimit, a certain amount of rlimits can be allocated
> to a particular namespace in one go. When resources are abundant, these
> resources do not need to be immediately returned to the parent node. Within
> a user namespace, if there are available values in the cache, there is no
> need to request additional resources from the parent node.
> 
> 	init_user_ns                             init_ucounts
> 	^                                              ^
> 	|                        |                     |
> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> 	|                        |                     |
> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> 			^		^
> 			|		|
> 			cache_rlimit--->|
> 					|
> 					ucount_b_1
> 
> 
> The ultimate objective of this solution is to achieve complete isolation
> among namespaces. After applying this patch set, the final test results
> indicate that in the signal1 test case, the performance does not
> deteriorate as the number of containers increases. This effectively meets

> the goal of linear scalability.
> 
> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> 
> Challenges:
> 
> When checking the pending signals in the parent node using the command
>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> cached signal counts from its child nodes. As a result, the SigQ value
> in the parent node fails to accurately and instantaneously reflect the
> actual number of pending signals.
> 
> 	# cat /proc/self/status | grep SigQ
> 	SigQ:	16/6187667
> 
> TODO:
> 
> Add cache for the other rlimits.
> 
> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
> 
> Chen Ridong (5):
>   user_namespace: add children list node
>   usernamespace: make usernamespace rcu safe
>   user_namespace: add user_ns iteration helper
>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>   ucount: add rlimit cache for ucount
> 
>  include/linux/user_namespace.h |  23 ++++-
>  kernel/signal.c                |   2 +-
>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>  kernel/user.c                  |   2 +
>  kernel/user_namespace.c        |  60 ++++++++++-
>  5 files changed, 243 insertions(+), 25 deletions(-)
> 
> -- 
> 2.34.1
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-16 11:48 ` Alexey Gladkov
@ 2025-05-19 13:39   ` Chen Ridong
  2025-05-19 16:32     ` Alexey Gladkov
  0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-05-19 13:39 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: akpm, paulmck, bigeasy, roman.gushchin, brauner, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4, Eric Biederman



On 2025/5/16 19:48, Alexey Gladkov wrote:
> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
>> The will-it-scale test case signal1 [1] has been observed. and the test
>> results reveal that the signal sending system call lacks linearity.
> 
> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
> 
> Do you have an example of a closer-to-life scenario where this delay
> becomes a bottleneck ?
> 
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
> 

Thank you for your prompt reply. Unfortunately, I do not have the
specific scenario.

Motivation:
I plan to use servers with 384 cores, and potentially even more in the
future. Therefore, I am testing these system calls to identify any
scalability bottlenecks that could arise in massively parallel
high-density computing environments.

In addition, we hope that the containers can be isolated as much as
possible to avoid interfering with each other.

Best regards,
Ridong

>> To further investigate this issue, we initiated a series of tests by
>> launching varying numbers of dockers and closely monitored the throughput
>> of each individual docker. The detailed test outcomes are presented as
>> follows:
>>
>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>
>> The data clearly demonstrates a discernible trend: as the quantity of
>> dockers increases, the throughput per container progressively declines.
>> In-depth analysis has identified the root cause of this performance
>> degradation. The ucouts module conducts statistics on rlimit, which
>> involves a significant number of atomic operations. These atomic
>> operations, when acting on the same variable, trigger a substantial number
>> of cache misses or remote accesses, ultimately resulting in a drop in
>> performance.
>>
>> Notably, even though a new user_namespace is created upon docker startup,
>> the problem persists. This is because all these dockers share the same
>> parent node, meaning that rlimit statistics continuously modify the same
>> atomic variable.
>>
>> Currently, when incrementing a specific rlimit within a child user
>> namespace by 1, the corresponding rlimit in the parent node must also be
>> incremented by 1. Specifically, if the ucounts corresponding to a task in
>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
>> This operation should be ensured to stay within the limits set for the
>> user namespaces.
>>
>> 	init_user_ns                             init_ucounts
>> 	^                                              ^
>> 	|                        |                     |
>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> 	|                        |                     |
>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
>> 					^
>> 					|
>> 					|
>> 					|
>> 					ucount_b_1
>>
>> What is expected is that dockers operating within separate namespaces
>> should remain isolated and not interfere with one another. Regrettably,
>> the current signal system call fails to achieve this desired level of
>> isolation.
>>
>> Proposal:
>>
>> To address the aforementioned issues, the concept of implementing a cache
>> for each namespace's rlimit has been proposed. If a cache is added for
>> each user namespace's rlimit, a certain amount of rlimits can be allocated
>> to a particular namespace in one go. When resources are abundant, these
>> resources do not need to be immediately returned to the parent node. Within
>> a user namespace, if there are available values in the cache, there is no
>> need to request additional resources from the parent node.
>>
>> 	init_user_ns                             init_ucounts
>> 	^                                              ^
>> 	|                        |                     |
>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>> 	|                        |                     |
>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
>> 			^		^
>> 			|		|
>> 			cache_rlimit--->|
>> 					|
>> 					ucount_b_1
>>
>>
>> The ultimate objective of this solution is to achieve complete isolation
>> among namespaces. After applying this patch set, the final test results
>> indicate that in the signal1 test case, the performance does not
>> deteriorate as the number of containers increases. This effectively meets
> 
>> the goal of linear scalability.
>>
>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
>>
>> Challenges:
>>
>> When checking the pending signals in the parent node using the command
>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
>> cached signal counts from its child nodes. As a result, the SigQ value
>> in the parent node fails to accurately and instantaneously reflect the
>> actual number of pending signals.
>>
>> 	# cat /proc/self/status | grep SigQ
>> 	SigQ:	16/6187667
>>
>> TODO:
>>
>> Add cache for the other rlimits.
>>
>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
>>
>> Chen Ridong (5):
>>   user_namespace: add children list node
>>   usernamespace: make usernamespace rcu safe
>>   user_namespace: add user_ns iteration helper
>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>>   ucount: add rlimit cache for ucount
>>
>>  include/linux/user_namespace.h |  23 ++++-
>>  kernel/signal.c                |   2 +-
>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>>  kernel/user.c                  |   2 +
>>  kernel/user_namespace.c        |  60 ++++++++++-
>>  5 files changed, 243 insertions(+), 25 deletions(-)
>>
>> -- 
>> 2.34.1
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-19 13:39   ` Chen Ridong
@ 2025-05-19 16:32     ` Alexey Gladkov
  2025-05-21  1:32       ` Chen Ridong
  0 siblings, 1 reply; 17+ messages in thread
From: Alexey Gladkov @ 2025-05-19 16:32 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, paulmck, bigeasy, roman.gushchin, brauner, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4, Eric Biederman

On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote:
> 
> 
> On 2025/5/16 19:48, Alexey Gladkov wrote:
> > On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> >> The will-it-scale test case signal1 [1] has been observed. and the test
> >> results reveal that the signal sending system call lacks linearity.
> > 
> > The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
> > 
> > Do you have an example of a closer-to-life scenario where this delay
> > becomes a bottleneck ?
> > 
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
> > 
> 
> Thank you for your prompt reply. Unfortunately, I do not have the
> specific scenario.
> 
> Motivation:
> I plan to use servers with 384 cores, and potentially even more in the
> future. Therefore, I am testing these system calls to identify any
> scalability bottlenecks that could arise in massively parallel
> high-density computing environments.

But it turns out that you're proposing complex changes for something that
is essentially a non-issue. In the real world, applications don't spam
signals and I'm not sure we want to support that scenario.

> In addition, we hope that the containers can be isolated as much as
> possible to avoid interfering with each other.

But that's impossible. Even before migration to ucounts, some rlimits
(RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were
bound to user_struct. I mean, atomic counter and "bottleneck" was there.
We can't remove the counters for that rlimits and they will have an
impact.

These rlimits are now counted per-namespace. In real life, docker/podman
creates a separate user namespace for each container from init_user_ns.
Usually only one additional counter is added for each rlimit in this way.

All I'm saying is that "bottleneck" with atomic counter was there before
and can't be removed anywhere.

> 
> Best regards,
> Ridong
> 
> >> To further investigate this issue, we initiated a series of tests by
> >> launching varying numbers of dockers and closely monitored the throughput
> >> of each individual docker. The detailed test outcomes are presented as
> >> follows:
> >>
> >> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> >>
> >> The data clearly demonstrates a discernible trend: as the quantity of
> >> dockers increases, the throughput per container progressively declines.
> >> In-depth analysis has identified the root cause of this performance
> >> degradation. The ucouts module conducts statistics on rlimit, which
> >> involves a significant number of atomic operations. These atomic
> >> operations, when acting on the same variable, trigger a substantial number
> >> of cache misses or remote accesses, ultimately resulting in a drop in
> >> performance.
> >>
> >> Notably, even though a new user_namespace is created upon docker startup,
> >> the problem persists. This is because all these dockers share the same
> >> parent node, meaning that rlimit statistics continuously modify the same
> >> atomic variable.
> >>
> >> Currently, when incrementing a specific rlimit within a child user
> >> namespace by 1, the corresponding rlimit in the parent node must also be
> >> incremented by 1. Specifically, if the ucounts corresponding to a task in
> >> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> >> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> >> This operation should be ensured to stay within the limits set for the
> >> user namespaces.
> >>
> >> 	init_user_ns                             init_ucounts
> >> 	^                                              ^
> >> 	|                        |                     |
> >> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >> 	|                        |                     |
> >> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> >> 					^
> >> 					|
> >> 					|
> >> 					|
> >> 					ucount_b_1
> >>
> >> What is expected is that dockers operating within separate namespaces
> >> should remain isolated and not interfere with one another. Regrettably,
> >> the current signal system call fails to achieve this desired level of
> >> isolation.
> >>
> >> Proposal:
> >>
> >> To address the aforementioned issues, the concept of implementing a cache
> >> for each namespace's rlimit has been proposed. If a cache is added for
> >> each user namespace's rlimit, a certain amount of rlimits can be allocated
> >> to a particular namespace in one go. When resources are abundant, these
> >> resources do not need to be immediately returned to the parent node. Within
> >> a user namespace, if there are available values in the cache, there is no
> >> need to request additional resources from the parent node.
> >>
> >> 	init_user_ns                             init_ucounts
> >> 	^                                              ^
> >> 	|                        |                     |
> >> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >> 	|                        |                     |
> >> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> >> 			^		^
> >> 			|		|
> >> 			cache_rlimit--->|
> >> 					|
> >> 					ucount_b_1
> >>
> >>
> >> The ultimate objective of this solution is to achieve complete isolation
> >> among namespaces. After applying this patch set, the final test results
> >> indicate that in the signal1 test case, the performance does not
> >> deteriorate as the number of containers increases. This effectively meets
> > 
> >> the goal of linear scalability.
> >>
> >> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> >>
> >> Challenges:
> >>
> >> When checking the pending signals in the parent node using the command
> >>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> >> cached signal counts from its child nodes. As a result, the SigQ value
> >> in the parent node fails to accurately and instantaneously reflect the
> >> actual number of pending signals.
> >>
> >> 	# cat /proc/self/status | grep SigQ
> >> 	SigQ:	16/6187667
> >>
> >> TODO:
> >>
> >> Add cache for the other rlimits.
> >>
> >> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
> >>
> >> Chen Ridong (5):
> >>   user_namespace: add children list node
> >>   usernamespace: make usernamespace rcu safe
> >>   user_namespace: add user_ns iteration helper
> >>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
> >>   ucount: add rlimit cache for ucount
> >>
> >>  include/linux/user_namespace.h |  23 ++++-
> >>  kernel/signal.c                |   2 +-
> >>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
> >>  kernel/user.c                  |   2 +
> >>  kernel/user_namespace.c        |  60 ++++++++++-
> >>  5 files changed, 243 insertions(+), 25 deletions(-)
> >>
> >> -- 
> >> 2.34.1
> >>
> > 
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-19 16:32     ` Alexey Gladkov
@ 2025-05-21  1:32       ` Chen Ridong
  2025-05-21  7:29         ` Alexey Gladkov
  0 siblings, 1 reply; 17+ messages in thread
From: Chen Ridong @ 2025-05-21  1:32 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: akpm, paulmck, bigeasy, roman.gushchin, brauner, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4, Eric Biederman



On 2025/5/20 0:32, Alexey Gladkov wrote:
> On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote:
>>
>>
>> On 2025/5/16 19:48, Alexey Gladkov wrote:
>>> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
>>>> The will-it-scale test case signal1 [1] has been observed. and the test
>>>> results reveal that the signal sending system call lacks linearity.
>>>
>>> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
>>>
>>> Do you have an example of a closer-to-life scenario where this delay
>>> becomes a bottleneck ?
>>>
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
>>>
>>
>> Thank you for your prompt reply. Unfortunately, I do not have the
>> specific scenario.
>>
>> Motivation:
>> I plan to use servers with 384 cores, and potentially even more in the
>> future. Therefore, I am testing these system calls to identify any
>> scalability bottlenecks that could arise in massively parallel
>> high-density computing environments.
> 
> But it turns out that you're proposing complex changes for something that

Using percpu_couter is not as complex as this patch set.

> is essentially a non-issue. In the real world, applications don't spam
> signals and I'm not sure we want to support that scenario.
> 
>> In addition, we hope that the containers can be isolated as much as
>> possible to avoid interfering with each other.
> 
> But that's impossible. Even before migration to ucounts, some rlimits
> (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were
> bound to user_struct. I mean, atomic counter and "bottleneck" was there.
> We can't remove the counters for that rlimits and they will have an
> impact.
> 
> These rlimits are now counted per-namespace. In real life, docker/podman
> creates a separate user namespace for each container from init_user_ns.
> Usually only one additional counter is added for each rlimit in this way.
> 
The problem is that when increasing an rlimit in Docker, the limit has
to be increased in the init_user_ns even if a separate user namespace
has been created. This is why I believe this issue deserves to be fixed.


> All I'm saying is that "bottleneck" with atomic counter was there before
> and can't be removed anywhere.
> 

Yes, it can not be removed anywhere, maybe we can make it better.

Best regards,
Ridong

>>
>> Best regards,
>> Ridong
>>
>>>> To further investigate this issue, we initiated a series of tests by
>>>> launching varying numbers of dockers and closely monitored the throughput
>>>> of each individual docker. The detailed test outcomes are presented as
>>>> follows:
>>>>
>>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
>>>>
>>>> The data clearly demonstrates a discernible trend: as the quantity of
>>>> dockers increases, the throughput per container progressively declines.
>>>> In-depth analysis has identified the root cause of this performance
>>>> degradation. The ucouts module conducts statistics on rlimit, which
>>>> involves a significant number of atomic operations. These atomic
>>>> operations, when acting on the same variable, trigger a substantial number
>>>> of cache misses or remote accesses, ultimately resulting in a drop in
>>>> performance.
>>>>
>>>> Notably, even though a new user_namespace is created upon docker startup,
>>>> the problem persists. This is because all these dockers share the same
>>>> parent node, meaning that rlimit statistics continuously modify the same
>>>> atomic variable.
>>>>
>>>> Currently, when incrementing a specific rlimit within a child user
>>>> namespace by 1, the corresponding rlimit in the parent node must also be
>>>> incremented by 1. Specifically, if the ucounts corresponding to a task in
>>>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
>>>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
>>>> This operation should be ensured to stay within the limits set for the
>>>> user namespaces.
>>>>
>>>> 	init_user_ns                             init_ucounts
>>>> 	^                                              ^
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
>>>> 					^
>>>> 					|
>>>> 					|
>>>> 					|
>>>> 					ucount_b_1
>>>>
>>>> What is expected is that dockers operating within separate namespaces
>>>> should remain isolated and not interfere with one another. Regrettably,
>>>> the current signal system call fails to achieve this desired level of
>>>> isolation.
>>>>
>>>> Proposal:
>>>>
>>>> To address the aforementioned issues, the concept of implementing a cache
>>>> for each namespace's rlimit has been proposed. If a cache is added for
>>>> each user namespace's rlimit, a certain amount of rlimits can be allocated
>>>> to a particular namespace in one go. When resources are abundant, these
>>>> resources do not need to be immediately returned to the parent node. Within
>>>> a user namespace, if there are available values in the cache, there is no
>>>> need to request additional resources from the parent node.
>>>>
>>>> 	init_user_ns                             init_ucounts
>>>> 	^                                              ^
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
>>>> 	|                        |                     |
>>>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
>>>> 			^		^
>>>> 			|		|
>>>> 			cache_rlimit--->|
>>>> 					|
>>>> 					ucount_b_1
>>>>
>>>>
>>>> The ultimate objective of this solution is to achieve complete isolation
>>>> among namespaces. After applying this patch set, the final test results
>>>> indicate that in the signal1 test case, the performance does not
>>>> deteriorate as the number of containers increases. This effectively meets
>>>
>>>> the goal of linear scalability.
>>>>
>>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
>>>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
>>>>
>>>> Challenges:
>>>>
>>>> When checking the pending signals in the parent node using the command
>>>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
>>>> cached signal counts from its child nodes. As a result, the SigQ value
>>>> in the parent node fails to accurately and instantaneously reflect the
>>>> actual number of pending signals.
>>>>
>>>> 	# cat /proc/self/status | grep SigQ
>>>> 	SigQ:	16/6187667
>>>>
>>>> TODO:
>>>>
>>>> Add cache for the other rlimits.
>>>>
>>>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
>>>>
>>>> Chen Ridong (5):
>>>>   user_namespace: add children list node
>>>>   usernamespace: make usernamespace rcu safe
>>>>   user_namespace: add user_ns iteration helper
>>>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
>>>>   ucount: add rlimit cache for ucount
>>>>
>>>>  include/linux/user_namespace.h |  23 ++++-
>>>>  kernel/signal.c                |   2 +-
>>>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
>>>>  kernel/user.c                  |   2 +
>>>>  kernel/user_namespace.c        |  60 ++++++++++-
>>>>  5 files changed, 243 insertions(+), 25 deletions(-)
>>>>
>>>> -- 
>>>> 2.34.1
>>>>
>>>
>>
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-21  1:32       ` Chen Ridong
@ 2025-05-21  7:29         ` Alexey Gladkov
  2025-05-22 22:48           ` Andrei Vagin
  0 siblings, 1 reply; 17+ messages in thread
From: Alexey Gladkov @ 2025-05-21  7:29 UTC (permalink / raw)
  To: Chen Ridong
  Cc: akpm, paulmck, bigeasy, roman.gushchin, brauner, tglx, frederic,
	peterz, oleg, joel.granados, viro, lorenzo.stoakes, avagin,
	mengensun, linux, jlayton, ruanjinjie, kees, linux-kernel,
	lujialin4, Eric Biederman

On Wed, May 21, 2025 at 09:32:12AM +0800, Chen Ridong wrote:
> 
> 
> On 2025/5/20 0:32, Alexey Gladkov wrote:
> > On Mon, May 19, 2025 at 09:39:34PM +0800, Chen Ridong wrote:
> >>
> >>
> >> On 2025/5/16 19:48, Alexey Gladkov wrote:
> >>> On Fri, May 09, 2025 at 07:20:49AM +0000, Chen Ridong wrote:
> >>>> The will-it-scale test case signal1 [1] has been observed. and the test
> >>>> results reveal that the signal sending system call lacks linearity.
> >>>
> >>> The signal1 testcase is pretty synthetic. It sends a signal in a busy loop.
> >>>
> >>> Do you have an example of a closer-to-life scenario where this delay
> >>> becomes a bottleneck ?
> >>>
> >>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c
> >>>
> >>
> >> Thank you for your prompt reply. Unfortunately, I do not have the
> >> specific scenario.
> >>
> >> Motivation:
> >> I plan to use servers with 384 cores, and potentially even more in the
> >> future. Therefore, I am testing these system calls to identify any
> >> scalability bottlenecks that could arise in massively parallel
> >> high-density computing environments.
> > 
> > But it turns out that you're proposing complex changes for something that
> 
> Using percpu_couter is not as complex as this patch set.
> 
> > is essentially a non-issue. In the real world, applications don't spam
> > signals and I'm not sure we want to support that scenario.
> > 
> >> In addition, we hope that the containers can be isolated as much as
> >> possible to avoid interfering with each other.
> > 
> > But that's impossible. Even before migration to ucounts, some rlimits
> > (RLIMIT_MSGQUEUE, RLIMIT_MEMLOCK, RLIMIT_SIGPENDING, RLIMIT_NPROC) were
> > bound to user_struct. I mean, atomic counter and "bottleneck" was there.
> > We can't remove the counters for that rlimits and they will have an
> > impact.
> > 
> > These rlimits are now counted per-namespace. In real life, docker/podman
> > creates a separate user namespace for each container from init_user_ns.
> > Usually only one additional counter is added for each rlimit in this way.
> > 
> The problem is that when increasing an rlimit in Docker, the limit has
> to be increased in the init_user_ns even if a separate user namespace
> has been created. This is why I believe this issue deserves to be fixed.

man setrlimit(2):

  An unprivileged process may set only its soft limit to a value in the
  range from 0 up to the hard limit, and (irreversibly) lower its hard
  limit. A privileged process may make arbitrary changes to either limit
  value.

This is a well-documented behavior. But it works in all user namespaces.
If a unprivileged user has rlimits in init_user_ns, they should not be
exceeded even if that process uses a different user namespeces.

So even if you increase rlimits in the new user namespace as root (in the
new userns), it doesn't mean that rlimits in the parent user namespace
cease to apply or should somehow increase. You still have limitations from
the upper user namespace.

I don't see a issue here.

> > All I'm saying is that "bottleneck" with atomic counter was there before
> > and can't be removed anywhere.
> > 
> 
> Yes, it can not be removed anywhere, maybe we can make it better.

Yes, we probably can, but we need to have a reason to complicate the code.
And we're still talking about a synthetic test.

> 
> Best regards,
> Ridong
> 
> >>
> >> Best regards,
> >> Ridong
> >>
> >>>> To further investigate this issue, we initiated a series of tests by
> >>>> launching varying numbers of dockers and closely monitored the throughput
> >>>> of each individual docker. The detailed test outcomes are presented as
> >>>> follows:
> >>>>
> >>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >>>> 	| Throughput  |380068 |353204 |308948 |306453 |180659 |129152 |
> >>>>
> >>>> The data clearly demonstrates a discernible trend: as the quantity of
> >>>> dockers increases, the throughput per container progressively declines.
> >>>> In-depth analysis has identified the root cause of this performance
> >>>> degradation. The ucouts module conducts statistics on rlimit, which
> >>>> involves a significant number of atomic operations. These atomic
> >>>> operations, when acting on the same variable, trigger a substantial number
> >>>> of cache misses or remote accesses, ultimately resulting in a drop in
> >>>> performance.
> >>>>
> >>>> Notably, even though a new user_namespace is created upon docker startup,
> >>>> the problem persists. This is because all these dockers share the same
> >>>> parent node, meaning that rlimit statistics continuously modify the same
> >>>> atomic variable.
> >>>>
> >>>> Currently, when incrementing a specific rlimit within a child user
> >>>> namespace by 1, the corresponding rlimit in the parent node must also be
> >>>> incremented by 1. Specifically, if the ucounts corresponding to a task in
> >>>> Docker B is ucount_b_1, after incrementing the rlimit of ucount_b_1 by 1,
> >>>> the rlimit of the parent node, init_ucounts, must also be incremented by 1.
> >>>> This operation should be ensured to stay within the limits set for the
> >>>> user namespaces.
> >>>>
> >>>> 	init_user_ns                             init_ucounts
> >>>> 	^                                              ^
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_b(docker B)|usr_ns_a->ucount---->|
> >>>> 					^
> >>>> 					|
> >>>> 					|
> >>>> 					|
> >>>> 					ucount_b_1
> >>>>
> >>>> What is expected is that dockers operating within separate namespaces
> >>>> should remain isolated and not interfere with one another. Regrettably,
> >>>> the current signal system call fails to achieve this desired level of
> >>>> isolation.
> >>>>
> >>>> Proposal:
> >>>>
> >>>> To address the aforementioned issues, the concept of implementing a cache
> >>>> for each namespace's rlimit has been proposed. If a cache is added for
> >>>> each user namespace's rlimit, a certain amount of rlimits can be allocated
> >>>> to a particular namespace in one go. When resources are abundant, these
> >>>> resources do not need to be immediately returned to the parent node. Within
> >>>> a user namespace, if there are available values in the cache, there is no
> >>>> need to request additional resources from the parent node.
> >>>>
> >>>> 	init_user_ns                             init_ucounts
> >>>> 	^                                              ^
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_a(docker A)|usr_ns_a->ucount---->|
> >>>> 	|                        |                     |
> >>>> 	|<---- usr_ns_b(docker B)|usr_ns_b->ucount---->|
> >>>> 			^		^
> >>>> 			|		|
> >>>> 			cache_rlimit--->|
> >>>> 					|
> >>>> 					ucount_b_1
> >>>>
> >>>>
> >>>> The ultimate objective of this solution is to achieve complete isolation
> >>>> among namespaces. After applying this patch set, the final test results
> >>>> indicate that in the signal1 test case, the performance does not
> >>>> deteriorate as the number of containers increases. This effectively meets
> >>>
> >>>> the goal of linear scalability.
> >>>>
> >>>> 	| Dockers     |1      |4      |8      |16     |32     |64     |
> >>>> 	| Throughput  |381809 |382284 |380640 |383515 |381318 |380120 |
> >>>>
> >>>> Challenges:
> >>>>
> >>>> When checking the pending signals in the parent node using the command
> >>>>  cat /proc/self/status | grep SigQ, the retrieved value includes the
> >>>> cached signal counts from its child nodes. As a result, the SigQ value
> >>>> in the parent node fails to accurately and instantaneously reflect the
> >>>> actual number of pending signals.
> >>>>
> >>>> 	# cat /proc/self/status | grep SigQ
> >>>> 	SigQ:	16/6187667
> >>>>
> >>>> TODO:
> >>>>
> >>>> Add cache for the other rlimits.
> >>>>
> >>>> [1] https://github.com/antonblanchard/will-it-scale/blob/master/tests/
> >>>>
> >>>> Chen Ridong (5):
> >>>>   user_namespace: add children list node
> >>>>   usernamespace: make usernamespace rcu safe
> >>>>   user_namespace: add user_ns iteration helper
> >>>>   uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts
> >>>>   ucount: add rlimit cache for ucount
> >>>>
> >>>>  include/linux/user_namespace.h |  23 ++++-
> >>>>  kernel/signal.c                |   2 +-
> >>>>  kernel/ucount.c                | 181 +++++++++++++++++++++++++++++----
> >>>>  kernel/user.c                  |   2 +
> >>>>  kernel/user_namespace.c        |  60 ++++++++++-
> >>>>  5 files changed, 243 insertions(+), 25 deletions(-)
> >>>>
> >>>> -- 
> >>>> 2.34.1
> >>>>
> >>>
> >>
> > 
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [RFC next v2 0/5] ucount: add rlimit cache for ucount
  2025-05-21  7:29         ` Alexey Gladkov
@ 2025-05-22 22:48           ` Andrei Vagin
  0 siblings, 0 replies; 17+ messages in thread
From: Andrei Vagin @ 2025-05-22 22:48 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Chen Ridong, akpm, paulmck, bigeasy, roman.gushchin, brauner,
	tglx, frederic, peterz, oleg, joel.granados, viro,
	lorenzo.stoakes, mengensun, linux, jlayton, ruanjinjie, kees,
	linux-kernel, lujialin4, Eric Biederman

On Wed, May 21, 2025 at 12:29 AM Alexey Gladkov <legion@kernel.org> wrote:
<....>
>
> > > All I'm saying is that "bottleneck" with atomic counter was there before
> > > and can't be removed anywhere.
> > >
> >
> > Yes, it can not be removed anywhere, maybe we can make it better.
>
> Yes, we probably can, but we need to have a reason to complicate the code.
> And we're still talking about a synthetic test.

I think I have a real use case that will be negatively impacted by this
issue. This involves gVisor with the systrap platform. gVisor is an
application kernel, similar to user-mode Linux. The systrap platform
utilizes seccomp to intercept guest syscalls, meaning each guest syscall
triggers a SIGSYS signal. For some workloads, the signal handling overhead
accounts for over 50% of the total workload execution time.

However, considering the gVisor problem, I think the solution could be
simpler. Each task could reserve one signal in advance. Then, when a signal
is triggered by an 'exception' (e.g., seccomp or page fault), the kernel
could queue a force signal without incurring a ucount charge. Even
currently, such signals are allocated with the override_rlimit flag set,
meaning they are not subject to standard resource limits.

Thanks,
Andrei

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-05-22 22:48 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-09  7:20 [RFC next v2 0/5] ucount: add rlimit cache for ucount Chen Ridong
2025-05-09  7:20 ` [RFC next v2 1/5] user_namespace: add children list node Chen Ridong
2025-05-09  7:20 ` [RFC next v2 2/5] usernamespace: make usernamespace rcu safe Chen Ridong
2025-05-09  7:20 ` [RFC next v2 3/5] user_namespace: add user_ns iteration helper Chen Ridong
2025-05-09  7:20 ` [RFC next v2 4/5] uounts: factor out __inc_rlimit_get_ucounts/__dec_rlimit_put_ucounts Chen Ridong
2025-05-09  7:20 ` [RFC next v2 5/5] ucount: add rlimit cache for ucount Chen Ridong
2025-05-09 20:18 ` [RFC next v2 0/5] " Andrew Morton
2025-05-12 10:48   ` Sebastian Andrzej Siewior
2025-05-13  1:48     ` Chen Ridong
2025-05-15 10:29 ` Christian Brauner
2025-05-15 12:04   ` Chen Ridong
2025-05-16 11:48 ` Alexey Gladkov
2025-05-19 13:39   ` Chen Ridong
2025-05-19 16:32     ` Alexey Gladkov
2025-05-21  1:32       ` Chen Ridong
2025-05-21  7:29         ` Alexey Gladkov
2025-05-22 22:48           ` Andrei Vagin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).