linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Alternative "pid_max" for 32-bit userspace
@ 2025-02-21 17:02 Michal Koutný
  2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
  2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
  0 siblings, 2 replies; 12+ messages in thread
From: Michal Koutný @ 2025-02-21 17:02 UTC (permalink / raw)
  To: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
	linux-fsdevel, linux-trace-kernel
  Cc: Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
	Michal Koutný, Oleg Nesterov

pid_max is sort of a legacy limit (its value and partially the concept
too, given the existence of pids cgroup controller).
It is tempting to make the pid_max value part of a pid namespace to
provide compat environment for 32-bit applications [1].  On the other
hand, it provides yet another mechanism for limitation of task count.
Even without namespacing of pid_max value, the configuration of
conscious limit is confusing for users [2].

This series builds upon the idea of restricting the number (amount) of
tasks by pids controller and ensuring that number (pid) never exceeds
the amount of tasks. This would not currently work out of the box
because next-fit pid allocation would continue to assign numbers (pids)
higher than the actual amount (there would be gaps in the lower range of
the interval). The patch 2/2 implements this idea by extending semantics
of ns_last_pid knob to allow first-fit numbering. (The implementation
has clumsy ifdefery, which can might be dropped since it's too
x86-centric.)

The patch 1/2 is a mere revert to simplify pid_max to one global limit
only.

(I pruned Cc: list from scripts/get_maintainer.pl for better focus, feel
free to bounce as necessary.)

[1] https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
[2] https://lore.kernel.org/r/bnxhqrq7tip6jl2hu6jsvxxogdfii7ugmafbhgsogovrchxfyp@kagotkztqurt/

Michal Koutný (2):
  Revert "pid: allow pid_max to be set per pid namespace"
  pid: Optional first-fit pid allocation

 Documentation/admin-guide/sysctl/kernel.rst |   2 +
 include/linux/pid.h                         |   3 +
 include/linux/pid_namespace.h               |  11 +-
 kernel/pid.c                                | 137 +++-----------------
 kernel/pid_namespace.c                      |  71 +++++-----
 kernel/sysctl.c                             |   9 ++
 kernel/trace/pid_list.c                     |   2 +-
 kernel/trace/trace.h                        |   2 +
 kernel/trace/trace_sched_switch.c           |   2 +-
 9 files changed, 70 insertions(+), 169 deletions(-)


base-commit: 334426094588f8179fe175a09ecc887ff0c75758
-- 
2.48.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"
  2025-02-21 17:02 [PATCH 0/2] Alternative "pid_max" for 32-bit userspace Michal Koutný
@ 2025-02-21 17:02 ` Michal Koutný
  2025-02-25 17:36   ` Alexander Mikhalitsyn
  2025-03-10  7:32   ` kernel test robot
  2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
  1 sibling, 2 replies; 12+ messages in thread
From: Michal Koutný @ 2025-02-21 17:02 UTC (permalink / raw)
  To: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
	linux-fsdevel, linux-trace-kernel
  Cc: Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
	Michal Koutný, Oleg Nesterov

This reverts commit 7863dcc72d0f4b13a641065670426435448b3d80.

It is already difficult for users to troubleshoot which of multiple pid
limits restricts their workload. I'm afraid making pid_max
per-(hierarchical-)NS will contribute to confusion.
Also, the implementation copies the limit upon creation from
parent, this pattern showed cumbersome with some attributes in legacy
cgroup controllers -- it's subject to race condition between parent's
limit modification and children creation and once copied it must be
changed in the descendant.

This is very similar to what pids.max of a cgroup (already) does that
can be used as an alternative.

Link: https://lore.kernel.org/r/bnxhqrq7tip6jl2hu6jsvxxogdfii7ugmafbhgsogovrchxfyp@kagotkztqurt/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 include/linux/pid.h               |   3 +
 include/linux/pid_namespace.h     |  10 +--
 kernel/pid.c                      | 125 ++----------------------------
 kernel/pid_namespace.c            |  43 +++-------
 kernel/sysctl.c                   |   9 +++
 kernel/trace/pid_list.c           |   2 +-
 kernel/trace/trace.h              |   2 +
 kernel/trace/trace_sched_switch.c |   2 +-
 8 files changed, 35 insertions(+), 161 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 98837a1ff0f33..fe575fcdb4afa 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -108,6 +108,9 @@ extern void exchange_tids(struct task_struct *task, struct task_struct *old);
 extern void transfer_pid(struct task_struct *old, struct task_struct *new,
 			 enum pid_type);
 
+extern int pid_max;
+extern int pid_max_min, pid_max_max;
+
 /*
  * look up a PID in the hash table. Must be called with the tasklist_lock
  * or rcu_read_lock() held.
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 7c67a58111998..f9f9931e02d6a 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -30,7 +30,6 @@ struct pid_namespace {
 	struct task_struct *child_reaper;
 	struct kmem_cache *pid_cachep;
 	unsigned int level;
-	int pid_max;
 	struct pid_namespace *parent;
 #ifdef CONFIG_BSD_PROCESS_ACCT
 	struct fs_pin *bacct;
@@ -39,14 +38,9 @@ struct pid_namespace {
 	struct ucounts *ucounts;
 	int reboot;	/* group exit code if this pidns was rebooted */
 	struct ns_common ns;
-	struct work_struct	work;
-#ifdef CONFIG_SYSCTL
-	struct ctl_table_set	set;
-	struct ctl_table_header *sysctls;
-#if defined(CONFIG_MEMFD_CREATE)
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 	int memfd_noexec_scope;
 #endif
-#endif
 } __randomize_layout;
 
 extern struct pid_namespace init_pid_ns;
@@ -123,8 +117,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
 void pidhash_init(void);
 void pid_idr_init(void);
-int register_pidns_sysctls(struct pid_namespace *pidns);
-void unregister_pidns_sysctls(struct pid_namespace *pidns);
 
 static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
 {
diff --git a/kernel/pid.c b/kernel/pid.c
index 924084713be8b..aa2a7d4da4555 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -61,8 +61,10 @@ struct pid init_struct_pid = {
 	}, }
 };
 
-static int pid_max_min = RESERVED_PIDS + 1;
-static int pid_max_max = PID_MAX_LIMIT;
+int pid_max = PID_MAX_DEFAULT;
+
+int pid_max_min = RESERVED_PIDS + 1;
+int pid_max_max = PID_MAX_LIMIT;
 
 /*
  * PID-map pages start out as NULL, they get allocated upon
@@ -81,7 +83,6 @@ struct pid_namespace init_pid_ns = {
 #ifdef CONFIG_PID_NS
 	.ns.ops = &pidns_operations,
 #endif
-	.pid_max = PID_MAX_DEFAULT,
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 	.memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
 #endif
@@ -190,7 +191,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	for (i = ns->level; i >= 0; i--) {
 		int tid = 0;
-		int pid_max = READ_ONCE(tmp->pid_max);
 
 		if (set_tid_size) {
 			tid = set_tid[ns->level - i];
@@ -644,118 +644,17 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
 	return fd;
 }
 
-#ifdef CONFIG_SYSCTL
-static struct ctl_table_set *pid_table_root_lookup(struct ctl_table_root *root)
-{
-	return &task_active_pid_ns(current)->set;
-}
-
-static int set_is_seen(struct ctl_table_set *set)
-{
-	return &task_active_pid_ns(current)->set == set;
-}
-
-static int pid_table_root_permissions(struct ctl_table_header *head,
-				      const struct ctl_table *table)
-{
-	struct pid_namespace *pidns =
-		container_of(head->set, struct pid_namespace, set);
-	int mode = table->mode;
-
-	if (ns_capable(pidns->user_ns, CAP_SYS_ADMIN) ||
-	    uid_eq(current_euid(), make_kuid(pidns->user_ns, 0)))
-		mode = (mode & S_IRWXU) >> 6;
-	else if (in_egroup_p(make_kgid(pidns->user_ns, 0)))
-		mode = (mode & S_IRWXG) >> 3;
-	else
-		mode = mode & S_IROTH;
-	return (mode << 6) | (mode << 3) | mode;
-}
-
-static void pid_table_root_set_ownership(struct ctl_table_header *head,
-					 kuid_t *uid, kgid_t *gid)
-{
-	struct pid_namespace *pidns =
-		container_of(head->set, struct pid_namespace, set);
-	kuid_t ns_root_uid;
-	kgid_t ns_root_gid;
-
-	ns_root_uid = make_kuid(pidns->user_ns, 0);
-	if (uid_valid(ns_root_uid))
-		*uid = ns_root_uid;
-
-	ns_root_gid = make_kgid(pidns->user_ns, 0);
-	if (gid_valid(ns_root_gid))
-		*gid = ns_root_gid;
-}
-
-static struct ctl_table_root pid_table_root = {
-	.lookup		= pid_table_root_lookup,
-	.permissions	= pid_table_root_permissions,
-	.set_ownership	= pid_table_root_set_ownership,
-};
-
-static const struct ctl_table pid_table[] = {
-	{
-		.procname	= "pid_max",
-		.data		= &init_pid_ns.pid_max,
-		.maxlen		= sizeof(int),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &pid_max_min,
-		.extra2		= &pid_max_max,
-	},
-};
-#endif
-
-int register_pidns_sysctls(struct pid_namespace *pidns)
-{
-#ifdef CONFIG_SYSCTL
-	struct ctl_table *tbl;
-
-	setup_sysctl_set(&pidns->set, &pid_table_root, set_is_seen);
-
-	tbl = kmemdup(pid_table, sizeof(pid_table), GFP_KERNEL);
-	if (!tbl)
-		return -ENOMEM;
-	tbl->data = &pidns->pid_max;
-	pidns->pid_max = min(pid_max_max, max_t(int, pidns->pid_max,
-			     PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
-
-	pidns->sysctls = __register_sysctl_table(&pidns->set, "kernel", tbl,
-						 ARRAY_SIZE(pid_table));
-	if (!pidns->sysctls) {
-		kfree(tbl);
-		retire_sysctl_set(&pidns->set);
-		return -ENOMEM;
-	}
-#endif
-	return 0;
-}
-
-void unregister_pidns_sysctls(struct pid_namespace *pidns)
-{
-#ifdef CONFIG_SYSCTL
-	const struct ctl_table *tbl;
-
-	tbl = pidns->sysctls->ctl_table_arg;
-	unregister_sysctl_table(pidns->sysctls);
-	retire_sysctl_set(&pidns->set);
-	kfree(tbl);
-#endif
-}
-
 void __init pid_idr_init(void)
 {
 	/* Verify no one has done anything silly: */
 	BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
 
 	/* bump default and minimum pid_max based on number of cpus */
-	init_pid_ns.pid_max = min(pid_max_max, max_t(int, init_pid_ns.pid_max,
-				  PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
+	pid_max = min(pid_max_max, max_t(int, pid_max,
+				PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
 	pid_max_min = max_t(int, pid_max_min,
 				PIDS_PER_CPU_MIN * num_possible_cpus());
-	pr_info("pid_max: default: %u minimum: %u\n", init_pid_ns.pid_max, pid_max_min);
+	pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
 
 	idr_init(&init_pid_ns.idr);
 
@@ -766,16 +665,6 @@ void __init pid_idr_init(void)
 			NULL);
 }
 
-static __init int pid_namespace_sysctl_init(void)
-{
-#ifdef CONFIG_SYSCTL
-	/* "kernel" directory will have already been initialized. */
-	BUG_ON(register_pidns_sysctls(&init_pid_ns));
-#endif
-	return 0;
-}
-subsys_initcall(pid_namespace_sysctl_init);
-
 static struct file *__pidfd_fget(struct task_struct *task, int fd)
 {
 	struct file *file;
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 8f6cfec87555a..0f23285be4f92 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -70,8 +70,6 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
 	dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
 }
 
-static void destroy_pid_namespace_work(struct work_struct *work);
-
 static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
 	struct pid_namespace *parent_pid_ns)
 {
@@ -107,27 +105,17 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 		goto out_free_idr;
 	ns->ns.ops = &pidns_operations;
 
-	ns->pid_max = parent_pid_ns->pid_max;
-	err = register_pidns_sysctls(ns);
-	if (err)
-		goto out_free_inum;
-
 	refcount_set(&ns->ns.count, 1);
 	ns->level = level;
 	ns->parent = get_pid_ns(parent_pid_ns);
 	ns->user_ns = get_user_ns(user_ns);
 	ns->ucounts = ucounts;
 	ns->pid_allocated = PIDNS_ADDING;
-	INIT_WORK(&ns->work, destroy_pid_namespace_work);
-
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 	ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
 #endif
-
 	return ns;
 
-out_free_inum:
-	ns_free_inum(&ns->ns);
 out_free_idr:
 	idr_destroy(&ns->idr);
 	kmem_cache_free(pid_ns_cachep, ns);
@@ -149,28 +137,12 @@ static void delayed_free_pidns(struct rcu_head *p)
 
 static void destroy_pid_namespace(struct pid_namespace *ns)
 {
-	unregister_pidns_sysctls(ns);
-
 	ns_free_inum(&ns->ns);
 
 	idr_destroy(&ns->idr);
 	call_rcu(&ns->rcu, delayed_free_pidns);
 }
 
-static void destroy_pid_namespace_work(struct work_struct *work)
-{
-	struct pid_namespace *ns =
-		container_of(work, struct pid_namespace, work);
-
-	do {
-		struct pid_namespace *parent;
-
-		parent = ns->parent;
-		destroy_pid_namespace(ns);
-		ns = parent;
-	} while (ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count));
-}
-
 struct pid_namespace *copy_pid_ns(unsigned long flags,
 	struct user_namespace *user_ns, struct pid_namespace *old_ns)
 {
@@ -183,8 +155,15 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
 
 void put_pid_ns(struct pid_namespace *ns)
 {
-	if (ns && ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count))
-		schedule_work(&ns->work);
+	struct pid_namespace *parent;
+
+	while (ns != &init_pid_ns) {
+		parent = ns->parent;
+		if (!refcount_dec_and_test(&ns->ns.count))
+			break;
+		destroy_pid_namespace(ns);
+		ns = parent;
+	}
 }
 EXPORT_SYMBOL_GPL(put_pid_ns);
 
@@ -295,7 +274,6 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
 	next = idr_get_cursor(&pid_ns->idr) - 1;
 
 	tmp.data = &next;
-	tmp.extra2 = &pid_ns->pid_max;
 	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
 	if (!ret && write)
 		idr_set_cursor(&pid_ns->idr, next + 1);
@@ -303,6 +281,7 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
 	return ret;
 }
 
+extern int pid_max;
 static const struct ctl_table pid_ns_ctl_table[] = {
 	{
 		.procname = "ns_last_pid",
@@ -310,7 +289,7 @@ static const struct ctl_table pid_ns_ctl_table[] = {
 		.mode = 0666, /* permissions are checked in the handler */
 		.proc_handler = pid_ns_ctl_handler,
 		.extra1 = SYSCTL_ZERO,
-		.extra2 = &init_pid_ns.pid_max,
+		.extra2 = &pid_max,
 	},
 };
 #endif	/* CONFIG_CHECKPOINT_RESTORE */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cb57da499ebb1..bb739608680f2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1803,6 +1803,15 @@ static const struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+	{
+		.procname	= "pid_max",
+		.data		= &pid_max,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &pid_max_min,
+		.extra2		= &pid_max_max,
+	},
 	{
 		.procname	= "panic_on_oops",
 		.data		= &panic_on_oops,
diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c
index c62b9b3cfb3d8..4966e6bbdf6f3 100644
--- a/kernel/trace/pid_list.c
+++ b/kernel/trace/pid_list.c
@@ -414,7 +414,7 @@ struct trace_pid_list *trace_pid_list_alloc(void)
 	int i;
 
 	/* According to linux/thread.h, pids can be no bigger that 30 bits */
-	WARN_ON_ONCE(init_pid_ns.pid_max > (1 << 30));
+	WARN_ON_ONCE(pid_max > (1 << 30));
 
 	pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL);
 	if (!pid_list)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9c21ba45b7af6..46c65402ad7e5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -732,6 +732,8 @@ extern unsigned long tracing_thresh;
 
 /* PID filtering */
 
+extern int pid_max;
+
 bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids,
 			     pid_t search_pid);
 bool trace_ignore_this_task(struct trace_pid_list *filtered_pids,
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index cb49f7279dc80..573b5d8e8a28e 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -442,7 +442,7 @@ int trace_alloc_tgid_map(void)
 	if (tgid_map)
 		return 0;
 
-	tgid_map_max = init_pid_ns.pid_max;
+	tgid_map_max = pid_max;
 	map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map),
 		       GFP_KERNEL);
 	if (!map)
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-02-21 17:02 [PATCH 0/2] Alternative "pid_max" for 32-bit userspace Michal Koutný
  2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
@ 2025-02-21 17:02 ` Michal Koutný
  2025-02-22  0:18   ` Andrew Morton
                     ` (2 more replies)
  1 sibling, 3 replies; 12+ messages in thread
From: Michal Koutný @ 2025-02-21 17:02 UTC (permalink / raw)
  To: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
	linux-fsdevel, linux-trace-kernel
  Cc: Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
	Michal Koutný, Oleg Nesterov

Noone would need to use this allocation strategy (it's slower, pid
numbers collide sooner). Its primary purpose are pid namespaces in
conjunction with pids.max cgroup limit which keeps (virtual) pid numbers
below the given limit. This is for 32-bit userspace programs that may
not work well with pid numbers above 65536.

Link: https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |  2 ++
 include/linux/pid_namespace.h               |  3 +++
 kernel/pid.c                                | 12 +++++++--
 kernel/pid_namespace.c                      | 28 +++++++++++++++------
 4 files changed, 36 insertions(+), 9 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index a43b78b4b6464..f5e68d1c8849f 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
 lives in) pid namespace. When selecting a pid for a next task on fork
 kernel tries to allocate a number starting from this one.
 
+When set to -1, first-fit pid numbering is used instead of the next-fit.
+
 
 powersave-nap (PPC only)
 ========================
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index f9f9931e02d6a..10bf66ca78590 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -41,6 +41,9 @@ struct pid_namespace {
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 	int memfd_noexec_scope;
 #endif
+#ifdef CONFIG_IA32_EMULATION
+	bool pid_noncyclic;
+#endif
 } __randomize_layout;
 
 extern struct pid_namespace init_pid_ns;
diff --git a/kernel/pid.c b/kernel/pid.c
index aa2a7d4da4555..e9da1662b8821 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -191,6 +191,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 
 	for (i = ns->level; i >= 0; i--) {
 		int tid = 0;
+		bool pid_noncyclic = 0;
+#ifdef CONFIG_IA32_EMULATION
+		pid_noncyclic = READ_ONCE(tmp->pid_noncyclic);
+#endif
 
 		if (set_tid_size) {
 			tid = set_tid[ns->level - i];
@@ -235,8 +239,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 			 * Store a null pointer so find_pid_ns does not find
 			 * a partially initialized PID (see below).
 			 */
-			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
-					      pid_max, GFP_ATOMIC);
+			if (likely(!pid_noncyclic))
+				nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+						      pid_max, GFP_ATOMIC);
+			else
+				nr = idr_alloc(&tmp->idr, NULL, pid_min,
+						      pid_max, GFP_ATOMIC);
 		}
 		spin_unlock_irq(&pidmap_lock);
 		idr_preload_end();
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0f23285be4f92..ceda94a064294 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -113,6 +113,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
 	ns->pid_allocated = PIDNS_ADDING;
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
 	ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
+#endif
+#ifdef CONFIG_IA32_EMULATION
+	ns->pid_noncyclic = READ_ONCE(parent_pid_ns->pid_noncyclic);
 #endif
 	return ns;
 
@@ -260,7 +263,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
 	return;
 }
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
 static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
 {
@@ -271,12 +274,23 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
 	if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
 		return -EPERM;
 
-	next = idr_get_cursor(&pid_ns->idr) - 1;
+	next = -1;
+#ifdef CONFIG_IA32_EMULATION
+	if (!pid_ns->pid_noncyclic)
+#endif
+		next += idr_get_cursor(&pid_ns->idr);
 
 	tmp.data = &next;
 	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
-	if (!ret && write)
-		idr_set_cursor(&pid_ns->idr, next + 1);
+	if (!ret && write) {
+		if (next > -1)
+			idr_set_cursor(&pid_ns->idr, next + 1);
+		else if (!IS_ENABLED(CONFIG_IA32_EMULATION))
+			ret = -EINVAL;
+#ifdef CONFIG_IA32_EMULATION
+		WRITE_ONCE(pid_ns->pid_noncyclic, next == -1);
+#endif
+	}
 
 	return ret;
 }
@@ -288,11 +302,11 @@ static const struct ctl_table pid_ns_ctl_table[] = {
 		.maxlen = sizeof(int),
 		.mode = 0666, /* permissions are checked in the handler */
 		.proc_handler = pid_ns_ctl_handler,
-		.extra1 = SYSCTL_ZERO,
+		.extra1 = SYSCTL_NEG_ONE,
 		.extra2 = &pid_max,
 	},
 };
-#endif	/* CONFIG_CHECKPOINT_RESTORE */
+#endif	/* CONFIG_CHECKPOINT_RESTORE || CONFIG_IA32_EMULATION */
 
 int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
 {
@@ -449,7 +463,7 @@ static __init int pid_namespaces_init(void)
 {
 	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
 
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
 	register_sysctl_init("kernel", pid_ns_ctl_table);
 #endif
 
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
@ 2025-02-22  0:18   ` Andrew Morton
  2025-02-22  9:02     ` David Laight
  2025-03-05 15:04     ` Michal Koutný
  2025-02-25 17:30   ` Alexander Mikhalitsyn
  2025-03-06  8:59   ` Christian Brauner
  2 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2025-02-22  0:18 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
	linux-fsdevel, linux-trace-kernel, Jonathan Corbet, Kees Cook,
	Eric W . Biederman, Oleg Nesterov

On Fri, 21 Feb 2025 18:02:49 +0100 Michal Koutný <mkoutny@suse.com> wrote:

> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
>  lives in) pid namespace. When selecting a pid for a next task on fork
>  kernel tries to allocate a number starting from this one.
>  
> +When set to -1, first-fit pid numbering is used instead of the next-fit.
> +

This seems thin.  Is there more we can tell our users?  What are the
visible effects of this?  What are the benefits?  Why would they want
to turn it on?

I mean, there are veritable paragraphs in the changelogs, but just a
single line in the user-facing docs.  Seems there could be more...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-02-22  0:18   ` Andrew Morton
@ 2025-02-22  9:02     ` David Laight
  2025-03-05 15:01       ` Michal Koutný
  2025-03-05 15:04     ` Michal Koutný
  1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2025-02-22  9:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Koutný, Christian Brauner, Alexander Mikhalitsyn,
	linux-doc, linux-kernel, linux-fsdevel, linux-trace-kernel,
	Jonathan Corbet, Kees Cook, Eric W . Biederman, Oleg Nesterov

On Fri, 21 Feb 2025 16:18:54 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 21 Feb 2025 18:02:49 +0100 Michal Koutný <mkoutny@suse.com> wrote:
> 
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
> >  lives in) pid namespace. When selecting a pid for a next task on fork
> >  kernel tries to allocate a number starting from this one.
> >  
> > +When set to -1, first-fit pid numbering is used instead of the next-fit.
> > +  
> 
> This seems thin.  Is there more we can tell our users?  What are the
> visible effects of this?  What are the benefits?  Why would they want
> to turn it on?
> 
> I mean, there are veritable paragraphs in the changelogs, but just a
> single line in the user-facing docs.  Seems there could be more...

It also seems a good way of being able to predict the next pid and
doing all the 'nasty' things that allows because there is no guard
time on pid reuse.

Both first-fit and next-fit have the same issue.
Picking a random pid is better.

Or pick the pid after finding an empty slot in the 'hash' table.
Then you guarantee O(1) lookup and can easily stop pids being reused
quickly.

	David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
  2025-02-22  0:18   ` Andrew Morton
@ 2025-02-25 17:30   ` Alexander Mikhalitsyn
  2025-03-06  8:59   ` Christian Brauner
  2 siblings, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2025-02-25 17:30 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christian Brauner, linux-doc, linux-kernel, linux-fsdevel,
	linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
	Eric W . Biederman, Oleg Nesterov

Am Fr., 21. Feb. 2025 um 18:02 Uhr schrieb Michal Koutný <mkoutny@suse.com>:
>
> Noone would need to use this allocation strategy (it's slower, pid
> numbers collide sooner). Its primary purpose are pid namespaces in
> conjunction with pids.max cgroup limit which keeps (virtual) pid numbers
> below the given limit. This is for 32-bit userspace programs that may
> not work well with pid numbers above 65536.
>
> Link: https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
> Signed-off-by: Michal Koutný <mkoutny@suse.com>

Dear Michal,

sorry for such a long delay with reply on your patches.

> ---
>  Documentation/admin-guide/sysctl/kernel.rst |  2 ++
>  include/linux/pid_namespace.h               |  3 +++
>  kernel/pid.c                                | 12 +++++++--
>  kernel/pid_namespace.c                      | 28 +++++++++++++++------
>  4 files changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index a43b78b4b6464..f5e68d1c8849f 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
>  lives in) pid namespace. When selecting a pid for a next task on fork
>  kernel tries to allocate a number starting from this one.
>
> +When set to -1, first-fit pid numbering is used instead of the next-fit.
> +
>
>  powersave-nap (PPC only)
>  ========================
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index f9f9931e02d6a..10bf66ca78590 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -41,6 +41,9 @@ struct pid_namespace {
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
>         int memfd_noexec_scope;
>  #endif
> +#ifdef CONFIG_IA32_EMULATION

Unfortunately, this does not work for our use case as it's x86-specific.

In the original cover letter [1] it was written:

>In any case, there are workloads that have expections about how large
>pid numbers they accept. Either for historical reasons or architectural
>reasons. One concreate example is the 32-bit version of Android's bionic
>libc which requires pid numbers less than 65536. There are workloads
>where it is run in a 32-bit container on a 64-bit kernel. If the host

And I have just confirmed with folks from Canonical, who work on Anbox
(Android in container project),
that they use Arm machines (both armhf/arm64). And one of the reasons
to add this feature is to
make legacy 32-bit Android Bionic libc to work [2].

[1] https://lore.kernel.org/all/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
[2] https://android.googlesource.com/platform/bionic.git/+/HEAD/docs/32-bit-abi.md#is-too-small-for-large-pids

Kind regards,
Alex

> +       bool pid_noncyclic;
> +#endif
>  } __randomize_layout;
>
>  extern struct pid_namespace init_pid_ns;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index aa2a7d4da4555..e9da1662b8821 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -191,6 +191,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
>         for (i = ns->level; i >= 0; i--) {
>                 int tid = 0;
> +               bool pid_noncyclic = 0;
> +#ifdef CONFIG_IA32_EMULATION
> +               pid_noncyclic = READ_ONCE(tmp->pid_noncyclic);
> +#endif
>
>                 if (set_tid_size) {
>                         tid = set_tid[ns->level - i];
> @@ -235,8 +239,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>                          * Store a null pointer so find_pid_ns does not find
>                          * a partially initialized PID (see below).
>                          */
> -                       nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> -                                             pid_max, GFP_ATOMIC);
> +                       if (likely(!pid_noncyclic))
> +                               nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> +                                                     pid_max, GFP_ATOMIC);
> +                       else
> +                               nr = idr_alloc(&tmp->idr, NULL, pid_min,
> +                                                     pid_max, GFP_ATOMIC);
>                 }
>                 spin_unlock_irq(&pidmap_lock);
>                 idr_preload_end();
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0f23285be4f92..ceda94a064294 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -113,6 +113,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>         ns->pid_allocated = PIDNS_ADDING;
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
>         ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
> +#endif
> +#ifdef CONFIG_IA32_EMULATION
> +       ns->pid_noncyclic = READ_ONCE(parent_pid_ns->pid_noncyclic);
>  #endif
>         return ns;
>
> @@ -260,7 +263,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>         return;
>  }
>
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
>  static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
>                 void *buffer, size_t *lenp, loff_t *ppos)
>  {
> @@ -271,12 +274,23 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
>         if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
>                 return -EPERM;
>
> -       next = idr_get_cursor(&pid_ns->idr) - 1;
> +       next = -1;
> +#ifdef CONFIG_IA32_EMULATION
> +       if (!pid_ns->pid_noncyclic)
> +#endif
> +               next += idr_get_cursor(&pid_ns->idr);
>
>         tmp.data = &next;
>         ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> -       if (!ret && write)
> -               idr_set_cursor(&pid_ns->idr, next + 1);
> +       if (!ret && write) {
> +               if (next > -1)
> +                       idr_set_cursor(&pid_ns->idr, next + 1);
> +               else if (!IS_ENABLED(CONFIG_IA32_EMULATION))
> +                       ret = -EINVAL;
> +#ifdef CONFIG_IA32_EMULATION
> +               WRITE_ONCE(pid_ns->pid_noncyclic, next == -1);
> +#endif
> +       }
>
>         return ret;
>  }
> @@ -288,11 +302,11 @@ static const struct ctl_table pid_ns_ctl_table[] = {
>                 .maxlen = sizeof(int),
>                 .mode = 0666, /* permissions are checked in the handler */
>                 .proc_handler = pid_ns_ctl_handler,
> -               .extra1 = SYSCTL_ZERO,
> +               .extra1 = SYSCTL_NEG_ONE,
>                 .extra2 = &pid_max,
>         },
>  };
> -#endif /* CONFIG_CHECKPOINT_RESTORE */
> +#endif /* CONFIG_CHECKPOINT_RESTORE || CONFIG_IA32_EMULATION */
>
>  int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>  {
> @@ -449,7 +463,7 @@ static __init int pid_namespaces_init(void)
>  {
>         pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
>         register_sysctl_init("kernel", pid_ns_ctl_table);
>  #endif
>
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"
  2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
@ 2025-02-25 17:36   ` Alexander Mikhalitsyn
  2025-03-10  7:32   ` kernel test robot
  1 sibling, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2025-02-25 17:36 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Christian Brauner, linux-doc, linux-kernel, linux-fsdevel,
	linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
	Eric W . Biederman, Oleg Nesterov

Am Fr., 21. Feb. 2025 um 18:02 Uhr schrieb Michal Koutný <mkoutny@suse.com>:
>
> This reverts commit 7863dcc72d0f4b13a641065670426435448b3d80.

If we revert this one, then we should also revert a corresponding kselftest:
https://github.com/torvalds/linux/commit/615ab43b838bb982dc234feff75ee9ad35447c5d

>
> It is already difficult for users to troubleshoot which of multiple pid
> limits restricts their workload. I'm afraid making pid_max
> per-(hierarchical-)NS will contribute to confusion.
> Also, the implementation copies the limit upon creation from
> parent, this pattern showed cumbersome with some attributes in legacy
> cgroup controllers -- it's subject to race condition between parent's
> limit modification and children creation and once copied it must be
> changed in the descendant.
>
> This is very similar to what pids.max of a cgroup (already) does that
> can be used as an alternative.
>
> Link: https://lore.kernel.org/r/bnxhqrq7tip6jl2hu6jsvxxogdfii7ugmafbhgsogovrchxfyp@kagotkztqurt/
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>  include/linux/pid.h               |   3 +
>  include/linux/pid_namespace.h     |  10 +--
>  kernel/pid.c                      | 125 ++----------------------------
>  kernel/pid_namespace.c            |  43 +++-------
>  kernel/sysctl.c                   |   9 +++
>  kernel/trace/pid_list.c           |   2 +-
>  kernel/trace/trace.h              |   2 +
>  kernel/trace/trace_sched_switch.c |   2 +-
>  8 files changed, 35 insertions(+), 161 deletions(-)
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 98837a1ff0f33..fe575fcdb4afa 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -108,6 +108,9 @@ extern void exchange_tids(struct task_struct *task, struct task_struct *old);
>  extern void transfer_pid(struct task_struct *old, struct task_struct *new,
>                          enum pid_type);
>
> +extern int pid_max;
> +extern int pid_max_min, pid_max_max;
> +
>  /*
>   * look up a PID in the hash table. Must be called with the tasklist_lock
>   * or rcu_read_lock() held.
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 7c67a58111998..f9f9931e02d6a 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -30,7 +30,6 @@ struct pid_namespace {
>         struct task_struct *child_reaper;
>         struct kmem_cache *pid_cachep;
>         unsigned int level;
> -       int pid_max;
>         struct pid_namespace *parent;
>  #ifdef CONFIG_BSD_PROCESS_ACCT
>         struct fs_pin *bacct;
> @@ -39,14 +38,9 @@ struct pid_namespace {
>         struct ucounts *ucounts;
>         int reboot;     /* group exit code if this pidns was rebooted */
>         struct ns_common ns;
> -       struct work_struct      work;
> -#ifdef CONFIG_SYSCTL
> -       struct ctl_table_set    set;
> -       struct ctl_table_header *sysctls;
> -#if defined(CONFIG_MEMFD_CREATE)
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
>         int memfd_noexec_scope;
>  #endif
> -#endif
>  } __randomize_layout;
>
>  extern struct pid_namespace init_pid_ns;
> @@ -123,8 +117,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>  extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
>  void pidhash_init(void);
>  void pid_idr_init(void);
> -int register_pidns_sysctls(struct pid_namespace *pidns);
> -void unregister_pidns_sysctls(struct pid_namespace *pidns);
>
>  static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
>  {
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 924084713be8b..aa2a7d4da4555 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -61,8 +61,10 @@ struct pid init_struct_pid = {
>         }, }
>  };
>
> -static int pid_max_min = RESERVED_PIDS + 1;
> -static int pid_max_max = PID_MAX_LIMIT;
> +int pid_max = PID_MAX_DEFAULT;
> +
> +int pid_max_min = RESERVED_PIDS + 1;
> +int pid_max_max = PID_MAX_LIMIT;
>
>  /*
>   * PID-map pages start out as NULL, they get allocated upon
> @@ -81,7 +83,6 @@ struct pid_namespace init_pid_ns = {
>  #ifdef CONFIG_PID_NS
>         .ns.ops = &pidns_operations,
>  #endif
> -       .pid_max = PID_MAX_DEFAULT,
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
>         .memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
>  #endif
> @@ -190,7 +191,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
>         for (i = ns->level; i >= 0; i--) {
>                 int tid = 0;
> -               int pid_max = READ_ONCE(tmp->pid_max);
>
>                 if (set_tid_size) {
>                         tid = set_tid[ns->level - i];
> @@ -644,118 +644,17 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
>         return fd;
>  }
>
> -#ifdef CONFIG_SYSCTL
> -static struct ctl_table_set *pid_table_root_lookup(struct ctl_table_root *root)
> -{
> -       return &task_active_pid_ns(current)->set;
> -}
> -
> -static int set_is_seen(struct ctl_table_set *set)
> -{
> -       return &task_active_pid_ns(current)->set == set;
> -}
> -
> -static int pid_table_root_permissions(struct ctl_table_header *head,
> -                                     const struct ctl_table *table)
> -{
> -       struct pid_namespace *pidns =
> -               container_of(head->set, struct pid_namespace, set);
> -       int mode = table->mode;
> -
> -       if (ns_capable(pidns->user_ns, CAP_SYS_ADMIN) ||
> -           uid_eq(current_euid(), make_kuid(pidns->user_ns, 0)))
> -               mode = (mode & S_IRWXU) >> 6;
> -       else if (in_egroup_p(make_kgid(pidns->user_ns, 0)))
> -               mode = (mode & S_IRWXG) >> 3;
> -       else
> -               mode = mode & S_IROTH;
> -       return (mode << 6) | (mode << 3) | mode;
> -}
> -
> -static void pid_table_root_set_ownership(struct ctl_table_header *head,
> -                                        kuid_t *uid, kgid_t *gid)
> -{
> -       struct pid_namespace *pidns =
> -               container_of(head->set, struct pid_namespace, set);
> -       kuid_t ns_root_uid;
> -       kgid_t ns_root_gid;
> -
> -       ns_root_uid = make_kuid(pidns->user_ns, 0);
> -       if (uid_valid(ns_root_uid))
> -               *uid = ns_root_uid;
> -
> -       ns_root_gid = make_kgid(pidns->user_ns, 0);
> -       if (gid_valid(ns_root_gid))
> -               *gid = ns_root_gid;
> -}
> -
> -static struct ctl_table_root pid_table_root = {
> -       .lookup         = pid_table_root_lookup,
> -       .permissions    = pid_table_root_permissions,
> -       .set_ownership  = pid_table_root_set_ownership,
> -};
> -
> -static const struct ctl_table pid_table[] = {
> -       {
> -               .procname       = "pid_max",
> -               .data           = &init_pid_ns.pid_max,
> -               .maxlen         = sizeof(int),
> -               .mode           = 0644,
> -               .proc_handler   = proc_dointvec_minmax,
> -               .extra1         = &pid_max_min,
> -               .extra2         = &pid_max_max,
> -       },
> -};
> -#endif
> -
> -int register_pidns_sysctls(struct pid_namespace *pidns)
> -{
> -#ifdef CONFIG_SYSCTL
> -       struct ctl_table *tbl;
> -
> -       setup_sysctl_set(&pidns->set, &pid_table_root, set_is_seen);
> -
> -       tbl = kmemdup(pid_table, sizeof(pid_table), GFP_KERNEL);
> -       if (!tbl)
> -               return -ENOMEM;
> -       tbl->data = &pidns->pid_max;
> -       pidns->pid_max = min(pid_max_max, max_t(int, pidns->pid_max,
> -                            PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
> -
> -       pidns->sysctls = __register_sysctl_table(&pidns->set, "kernel", tbl,
> -                                                ARRAY_SIZE(pid_table));
> -       if (!pidns->sysctls) {
> -               kfree(tbl);
> -               retire_sysctl_set(&pidns->set);
> -               return -ENOMEM;
> -       }
> -#endif
> -       return 0;
> -}
> -
> -void unregister_pidns_sysctls(struct pid_namespace *pidns)
> -{
> -#ifdef CONFIG_SYSCTL
> -       const struct ctl_table *tbl;
> -
> -       tbl = pidns->sysctls->ctl_table_arg;
> -       unregister_sysctl_table(pidns->sysctls);
> -       retire_sysctl_set(&pidns->set);
> -       kfree(tbl);
> -#endif
> -}
> -
>  void __init pid_idr_init(void)
>  {
>         /* Verify no one has done anything silly: */
>         BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
>
>         /* bump default and minimum pid_max based on number of cpus */
> -       init_pid_ns.pid_max = min(pid_max_max, max_t(int, init_pid_ns.pid_max,
> -                                 PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
> +       pid_max = min(pid_max_max, max_t(int, pid_max,
> +                               PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
>         pid_max_min = max_t(int, pid_max_min,
>                                 PIDS_PER_CPU_MIN * num_possible_cpus());
> -       pr_info("pid_max: default: %u minimum: %u\n", init_pid_ns.pid_max, pid_max_min);
> +       pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
>
>         idr_init(&init_pid_ns.idr);
>
> @@ -766,16 +665,6 @@ void __init pid_idr_init(void)
>                         NULL);
>  }
>
> -static __init int pid_namespace_sysctl_init(void)
> -{
> -#ifdef CONFIG_SYSCTL
> -       /* "kernel" directory will have already been initialized. */
> -       BUG_ON(register_pidns_sysctls(&init_pid_ns));
> -#endif
> -       return 0;
> -}
> -subsys_initcall(pid_namespace_sysctl_init);
> -
>  static struct file *__pidfd_fget(struct task_struct *task, int fd)
>  {
>         struct file *file;
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 8f6cfec87555a..0f23285be4f92 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -70,8 +70,6 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
>         dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
>  }
>
> -static void destroy_pid_namespace_work(struct work_struct *work);
> -
>  static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
>         struct pid_namespace *parent_pid_ns)
>  {
> @@ -107,27 +105,17 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>                 goto out_free_idr;
>         ns->ns.ops = &pidns_operations;
>
> -       ns->pid_max = parent_pid_ns->pid_max;
> -       err = register_pidns_sysctls(ns);
> -       if (err)
> -               goto out_free_inum;
> -
>         refcount_set(&ns->ns.count, 1);
>         ns->level = level;
>         ns->parent = get_pid_ns(parent_pid_ns);
>         ns->user_ns = get_user_ns(user_ns);
>         ns->ucounts = ucounts;
>         ns->pid_allocated = PIDNS_ADDING;
> -       INIT_WORK(&ns->work, destroy_pid_namespace_work);
> -
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
>         ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
>  #endif
> -
>         return ns;
>
> -out_free_inum:
> -       ns_free_inum(&ns->ns);
>  out_free_idr:
>         idr_destroy(&ns->idr);
>         kmem_cache_free(pid_ns_cachep, ns);
> @@ -149,28 +137,12 @@ static void delayed_free_pidns(struct rcu_head *p)
>
>  static void destroy_pid_namespace(struct pid_namespace *ns)
>  {
> -       unregister_pidns_sysctls(ns);
> -
>         ns_free_inum(&ns->ns);
>
>         idr_destroy(&ns->idr);
>         call_rcu(&ns->rcu, delayed_free_pidns);
>  }
>
> -static void destroy_pid_namespace_work(struct work_struct *work)
> -{
> -       struct pid_namespace *ns =
> -               container_of(work, struct pid_namespace, work);
> -
> -       do {
> -               struct pid_namespace *parent;
> -
> -               parent = ns->parent;
> -               destroy_pid_namespace(ns);
> -               ns = parent;
> -       } while (ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count));
> -}
> -
>  struct pid_namespace *copy_pid_ns(unsigned long flags,
>         struct user_namespace *user_ns, struct pid_namespace *old_ns)
>  {
> @@ -183,8 +155,15 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
>
>  void put_pid_ns(struct pid_namespace *ns)
>  {
> -       if (ns && ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count))
> -               schedule_work(&ns->work);
> +       struct pid_namespace *parent;
> +
> +       while (ns != &init_pid_ns) {
> +               parent = ns->parent;
> +               if (!refcount_dec_and_test(&ns->ns.count))
> +                       break;
> +               destroy_pid_namespace(ns);
> +               ns = parent;
> +       }
>  }
>  EXPORT_SYMBOL_GPL(put_pid_ns);
>
> @@ -295,7 +274,6 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
>         next = idr_get_cursor(&pid_ns->idr) - 1;
>
>         tmp.data = &next;
> -       tmp.extra2 = &pid_ns->pid_max;
>         ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
>         if (!ret && write)
>                 idr_set_cursor(&pid_ns->idr, next + 1);
> @@ -303,6 +281,7 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
>         return ret;
>  }
>
> +extern int pid_max;
>  static const struct ctl_table pid_ns_ctl_table[] = {
>         {
>                 .procname = "ns_last_pid",
> @@ -310,7 +289,7 @@ static const struct ctl_table pid_ns_ctl_table[] = {
>                 .mode = 0666, /* permissions are checked in the handler */
>                 .proc_handler = pid_ns_ctl_handler,
>                 .extra1 = SYSCTL_ZERO,
> -               .extra2 = &init_pid_ns.pid_max,
> +               .extra2 = &pid_max,
>         },
>  };
>  #endif /* CONFIG_CHECKPOINT_RESTORE */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cb57da499ebb1..bb739608680f2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1803,6 +1803,15 @@ static const struct ctl_table kern_table[] = {
>                 .proc_handler   = proc_dointvec,
>         },
>  #endif
> +       {
> +               .procname       = "pid_max",
> +               .data           = &pid_max,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &pid_max_min,
> +               .extra2         = &pid_max_max,
> +       },
>         {
>                 .procname       = "panic_on_oops",
>                 .data           = &panic_on_oops,
> diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c
> index c62b9b3cfb3d8..4966e6bbdf6f3 100644
> --- a/kernel/trace/pid_list.c
> +++ b/kernel/trace/pid_list.c
> @@ -414,7 +414,7 @@ struct trace_pid_list *trace_pid_list_alloc(void)
>         int i;
>
>         /* According to linux/thread.h, pids can be no bigger that 30 bits */
> -       WARN_ON_ONCE(init_pid_ns.pid_max > (1 << 30));
> +       WARN_ON_ONCE(pid_max > (1 << 30));
>
>         pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL);
>         if (!pid_list)
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 9c21ba45b7af6..46c65402ad7e5 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -732,6 +732,8 @@ extern unsigned long tracing_thresh;
>
>  /* PID filtering */
>
> +extern int pid_max;
> +
>  bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids,
>                              pid_t search_pid);
>  bool trace_ignore_this_task(struct trace_pid_list *filtered_pids,
> diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
> index cb49f7279dc80..573b5d8e8a28e 100644
> --- a/kernel/trace/trace_sched_switch.c
> +++ b/kernel/trace/trace_sched_switch.c
> @@ -442,7 +442,7 @@ int trace_alloc_tgid_map(void)
>         if (tgid_map)
>                 return 0;
>
> -       tgid_map_max = init_pid_ns.pid_max;
> +       tgid_map_max = pid_max;
>         map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map),
>                        GFP_KERNEL);
>         if (!map)
> --
> 2.48.1
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-02-22  9:02     ` David Laight
@ 2025-03-05 15:01       ` Michal Koutný
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2025-03-05 15:01 UTC (permalink / raw)
  To: David Laight
  Cc: Andrew Morton, Christian Brauner, Alexander Mikhalitsyn,
	linux-doc, linux-kernel, linux-fsdevel, linux-trace-kernel,
	Jonathan Corbet, Kees Cook, Eric W . Biederman, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]

On Sat, Feb 22, 2025 at 09:02:08AM +0000, David Laight <david.laight.linux@gmail.com> wrote:
> It also seems a good way of being able to predict the next pid and
> doing all the 'nasty' things that allows because there is no guard
> time on pid reuse.

The motivations was not to make guessing next pid more difficult, I'll
update the docs with better explanation.

> Both first-fit and next-fit have the same issue.
> Picking a random pid is better.

I surely don't want to delve into this now. (I acknowledge that having a
possible range specified per pid ns would be useful for such a
randomization.)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-02-22  0:18   ` Andrew Morton
  2025-02-22  9:02     ` David Laight
@ 2025-03-05 15:04     ` Michal Koutný
  1 sibling, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2025-03-05 15:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
	linux-fsdevel, linux-trace-kernel, Jonathan Corbet, Kees Cook,
	Eric W . Biederman, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 553 bytes --]

Hi.

On Fri, Feb 21, 2025 at 04:18:54PM -0800, Andrew Morton <akpm@linux-foundation.org> wrote:
> This seems thin.  Is there more we can tell our users?  What are the
> visible effects of this?  What are the benefits?  Why would they want
> to turn it on?

Thanks for review and comments (also Alexander).

> I mean, there are veritable paragraphs in the changelogs, but just a
> single line in the user-facing docs.  Seems there could be more...

I decided not to fiddle with allocation strategies and disable pid_max
in namespaces by default.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
  2025-02-22  0:18   ` Andrew Morton
  2025-02-25 17:30   ` Alexander Mikhalitsyn
@ 2025-03-06  8:59   ` Christian Brauner
  2025-03-06  9:09     ` Michal Koutný
  2 siblings, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2025-03-06  8:59 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexander Mikhalitsyn, linux-doc, linux-kernel, linux-fsdevel,
	linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
	Eric W . Biederman, Oleg Nesterov

On Fri, Feb 21, 2025 at 06:02:49PM +0100, Michal Koutný wrote:
> Noone would need to use this allocation strategy (it's slower, pid
> numbers collide sooner). Its primary purpose are pid namespaces in
> conjunction with pids.max cgroup limit which keeps (virtual) pid numbers
> below the given limit. This is for 32-bit userspace programs that may
> not work well with pid numbers above 65536.
> 
> Link: https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
>  Documentation/admin-guide/sysctl/kernel.rst |  2 ++
>  include/linux/pid_namespace.h               |  3 +++
>  kernel/pid.c                                | 12 +++++++--
>  kernel/pid_namespace.c                      | 28 +++++++++++++++------
>  4 files changed, 36 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index a43b78b4b6464..f5e68d1c8849f 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
>  lives in) pid namespace. When selecting a pid for a next task on fork
>  kernel tries to allocate a number starting from this one.
>  
> +When set to -1, first-fit pid numbering is used instead of the next-fit.

I strongly disagree with this approach. This is way worse then making
pid_max per pid namespace.

I'm fine if you come up with something else that's purely based on
cgroups somehow and is uniform across 64-bit and 32-bit. Allowing to
change the pid allocation strategy just for 32-bit is not the solution
and not mergable.

> +
>  
>  powersave-nap (PPC only)
>  ========================
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index f9f9931e02d6a..10bf66ca78590 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -41,6 +41,9 @@ struct pid_namespace {
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
>  	int memfd_noexec_scope;
>  #endif
> +#ifdef CONFIG_IA32_EMULATION
> +	bool pid_noncyclic;
> +#endif
>  } __randomize_layout;
>  
>  extern struct pid_namespace init_pid_ns;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index aa2a7d4da4555..e9da1662b8821 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -191,6 +191,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  
>  	for (i = ns->level; i >= 0; i--) {
>  		int tid = 0;
> +		bool pid_noncyclic = 0;
> +#ifdef CONFIG_IA32_EMULATION
> +		pid_noncyclic = READ_ONCE(tmp->pid_noncyclic);
> +#endif
>  
>  		if (set_tid_size) {
>  			tid = set_tid[ns->level - i];
> @@ -235,8 +239,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  			 * Store a null pointer so find_pid_ns does not find
>  			 * a partially initialized PID (see below).
>  			 */
> -			nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> -					      pid_max, GFP_ATOMIC);
> +			if (likely(!pid_noncyclic))
> +				nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> +						      pid_max, GFP_ATOMIC);
> +			else
> +				nr = idr_alloc(&tmp->idr, NULL, pid_min,
> +						      pid_max, GFP_ATOMIC);
>  		}
>  		spin_unlock_irq(&pidmap_lock);
>  		idr_preload_end();
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0f23285be4f92..ceda94a064294 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -113,6 +113,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
>  	ns->pid_allocated = PIDNS_ADDING;
>  #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
>  	ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
> +#endif
> +#ifdef CONFIG_IA32_EMULATION
> +	ns->pid_noncyclic = READ_ONCE(parent_pid_ns->pid_noncyclic);
>  #endif
>  	return ns;
>  
> @@ -260,7 +263,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
>  	return;
>  }
>  
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
>  static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
>  		void *buffer, size_t *lenp, loff_t *ppos)
>  {
> @@ -271,12 +274,23 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
>  	if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
>  		return -EPERM;
>  
> -	next = idr_get_cursor(&pid_ns->idr) - 1;
> +	next = -1;
> +#ifdef CONFIG_IA32_EMULATION
> +	if (!pid_ns->pid_noncyclic)
> +#endif
> +		next += idr_get_cursor(&pid_ns->idr);
>  
>  	tmp.data = &next;
>  	ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> -	if (!ret && write)
> -		idr_set_cursor(&pid_ns->idr, next + 1);
> +	if (!ret && write) {
> +		if (next > -1)
> +			idr_set_cursor(&pid_ns->idr, next + 1);
> +		else if (!IS_ENABLED(CONFIG_IA32_EMULATION))
> +			ret = -EINVAL;
> +#ifdef CONFIG_IA32_EMULATION
> +		WRITE_ONCE(pid_ns->pid_noncyclic, next == -1);
> +#endif
> +	}
>  
>  	return ret;
>  }
> @@ -288,11 +302,11 @@ static const struct ctl_table pid_ns_ctl_table[] = {
>  		.maxlen = sizeof(int),
>  		.mode = 0666, /* permissions are checked in the handler */
>  		.proc_handler = pid_ns_ctl_handler,
> -		.extra1 = SYSCTL_ZERO,
> +		.extra1 = SYSCTL_NEG_ONE,
>  		.extra2 = &pid_max,
>  	},
>  };
> -#endif	/* CONFIG_CHECKPOINT_RESTORE */
> +#endif	/* CONFIG_CHECKPOINT_RESTORE || CONFIG_IA32_EMULATION */
>  
>  int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
>  {
> @@ -449,7 +463,7 @@ static __init int pid_namespaces_init(void)
>  {
>  	pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>  
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
>  	register_sysctl_init("kernel", pid_ns_ctl_table);
>  #endif
>  
> -- 
> 2.48.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
  2025-03-06  8:59   ` Christian Brauner
@ 2025-03-06  9:09     ` Michal Koutný
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2025-03-06  9:09 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Mikhalitsyn, linux-doc, linux-kernel, linux-fsdevel,
	linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
	Eric W . Biederman, Oleg Nesterov

[-- Attachment #1: Type: text/plain, Size: 560 bytes --]

On Thu, Mar 06, 2025 at 09:59:13AM +0100, Christian Brauner <brauner@kernel.org> wrote:
> I strongly disagree with this approach. This is way worse then making
> pid_max per pid namespace.

Thanks for taking the look.

> I'm fine if you come up with something else that's purely based on
> cgroups somehow and is uniform across 64-bit and 32-bit. Allowing to
> change the pid allocation strategy just for 32-bit is not the solution
> and not mergable.

Here's a minimalist correction
https://lore.kernel.org/r/20250305145849.55491-1-mkoutny@suse.com/


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"
  2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
  2025-02-25 17:36   ` Alexander Mikhalitsyn
@ 2025-03-10  7:32   ` kernel test robot
  1 sibling, 0 replies; 12+ messages in thread
From: kernel test robot @ 2025-03-10  7:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: oe-lkp, lkp, linux-kernel, linux-fsdevel, linux-trace-kernel,
	Christian Brauner, Alexander Mikhalitsyn, linux-doc,
	Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
	Michal Koutný, Oleg Nesterov, oliver.sang



Hello,

kernel test robot noticed a 23.4% improvement of stress-ng.sigxfsz.ops_per_sec on:


commit: ee2a5c3e36093d0ff5709bc8f21d3793cf55f746 ("[PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"")
url: https://github.com/intel-lab-lkp/linux/commits/Michal-Koutn/Revert-pid-allow-pid_max-to-be-set-per-pid-namespace/20250222-010942
patch link: https://lore.kernel.org/all/20250221170249.890014-2-mkoutny@suse.com/
patch subject: [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"

testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:

	nr_threads: 100%
	testtime: 60s
	test: sigxfsz
	cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+-------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.mprotect.ops_per_sec 4.5% improvement                                |
| test machine     | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
| test parameters  | cpufreq_governor=performance                                                              |
|                  | nr_threads=100%                                                                           |
|                  | test=mprotect                                                                             |
|                  | testtime=60s                                                                              |
+------------------+-------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 15.7% improvement                                  |
| test machine     | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
| test parameters  | cpufreq_governor=performance                                                              |
|                  | nr_threads=100%                                                                           |
|                  | test=sigrt                                                                                |
|                  | testtime=60s                                                                              |
+------------------+-------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.sigbus.ops_per_sec 20.6% improvement                                 |
| test machine     | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
| test parameters  | cpufreq_governor=performance                                                              |
|                  | nr_threads=100%                                                                           |
|                  | test=sigbus                                                                               |
|                  | testtime=60s                                                                              |
+------------------+-------------------------------------------------------------------------------------------+




Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250310/202503101532.348576bb-lkp@intel.com

=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sigxfsz/stress-ng/60s

commit: 
  3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
  ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")

334426094588f817 ee2a5c3e36093d0ff5709bc8f21 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      5.11            +1.3        6.43        mpstat.cpu.all.usr%
      3737 ±  6%     -38.8%       2286 ± 42%  proc-vmstat.numa_hint_faults_local
   1212920 ±  4%     -10.4%    1086901 ±  5%  sched_debug.cpu.avg_idle.max
     35.50 ± 16%     -30.0%      24.83 ± 20%  perf-c2c.DRAM.local
      1517 ±  4%     -46.5%     812.17 ±  3%  perf-c2c.DRAM.remote
      1808 ±  2%     +57.0%       2840        perf-c2c.HITM.local
      1360 ±  5%     -49.9%     680.83 ±  2%  perf-c2c.HITM.remote
      5.22 ±  3%     +19.8%       6.26 ±  7%  perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
     53.33 ± 15%     +25.0%      66.67 ± 15%  perf-sched.wait_and_delay.count.__cond_resched.vfs_write.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe
    953.83 ±  3%     -16.5%     796.33 ±  7%  perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      5.21 ±  3%     +20.0%       6.25 ±  7%  perf-sched.wait_time.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    163515           +27.8%     208915        stress-ng.sigxfsz.SIGXFSZ_signals_per_sec
 6.668e+08           +23.4%   8.23e+08        stress-ng.sigxfsz.ops
  11113966           +23.4%   13716156        stress-ng.sigxfsz.ops_per_sec
      3623            -1.4%       3573        stress-ng.time.system_time
    163.26           +31.7%     214.98        stress-ng.time.user_time
      0.25           -54.7%       0.12 ±  2%  perf-stat.i.MPKI
 1.125e+10           +22.1%  1.373e+10        perf-stat.i.branch-instructions
      0.54            -0.0        0.50        perf-stat.i.branch-miss-rate%
  59748239           +10.9%   66264440        perf-stat.i.branch-misses
     33.30           -17.9       15.38 ±  2%  perf-stat.i.cache-miss-rate%
  13040640           -45.8%    7066419 ±  2%  perf-stat.i.cache-misses
  39047103           +15.5%   45098530        perf-stat.i.cache-references
      4.39           -18.2%       3.59        perf-stat.i.cpi
     17823           +97.0%      35113        perf-stat.i.cycles-between-cache-misses
 5.144e+10           +22.0%  6.275e+10        perf-stat.i.instructions
      0.23           +21.3%       0.28        perf-stat.i.ipc
      0.25           -55.6%       0.11 ±  2%  perf-stat.overall.MPKI
      0.53            -0.0        0.48        perf-stat.overall.branch-miss-rate%
     33.40           -17.7       15.67 ±  2%  perf-stat.overall.cache-miss-rate%
      4.40           -18.0%       3.60        perf-stat.overall.cpi
     17350           +84.6%      32027 ±  2%  perf-stat.overall.cycles-between-cache-misses
      0.23           +22.0%       0.28        perf-stat.overall.ipc
 1.106e+10           +22.1%   1.35e+10        perf-stat.ps.branch-instructions
  58763534           +10.9%   65180843        perf-stat.ps.branch-misses
  12827760           -45.8%    6951883 ±  2%  perf-stat.ps.cache-misses
  38411225           +15.5%   44365626        perf-stat.ps.cache-references
  5.06e+10           +22.0%  6.172e+10        perf-stat.ps.instructions
 3.106e+12           +21.9%  3.787e+12        perf-stat.total.instructions


***************************************************************************************************
lkp-icl-2sp7: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp7/mprotect/stress-ng/60s

commit: 
  3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
  ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")

334426094588f817 ee2a5c3e36093d0ff5709bc8f21 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
     10205 ± 25%     +33.5%      13621 ± 16%  numa-meminfo.node0.KernelStack
      0.02 ± 37%     -37.8%       0.01 ± 13%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.82 ± 32%     -37.7%       0.51 ±  7%  perf-sched.sch_delay.max.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
    807.17 ±  5%      -8.5%     738.67 ±  5%  perf-sched.wait_and_delay.count.__cond_resched.down_write.vma_prepare.__split_vma.vma_modify
    433709            +4.9%     454923 ±  5%  proc-vmstat.nr_active_anon
     61940 ±  3%     +31.3%      81315 ± 35%  proc-vmstat.nr_shmem
    433709            +4.9%     454923 ±  5%  proc-vmstat.nr_zone_active_anon
 4.903e+08            +4.5%  5.124e+08        stress-ng.mprotect.ops
   8163833            +4.5%    8533021        stress-ng.mprotect.ops_per_sec
    239.55            +4.7%     250.91        stress-ng.time.user_time
   3960356 ±  7%     -16.0%    3325457        numa-numastat.node0.local_node
   3990670 ±  7%     -16.1%    3348370        numa-numastat.node0.numa_hit
   2608139 ±  6%     +34.5%    3507199 ±  4%  numa-numastat.node1.local_node
   2644058 ±  6%     +34.3%    3550893 ±  4%  numa-numastat.node1.numa_hit
   3986137 ±  7%     -16.0%    3349506        numa-vmstat.node0.numa_hit
   3955823 ±  7%     -15.9%    3326594        numa-vmstat.node0.numa_local
   2639425 ±  6%     +34.6%    3552253 ±  4%  numa-vmstat.node1.numa_hit
   2603506 ±  6%     +34.8%    3508559 ±  4%  numa-vmstat.node1.numa_local
      1.11 ± 20%     -38.9%       0.68 ± 31%  sched_debug.cfs_rq:/.h_nr_queued.stddev
      1.11 ± 19%     -38.6%       0.68 ± 31%  sched_debug.cfs_rq:/.h_nr_runnable.stddev
      5890 ±  6%     -10.7%       5262        sched_debug.cfs_rq:/.runnable_avg.max
      1064 ± 20%     -41.1%     626.67 ± 33%  sched_debug.cfs_rq:/.runnable_avg.stddev
      1151           -12.2%       1010        sched_debug.cpu.clock_task.stddev
      1.11 ± 20%     -39.1%       0.68 ± 32%  sched_debug.cpu.nr_running.stddev
 1.861e+10            +4.5%  1.945e+10        perf-stat.i.branch-instructions
 1.264e+08            +4.1%  1.316e+08        perf-stat.i.branch-misses
  1.45e+08            +5.3%  1.526e+08        perf-stat.i.cache-references
      2.28            -4.3%       2.18        perf-stat.i.cpi
 8.533e+10            +4.5%   8.92e+10        perf-stat.i.instructions
      0.44            +4.5%       0.46        perf-stat.i.ipc
     63.03            +4.5%      65.90        perf-stat.i.metric.K/sec
   4035009            +4.5%    4218051        perf-stat.i.page-faults
      2.29            -4.4%       2.19        perf-stat.overall.cpi
      0.44            +4.6%       0.46        perf-stat.overall.ipc
 1.829e+10            +4.5%  1.912e+10        perf-stat.ps.branch-instructions
 1.242e+08            +4.1%  1.293e+08        perf-stat.ps.branch-misses
 1.424e+08            +5.3%  1.499e+08        perf-stat.ps.cache-references
 8.385e+10            +4.6%  8.767e+10        perf-stat.ps.instructions
   3966080            +4.6%    4146673        perf-stat.ps.page-faults
 5.154e+12            +4.6%  5.389e+12        perf-stat.total.instructions
     36.24            -1.9       34.36 ±  2%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.stress_mprotect_mem
     38.30            -1.7       36.58 ±  2%  perf-profile.calltrace.cycles-pp.stress_mprotect_mem
     14.45 ±  2%      -1.7       12.80 ±  2%  perf-profile.calltrace.cycles-pp.get_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
     17.12            -1.5       15.58 ±  2%  perf-profile.calltrace.cycles-pp.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
     17.06            -1.5       15.54 ±  2%  perf-profile.calltrace.cycles-pp.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
     12.44 ±  2%      -1.5       10.92 ±  2%  perf-profile.calltrace.cycles-pp.do_dec_rlimit_put_ucounts.__sigqueue_free.get_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode
     12.46 ±  2%      -1.5       10.94 ±  2%  perf-profile.calltrace.cycles-pp.__sigqueue_free.get_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault
      0.54 ±  2%      -0.1        0.43 ± 44%  perf-profile.calltrace.cycles-pp.up_read.__bad_area.bad_area_access_error.exc_page_fault.asm_exc_page_fault
      0.84            -0.1        0.75 ±  4%  perf-profile.calltrace.cycles-pp.down_write.__split_vma.vma_modify.vma_modify_flags.mprotect_fixup
      1.60            -0.1        1.51 ±  2%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.stress_sig_handler
      1.59            -0.1        1.51 ±  2%  perf-profile.calltrace.cycles-pp.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_sig_handler
      0.82 ±  3%      -0.1        0.74 ±  2%  perf-profile.calltrace.cycles-pp.sigprocmask.__x64_sys_rt_sigprocmask.do_syscall_64.entry_SYSCALL_64_after_hwframe.pthread_sigmask
      1.44            -0.1        1.37 ±  2%  perf-profile.calltrace.cycles-pp.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_sig_handler
      1.03 ±  2%      -0.1        0.98        perf-profile.calltrace.cycles-pp.__x64_sys_rt_sigprocmask.do_syscall_64.entry_SYSCALL_64_after_hwframe.pthread_sigmask
      1.29 ±  2%      -0.1        1.23        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.pthread_sigmask
      0.68 ±  3%      -0.0        0.64 ±  2%  perf-profile.calltrace.cycles-pp.up_write.vma_complete.__split_vma.vma_modify.vma_modify_flags
      0.58 ±  2%      -0.0        0.54 ±  3%  perf-profile.calltrace.cycles-pp.__bad_area.bad_area_access_error.exc_page_fault.asm_exc_page_fault.stress_mprotect_mem
      0.58 ±  2%      -0.0        0.56        perf-profile.calltrace.cycles-pp.fpu__clear_user_states.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault
      0.62 ±  3%      +0.1        0.67 ±  2%  perf-profile.calltrace.cycles-pp.mas_prev_slot.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe
      1.01            +0.1        1.07        perf-profile.calltrace.cycles-pp.copy_fpstate_to_sigframe.get_sigframe.x64_setup_rt_frame.handle_signal.arch_do_signal_or_restart
      1.23            +0.1        1.30 ±  2%  perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.stress_mprotect_mem
      0.84 ±  3%      +0.1        0.91 ±  2%  perf-profile.calltrace.cycles-pp.vma_interval_tree_insert.vma_complete.commit_merge.vma_merge_existing_range.vma_modify
      0.84 ±  2%      +0.1        0.91        perf-profile.calltrace.cycles-pp.mas_preallocate.__split_vma.vma_modify.vma_modify_flags.mprotect_fixup
      1.75 ±  2%      +0.1        1.83        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__mprotect
      0.59 ±  2%      +0.1        0.67 ±  2%  perf-profile.calltrace.cycles-pp.simple_dname.perf_event_mmap_event.perf_event_mmap.mprotect_fixup.do_mprotect_pkey
      2.41 ±  2%      +0.1        2.50        perf-profile.calltrace.cycles-pp.clear_bhb_loop.__mprotect
      1.77            +0.1        1.88        perf-profile.calltrace.cycles-pp.get_sigframe.x64_setup_rt_frame.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode
      2.02            +0.1        2.14        perf-profile.calltrace.cycles-pp.x64_setup_rt_frame.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault
      0.98 ± 18%      +0.1        1.10        perf-profile.calltrace.cycles-pp.change_protection_range.mprotect_fixup.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64
      2.57            +0.1        2.70        perf-profile.calltrace.cycles-pp.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
      3.13 ±  3%      +0.2        3.34 ±  2%  perf-profile.calltrace.cycles-pp.asm_exc_page_fault.__mprotect
      0.00            +0.6        0.55 ±  2%  perf-profile.calltrace.cycles-pp.prepend_copy.simple_dname.perf_event_mmap_event.perf_event_mmap.mprotect_fixup
     34.00            +1.1       35.12 ±  2%  perf-profile.calltrace.cycles-pp.mprotect_fixup.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe
     46.05            +1.1       47.19        perf-profile.calltrace.cycles-pp.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mprotect
     46.28            +1.2       47.43        perf-profile.calltrace.cycles-pp.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mprotect
     48.43            +1.2       49.61        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mprotect
     48.86            +1.2       50.06        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mprotect
     55.84            +1.6       57.41        perf-profile.calltrace.cycles-pp.__mprotect
     39.48            -1.9       37.62 ±  2%  perf-profile.children.cycles-pp.asm_exc_page_fault
     14.48 ±  2%      -1.6       12.83 ±  2%  perf-profile.children.cycles-pp.get_signal
     18.72            -1.6       17.11        perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
     39.92            -1.6       38.32 ±  2%  perf-profile.children.cycles-pp.stress_mprotect_mem
     18.52            -1.6       16.92        perf-profile.children.cycles-pp.arch_do_signal_or_restart
     12.47 ±  2%      -1.5       10.94 ±  2%  perf-profile.children.cycles-pp.__sigqueue_free
     12.44 ±  2%      -1.5       10.92 ±  2%  perf-profile.children.cycles-pp.do_dec_rlimit_put_ucounts
      5.00            -0.2        4.83 ±  2%  perf-profile.children.cycles-pp.up_write
      0.47 ± 10%      -0.1        0.34 ±  7%  perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
      0.47 ± 10%      -0.1        0.34 ±  7%  perf-profile.children.cycles-pp.hrtimer_interrupt
      1.16 ±  3%      -0.1        1.05        perf-profile.children.cycles-pp.recalc_sigpending
      0.35 ±  7%      -0.1        0.24 ±  6%  perf-profile.children.cycles-pp.__hrtimer_run_queues
      0.89 ±  6%      -0.1        0.79 ±  5%  perf-profile.children.cycles-pp._raw_spin_lock_irq
      0.34 ±  8%      -0.1        0.24 ±  6%  perf-profile.children.cycles-pp.tick_nohz_handler
      0.86 ±  2%      -0.1        0.78        perf-profile.children.cycles-pp.sigprocmask
      0.28 ± 10%      -0.1        0.21 ±  6%  perf-profile.children.cycles-pp.update_process_times
      1.05 ±  2%      -0.1        0.98        perf-profile.children.cycles-pp.__x64_sys_rt_sigprocmask
      0.30 ±  3%      -0.0        0.26 ±  3%  perf-profile.children.cycles-pp.fpregs_mark_activate
      0.17 ± 10%      -0.0        0.13 ±  6%  perf-profile.children.cycles-pp.sched_tick
      0.47 ±  3%      -0.0        0.43 ±  3%  perf-profile.children.cycles-pp.complete_signal
      0.54 ±  2%      -0.0        0.51 ±  2%  perf-profile.children.cycles-pp.up_read
      0.58 ±  2%      -0.0        0.55 ±  2%  perf-profile.children.cycles-pp.__bad_area
      0.61            -0.0        0.58        perf-profile.children.cycles-pp.fpu__clear_user_states
      0.12 ±  5%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.__get_user_nocheck_4
      0.13 ±  3%      +0.0        0.14 ±  3%  perf-profile.children.cycles-pp.ima_file_mprotect
      0.22 ±  5%      +0.0        0.24 ±  2%  perf-profile.children.cycles-pp.security_file_mprotect
      0.25 ±  3%      +0.0        0.28 ±  4%  perf-profile.children.cycles-pp.stress_mwc16
      0.18 ±  5%      +0.0        0.20 ±  6%  perf-profile.children.cycles-pp.stress_mwc16modn
      0.34 ±  3%      +0.0        0.37 ±  3%  perf-profile.children.cycles-pp.mas_ascend
      0.12 ±  4%      +0.0        0.15 ±  5%  perf-profile.children.cycles-pp.copy_from_kernel_nofault_allowed
      0.30 ±  8%      +0.0        0.33 ±  2%  perf-profile.children.cycles-pp.rcu_all_qs
      0.26 ±  4%      +0.0        0.29 ±  6%  perf-profile.children.cycles-pp.mas_pop_node
      0.44 ±  2%      +0.0        0.47        perf-profile.children.cycles-pp.vma_set_page_prot
      0.49 ±  3%      +0.0        0.53 ±  3%  perf-profile.children.cycles-pp.save_xstate_epilog
      0.66 ±  2%      +0.0        0.71 ±  2%  perf-profile.children.cycles-pp.native_irq_return_iret
      0.02 ± 99%      +0.1        0.08 ± 11%  perf-profile.children.cycles-pp.anon_vma_clone
      1.27            +0.1        1.33        perf-profile.children.cycles-pp.do_user_addr_fault
      0.84            +0.1        0.90        perf-profile.children.cycles-pp.mas_prev_slot
      1.04            +0.1        1.11        perf-profile.children.cycles-pp.copy_fpstate_to_sigframe
      0.73 ±  7%      +0.1        0.79 ±  2%  perf-profile.children.cycles-pp.__cond_resched
      0.46 ±  3%      +0.1        0.53 ±  2%  perf-profile.children.cycles-pp.copy_from_kernel_nofault
      1.30 ±  2%      +0.1        1.37        perf-profile.children.cycles-pp.entry_SYSCALL_64
      0.50 ±  2%      +0.1        0.58 ±  2%  perf-profile.children.cycles-pp.prepend_copy
      1.68            +0.1        1.75        perf-profile.children.cycles-pp.mas_preallocate
      0.61 ±  3%      +0.1        0.70 ±  3%  perf-profile.children.cycles-pp.simple_dname
      2.77 ±  2%      +0.1        2.87        perf-profile.children.cycles-pp.clear_bhb_loop
      3.27            +0.1        3.37        perf-profile.children.cycles-pp.handle_signal
      1.78            +0.1        1.89        perf-profile.children.cycles-pp.get_sigframe
      2.05            +0.1        2.16        perf-profile.children.cycles-pp.x64_setup_rt_frame
      0.99 ± 18%      +0.1        1.11        perf-profile.children.cycles-pp.change_protection_range
      7.00            +0.2        7.24 ±  2%  perf-profile.children.cycles-pp.vma_prepare
     34.09            +1.1       35.22 ±  2%  perf-profile.children.cycles-pp.mprotect_fixup
     50.17            +1.1       51.31        perf-profile.children.cycles-pp.do_syscall_64
     46.24            +1.2       47.39        perf-profile.children.cycles-pp.do_mprotect_pkey
     46.33            +1.2       47.49        perf-profile.children.cycles-pp.__x64_sys_mprotect
     50.61            +1.2       51.78        perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
     55.94            +1.6       57.52        perf-profile.children.cycles-pp.__mprotect
     12.44 ±  2%      -1.5       10.91 ±  2%  perf-profile.self.cycles-pp.do_dec_rlimit_put_ucounts
      4.36            -0.1        4.22 ±  2%  perf-profile.self.cycles-pp.up_write
      1.14 ±  3%      -0.1        1.03        perf-profile.self.cycles-pp.recalc_sigpending
      0.87 ±  6%      -0.1        0.78 ±  5%  perf-profile.self.cycles-pp._raw_spin_lock_irq
      2.83            -0.1        2.75        perf-profile.self.cycles-pp.down_write
      0.28 ±  5%      -0.0        0.23 ±  5%  perf-profile.self.cycles-pp.fpregs_mark_activate
      0.19 ± 10%      -0.0        0.14 ± 12%  perf-profile.self.cycles-pp.__perf_event_header__init_id
      0.40 ±  3%      -0.0        0.36 ±  5%  perf-profile.self.cycles-pp.complete_signal
      0.52 ±  2%      -0.0        0.48 ±  2%  perf-profile.self.cycles-pp.up_read
      0.15 ±  2%      -0.0        0.14 ±  3%  perf-profile.self.cycles-pp.__send_signal_locked
      0.10 ±  4%      -0.0        0.09 ±  4%  perf-profile.self.cycles-pp.__bad_area_nosemaphore
      0.30 ±  3%      +0.0        0.33 ±  4%  perf-profile.self.cycles-pp.mas_ascend
      0.10 ±  5%      +0.0        0.12 ±  5%  perf-profile.self.cycles-pp.do_user_addr_fault
      0.10 ±  4%      +0.0        0.12 ±  3%  perf-profile.self.cycles-pp.copy_from_kernel_nofault_allowed
      0.21 ±  6%      +0.0        0.24 ±  4%  perf-profile.self.cycles-pp.rwsem_down_write_slowpath
      0.40            +0.0        0.43 ±  2%  perf-profile.self.cycles-pp.change_protection_range
      0.44            +0.0        0.47        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
      0.24 ±  3%      +0.0        0.27 ±  6%  perf-profile.self.cycles-pp.mas_pop_node
      0.34 ±  2%      +0.0        0.38 ±  3%  perf-profile.self.cycles-pp.mas_preallocate
      0.37 ±  8%      +0.0        0.41 ±  3%  perf-profile.self.cycles-pp.__cond_resched
      0.72            +0.0        0.76 ±  2%  perf-profile.self.cycles-pp.copy_fpstate_to_sigframe
      0.41            +0.0        0.45 ±  3%  perf-profile.self.cycles-pp.mas_prev_slot
      0.66 ±  2%      +0.0        0.71 ±  2%  perf-profile.self.cycles-pp.native_irq_return_iret
      0.30 ±  4%      +0.0        0.35 ±  2%  perf-profile.self.cycles-pp.copy_from_kernel_nofault
      0.02 ±141%      +0.1        0.08 ± 11%  perf-profile.self.cycles-pp.anon_vma_clone
      1.21 ±  2%      +0.1        1.30 ±  2%  perf-profile.self.cycles-pp.__mprotect
      2.73 ±  2%      +0.1        2.83        perf-profile.self.cycles-pp.clear_bhb_loop
      2.76            +0.1        2.88        perf-profile.self.cycles-pp.do_mprotect_pkey
      3.48 ±  3%      +0.3        3.74 ±  2%  perf-profile.self.cycles-pp.stress_mprotect_mem



***************************************************************************************************
lkp-icl-2sp8: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sigrt/stress-ng/60s

commit: 
  3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
  ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")

334426094588f817 ee2a5c3e36093d0ff5709bc8f21 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      1345 ±  9%     -15.8%       1132 ±  5%  perf-c2c.HITM.remote
   5328778           +18.0%    6289475        vmstat.system.cs
    197362            +2.0%     201296        vmstat.system.in
     45.97 ±118%     -85.4%       6.71 ± 55%  perf-sched.sch_delay.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.shmem_alloc_folio
    582.79 ± 39%     -39.2%     354.28 ± 31%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range.do_sigtimedwait.isra.0.__x64_sys_rt_sigtimedwait
      1260 ± 46%     -43.7%     709.74 ± 31%  perf-sched.wait_and_delay.max.ms.schedule_hrtimeout_range.do_sigtimedwait.isra.0.__x64_sys_rt_sigtimedwait
     45.97 ±118%     -85.4%       6.71 ± 55%  perf-sched.wait_time.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.shmem_alloc_folio
    705.59 ± 50%     -48.9%     360.90 ± 32%  perf-sched.wait_time.max.ms.schedule_hrtimeout_range.do_sigtimedwait.isra.0.__x64_sys_rt_sigtimedwait
     83250           -16.0%      69935        stress-ng.sigrt.nanosecs_between_sigqueue_and_sigwaitinfo_completion
 3.362e+08           +15.7%   3.89e+08        stress-ng.sigrt.ops
   5601334           +15.7%    6480915        stress-ng.sigrt.ops_per_sec
  65582158           +17.7%   77176472        stress-ng.time.involuntary_context_switches
      3423            -1.4%       3375        stress-ng.time.system_time
    335.13 ±  2%     +14.5%     383.80 ±  2%  stress-ng.time.user_time
 2.714e+08           +17.4%  3.185e+08        stress-ng.time.voluntary_context_switches
   4202907 ± 15%     -24.2%    3184715 ± 12%  sched_debug.cfs_rq:/.avg_vruntime.max
     82.07 ± 12%    +391.9%     403.68 ± 94%  sched_debug.cfs_rq:/.load_avg.avg
    169.48 ±  8%   +1182.4%       2173 ±115%  sched_debug.cfs_rq:/.load_avg.stddev
   4202907 ± 15%     -24.2%    3184715 ± 12%  sched_debug.cfs_rq:/.min_vruntime.max
      1239 ±  8%     +14.2%       1415 ± 12%  sched_debug.cfs_rq:/.util_avg.max
   2593172           +17.4%    3044316        sched_debug.cpu.nr_switches.avg
   1526897 ±  3%     +66.4%    2540867 ±  2%  sched_debug.cpu.nr_switches.min
    606805           -67.2%     198918 ±  9%  sched_debug.cpu.nr_switches.stddev
 1.902e+10           +14.8%  2.184e+10        perf-stat.i.branch-instructions
  1.42e+08 ±  3%     +16.2%   1.65e+08        perf-stat.i.branch-misses
      6.65 ±  4%      -0.9        5.77 ±  7%  perf-stat.i.cache-miss-rate%
 3.931e+08 ±  9%     +17.1%  4.605e+08 ±  6%  perf-stat.i.cache-references
   5534190           +17.4%    6498045        perf-stat.i.context-switches
      2.71           -14.3%       2.33        perf-stat.i.cpi
 8.694e+10           +14.8%  9.976e+10        perf-stat.i.instructions
      0.39           +14.2%       0.45        perf-stat.i.ipc
     86.53           +17.4%     101.60        perf-stat.i.metric.K/sec
      6.82 ±  5%      -0.9        5.91 ±  9%  perf-stat.overall.cache-miss-rate%
      2.59           -12.9%       2.26        perf-stat.overall.cpi
      0.39           +14.7%       0.44        perf-stat.overall.ipc
 1.871e+10           +14.8%  2.149e+10        perf-stat.ps.branch-instructions
 1.396e+08 ±  3%     +16.2%  1.622e+08        perf-stat.ps.branch-misses
 3.868e+08 ±  9%     +17.1%   4.53e+08 ±  6%  perf-stat.ps.cache-references
   5443676           +17.4%    6391319        perf-stat.ps.context-switches
 8.552e+10           +14.8%  9.813e+10        perf-stat.ps.instructions
 5.251e+12           +14.3%      6e+12        perf-stat.total.instructions



***************************************************************************************************
lkp-icl-2sp8: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
  gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sigbus/stress-ng/60s

commit: 
  3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
  ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")

334426094588f817 ee2a5c3e36093d0ff5709bc8f21 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
      7.64            +1.7        9.30        mpstat.cpu.all.usr%
     36.50 ± 16%     -42.9%      20.83 ± 31%  perf-c2c.DRAM.local
      2312 ±  6%     -68.7%     723.17 ±  4%  perf-c2c.DRAM.remote
      3690 ±  3%     +44.9%       5347 ±  6%  perf-c2c.HITM.local
      2155 ±  6%     -71.8%     608.17 ±  4%  perf-c2c.HITM.remote
      4477 ± 69%     -70.3%       1328 ± 35%  proc-vmstat.numa_hint_faults
      2459 ± 11%     -64.8%     866.33 ± 47%  proc-vmstat.numa_hint_faults_local
    140611 ± 21%     -33.6%      93302 ± 45%  proc-vmstat.numa_pte_updates
 7.197e+08           +20.7%  8.685e+08        proc-vmstat.pgfault
 7.201e+08           +20.6%  8.682e+08        stress-ng.sigbus.ops
  12001759           +20.6%   14469786        stress-ng.sigbus.ops_per_sec
      3526            -1.8%       3461        stress-ng.time.system_time
    261.31           +25.4%     327.64        stress-ng.time.user_time
      0.03 ± 55%     -64.6%       0.01 ± 17%  perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.86 ±150%     -90.1%       0.09 ±201%  perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
      0.02 ± 50%     -58.7%       0.01 ± 14%  perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      1.08 ± 18%     -34.1%       0.71 ± 14%  perf-sched.sch_delay.avg.ms.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
      0.31 ± 72%     -65.9%       0.11 ± 71%  perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
      0.02 ± 10%     -23.4%       0.01 ± 15%  perf-sched.sch_delay.max.ms.rcu_gp_kthread.kthread.ret_from_fork.ret_from_fork_asm
      1.91 ±218%     -99.2%       0.02 ± 11%  perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      4.00 ± 49%     -71.6%       1.14 ± 56%  perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    261.25 ± 37%    +199.1%     781.43 ± 15%  perf-sched.wait_and_delay.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     81.02 ± 59%    +274.1%     303.13 ± 50%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
      6.60 ±  2%     +16.9%       7.71 ±  3%  perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
    108.83 ± 63%     -81.2%      20.50 ±113%  perf-sched.wait_and_delay.count.devkmsg_read.vfs_read.ksys_read.do_syscall_64
      3107 ±  3%     -12.6%       2714 ±  5%  perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
    124.17 ± 63%     -70.1%      37.17 ± 60%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
    751.00 ±  2%     -17.0%     623.50 ±  2%  perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      1550 ± 31%    +119.7%       3406 ± 19%  perf-sched.wait_and_delay.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
    261.24 ± 37%    +199.1%     781.42 ± 15%  perf-sched.wait_time.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
     80.16 ± 60%    +278.0%     303.05 ± 50%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
      6.59 ±  2%     +17.0%       7.71 ±  3%  perf-sched.wait_time.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
      1550 ± 31%    +119.7%       3406 ± 19%  perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
      0.18           -49.0%       0.09 ±  3%  perf-stat.i.MPKI
  1.59e+10           +19.7%  1.903e+10        perf-stat.i.branch-instructions
      0.28            -0.0        0.25        perf-stat.i.branch-miss-rate%
  40989724            +5.3%   43173098 ±  2%  perf-stat.i.branch-misses
     32.63           -15.8       16.81 ±  2%  perf-stat.i.cache-miss-rate%
  12733301 ±  2%     -40.3%    7597041 ±  3%  perf-stat.i.cache-misses
  38933806           +14.5%   44591128        perf-stat.i.cache-references
      3.17           -16.4%       2.65        perf-stat.i.cpi
     18224           +75.2%      31921        perf-stat.i.cycles-between-cache-misses
 7.098e+10           +19.6%  8.489e+10        perf-stat.i.instructions
      0.32           +19.0%       0.38        perf-stat.i.ipc
    184.67           +20.6%     222.65        perf-stat.i.metric.K/sec
  11819123           +20.6%   14249011        perf-stat.i.page-faults
      0.18           -50.1%       0.09 ±  3%  perf-stat.overall.MPKI
      0.26            -0.0        0.23        perf-stat.overall.branch-miss-rate%
     32.70           -15.7       17.04 ±  3%  perf-stat.overall.cache-miss-rate%
      3.19           -16.4%       2.66        perf-stat.overall.cpi
     17772 ±  2%     +67.6%      29795 ±  2%  perf-stat.overall.cycles-between-cache-misses
      0.31           +19.6%       0.38        perf-stat.overall.ipc
 1.564e+10           +19.7%  1.871e+10        perf-stat.ps.branch-instructions
  40314687            +5.4%   42478375 ±  2%  perf-stat.ps.branch-misses
  12525837 ±  2%     -40.3%    7473864 ±  3%  perf-stat.ps.cache-misses
  38300912           +14.5%   43866104        perf-stat.ps.cache-references
 6.982e+10           +19.6%   8.35e+10        perf-stat.ps.instructions
  11626044           +20.6%   14016280        perf-stat.ps.page-faults
 4.284e+12           +19.5%  5.117e+12        perf-stat.total.instructions





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-03-10  7:32 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-21 17:02 [PATCH 0/2] Alternative "pid_max" for 32-bit userspace Michal Koutný
2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
2025-02-25 17:36   ` Alexander Mikhalitsyn
2025-03-10  7:32   ` kernel test robot
2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
2025-02-22  0:18   ` Andrew Morton
2025-02-22  9:02     ` David Laight
2025-03-05 15:01       ` Michal Koutný
2025-03-05 15:04     ` Michal Koutný
2025-02-25 17:30   ` Alexander Mikhalitsyn
2025-03-06  8:59   ` Christian Brauner
2025-03-06  9:09     ` Michal Koutný

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).