* [PATCH 0/2] Alternative "pid_max" for 32-bit userspace
@ 2025-02-21 17:02 Michal Koutný
2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
0 siblings, 2 replies; 12+ messages in thread
From: Michal Koutný @ 2025-02-21 17:02 UTC (permalink / raw)
To: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
linux-fsdevel, linux-trace-kernel
Cc: Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
Michal Koutný, Oleg Nesterov
pid_max is sort of a legacy limit (its value and partially the concept
too, given the existence of pids cgroup controller).
It is tempting to make the pid_max value part of a pid namespace to
provide compat environment for 32-bit applications [1]. On the other
hand, it provides yet another mechanism for limitation of task count.
Even without namespacing of pid_max value, the configuration of
conscious limit is confusing for users [2].
This series builds upon the idea of restricting the number (amount) of
tasks by pids controller and ensuring that number (pid) never exceeds
the amount of tasks. This would not currently work out of the box
because next-fit pid allocation would continue to assign numbers (pids)
higher than the actual amount (there would be gaps in the lower range of
the interval). The patch 2/2 implements this idea by extending semantics
of ns_last_pid knob to allow first-fit numbering. (The implementation
has clumsy ifdefery, which can might be dropped since it's too
x86-centric.)
The patch 1/2 is a mere revert to simplify pid_max to one global limit
only.
(I pruned Cc: list from scripts/get_maintainer.pl for better focus, feel
free to bounce as necessary.)
[1] https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
[2] https://lore.kernel.org/r/bnxhqrq7tip6jl2hu6jsvxxogdfii7ugmafbhgsogovrchxfyp@kagotkztqurt/
Michal Koutný (2):
Revert "pid: allow pid_max to be set per pid namespace"
pid: Optional first-fit pid allocation
Documentation/admin-guide/sysctl/kernel.rst | 2 +
include/linux/pid.h | 3 +
include/linux/pid_namespace.h | 11 +-
kernel/pid.c | 137 +++-----------------
kernel/pid_namespace.c | 71 +++++-----
kernel/sysctl.c | 9 ++
kernel/trace/pid_list.c | 2 +-
kernel/trace/trace.h | 2 +
kernel/trace/trace_sched_switch.c | 2 +-
9 files changed, 70 insertions(+), 169 deletions(-)
base-commit: 334426094588f8179fe175a09ecc887ff0c75758
--
2.48.1
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"
2025-02-21 17:02 [PATCH 0/2] Alternative "pid_max" for 32-bit userspace Michal Koutný
@ 2025-02-21 17:02 ` Michal Koutný
2025-02-25 17:36 ` Alexander Mikhalitsyn
2025-03-10 7:32 ` kernel test robot
2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
1 sibling, 2 replies; 12+ messages in thread
From: Michal Koutný @ 2025-02-21 17:02 UTC (permalink / raw)
To: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
linux-fsdevel, linux-trace-kernel
Cc: Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
Michal Koutný, Oleg Nesterov
This reverts commit 7863dcc72d0f4b13a641065670426435448b3d80.
It is already difficult for users to troubleshoot which of multiple pid
limits restricts their workload. I'm afraid making pid_max
per-(hierarchical-)NS will contribute to confusion.
Also, the implementation copies the limit upon creation from
parent, this pattern showed cumbersome with some attributes in legacy
cgroup controllers -- it's subject to race condition between parent's
limit modification and children creation and once copied it must be
changed in the descendant.
This is very similar to what pids.max of a cgroup (already) does that
can be used as an alternative.
Link: https://lore.kernel.org/r/bnxhqrq7tip6jl2hu6jsvxxogdfii7ugmafbhgsogovrchxfyp@kagotkztqurt/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
include/linux/pid.h | 3 +
include/linux/pid_namespace.h | 10 +--
kernel/pid.c | 125 ++----------------------------
kernel/pid_namespace.c | 43 +++-------
kernel/sysctl.c | 9 +++
kernel/trace/pid_list.c | 2 +-
kernel/trace/trace.h | 2 +
kernel/trace/trace_sched_switch.c | 2 +-
8 files changed, 35 insertions(+), 161 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 98837a1ff0f33..fe575fcdb4afa 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -108,6 +108,9 @@ extern void exchange_tids(struct task_struct *task, struct task_struct *old);
extern void transfer_pid(struct task_struct *old, struct task_struct *new,
enum pid_type);
+extern int pid_max;
+extern int pid_max_min, pid_max_max;
+
/*
* look up a PID in the hash table. Must be called with the tasklist_lock
* or rcu_read_lock() held.
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 7c67a58111998..f9f9931e02d6a 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -30,7 +30,6 @@ struct pid_namespace {
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
unsigned int level;
- int pid_max;
struct pid_namespace *parent;
#ifdef CONFIG_BSD_PROCESS_ACCT
struct fs_pin *bacct;
@@ -39,14 +38,9 @@ struct pid_namespace {
struct ucounts *ucounts;
int reboot; /* group exit code if this pidns was rebooted */
struct ns_common ns;
- struct work_struct work;
-#ifdef CONFIG_SYSCTL
- struct ctl_table_set set;
- struct ctl_table_header *sysctls;
-#if defined(CONFIG_MEMFD_CREATE)
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
int memfd_noexec_scope;
#endif
-#endif
} __randomize_layout;
extern struct pid_namespace init_pid_ns;
@@ -123,8 +117,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
void pidhash_init(void);
void pid_idr_init(void);
-int register_pidns_sysctls(struct pid_namespace *pidns);
-void unregister_pidns_sysctls(struct pid_namespace *pidns);
static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
{
diff --git a/kernel/pid.c b/kernel/pid.c
index 924084713be8b..aa2a7d4da4555 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -61,8 +61,10 @@ struct pid init_struct_pid = {
}, }
};
-static int pid_max_min = RESERVED_PIDS + 1;
-static int pid_max_max = PID_MAX_LIMIT;
+int pid_max = PID_MAX_DEFAULT;
+
+int pid_max_min = RESERVED_PIDS + 1;
+int pid_max_max = PID_MAX_LIMIT;
/*
* PID-map pages start out as NULL, they get allocated upon
@@ -81,7 +83,6 @@ struct pid_namespace init_pid_ns = {
#ifdef CONFIG_PID_NS
.ns.ops = &pidns_operations,
#endif
- .pid_max = PID_MAX_DEFAULT,
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
.memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
#endif
@@ -190,7 +191,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
for (i = ns->level; i >= 0; i--) {
int tid = 0;
- int pid_max = READ_ONCE(tmp->pid_max);
if (set_tid_size) {
tid = set_tid[ns->level - i];
@@ -644,118 +644,17 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
return fd;
}
-#ifdef CONFIG_SYSCTL
-static struct ctl_table_set *pid_table_root_lookup(struct ctl_table_root *root)
-{
- return &task_active_pid_ns(current)->set;
-}
-
-static int set_is_seen(struct ctl_table_set *set)
-{
- return &task_active_pid_ns(current)->set == set;
-}
-
-static int pid_table_root_permissions(struct ctl_table_header *head,
- const struct ctl_table *table)
-{
- struct pid_namespace *pidns =
- container_of(head->set, struct pid_namespace, set);
- int mode = table->mode;
-
- if (ns_capable(pidns->user_ns, CAP_SYS_ADMIN) ||
- uid_eq(current_euid(), make_kuid(pidns->user_ns, 0)))
- mode = (mode & S_IRWXU) >> 6;
- else if (in_egroup_p(make_kgid(pidns->user_ns, 0)))
- mode = (mode & S_IRWXG) >> 3;
- else
- mode = mode & S_IROTH;
- return (mode << 6) | (mode << 3) | mode;
-}
-
-static void pid_table_root_set_ownership(struct ctl_table_header *head,
- kuid_t *uid, kgid_t *gid)
-{
- struct pid_namespace *pidns =
- container_of(head->set, struct pid_namespace, set);
- kuid_t ns_root_uid;
- kgid_t ns_root_gid;
-
- ns_root_uid = make_kuid(pidns->user_ns, 0);
- if (uid_valid(ns_root_uid))
- *uid = ns_root_uid;
-
- ns_root_gid = make_kgid(pidns->user_ns, 0);
- if (gid_valid(ns_root_gid))
- *gid = ns_root_gid;
-}
-
-static struct ctl_table_root pid_table_root = {
- .lookup = pid_table_root_lookup,
- .permissions = pid_table_root_permissions,
- .set_ownership = pid_table_root_set_ownership,
-};
-
-static const struct ctl_table pid_table[] = {
- {
- .procname = "pid_max",
- .data = &init_pid_ns.pid_max,
- .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = &pid_max_min,
- .extra2 = &pid_max_max,
- },
-};
-#endif
-
-int register_pidns_sysctls(struct pid_namespace *pidns)
-{
-#ifdef CONFIG_SYSCTL
- struct ctl_table *tbl;
-
- setup_sysctl_set(&pidns->set, &pid_table_root, set_is_seen);
-
- tbl = kmemdup(pid_table, sizeof(pid_table), GFP_KERNEL);
- if (!tbl)
- return -ENOMEM;
- tbl->data = &pidns->pid_max;
- pidns->pid_max = min(pid_max_max, max_t(int, pidns->pid_max,
- PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
-
- pidns->sysctls = __register_sysctl_table(&pidns->set, "kernel", tbl,
- ARRAY_SIZE(pid_table));
- if (!pidns->sysctls) {
- kfree(tbl);
- retire_sysctl_set(&pidns->set);
- return -ENOMEM;
- }
-#endif
- return 0;
-}
-
-void unregister_pidns_sysctls(struct pid_namespace *pidns)
-{
-#ifdef CONFIG_SYSCTL
- const struct ctl_table *tbl;
-
- tbl = pidns->sysctls->ctl_table_arg;
- unregister_sysctl_table(pidns->sysctls);
- retire_sysctl_set(&pidns->set);
- kfree(tbl);
-#endif
-}
-
void __init pid_idr_init(void)
{
/* Verify no one has done anything silly: */
BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
/* bump default and minimum pid_max based on number of cpus */
- init_pid_ns.pid_max = min(pid_max_max, max_t(int, init_pid_ns.pid_max,
- PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
+ pid_max = min(pid_max_max, max_t(int, pid_max,
+ PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
pid_max_min = max_t(int, pid_max_min,
PIDS_PER_CPU_MIN * num_possible_cpus());
- pr_info("pid_max: default: %u minimum: %u\n", init_pid_ns.pid_max, pid_max_min);
+ pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
idr_init(&init_pid_ns.idr);
@@ -766,16 +665,6 @@ void __init pid_idr_init(void)
NULL);
}
-static __init int pid_namespace_sysctl_init(void)
-{
-#ifdef CONFIG_SYSCTL
- /* "kernel" directory will have already been initialized. */
- BUG_ON(register_pidns_sysctls(&init_pid_ns));
-#endif
- return 0;
-}
-subsys_initcall(pid_namespace_sysctl_init);
-
static struct file *__pidfd_fget(struct task_struct *task, int fd)
{
struct file *file;
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 8f6cfec87555a..0f23285be4f92 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -70,8 +70,6 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
}
-static void destroy_pid_namespace_work(struct work_struct *work);
-
static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
struct pid_namespace *parent_pid_ns)
{
@@ -107,27 +105,17 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
goto out_free_idr;
ns->ns.ops = &pidns_operations;
- ns->pid_max = parent_pid_ns->pid_max;
- err = register_pidns_sysctls(ns);
- if (err)
- goto out_free_inum;
-
refcount_set(&ns->ns.count, 1);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
ns->pid_allocated = PIDNS_ADDING;
- INIT_WORK(&ns->work, destroy_pid_namespace_work);
-
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
#endif
-
return ns;
-out_free_inum:
- ns_free_inum(&ns->ns);
out_free_idr:
idr_destroy(&ns->idr);
kmem_cache_free(pid_ns_cachep, ns);
@@ -149,28 +137,12 @@ static void delayed_free_pidns(struct rcu_head *p)
static void destroy_pid_namespace(struct pid_namespace *ns)
{
- unregister_pidns_sysctls(ns);
-
ns_free_inum(&ns->ns);
idr_destroy(&ns->idr);
call_rcu(&ns->rcu, delayed_free_pidns);
}
-static void destroy_pid_namespace_work(struct work_struct *work)
-{
- struct pid_namespace *ns =
- container_of(work, struct pid_namespace, work);
-
- do {
- struct pid_namespace *parent;
-
- parent = ns->parent;
- destroy_pid_namespace(ns);
- ns = parent;
- } while (ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count));
-}
-
struct pid_namespace *copy_pid_ns(unsigned long flags,
struct user_namespace *user_ns, struct pid_namespace *old_ns)
{
@@ -183,8 +155,15 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
void put_pid_ns(struct pid_namespace *ns)
{
- if (ns && ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count))
- schedule_work(&ns->work);
+ struct pid_namespace *parent;
+
+ while (ns != &init_pid_ns) {
+ parent = ns->parent;
+ if (!refcount_dec_and_test(&ns->ns.count))
+ break;
+ destroy_pid_namespace(ns);
+ ns = parent;
+ }
}
EXPORT_SYMBOL_GPL(put_pid_ns);
@@ -295,7 +274,6 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
next = idr_get_cursor(&pid_ns->idr) - 1;
tmp.data = &next;
- tmp.extra2 = &pid_ns->pid_max;
ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
if (!ret && write)
idr_set_cursor(&pid_ns->idr, next + 1);
@@ -303,6 +281,7 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
return ret;
}
+extern int pid_max;
static const struct ctl_table pid_ns_ctl_table[] = {
{
.procname = "ns_last_pid",
@@ -310,7 +289,7 @@ static const struct ctl_table pid_ns_ctl_table[] = {
.mode = 0666, /* permissions are checked in the handler */
.proc_handler = pid_ns_ctl_handler,
.extra1 = SYSCTL_ZERO,
- .extra2 = &init_pid_ns.pid_max,
+ .extra2 = &pid_max,
},
};
#endif /* CONFIG_CHECKPOINT_RESTORE */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index cb57da499ebb1..bb739608680f2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1803,6 +1803,15 @@ static const struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
#endif
+ {
+ .procname = "pid_max",
+ .data = &pid_max,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &pid_max_min,
+ .extra2 = &pid_max_max,
+ },
{
.procname = "panic_on_oops",
.data = &panic_on_oops,
diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c
index c62b9b3cfb3d8..4966e6bbdf6f3 100644
--- a/kernel/trace/pid_list.c
+++ b/kernel/trace/pid_list.c
@@ -414,7 +414,7 @@ struct trace_pid_list *trace_pid_list_alloc(void)
int i;
/* According to linux/thread.h, pids can be no bigger that 30 bits */
- WARN_ON_ONCE(init_pid_ns.pid_max > (1 << 30));
+ WARN_ON_ONCE(pid_max > (1 << 30));
pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL);
if (!pid_list)
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 9c21ba45b7af6..46c65402ad7e5 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -732,6 +732,8 @@ extern unsigned long tracing_thresh;
/* PID filtering */
+extern int pid_max;
+
bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids,
pid_t search_pid);
bool trace_ignore_this_task(struct trace_pid_list *filtered_pids,
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index cb49f7279dc80..573b5d8e8a28e 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -442,7 +442,7 @@ int trace_alloc_tgid_map(void)
if (tgid_map)
return 0;
- tgid_map_max = init_pid_ns.pid_max;
+ tgid_map_max = pid_max;
map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map),
GFP_KERNEL);
if (!map)
--
2.48.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/2] pid: Optional first-fit pid allocation
2025-02-21 17:02 [PATCH 0/2] Alternative "pid_max" for 32-bit userspace Michal Koutný
2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
@ 2025-02-21 17:02 ` Michal Koutný
2025-02-22 0:18 ` Andrew Morton
` (2 more replies)
1 sibling, 3 replies; 12+ messages in thread
From: Michal Koutný @ 2025-02-21 17:02 UTC (permalink / raw)
To: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
linux-fsdevel, linux-trace-kernel
Cc: Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
Michal Koutný, Oleg Nesterov
Noone would need to use this allocation strategy (it's slower, pid
numbers collide sooner). Its primary purpose are pid namespaces in
conjunction with pids.max cgroup limit which keeps (virtual) pid numbers
below the given limit. This is for 32-bit userspace programs that may
not work well with pid numbers above 65536.
Link: https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
Signed-off-by: Michal Koutný <mkoutny@suse.com>
---
Documentation/admin-guide/sysctl/kernel.rst | 2 ++
include/linux/pid_namespace.h | 3 +++
kernel/pid.c | 12 +++++++--
kernel/pid_namespace.c | 28 +++++++++++++++------
4 files changed, 36 insertions(+), 9 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index a43b78b4b6464..f5e68d1c8849f 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
lives in) pid namespace. When selecting a pid for a next task on fork
kernel tries to allocate a number starting from this one.
+When set to -1, first-fit pid numbering is used instead of the next-fit.
+
powersave-nap (PPC only)
========================
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index f9f9931e02d6a..10bf66ca78590 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -41,6 +41,9 @@ struct pid_namespace {
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
int memfd_noexec_scope;
#endif
+#ifdef CONFIG_IA32_EMULATION
+ bool pid_noncyclic;
+#endif
} __randomize_layout;
extern struct pid_namespace init_pid_ns;
diff --git a/kernel/pid.c b/kernel/pid.c
index aa2a7d4da4555..e9da1662b8821 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -191,6 +191,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
for (i = ns->level; i >= 0; i--) {
int tid = 0;
+ bool pid_noncyclic = 0;
+#ifdef CONFIG_IA32_EMULATION
+ pid_noncyclic = READ_ONCE(tmp->pid_noncyclic);
+#endif
if (set_tid_size) {
tid = set_tid[ns->level - i];
@@ -235,8 +239,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
* Store a null pointer so find_pid_ns does not find
* a partially initialized PID (see below).
*/
- nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
+ if (likely(!pid_noncyclic))
+ nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+ pid_max, GFP_ATOMIC);
+ else
+ nr = idr_alloc(&tmp->idr, NULL, pid_min,
+ pid_max, GFP_ATOMIC);
}
spin_unlock_irq(&pidmap_lock);
idr_preload_end();
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0f23285be4f92..ceda94a064294 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -113,6 +113,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
ns->pid_allocated = PIDNS_ADDING;
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
+#endif
+#ifdef CONFIG_IA32_EMULATION
+ ns->pid_noncyclic = READ_ONCE(parent_pid_ns->pid_noncyclic);
#endif
return ns;
@@ -260,7 +263,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
return;
}
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
@@ -271,12 +274,23 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
return -EPERM;
- next = idr_get_cursor(&pid_ns->idr) - 1;
+ next = -1;
+#ifdef CONFIG_IA32_EMULATION
+ if (!pid_ns->pid_noncyclic)
+#endif
+ next += idr_get_cursor(&pid_ns->idr);
tmp.data = &next;
ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
- if (!ret && write)
- idr_set_cursor(&pid_ns->idr, next + 1);
+ if (!ret && write) {
+ if (next > -1)
+ idr_set_cursor(&pid_ns->idr, next + 1);
+ else if (!IS_ENABLED(CONFIG_IA32_EMULATION))
+ ret = -EINVAL;
+#ifdef CONFIG_IA32_EMULATION
+ WRITE_ONCE(pid_ns->pid_noncyclic, next == -1);
+#endif
+ }
return ret;
}
@@ -288,11 +302,11 @@ static const struct ctl_table pid_ns_ctl_table[] = {
.maxlen = sizeof(int),
.mode = 0666, /* permissions are checked in the handler */
.proc_handler = pid_ns_ctl_handler,
- .extra1 = SYSCTL_ZERO,
+ .extra1 = SYSCTL_NEG_ONE,
.extra2 = &pid_max,
},
};
-#endif /* CONFIG_CHECKPOINT_RESTORE */
+#endif /* CONFIG_CHECKPOINT_RESTORE || CONFIG_IA32_EMULATION */
int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
{
@@ -449,7 +463,7 @@ static __init int pid_namespaces_init(void)
{
pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
-#ifdef CONFIG_CHECKPOINT_RESTORE
+#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
register_sysctl_init("kernel", pid_ns_ctl_table);
#endif
--
2.48.1
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
@ 2025-02-22 0:18 ` Andrew Morton
2025-02-22 9:02 ` David Laight
2025-03-05 15:04 ` Michal Koutný
2025-02-25 17:30 ` Alexander Mikhalitsyn
2025-03-06 8:59 ` Christian Brauner
2 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2025-02-22 0:18 UTC (permalink / raw)
To: Michal Koutný
Cc: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
linux-fsdevel, linux-trace-kernel, Jonathan Corbet, Kees Cook,
Eric W . Biederman, Oleg Nesterov
On Fri, 21 Feb 2025 18:02:49 +0100 Michal Koutný <mkoutny@suse.com> wrote:
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
> lives in) pid namespace. When selecting a pid for a next task on fork
> kernel tries to allocate a number starting from this one.
>
> +When set to -1, first-fit pid numbering is used instead of the next-fit.
> +
This seems thin. Is there more we can tell our users? What are the
visible effects of this? What are the benefits? Why would they want
to turn it on?
I mean, there are veritable paragraphs in the changelogs, but just a
single line in the user-facing docs. Seems there could be more...
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
2025-02-22 0:18 ` Andrew Morton
@ 2025-02-22 9:02 ` David Laight
2025-03-05 15:01 ` Michal Koutný
2025-03-05 15:04 ` Michal Koutný
1 sibling, 1 reply; 12+ messages in thread
From: David Laight @ 2025-02-22 9:02 UTC (permalink / raw)
To: Andrew Morton
Cc: Michal Koutný, Christian Brauner, Alexander Mikhalitsyn,
linux-doc, linux-kernel, linux-fsdevel, linux-trace-kernel,
Jonathan Corbet, Kees Cook, Eric W . Biederman, Oleg Nesterov
On Fri, 21 Feb 2025 16:18:54 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Fri, 21 Feb 2025 18:02:49 +0100 Michal Koutný <mkoutny@suse.com> wrote:
>
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
> > lives in) pid namespace. When selecting a pid for a next task on fork
> > kernel tries to allocate a number starting from this one.
> >
> > +When set to -1, first-fit pid numbering is used instead of the next-fit.
> > +
>
> This seems thin. Is there more we can tell our users? What are the
> visible effects of this? What are the benefits? Why would they want
> to turn it on?
>
> I mean, there are veritable paragraphs in the changelogs, but just a
> single line in the user-facing docs. Seems there could be more...
It also seems a good way of being able to predict the next pid and
doing all the 'nasty' things that allows because there is no guard
time on pid reuse.
Both first-fit and next-fit have the same issue.
Picking a random pid is better.
Or pick the pid after finding an empty slot in the 'hash' table.
Then you guarantee O(1) lookup and can easily stop pids being reused
quickly.
David
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
2025-02-22 0:18 ` Andrew Morton
@ 2025-02-25 17:30 ` Alexander Mikhalitsyn
2025-03-06 8:59 ` Christian Brauner
2 siblings, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2025-02-25 17:30 UTC (permalink / raw)
To: Michal Koutný
Cc: Christian Brauner, linux-doc, linux-kernel, linux-fsdevel,
linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
Eric W . Biederman, Oleg Nesterov
Am Fr., 21. Feb. 2025 um 18:02 Uhr schrieb Michal Koutný <mkoutny@suse.com>:
>
> Noone would need to use this allocation strategy (it's slower, pid
> numbers collide sooner). Its primary purpose are pid namespaces in
> conjunction with pids.max cgroup limit which keeps (virtual) pid numbers
> below the given limit. This is for 32-bit userspace programs that may
> not work well with pid numbers above 65536.
>
> Link: https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
Dear Michal,
sorry for such a long delay with reply on your patches.
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 2 ++
> include/linux/pid_namespace.h | 3 +++
> kernel/pid.c | 12 +++++++--
> kernel/pid_namespace.c | 28 +++++++++++++++------
> 4 files changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index a43b78b4b6464..f5e68d1c8849f 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
> lives in) pid namespace. When selecting a pid for a next task on fork
> kernel tries to allocate a number starting from this one.
>
> +When set to -1, first-fit pid numbering is used instead of the next-fit.
> +
>
> powersave-nap (PPC only)
> ========================
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index f9f9931e02d6a..10bf66ca78590 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -41,6 +41,9 @@ struct pid_namespace {
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> int memfd_noexec_scope;
> #endif
> +#ifdef CONFIG_IA32_EMULATION
Unfortunately, this does not work for our use case as it's x86-specific.
In the original cover letter [1] it was written:
>In any case, there are workloads that have expections about how large
>pid numbers they accept. Either for historical reasons or architectural
>reasons. One concreate example is the 32-bit version of Android's bionic
>libc which requires pid numbers less than 65536. There are workloads
>where it is run in a 32-bit container on a 64-bit kernel. If the host
And I have just confirmed with folks from Canonical, who work on Anbox
(Android in container project),
that they use Arm machines (both armhf/arm64). And one of the reasons
to add this feature is to
make legacy 32-bit Android Bionic libc to work [2].
[1] https://lore.kernel.org/all/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
[2] https://android.googlesource.com/platform/bionic.git/+/HEAD/docs/32-bit-abi.md#is-too-small-for-large-pids
Kind regards,
Alex
> + bool pid_noncyclic;
> +#endif
> } __randomize_layout;
>
> extern struct pid_namespace init_pid_ns;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index aa2a7d4da4555..e9da1662b8821 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -191,6 +191,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
> for (i = ns->level; i >= 0; i--) {
> int tid = 0;
> + bool pid_noncyclic = 0;
> +#ifdef CONFIG_IA32_EMULATION
> + pid_noncyclic = READ_ONCE(tmp->pid_noncyclic);
> +#endif
>
> if (set_tid_size) {
> tid = set_tid[ns->level - i];
> @@ -235,8 +239,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> * Store a null pointer so find_pid_ns does not find
> * a partially initialized PID (see below).
> */
> - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> - pid_max, GFP_ATOMIC);
> + if (likely(!pid_noncyclic))
> + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> + pid_max, GFP_ATOMIC);
> + else
> + nr = idr_alloc(&tmp->idr, NULL, pid_min,
> + pid_max, GFP_ATOMIC);
> }
> spin_unlock_irq(&pidmap_lock);
> idr_preload_end();
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0f23285be4f92..ceda94a064294 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -113,6 +113,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> ns->pid_allocated = PIDNS_ADDING;
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
> +#endif
> +#ifdef CONFIG_IA32_EMULATION
> + ns->pid_noncyclic = READ_ONCE(parent_pid_ns->pid_noncyclic);
> #endif
> return ns;
>
> @@ -260,7 +263,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> return;
> }
>
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
> static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
> void *buffer, size_t *lenp, loff_t *ppos)
> {
> @@ -271,12 +274,23 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
> if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
> return -EPERM;
>
> - next = idr_get_cursor(&pid_ns->idr) - 1;
> + next = -1;
> +#ifdef CONFIG_IA32_EMULATION
> + if (!pid_ns->pid_noncyclic)
> +#endif
> + next += idr_get_cursor(&pid_ns->idr);
>
> tmp.data = &next;
> ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> - if (!ret && write)
> - idr_set_cursor(&pid_ns->idr, next + 1);
> + if (!ret && write) {
> + if (next > -1)
> + idr_set_cursor(&pid_ns->idr, next + 1);
> + else if (!IS_ENABLED(CONFIG_IA32_EMULATION))
> + ret = -EINVAL;
> +#ifdef CONFIG_IA32_EMULATION
> + WRITE_ONCE(pid_ns->pid_noncyclic, next == -1);
> +#endif
> + }
>
> return ret;
> }
> @@ -288,11 +302,11 @@ static const struct ctl_table pid_ns_ctl_table[] = {
> .maxlen = sizeof(int),
> .mode = 0666, /* permissions are checked in the handler */
> .proc_handler = pid_ns_ctl_handler,
> - .extra1 = SYSCTL_ZERO,
> + .extra1 = SYSCTL_NEG_ONE,
> .extra2 = &pid_max,
> },
> };
> -#endif /* CONFIG_CHECKPOINT_RESTORE */
> +#endif /* CONFIG_CHECKPOINT_RESTORE || CONFIG_IA32_EMULATION */
>
> int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> {
> @@ -449,7 +463,7 @@ static __init int pid_namespaces_init(void)
> {
> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
> register_sysctl_init("kernel", pid_ns_ctl_table);
> #endif
>
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"
2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
@ 2025-02-25 17:36 ` Alexander Mikhalitsyn
2025-03-10 7:32 ` kernel test robot
1 sibling, 0 replies; 12+ messages in thread
From: Alexander Mikhalitsyn @ 2025-02-25 17:36 UTC (permalink / raw)
To: Michal Koutný
Cc: Christian Brauner, linux-doc, linux-kernel, linux-fsdevel,
linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
Eric W . Biederman, Oleg Nesterov
Am Fr., 21. Feb. 2025 um 18:02 Uhr schrieb Michal Koutný <mkoutny@suse.com>:
>
> This reverts commit 7863dcc72d0f4b13a641065670426435448b3d80.
If we revert this one, then we should also revert a corresponding kselftest:
https://github.com/torvalds/linux/commit/615ab43b838bb982dc234feff75ee9ad35447c5d
>
> It is already difficult for users to troubleshoot which of multiple pid
> limits restricts their workload. I'm afraid making pid_max
> per-(hierarchical-)NS will contribute to confusion.
> Also, the implementation copies the limit upon creation from
> parent, this pattern showed cumbersome with some attributes in legacy
> cgroup controllers -- it's subject to race condition between parent's
> limit modification and children creation and once copied it must be
> changed in the descendant.
>
> This is very similar to what pids.max of a cgroup (already) does that
> can be used as an alternative.
>
> Link: https://lore.kernel.org/r/bnxhqrq7tip6jl2hu6jsvxxogdfii7ugmafbhgsogovrchxfyp@kagotkztqurt/
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
> include/linux/pid.h | 3 +
> include/linux/pid_namespace.h | 10 +--
> kernel/pid.c | 125 ++----------------------------
> kernel/pid_namespace.c | 43 +++-------
> kernel/sysctl.c | 9 +++
> kernel/trace/pid_list.c | 2 +-
> kernel/trace/trace.h | 2 +
> kernel/trace/trace_sched_switch.c | 2 +-
> 8 files changed, 35 insertions(+), 161 deletions(-)
>
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 98837a1ff0f33..fe575fcdb4afa 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -108,6 +108,9 @@ extern void exchange_tids(struct task_struct *task, struct task_struct *old);
> extern void transfer_pid(struct task_struct *old, struct task_struct *new,
> enum pid_type);
>
> +extern int pid_max;
> +extern int pid_max_min, pid_max_max;
> +
> /*
> * look up a PID in the hash table. Must be called with the tasklist_lock
> * or rcu_read_lock() held.
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index 7c67a58111998..f9f9931e02d6a 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -30,7 +30,6 @@ struct pid_namespace {
> struct task_struct *child_reaper;
> struct kmem_cache *pid_cachep;
> unsigned int level;
> - int pid_max;
> struct pid_namespace *parent;
> #ifdef CONFIG_BSD_PROCESS_ACCT
> struct fs_pin *bacct;
> @@ -39,14 +38,9 @@ struct pid_namespace {
> struct ucounts *ucounts;
> int reboot; /* group exit code if this pidns was rebooted */
> struct ns_common ns;
> - struct work_struct work;
> -#ifdef CONFIG_SYSCTL
> - struct ctl_table_set set;
> - struct ctl_table_header *sysctls;
> -#if defined(CONFIG_MEMFD_CREATE)
> +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> int memfd_noexec_scope;
> #endif
> -#endif
> } __randomize_layout;
>
> extern struct pid_namespace init_pid_ns;
> @@ -123,8 +117,6 @@ static inline int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> extern struct pid_namespace *task_active_pid_ns(struct task_struct *tsk);
> void pidhash_init(void);
> void pid_idr_init(void);
> -int register_pidns_sysctls(struct pid_namespace *pidns);
> -void unregister_pidns_sysctls(struct pid_namespace *pidns);
>
> static inline bool task_is_in_init_pid_ns(struct task_struct *tsk)
> {
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 924084713be8b..aa2a7d4da4555 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -61,8 +61,10 @@ struct pid init_struct_pid = {
> }, }
> };
>
> -static int pid_max_min = RESERVED_PIDS + 1;
> -static int pid_max_max = PID_MAX_LIMIT;
> +int pid_max = PID_MAX_DEFAULT;
> +
> +int pid_max_min = RESERVED_PIDS + 1;
> +int pid_max_max = PID_MAX_LIMIT;
>
> /*
> * PID-map pages start out as NULL, they get allocated upon
> @@ -81,7 +83,6 @@ struct pid_namespace init_pid_ns = {
> #ifdef CONFIG_PID_NS
> .ns.ops = &pidns_operations,
> #endif
> - .pid_max = PID_MAX_DEFAULT,
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> .memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,
> #endif
> @@ -190,7 +191,6 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
> for (i = ns->level; i >= 0; i--) {
> int tid = 0;
> - int pid_max = READ_ONCE(tmp->pid_max);
>
> if (set_tid_size) {
> tid = set_tid[ns->level - i];
> @@ -644,118 +644,17 @@ SYSCALL_DEFINE2(pidfd_open, pid_t, pid, unsigned int, flags)
> return fd;
> }
>
> -#ifdef CONFIG_SYSCTL
> -static struct ctl_table_set *pid_table_root_lookup(struct ctl_table_root *root)
> -{
> - return &task_active_pid_ns(current)->set;
> -}
> -
> -static int set_is_seen(struct ctl_table_set *set)
> -{
> - return &task_active_pid_ns(current)->set == set;
> -}
> -
> -static int pid_table_root_permissions(struct ctl_table_header *head,
> - const struct ctl_table *table)
> -{
> - struct pid_namespace *pidns =
> - container_of(head->set, struct pid_namespace, set);
> - int mode = table->mode;
> -
> - if (ns_capable(pidns->user_ns, CAP_SYS_ADMIN) ||
> - uid_eq(current_euid(), make_kuid(pidns->user_ns, 0)))
> - mode = (mode & S_IRWXU) >> 6;
> - else if (in_egroup_p(make_kgid(pidns->user_ns, 0)))
> - mode = (mode & S_IRWXG) >> 3;
> - else
> - mode = mode & S_IROTH;
> - return (mode << 6) | (mode << 3) | mode;
> -}
> -
> -static void pid_table_root_set_ownership(struct ctl_table_header *head,
> - kuid_t *uid, kgid_t *gid)
> -{
> - struct pid_namespace *pidns =
> - container_of(head->set, struct pid_namespace, set);
> - kuid_t ns_root_uid;
> - kgid_t ns_root_gid;
> -
> - ns_root_uid = make_kuid(pidns->user_ns, 0);
> - if (uid_valid(ns_root_uid))
> - *uid = ns_root_uid;
> -
> - ns_root_gid = make_kgid(pidns->user_ns, 0);
> - if (gid_valid(ns_root_gid))
> - *gid = ns_root_gid;
> -}
> -
> -static struct ctl_table_root pid_table_root = {
> - .lookup = pid_table_root_lookup,
> - .permissions = pid_table_root_permissions,
> - .set_ownership = pid_table_root_set_ownership,
> -};
> -
> -static const struct ctl_table pid_table[] = {
> - {
> - .procname = "pid_max",
> - .data = &init_pid_ns.pid_max,
> - .maxlen = sizeof(int),
> - .mode = 0644,
> - .proc_handler = proc_dointvec_minmax,
> - .extra1 = &pid_max_min,
> - .extra2 = &pid_max_max,
> - },
> -};
> -#endif
> -
> -int register_pidns_sysctls(struct pid_namespace *pidns)
> -{
> -#ifdef CONFIG_SYSCTL
> - struct ctl_table *tbl;
> -
> - setup_sysctl_set(&pidns->set, &pid_table_root, set_is_seen);
> -
> - tbl = kmemdup(pid_table, sizeof(pid_table), GFP_KERNEL);
> - if (!tbl)
> - return -ENOMEM;
> - tbl->data = &pidns->pid_max;
> - pidns->pid_max = min(pid_max_max, max_t(int, pidns->pid_max,
> - PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
> -
> - pidns->sysctls = __register_sysctl_table(&pidns->set, "kernel", tbl,
> - ARRAY_SIZE(pid_table));
> - if (!pidns->sysctls) {
> - kfree(tbl);
> - retire_sysctl_set(&pidns->set);
> - return -ENOMEM;
> - }
> -#endif
> - return 0;
> -}
> -
> -void unregister_pidns_sysctls(struct pid_namespace *pidns)
> -{
> -#ifdef CONFIG_SYSCTL
> - const struct ctl_table *tbl;
> -
> - tbl = pidns->sysctls->ctl_table_arg;
> - unregister_sysctl_table(pidns->sysctls);
> - retire_sysctl_set(&pidns->set);
> - kfree(tbl);
> -#endif
> -}
> -
> void __init pid_idr_init(void)
> {
> /* Verify no one has done anything silly: */
> BUILD_BUG_ON(PID_MAX_LIMIT >= PIDNS_ADDING);
>
> /* bump default and minimum pid_max based on number of cpus */
> - init_pid_ns.pid_max = min(pid_max_max, max_t(int, init_pid_ns.pid_max,
> - PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
> + pid_max = min(pid_max_max, max_t(int, pid_max,
> + PIDS_PER_CPU_DEFAULT * num_possible_cpus()));
> pid_max_min = max_t(int, pid_max_min,
> PIDS_PER_CPU_MIN * num_possible_cpus());
> - pr_info("pid_max: default: %u minimum: %u\n", init_pid_ns.pid_max, pid_max_min);
> + pr_info("pid_max: default: %u minimum: %u\n", pid_max, pid_max_min);
>
> idr_init(&init_pid_ns.idr);
>
> @@ -766,16 +665,6 @@ void __init pid_idr_init(void)
> NULL);
> }
>
> -static __init int pid_namespace_sysctl_init(void)
> -{
> -#ifdef CONFIG_SYSCTL
> - /* "kernel" directory will have already been initialized. */
> - BUG_ON(register_pidns_sysctls(&init_pid_ns));
> -#endif
> - return 0;
> -}
> -subsys_initcall(pid_namespace_sysctl_init);
> -
> static struct file *__pidfd_fget(struct task_struct *task, int fd)
> {
> struct file *file;
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 8f6cfec87555a..0f23285be4f92 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -70,8 +70,6 @@ static void dec_pid_namespaces(struct ucounts *ucounts)
> dec_ucount(ucounts, UCOUNT_PID_NAMESPACES);
> }
>
> -static void destroy_pid_namespace_work(struct work_struct *work);
> -
> static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
> struct pid_namespace *parent_pid_ns)
> {
> @@ -107,27 +105,17 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> goto out_free_idr;
> ns->ns.ops = &pidns_operations;
>
> - ns->pid_max = parent_pid_ns->pid_max;
> - err = register_pidns_sysctls(ns);
> - if (err)
> - goto out_free_inum;
> -
> refcount_set(&ns->ns.count, 1);
> ns->level = level;
> ns->parent = get_pid_ns(parent_pid_ns);
> ns->user_ns = get_user_ns(user_ns);
> ns->ucounts = ucounts;
> ns->pid_allocated = PIDNS_ADDING;
> - INIT_WORK(&ns->work, destroy_pid_namespace_work);
> -
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
> #endif
> -
> return ns;
>
> -out_free_inum:
> - ns_free_inum(&ns->ns);
> out_free_idr:
> idr_destroy(&ns->idr);
> kmem_cache_free(pid_ns_cachep, ns);
> @@ -149,28 +137,12 @@ static void delayed_free_pidns(struct rcu_head *p)
>
> static void destroy_pid_namespace(struct pid_namespace *ns)
> {
> - unregister_pidns_sysctls(ns);
> -
> ns_free_inum(&ns->ns);
>
> idr_destroy(&ns->idr);
> call_rcu(&ns->rcu, delayed_free_pidns);
> }
>
> -static void destroy_pid_namespace_work(struct work_struct *work)
> -{
> - struct pid_namespace *ns =
> - container_of(work, struct pid_namespace, work);
> -
> - do {
> - struct pid_namespace *parent;
> -
> - parent = ns->parent;
> - destroy_pid_namespace(ns);
> - ns = parent;
> - } while (ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count));
> -}
> -
> struct pid_namespace *copy_pid_ns(unsigned long flags,
> struct user_namespace *user_ns, struct pid_namespace *old_ns)
> {
> @@ -183,8 +155,15 @@ struct pid_namespace *copy_pid_ns(unsigned long flags,
>
> void put_pid_ns(struct pid_namespace *ns)
> {
> - if (ns && ns != &init_pid_ns && refcount_dec_and_test(&ns->ns.count))
> - schedule_work(&ns->work);
> + struct pid_namespace *parent;
> +
> + while (ns != &init_pid_ns) {
> + parent = ns->parent;
> + if (!refcount_dec_and_test(&ns->ns.count))
> + break;
> + destroy_pid_namespace(ns);
> + ns = parent;
> + }
> }
> EXPORT_SYMBOL_GPL(put_pid_ns);
>
> @@ -295,7 +274,6 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
> next = idr_get_cursor(&pid_ns->idr) - 1;
>
> tmp.data = &next;
> - tmp.extra2 = &pid_ns->pid_max;
> ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> if (!ret && write)
> idr_set_cursor(&pid_ns->idr, next + 1);
> @@ -303,6 +281,7 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
> return ret;
> }
>
> +extern int pid_max;
> static const struct ctl_table pid_ns_ctl_table[] = {
> {
> .procname = "ns_last_pid",
> @@ -310,7 +289,7 @@ static const struct ctl_table pid_ns_ctl_table[] = {
> .mode = 0666, /* permissions are checked in the handler */
> .proc_handler = pid_ns_ctl_handler,
> .extra1 = SYSCTL_ZERO,
> - .extra2 = &init_pid_ns.pid_max,
> + .extra2 = &pid_max,
> },
> };
> #endif /* CONFIG_CHECKPOINT_RESTORE */
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index cb57da499ebb1..bb739608680f2 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1803,6 +1803,15 @@ static const struct ctl_table kern_table[] = {
> .proc_handler = proc_dointvec,
> },
> #endif
> + {
> + .procname = "pid_max",
> + .data = &pid_max,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &pid_max_min,
> + .extra2 = &pid_max_max,
> + },
> {
> .procname = "panic_on_oops",
> .data = &panic_on_oops,
> diff --git a/kernel/trace/pid_list.c b/kernel/trace/pid_list.c
> index c62b9b3cfb3d8..4966e6bbdf6f3 100644
> --- a/kernel/trace/pid_list.c
> +++ b/kernel/trace/pid_list.c
> @@ -414,7 +414,7 @@ struct trace_pid_list *trace_pid_list_alloc(void)
> int i;
>
> /* According to linux/thread.h, pids can be no bigger that 30 bits */
> - WARN_ON_ONCE(init_pid_ns.pid_max > (1 << 30));
> + WARN_ON_ONCE(pid_max > (1 << 30));
>
> pid_list = kzalloc(sizeof(*pid_list), GFP_KERNEL);
> if (!pid_list)
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 9c21ba45b7af6..46c65402ad7e5 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -732,6 +732,8 @@ extern unsigned long tracing_thresh;
>
> /* PID filtering */
>
> +extern int pid_max;
> +
> bool trace_find_filtered_pid(struct trace_pid_list *filtered_pids,
> pid_t search_pid);
> bool trace_ignore_this_task(struct trace_pid_list *filtered_pids,
> diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
> index cb49f7279dc80..573b5d8e8a28e 100644
> --- a/kernel/trace/trace_sched_switch.c
> +++ b/kernel/trace/trace_sched_switch.c
> @@ -442,7 +442,7 @@ int trace_alloc_tgid_map(void)
> if (tgid_map)
> return 0;
>
> - tgid_map_max = init_pid_ns.pid_max;
> + tgid_map_max = pid_max;
> map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map),
> GFP_KERNEL);
> if (!map)
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
2025-02-22 9:02 ` David Laight
@ 2025-03-05 15:01 ` Michal Koutný
0 siblings, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2025-03-05 15:01 UTC (permalink / raw)
To: David Laight
Cc: Andrew Morton, Christian Brauner, Alexander Mikhalitsyn,
linux-doc, linux-kernel, linux-fsdevel, linux-trace-kernel,
Jonathan Corbet, Kees Cook, Eric W . Biederman, Oleg Nesterov
[-- Attachment #1: Type: text/plain, Size: 612 bytes --]
On Sat, Feb 22, 2025 at 09:02:08AM +0000, David Laight <david.laight.linux@gmail.com> wrote:
> It also seems a good way of being able to predict the next pid and
> doing all the 'nasty' things that allows because there is no guard
> time on pid reuse.
The motivations was not to make guessing next pid more difficult, I'll
update the docs with better explanation.
> Both first-fit and next-fit have the same issue.
> Picking a random pid is better.
I surely don't want to delve into this now. (I acknowledge that having a
possible range specified per pid ns would be useful for such a
randomization.)
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
2025-02-22 0:18 ` Andrew Morton
2025-02-22 9:02 ` David Laight
@ 2025-03-05 15:04 ` Michal Koutný
1 sibling, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2025-03-05 15:04 UTC (permalink / raw)
To: Andrew Morton
Cc: Christian Brauner, Alexander Mikhalitsyn, linux-doc, linux-kernel,
linux-fsdevel, linux-trace-kernel, Jonathan Corbet, Kees Cook,
Eric W . Biederman, Oleg Nesterov
[-- Attachment #1: Type: text/plain, Size: 553 bytes --]
Hi.
On Fri, Feb 21, 2025 at 04:18:54PM -0800, Andrew Morton <akpm@linux-foundation.org> wrote:
> This seems thin. Is there more we can tell our users? What are the
> visible effects of this? What are the benefits? Why would they want
> to turn it on?
Thanks for review and comments (also Alexander).
> I mean, there are veritable paragraphs in the changelogs, but just a
> single line in the user-facing docs. Seems there could be more...
I decided not to fiddle with allocation strategies and disable pid_max
in namespaces by default.
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
2025-02-22 0:18 ` Andrew Morton
2025-02-25 17:30 ` Alexander Mikhalitsyn
@ 2025-03-06 8:59 ` Christian Brauner
2025-03-06 9:09 ` Michal Koutný
2 siblings, 1 reply; 12+ messages in thread
From: Christian Brauner @ 2025-03-06 8:59 UTC (permalink / raw)
To: Michal Koutný
Cc: Alexander Mikhalitsyn, linux-doc, linux-kernel, linux-fsdevel,
linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
Eric W . Biederman, Oleg Nesterov
On Fri, Feb 21, 2025 at 06:02:49PM +0100, Michal Koutný wrote:
> Noone would need to use this allocation strategy (it's slower, pid
> numbers collide sooner). Its primary purpose are pid namespaces in
> conjunction with pids.max cgroup limit which keeps (virtual) pid numbers
> below the given limit. This is for 32-bit userspace programs that may
> not work well with pid numbers above 65536.
>
> Link: https://lore.kernel.org/r/20241122132459.135120-1-aleksandr.mikhalitsyn@canonical.com/
> Signed-off-by: Michal Koutný <mkoutny@suse.com>
> ---
> Documentation/admin-guide/sysctl/kernel.rst | 2 ++
> include/linux/pid_namespace.h | 3 +++
> kernel/pid.c | 12 +++++++--
> kernel/pid_namespace.c | 28 +++++++++++++++------
> 4 files changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index a43b78b4b6464..f5e68d1c8849f 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -1043,6 +1043,8 @@ The last pid allocated in the current (the one task using this sysctl
> lives in) pid namespace. When selecting a pid for a next task on fork
> kernel tries to allocate a number starting from this one.
>
> +When set to -1, first-fit pid numbering is used instead of the next-fit.
I strongly disagree with this approach. This is way worse then making
pid_max per pid namespace.
I'm fine if you come up with something else that's purely based on
cgroups somehow and is uniform across 64-bit and 32-bit. Allowing to
change the pid allocation strategy just for 32-bit is not the solution
and not mergable.
> +
>
> powersave-nap (PPC only)
> ========================
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index f9f9931e02d6a..10bf66ca78590 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -41,6 +41,9 @@ struct pid_namespace {
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> int memfd_noexec_scope;
> #endif
> +#ifdef CONFIG_IA32_EMULATION
> + bool pid_noncyclic;
> +#endif
> } __randomize_layout;
>
> extern struct pid_namespace init_pid_ns;
> diff --git a/kernel/pid.c b/kernel/pid.c
> index aa2a7d4da4555..e9da1662b8821 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -191,6 +191,10 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>
> for (i = ns->level; i >= 0; i--) {
> int tid = 0;
> + bool pid_noncyclic = 0;
> +#ifdef CONFIG_IA32_EMULATION
> + pid_noncyclic = READ_ONCE(tmp->pid_noncyclic);
> +#endif
>
> if (set_tid_size) {
> tid = set_tid[ns->level - i];
> @@ -235,8 +239,12 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
> * Store a null pointer so find_pid_ns does not find
> * a partially initialized PID (see below).
> */
> - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> - pid_max, GFP_ATOMIC);
> + if (likely(!pid_noncyclic))
> + nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
> + pid_max, GFP_ATOMIC);
> + else
> + nr = idr_alloc(&tmp->idr, NULL, pid_min,
> + pid_max, GFP_ATOMIC);
> }
> spin_unlock_irq(&pidmap_lock);
> idr_preload_end();
> diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
> index 0f23285be4f92..ceda94a064294 100644
> --- a/kernel/pid_namespace.c
> +++ b/kernel/pid_namespace.c
> @@ -113,6 +113,9 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
> ns->pid_allocated = PIDNS_ADDING;
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> ns->memfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);
> +#endif
> +#ifdef CONFIG_IA32_EMULATION
> + ns->pid_noncyclic = READ_ONCE(parent_pid_ns->pid_noncyclic);
> #endif
> return ns;
>
> @@ -260,7 +263,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
> return;
> }
>
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
> static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
> void *buffer, size_t *lenp, loff_t *ppos)
> {
> @@ -271,12 +274,23 @@ static int pid_ns_ctl_handler(const struct ctl_table *table, int write,
> if (write && !checkpoint_restore_ns_capable(pid_ns->user_ns))
> return -EPERM;
>
> - next = idr_get_cursor(&pid_ns->idr) - 1;
> + next = -1;
> +#ifdef CONFIG_IA32_EMULATION
> + if (!pid_ns->pid_noncyclic)
> +#endif
> + next += idr_get_cursor(&pid_ns->idr);
>
> tmp.data = &next;
> ret = proc_dointvec_minmax(&tmp, write, buffer, lenp, ppos);
> - if (!ret && write)
> - idr_set_cursor(&pid_ns->idr, next + 1);
> + if (!ret && write) {
> + if (next > -1)
> + idr_set_cursor(&pid_ns->idr, next + 1);
> + else if (!IS_ENABLED(CONFIG_IA32_EMULATION))
> + ret = -EINVAL;
> +#ifdef CONFIG_IA32_EMULATION
> + WRITE_ONCE(pid_ns->pid_noncyclic, next == -1);
> +#endif
> + }
>
> return ret;
> }
> @@ -288,11 +302,11 @@ static const struct ctl_table pid_ns_ctl_table[] = {
> .maxlen = sizeof(int),
> .mode = 0666, /* permissions are checked in the handler */
> .proc_handler = pid_ns_ctl_handler,
> - .extra1 = SYSCTL_ZERO,
> + .extra1 = SYSCTL_NEG_ONE,
> .extra2 = &pid_max,
> },
> };
> -#endif /* CONFIG_CHECKPOINT_RESTORE */
> +#endif /* CONFIG_CHECKPOINT_RESTORE || CONFIG_IA32_EMULATION */
>
> int reboot_pid_ns(struct pid_namespace *pid_ns, int cmd)
> {
> @@ -449,7 +463,7 @@ static __init int pid_namespaces_init(void)
> {
> pid_ns_cachep = KMEM_CACHE(pid_namespace, SLAB_PANIC | SLAB_ACCOUNT);
>
> -#ifdef CONFIG_CHECKPOINT_RESTORE
> +#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(CONFIG_IA32_EMULATION)
> register_sysctl_init("kernel", pid_ns_ctl_table);
> #endif
>
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] pid: Optional first-fit pid allocation
2025-03-06 8:59 ` Christian Brauner
@ 2025-03-06 9:09 ` Michal Koutný
0 siblings, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2025-03-06 9:09 UTC (permalink / raw)
To: Christian Brauner
Cc: Alexander Mikhalitsyn, linux-doc, linux-kernel, linux-fsdevel,
linux-trace-kernel, Jonathan Corbet, Kees Cook, Andrew Morton,
Eric W . Biederman, Oleg Nesterov
[-- Attachment #1: Type: text/plain, Size: 560 bytes --]
On Thu, Mar 06, 2025 at 09:59:13AM +0100, Christian Brauner <brauner@kernel.org> wrote:
> I strongly disagree with this approach. This is way worse then making
> pid_max per pid namespace.
Thanks for taking the look.
> I'm fine if you come up with something else that's purely based on
> cgroups somehow and is uniform across 64-bit and 32-bit. Allowing to
> change the pid allocation strategy just for 32-bit is not the solution
> and not mergable.
Here's a minimalist correction
https://lore.kernel.org/r/20250305145849.55491-1-mkoutny@suse.com/
Michal
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"
2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
2025-02-25 17:36 ` Alexander Mikhalitsyn
@ 2025-03-10 7:32 ` kernel test robot
1 sibling, 0 replies; 12+ messages in thread
From: kernel test robot @ 2025-03-10 7:32 UTC (permalink / raw)
To: Michal Koutný
Cc: oe-lkp, lkp, linux-kernel, linux-fsdevel, linux-trace-kernel,
Christian Brauner, Alexander Mikhalitsyn, linux-doc,
Jonathan Corbet, Kees Cook, Andrew Morton, Eric W . Biederman,
Michal Koutný, Oleg Nesterov, oliver.sang
Hello,
kernel test robot noticed a 23.4% improvement of stress-ng.sigxfsz.ops_per_sec on:
commit: ee2a5c3e36093d0ff5709bc8f21d3793cf55f746 ("[PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"")
url: https://github.com/intel-lab-lkp/linux/commits/Michal-Koutn/Revert-pid-allow-pid_max-to-be-set-per-pid-namespace/20250222-010942
patch link: https://lore.kernel.org/all/20250221170249.890014-2-mkoutny@suse.com/
patch subject: [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace"
testcase: stress-ng
config: x86_64-rhel-9.4
compiler: gcc-12
test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
parameters:
nr_threads: 100%
testtime: 60s
test: sigxfsz
cpufreq_governor: performance
In addition to that, the commit also has significant impact on the following tests:
+------------------+-------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.mprotect.ops_per_sec 4.5% improvement |
| test machine | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
| test parameters | cpufreq_governor=performance |
| | nr_threads=100% |
| | test=mprotect |
| | testtime=60s |
+------------------+-------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.sigrt.ops_per_sec 15.7% improvement |
| test machine | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
| test parameters | cpufreq_governor=performance |
| | nr_threads=100% |
| | test=sigrt |
| | testtime=60s |
+------------------+-------------------------------------------------------------------------------------------+
| testcase: change | stress-ng: stress-ng.sigbus.ops_per_sec 20.6% improvement |
| test machine | 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory |
| test parameters | cpufreq_governor=performance |
| | nr_threads=100% |
| | test=sigbus |
| | testtime=60s |
+------------------+-------------------------------------------------------------------------------------------+
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20250310/202503101532.348576bb-lkp@intel.com
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sigxfsz/stress-ng/60s
commit:
3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")
334426094588f817 ee2a5c3e36093d0ff5709bc8f21
---------------- ---------------------------
%stddev %change %stddev
\ | \
5.11 +1.3 6.43 mpstat.cpu.all.usr%
3737 ± 6% -38.8% 2286 ± 42% proc-vmstat.numa_hint_faults_local
1212920 ± 4% -10.4% 1086901 ± 5% sched_debug.cpu.avg_idle.max
35.50 ± 16% -30.0% 24.83 ± 20% perf-c2c.DRAM.local
1517 ± 4% -46.5% 812.17 ± 3% perf-c2c.DRAM.remote
1808 ± 2% +57.0% 2840 perf-c2c.HITM.local
1360 ± 5% -49.9% 680.83 ± 2% perf-c2c.HITM.remote
5.22 ± 3% +19.8% 6.26 ± 7% perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
53.33 ± 15% +25.0% 66.67 ± 15% perf-sched.wait_and_delay.count.__cond_resched.vfs_write.__x64_sys_pwrite64.do_syscall_64.entry_SYSCALL_64_after_hwframe
953.83 ± 3% -16.5% 796.33 ± 7% perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
5.21 ± 3% +20.0% 6.25 ± 7% perf-sched.wait_time.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
163515 +27.8% 208915 stress-ng.sigxfsz.SIGXFSZ_signals_per_sec
6.668e+08 +23.4% 8.23e+08 stress-ng.sigxfsz.ops
11113966 +23.4% 13716156 stress-ng.sigxfsz.ops_per_sec
3623 -1.4% 3573 stress-ng.time.system_time
163.26 +31.7% 214.98 stress-ng.time.user_time
0.25 -54.7% 0.12 ± 2% perf-stat.i.MPKI
1.125e+10 +22.1% 1.373e+10 perf-stat.i.branch-instructions
0.54 -0.0 0.50 perf-stat.i.branch-miss-rate%
59748239 +10.9% 66264440 perf-stat.i.branch-misses
33.30 -17.9 15.38 ± 2% perf-stat.i.cache-miss-rate%
13040640 -45.8% 7066419 ± 2% perf-stat.i.cache-misses
39047103 +15.5% 45098530 perf-stat.i.cache-references
4.39 -18.2% 3.59 perf-stat.i.cpi
17823 +97.0% 35113 perf-stat.i.cycles-between-cache-misses
5.144e+10 +22.0% 6.275e+10 perf-stat.i.instructions
0.23 +21.3% 0.28 perf-stat.i.ipc
0.25 -55.6% 0.11 ± 2% perf-stat.overall.MPKI
0.53 -0.0 0.48 perf-stat.overall.branch-miss-rate%
33.40 -17.7 15.67 ± 2% perf-stat.overall.cache-miss-rate%
4.40 -18.0% 3.60 perf-stat.overall.cpi
17350 +84.6% 32027 ± 2% perf-stat.overall.cycles-between-cache-misses
0.23 +22.0% 0.28 perf-stat.overall.ipc
1.106e+10 +22.1% 1.35e+10 perf-stat.ps.branch-instructions
58763534 +10.9% 65180843 perf-stat.ps.branch-misses
12827760 -45.8% 6951883 ± 2% perf-stat.ps.cache-misses
38411225 +15.5% 44365626 perf-stat.ps.cache-references
5.06e+10 +22.0% 6.172e+10 perf-stat.ps.instructions
3.106e+12 +21.9% 3.787e+12 perf-stat.total.instructions
***************************************************************************************************
lkp-icl-2sp7: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp7/mprotect/stress-ng/60s
commit:
3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")
334426094588f817 ee2a5c3e36093d0ff5709bc8f21
---------------- ---------------------------
%stddev %change %stddev
\ | \
10205 ± 25% +33.5% 13621 ± 16% numa-meminfo.node0.KernelStack
0.02 ± 37% -37.8% 0.01 ± 13% perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.82 ± 32% -37.7% 0.51 ± 7% perf-sched.sch_delay.max.ms.schedule_timeout.kcompactd.kthread.ret_from_fork
807.17 ± 5% -8.5% 738.67 ± 5% perf-sched.wait_and_delay.count.__cond_resched.down_write.vma_prepare.__split_vma.vma_modify
433709 +4.9% 454923 ± 5% proc-vmstat.nr_active_anon
61940 ± 3% +31.3% 81315 ± 35% proc-vmstat.nr_shmem
433709 +4.9% 454923 ± 5% proc-vmstat.nr_zone_active_anon
4.903e+08 +4.5% 5.124e+08 stress-ng.mprotect.ops
8163833 +4.5% 8533021 stress-ng.mprotect.ops_per_sec
239.55 +4.7% 250.91 stress-ng.time.user_time
3960356 ± 7% -16.0% 3325457 numa-numastat.node0.local_node
3990670 ± 7% -16.1% 3348370 numa-numastat.node0.numa_hit
2608139 ± 6% +34.5% 3507199 ± 4% numa-numastat.node1.local_node
2644058 ± 6% +34.3% 3550893 ± 4% numa-numastat.node1.numa_hit
3986137 ± 7% -16.0% 3349506 numa-vmstat.node0.numa_hit
3955823 ± 7% -15.9% 3326594 numa-vmstat.node0.numa_local
2639425 ± 6% +34.6% 3552253 ± 4% numa-vmstat.node1.numa_hit
2603506 ± 6% +34.8% 3508559 ± 4% numa-vmstat.node1.numa_local
1.11 ± 20% -38.9% 0.68 ± 31% sched_debug.cfs_rq:/.h_nr_queued.stddev
1.11 ± 19% -38.6% 0.68 ± 31% sched_debug.cfs_rq:/.h_nr_runnable.stddev
5890 ± 6% -10.7% 5262 sched_debug.cfs_rq:/.runnable_avg.max
1064 ± 20% -41.1% 626.67 ± 33% sched_debug.cfs_rq:/.runnable_avg.stddev
1151 -12.2% 1010 sched_debug.cpu.clock_task.stddev
1.11 ± 20% -39.1% 0.68 ± 32% sched_debug.cpu.nr_running.stddev
1.861e+10 +4.5% 1.945e+10 perf-stat.i.branch-instructions
1.264e+08 +4.1% 1.316e+08 perf-stat.i.branch-misses
1.45e+08 +5.3% 1.526e+08 perf-stat.i.cache-references
2.28 -4.3% 2.18 perf-stat.i.cpi
8.533e+10 +4.5% 8.92e+10 perf-stat.i.instructions
0.44 +4.5% 0.46 perf-stat.i.ipc
63.03 +4.5% 65.90 perf-stat.i.metric.K/sec
4035009 +4.5% 4218051 perf-stat.i.page-faults
2.29 -4.4% 2.19 perf-stat.overall.cpi
0.44 +4.6% 0.46 perf-stat.overall.ipc
1.829e+10 +4.5% 1.912e+10 perf-stat.ps.branch-instructions
1.242e+08 +4.1% 1.293e+08 perf-stat.ps.branch-misses
1.424e+08 +5.3% 1.499e+08 perf-stat.ps.cache-references
8.385e+10 +4.6% 8.767e+10 perf-stat.ps.instructions
3966080 +4.6% 4146673 perf-stat.ps.page-faults
5.154e+12 +4.6% 5.389e+12 perf-stat.total.instructions
36.24 -1.9 34.36 ± 2% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.stress_mprotect_mem
38.30 -1.7 36.58 ± 2% perf-profile.calltrace.cycles-pp.stress_mprotect_mem
14.45 ± 2% -1.7 12.80 ± 2% perf-profile.calltrace.cycles-pp.get_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
17.12 -1.5 15.58 ± 2% perf-profile.calltrace.cycles-pp.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
17.06 -1.5 15.54 ± 2% perf-profile.calltrace.cycles-pp.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
12.44 ± 2% -1.5 10.92 ± 2% perf-profile.calltrace.cycles-pp.do_dec_rlimit_put_ucounts.__sigqueue_free.get_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode
12.46 ± 2% -1.5 10.94 ± 2% perf-profile.calltrace.cycles-pp.__sigqueue_free.get_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault
0.54 ± 2% -0.1 0.43 ± 44% perf-profile.calltrace.cycles-pp.up_read.__bad_area.bad_area_access_error.exc_page_fault.asm_exc_page_fault
0.84 -0.1 0.75 ± 4% perf-profile.calltrace.cycles-pp.down_write.__split_vma.vma_modify.vma_modify_flags.mprotect_fixup
1.60 -0.1 1.51 ± 2% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.stress_sig_handler
1.59 -0.1 1.51 ± 2% perf-profile.calltrace.cycles-pp.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_sig_handler
0.82 ± 3% -0.1 0.74 ± 2% perf-profile.calltrace.cycles-pp.sigprocmask.__x64_sys_rt_sigprocmask.do_syscall_64.entry_SYSCALL_64_after_hwframe.pthread_sigmask
1.44 -0.1 1.37 ± 2% perf-profile.calltrace.cycles-pp.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_sig_handler
1.03 ± 2% -0.1 0.98 perf-profile.calltrace.cycles-pp.__x64_sys_rt_sigprocmask.do_syscall_64.entry_SYSCALL_64_after_hwframe.pthread_sigmask
1.29 ± 2% -0.1 1.23 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.pthread_sigmask
0.68 ± 3% -0.0 0.64 ± 2% perf-profile.calltrace.cycles-pp.up_write.vma_complete.__split_vma.vma_modify.vma_modify_flags
0.58 ± 2% -0.0 0.54 ± 3% perf-profile.calltrace.cycles-pp.__bad_area.bad_area_access_error.exc_page_fault.asm_exc_page_fault.stress_mprotect_mem
0.58 ± 2% -0.0 0.56 perf-profile.calltrace.cycles-pp.fpu__clear_user_states.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault
0.62 ± 3% +0.1 0.67 ± 2% perf-profile.calltrace.cycles-pp.mas_prev_slot.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.01 +0.1 1.07 perf-profile.calltrace.cycles-pp.copy_fpstate_to_sigframe.get_sigframe.x64_setup_rt_frame.handle_signal.arch_do_signal_or_restart
1.23 +0.1 1.30 ± 2% perf-profile.calltrace.cycles-pp.do_user_addr_fault.exc_page_fault.asm_exc_page_fault.stress_mprotect_mem
0.84 ± 3% +0.1 0.91 ± 2% perf-profile.calltrace.cycles-pp.vma_interval_tree_insert.vma_complete.commit_merge.vma_merge_existing_range.vma_modify
0.84 ± 2% +0.1 0.91 perf-profile.calltrace.cycles-pp.mas_preallocate.__split_vma.vma_modify.vma_modify_flags.mprotect_fixup
1.75 ± 2% +0.1 1.83 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__mprotect
0.59 ± 2% +0.1 0.67 ± 2% perf-profile.calltrace.cycles-pp.simple_dname.perf_event_mmap_event.perf_event_mmap.mprotect_fixup.do_mprotect_pkey
2.41 ± 2% +0.1 2.50 perf-profile.calltrace.cycles-pp.clear_bhb_loop.__mprotect
1.77 +0.1 1.88 perf-profile.calltrace.cycles-pp.get_sigframe.x64_setup_rt_frame.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode
2.02 +0.1 2.14 perf-profile.calltrace.cycles-pp.x64_setup_rt_frame.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault
0.98 ± 18% +0.1 1.10 perf-profile.calltrace.cycles-pp.change_protection_range.mprotect_fixup.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64
2.57 +0.1 2.70 perf-profile.calltrace.cycles-pp.handle_signal.arch_do_signal_or_restart.irqentry_exit_to_user_mode.asm_exc_page_fault.stress_mprotect_mem
3.13 ± 3% +0.2 3.34 ± 2% perf-profile.calltrace.cycles-pp.asm_exc_page_fault.__mprotect
0.00 +0.6 0.55 ± 2% perf-profile.calltrace.cycles-pp.prepend_copy.simple_dname.perf_event_mmap_event.perf_event_mmap.mprotect_fixup
34.00 +1.1 35.12 ± 2% perf-profile.calltrace.cycles-pp.mprotect_fixup.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe
46.05 +1.1 47.19 perf-profile.calltrace.cycles-pp.do_mprotect_pkey.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mprotect
46.28 +1.2 47.43 perf-profile.calltrace.cycles-pp.__x64_sys_mprotect.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mprotect
48.43 +1.2 49.61 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mprotect
48.86 +1.2 50.06 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mprotect
55.84 +1.6 57.41 perf-profile.calltrace.cycles-pp.__mprotect
39.48 -1.9 37.62 ± 2% perf-profile.children.cycles-pp.asm_exc_page_fault
14.48 ± 2% -1.6 12.83 ± 2% perf-profile.children.cycles-pp.get_signal
18.72 -1.6 17.11 perf-profile.children.cycles-pp.irqentry_exit_to_user_mode
39.92 -1.6 38.32 ± 2% perf-profile.children.cycles-pp.stress_mprotect_mem
18.52 -1.6 16.92 perf-profile.children.cycles-pp.arch_do_signal_or_restart
12.47 ± 2% -1.5 10.94 ± 2% perf-profile.children.cycles-pp.__sigqueue_free
12.44 ± 2% -1.5 10.92 ± 2% perf-profile.children.cycles-pp.do_dec_rlimit_put_ucounts
5.00 -0.2 4.83 ± 2% perf-profile.children.cycles-pp.up_write
0.47 ± 10% -0.1 0.34 ± 7% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.47 ± 10% -0.1 0.34 ± 7% perf-profile.children.cycles-pp.hrtimer_interrupt
1.16 ± 3% -0.1 1.05 perf-profile.children.cycles-pp.recalc_sigpending
0.35 ± 7% -0.1 0.24 ± 6% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.89 ± 6% -0.1 0.79 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.34 ± 8% -0.1 0.24 ± 6% perf-profile.children.cycles-pp.tick_nohz_handler
0.86 ± 2% -0.1 0.78 perf-profile.children.cycles-pp.sigprocmask
0.28 ± 10% -0.1 0.21 ± 6% perf-profile.children.cycles-pp.update_process_times
1.05 ± 2% -0.1 0.98 perf-profile.children.cycles-pp.__x64_sys_rt_sigprocmask
0.30 ± 3% -0.0 0.26 ± 3% perf-profile.children.cycles-pp.fpregs_mark_activate
0.17 ± 10% -0.0 0.13 ± 6% perf-profile.children.cycles-pp.sched_tick
0.47 ± 3% -0.0 0.43 ± 3% perf-profile.children.cycles-pp.complete_signal
0.54 ± 2% -0.0 0.51 ± 2% perf-profile.children.cycles-pp.up_read
0.58 ± 2% -0.0 0.55 ± 2% perf-profile.children.cycles-pp.__bad_area
0.61 -0.0 0.58 perf-profile.children.cycles-pp.fpu__clear_user_states
0.12 ± 5% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.__get_user_nocheck_4
0.13 ± 3% +0.0 0.14 ± 3% perf-profile.children.cycles-pp.ima_file_mprotect
0.22 ± 5% +0.0 0.24 ± 2% perf-profile.children.cycles-pp.security_file_mprotect
0.25 ± 3% +0.0 0.28 ± 4% perf-profile.children.cycles-pp.stress_mwc16
0.18 ± 5% +0.0 0.20 ± 6% perf-profile.children.cycles-pp.stress_mwc16modn
0.34 ± 3% +0.0 0.37 ± 3% perf-profile.children.cycles-pp.mas_ascend
0.12 ± 4% +0.0 0.15 ± 5% perf-profile.children.cycles-pp.copy_from_kernel_nofault_allowed
0.30 ± 8% +0.0 0.33 ± 2% perf-profile.children.cycles-pp.rcu_all_qs
0.26 ± 4% +0.0 0.29 ± 6% perf-profile.children.cycles-pp.mas_pop_node
0.44 ± 2% +0.0 0.47 perf-profile.children.cycles-pp.vma_set_page_prot
0.49 ± 3% +0.0 0.53 ± 3% perf-profile.children.cycles-pp.save_xstate_epilog
0.66 ± 2% +0.0 0.71 ± 2% perf-profile.children.cycles-pp.native_irq_return_iret
0.02 ± 99% +0.1 0.08 ± 11% perf-profile.children.cycles-pp.anon_vma_clone
1.27 +0.1 1.33 perf-profile.children.cycles-pp.do_user_addr_fault
0.84 +0.1 0.90 perf-profile.children.cycles-pp.mas_prev_slot
1.04 +0.1 1.11 perf-profile.children.cycles-pp.copy_fpstate_to_sigframe
0.73 ± 7% +0.1 0.79 ± 2% perf-profile.children.cycles-pp.__cond_resched
0.46 ± 3% +0.1 0.53 ± 2% perf-profile.children.cycles-pp.copy_from_kernel_nofault
1.30 ± 2% +0.1 1.37 perf-profile.children.cycles-pp.entry_SYSCALL_64
0.50 ± 2% +0.1 0.58 ± 2% perf-profile.children.cycles-pp.prepend_copy
1.68 +0.1 1.75 perf-profile.children.cycles-pp.mas_preallocate
0.61 ± 3% +0.1 0.70 ± 3% perf-profile.children.cycles-pp.simple_dname
2.77 ± 2% +0.1 2.87 perf-profile.children.cycles-pp.clear_bhb_loop
3.27 +0.1 3.37 perf-profile.children.cycles-pp.handle_signal
1.78 +0.1 1.89 perf-profile.children.cycles-pp.get_sigframe
2.05 +0.1 2.16 perf-profile.children.cycles-pp.x64_setup_rt_frame
0.99 ± 18% +0.1 1.11 perf-profile.children.cycles-pp.change_protection_range
7.00 +0.2 7.24 ± 2% perf-profile.children.cycles-pp.vma_prepare
34.09 +1.1 35.22 ± 2% perf-profile.children.cycles-pp.mprotect_fixup
50.17 +1.1 51.31 perf-profile.children.cycles-pp.do_syscall_64
46.24 +1.2 47.39 perf-profile.children.cycles-pp.do_mprotect_pkey
46.33 +1.2 47.49 perf-profile.children.cycles-pp.__x64_sys_mprotect
50.61 +1.2 51.78 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
55.94 +1.6 57.52 perf-profile.children.cycles-pp.__mprotect
12.44 ± 2% -1.5 10.91 ± 2% perf-profile.self.cycles-pp.do_dec_rlimit_put_ucounts
4.36 -0.1 4.22 ± 2% perf-profile.self.cycles-pp.up_write
1.14 ± 3% -0.1 1.03 perf-profile.self.cycles-pp.recalc_sigpending
0.87 ± 6% -0.1 0.78 ± 5% perf-profile.self.cycles-pp._raw_spin_lock_irq
2.83 -0.1 2.75 perf-profile.self.cycles-pp.down_write
0.28 ± 5% -0.0 0.23 ± 5% perf-profile.self.cycles-pp.fpregs_mark_activate
0.19 ± 10% -0.0 0.14 ± 12% perf-profile.self.cycles-pp.__perf_event_header__init_id
0.40 ± 3% -0.0 0.36 ± 5% perf-profile.self.cycles-pp.complete_signal
0.52 ± 2% -0.0 0.48 ± 2% perf-profile.self.cycles-pp.up_read
0.15 ± 2% -0.0 0.14 ± 3% perf-profile.self.cycles-pp.__send_signal_locked
0.10 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.__bad_area_nosemaphore
0.30 ± 3% +0.0 0.33 ± 4% perf-profile.self.cycles-pp.mas_ascend
0.10 ± 5% +0.0 0.12 ± 5% perf-profile.self.cycles-pp.do_user_addr_fault
0.10 ± 4% +0.0 0.12 ± 3% perf-profile.self.cycles-pp.copy_from_kernel_nofault_allowed
0.21 ± 6% +0.0 0.24 ± 4% perf-profile.self.cycles-pp.rwsem_down_write_slowpath
0.40 +0.0 0.43 ± 2% perf-profile.self.cycles-pp.change_protection_range
0.44 +0.0 0.47 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.24 ± 3% +0.0 0.27 ± 6% perf-profile.self.cycles-pp.mas_pop_node
0.34 ± 2% +0.0 0.38 ± 3% perf-profile.self.cycles-pp.mas_preallocate
0.37 ± 8% +0.0 0.41 ± 3% perf-profile.self.cycles-pp.__cond_resched
0.72 +0.0 0.76 ± 2% perf-profile.self.cycles-pp.copy_fpstate_to_sigframe
0.41 +0.0 0.45 ± 3% perf-profile.self.cycles-pp.mas_prev_slot
0.66 ± 2% +0.0 0.71 ± 2% perf-profile.self.cycles-pp.native_irq_return_iret
0.30 ± 4% +0.0 0.35 ± 2% perf-profile.self.cycles-pp.copy_from_kernel_nofault
0.02 ±141% +0.1 0.08 ± 11% perf-profile.self.cycles-pp.anon_vma_clone
1.21 ± 2% +0.1 1.30 ± 2% perf-profile.self.cycles-pp.__mprotect
2.73 ± 2% +0.1 2.83 perf-profile.self.cycles-pp.clear_bhb_loop
2.76 +0.1 2.88 perf-profile.self.cycles-pp.do_mprotect_pkey
3.48 ± 3% +0.3 3.74 ± 2% perf-profile.self.cycles-pp.stress_mprotect_mem
***************************************************************************************************
lkp-icl-2sp8: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sigrt/stress-ng/60s
commit:
3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")
334426094588f817 ee2a5c3e36093d0ff5709bc8f21
---------------- ---------------------------
%stddev %change %stddev
\ | \
1345 ± 9% -15.8% 1132 ± 5% perf-c2c.HITM.remote
5328778 +18.0% 6289475 vmstat.system.cs
197362 +2.0% 201296 vmstat.system.in
45.97 ±118% -85.4% 6.71 ± 55% perf-sched.sch_delay.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.shmem_alloc_folio
582.79 ± 39% -39.2% 354.28 ± 31% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range.do_sigtimedwait.isra.0.__x64_sys_rt_sigtimedwait
1260 ± 46% -43.7% 709.74 ± 31% perf-sched.wait_and_delay.max.ms.schedule_hrtimeout_range.do_sigtimedwait.isra.0.__x64_sys_rt_sigtimedwait
45.97 ±118% -85.4% 6.71 ± 55% perf-sched.wait_time.max.ms.__cond_resched.__alloc_frozen_pages_noprof.alloc_pages_mpol.folio_alloc_mpol_noprof.shmem_alloc_folio
705.59 ± 50% -48.9% 360.90 ± 32% perf-sched.wait_time.max.ms.schedule_hrtimeout_range.do_sigtimedwait.isra.0.__x64_sys_rt_sigtimedwait
83250 -16.0% 69935 stress-ng.sigrt.nanosecs_between_sigqueue_and_sigwaitinfo_completion
3.362e+08 +15.7% 3.89e+08 stress-ng.sigrt.ops
5601334 +15.7% 6480915 stress-ng.sigrt.ops_per_sec
65582158 +17.7% 77176472 stress-ng.time.involuntary_context_switches
3423 -1.4% 3375 stress-ng.time.system_time
335.13 ± 2% +14.5% 383.80 ± 2% stress-ng.time.user_time
2.714e+08 +17.4% 3.185e+08 stress-ng.time.voluntary_context_switches
4202907 ± 15% -24.2% 3184715 ± 12% sched_debug.cfs_rq:/.avg_vruntime.max
82.07 ± 12% +391.9% 403.68 ± 94% sched_debug.cfs_rq:/.load_avg.avg
169.48 ± 8% +1182.4% 2173 ±115% sched_debug.cfs_rq:/.load_avg.stddev
4202907 ± 15% -24.2% 3184715 ± 12% sched_debug.cfs_rq:/.min_vruntime.max
1239 ± 8% +14.2% 1415 ± 12% sched_debug.cfs_rq:/.util_avg.max
2593172 +17.4% 3044316 sched_debug.cpu.nr_switches.avg
1526897 ± 3% +66.4% 2540867 ± 2% sched_debug.cpu.nr_switches.min
606805 -67.2% 198918 ± 9% sched_debug.cpu.nr_switches.stddev
1.902e+10 +14.8% 2.184e+10 perf-stat.i.branch-instructions
1.42e+08 ± 3% +16.2% 1.65e+08 perf-stat.i.branch-misses
6.65 ± 4% -0.9 5.77 ± 7% perf-stat.i.cache-miss-rate%
3.931e+08 ± 9% +17.1% 4.605e+08 ± 6% perf-stat.i.cache-references
5534190 +17.4% 6498045 perf-stat.i.context-switches
2.71 -14.3% 2.33 perf-stat.i.cpi
8.694e+10 +14.8% 9.976e+10 perf-stat.i.instructions
0.39 +14.2% 0.45 perf-stat.i.ipc
86.53 +17.4% 101.60 perf-stat.i.metric.K/sec
6.82 ± 5% -0.9 5.91 ± 9% perf-stat.overall.cache-miss-rate%
2.59 -12.9% 2.26 perf-stat.overall.cpi
0.39 +14.7% 0.44 perf-stat.overall.ipc
1.871e+10 +14.8% 2.149e+10 perf-stat.ps.branch-instructions
1.396e+08 ± 3% +16.2% 1.622e+08 perf-stat.ps.branch-misses
3.868e+08 ± 9% +17.1% 4.53e+08 ± 6% perf-stat.ps.cache-references
5443676 +17.4% 6391319 perf-stat.ps.context-switches
8.552e+10 +14.8% 9.813e+10 perf-stat.ps.instructions
5.251e+12 +14.3% 6e+12 perf-stat.total.instructions
***************************************************************************************************
lkp-icl-2sp8: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-9.4/100%/debian-12-x86_64-20240206.cgz/lkp-icl-2sp8/sigbus/stress-ng/60s
commit:
3344260945 ("Merge tag 'for-v6.14-rc' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply")
ee2a5c3e36 ("Revert "pid: allow pid_max to be set per pid namespace"")
334426094588f817 ee2a5c3e36093d0ff5709bc8f21
---------------- ---------------------------
%stddev %change %stddev
\ | \
7.64 +1.7 9.30 mpstat.cpu.all.usr%
36.50 ± 16% -42.9% 20.83 ± 31% perf-c2c.DRAM.local
2312 ± 6% -68.7% 723.17 ± 4% perf-c2c.DRAM.remote
3690 ± 3% +44.9% 5347 ± 6% perf-c2c.HITM.local
2155 ± 6% -71.8% 608.17 ± 4% perf-c2c.HITM.remote
4477 ± 69% -70.3% 1328 ± 35% proc-vmstat.numa_hint_faults
2459 ± 11% -64.8% 866.33 ± 47% proc-vmstat.numa_hint_faults_local
140611 ± 21% -33.6% 93302 ± 45% proc-vmstat.numa_pte_updates
7.197e+08 +20.7% 8.685e+08 proc-vmstat.pgfault
7.201e+08 +20.6% 8.682e+08 stress-ng.sigbus.ops
12001759 +20.6% 14469786 stress-ng.sigbus.ops_per_sec
3526 -1.8% 3461 stress-ng.time.system_time
261.31 +25.4% 327.64 stress-ng.time.user_time
0.03 ± 55% -64.6% 0.01 ± 17% perf-sched.sch_delay.avg.ms.do_wait.kernel_wait4.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.86 ±150% -90.1% 0.09 ±201% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
0.02 ± 50% -58.7% 0.01 ± 14% perf-sched.sch_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.08 ± 18% -34.1% 0.71 ± 14% perf-sched.sch_delay.avg.ms.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
0.31 ± 72% -65.9% 0.11 ± 71% perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.02 ± 10% -23.4% 0.01 ± 15% perf-sched.sch_delay.max.ms.rcu_gp_kthread.kthread.ret_from_fork.ret_from_fork_asm
1.91 ±218% -99.2% 0.02 ± 11% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
4.00 ± 49% -71.6% 1.14 ± 56% perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
261.25 ± 37% +199.1% 781.43 ± 15% perf-sched.wait_and_delay.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
81.02 ± 59% +274.1% 303.13 ± 50% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
6.60 ± 2% +16.9% 7.71 ± 3% perf-sched.wait_and_delay.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
108.83 ± 63% -81.2% 20.50 ±113% perf-sched.wait_and_delay.count.devkmsg_read.vfs_read.ksys_read.do_syscall_64
3107 ± 3% -12.6% 2714 ± 5% perf-sched.wait_and_delay.count.irqentry_exit_to_user_mode.asm_exc_page_fault.[unknown]
124.17 ± 63% -70.1% 37.17 ± 60% perf-sched.wait_and_delay.count.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
751.00 ± 2% -17.0% 623.50 ± 2% perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
1550 ± 31% +119.7% 3406 ± 19% perf-sched.wait_and_delay.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
261.24 ± 37% +199.1% 781.42 ± 15% perf-sched.wait_time.avg.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
80.16 ± 60% +278.0% 303.05 ± 50% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range.ep_poll.do_epoll_wait.__x64_sys_epoll_wait
6.59 ± 2% +17.0% 7.71 ± 3% perf-sched.wait_time.avg.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
1550 ± 31% +119.7% 3406 ± 19% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.18 -49.0% 0.09 ± 3% perf-stat.i.MPKI
1.59e+10 +19.7% 1.903e+10 perf-stat.i.branch-instructions
0.28 -0.0 0.25 perf-stat.i.branch-miss-rate%
40989724 +5.3% 43173098 ± 2% perf-stat.i.branch-misses
32.63 -15.8 16.81 ± 2% perf-stat.i.cache-miss-rate%
12733301 ± 2% -40.3% 7597041 ± 3% perf-stat.i.cache-misses
38933806 +14.5% 44591128 perf-stat.i.cache-references
3.17 -16.4% 2.65 perf-stat.i.cpi
18224 +75.2% 31921 perf-stat.i.cycles-between-cache-misses
7.098e+10 +19.6% 8.489e+10 perf-stat.i.instructions
0.32 +19.0% 0.38 perf-stat.i.ipc
184.67 +20.6% 222.65 perf-stat.i.metric.K/sec
11819123 +20.6% 14249011 perf-stat.i.page-faults
0.18 -50.1% 0.09 ± 3% perf-stat.overall.MPKI
0.26 -0.0 0.23 perf-stat.overall.branch-miss-rate%
32.70 -15.7 17.04 ± 3% perf-stat.overall.cache-miss-rate%
3.19 -16.4% 2.66 perf-stat.overall.cpi
17772 ± 2% +67.6% 29795 ± 2% perf-stat.overall.cycles-between-cache-misses
0.31 +19.6% 0.38 perf-stat.overall.ipc
1.564e+10 +19.7% 1.871e+10 perf-stat.ps.branch-instructions
40314687 +5.4% 42478375 ± 2% perf-stat.ps.branch-misses
12525837 ± 2% -40.3% 7473864 ± 3% perf-stat.ps.cache-misses
38300912 +14.5% 43866104 perf-stat.ps.cache-references
6.982e+10 +19.6% 8.35e+10 perf-stat.ps.instructions
11626044 +20.6% 14016280 perf-stat.ps.page-faults
4.284e+12 +19.5% 5.117e+12 perf-stat.total.instructions
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-03-10 7:32 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-21 17:02 [PATCH 0/2] Alternative "pid_max" for 32-bit userspace Michal Koutný
2025-02-21 17:02 ` [PATCH 1/2] Revert "pid: allow pid_max to be set per pid namespace" Michal Koutný
2025-02-25 17:36 ` Alexander Mikhalitsyn
2025-03-10 7:32 ` kernel test robot
2025-02-21 17:02 ` [PATCH 2/2] pid: Optional first-fit pid allocation Michal Koutný
2025-02-22 0:18 ` Andrew Morton
2025-02-22 9:02 ` David Laight
2025-03-05 15:01 ` Michal Koutný
2025-03-05 15:04 ` Michal Koutný
2025-02-25 17:30 ` Alexander Mikhalitsyn
2025-03-06 8:59 ` Christian Brauner
2025-03-06 9:09 ` Michal Koutný
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).