* [PATCH v5 10/24] sched/core: Keep tick on non-preferred CPUs until tasks are out
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
Enable tick on nohz full CPU when it is marked as non-preferred.
If there in no CFS running there, disable the tick to save the power.
Steal time handling code will call tick_nohz_dep_set_cpu with
TICK_DEP_BIT_SCHED for moving the task out of nohz_full CPU fast.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Move it below rt checks. (Sashiko)
kernel/sched/core.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 281715a6e88f..c0391e7897f5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1473,6 +1473,10 @@ bool sched_can_stop_tick(struct rq *rq)
return false;
}
+ /* Keep the tick running until CFS tasks are pushed out*/
+ if (!cpu_preferred(rq->cpu) && rq->cfs.h_nr_queued)
+ return false;
+
return true;
}
#endif /* CONFIG_NO_HZ_FULL */
--
2.47.3
^ permalink raw reply related
* [PATCH v5 09/24] sched/fair: Pull the load on preferred CPU
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
When cpu is marked as non preferred, any load pulled towards it is
pointless since in the next tick task will be pushed out again.
Since load balancing only happens among preferred CPUs, should_we_balance
will bail out. But for NEWIDLE and IDLE balance, this bailout can
happen even earlier.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- new patch
kernel/sched/fair.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 44a0d9736b67..fda8966d9d87 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -14196,6 +14196,10 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
if (!idle_cpu(balance_cpu))
continue;
+ /* There is no point in pulling the load, just to push it out next */
+ if (!cpu_preferred(balance_cpu))
+ continue;
+
/*
* If this CPU gets work to do, stop the load balancing
* work being done for other CPUs. Next load
@@ -14375,6 +14379,10 @@ static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf)
if (!cpu_active(this_cpu))
return 0;
+ /* Do not pull to a !preferred CPU just to push it out next */
+ if (!cpu_preferred(this_cpu))
+ return 0;
+
/*
* This is OK, because current is on_cpu, which avoids it being picked
* for load-balance and preemption/IRQs are still disabled avoiding
--
2.47.3
^ permalink raw reply related
* [PATCH v5 08/24] sched/fair: load balance only among preferred CPUs
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
Consider only preferred CPUs for load balance.
With this, load balance will end up choosing a preferred CPUs for pull.
This makes it not fight against the push task mechanism which happens
at tick. Also, this stops active balance to happen on non-preferred CPU
pulling the load.
This means there is no load balancing if the task is pinned only to
non-preferred CPUs. They will continue to run where they were previously
running before the CPUs was marked as non-preferred.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Remove previous cpumask_and (K Prateek Nayak)
kernel/sched/fair.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee1..44a0d9736b67 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13289,7 +13289,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
};
bool need_unlock = false;
- cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
+ /* Spread load among preferred CPUs */
+ cpumask_and(cpus, sched_domain_span(sd), cpu_preferred_mask);
schedstat_inc(sd->lb_count[idle]);
--
2.47.3
^ permalink raw reply related
* [PATCH v5 07/24] sched/fair: Select preferred CPU at wakeup when possible
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
Update available_idle_cpu to consider preferred CPUs. This takes care of
lot of decisions at wakeup to use only preferred CPUs. There is no need to
put those explicit checks everywhere.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
kernel/sched/sched.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5d009c2529b2..148fe6145f1a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1434,6 +1434,9 @@ static inline bool available_idle_cpu(int cpu)
if (!idle_rq(cpu_rq(cpu)))
return 0;
+ if (!cpu_preferred(cpu))
+ return 0;
+
if (vcpu_is_preempted(cpu))
return 0;
--
2.47.3
^ permalink raw reply related
* [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
When possible, choose a preferred CPUs to pick.
Push task mechanism uses stopper thread which going to call
select_fallback_rq and use this mechanism to pick only a preferred CPU.
When task is affined only to non-preferred CPUs it should continue to
run there. Detect that by checking if cpus_ptr and cpu_preferred_mask
intersect or not.
Since is_cpu_allowed can be called directly or repeatedly in
select_fallback_rq, encode the info in task_struct->has_preferred_cpu_state
if the path is via select_fallback_rq or not.
This helps to avoid N**2 complexity for the rare cases.
Additional overhead of O(N) comes to is_cpu_allowed only when cpu is not
preferred. So in normal scenarios overhead is only a bit check.
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Do simple encoding of -1,0,1 instead (K Prateek Nayak)
- Make it s8 (K Prateek Nayak)
- Update changelog to address sashiko concerns of overhead.
include/linux/sched.h | 1 +
kernel/sched/core.c | 35 +++++++++++++++++++++++++++++++++--
kernel/sched/sched.h | 25 +++++++++++++++++++++++++
3 files changed, 59 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc6ecb3869dd..27dbf676113e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1657,6 +1657,7 @@ struct task_struct {
#ifdef CONFIG_UNWIND_USER
struct unwind_task_info unwind_info;
#endif
+ s8 has_preferred_cpu_state;
/* CPU-specific state of this task: */
struct thread_struct thread;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e16946c9d62..281715a6e88f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2500,6 +2500,8 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
*/
static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
{
+ bool task_check_preferred_cpu;
+
/* When not in the task's cpumask, no point in looking further. */
if (!task_allowed_on_cpu(p, cpu))
return false;
@@ -2508,9 +2510,23 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
if (is_migration_disabled(p))
return cpu_online(cpu);
+ /*
+ * This is essential to maintain user affinities when preferred
+ * CPUs change. A task pinned on non-preferred CPU should continue
+ * to run there, since this is non-user triggered.
+ *
+ * If CPU is non-preferred and task can run on other CPUs which are
+ * currently preferred, then choose those other CPUs instead.
+ * Overhead is minimal when CPU is preferred.
+ */
+ task_check_preferred_cpu = !cpu_preferred(cpu) && task_has_preferred_cpus(p);
+
/* Non kernel threads are not allowed during either online or offline. */
- if (!(p->flags & PF_KTHREAD))
+ if (!(p->flags & PF_KTHREAD)) {
+ if (task_check_preferred_cpu)
+ return false;
return cpu_active(cpu);
+ }
/* KTHREAD_IS_PER_CPU is always allowed. */
if (kthread_is_per_cpu(p))
@@ -2520,6 +2536,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
if (cpu_dying(cpu))
return false;
+ /* Try on preferred CPU first if possible*/
+ if (task_check_preferred_cpu)
+ return false;
+
/* But are allowed during online. */
return cpu_online(cpu);
}
@@ -3549,6 +3569,14 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
enum { cpuset, possible, fail } state = cpuset;
int dest_cpu;
+ /*
+ * Cache the value whether task's affinity spans preferred CPUs.
+ * This helps to avoid repeating the same for each CPU
+ * later in the loop. Encode call to is_cpu_allowed coming
+ * via select_fallback_rq.
+ */
+ p->has_preferred_cpu_state = task_has_preferred_cpus(p) ? 1 : -1;
+
/*
* If the node that the CPU is on has been offlined, cpu_to_node()
* will return -1. There is no CPU on the node, and we should
@@ -3560,7 +3588,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
/* Look for allowed, online CPU in same node. */
for_each_cpu(dest_cpu, nodemask) {
if (is_cpu_allowed(p, dest_cpu))
- return dest_cpu;
+ goto clear_and_return;
}
}
@@ -3604,6 +3632,8 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
}
}
+clear_and_return:
+ p->has_preferred_cpu_state = 0;
return dest_cpu;
}
@@ -4612,6 +4642,7 @@ static void __sched_fork(u64 clone_flags, struct task_struct *p)
init_numa_balancing(clone_flags, p);
p->wake_entry.u_flags = CSD_TYPE_TTWU;
p->migration_pending = NULL;
+ p->has_preferred_cpu_state = 0;
init_sched_mm(p);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7c2dea65edd..5d009c2529b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4213,4 +4213,29 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)
#include "ext.h"
+/*
+ * has_preferred_cpu_state could have the value cached from
+ * select_fallback_rq. It is set/cleared while holding pi_lock
+ * and irq disabled.
+ *
+ * 1: Cached and preferred CPUs exists in task's affinity.
+ * 0: Not cached and need to evaluate.
+ * -1: Cached and preferred CPU doesn't exits task's affinity
+ *
+ * Only affects FAIR task.
+ */
+static inline bool task_has_preferred_cpus(struct task_struct *p)
+{
+ int cached;
+
+ /* Only FAIR tasks honor preferred CPU state */
+ if (unlikely(p->sched_class != &fair_sched_class))
+ return false;
+
+ cached = READ_ONCE(p->has_preferred_cpu_state);
+ if (cached)
+ return cached > 0;
+ else
+ return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
#endif /* _KERNEL_SCHED_SCHED_H */
--
2.47.3
^ permalink raw reply related
* Re: [PATCH v4] coredump: Add /proc/<pid>/coredump_pre_exit for pre-exit before dumping
From: Lorenzo Stoakes @ 2026-06-25 12:48 UTC (permalink / raw)
To: Xin Zhao
Cc: brauner, mjguzik, pfalcato, ebiederm, viro, jack, jlayton,
chuck.lever, alex.aring, arnd, keescook, mcgrof, j.granados,
allen.lkml, linux-fsdevel, linux-kernel, linux-arch,
Jonathan Corbet, Andrew Morton, David Hildenbrand, Ingo Molnar,
Peter Zijlstra, Juri Lelli, Vincent Guittot, Liam R. Howlett,
linux-doc, linux-mm
In-Reply-To: <20260624145552.70143-1-jackzxcui1989@163.com>
+cc missing maintainers, lists.
NAK.
This is un-upstreamable for numerous reasons.
The stuff you're doing in mm is broken, wrong and invasive and you've not
even bothered to cc- mm people. I'm annoyed by this.
You're also doing incredibly silly mistakes at v4 of something that should have
been an RFC.
You don't seem to understand the concept of patch _series_ (break it up into
smaller patches!!!) and you haven't bothered cc'ing maintainers whose subsystems
you're radically alterting.
I'm annoyed as you have a history where you were told not to add insane hacks
before ([0], my reply at [1]).
[0]:https://lore.kernel.org/all/20260116042817.3790405-1-jackzxcui1989@163.com/
[1]:https://lore.kernel.org/all/14110b70-19e7-474d-b0dd-ba80e8bed9b0@lucifer.local/
Was I wasting my time there? Am I wasting my time responding now?
And how hard is it to run a simple perl script?
Let me run it for you for _just_ the maintainers:
$ scripts/get_maintainer.pl --nogit --nogit-fallback --nor your_patch.patch
Jonathan Corbet <corbet@lwn.net> (maintainer:DOCUMENTATION)
Alexander Viro <viro@zeniv.linux.org.uk> (maintainer:FILESYSTEMS (VFS and infrastructure))
Christian Brauner <brauner@kernel.org> (maintainer:FILESYSTEMS (VFS and infrastructure))
Andrew Morton <akpm@linux-foundation.org> (maintainer:MEMORY MANAGEMENT - CORE)
David Hildenbrand <david@kernel.org> (maintainer:MEMORY MANAGEMENT - CORE)
Arnd Bergmann <arnd@arndb.de> (maintainer:GENERIC INCLUDE/ASM HEADER FILES)
Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER)
Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER)
Juri Lelli <juri.lelli@redhat.com> (maintainer:SCHEDULER)
Vincent Guittot <vincent.guittot@linaro.org> (maintainer:SCHEDULER)
Kees Cook <kees@kernel.org> (maintainer:EXEC & BINFMT API, ELF)
"Liam R. Howlett" <liam@infradead.org> (maintainer:MEMORY MAPPING)
Lorenzo Stoakes <ljs@kernel.org> (maintainer:MEMORY MAPPING)
linux-doc@vger.kernel.org (open list:DOCUMENTATION)
linux-kernel@vger.kernel.org (open list)
linux-fsdevel@vger.kernel.org (open list:PROC FILESYSTEM)
linux-mm@kvack.org (open list:MEMORY MANAGEMENT - CORE)
linux-arch@vger.kernel.org (open list:GENERIC INCLUDE/ASM HEADER FILES)
EXEC & BINFMT API, ELF status: Supported
You're missing the majority of these. That's _not OK_.
On Wed, Jun 24, 2026 at 10:55:52PM +0800, Xin Zhao wrote:
> A coredump typically takes some time to complete. If we happen to hold a
> write lock with flock just before triggering the coredump, that write lock
> will not be released during the entire coredump process. As a result,
> other processes attempting to acquire the same write lock may experience
> significant delays. Another typical scenario is that shared memory, such
> as dma-buf, remains occupied and is not released for a long time due to
> core dumps.
>
> To address this, add /proc/<pid>/coredump_pre_exit node so that people can
This is a horrible idea.
> specify which resources they want to release before dumping core. This
> patch implements the early release of two types of resources: flock files
> and file-backed shared memory. Default settings are NOT pre-exit anything.
What, people set this ahead of time? For a dynamic thing like files?
>
> A temporary bit, O_TMPCLOS, is added to mark vma->vm_file->f_flags during
> the execution of the newly introduced exit_mmap_mapped_shared() function.
> In this way, the subsequent exit_files_pre_exit() function does not need
> to find the corresponding vma through the file to check for the VM_SHARED
> attribute, thereby reducing the traversal cost.
This sentence doesn't even make sense?
And also !VM_SHARED means !vma->vm_file so your code would NULL deref if you
didn't check that. But !VM_SHARED VMAs can absolutely be file-backed...
>
> Signed-off-by: Xin Zhao <jackzxcui1989@163.com>
> ---
>
> Change in v4:
> - Christian pointed out that the coredump process will traverse file
> descriptors (fd), so certain fds should not be closed by default.
> Rework the whole feature, add /proc/<pid>/coredump_pre_exit for user
> pre-exit resources selection, default is NOT pre-exit anything.
> - Mateusz suggested that walking the fd table and release the file-lock is
> reasonable. No longer release all the fd(s). Based on user config, only
> the flock fd(s) and the fd(s) correspondent to file-backed shared memory
> will be released at most.
>
> Change in v3:
> - Add comment and commit-log to explain why do the MMF_DUMP_MAPPED_SHARED
> mm_flags_test() check, note that memory mapped files keep their own
> separate references to the files. The case to work around is that early
> unlocking a flock on a file allows other processes to lock and modify
> the mapped data protected by the flock,
> as suggested by Pedro Falcato.
> - Link to v3: https://lore.kernel.org/all/20260619122419.3954581-1-jackzxcui1989@163.com/
>
> Change in v2:
> - Get rid of the implement of adding new fcntl API, the issue does not
> worth inflicting the cost on everyone,
> as suggested by Al Viro.
> - Call exit_files() in coredump_wait(),
> as suggested by Eric W. Biederman.
> Add MMF_DUMP_MAPPED_SHARED mm_flags_test() check to filter cases that
> need to dump file-backed shared memory.
> - Link to v2: https://lore.kernel.org/lkml/20260618150301.3226517-1-jackzxcui1989@163.com/
>
> v1:
> - Link to v1: https://lore.kernel.org/all/20260618030700.2511668-1-jackzxcui1989@163.com/
> ---
> .../admin-guide/kernel-parameters.txt | 5 ++
> Documentation/filesystems/proc.rst | 58 +++++++++-----
> fs/coredump.c | 23 ++++++
> fs/file.c | 46 +++++++++++
> fs/proc/base.c | 78 +++++++++++++++++++
> include/linux/mm.h | 1 +
No.
> include/linux/mm_types.h | 9 +++
No.
> include/linux/sched/task.h | 1 +
> include/uapi/asm-generic/fcntl.h | 4 +
> kernel/fork.c | 12 +++
> mm/mmap.c | 21 +++++
No.
> 11 files changed, 238 insertions(+), 20 deletions(-)
This is a completely insane diffstat for a single patch. Ridiculous.
AND YOU HAVEN'T ADDED A SINGLE TEST.
>
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index f575d4508..bc6d3859f 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -1024,6 +1024,11 @@ Kernel parameters
> /proc/<pid>/coredump_filter.
> See also Documentation/filesystems/proc.rst.
>
> + coredump_pre_exit=
> + [KNL] Change the default value for
> + /proc/<pid>/coredump_pre_exit.
> + See also Documentation/filesystems/proc.rst.
> +
> coresight_cpu_debug.enable
> [ARM,ARM64]
> Format: <bool>
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index db6167bef..6a637d31d 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -39,16 +39,17 @@ fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009
> 3.2 /proc/<pid>/oom_score - Display current oom-killer score
> 3.3 /proc/<pid>/io - Display the IO accounting fields
> 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings
> - 3.5 /proc/<pid>/mountinfo - Information about mounts
> - 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
> - 3.7 /proc/<pid>/task/<tid>/children - Information about task children
> - 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
> - 3.9 /proc/<pid>/map_files - Information about memory mapped files
> - 3.10 /proc/<pid>/timerslack_ns - Task timerslack value
> - 3.11 /proc/<pid>/patch_state - Livepatch patch operation state
> - 3.12 /proc/<pid>/arch_status - Task architecture specific information
> - 3.13 /proc/<pid>/fd - List of symlinks to open files
> - 3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status.
> + 3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> + 3.6 /proc/<pid>/mountinfo - Information about mounts
> + 3.7 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
> + 3.8 /proc/<pid>/task/<tid>/children - Information about task children
> + 3.9 /proc/<pid>/fdinfo/<fd> - Information about opened file
> + 3.10 /proc/<pid>/map_files - Information about memory mapped files
> + 3.11 /proc/<pid>/timerslack_ns - Task timerslack value
> + 3.12 /proc/<pid>/patch_state - Livepatch patch operation state
> + 3.13 /proc/<pid>/arch_status - Task architecture specific information
> + 3.14 /proc/<pid>/fd - List of symlinks to open files
> + 3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status.
>
> 4 Configuring procfs
> 4.1 Mount options
> @@ -1961,7 +1962,24 @@ For example::
> $ echo 0x7 > /proc/self/coredump_filter
> $ ./some_program
>
> -3.5 /proc/<pid>/mountinfo - Information about mounts
> +3.5 /proc/<pid>/coredump_pre_exit - Core dump pre-exit settings
> +---------------------------------------------------------------
> +A coredump typically takes some time to complete. If we happen to hold a write
> +lock with flock just before triggering the coredump, that write lock will not
> +be released during the entire coredump process. As a result, other processes
> +attempting to acquire the same write lock may experience significant delays.
> +Another typical scenario is that shared memory, such as dma-buf, remains
> +occupied and is not released for a long time due to core dumps.
> +
> +/proc/<pid>/coredump_pre_exit allows you to pre-exit some resources before
> +dumping core.
> +
> +The following two types are supported:
> +
> + - (bit 0) flock files
> + - (bit 1) file-backed shared memory
> +
> +3.6 /proc/<pid>/mountinfo - Information about mounts
> --------------------------------------------------------
>
> This file contains lines of the form::
> @@ -2001,7 +2019,7 @@ For more information on mount propagation see:
> Documentation/filesystems/sharedsubtree.rst
>
>
> -3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
> +3.7 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm
> --------------------------------------------------------
> These files provide a method to access a task's comm value. It also allows for
> a task to set its own or one of its thread siblings comm value. The comm value
> @@ -2010,7 +2028,7 @@ then the kernel's TASK_COMM_LEN (currently 16 chars, including the NUL
> terminator) will result in a truncated comm value.
>
>
> -3.7 /proc/<pid>/task/<tid>/children - Information about task children
> +3.8 /proc/<pid>/task/<tid>/children - Information about task children
> -------------------------------------------------------------------------
> This file provides a fast way to retrieve first level children pids
> of a task pointed by <pid>/<tid> pair. The format is a space separated
> @@ -2027,7 +2045,7 @@ pids, so one needs to either stop or freeze processes being inspected
> if precise results are needed.
>
>
> -3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file
> +3.9 /proc/<pid>/fdinfo/<fd> - Information about opened file
> ---------------------------------------------------------------
> This file provides information associated with an opened file. The regular
> files have at least four fields -- 'pos', 'flags', 'mnt_id' and 'ino'.
> @@ -2198,7 +2216,7 @@ VFIO Device files
> where 'vfio-device-syspath' is the sysfs path corresponding to the VFIO device
> file.
>
> -3.9 /proc/<pid>/map_files - Information about memory mapped files
> +3.10 /proc/<pid>/map_files - Information about memory mapped files
> ---------------------------------------------------------------------
> This directory contains symbolic links which represent memory mapped files
> the process is maintaining. Example output::
> @@ -2220,7 +2238,7 @@ time one can open(2) mappings from the listings of two processes and
> comparing their inode numbers to figure out which anonymous memory areas
> are actually shared.
>
> -3.10 /proc/<pid>/timerslack_ns - Task timerslack value
> +3.11 /proc/<pid>/timerslack_ns - Task timerslack value
> ---------------------------------------------------------
> This file provides the value of the task's timerslack value in nanoseconds.
> This value specifies an amount of time that normal timers may be deferred
> @@ -2236,7 +2254,7 @@ Valid values are from 0 - ULLONG_MAX
> An application setting the value must have PTRACE_MODE_ATTACH_FSCREDS level
> permissions on the task specified to change its timerslack_ns value.
>
> -3.11 /proc/<pid>/patch_state - Livepatch patch operation state
> +3.12 /proc/<pid>/patch_state - Livepatch patch operation state
> -----------------------------------------------------------------
> When CONFIG_LIVEPATCH is enabled, this file displays the value of the
> patch state for the task.
> @@ -2253,7 +2271,7 @@ patched. If the patch is being enabled, then the task has already been
> patched. If the patch is being disabled, then the task hasn't been
> unpatched yet.
>
> -3.12 /proc/<pid>/arch_status - task architecture specific status
> +3.13 /proc/<pid>/arch_status - task architecture specific status
> -------------------------------------------------------------------
> When CONFIG_PROC_PID_ARCH_STATUS is enabled, this file displays the
> architecture specific status of the task.
> @@ -2298,7 +2316,7 @@ AVX512_elapsed_ms
> the task is unlikely an AVX512 user, but depends on the workload and the
> scheduling scenario, it also could be a false negative mentioned above.
>
> -3.13 /proc/<pid>/fd - List of symlinks to open files
> +3.14 /proc/<pid>/fd - List of symlinks to open files
> -------------------------------------------------------
> This directory contains symbolic links which represent open files
> the process is maintaining. Example output::
> @@ -2313,7 +2331,7 @@ The number of open files for the process is stored in 'size' member
> of stat() output for /proc/<pid>/fd for fast access.
> -------------------------------------------------------
>
> -3.14 /proc/<pid>/ksm_stat - Information about the process's ksm status
> +3.15 /proc/<pid>/ksm_stat - Information about the process's ksm status
> ----------------------------------------------------------------------
> When CONFIG_KSM is enabled, each process has this file which displays
> the information of ksm merging status.
> diff --git a/fs/coredump.c b/fs/coredump.c
> index bb6fdb1f4..e08a8a6c4 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -521,6 +521,27 @@ static int zap_threads(struct task_struct *tsk,
> return nr;
> }
>
> +static void coredump_pre_exit(void)
> +{
> + struct task_struct *tsk = current;
> + unsigned long flags = __mm_flags_get_dumpable(tsk->mm);
> +
> + if (!likely(flags & MMF_DUMP_PRE_EXIT_MASK))
> + return;
> +
> + /*
> + * Set O_TMPCLOS of file f_flags if file needs to be closed.
> + */
> + if (test_bit(MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED, &flags) &&
> + !test_bit(MMF_DUMP_MAPPED_SHARED, &flags))
> + exit_mmap_mapped_shared(tsk->mm);
What the hell are you doing?
This is not where we unmap VMAs?
This is likely broken in subtle ways.
> +
> + /*
> + * Check O_TMPCLOS of file f_flags to close file and clear it.
> + */
> + exit_files_pre_exit(tsk, mm_flags_test(MMF_DUMP_PRE_EXIT_FLOCK, tsk->mm));
> +}
> +
> static int coredump_wait(int exit_code, struct core_state *core_state)
> {
> struct task_struct *tsk = current;
> @@ -1100,6 +1121,8 @@ static void do_coredump(struct core_name *cn, struct coredump_params *cprm,
> return;
> }
>
> + coredump_pre_exit();
> +
> switch (cn->core_type) {
> case COREDUMP_FILE:
> if (!coredump_file(cn, cprm, binfmt))
> diff --git a/fs/file.c b/fs/file.c
> index 2c81c0b16..a58ffffcc 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -23,6 +23,7 @@
> #include <linux/file_ref.h>
> #include <net/sock.h>
> #include <linux/init_task.h>
> +#include <linux/filelock.h>
>
> #include "internal.h"
>
> @@ -527,6 +528,51 @@ void exit_files(struct task_struct *tsk)
> }
> }
>
> +void exit_files_pre_exit(struct task_struct *tsk, bool checkflock)
> +{
> + struct files_struct *files = tsk->files;
> + struct fdtable *fdt;
> + struct file *file;
> + unsigned int i, j = 0;
> +
> + if (!files)
> + return;
> +
> + fdt = rcu_dereference_raw(files->fdt);
> + for (;;) {
> + unsigned long set;
> +
> + i = j * BITS_PER_LONG;
> + if (i >= fdt->max_fds)
> + break;
> + set = fdt->open_fds[j++];
> + while (set) {
> + if (!(set & 1))
> + goto next_fd;
> + file = fdt->fd[i];
> + if (!file)
> + goto next_fd;
> + if (file->f_flags & O_TMPCLOS) {
> + file->f_flags &= ~O_TMPCLOS;
> + goto close_fd;
> + }
> + if (!checkflock)
> + goto next_fd;
> + if (!vfs_inode_has_locks(file_inode(file)))
> + goto next_fd;
> +
> +close_fd:
> + fdt->fd[i] = NULL;
> + filp_close(file, files);
> + cond_resched();
> +
> +next_fd:
> + i++;
> + set >>= 1;
> + }
> + }
This code hurts my eyes.
> +}
> +
> struct files_struct init_files = {
> .count = ATOMIC_INIT(1),
> .fdt = &init_files.fdtab,
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index d9acfa89c..99b5f219f 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -3026,6 +3026,83 @@ static const struct file_operations proc_coredump_filter_operations = {
> .write = proc_coredump_filter_write,
> .llseek = generic_file_llseek,
> };
> +
No comment, obviously.
> +static ssize_t proc_coredump_pre_exit_read(struct file *file, char __user *buf,
> + size_t count, loff_t *ppos)
> +{
> + struct task_struct *task = get_proc_task(file_inode(file));
> + struct mm_struct *mm;
> + char buffer[PROC_NUMBUF];
> + size_t len;
> + int ret;
> +
> + if (!task)
> + return -ESRCH;
> +
> + ret = 0;
> + mm = get_task_mm(task);
> + if (mm) {
> + unsigned long flags = __mm_flags_get_dumpable(mm);
> +
> + len = snprintf(buffer, sizeof(buffer), "%08lx\n",
> + ((flags & MMF_DUMP_PRE_EXIT_MASK) >>
> + MMF_DUMP_PRE_EXIT_SHIFT));
> + mmput(mm);
> + ret = simple_read_from_buffer(buf, count, ppos, buffer, len);
> + }
> +
> + put_task_struct(task);
> +
> + return ret;
> +}
> +
Yeah who needs a comment...
> +static ssize_t proc_coredump_pre_exit_write(struct file *file,
> + const char __user *buf,
> + size_t count,
> + loff_t *ppos)
> +{
> + struct task_struct *task;
> + struct mm_struct *mm;
> + unsigned int val;
> + int ret;
> + int i;
> + unsigned long mask;
> +
> + ret = kstrtouint_from_user(buf, count, 0, &val);
> + if (ret < 0)
> + return ret;
> +
> + ret = -ESRCH;
> + task = get_proc_task(file_inode(file));
> + if (!task)
> + goto out_no_task;
> +
> + mm = get_task_mm(task);
> + if (!mm)
> + goto out_no_mm;
> + ret = 0;
> +
> + for (i = 0, mask = 1; i < MMF_DUMP_PRE_EXIT_BITS; i++, mask <<= 1) {
What?
> + if (val & mask)
> + mm_flags_set(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> + else
> + mm_flags_clear(i + MMF_DUMP_PRE_EXIT_SHIFT, mm);
> + }
> +
> + mmput(mm);
> + out_no_mm:
> + put_task_struct(task);
> + out_no_task:
> + if (ret < 0)
> + return ret;
> + return count;
> +}
> +
> +static const struct file_operations proc_coredump_pre_exit_operations = {
> + .read = proc_coredump_pre_exit_read,
> + .write = proc_coredump_pre_exit_write,
> + .llseek = generic_file_llseek,
> +};
> #endif
>
> #ifdef CONFIG_TASK_IO_ACCOUNTING
> @@ -3391,6 +3468,7 @@ static const struct pid_entry tgid_base_stuff[] = {
> #endif
> #ifdef CONFIG_ELF_CORE
> REG("coredump_filter", S_IRUGO|S_IWUSR, proc_coredump_filter_operations),
> + REG("coredump_pre_exit", S_IRUGO|S_IWUSR, proc_coredump_pre_exit_operations),
> #endif
> #ifdef CONFIG_TASK_IO_ACCOUNTING
> ONE("io", S_IRUSR, proc_tgid_io_accounting),
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index af23453e9..dfd4717c7 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4066,6 +4066,7 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node);
> extern int __vm_enough_memory(const struct mm_struct *mm, long pages, int cap_sys_admin);
> extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> extern void exit_mmap(struct mm_struct *);
> +extern void exit_mmap_mapped_shared(struct mm_struct *mm);
You don't use extern.
> bool mmap_read_lock_maybe_expand(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long addr, bool write);
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index c7db35be6..0555aaf50 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1963,6 +1963,15 @@ enum {
> (BIT(MMF_DUMP_ANON_PRIVATE) | BIT(MMF_DUMP_ANON_SHARED) | \
> BIT(MMF_DUMP_HUGETLB_PRIVATE) | MMF_DUMP_MASK_DEFAULT_ELF)
>
> +/* coredump pre-exit bits */
> +#define MMF_DUMP_PRE_EXIT_FLOCK 11
> +#define MMF_DUMP_PRE_EXIT_FILE_BACKED_SHARED 12
Err do we have space for this?
You really want to add 2 more bits to mm_struct flags for this insanity?
> +
> +#define MMF_DUMP_PRE_EXIT_SHIFT (MMF_DUMPABLE_BITS + MMF_DUMP_FILTER_BITS)
> +#define MMF_DUMP_PRE_EXIT_BITS 2
> +#define MMF_DUMP_PRE_EXIT_MASK \
> + (((1 << MMF_DUMP_PRE_EXIT_BITS) - 1) << MMF_DUMP_PRE_EXIT_SHIFT)
So are these dumpable bits or not? Why are you not just incrementing
MMF_DUMPABLE_BITS?
> +
> #ifdef CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS
> # define MMF_DUMP_MASK_DEFAULT_ELF BIT(MMF_DUMP_ELF_HEADERS)
> #else
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 41ed884cf..b4becbf6c 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -93,6 +93,7 @@ static inline void exit_thread(struct task_struct *tsk)
> extern __noreturn void do_group_exit(int);
>
> extern void exit_files(struct task_struct *);
> +extern void exit_files_pre_exit(struct task_struct *, bool);
> extern void exit_itimers(struct task_struct *);
>
> extern pid_t kernel_clone(struct kernel_clone_args *kargs);
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285..360604d65 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -95,6 +95,10 @@
> #define O_NDELAY O_NONBLOCK
> #endif
>
> +#ifndef O_TMPCLOS
> +#define O_TMPCLOS 0x80000000 /* tag need close, temporarily used */
> +#endif
> +
> #define F_DUPFD 0 /* dup */
> #define F_GETFD 1 /* get close_on_exec */
> #define F_SETFD 2 /* set/clear close_on_exec */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a679b2448..84f1ee7f3 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1030,6 +1030,18 @@ static int __init coredump_filter_setup(char *s)
>
> __setup("coredump_filter=", coredump_filter_setup);
>
> +static unsigned long default_dump_pre_exit;
> +
> +static int __init coredump_pre_exit_setup(char *s)
> +{
> + default_dump_pre_exit =
> + (simple_strtoul(s, NULL, 0) << MMF_DUMP_PRE_EXIT_SHIFT) &
> + MMF_DUMP_PRE_EXIT_MASK;
> + return 1;
> +}
> +
> +__setup("coredump_pre_exit=", coredump_pre_exit_setup);
> +
> #include <linux/init_task.h>
>
> static void mm_init_aio(struct mm_struct *mm)
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 5754d1c36..b955c47c0 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1326,6 +1326,27 @@ void exit_mmap(struct mm_struct *mm)
> vm_unacct_memory(nr_accounted);
> }
>
> +void exit_mmap_mapped_shared(struct mm_struct *mm)
> +{
> + struct vm_area_struct *vma;
> + VMA_ITERATOR(vmi, mm, 0);
> +
> + mmap_write_lock(mm);
> + lru_add_drain();
Why?
> +
> + for_each_vma(vmi, vma) {
Literally every single VMA? Including the gate VMA too?
No VMA locks... so that's already broken.
> + if (vma->vm_flags & VM_HUGETLB)
> + continue;
That's not how you test for hugetlb.
> + if (!(vma->vm_flags & VM_SHARED) || !file_inode(vma->vm_file)->i_nlink)
This isn't how we work with flags any more.
> + continue;
> + vma->vm_file->f_flags |= O_TMPCLOS;
Not sure directly manipulating file flags like this is valid in any way, shape,
or form.
> + do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
This is utterly broken, the outer loop will be invalidated by you removing
these, do_munmap() has its own iterator...
And this is just madly inefficient. Why wouldn't you just loop over the VMAs to
alter flags then unmap the whole range?
But this is also introducing a completely separate, duplicative, version of
exit_mmap().
You're not doing any of what that function does. You're just very inefficiently
unmapping everything?
> + cond_resched();
Of course!
> + }
> +
> + mmap_write_unlock(mm);
And VMAs can be mapped again now?
> +}
> +
> /*
> * Return true if the calling process may expand its vm space by the passed
> * number of pages
> --
> 2.34.1
>
I'm not sure if this idea can be made upstreamble in any way. But this patch or
anything that looks like it or fundamentally alters mm is just not acceptable,
sorry.
Lorenzo
^ permalink raw reply
* [PATCH v5 04/24] cpumask: Introduce cpu_preferred_mask
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
This patch does
- Declare and Define cpu_preferred_mask.
- Get/Set helpers for it.
Values are set/clear by the scheduler by detecting the steal time values.
A CPU is set to preferred when it becomes active. Later it may be
marked as non-preferred depending on steal time values with
steal monitor being enabled.
Always maintain design construct of preferred is subset of active.
i.e. preferred ⊆ active ⊆ online ⊆ present ⊆ possible
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Make it macro instead (Yury Norov)
include/linux/cpumask.h | 21 ++++++++++++++++++++-
kernel/cpu.c | 6 ++++++
kernel/sched/core.c | 5 +++++
3 files changed, 31 insertions(+), 1 deletion(-)
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 80211900f373..5a643d608ea6 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -120,12 +120,20 @@ extern struct cpumask __cpu_enabled_mask;
extern struct cpumask __cpu_present_mask;
extern struct cpumask __cpu_active_mask;
extern struct cpumask __cpu_dying_mask;
+
+#ifdef CONFIG_PREFERRED_CPU
+extern struct cpumask __cpu_preferred_mask;
+#else
+#define __cpu_preferred_mask __cpu_active_mask
+#endif
+
#define cpu_possible_mask ((const struct cpumask *)&__cpu_possible_mask)
#define cpu_online_mask ((const struct cpumask *)&__cpu_online_mask)
#define cpu_enabled_mask ((const struct cpumask *)&__cpu_enabled_mask)
#define cpu_present_mask ((const struct cpumask *)&__cpu_present_mask)
#define cpu_active_mask ((const struct cpumask *)&__cpu_active_mask)
#define cpu_dying_mask ((const struct cpumask *)&__cpu_dying_mask)
+#define cpu_preferred_mask ((const struct cpumask *)&__cpu_preferred_mask)
extern atomic_t __num_online_cpus;
extern unsigned int __num_possible_cpus;
@@ -1161,6 +1169,7 @@ void init_cpu_possible(const struct cpumask *src);
#define set_cpu_present(cpu, present) assign_cpu((cpu), &__cpu_present_mask, (present))
#define set_cpu_active(cpu, active) assign_cpu((cpu), &__cpu_active_mask, (active))
#define set_cpu_dying(cpu, dying) assign_cpu((cpu), &__cpu_dying_mask, (dying))
+#define set_cpu_preferred(cpu, preferred) assign_cpu((cpu), &__cpu_preferred_mask, (preferred))
void set_cpu_online(unsigned int cpu, bool online);
void set_cpu_possible(unsigned int cpu, bool possible);
@@ -1256,7 +1265,12 @@ static __always_inline bool cpu_dying(unsigned int cpu)
return cpumask_test_cpu(cpu, cpu_dying_mask);
}
-#else
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+ return cpumask_test_cpu(cpu, cpu_preferred_mask);
+}
+
+#else /* NR_CPUS <= 1 */
#define num_online_cpus() 1U
#define num_possible_cpus() 1U
@@ -1294,6 +1308,11 @@ static __always_inline bool cpu_dying(unsigned int cpu)
return false;
}
+static __always_inline bool cpu_preferred(unsigned int cpu)
+{
+ return cpu == 0;
+}
+
#endif /* NR_CPUS > 1 */
#define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
diff --git a/kernel/cpu.c b/kernel/cpu.c
index bc4f7a9ba64e..d623a9c5554a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -3107,6 +3107,11 @@ EXPORT_SYMBOL(__cpu_dying_mask);
atomic_t __num_online_cpus __read_mostly;
EXPORT_SYMBOL(__num_online_cpus);
+#ifdef CONFIG_PREFERRED_CPU
+struct cpumask __cpu_preferred_mask __read_mostly;
+EXPORT_SYMBOL(__cpu_preferred_mask);
+#endif
+
void init_cpu_present(const struct cpumask *src)
{
cpumask_copy(&__cpu_present_mask, src);
@@ -3164,6 +3169,7 @@ void __init boot_cpu_init(void)
/* Mark the boot cpu "present", "online" etc for SMP and UP case */
set_cpu_online(cpu, true);
set_cpu_active(cpu, true);
+ set_cpu_preferred(cpu, true);
set_cpu_present(cpu, true);
set_cpu_possible(cpu, true);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f4530eb543f..9e16946c9d62 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8685,6 +8685,9 @@ int sched_cpu_activate(unsigned int cpu)
*/
sched_set_rq_online(rq, cpu);
+ /* preferred is subset of active and follows its state */
+ set_cpu_preferred(cpu, true);
+
return 0;
}
@@ -8698,6 +8701,8 @@ int sched_cpu_deactivate(unsigned int cpu)
if (ret)
return ret;
+ set_cpu_preferred(cpu, false);
+
/*
* Remove CPU from nohz.idle_cpus_mask to prevent participating in
* load balancing when not active
--
2.47.3
^ permalink raw reply related
* [PATCH v5 03/24] kconfig: Provide PREFERRED_CPU option
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
Introduce a new config named PREFERRED_CPU.
This helps to:
- Avoid the code bloat when PREFERRED_CPU=n. In that cases preferred
is same as active.
- Avoid the ifdeffery around PREFERRED_CPU in many files.
Since paravirtualized use case is the main driving force of this
feature, make it default for kernels with PARAVIRT=y
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Make it depend on instead. (Yury Norov)
- Fix helper indentation (sashiko)
kernel/Kconfig.preempt | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 88c594c6d7fc..b3a543cb44cd 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -192,3 +192,17 @@ config SCHED_CLASS_EXT
For more information:
Documentation/scheduler/sched-ext.rst
https://github.com/sched-ext/scx
+
+config PREFERRED_CPU
+ bool "Dynamic vCPU management based on steal time"
+ depends on PARAVIRT && SMP
+ default y
+ help
+ This feature helps to reduce the steal time in paravirtualised
+ environment, there by reducing vCPU preemption. Reducing vCPU
+ preemption provides improved lock holder preemption and reduces
+ cost of vCPU preemption in the host.
+
+ By default preferred CPUs will be same as active CPUs. Depending
+ on the steal time when steal_monitor driver is enabled,
+ preferred CPUs could become subset of active CPUs.
--
2.47.3
^ permalink raw reply related
* [PATCH v5 02/24] sched/docs: Document cpu_preferred_mask and Preferred CPU concept
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc,
kernel test robot
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
Add documentation for new cpumask called cpu_preferred_mask. This could
help users in understanding what this mask is and the concept behind it.
Document how to enable it and implementation aspects of it.
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202606180717.yNM0yb41-lkp@intel.com/
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
v4->v5:
- Change text to reflect new driver info.
- Changes suggested by Randy Dunlap.
- Sashiko nitpicks
Documentation/scheduler/sched-arch.rst | 50 ++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/Documentation/scheduler/sched-arch.rst b/Documentation/scheduler/sched-arch.rst
index ed07efea7d02..8fc56edd8e03 100644
--- a/Documentation/scheduler/sched-arch.rst
+++ b/Documentation/scheduler/sched-arch.rst
@@ -62,6 +62,56 @@ Your cpu_idle routines need to obey the following rules:
arch/x86/kernel/process.c has examples of both polling and
sleeping idle functions.
+Preferred CPUs
+==============
+
+In virtualised environments it is possible to overcommit CPU resources.
+i.e sum of virtual CPU(vCPU) of all VMs is greater than number of physical
+CPUs(pCPU). Under such conditions when all or many VMs have high utilization,
+hypervisor won't be able to satisfy the CPU requirement and has to context
+switch within or across VMs. i.e hypervisor needs to preempt one vCPU to run
+another. This is called vCPU preemption. This is more expensive compared to
+task context switch within a vCPU.
+
+In such cases it is better that combined vCPU ask from all VMs is reduced
+by not using some of the vCPUs in each VM. vCPUs where workload can be safely
+scheduled which won't increase any contention for pCPU are called as
+"Preferred CPUs".
+
+Main design construct is preferred CPUs is always subset of active CPUs.
+In most cases preferred CPUs will be same as active CPUs, when there is pCPU
+contention, Preferred CPUs will reduce based on the amount of steal time.
+When the pCPU contention goes away as indicated by steal time, Preferred CPUs
+will become same as active CPUs again. This is done by loading the
+steal_monitor driver available at drivers/virt/steal_monitor.
+
+For scheduling decisions such as wakeup, pushing the task etc, needs this
+CPU state info. This is maintained in cpu_preferred_mask.
+vCPUs which are not in cpu_preferred_mask should be treated as vCPUs which
+should not be used at this moment provided it doesn't break user affinity.
+
+This is achieved by
+1. Selecting a preferred CPU at wakeup.
+2. Push the task away from non-preferred CPU at tick.
+3. Only select preferred CPUs for load balance.
+
+/sys/devices/system/cpu/preferred prints the current cpu_preferred_mask in
+cpulist format.
+
+Notes:
+1. This feature is available under CONFIG_PREFERRED_CPU. This enables
+ steal_monitor driver. On enabling the driver, CPU preferred state
+ can change based on steal time. With CONFIG_PREFERRED_CPU=n,
+ preferred CPUs is same as active CPUs.
+
+2. This feature works for FAIR class only.
+
+3. A task pinned, which can't be moved to preferred CPUs will continue
+ to run based on its affinity. But no load balancing happens.
+
+4. Decision to use/not use is driven by kernel. Hence it shouldn't
+ break user affinities. One of the main reasons why CPU hotplug
+ or Isolated cpuset partitions was not a solution.
Possible arch/ problems
=======================
--
2.47.3
^ permalink raw reply related
* [PATCH v5 01/24] sched/debug: Remove unused schedstats
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
In-Reply-To: <20260625124648.802832-1-sshegde@linux.ibm.com>
nr_migrations_cold, nr_wakeups_passive and nr_wakeups_idle are not
being updated anywhere. So remove them.
These are per process stats. So updating sched stats version isn't
necessary.
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com>
---
include/linux/sched.h | 3 ---
kernel/sched/debug.c | 3 ---
2 files changed, 6 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 35e6183ef615..fc6ecb3869dd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -550,7 +550,6 @@ struct sched_statistics {
s64 exec_max;
u64 slice_max;
- u64 nr_migrations_cold;
u64 nr_failed_migrations_affine;
u64 nr_failed_migrations_running;
u64 nr_failed_migrations_hot;
@@ -563,8 +562,6 @@ struct sched_statistics {
u64 nr_wakeups_remote;
u64 nr_wakeups_affine;
u64 nr_wakeups_affine_attempts;
- u64 nr_wakeups_passive;
- u64 nr_wakeups_idle;
#ifdef CONFIG_SCHED_CORE
u64 core_forceidle_sum;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 40584b27ea0c..f3a033b34ba0 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1359,7 +1359,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P_SCHEDSTAT(wait_count);
PN_SCHEDSTAT(iowait_sum);
P_SCHEDSTAT(iowait_count);
- P_SCHEDSTAT(nr_migrations_cold);
P_SCHEDSTAT(nr_failed_migrations_affine);
P_SCHEDSTAT(nr_failed_migrations_running);
P_SCHEDSTAT(nr_failed_migrations_hot);
@@ -1371,8 +1370,6 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
P_SCHEDSTAT(nr_wakeups_remote);
P_SCHEDSTAT(nr_wakeups_affine);
P_SCHEDSTAT(nr_wakeups_affine_attempts);
- P_SCHEDSTAT(nr_wakeups_passive);
- P_SCHEDSTAT(nr_wakeups_idle);
avg_atom = p->se.sum_exec_runtime;
if (nr_switches)
--
2.47.3
^ permalink raw reply related
* [PATCH v5 00/24] sched: Introduce cpu_preferred_mask and steal-driven vCPU backoff
From: Shrikanth Hegde @ 2026-06-25 12:46 UTC (permalink / raw)
To: linux-kernel, mingo, peterz, juri.lelli, vincent.guittot,
yury.norov, kprateek.nayak, iii, corbet
Cc: sshegde, tglx, gregkh, pbonzini, seanjc, vschneid, huschle,
rostedt, dietmar.eggemann, maddy, srikar, hdanton, chleroy,
vineeth, frederic, arighi, pauld, christian.loehle, tj,
tommaso.cucinotta, maz, rafael, rdunlap, kernellwp, linux-doc
Very briefly,
- Maintain set of CPUs which can be used by workload. It is denoted as
cpu_preferred_mask
- Periodically compute the steal time. If steal time is high/low based
on the thresholds, either reduce/increase the preferred CPUs. This is
handled in a new driver called steal_monitor
- If a CPU is marked as non-preferred, push the task running on it if
possible.
- Use this CPU state in wakeup and load balance to ensure tasks run
within preferred CPUs.
For more details on idea, problem statement and performance numbers,
please refer to cover-letter of v2[2] and OSPM talk[1].
*** Please review and provide your feedback!! ***
[1]:https://youtu.be/adxUKFPlOp0
[2] v2: https://lore.kernel.org/all/20260407191950.643549-1-sshegde@linux.ibm.com/#t
[3] v4: https://lore.kernel.org/all/20260617174139.155540-1-sshegde@linux.ibm.com/#t
Thank you very much for feedback so far. This has helped the code to
evolve towards a clear abstraction layers and get simplified.
(Hopefully). Apologies in advance if I have missed any comment.
base commit:
tip/sched/core at c095741713d1 ("sched/fair: Fix newidle vs core-sched")
v4->v5:
- Move the computation of steal time and decide on preferred CPU state
to a driver. Drop those changes in core scheduler. (Yury Norov, K Prateek Nayak)
- A new driver called steal_monitor is added in drivers/virt/ (K Prateek Nayak)
(Please let me know if there is a better place for it. I can move it
there)
- New driver does periodic computation of steal time and
increments/decrements the preferred CPUs.
- Debug knobs can be changed via module parameters. (Yury Norov)
- Default implementation are weak symbols. Archs may override by
providing strong symbols in new respective arch specific file.
- Everything is centered around CONFIG_PREFERRED_CPU. No new config
for new driver. Driver gets added to kernel, but not loaded by
default.
- Load the driver to enable steal_monitor functionality. Unload to
remove the same.
- Make CONFIG_PREFERRED_CPU depend on PARAVIRT && SMP (Yury Norov)
- move set_cpu_preferred to a macro. (Yury Norov)
on CONFIG_PREFERRED_CPU=n it will just act on active CPUs in that case.
It shouldn't alter any functionality.
- Do a simple encoding for has_preferred_cpu_state, which aims to avoid
repeated cpumask_interest in is_cpu_allowed.
(Please let me know if new variable based approach to is_cpu_allowed
should be done instead).
- Move select_fallback_rq above the rq_lock. (sashiko)
- Few documentation nitpicks (Randy Dunlap, sashiko)
- Avoid any decision for is_cpu_allowed for other classes (sashiko)
- Don't pull the load towards a non-preferred CPUs in idle and new
idle balanced. (Inferred when seeing sashiko comments)
- Fix leaking of task_struct in push_work_done (K Prateek Nayak)
- Module parameters aren't checked for sane values. One should know
what they are writing to it. If one writes 0 for interval_ms,
then it gets set to default value again to avoid workqueue lockup.
- Added a few design construct related checks in the periodic work
to ensure any future arch specific implementations follow it.
1. preferred is subset of active.
2. preferred cannot be empty.
- Added Documentation of steal_monitor in Documentation/driver-api/
(Let me know if there is better place for it)
performance numbers are expected to be same or slightly better than v2.
With driver, one major overhead in sched_tick has been removed. i.e
finding the first housekeeping CPU which was O(N).
Apologies in advance if there is any critical information is missing
regarding new driver such as policy, documentation or missing
implementation. Please let me know, and I can make those changes.
I have ensured checkpatch --strict is happy.
Also, I think there should be a MAINTAINERS file entry for new
driver. I don't see a drivers/virt/* entry.
Either as a new entry for driver or a few file in SCHEDULER entry.
Let me know if/what I should add it. I am bit cautious about such
change. I am willing to maintain this driver, other than that
I don't know what else i going to be necessary for it. I don't have
any maintainer experience either :)
PS: Sorry for the long CC list. Please unicast it to me if you want to
be dropped for the CC list.
Shrikanth Hegde (24):
sched/debug: Remove unused schedstats
sched/docs: Document cpu_preferred_mask and Preferred CPU concept
kconfig: Provide PREFERRED_CPU option
cpumask: Introduce cpu_preferred_mask
sysfs: Add preferred CPU file
sched/core: allow only preferred CPUs in is_cpu_allowed
sched/fair: Select preferred CPU at wakeup when possible
sched/fair: load balance only among preferred CPUs
sched/fair: Pull the load on preferred CPU
sched/core: Keep tick on non-preferred CPUs until tasks are out
sched/core: Push current task from non preferred CPU
sched/debug: Add migration stats due to non preferred CPUs
virt/steal_monitor: Add documentation
virt: Introduce steal monitor driver
virt/steal_monitor: Restore to active on module disable
virt/steal_monitor: Define steal_monitor structure
virt/steal_monitor: Add control knobs for handling steal values
virt/steal_monitor: Compute work at regular intervals
virt/steal_monitor: Provide default method to get systemwide steal
time
virt/steal_monitor: Provide default method to inc/dec preferred CPUs
virt/steal_monitor: Provide default method to get num of CPUs for
steal ratio
virt/steal_monitor: Act on steal values at regular intervals
virt/steal_monitor: Add direction control
virt/steal_monitor: Add design check of preferred subset of active
.../ABI/testing/sysfs-devices-system-cpu | 11 ++
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/steal-monitor.rst | 93 ++++++++++++
Documentation/scheduler/sched-arch.rst | 50 +++++++
drivers/base/cpu.c | 8 ++
drivers/virt/Makefile | 1 +
drivers/virt/steal_monitor/Makefile | 14 ++
drivers/virt/steal_monitor/defaults.c | 105 ++++++++++++++
drivers/virt/steal_monitor/sm_core.c | 124 ++++++++++++++++
drivers/virt/steal_monitor/sm_core.h | 32 +++++
include/linux/cpumask.h | 21 ++-
include/linux/sched.h | 5 +-
kernel/Kconfig.preempt | 14 ++
kernel/cpu.c | 6 +
kernel/sched/core.c | 133 +++++++++++++++++-
kernel/sched/debug.c | 4 +-
kernel/sched/fair.c | 11 +-
kernel/sched/sched.h | 36 +++++
18 files changed, 659 insertions(+), 10 deletions(-)
create mode 100644 Documentation/driver-api/steal-monitor.rst
create mode 100644 drivers/virt/steal_monitor/Makefile
create mode 100644 drivers/virt/steal_monitor/defaults.c
create mode 100644 drivers/virt/steal_monitor/sm_core.c
create mode 100644 drivers/virt/steal_monitor/sm_core.h
--
2.47.3
^ permalink raw reply
* Re: [PATCH v7 10/42] KVM: guest_memfd: Ensure pages are not in use before conversion
From: David Hildenbrand (Arm) @ 2026-06-25 12:36 UTC (permalink / raw)
To: Ackerley Tng, Vlastimil Babka (SUSE), aik, andrew.jones,
binbin.wu, brauner, chao.p.peng, ira.weiny, jmattson, jthoughton,
michael.roth, oupton, pankaj.gupta, qperret, rick.p.edgecombe,
rientjes, shivankg, steven.price, tabba, willy, wyihan,
yan.y.zhao, forkloop, pratyush, suzuki.poulose, aneesh.kumar,
liam, Paolo Bonzini, Sean Christopherson, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers,
Jonathan Corbet, Shuah Khan, Shuah Khan, Vishal Annapurve,
Andrew Morton, Chris Li, Kairui Song, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Jason Gunthorpe
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgHM4a66Jx9++6iioQLpFY-KgPvjY5+bg_X97DfSjpXzRQ@mail.gmail.com>
On 6/19/26 02:17, Ackerley Tng wrote:
> "Vlastimil Babka (SUSE)" <vbabka@kernel.org> writes:
>
>> On 5/23/26 02:17, Ackerley Tng via B4 Relay wrote:
>>> From: Ackerley Tng <ackerleytng@google.com>
>>>
>>> When converting memory to private in guest_memfd, it is necessary to ensure
>>> that the pages are not currently being accessed by any other part of the
>>> kernel or userspace to avoid any current user writing to guest private
>>> memory.
>>>
>>> guest_memfd checks for unexpected refcounts to determine whether a page is
>>> still in use. The only expected refcounts after unmapping the range
>>> requested for conversion are those that are held by guest_memfd itself.
>>
>> Is it sufficient to only check, and not also freeze the refcount? (i.e.
>> using folio_ref_freeze()), because without freezing, anything (e.g.
>> compaction's pfn-based scanner) could do a speculative folio_try_get() and
>> the checked refcount becomes stale.
>>
>
> I believe there's no issue here, since the main thing here is to check
> for long-term pins on the folio. Perhaps David can help me verify. :)
I think I raised this in the past as well: ideally, we'd be freezing the
refcount, then, there is no need to worry about any concurrent access.
However, we could really only get additional page references through PFN walkers
(or speculative references), not through page tables or GUP pins, which is what
we care about.
So if we can tolerate a speculative bump+release of a folio reference, likely
we're good.
--
Cheers,
David
^ permalink raw reply
* Re: [PATCH] Documentation: landlock: Document fs.resolve_unix audit blocker
From: Günther Noack @ 2026-06-25 12:31 UTC (permalink / raw)
To: Doehyun Baek
Cc: Mickaël Salaün, Jonathan Corbet, Shuah Khan,
Sebastian Andrzej Siewior, linux-security-module, linux-doc,
linux-kernel
In-Reply-To: <20260625092819.1870049-1-doehyunbaek@gmail.com>
On Thu, Jun 25, 2026 at 09:28:19AM +0000, Doehyun Baek wrote:
> The Landlock audit code can emit fs.resolve_unix as a filesystem blocker
> for pathname UNIX socket resolution denials, but the admin guide's blockers
> list did not mention it.
>
> Add the missing blocker name and ABI version to keep the audit
> documentation in sync with the emitted records.
>
> Fixes: ae97330d1bd6 ("landlock: Control pathname UNIX domain socket resolution by path")
> Signed-off-by: Doehyun Baek <doehyunbaek@gmail.com>
> ---
> Documentation/admin-guide/LSM/landlock.rst | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst
> index 314052bbeb0a..8eb85c9381ff 100644
> --- a/Documentation/admin-guide/LSM/landlock.rst
> +++ b/Documentation/admin-guide/LSM/landlock.rst
> @@ -52,6 +52,7 @@ AUDIT_LANDLOCK_ACCESS
> - fs.refer (ABI 2+)
> - fs.truncate (ABI 3+)
> - fs.ioctl_dev (ABI 5+)
> + - fs.resolve_unix (ABI 9+)
>
> **net.*** - Network access rights (ABI 4+):
> - net.bind_tcp - TCP port binding was denied
>
> base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
> --
> 2.43.0
>
Thanks, good catch!
Reviewed-by: Günther Noack <gnoack@google.com>
^ permalink raw reply
* Re: [PATCH] landlock: Documentation wording cleanups
From: Günther Noack @ 2026-06-25 12:29 UTC (permalink / raw)
To: Mickaël Salaün
Cc: linux-doc, linux-security-module, Alejandro Colomar,
Alejandro Colomar
In-Reply-To: <20260516190112.4924-1-gnoack3000@gmail.com>
On Sat, May 16, 2026 at 09:01:12PM +0200, Günther Noack wrote:
> Documentation cleanups suggested by Alejandro Colomar,
> which we have also applied in the man pages.
>
> Link: https://lore.kernel.org/all/agW4yMK6CinJGqXt@devuan/
> Suggested-by: Alejandro Colomar <alx@kernel.org>
> Signed-off-by: Günther Noack <gnoack3000@gmail.com>
> ---
> include/uapi/linux/landlock.h | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> index 10a346e55e95..48c12ddf1108 100644
> --- a/include/uapi/linux/landlock.h
> +++ b/include/uapi/linux/landlock.h
> @@ -255,16 +255,16 @@ struct landlock_net_port_attr {
> * :manpage:`connect(2)` as well as calls to :manpage:`sendmsg(2)` with an
> * explicit recipient address.
> *
> - * This access right only applies to connections to UNIX server sockets which
> + * This access right applies only to connections to UNIX server sockets which
> * were created outside of the newly created Landlock domain (e.g. from within
> * a parent domain or from an unrestricted process). Newly created UNIX
> * servers within the same Landlock domain continue to be accessible. In this
> * regard, %LANDLOCK_ACCESS_FS_RESOLVE_UNIX has the same semantics as the
> * ``LANDLOCK_SCOPE_*`` flags.
> *
> - * If a resolve attempt is denied, the operation returns an ``EACCES`` error,
> - * in line with other filesystem access rights (but different to denials for
> - * abstract UNIX domain sockets).
> + * If a resolution attempt is denied, the operation returns an ``EACCES``
> + * error, in line with other filesystem access rights (but different to
> + * denials for abstract UNIX domain sockets).
> *
> * This access right is available since the ninth version of the Landlock ABI.
> *
> --
> 2.54.0
>
Friendly ping, Mickaël!
This is only a minor change, but keeps the man pages and kernel docs wording in line.
—Günther
^ permalink raw reply
* Re: [PATCH v9 4/6] mm/memory-failure: add panic option for unrecoverable pages
From: Miaohe Lin @ 2026-06-25 12:22 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Jonathan Corbet, Shuah Khan, Liam R. Howlett, lance.yang,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260609-ecc_panic-v9-4-432a74002e74@debian.org>
On 2026/6/9 18:56, Breno Leitao wrote:
> Add a sysctl panic_on_unrecoverable_memory_failure (disabled by
> default) that triggers a kernel panic when memory_failure()
> encounters pages that cannot be recovered. This provides a clean
> crash with useful debug information rather than allowing silent
> data corruption or a delayed crash at an unrelated code path.
>
> Panic eligibility is intentionally narrow: only MF_MSG_KERNEL with
> result == MF_IGNORED panics. After the previous patch, MF_MSG_KERNEL
> covers PG_reserved pages and the kernel-owned pages promoted from
> get_hwpoison_page() via -ENOTRECOVERABLE (slab, page tables,
> large-kmalloc).
>
> All other action types are excluded:
>
> - MF_MSG_GET_HWPOISON and MF_MSG_KERNEL_HIGH_ORDER can be reached by
> transient refcount races with the page allocator (an in-flight buddy
> allocation has refcount 0 and is no longer on the buddy free list,
> briefly), and panicking on them would risk killing the box for what
> is actually a recoverable userspace page.
>
> - MF_MSG_UNKNOWN means identify_page_state() could not classify the
> page; that is precisely the wrong basis for a panic decision.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.
^ permalink raw reply
* Re: [PATCH v9 2/6] mm/memory-failure: surface unhandlable kernel pages as -ENOTRECOVERABLE
From: Miaohe Lin @ 2026-06-25 12:02 UTC (permalink / raw)
To: Breno Leitao
Cc: linux-mm, linux-kernel, linux-doc, linux-kselftest,
linux-trace-kernel, kernel-team, Andrew Morton, David Hildenbrand,
Lorenzo Stoakes, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Michal Hocko, Shuah Khan, Naoya Horiguchi,
Jonathan Corbet, Shuah Khan, Liam R. Howlett, lance.yang,
Steven Rostedt, Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260609-ecc_panic-v9-2-432a74002e74@debian.org>
On 2026/6/9 18:56, Breno Leitao wrote:
> get_any_page() collapses every HWPoisonHandlable() rejection into a
> single -EIO via the __get_hwpoison_page() -> -EBUSY -> shake_page()
> -> retry path. That is correct for the transient case (a userspace
> folio briefly off LRU during migration or compaction, which a later
> shake can drag back), but wrong for stable kernel-owned pages: slab,
> page-table, large-kmalloc and PG_reserved pages will never become
> HWPoisonHandlable(), so the retry loop is wasted work and the final
> -EIO loses the "this is structurally unrecoverable" information.
> memory_failure() then maps -EIO into MF_MSG_GET_HWPOISON, which the
> panic-on-unrecoverable sysctl deliberately does not act on.
>
> Introduce HWPoisonKernelOwned(), a small predicate that positively
> identifies pages the hwpoison handler cannot recover from:
>
> HWPoisonKernelOwned(p, flags) :=
> !(MF_SOFT_OFFLINE && page_has_movable_ops(p)) &&
> (PageReserved(p) ||
> PageSlab(head) || PageTable(head) || PageLargeKmalloc(head))
>
> where head = compound_head(p).
>
> PG_reserved is a per-page flag (PF_NO_COMPOUND) and is tested on the
> page directly. The slab, page-table and large-kmalloc page-type bits
> are only stored on the head page, so those tests resolve the compound
> head first, then re-read compound_head(page) afterwards: a concurrent
> split or compound free that moves head invalidates the just-read flags
> and the loop retries. The lookup still takes no refcount, mirroring
> the rest of get_any_page(); the recheck closes the common split race,
> and a residual free->alloc->free in the same window can only mis-tag
> a genuinely poisoned page, never reclassify a handlable one.
>
> The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors the
> same exception in HWPoisonHandlable(): soft-offline is allowed to
> migrate movable_ops pages even though they are not on the LRU, and
> we must not pre-empt that with an unrecoverable verdict.
>
> The list is intentionally not exhaustive. vmalloc and kernel-stack
> pages, for example, do not carry a page_type bit and would need a
> different oracle; they keep going through the existing retry path
> unchanged. This is the smallest set we can identify with certainty
> by page type.
>
> Wire the helper into the top of get_any_page() to short-circuit
> those pages before the retry loop runs. On a hit, drop the caller's
> MF_COUNT_INCREASED reference (if any) and return -ENOTRECOVERABLE
> straight away. Pages outside the helper's positive list still take
> the existing retry path and return -EIO, leaving operator-visible
> behaviour for those cases unchanged.
>
> Extend the unhandlable-page pr_err() to fire for either errno and
> update the get_hwpoison_page() kerneldoc to document the new return.
>
> memory_failure() still folds every negative return into
> MF_MSG_GET_HWPOISON via its existing "else if (res < 0)" branch, so
> this patch on its own only changes the errno that soft_offline_page()
> can propagate to its callers. A follow-up wires -ENOTRECOVERABLE
> through memory_failure() and reports MF_MSG_KERNEL for the
> unrecoverable cases, which is what the
> panic_on_unrecoverable_memory_failure sysctl observes.
>
> Suggested-by: David Hildenbrand <david@kernel.org>
> Suggested-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> mm/memory-failure.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 58 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f4d3e6e20e13..eed9de387694 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1325,6 +1325,46 @@ static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
> return PageLRU(page) || is_free_buddy_page(page);
> }
>
> +/*
> + * Positive identification of pages the hwpoison handler cannot recover.
> + * These page types are owned by kernel internals (no userspace mapping
> + * to unmap, no file mapping to invalidate, no migration target), so the
> + * shake_page() / retry loop in get_any_page() can never turn them into
> + * something HWPoisonHandlable() will accept. Short-circuit them to
> + * -ENOTRECOVERABLE so callers can panic on operator request instead of
> + * spinning through retries that exit as a transient-looking -EIO.
> + *
> + * The MF_SOFT_OFFLINE / page_has_movable_ops() opt-out mirrors
> + * HWPoisonHandlable(): soft-offline is allowed to migrate movable_ops
> + * pages even though they are not on the LRU.
> + */
> +static inline bool HWPoisonKernelOwned(struct page *page, unsigned long flags)
> +{
> + struct page *head;
> +
> + if ((flags & MF_SOFT_OFFLINE) && page_has_movable_ops(page))
> + return false;
> +
> + /* PG_reserved is a per-page flag, never set on a compound page. */
> + if (PageReserved(page))
> + return true;
> +
> + /*
> + * Page-type bits live only on the head page, so resolve any tail
> + * first. The check takes no refcount; recheck the head afterwards
> + * so a concurrent split or compound free cannot leave us trusting
> + * a stale view. A free->alloc->free in the same window is still
> + * possible but closing it would require taking a reference here.
> + */
> +retry:
> + head = compound_head(page);
> + if (!(PageSlab(head) || PageTable(head) || PageLargeKmalloc(head)))
> + return false;
> + if (head != compound_head(page))
> + goto retry;
Looks good to me with one comment: should we write above as something like below:
bool kernel_owned;
retry:
head = compound_head(page);
kernel_owned = PageSlab(head) || PageTable(head) || PageLargeKmalloc(head);
if (head != compound_head(page))
goto retry;
I.e. we should always check whether compound_head has changed, regardless of whether
the page is owned by the kernel, so we can obtain a relatively stable result?
Thanks.
.
^ permalink raw reply
* Re: [PATCH v4 1/5] mm/zswap: Extend shrink_memcg() writeback capability
From: Hao Jia @ 2026-06-25 11:31 UTC (permalink / raw)
To: Yosry Ahmed
Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin, linux-mm,
linux-kernel, linux-doc, Hao Jia
In-Reply-To: <CAO9r8zMmnYkXocZ9Fb9DL_rdAHt5xtT_FLMxJD1bHcM3B4wTFw@mail.gmail.com>
On 2026/6/25 00:57, Yosry Ahmed wrote:
>>
>> /*
>> * Scan up to @nr_to_scan pages across the per-node zswap LRUs of @memcg
>> * and write back the reclaimable ones.
>> *
>> * Since the second-chance algorithm rotates referenced entries to the
>> * LRU tail, the per-node scan is capped at the current LRU length so
>> * each entry is scanned at most once per call. It is up to the caller
>> * to handle retries, deciding whether to scan the next memcg to complete
>
> Nit: "whether to scan another memcg to complete.."
Will fix in the next version.
>
>> * the full iteration, or to rescan the current memcg to drain its zswap
>> * entries.
>> *
>> * Return: The number of compressed bytes written back (>= 0), or -ENOENT
>> * if @memcg has writeback disabled, is a zombie cgroup, or has empty
>> * zswap LRUs.
>> */
>> static long shrink_memcg(struct mem_cgroup *memcg, unsigned long nr_to_scan)
>> {
>> struct zswap_shrink_walk_arg walk_arg = {
>> .bytes_written = 0,
>> .encountered_page_in_swapcache = false,
>> };
>> unsigned long nr_remaining = nr_to_scan;
>> int nid;
>>
>> if (!mem_cgroup_zswap_writeback_enabled(memcg))
>> return -ENOENT;
>>
>> /*
>> * Skip zombies because their LRUs are reparented and we would be
>> * reclaiming from the parent instead of the dead memcg.
>> */
>> if (memcg && !mem_cgroup_online(memcg))
>> return -ENOENT;
>>
>> for_each_node_state(nid, N_NORMAL_MEMORY) {
>> unsigned long nr_to_walk;
>>
>> /*
>> * Cap the walk at the current LRU length to ensure each entry is
>> * scanned at most once per call. Referenced entries are rotated
>> * to the tail for a second chance, and this bound prevents them
>> * from being revisited within a single call. Retries are left to
>> * the caller, which can choose to rescan the current memcg or
>> * move on to the next one.
>> */
>
> Nit: Make this more concise since it's already explained above.
>
Will fix in the next version. Thanks a lot for the review!
Thanks,
Hao
> Otherwise this looks good to me, thank you!
>
>> nr_to_walk = min(nr_remaining,
>> list_lru_count_one(&zswap_list_lru, nid, memcg));
>> if (!nr_to_walk)
>> continue;
>>
>> nr_remaining -= nr_to_walk;
>> list_lru_walk_one(&zswap_list_lru, nid, memcg, &shrink_memcg_cb,
>> &walk_arg, &nr_to_walk);
>> /* Return the unused share of the budget to the pool. */
>> nr_remaining += nr_to_walk;
>>
>> if (!nr_remaining)
>> break;
>> }
>>
>> /* Nothing was scanned: every LRU under @memcg was empty. */
>> if (nr_remaining == nr_to_scan)
>> return -ENOENT;
>>
>> return walk_arg.bytes_written;
>> }
>>
^ permalink raw reply
* Re: [PATCH v4 2/5] mm/zswap: Factor writeback loop out of shrink_worker()
From: Hao Jia @ 2026-06-25 11:28 UTC (permalink / raw)
To: Yosry Ahmed
Cc: akpm, tj, hannes, shakeel.butt, mhocko, mkoutny, nphamcs,
chengming.zhou, muchun.song, roman.gushchin, linux-mm,
linux-kernel, linux-doc, Hao Jia
In-Reply-To: <CAO9r8zPSZLaqLXw87V3q4tZa8WD7xCympKqfLMLB+o-++GksJQ@mail.gmail.com>
On 2026/6/25 01:00, Yosry Ahmed wrote:
> On Wed, Jun 24, 2026 at 4:55 AM Hao Jia <jiahao.kernel@gmail.com> wrote:
>>
>>
>>
>> On 2026/6/23 07:36, Yosry Ahmed wrote:
>>
>>
>> Perhaps something like this?
>>
>> struct zswap_shrink_state {
>> int attempts;
>> int failures;
>> bool stop;
>> };
>>
>> static bool zswap_shrink_no_candidate(struct zswap_shrink_state *s)
>> {
>> if (!s->attempts && ++s->failures == MAX_RECLAIM_RETRIES)
>> return true;
>>
>> s->attempts = 0;
>> return false;
>> }
>>
>> static long zswap_shrink_one(struct mem_cgroup *memcg,
>> struct zswap_shrink_state *s)
>> {
>> long shrunk;
>>
>> shrunk = shrink_memcg(memcg, NR_ZSWAP_WB_BATCH);
>> if (shrunk == -ENOENT)
>> return 0;
>>
>> s->attempts++;
>> if (shrunk <= 0 && ++s->failures == MAX_RECLAIM_RETRIES)
>> s->stop = true;
>
> Do we need 'stop' or can we just return a value here to indicate that
> we should stop (e.g. -EBUSY)?
>
Perhaps we could return -EAGAIN instead of -EBUSY? This would align with
the semantics of the memory.reclaim interface, which returns -EAGAIN
when it reclaims fewer bytes than requested.
>>
>> return shrunk;
>> }
>>
>> static void shrink_worker(struct work_struct *w)
>> {
>> struct zswap_shrink_state s = {};
>> unsigned long thr;
>>
>> /* Reclaim down to the accept threshold */
>> thr = zswap_accept_thr_pages();
>>
>> while (zswap_total_pages() > thr) {
>> struct mem_cgroup *memcg;
>>
>> cond_resched();
>>
>> memcg = zswap_iter_global();
>> if (!memcg) {
>> if (zswap_shrink_no_candidate(&s))
>> break;
>> continue;
>> }
>>
>> zswap_shrink_one(memcg, &s);
>> /* Drop the extra reference taken by the iterator. */
>> mem_cgroup_put(memcg);
>> if (s.stop)
>> break;
>> }
>> }
>
> I think splitting the shrink/retry logic over 2 functions makes it
> more difficult to follow, so yeah I think fold
> zswap_shrink_no_candidate() into zswap_shrink_one(). Then the callers
> only need to iterate memcgs (depending on the context) and call
> zswap_shrink_one() for each of them.
So, something like this?
/* Track progress of a memcg-tree writeback walk. */
struct zswap_shrink_state {
int attempts;
int failures;
};
/*
* Take one step of a memcg-tree writeback walk driven by the caller's
* iterator, and fold the result into @s, the retry bookkeeping shared
* across steps. @memcg is the iterator's current memcg, or NULL once
* it has wrapped around after a full pass over the tree.
*
* The function returns -EAGAIN to signal the caller to abort the walk
* after encountering the following conditions MAX_RECLAIM_RETRIES times:
* - No writeback-candidate memcgs were found in a memcg tree walk.
* - Shrinking a writeback-candidate memcg failed.
*
* Return: The number of compressed bytes written back (>= 0), or -EAGAIN
* once the retry budget is exhausted and the caller should abort the walk.
*/
static long zswap_shrink_one(struct mem_cgroup *memcg,
struct zswap_shrink_state *s)
{
long shrunk;
/*
* If the iterator has completed a full pass, update the shrink state
* and check whether we should keep going.
*/
if (!memcg) {
/*
* Continue shrinking without incrementing failures if we found
* candidate memcgs in the last tree walk.
*/
if (!s->attempts && ++s->failures == MAX_RECLAIM_RETRIES)
return -EAGAIN;
s->attempts = 0;
return 0;
}
shrunk = shrink_memcg(memcg, NR_ZSWAP_WB_BATCH);
/*
* There are no writeback-candidate pages in the memcg. This is not an
* issue as long as we can find another memcg with pages in zswap. Skip
* this without incrementing attempts and failures.
*/
if (shrunk == -ENOENT)
return 0;
s->attempts++;
if (shrunk <= 0 && ++s->failures == MAX_RECLAIM_RETRIES)
return -EAGAIN;
return shrunk;
}
static void shrink_worker(struct work_struct *w)
{
struct zswap_shrink_state s = {};
unsigned long thr;
/* Reclaim down to the accept threshold */
thr = zswap_accept_thr_pages();
while (zswap_total_pages() > thr) {
struct mem_cgroup *memcg;
long ret;
cond_resched();
memcg = zswap_iter_global();
ret = zswap_shrink_one(memcg, &s);
/* drop the extra reference taken by zswap_iter_global() */
mem_cgroup_put(memcg);
if (ret == -EAGAIN)
break;
}
}
^ permalink raw reply
* Re: [PATCH v4 0/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Jani Nikula @ 2026-06-25 11:05 UTC (permalink / raw)
To: Steven Rostedt, linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>
On Thu, 25 Jun 2026, Steven Rostedt <rostedt@kernel.org> wrote:
> Remove trace_printk.h by creating a trace_controls.h for those places that
> need access to tracing prototypes like tracing_off() and for the places that
> need trace_printk() directly, to have it included directly.
>
> Changse since v3: https://lore.kernel.org/all/20260624081806.120105649@kernel.org/
>
> - Always include trace_controls.h in rcu.h (kernel test robot)
>
> There are other configs that may include tracing_off() in rcu.h besides
> the one that had the include of trace_controls.h. Just always include
> it in that header to be safe.
>
> Steven Rostedt (2):
> tracing: Move non-trace_printk prototypes into trace_controls.h
> tracing: Remove trace_printk.h from kernel.h
>
> ----
> arch/powerpc/kvm/book3s_xics.c | 1 +
> arch/powerpc/xmon/xmon.c | 1 +
> arch/s390/kernel/ipl.c | 1 +
> arch/s390/kernel/machine_kexec.c | 1 +
> drivers/gpu/drm/i915/gt/intel_gtt.h | 1 +
> drivers/gpu/drm/i915/i915_gem.h | 2 ++
For the i915 parts,
Acked-by: Jani Nikula <jani.nikula@intel.com>
for merging via whichever tree.
> drivers/hwtracing/stm/dummy_stm.c | 1 +
> drivers/infiniband/hw/hfi1/trace_dbg.h | 1 +
> drivers/tty/sysrq.c | 1 +
> drivers/usb/early/xhci-dbc.c | 1 +
> fs/ext4/inline.c | 1 +
> include/linux/ftrace.h | 2 ++
> include/linux/kernel.h | 1 -
> include/linux/sunrpc/debug.h | 1 +
> include/linux/trace_controls.h | 54 ++++++++++++++++++++++++++++++++
> include/linux/trace_printk.h | 56 ++--------------------------------
> kernel/debug/debug_core.c | 1 +
> kernel/panic.c | 1 +
> kernel/rcu/rcu.h | 1 +
> kernel/rcu/rcutorture.c | 1 +
> kernel/trace/ring_buffer_benchmark.c | 1 +
> kernel/trace/trace.h | 1 +
> kernel/trace/trace_benchmark.c | 1 +
> lib/sys_info.c | 1 +
> samples/fprobe/fprobe_example.c | 1 +
> samples/ftrace/ftrace-direct-too.c | 1 -
> samples/trace_printk/trace-printk.c | 1 +
> 27 files changed, 82 insertions(+), 55 deletions(-)
> create mode 100644 include/linux/trace_controls.h
--
Jani Nikula, Intel
^ permalink raw reply
* Re: [PATCH v16 04/14] lib: kstrtox: add initial value to _parse_integer_limit()
From: Jonathan Cameron @ 2026-06-25 10:58 UTC (permalink / raw)
To: Rodrigo Alencar
Cc: rodrigo.alencar, linux-kernel, linux-iio, devicetree, linux-doc,
linux, David Lechner, Andy Shevchenko, Lars-Peter Clausen,
Michael Hennerich, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Jonathan Corbet, Andrew Morton, Petr Mladek, Steven Rostedt,
Andy Shevchenko, Rasmus Villemoes, Sergey Senozhatsky, Shuah Khan
In-Reply-To: <ssrckqgv3rqpfgxwpx4ca3m5m2mp3frxs6sd673chvsorhsjiq@63t3eosivg3a>
On Thu, 25 Jun 2026 08:30:07 +0100
Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
> On 24/06/26 15:54, Jonathan Cameron wrote:
> > On Sun, 14 Jun 2026 21:00:44 +0100
> > Jonathan Cameron <jic23@kernel.org> wrote:
> >
> > > On Thu, 4 Jun 2026 11:09:33 +0100
> > > Rodrigo Alencar <455.rodrigo.alencar@gmail.com> wrote:
> > >
> > > > On 26/06/04 10:58AM, Rodrigo Alencar via B4 Relay wrote:
> > > > > From: Rodrigo Alencar <rodrigo.alencar@analog.com>
> > > > >
> > > > > Add init parameter to _parse_integer_limit() that defines an initial
> > > > > value for the accumulated result when parsing an 64-bit integer. The
> > > > > new function prototype is adjusted so that the _parse_integer() macros
> > > > > stay consistent allowing for one more argument, which defaults to 0.
> > > >
> > > > ...
> > > >
> > > > > noinline
> > > > > unsigned int _parse_integer_limit(const char *s, unsigned int base, unsigned long long *p,
> > > > > - size_t max_chars)
> > > > > + size_t max_chars, unsigned long long init)
> > > > > {
> > > > > unsigned long long res;
> > > > > unsigned int rv;
> > > > >
> > > > > - res = 0;
> > > > > + res = init;
> > > >
> > > > This might generate conflict, as the code around have changed in linux-next.
> > > > It is an easy fix though.
> > > >
> > > Thanks for the heads up. Hopefully that will all fall out when I rebase testing
> > > on rc1 once that is out.
> > I've done a mid merge cycle rebase as the char-misc branches have merged.
> > So this should be resolve on my testing branch now.
>
> https://lore.kernel.org/oe-kbuild-all/202606250230.etPGuolf-lkp@intel.com/
>
> Apparently, the documentation header now includes parameter descriptions.
> The new one is missing.
I'm snowed under for next few days so if you have time to spin me a fixup patch
that I can just apply that would be great.
If not I'll get to it next week probably.
Jonathan
>
^ permalink raw reply
* Re: [PATCH v8 24/46] KVM: guest_memfd: Make in-place conversion the default
From: Yan Zhao @ 2026-06-25 10:57 UTC (permalink / raw)
To: Sean Christopherson, Ackerley Tng, aik, andrew.jones, binbin.wu,
brauner, chao.p.peng, david, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Barry Song, Axel Rasmussen, Yuanchu Xie,
Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt, Kiryl Shutsemau,
Baoquan He, Jason Gunthorpe, Vlastimil Babka, kvm, linux-kernel,
linux-trace-kernel, linux-doc, linux-kselftest, linux-mm,
linux-coco
In-Reply-To: <ajyJhZcgfYFtGfS2@yzhao56-desk.sh.intel.com>
On Thu, Jun 25, 2026 at 09:51:01AM +0800, Yan Zhao wrote:
> On Wed, Jun 24, 2026 at 05:41:58PM -0700, Sean Christopherson wrote:
> > On Wed, Jun 24, 2026, Ackerley Tng wrote:
> > > Yan Zhao <yan.y.zhao@intel.com> writes:
> > > > With gmem_in_place_conversion=true, userspace can create guest_memfd without the
> > > > MMAP flag. In such cases, shared memory is allocated from different backends.
> > > > This means this module parameter only enables per-gmem memory attribute and does
> > > > not guarantee that gmem in-place conversion will actually occur.
> >
> > KVM module params are pretty much always about what KVM supports, not what is
> > guaranteed to happen.
> >
> > - enable_mmio_caching doesn't guarantee there will actually be MMIO SPTEs,
> > because maybe the guest never accesses emulated MMIO.
> > - enable_pmu doesn't guarantee VMs will get a PMU, because userspace may elect
> > not to advertise one.
> > - and so on and so forth...
> >
> > Yes, there's a small mental jump to get from "KVM supports in-place conversion"
> > to "I need to set memory attributes on the guest_memfd instance, not the VM",
> > but I don't see that as a big hurdle, certainly not in the long term. And once
> > the VMM code is written, I really do think most people are going to care about
> > whether or not KVM supports in-place conversion, not where PRIVATE is tracked.
> Sorry, I just saw this mail after posting my reply in [1].
>
> I'm ok with gmem_in_place_conversion=true just means KVM supports in-place
> conversion, while we can still create VMs with shared memory not from gmem.
Or what about "allow_gmem_in_place_conversion" ?
> Though it still feels a bit odd to require TDX huge pages to depend on
> gmem_in_place_conversion=true when shared memory is not currently allocated from
> gmem, it should become more natural over time once gmem supports in-place
> conversions for huge page.
>
> [1] https://lore.kernel.org/all/ajyCn0PnFtQK+Nka@yzhao56-desk.sh.intel.com
>
>
> > > > To avoid confusion, could we rename this module parameter to something more
> > > > accurate, such as gmem_memory_attribute?
> > >
> > > I asked Sean about this after getting some fixes off list. Sean said
> > > gmem_in_place_conversion is named for a host admin to use, and something
> > > like gmem_memory_attributes is too much implementation details for the
> > > admin.
> > >
> > > Sean, would you reconsider since Yan also asked? If the admin compiled
> > > the kernel knowing what CONFIG_KVM_VM_MEMORY_ATTRIBUTES means, then the
> > > admin would also be able to use a param like gmem_memory_attributes?
> >
> > No, because it's not all memory attributes, it's very specifically the PRIVATE
> > attribute that will get moved to guest_memfd. I don't want to pick a name that
> > will become stale and confusing when RWX attributes come along. The RWX bits
> > will be per-VM, while PRIVATE will be per-guest_memfd.
^ permalink raw reply
* Re: [PATCH net-next] Documentation: networking: Add a test plan for ethtool pause validation
From: Maxime Chevallier @ 2026-06-25 10:46 UTC (permalink / raw)
To: Andrew Lunn
Cc: Jakub Kicinski, davem, Eric Dumazet, Paolo Abeni, Simon Horman,
Russell King, Heiner Kallweit, Jonathan Corbet, Shuah Khan,
Oleksij Rempel, Vladimir Oltean, Florian Fainelli,
thomas.petazzoni, netdev, linux-kernel, linux-doc
In-Reply-To: <b7de216a-fd1a-42a0-8711-d822a1ad9319@lunn.ch>
Hi Andrew,
On 5/29/26 14:59, Andrew Lunn wrote:
(This discussion was a while ago, but this bit of context should be enough)
> But we also need to consider that for some APIs, we have decided that
> a configuration can be set now, which does not actually apply in our
> current conditions, but it will be stored away for when conditions
> change and it is applicable. The half duplex case could fit that. When
> the link is currently half duplex, you can configure pause, but you
> don't expect it to actually change the current behaviour. It only
> kicks in when the link renegotiates to full duplex sometime in the
> future. We have to also consider this the other way around. The link
> is full duplex and pause is configured by the user. Something happens
> with the LP and the link renegotiates to half duplex. The local end
> should not throw away the configuration, it simply cannot apply it
> given the current situation.
I'm writing the test description for HD with a better formatting, so the
HD test wouldn't be about "are we using pause stuff while in HD" as it
doesn't make sense, but rather "do we correctly store the pause settings
aside for later".
I'm realising that we don't really have an API to report the *true* in-use pause
settings. Taking HD as an example :
# ethtool -s eth2 duplex half
[588209.379363] mvpp2 f4000000.ethernet eth2: Link is Up - 100Mbps/Half - flow control off
# ethtool eth2
[...]
Supported pause frame use: Symmetric Receive-only
Advertised pause frame use: Symmetric Receive-only
Link partner advertised pause frame use: Symmetric Receive-only
# ethtool -a eth2
Autonegotiate: on
RX: off
TX: off
RX negotiated: on
TX negotiated: on
Sure, pause and HD don't make sense, however what I find confusing to some
extent is that the only place we have information about the *actual* pause
settings is the "link is Up" log in dmesg.
Maybe the problem in the above situation is that whoever advertises
half-duplex only modes should also not advertise pause ?
Still, I'm wondering if we should even care about all that actually, HD and
Pause are incompatible, and that's it. If you have any thought on this, let
me know.
Maxime
^ permalink raw reply
* [PATCH v4 1/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>
From: Steven Rostedt <rostedt@goodmis.org>
Remove the prototypes of the code that is not associated with
trace_printk() from trace_printk.h.
These control functions as well as ftrace_dump() and trace_dump_stack()
are used in cases where things go wrong. The main use case is to do a
trace_dump_stack(); tracing_off(); ftrace_dump(); in a place that detected
that something went wrong, whereas, trace_printk() is added to normal code
during debugging and removed before committing upstream. The dump code is
fine to keep in production.
Suggested-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v3: https://patch.msgid.link/20260624081948.147764194@kernel.org
- Move include out of #if statement in rcu.h
kernel test robot found other configs that could require the
control functions in rcu.h. Just always include it in that file.
arch/powerpc/xmon/xmon.c | 1 +
arch/s390/kernel/ipl.c | 1 +
arch/s390/kernel/machine_kexec.c | 1 +
drivers/gpu/drm/i915/i915_gem.h | 1 +
drivers/tty/sysrq.c | 1 +
include/linux/trace_controls.h | 54 ++++++++++++++++++++++++++++++++
include/linux/trace_printk.h | 51 ------------------------------
kernel/debug/debug_core.c | 1 +
kernel/panic.c | 1 +
kernel/rcu/rcu.h | 1 +
kernel/rcu/rcutorture.c | 1 +
kernel/trace/trace.h | 1 +
kernel/trace/trace_benchmark.c | 1 +
lib/sys_info.c | 1 +
14 files changed, 66 insertions(+), 51 deletions(-)
create mode 100644 include/linux/trace_controls.h
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index cb3a3244ae6f..2135f319e0dd 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -27,6 +27,7 @@
#include <linux/highmem.h>
#include <linux/security.h>
#include <linux/debugfs.h>
+#include <linux/trace_controls.h>
#include <asm/ptrace.h>
#include <asm/smp.h>
diff --git a/arch/s390/kernel/ipl.c b/arch/s390/kernel/ipl.c
index 3c346b02ceb9..baac66cc4de4 100644
--- a/arch/s390/kernel/ipl.c
+++ b/arch/s390/kernel/ipl.c
@@ -22,6 +22,7 @@
#include <linux/debug_locks.h>
#include <linux/vmalloc.h>
#include <linux/secure_boot.h>
+#include <linux/trace_controls.h>
#include <asm/asm-extable.h>
#include <asm/machine.h>
#include <asm/diag.h>
diff --git a/arch/s390/kernel/machine_kexec.c b/arch/s390/kernel/machine_kexec.c
index baeb3dcfc1c8..33f9a89eb3ad 100644
--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -12,6 +12,7 @@
#include <linux/delay.h>
#include <linux/reboot.h>
#include <linux/ftrace.h>
+#include <linux/trace_controls.h>
#include <linux/debug_locks.h>
#include <linux/cpufeature.h>
#include <asm/guarded_storage.h>
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 20b3cb29cfff..1da8fb61c09e 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -116,6 +116,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
#endif
#if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
+#include <linux/trace_controls.h>
#define GEM_TRACE(...) trace_printk(__VA_ARGS__)
#define GEM_TRACE_ERR(...) do { \
pr_err(__VA_ARGS__); \
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index c2e4b31b699a..d3f72dc430b8 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -324,6 +324,7 @@ static const struct sysrq_key_op sysrq_showstate_blocked_op = {
};
#ifdef CONFIG_TRACING
+#include <linux/trace_controls.h>
#include <linux/ftrace.h>
static void sysrq_ftrace_dump(u8 key)
diff --git a/include/linux/trace_controls.h b/include/linux/trace_controls.h
new file mode 100644
index 000000000000..995b97e963b4
--- /dev/null
+++ b/include/linux/trace_controls.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TRACE_CONTROLS_H
+#define _LINUX_TRACE_CONTROLS_H
+
+
+/*
+ * General tracing related utility functions - trace_printk(),
+ * tracing_on/tracing_off and tracing_start()/tracing_stop
+ *
+ * Use tracing_on/tracing_off when you want to quickly turn on or off
+ * tracing. It simply enables or disables the recording of the trace events.
+ * This also corresponds to the user space /sys/kernel/tracing/tracing_on
+ * file, which gives a means for the kernel and userspace to interact.
+ * Place a tracing_off() in the kernel where you want tracing to end.
+ * From user space, examine the trace, and then echo 1 > tracing_on
+ * to continue tracing.
+ *
+ * tracing_stop/tracing_start has slightly more overhead. It is used
+ * by things like suspend to ram where disabling the recording of the
+ * trace is not enough, but tracing must actually stop because things
+ * like calling smp_processor_id() may crash the system.
+ *
+ * Most likely, you want to use tracing_on/tracing_off.
+ */
+enum ftrace_dump_mode {
+ DUMP_NONE,
+ DUMP_ALL,
+ DUMP_ORIG,
+ DUMP_PARAM,
+};
+
+#ifdef CONFIG_TRACING
+void tracing_on(void);
+void tracing_off(void);
+int tracing_is_on(void);
+void tracing_snapshot(void);
+void tracing_snapshot_alloc(void);
+void tracing_start(void);
+void tracing_stop(void);
+void trace_dump_stack(int skip);
+void ftrace_dump(enum ftrace_dump_mode oops_dump_mode);
+#else
+static inline void tracing_start(void) { }
+static inline void tracing_stop(void) { }
+static inline void tracing_on(void) { }
+static inline void tracing_off(void) { }
+static inline int tracing_is_on(void) { return 0; }
+static inline void tracing_snapshot(void) { }
+static inline void tracing_snapshot_alloc(void) { }
+static inline void trace_dump_stack(int skip) { }
+static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
+#endif
+
+#endif /* _LINUX_TRACE_CONTROLS_H */
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index 3d54f440dccf..a488ea9e9f85 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -7,43 +7,7 @@
#include <linux/stddef.h>
#include <linux/stringify.h>
-/*
- * General tracing related utility functions - trace_printk(),
- * tracing_on/tracing_off and tracing_start()/tracing_stop
- *
- * Use tracing_on/tracing_off when you want to quickly turn on or off
- * tracing. It simply enables or disables the recording of the trace events.
- * This also corresponds to the user space /sys/kernel/tracing/tracing_on
- * file, which gives a means for the kernel and userspace to interact.
- * Place a tracing_off() in the kernel where you want tracing to end.
- * From user space, examine the trace, and then echo 1 > tracing_on
- * to continue tracing.
- *
- * tracing_stop/tracing_start has slightly more overhead. It is used
- * by things like suspend to ram where disabling the recording of the
- * trace is not enough, but tracing must actually stop because things
- * like calling smp_processor_id() may crash the system.
- *
- * Most likely, you want to use tracing_on/tracing_off.
- */
-
-enum ftrace_dump_mode {
- DUMP_NONE,
- DUMP_ALL,
- DUMP_ORIG,
- DUMP_PARAM,
-};
-
#ifdef CONFIG_TRACING
-void tracing_on(void);
-void tracing_off(void);
-int tracing_is_on(void);
-void tracing_snapshot(void);
-void tracing_snapshot_alloc(void);
-
-extern void tracing_start(void);
-extern void tracing_stop(void);
-
static inline __printf(1, 2)
void ____trace_printk_check_format(const char *fmt, ...)
{
@@ -149,8 +113,6 @@ int __trace_printk(unsigned long ip, const char *fmt, ...);
extern int __trace_bputs(unsigned long ip, const char *str);
extern int __trace_puts(unsigned long ip, const char *str);
-extern void trace_dump_stack(int skip);
-
/*
* The double __builtin_constant_p is because gcc will give us an error
* if we try to allocate the static variable to fmt if it is not a
@@ -173,19 +135,7 @@ __ftrace_vbprintk(unsigned long ip, const char *fmt, va_list ap);
extern __printf(2, 0) int
__ftrace_vprintk(unsigned long ip, const char *fmt, va_list ap);
-
-extern void ftrace_dump(enum ftrace_dump_mode oops_dump_mode);
#else
-static inline void tracing_start(void) { }
-static inline void tracing_stop(void) { }
-static inline void trace_dump_stack(int skip) { }
-
-static inline void tracing_on(void) { }
-static inline void tracing_off(void) { }
-static inline int tracing_is_on(void) { return 0; }
-static inline void tracing_snapshot(void) { }
-static inline void tracing_snapshot_alloc(void) { }
-
static inline __printf(1, 2)
int trace_printk(const char *fmt, ...)
{
@@ -196,7 +146,6 @@ ftrace_vprintk(const char *fmt, va_list ap)
{
return 0;
}
-static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
#endif /* CONFIG_TRACING */
#endif
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index b276504c1c6b..f9c83a470c98 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -27,6 +27,7 @@
#define pr_fmt(fmt) "KGDB: " fmt
+#include <linux/trace_controls.h>
#include <linux/pid_namespace.h>
#include <linux/clocksource.h>
#include <linux/serial_core.h>
diff --git a/kernel/panic.c b/kernel/panic.c
index 213725b612aa..1415e910371d 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -9,6 +9,7 @@
* This function is used through-out the kernel (including mm and fs)
* to indicate a major problem.
*/
+#include <linux/trace_controls.h>
#include <linux/debug_locks.h>
#include <linux/sched/debug.h>
#include <linux/interrupt.h>
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index fa6d30ce73d1..735a80df0b30 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -12,6 +12,7 @@
#include <linux/slab.h>
#include <trace/events/rcu.h>
+#include <linux/trace_controls.h>
/*
* Grace-period counter management.
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 882a158ada7b..76bf0184b267 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -39,6 +39,7 @@
#include <linux/srcu.h>
#include <linux/slab.h>
#include <linux/trace_clock.h>
+#include <linux/trace_controls.h>
#include <asm/byteorder.h>
#include <linux/torture.h>
#include <linux/vmalloc.h>
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 80fe152af1dd..2537c33ddd49 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -22,6 +22,7 @@
#include <linux/ctype.h>
#include <linux/once_lite.h>
#include <linux/ftrace_regs.h>
+#include <linux/trace_controls.h>
#include <linux/llist.h>
#include "pid_list.h"
diff --git a/kernel/trace/trace_benchmark.c b/kernel/trace/trace_benchmark.c
index e19c32f2a938..69cc39008c36 100644
--- a/kernel/trace/trace_benchmark.c
+++ b/kernel/trace/trace_benchmark.c
@@ -3,6 +3,7 @@
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/trace_clock.h>
+#include <linux/trace_controls.h>
#define CREATE_TRACE_POINTS
#include "trace_benchmark.h"
diff --git a/lib/sys_info.c b/lib/sys_info.c
index f32a06ec9ed4..e3c9ca05601b 100644
--- a/lib/sys_info.c
+++ b/lib/sys_info.c
@@ -8,6 +8,7 @@
#include <linux/ftrace.h>
#include <linux/nmi.h>
#include <linux/sched/debug.h>
+#include <linux/trace_controls.h>
#include <linux/string.h>
#include <linux/sysctl.h>
--
2.53.0
^ permalink raw reply related
* [PATCH v4 2/2] tracing: Remove trace_printk.h from kernel.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
kvm, intel-gfx
In-Reply-To: <20260625104007.041432666@kernel.org>
From: Steven Rostedt <rostedt@goodmis.org>
There have been complaints about trace_printk.h causing more build time
for being in kernel.h if it changes. There is also an effort to clean up
kernel.h to have it not include unneeded header files. Move trace_printk.h
out of kernel.h and place it in the headers and C files that use it.
Link: https://lore.kernel.org/all/CAHk-=wikCBeVFjVXiY4o-oepdbjAoir5+TcAgtL12c4u1TpZLQ@mail.gmail.com/
Suggested-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
arch/powerpc/kvm/book3s_xics.c | 1 +
drivers/gpu/drm/i915/gt/intel_gtt.h | 1 +
drivers/gpu/drm/i915/i915_gem.h | 1 +
drivers/hwtracing/stm/dummy_stm.c | 1 +
drivers/infiniband/hw/hfi1/trace_dbg.h | 1 +
drivers/usb/early/xhci-dbc.c | 1 +
fs/ext4/inline.c | 1 +
include/linux/ftrace.h | 2 ++
include/linux/kernel.h | 1 -
include/linux/sunrpc/debug.h | 1 +
include/linux/trace_printk.h | 5 +++--
kernel/trace/ring_buffer_benchmark.c | 1 +
samples/fprobe/fprobe_example.c | 1 +
samples/ftrace/ftrace-direct-too.c | 1 -
samples/trace_printk/trace-printk.c | 1 +
15 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_xics.c b/arch/powerpc/kvm/book3s_xics.c
index 74a44fa702b0..ef5eb596a56e 100644
--- a/arch/powerpc/kvm/book3s_xics.c
+++ b/arch/powerpc/kvm/book3s_xics.c
@@ -26,6 +26,7 @@
#if 1
#define XICS_DBG(fmt...) do { } while (0)
#else
+#include <linux/trace_printk.h>
#define XICS_DBG(fmt...) trace_printk(fmt)
#endif
diff --git a/drivers/gpu/drm/i915/gt/intel_gtt.h b/drivers/gpu/drm/i915/gt/intel_gtt.h
index b54ee4f25af1..f6f223090760 100644
--- a/drivers/gpu/drm/i915/gt/intel_gtt.h
+++ b/drivers/gpu/drm/i915/gt/intel_gtt.h
@@ -35,6 +35,7 @@
#define I915_GFP_ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN)
#if IS_ENABLED(CONFIG_DRM_I915_TRACE_GTT)
+#include <linux/trace_printk.h>
#define GTT_TRACE(...) trace_printk(__VA_ARGS__)
#else
#define GTT_TRACE(...)
diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 1da8fb61c09e..f490052e8964 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -117,6 +117,7 @@ int i915_gem_open(struct drm_i915_private *i915, struct drm_file *file);
#if IS_ENABLED(CONFIG_DRM_I915_TRACE_GEM)
#include <linux/trace_controls.h>
+#include <linux/trace_printk.h>
#define GEM_TRACE(...) trace_printk(__VA_ARGS__)
#define GEM_TRACE_ERR(...) do { \
pr_err(__VA_ARGS__); \
diff --git a/drivers/hwtracing/stm/dummy_stm.c b/drivers/hwtracing/stm/dummy_stm.c
index 38528ffdc0b3..7c5e48ebfb9f 100644
--- a/drivers/hwtracing/stm/dummy_stm.c
+++ b/drivers/hwtracing/stm/dummy_stm.c
@@ -8,6 +8,7 @@
*/
#undef DEBUG
+#include <linux/trace_printk.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/slab.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_dbg.h b/drivers/infiniband/hw/hfi1/trace_dbg.h
index 58304b91380f..30df5e246586 100644
--- a/drivers/infiniband/hw/hfi1/trace_dbg.h
+++ b/drivers/infiniband/hw/hfi1/trace_dbg.h
@@ -103,6 +103,7 @@ __hfi1_trace_def(IOCTL);
*/
#ifdef HFI1_EARLY_DBG
+#include <linux/trace_printk.h>
#define hfi1_dbg_early(fmt, ...) \
trace_printk(fmt, ##__VA_ARGS__)
#else
diff --git a/drivers/usb/early/xhci-dbc.c b/drivers/usb/early/xhci-dbc.c
index 41118bba9197..955c73bd601f 100644
--- a/drivers/usb/early/xhci-dbc.c
+++ b/drivers/usb/early/xhci-dbc.c
@@ -30,6 +30,7 @@ static struct xdbc_state xdbc;
static bool early_console_keep;
#ifdef XDBC_TRACE
+#include <linux/trace_printk.h>
#define xdbc_trace trace_printk
#else
static inline void xdbc_trace(const char *fmt, ...) { }
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index 8045e4ff270c..0eff4a0c6a6c 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -934,6 +934,7 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping,
}
#ifdef INLINE_DIR_DEBUG
+#include <linux/trace_printk.h>
void ext4_show_inline_dir(struct inode *dir, struct buffer_head *bh,
void *inline_start, int inline_size)
{
diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
index 02bc5027523a..b5336a81e619 100644
--- a/include/linux/ftrace.h
+++ b/include/linux/ftrace.h
@@ -8,6 +8,8 @@
#define _LINUX_FTRACE_H
#include <linux/trace_recursion.h>
+#include <linux/trace_controls.h>
+#include <linux/trace_printk.h>
#include <linux/trace_clock.h>
#include <linux/jump_label.h>
#include <linux/kallsyms.h>
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index e5570a16cbb1..e87a40fbd152 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -31,7 +31,6 @@
#include <linux/build_bug.h>
#include <linux/sprintf.h>
#include <linux/static_call_types.h>
-#include <linux/trace_printk.h>
#include <linux/util_macros.h>
#include <linux/wordpart.h>
diff --git a/include/linux/sunrpc/debug.h b/include/linux/sunrpc/debug.h
index ab61bed2f7af..7524f5d82fba 100644
--- a/include/linux/sunrpc/debug.h
+++ b/include/linux/sunrpc/debug.h
@@ -29,6 +29,7 @@ extern unsigned int nlm_debug;
# define ifdebug(fac) if (unlikely(rpc_debug & RPCDBG_##fac))
# if IS_ENABLED(CONFIG_SUNRPC_DEBUG_TRACE)
+# include <linux/trace_printk.h>
# define __sunrpc_printk(fmt, ...) trace_printk(fmt, ##__VA_ARGS__)
# else
# define __sunrpc_printk(fmt, ...) printk(KERN_DEFAULT fmt, ##__VA_ARGS__)
diff --git a/include/linux/trace_printk.h b/include/linux/trace_printk.h
index a488ea9e9f85..74ce4f8995c4 100644
--- a/include/linux/trace_printk.h
+++ b/include/linux/trace_printk.h
@@ -1,11 +1,12 @@
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef _LINUX_TRACE_PRINTK_H
#define _LINUX_TRACE_PRINTK_H
+#if !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) && !defined(BUILD_VDSO)
-#include <linux/compiler_attributes.h>
#include <linux/instruction_pointer.h>
#include <linux/stddef.h>
#include <linux/stringify.h>
+#include <linux/stdarg.h>
#ifdef CONFIG_TRACING
static inline __printf(1, 2)
@@ -147,5 +148,5 @@ ftrace_vprintk(const char *fmt, va_list ap)
return 0;
}
#endif /* CONFIG_TRACING */
-
+#endif /* !defined(__ASSEMBLY__) && !defined(__GENKSYMS__) && !defined(BUILD_VDSO) */
#endif
diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 593e3b59e42e..2bb25caebb75 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -5,6 +5,7 @@
* Copyright (C) 2009 Steven Rostedt <srostedt@redhat.com>
*/
#include <linux/ring_buffer.h>
+#include <linux/trace_printk.h>
#include <linux/completion.h>
#include <linux/kthread.h>
#include <uapi/linux/sched/types.h>
diff --git a/samples/fprobe/fprobe_example.c b/samples/fprobe/fprobe_example.c
index bfe98ce826f3..de81b9b4ca7d 100644
--- a/samples/fprobe/fprobe_example.c
+++ b/samples/fprobe/fprobe_example.c
@@ -12,6 +12,7 @@
#define pr_fmt(fmt) "%s: " fmt, __func__
+#include <linux/trace_printk.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/fprobe.h>
diff --git a/samples/ftrace/ftrace-direct-too.c b/samples/ftrace/ftrace-direct-too.c
index bf2411aa6fd7..159190f4103f 100644
--- a/samples/ftrace/ftrace-direct-too.c
+++ b/samples/ftrace/ftrace-direct-too.c
@@ -1,6 +1,5 @@
// SPDX-License-Identifier: GPL-2.0-only
#include <linux/module.h>
-
#include <linux/mm.h> /* for handle_mm_fault() */
#include <linux/ftrace.h>
#if !defined(CONFIG_ARM64) && !defined(CONFIG_PPC32)
diff --git a/samples/trace_printk/trace-printk.c b/samples/trace_printk/trace-printk.c
index cfc159580263..ff37aeb8523e 100644
--- a/samples/trace_printk/trace-printk.c
+++ b/samples/trace_printk/trace-printk.c
@@ -1,4 +1,5 @@
// SPDX-License-Identifier: GPL-2.0-only
+#include <linux/trace_printk.h>
#include <linux/module.h>
#include <linux/kthread.h>
#include <linux/irq_work.h>
--
2.53.0
^ permalink raw reply related
* [PATCH v4 0/2] tracing: Move non-trace_printk prototypes into trace_controls.h
From: Steven Rostedt @ 2026-06-25 10:40 UTC (permalink / raw)
To: linux-kernel, linux-trace-kernel
Cc: Masami Hiramatsu, Mark Rutland, Mathieu Desnoyers, Andrew Morton,
Linus Torvalds, Sebastian Andrzej Siewior, John Ogness,
Thomas Gleixner, Peter Zijlstra, Julia Lawall, Yury Norov,
linux-doc, linux-kbuild, linuxppc-dev, dri-devel, linux-stm32,
linux-arm-kernel, linux-rdma, linux-usb, linux-ext4, linux-nfs,
kvm, intel-gfx
Remove trace_printk.h by creating a trace_controls.h for those places that
need access to tracing prototypes like tracing_off() and for the places that
need trace_printk() directly, to have it included directly.
Changse since v3: https://lore.kernel.org/all/20260624081806.120105649@kernel.org/
- Always include trace_controls.h in rcu.h (kernel test robot)
There are other configs that may include tracing_off() in rcu.h besides
the one that had the include of trace_controls.h. Just always include
it in that header to be safe.
Steven Rostedt (2):
tracing: Move non-trace_printk prototypes into trace_controls.h
tracing: Remove trace_printk.h from kernel.h
----
arch/powerpc/kvm/book3s_xics.c | 1 +
arch/powerpc/xmon/xmon.c | 1 +
arch/s390/kernel/ipl.c | 1 +
arch/s390/kernel/machine_kexec.c | 1 +
drivers/gpu/drm/i915/gt/intel_gtt.h | 1 +
drivers/gpu/drm/i915/i915_gem.h | 2 ++
drivers/hwtracing/stm/dummy_stm.c | 1 +
drivers/infiniband/hw/hfi1/trace_dbg.h | 1 +
drivers/tty/sysrq.c | 1 +
drivers/usb/early/xhci-dbc.c | 1 +
fs/ext4/inline.c | 1 +
include/linux/ftrace.h | 2 ++
include/linux/kernel.h | 1 -
include/linux/sunrpc/debug.h | 1 +
include/linux/trace_controls.h | 54 ++++++++++++++++++++++++++++++++
include/linux/trace_printk.h | 56 ++--------------------------------
kernel/debug/debug_core.c | 1 +
kernel/panic.c | 1 +
kernel/rcu/rcu.h | 1 +
kernel/rcu/rcutorture.c | 1 +
kernel/trace/ring_buffer_benchmark.c | 1 +
kernel/trace/trace.h | 1 +
kernel/trace/trace_benchmark.c | 1 +
lib/sys_info.c | 1 +
samples/fprobe/fprobe_example.c | 1 +
samples/ftrace/ftrace-direct-too.c | 1 -
samples/trace_printk/trace-printk.c | 1 +
27 files changed, 82 insertions(+), 55 deletions(-)
create mode 100644 include/linux/trace_controls.h
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox