[v3 PATCH 0/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status

All of lore.kernel.org
 help / color / mirror / Atom feed

* [v3 PATCH 0/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
@ 2026-01-15 20:54 Aaron Tomlin
  2026-01-15 20:54 ` [v3 PATCH 1/1] " Aaron Tomlin
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Tomlin @ 2026-01-15 20:54 UTC (permalink / raw)
  To: oleg, akpm, gregkh, david, brauner, mingo
  Cc: neelx, sean, linux-kernel, linux-fsdevel

Hi Oleg, David, Greg, Andrew,

This patch introduces a mechanism to expose the mm_cpumask of a process via
the /proc/[pid]/status interface.

In high-performance and large-scale NUMA environments, diagnosing latency
spikes attributed to Inter-Processor Interrupts (IPIs) can be particularly
challenging. While cpus_allowed describes where a thread may execute, it
does not describe the "memory footprint" - specifically, the set of CPUs
that may hold stale Translation Lookaside Buffer (TLB) entries for the
process.

It is this footprint (mm_cpumask) that dictates the target destination for
TLB flush IPIs. Discrepancies between a process's scheduling affinity and
its memory footprint are a common source of system noise and performance
degradation. By exposing this mask, we provide userspace with the
visibility required to debug these "invisible" sources of latency.

These fields are exposed only on architectures that explicitly opt-in
via CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM. This is necessary because
mm_cpumask semantics vary significantly across architectures; some
(e.g., x86) actively maintain the mask for coherency, while others may
never clear bits, rendering the data misleading for this specific use
case. x86 is updated to select this feature by default.

For example, outside x86:

    # make fs/proc/array.i
    # grep task_cpus_active_mm -B 1 -A 3 --max-count 1 fs/proc/array.i
    # 430 "fs/proc/array.c"
    static inline __attribute__((__gnu_inline__)) __attribute__((__unused__)) __attribute__((no_instrument_function)) void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
    {
    }

The implementation reads the mask directly without introducing additional
locks or snapshots. While this implies that the hex mask and list format
could theoretically observe slightly different states on a rapidly
changing system, this "best-effort" approach aligns with the standard
design philosophy of /proc and avoids imposing locking overhead on
critical memory management paths.

Changes since v2 [1]:
 - Introduce new configuration ARCH_WANT_PROC_CPUS_ACTIVE_MM. The x86
   architecture now explicitly selects this feature, ensuring that the
   field is only exposed where the mm_cpumask semantics are meaningful for
   TLB coherency (David Hildenbrand)

Changes since v1 [2]:
 - Document new Cpus_active_mm and Cpus_active_mm_list entries in
   /proc/[pid]/status (Oleg Nesterov)

[1]: https://lore.kernel.org/lkml/20251226211407.2252573-1-atomlin@atomlin.com/ 
[2]: https://lore.kernel.org/lkml/20251217024603.1846651-1-atomlin@atomlin.com/

Aaron Tomlin (1):
  fs/proc: Expose mm_cpumask in /proc/[pid]/status

 Documentation/filesystems/proc.rst |  7 +++++++
 arch/x86/Kconfig                   |  1 +
 fs/proc/Kconfig                    | 14 ++++++++++++++
 fs/proc/array.c                    | 28 +++++++++++++++++++++++++++-
 4 files changed, 49 insertions(+), 1 deletion(-)

-- 
2.51.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-15 20:54 [v3 PATCH 0/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status Aaron Tomlin
@ 2026-01-15 20:54 ` Aaron Tomlin
  2026-01-15 21:19   ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Tomlin @ 2026-01-15 20:54 UTC (permalink / raw)
  To: oleg, akpm, gregkh, david, brauner, mingo
  Cc: neelx, sean, linux-kernel, linux-fsdevel

This patch introduces two new fields to /proc/[pid]/status to display the
set of CPUs, representing the CPU affinity of the process's active
memory context, in both mask and list format: "Cpus_active_mm" and
"Cpus_active_mm_list". The mm_cpumask is primarily used for TLB and
cache synchronisation.

Exposing this information allows userspace to easily describe the
relationship between CPUs where a memory descriptor is "active" and the
CPUs where the thread is allowed to execute. The primary intent is to
provide visibility into the "memory footprint" across CPUs, which is
invaluable for debugging performance issues related to IPI storms and
TLB shootdowns in large-scale NUMA systems. The CPU-affinity sets the
boundary; the mm_cpumask records the arrival; they complement each
other.

Frequent mm_cpumask changes may indicate instability in placement
policies or excessive task migration overhead.

These fields are exposed only on architectures that explicitly opt-in
via CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM. This is necessary because
mm_cpumask semantics vary significantly across architectures; some
(e.g., x86) actively maintain the mask for coherency, while others may
never clear bits, rendering the data misleading for this specific use
case. x86 is updated to select this feature by default.

The implementation reads the mask directly without introducing additional
locks or snapshots. While this implies that the hex mask and list format
could theoretically observe slightly different states on a rapidly
changing system, this "best-effort" approach aligns with the standard
design philosophy of /proc and avoids imposing locking overhead on
critical memory management paths.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 Documentation/filesystems/proc.rst |  7 +++++++
 arch/x86/Kconfig                   |  1 +
 fs/proc/Kconfig                    | 14 ++++++++++++++
 fs/proc/array.c                    | 28 +++++++++++++++++++++++++++-
 4 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 8256e857e2d7..c6ced84c5c68 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -291,12 +291,19 @@ It's slow but very precise.
  SpeculationIndirectBranch   indirect branch speculation mode
  Cpus_allowed                mask of CPUs on which this process may run
  Cpus_allowed_list           Same as previous, but in "list format"
+ Cpus_active_mm              mask of CPUs on which this process has an active
+                             memory context
+ Cpus_active_mm_list         Same as previous, but in "list format"
  Mems_allowed                mask of memory nodes allowed to this process
  Mems_allowed_list           Same as previous, but in "list format"
  voluntary_ctxt_switches     number of voluntary context switches
  nonvoluntary_ctxt_switches  number of non voluntary context switches
  ==========================  ===================================================
 
+Note "Cpus_active_mm" is currently only supported on x86. Its semantics are
+architecture-dependent; on x86, it represents the set of CPUs that may hold
+stale TLB entries for the process and thus require IPI-based TLB shootdowns to
+maintain coherency.
 
 .. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3)
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 80527299f859..f0997791dbdb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
 	select ARCH_WANTS_THP_SWAP		if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+	select ARCH_WANT_PROC_CPUS_ACTIVE_MM
 	select BUILDTIME_TABLE_SORT
 	select CLKEVT_I8253
 	select CLOCKSOURCE_WATCHDOG
diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 6ae966c561e7..952c40cf3baa 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -127,3 +127,17 @@ config PROC_PID_ARCH_STATUS
 config PROC_CPU_RESCTRL
 	def_bool n
 	depends on PROC_FS
+
+config ARCH_WANT_PROC_CPUS_ACTIVE_MM
+	bool
+	depends on PROC_FS
+	help
+	  Selected by architectures that reliably maintain mm_cpumask for TLB
+	  and cache synchronisation and wish to expose it in
+	  /proc/[pid]/status. Exposing this information allows userspace to
+	  easily describe the relationship between CPUs where a memory
+	  descriptor is "active" and the CPUs where the thread is allowed to
+	  execute. The primary intent is to provide visibility into the
+	  "memory footprint" across CPUs, which is invaluable for debugging
+	  performance issues related to IPI storms and TLB shootdowns in
+	  large-scale NUMA systems.
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 42932f88141a..c16aad59e0a7 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -409,6 +409,29 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
 		   cpumask_pr_args(&task->cpus_mask));
 }
 
+/**
+ * task_cpus_active_mm - Show the mm_cpumask for a process
+ * @m: The seq_file structure for the /proc/PID/status output
+ * @mm: The memory descriptor of the process
+ *
+ * Prints the set of CPUs, representing the CPU affinity of the process's
+ * active memory context, in both mask and list format. This mask is
+ * primarily used for TLB and cache synchronisation.
+ */
+#ifdef CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM
+static void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
+{
+	seq_printf(m, "Cpus_active_mm:\t%*pb\n",
+		   cpumask_pr_args(mm_cpumask(mm)));
+	seq_printf(m, "Cpus_active_mm_list:\t%*pbl\n",
+		   cpumask_pr_args(mm_cpumask(mm)));
+}
+#else
+static inline void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
+{
+}
+#endif
+
 static inline void task_core_dumping(struct seq_file *m, struct task_struct *task)
 {
 	seq_put_decimal_ull(m, "CoreDumping:\t", !!task->signal->core_state);
@@ -450,12 +473,15 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
 		task_core_dumping(m, task);
 		task_thp_status(m, mm);
 		task_untag_mask(m, mm);
-		mmput(mm);
 	}
 	task_sig(m, task);
 	task_cap(m, task);
 	task_seccomp(m, task);
 	task_cpus_allowed(m, task);
+	if (mm) {
+		task_cpus_active_mm(m, mm);
+		mmput(mm);
+	}
 	cpuset_task_status_allowed(m, task);
 	task_context_switch_counts(m, task);
 	arch_proc_pid_thread_features(m, task);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-15 20:54 ` [v3 PATCH 1/1] " Aaron Tomlin
@ 2026-01-15 21:19   ` David Hildenbrand (Red Hat)
  2026-01-15 21:23     ` Peter Zijlstra
  2026-01-15 21:39     ` Dave Hansen
  0 siblings, 2 replies; 10+ messages in thread
From: David Hildenbrand (Red Hat) @ 2026-01-15 21:19 UTC (permalink / raw)
  To: Aaron Tomlin, oleg, akpm, gregkh, brauner, mingo
  Cc: neelx, sean, linux-kernel, linux-fsdevel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, x86@kernel.org

On 1/15/26 21:54, Aaron Tomlin wrote:
> This patch introduces two new fields to /proc/[pid]/status to display the
> set of CPUs, representing the CPU affinity of the process's active
> memory context, in both mask and list format: "Cpus_active_mm" and
> "Cpus_active_mm_list". The mm_cpumask is primarily used for TLB and
> cache synchronisation.
> 
> Exposing this information allows userspace to easily describe the
> relationship between CPUs where a memory descriptor is "active" and the
> CPUs where the thread is allowed to execute. The primary intent is to
> provide visibility into the "memory footprint" across CPUs, which is
> invaluable for debugging performance issues related to IPI storms and
> TLB shootdowns in large-scale NUMA systems. The CPU-affinity sets the
> boundary; the mm_cpumask records the arrival; they complement each
> other.
> 
> Frequent mm_cpumask changes may indicate instability in placement
> policies or excessive task migration overhead.
> 
> These fields are exposed only on architectures that explicitly opt-in
> via CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM. This is necessary because
> mm_cpumask semantics vary significantly across architectures; some
> (e.g., x86) actively maintain the mask for coherency, while others may
> never clear bits, rendering the data misleading for this specific use
> case. x86 is updated to select this feature by default.
> 
> The implementation reads the mask directly without introducing additional
> locks or snapshots. While this implies that the hex mask and list format
> could theoretically observe slightly different states on a rapidly
> changing system, this "best-effort" approach aligns with the standard
> design philosophy of /proc and avoids imposing locking overhead on
> critical memory management paths.


Yes, restricting to architectures that have the expected semantics is 
better.

... but we better get the blessing from x86 folks :)

(CCing the x86 MM folks)


> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
>   Documentation/filesystems/proc.rst |  7 +++++++
>   arch/x86/Kconfig                   |  1 +
>   fs/proc/Kconfig                    | 14 ++++++++++++++
>   fs/proc/array.c                    | 28 +++++++++++++++++++++++++++-
>   4 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 8256e857e2d7..c6ced84c5c68 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -291,12 +291,19 @@ It's slow but very precise.
>    SpeculationIndirectBranch   indirect branch speculation mode
>    Cpus_allowed                mask of CPUs on which this process may run
>    Cpus_allowed_list           Same as previous, but in "list format"
> + Cpus_active_mm              mask of CPUs on which this process has an active
> +                             memory context
> + Cpus_active_mm_list         Same as previous, but in "list format"
>    Mems_allowed                mask of memory nodes allowed to this process
>    Mems_allowed_list           Same as previous, but in "list format"
>    voluntary_ctxt_switches     number of voluntary context switches
>    nonvoluntary_ctxt_switches  number of non voluntary context switches
>    ==========================  ===================================================
>   
> +Note "Cpus_active_mm" is currently only supported on x86. Its semantics are
> +architecture-dependent; on x86, it represents the set of CPUs that may hold
> +stale TLB entries for the process and thus require IPI-based TLB shootdowns to
> +maintain coherency.
>   
>   .. table:: Table 1-3: Contents of the statm fields (as of 2.6.8-rc3)
>   
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 80527299f859..f0997791dbdb 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -152,6 +152,7 @@ config X86
>   	select ARCH_WANTS_THP_SWAP		if X86_64
>   	select ARCH_HAS_PARANOID_L1D_FLUSH
>   	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
> +	select ARCH_WANT_PROC_CPUS_ACTIVE_MM
>   	select BUILDTIME_TABLE_SORT
>   	select CLKEVT_I8253
>   	select CLOCKSOURCE_WATCHDOG
> diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
> index 6ae966c561e7..952c40cf3baa 100644
> --- a/fs/proc/Kconfig
> +++ b/fs/proc/Kconfig
> @@ -127,3 +127,17 @@ config PROC_PID_ARCH_STATUS
>   config PROC_CPU_RESCTRL
>   	def_bool n
>   	depends on PROC_FS
> +
> +config ARCH_WANT_PROC_CPUS_ACTIVE_MM
> +	bool
> +	depends on PROC_FS
> +	help
> +	  Selected by architectures that reliably maintain mm_cpumask for TLB
> +	  and cache synchronisation and wish to expose it in
> +	  /proc/[pid]/status. Exposing this information allows userspace to
> +	  easily describe the relationship between CPUs where a memory
> +	  descriptor is "active" and the CPUs where the thread is allowed to
> +	  execute. The primary intent is to provide visibility into the
> +	  "memory footprint" across CPUs, which is invaluable for debugging
> +	  performance issues related to IPI storms and TLB shootdowns in
> +	  large-scale NUMA systems.
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index 42932f88141a..c16aad59e0a7 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -409,6 +409,29 @@ static void task_cpus_allowed(struct seq_file *m, struct task_struct *task)
>   		   cpumask_pr_args(&task->cpus_mask));
>   }
>   
> +/**
> + * task_cpus_active_mm - Show the mm_cpumask for a process
> + * @m: The seq_file structure for the /proc/PID/status output
> + * @mm: The memory descriptor of the process
> + *
> + * Prints the set of CPUs, representing the CPU affinity of the process's
> + * active memory context, in both mask and list format. This mask is
> + * primarily used for TLB and cache synchronisation.
> + */
> +#ifdef CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM
> +static void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
> +{
> +	seq_printf(m, "Cpus_active_mm:\t%*pb\n",
> +		   cpumask_pr_args(mm_cpumask(mm)));
> +	seq_printf(m, "Cpus_active_mm_list:\t%*pbl\n",
> +		   cpumask_pr_args(mm_cpumask(mm)));
> +}
> +#else
> +static inline void task_cpus_active_mm(struct seq_file *m, struct mm_struct *mm)
> +{
> +}
> +#endif
> +
>   static inline void task_core_dumping(struct seq_file *m, struct task_struct *task)
>   {
>   	seq_put_decimal_ull(m, "CoreDumping:\t", !!task->signal->core_state);
> @@ -450,12 +473,15 @@ int proc_pid_status(struct seq_file *m, struct pid_namespace *ns,
>   		task_core_dumping(m, task);
>   		task_thp_status(m, mm);
>   		task_untag_mask(m, mm);
> -		mmput(mm);
>   	}
>   	task_sig(m, task);
>   	task_cap(m, task);
>   	task_seccomp(m, task);
>   	task_cpus_allowed(m, task);
> +	if (mm) {
> +		task_cpus_active_mm(m, mm);
> +		mmput(mm);
> +	}
>   	cpuset_task_status_allowed(m, task);
>   	task_context_switch_counts(m, task);
>   	arch_proc_pid_thread_features(m, task);


-- 
Cheers

David

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-15 21:19   ` David Hildenbrand (Red Hat)
@ 2026-01-15 21:23     ` Peter Zijlstra
  2026-01-15 21:39     ` Dave Hansen
  1 sibling, 0 replies; 10+ messages in thread
From: Peter Zijlstra @ 2026-01-15 21:23 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Aaron Tomlin, oleg, akpm, gregkh, brauner, mingo, neelx, sean,
	linux-kernel, linux-fsdevel, Dave Hansen, Andy Lutomirski,
	x86@kernel.org

On Thu, Jan 15, 2026 at 10:19:08PM +0100, David Hildenbrand (Red Hat) wrote:
> On 1/15/26 21:54, Aaron Tomlin wrote:
> > This patch introduces two new fields to /proc/[pid]/status to display the
> > set of CPUs, representing the CPU affinity of the process's active
> > memory context, in both mask and list format: "Cpus_active_mm" and
> > "Cpus_active_mm_list". The mm_cpumask is primarily used for TLB and
> > cache synchronisation.
> > 
> > Exposing this information allows userspace to easily describe the
> > relationship between CPUs where a memory descriptor is "active" and the
> > CPUs where the thread is allowed to execute. The primary intent is to
> > provide visibility into the "memory footprint" across CPUs, which is
> > invaluable for debugging performance issues related to IPI storms and
> > TLB shootdowns in large-scale NUMA systems. The CPU-affinity sets the
> > boundary; the mm_cpumask records the arrival; they complement each
> > other.
> > 
> > Frequent mm_cpumask changes may indicate instability in placement
> > policies or excessive task migration overhead.
> > 
> > These fields are exposed only on architectures that explicitly opt-in
> > via CONFIG_ARCH_WANT_PROC_CPUS_ACTIVE_MM. This is necessary because
> > mm_cpumask semantics vary significantly across architectures; some
> > (e.g., x86) actively maintain the mask for coherency, while others may
> > never clear bits, rendering the data misleading for this specific use
> > case. x86 is updated to select this feature by default.
> > 
> > The implementation reads the mask directly without introducing additional
> > locks or snapshots. While this implies that the hex mask and list format
> > could theoretically observe slightly different states on a rapidly
> > changing system, this "best-effort" approach aligns with the standard
> > design philosophy of /proc and avoids imposing locking overhead on
> > critical memory management paths.
> 
> 
> Yes, restricting to architectures that have the expected semantics is
> better.
> 
> ... but we better get the blessing from x86 folks :)
> 
> (CCing the x86 MM folks)

Yeah, seems like a very bad idea this. mm_cpumask really is an arch
detail you cannot rely on very much. Exposing this to userspace is
terrible.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-15 21:19   ` David Hildenbrand (Red Hat)
  2026-01-15 21:23     ` Peter Zijlstra
@ 2026-01-15 21:39     ` Dave Hansen
  2026-01-16  1:53       ` Aaron Tomlin
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2026-01-15 21:39 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat), Aaron Tomlin, oleg, akpm, gregkh,
	brauner, mingo
  Cc: neelx, sean, linux-kernel, linux-fsdevel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, x86@kernel.org

On 1/15/26 13:19, David Hildenbrand (Red Hat) wrote:
> On 1/15/26 21:54, Aaron Tomlin wrote:
>> This patch introduces two new fields to /proc/[pid]/status to display the
>> set of CPUs, representing the CPU affinity of the process's active
>> memory context, in both mask and list format: "Cpus_active_mm" and
>> "Cpus_active_mm_list". The mm_cpumask is primarily used for TLB and
>> cache synchronisation. 

I don't think this is the kind of thing we want to expose as ABI. It's
too deep of an implementation detail. Any meaning derived from it could
also change on a whim.

For instance, we've changed the rules about when CPUs are put in or
taken out of mm_cpumask() over time. I think the rules might have even
depended on the idle driver that your system was using at one time. I
think Rik also just changed some rules around it in his INVLPGB patches.

I'm not denying how valuable this kind of information might be. I just
don't think it's generally useful enough to justify an ABI that we need
to maintain forever. Tracing seems like a much more appropriate way to
get the data you are after than new ABI.

Can you get the info that you're after with kprobes? Or new tracepoints?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-15 21:39     ` Dave Hansen
@ 2026-01-16  1:53       ` Aaron Tomlin
  2026-01-16  2:27         ` Rik van Riel
  2026-01-16  5:08         ` Dave Hansen
  0 siblings, 2 replies; 10+ messages in thread
From: Aaron Tomlin @ 2026-01-16  1:53 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand (Red Hat), oleg, akpm, gregkh, brauner, mingo,
	neelx, sean, linux-kernel, linux-fsdevel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, riel, x86@kernel.org

[-- Attachment #1: Type: text/plain, Size: 2419 bytes --]

On Thu, Jan 15, 2026 at 01:39:27PM -0800, Dave Hansen wrote:
> I don't think this is the kind of thing we want to expose as ABI. It's
> too deep of an implementation detail. Any meaning derived from it could
> also change on a whim.
> 
> For instance, we've changed the rules about when CPUs are put in or
> taken out of mm_cpumask() over time. I think the rules might have even
> depended on the idle driver that your system was using at one time. I
> think Rik also just changed some rules around it in his INVLPGB patches.
> 
> I'm not denying how valuable this kind of information might be. I just
> don't think it's generally useful enough to justify an ABI that we need
> to maintain forever. Tracing seems like a much more appropriate way to
> get the data you are after than new ABI.
> 
> Can you get the info that you're after with kprobes? Or new tracepoints?

Hi Dave and Peter,

I fully appreciate your concern regarding the exposure of deep
implementation details as stable ABI. I understand that the semantics of
mm_cpumask are fluid. I certainly do not wish to ossify internal logic.

While the static tracepoint trace_tlb_flush is available, the primary
argument for exposing this via /proc/[pid]/status is one of immediacy and
the lack of external dependencies. Having an instantaneous snapshot
available without requiring e.g., Ftrace or eBPF, is invaluable for quick
diagnostic checks in production environments.

Based on my reading of arch/x86/mm/tlb.c, the lifecycle of each bit in
mm_cpumask appears to follow this logic:

    1. Schedule on (switch_mm): Bit set.
    2. Schedule off: Bit remains set (CPU enters "Lazy" mode).
    3. Remote TLB Flush (IPI):
       - If Running: Flush TLB, bit remains set.
       - If lazy (leave_mm): Switch to init_mm, bit clearing is deferred.
       - If stale (mm != loaded_mm): bit is cleared immediately
         (effectively the second IPI for a CPU that was previously lazy).

Would you be amenable to this exposure if it were guarded behind a specific
CONFIG_DEBUG option (e.g., CONFIG_DEBUG_MM_CPUMASK_INFO)? This would
clearly mark it as a diagnostic aid for debugging, allowing educated users
to opt-in to the visibility without implying a permanent guarantee of
semantic stability for general userspace applications.

Please let me know your thoughts. Thank you.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-16  1:53       ` Aaron Tomlin
@ 2026-01-16  2:27         ` Rik van Riel
  2026-01-16 14:31           ` Aaron Tomlin
  2026-01-16  5:08         ` Dave Hansen
  1 sibling, 1 reply; 10+ messages in thread
From: Rik van Riel @ 2026-01-16  2:27 UTC (permalink / raw)
  To: Aaron Tomlin, Dave Hansen
  Cc: David Hildenbrand (Red Hat), oleg, akpm, gregkh, brauner, mingo,
	neelx, sean, linux-kernel, linux-fsdevel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, x86@kernel.org

On Thu, 2026-01-15 at 20:53 -0500, Aaron Tomlin wrote:
> 
> Based on my reading of arch/x86/mm/tlb.c, the lifecycle of each bit
> in
> mm_cpumask appears to follow this logic:
> 
>     1. Schedule on (switch_mm): Bit set.
>     2. Schedule off: Bit remains set (CPU enters "Lazy" mode).
>     3. Remote TLB Flush (IPI):
>        - If Running: Flush TLB, bit remains set.
>        - If lazy (leave_mm): Switch to init_mm, bit clearing is
> deferred.
>        - If stale (mm != loaded_mm): bit is cleared immediately
>          (effectively the second IPI for a CPU that was previously
> lazy).
> 

You're close. When a process uses INVLPGB, no remote TLB
flushing IPIs will get sent, and CPUs never get cleared
from the mm_cpumask.

-- 
All Rights Reversed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-16  1:53       ` Aaron Tomlin
  2026-01-16  2:27         ` Rik van Riel
@ 2026-01-16  5:08         ` Dave Hansen
  2026-01-16 15:42           ` Aaron Tomlin
  1 sibling, 1 reply; 10+ messages in thread
From: Dave Hansen @ 2026-01-16  5:08 UTC (permalink / raw)
  To: Aaron Tomlin
  Cc: David Hildenbrand (Red Hat), oleg, akpm, gregkh, brauner, mingo,
	neelx, sean, linux-kernel, linux-fsdevel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, riel, x86@kernel.org

On 1/15/26 17:53, Aaron Tomlin wrote:
> Would you be amenable to this exposure if it were guarded behind a specific
> CONFIG_DEBUG option (e.g., CONFIG_DEBUG_MM_CPUMASK_INFO)?

Not really.

ABI behind a config option is still ABI.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-16  2:27         ` Rik van Riel
@ 2026-01-16 14:31           ` Aaron Tomlin
  0 siblings, 0 replies; 10+ messages in thread
From: Aaron Tomlin @ 2026-01-16 14:31 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Dave Hansen, David Hildenbrand (Red Hat), oleg, akpm, gregkh,
	brauner, mingo, neelx, sean, linux-kernel, linux-fsdevel,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, x86@kernel.org

[-- Attachment #1: Type: text/plain, Size: 1423 bytes --]

On Thu, Jan 15, 2026 at 09:27:44PM -0500, Rik van Riel wrote:
> On Thu, 2026-01-15 at 20:53 -0500, Aaron Tomlin wrote:
> > 
> > Based on my reading of arch/x86/mm/tlb.c, the lifecycle of each bit
> > in
> > mm_cpumask appears to follow this logic:
> > 
> >     1. Schedule on (switch_mm): Bit set.
> >     2. Schedule off: Bit remains set (CPU enters "Lazy" mode).
> >     3. Remote TLB Flush (IPI):
> >        - If Running: Flush TLB, bit remains set.
> >        - If lazy (leave_mm): Switch to init_mm, bit clearing is
> > deferred.
> >        - If stale (mm != loaded_mm): bit is cleared immediately
> >          (effectively the second IPI for a CPU that was previously
> > lazy).
> > 
> 
> You're close. When a process uses INVLPGB, no remote TLB
> flushing IPIs will get sent, and CPUs never get cleared
> from the mm_cpumask.

Hi Rik,

Not close enough :)

It is good to hear from you, and thank you for the clarification regarding
X86_FEATURE_INVLPGB.

You are quite right; as flush_tlb_func() serves as the sole mechanism for
clearing bits from the mm_cpumask, bypassing IPIs inherently bypasses the
cleanup logic. Consequently, in this scenario, the bit is set upon
scheduling but never cleared, as the hardware-broadcast invalidations
circumvent the software handler responsible for maintaining the mask.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [v3 PATCH 1/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status
  2026-01-16  5:08         ` Dave Hansen
@ 2026-01-16 15:42           ` Aaron Tomlin
  0 siblings, 0 replies; 10+ messages in thread
From: Aaron Tomlin @ 2026-01-16 15:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: David Hildenbrand (Red Hat), oleg, akpm, gregkh, brauner, mingo,
	neelx, sean, linux-kernel, linux-fsdevel, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, riel, x86@kernel.org

[-- Attachment #1: Type: text/plain, Size: 379 bytes --]

On Thu, Jan 15, 2026 at 09:08:58PM -0800, Dave Hansen wrote:
> On 1/15/26 17:53, Aaron Tomlin wrote:
> > Would you be amenable to this exposure if it were guarded behind a specific
> > CONFIG_DEBUG option (e.g., CONFIG_DEBUG_MM_CPUMASK_INFO)?
> 
> Not really.
> 
> ABI behind a config option is still ABI.

Hi Dave,

I understand.

Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-01-16 15:42 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-15 20:54 [v3 PATCH 0/1] fs/proc: Expose mm_cpumask in /proc/[pid]/status Aaron Tomlin
2026-01-15 20:54 ` [v3 PATCH 1/1] " Aaron Tomlin
2026-01-15 21:19   ` David Hildenbrand (Red Hat)
2026-01-15 21:23     ` Peter Zijlstra
2026-01-15 21:39     ` Dave Hansen
2026-01-16  1:53       ` Aaron Tomlin
2026-01-16  2:27         ` Rik van Riel
2026-01-16 14:31           ` Aaron Tomlin
2026-01-16  5:08         ` Dave Hansen
2026-01-16 15:42           ` Aaron Tomlin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.