All of lore.kernel.org
 help / color / mirror / Atom feed
From: Boqun Feng <boqun@kernel.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Catalin Marinas" <catalin.marinas@arm.com>,
	"Will Deacon" <will@kernel.org>,
	"Jonas Bonn" <jonas@southpole.se>,
	"Stefan Kristiansson" <stefan.kristiansson@saunalahti.fi>,
	"Stafford Horne" <shorne@gmail.com>,
	"Heiko Carstens" <hca@linux.ibm.com>,
	"Vasily Gorbik" <gor@linux.ibm.com>,
	"Alexander Gordeev" <agordeev@linux.ibm.com>,
	"Christian Borntraeger" <borntraeger@linux.ibm.com>,
	"Sven Schnelle" <svens@linux.ibm.com>,
	"Thomas Gleixner" <tglx@kernel.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Borislav Petkov" <bp@alien8.de>,
	"Dave Hansen" <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	"Arnd Bergmann" <arnd@arndb.de>,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Ben Segall" <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	"Valentin Schneider" <vschneid@redhat.com>,
	"K Prateek Nayak" <kprateek.nayak@amd.com>,
	"Waiman Long" <longman@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	"Eduard Zingerman" <eddyz87@gmail.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Martin KaFai Lau" <martin.lau@linux.dev>,
	"Kumar Kartikeya Dwivedi" <memxor@gmail.com>,
	"Song Liu" <song@kernel.org>,
	"Yonghong Song" <yonghong.song@linux.dev>,
	"Jiri Olsa" <jolsa@kernel.org>, "Shuah Khan" <shuah@kernel.org>,
	"Miguel Ojeda" <ojeda@kernel.org>, "Gary Guo" <gary@garyguo.net>,
	"Björn Roy Baron" <bjorn3_gh@protonmail.com>,
	"Benno Lossin" <lossin@kernel.org>,
	"Andreas Hindborg" <a.hindborg@kernel.org>,
	"Alice Ryhl" <aliceryhl@google.com>,
	"Trevor Gross" <tmgross@umich.edu>,
	"Danilo Krummrich" <dakr@kernel.org>,
	"Jinjie Ruan" <ruanjinjie@huawei.com>,
	"Lyude Paul" <lyude@redhat.com>, "Thomas Huth" <thuth@redhat.com>,
	"Sohil Mehta" <sohil.mehta@intel.com>,
	"Xin Li (Intel)" <xin@zytor.com>,
	"Pawan Gupta" <pawan.kumar.gupta@linux.intel.com>,
	"Nikunj A Dadhania" <nikunj@amd.com>,
	"Joel Fernandes" <joelagnelf@nvidia.com>,
	"Andy Shevchenko" <andriy.shevchenko@linux.intel.com>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Yury Norov" <ynorov@nvidia.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>,
	linux-kernel@vger.kernel.org, linux-openrisc@vger.kernel.org,
	linux-s390@vger.kernel.org, linux-arch@vger.kernel.org,
	bpf@vger.kernel.org, linux-kselftest@vger.kernel.org,
	rust-for-linux@vger.kernel.org, "Onur Özkan" <work@onurozkan.dev>,
	"Daniel Almeida" <daniel.almeida@collabora.com>,
	"Boqun Feng" <boqun.feng@gmail.com>
Subject: Re: [PATCH v2 01/12] preempt: Track NMI nesting to separate per-CPU counter
Date: Thu, 4 Jun 2026 05:36:36 -0700	[thread overview]
Message-ID: <aiFxVG3epMKAva76@tardis-2.local> (raw)
In-Reply-To: <20260526152148.30514-2-boqun@kernel.org>

On Tue, May 26, 2026 at 08:21:37AM -0700, Boqun Feng wrote:
> From: Joel Fernandes <joelagnelf@nvidia.com>
> 
> Move NMI nesting tracking from the preempt_count bits to a separate per-CPU
> counter (nmi_nesting). This is to free up the NMI bits in the preempt_count,
> allowing those bits to be repurposed for other uses.
> 
> Reduce NMI_BITS from 4 to 1, using it only to detect if we're in an NMI.
> The per-CPU counter currently caps nesting at 15.
> 
> [boqun: Solve Steven Rostedt's comment on the BUG_ON() condition]
> 
> Suggested-by: Boqun Feng <boqun.feng@gmail.com>
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Signed-off-by: Boqun Feng <boqun@kernel.org>
> Link: https://patch.msgid.link/20260121223933.1568682-3-lyude@redhat.com
> ---
>  include/linux/hardirq.h                        | 17 +++++++++++++----
>  include/linux/preempt.h                        |  9 +++++++--
>  kernel/softirq.c                               |  2 ++
>  tools/testing/selftests/bpf/bpf_experimental.h |  2 +-
>  4 files changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index d57cab4d4c06..1a0360a1000f 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -10,6 +10,8 @@
>  #include <linux/vtime.h>
>  #include <asm/hardirq.h>
>  
> +DECLARE_PER_CPU(unsigned int, nmi_nesting);
> +
>  extern void synchronize_irq(unsigned int irq);
>  extern bool synchronize_hardirq(unsigned int irq);
>  
> @@ -102,14 +104,17 @@ void irq_exit_rcu(void);
>   */
>  
>  /*
> - * nmi_enter() can nest up to 15 times; see NMI_BITS.
> + * nmi_enter() can nest - nesting is tracked in a per-CPU counter.
>   */
>  #define __nmi_enter()						\
>  	do {							\
>  		lockdep_off();					\
>  		arch_nmi_enter();				\
> -		BUG_ON(in_nmi() == NMI_MASK);			\
> -		__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);	\
> +		/* Maximum NMI nesting is 15. */		\
> +		BUG_ON(__this_cpu_read(nmi_nesting) >= 15);	\
> +		__this_cpu_inc(nmi_nesting);			\
> +		__preempt_count_add(HARDIRQ_OFFSET);		\
> +		preempt_count_set(preempt_count() | NMI_MASK);	\
>  	} while (0)
>  
>  #define nmi_enter()						\
> @@ -124,8 +129,12 @@ void irq_exit_rcu(void);
>  
>  #define __nmi_exit()						\
>  	do {							\
> +		unsigned int nesting;				\
>  		BUG_ON(!in_nmi());				\
> -		__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);	\
> +		__preempt_count_sub(HARDIRQ_OFFSET);		\
> +		nesting = __this_cpu_dec_return(nmi_nesting);	\
> +		if (!nesting)					\
> +			__preempt_count_sub(NMI_OFFSET);	\

We have an issue here in the following case:

  // nmi_nesting == 1
  __nmi_exit():
    ..
    nesting = __this_cpu_dec_return(nmi_nesting); // <- nesting == 0
    <another NMI comes>
      __nmi_enter()
      // nmi_nesting becomes 1
      __nmi_exit():
        nesting = __this_cpu_dec_return(nmi_nesting); // <- nesting == 0
        if (!nesting)
	  __preempt_count_sub(NMI_OFFSET);
	// NMI_OFFSET bit is 0
    if (!nesting)
       __preempt_count_sub(NMI_OFFSET); // underflow!

I think we need to do:
      
#define __nmi_exit()						\
	do {							\
		unsigned int nesting;				\
		BUG_ON(!in_nmi());				\
		__preempt_count_sub(HARDIRQ_OFFSET);		\
		nesting = __this_cpu_dec_return(nmi_nesting);	\
		if (!nesting)					\
			preempt_count_set(preempt_count() & ~NMI_MASK);	\
		arch_nmi_exit();				\
  		lockdep_on();					\
	} while (0)

@Joel, thoughts?

Similarly, we have this issue in patch #10 as well.

Regards,
Boqun

>  		arch_nmi_exit();				\
>  		lockdep_on();					\
>  	} while (0)
> diff --git a/include/linux/preempt.h b/include/linux/preempt.h
> index d964f965c8ff..586f96688325 100644
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -17,6 +17,8 @@
>   *
>   * - bits 0-7 are the preemption count (max preemption depth: 256)
>   * - bits 8-15 are the softirq count (max # of softirqs: 256)
> + * - bits 16-19 are the hardirq count (max # of hardirqs: 16)
> + * - bit 20 is the NMI flag (no nesting count, tracked separately)
>   *
>   * The hardirq count could in theory be the same as the number of
>   * interrupts in the system, but we run all interrupt handlers with
> @@ -24,16 +26,19 @@
>   * there are a few palaeontologic drivers which reenable interrupts in
>   * the handler, so we need more than one bit here.
>   *
> + * NMI nesting depth is tracked in a separate per-CPU variable
> + * (nmi_nesting) to save bits in preempt_count.
> + *
>   *         PREEMPT_MASK:	0x000000ff
>   *         SOFTIRQ_MASK:	0x0000ff00
>   *         HARDIRQ_MASK:	0x000f0000
> - *             NMI_MASK:	0x00f00000
> + *             NMI_MASK:	0x00100000
>   * PREEMPT_NEED_RESCHED:	0x80000000
>   */
>  #define PREEMPT_BITS	8
>  #define SOFTIRQ_BITS	8
>  #define HARDIRQ_BITS	4
> -#define NMI_BITS	4
> +#define NMI_BITS	1
>  
>  #define PREEMPT_SHIFT	0
>  #define SOFTIRQ_SHIFT	(PREEMPT_SHIFT + PREEMPT_BITS)
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 4425d8dce44b..10af5ed859e7 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -88,6 +88,8 @@ EXPORT_PER_CPU_SYMBOL_GPL(hardirqs_enabled);
>  EXPORT_PER_CPU_SYMBOL_GPL(hardirq_context);
>  #endif
>  
> +DEFINE_PER_CPU(unsigned int, nmi_nesting);
> +
>  /*
>   * SOFTIRQ_OFFSET usage:
>   *
> diff --git a/tools/testing/selftests/bpf/bpf_experimental.h b/tools/testing/selftests/bpf/bpf_experimental.h
> index 2234bd6bc9d3..2d4256ff471f 100644
> --- a/tools/testing/selftests/bpf/bpf_experimental.h
> +++ b/tools/testing/selftests/bpf/bpf_experimental.h
> @@ -449,7 +449,7 @@ extern int bpf_cgroup_read_xattr(struct cgroup *cgroup, const char *name__str,
>  #define PREEMPT_BITS	8
>  #define SOFTIRQ_BITS	8
>  #define HARDIRQ_BITS	4
> -#define NMI_BITS	4
> +#define NMI_BITS	1
>  
>  #define PREEMPT_SHIFT	0
>  #define SOFTIRQ_SHIFT	(PREEMPT_SHIFT + PREEMPT_BITS)
> -- 
> 2.50.1 (Apple Git-155)
> 

  parent reply	other threads:[~2026-06-04 12:36 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-26 15:21 [PATCH v2 00/12] Refcounted interrupt disable and SpinLockIrq for rust (Part 1) Boqun Feng
2026-05-26 15:21 ` [PATCH v2 01/12] preempt: Track NMI nesting to separate per-CPU counter Boqun Feng
2026-05-26 16:12   ` sashiko-bot
2026-06-04 12:36   ` Boqun Feng [this message]
2026-05-26 15:21 ` [PATCH v2 02/12] preempt: Introduce HARDIRQ_DISABLE_BITS Boqun Feng
2026-05-26 15:21 ` [PATCH v2 03/12] preempt: Introduce __preempt_count_{sub, add}_return() Boqun Feng
2026-05-26 15:21 ` [PATCH v2 04/12] openrisc: Include <linux/cpumask.h> in smp.h Boqun Feng
2026-05-26 15:21 ` [PATCH v2 05/12] irq & spin_lock: Add counted interrupt disabling/enabling Boqun Feng
2026-05-26 16:19   ` bot+bpf-ci
2026-05-26 17:54   ` sashiko-bot
2026-05-28 10:43   ` Peter Zijlstra
2026-05-28 14:31     ` Boqun Feng
2026-05-26 15:21 ` [PATCH v2 06/12] irq: Add KUnit test for refcounted interrupt enable/disable Boqun Feng
2026-05-26 18:18   ` sashiko-bot
2026-05-26 15:21 ` [PATCH v2 07/12] locking: Switch to _irq_{disable,enable}() variants in cleanup guards Boqun Feng
2026-05-28 10:45   ` Peter Zijlstra
2026-05-28 14:31     ` Boqun Feng
2026-05-26 15:21 ` [PATCH v2 08/12] sched: Remove the unused preempt_offset parameter of __cant_sleep() Boqun Feng
2026-05-26 15:21 ` [PATCH v2 09/12] sched: Avoid signed comparison of preempt_count() in __cant_migrate() Boqun Feng
2026-05-26 15:21 ` [PATCH v2 10/12] preempt: Introduce HAS_SEPARATE_PREEMPT_RESCHED_BITS Boqun Feng
2026-05-26 19:57   ` sashiko-bot
2026-06-04 12:40   ` Boqun Feng
2026-05-26 15:21 ` [PATCH v2 11/12] arm64: sched/preempt: Enable HAS_SEPARATE_PREEMPT_RESCHED_BITS Boqun Feng
2026-05-28 10:50   ` Peter Zijlstra
2026-05-26 15:21 ` [PATCH v2 12/12] s390/preempt: " Boqun Feng
2026-05-28 10:53   ` Peter Zijlstra
2026-05-28 14:41     ` Boqun Feng
2026-05-28 15:18       ` Heiko Carstens
2026-05-27 16:18 ` [PATCH v2 00/12] Refcounted interrupt disable and SpinLockIrq for rust (Part 1) Peter Zijlstra
2026-05-27 16:33   ` Boqun Feng
2026-06-03 19:20     ` Boqun Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aiFxVG3epMKAva76@tardis-2.local \
    --to=boqun@kernel.org \
    --cc=a.hindborg@kernel.org \
    --cc=agordeev@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=aliceryhl@google.com \
    --cc=andrii@kernel.org \
    --cc=andriy.shevchenko@linux.intel.com \
    --cc=arnd@arndb.de \
    --cc=ast@kernel.org \
    --cc=bigeasy@linutronix.de \
    --cc=bjorn3_gh@protonmail.com \
    --cc=boqun.feng@gmail.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=bp@alien8.de \
    --cc=bpf@vger.kernel.org \
    --cc=bsegall@google.com \
    --cc=catalin.marinas@arm.com \
    --cc=dakr@kernel.org \
    --cc=daniel.almeida@collabora.com \
    --cc=daniel@iogearbox.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=eddyz87@gmail.com \
    --cc=gary@garyguo.net \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=hpa@zytor.com \
    --cc=joelagnelf@nvidia.com \
    --cc=jolsa@kernel.org \
    --cc=jonas@southpole.se \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-openrisc@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=lossin@kernel.org \
    --cc=lyude@redhat.com \
    --cc=martin.lau@linux.dev \
    --cc=memxor@gmail.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=nikunj@amd.com \
    --cc=ojeda@kernel.org \
    --cc=pawan.kumar.gupta@linux.intel.com \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=ruanjinjie@huawei.com \
    --cc=rust-for-linux@vger.kernel.org \
    --cc=shorne@gmail.com \
    --cc=shuah@kernel.org \
    --cc=sohil.mehta@intel.com \
    --cc=song@kernel.org \
    --cc=stefan.kristiansson@saunalahti.fi \
    --cc=svens@linux.ibm.com \
    --cc=tglx@kernel.org \
    --cc=thuth@redhat.com \
    --cc=tmgross@umich.edu \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=will@kernel.org \
    --cc=work@onurozkan.dev \
    --cc=x86@kernel.org \
    --cc=xin@zytor.com \
    --cc=ynorov@nvidia.com \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.