public inbox for rcu@vger.kernel.org
 help / color / mirror / Atom feed
From: Frederic Weisbecker <frederic@kernel.org>
To: Joel Fernandes <joel@joelfernandes.org>
Cc: Paul E McKenney <paulmck@kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>,
	rcu@vger.kernel.org, Neeraj Upadhyay <neeraj.upadhyay@kernel.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Uladzislau Rezki <urezki@gmail.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	Lai Jiangshan <jiangshanlai@gmail.com>,
	Zqiang <qiang.zhang@linux.dev>, Shuah Khan <shuah@kernel.org>,
	linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
	Kai Yao <yaokai34@huawei.com>, Tengda Wu <wutengda2@huawei.com>
Subject: Re: [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq
Date: Fri, 9 Jan 2026 15:23:03 +0100	[thread overview]
Message-ID: <aWEPRz2QWnDnIO36@localhost.localdomain> (raw)
In-Reply-To: <20260109011256.GA1099041@joelbox2>

Le Thu, Jan 08, 2026 at 08:12:56PM -0500, Joel Fernandes a écrit :
> Hi Frederic,
> 
> On 1/8/2026 10:25 AM, Frederic Weisbecker wrote:
> > Le Wed, Jan 07, 2026 at 08:02:43PM -0500, Joel Fernandes a écrit :
> >>
> >>
> >>> On Jan 7, 2026, at 6:15 PM, Frederic Weisbecker <frederic@kernel.org> wrote:
> >>>
> >>> Le Thu, Jan 01, 2026 at 11:34:10AM -0500, Joel Fernandes a écrit :
> >>>> From: Yao Kai <yaokai34@huawei.com>
> >>>>
> >>>> Commit 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in
> >>>> __rcu_read_unlock()") removes the recursion-protection code from
> >>>> __rcu_read_unlock(). Therefore, we could invoke the deadloop in
> >>>> raise_softirq_irqoff() with ftrace enabled as follows:
> >>>>
> >>>> WARNING: CPU: 0 PID: 0 at kernel/trace/trace.c:3021 __ftrace_trace_stack.constprop.0+0x172/0x180
> >>>> Modules linked in: my_irq_work(O)
> >>>> CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Tainted: G O 6.18.0-rc7-dirty #23 PREEMPT(full)
> >>>> Tainted: [O]=OOT_MODULE
> >>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
> >>>> RIP: 0010:__ftrace_trace_stack.constprop.0+0x172/0x180
> >>>> RSP: 0018:ffffc900000034a8 EFLAGS: 00010002
> >>>> RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000000000
> >>>> RDX: 0000000000000003 RSI: ffffffff826d7b87 RDI: ffffffff826e9329
> >>>> RBP: 0000000000090009 R08: 0000000000000005 R09: ffffffff82afbc4c
> >>>> R10: 0000000000000008 R11: 0000000000011d7a R12: 0000000000000000
> >>>> R13: ffff888003874100 R14: 0000000000000003 R15: ffff8880038c1054
> >>>> FS:  0000000000000000(0000) GS:ffff8880fa8ea000(0000) knlGS:0000000000000000
> >>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> CR2: 000055b31fa7f540 CR3: 00000000078f4005 CR4: 0000000000770ef0
> >>>> PKRU: 55555554
> >>>> Call Trace:
> >>>> <IRQ>
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> unwind_next_frame+0x203/0x9b0
> >>>> __unwind_start+0x15d/0x1c0
> >>>> arch_stack_walk+0x62/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> raise_softirq_irqoff+0x6e/0xa0
> >>>> rcu_read_unlock_special+0xb1/0x160
> >>>> __is_insn_slot_addr+0x54/0x70
> >>>> kernel_text_address+0x48/0xc0
> >>>> __kernel_text_address+0xd/0x40
> >>>> unwind_get_return_address+0x1e/0x40
> >>>> arch_stack_walk+0x9c/0xf0
> >>>> stack_trace_save+0x48/0x70
> >>>> __ftrace_trace_stack.constprop.0+0x144/0x180
> >>>> trace_buffer_unlock_commit_regs+0x6d/0x220
> >>>> trace_event_buffer_commit+0x5c/0x260
> >>>> trace_event_raw_event_softirq+0x47/0x80
> >>>> __raise_softirq_irqoff+0x61/0x80
> >>>> __flush_smp_call_function_queue+0x115/0x420
> >>>> __sysvec_call_function_single+0x17/0xb0
> >>>> sysvec_call_function_single+0x8c/0xc0
> >>>> </IRQ>
> >>>>
> >>>> Commit b41642c87716 ("rcu: Fix rcu_read_unlock() deadloop due to IRQ work")
> >>>> fixed the infinite loop in rcu_read_unlock_special() for IRQ work by
> >>>> setting a flag before calling irq_work_queue_on(). We fix this issue by
> >>>> setting the same flag before calling raise_softirq_irqoff() and rename the
> >>>> flag to defer_qs_pending for more common.
> >>>>
> >>>> Fixes: 5f5fa7ea89dc ("rcu: Don't use negative nesting depth in __rcu_read_unlock()")
> >>>> Reported-by: Tengda Wu <wutengda2@huawei.com>
> >>>> Signed-off-by: Yao Kai <yaokai34@huawei.com>
> >>>> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com>
> >>>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> >>>
> >>> Looks good but, BTW, what happens if rcu_qs() is called
> >>> before rcu_preempt_deferred_qs() had a chance to be called?
> >>
> >> Could you provide an example of when that can happen?
> > 
> > It can happen because rcu_qs() is called before rcu_preempt_deferred_qs()
> > in rcu_softirq_qs(). Inverting the calls could help but IRQs must be disabled
> > to ensure there is no read side between rcu_preempt_deferred_qs() and rcu_qs().
> 
> Ah the rcu_softorq_qs() path. Indeed, I see what you're saying now. Not sure
> how to trigger it, but yeah good catch. it would delay the reset of the flag.
> 
> > I'm not aware of other ways to trigger that, except perhaps this:
> > 
> > https://lore.kernel.org/rcu/20251230004124.438070-1-joelagnelf@nvidia.com/T/#u
> > 
> > Either we fix those sites and make sure that rcu_preempt_deferred_qs() is always
> > called before rcu_qs() in the same IRQ disabled section (or there are other
> > fields set in ->rcu_read_unlock_special for later clearance). If we do that we
> > must WARN_ON_ONCE(rdp->defer_qs_pending == DEFER_QS_PENDING) in rcu_qs().
> > 
> > Or we reset rdp->defer_qs_pending from rcu_qs(), which sounds more robust.
> 
> If we did that, can the following not happen? I did believe I tried that and it
> did not fix the IRQ work recursion. Supposed you have a timer interrupt and an
> IRQ that triggers BPF on exit. Both are pending on the CPU's IRQ controller.
> 
> First the non-timer interrupt does this:
> 
> irq_exit()
>   __irq_exit_rcu()
>     /* in_hardirq() returns false after this */
>     preempt_count_sub(HARDIRQ_OFFSET)
>     tick_irq_exit()
>       tick_nohz_irq_exit()
> 	    tick_nohz_stop_sched_tick()
> 	      trace_tick_stop()  /* a bpf prog is hooked on this trace point */
> 		   __bpf_trace_tick_stop()
> 		      bpf_trace_run2()
> 			    rcu_read_unlock_special()
>                               /* will send a IPI to itself */
> 			      irq_work_queue_on(&rdp->defer_qs_iw, rdp->cpu);
> 
> <timer interrupt runs>
> 
> The timer interrupt runs, and does the clean up that the IRQ work was supposed
> to do.
> 
> <IPI now runs for the IRQ work>
>   ->irq_exit()
>    ... recursion since IRQ work issued again.

If defer_qs_pending is only cleared when rcu_qs() or rcu_report_exp_rdp(),
I don't think it can happen, but I could be missing something...

> 
> Maybe it is unlikely to happen, but it feels a bit fragile still.  All it
> takes is one call to rcu_qs() after the IRQ work was queued and before it
> ran, coupled with an RCU reader that somehow always enters the slow-path.

But if rcu_qs() or rcu_report_exp_rdp() has been called there is no more need
to enter rcu_read_unlock_special(), right? Unless the task is still blocked
but I'm not sure it could recurse...

> 
> > Ah an alternative is to make rdp::defer_qs_pending a field in union rcu_special
> > which, sadly, would need to be expanded as a u64.
> 
> I was thinking maybe the most robust is something like the following. We
> _have_ to go through the node tree to report QS, once we "defer the QS",
> there's no other way out of that, that's a path that is a guarantee to go
> through in order to end the GP. So just unconditionally clear the flag there
> and all such places, something like the following which passes light
> rcutorture on all scenarios.
> 
> Once we issue an IRQ work or raise a softirq, we don't need to do that again
> for the same CPU until the GP ends).
> 
> (EDIT: actually rcu_disable_urgency_upon_qs() or its callsites might just be
> the place, since it is present in (almost?) all call sites we are to report
> up on the node tree).
> 
> Thoughts? I need to double check if there any possibilities of requiring IRQ
> work for more than one time during the same GP and the same CPU. I don't think
> so though.
> 
> ---8<-----------------------
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index b7c818cabe44..81c3af5d1f67 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -729,6 +729,12 @@ static void rcu_disable_urgency_upon_qs(struct rcu_data *rdp)
>  	}
>  }
>  
> +static void rcu_defer_qs_clear_pending(struct rcu_data *rdp)
> +{
> +	if (READ_ONCE(rdp->defer_qs_pending) == DEFER_QS_PENDING)
> +		WRITE_ONCE(rdp->defer_qs_pending, DEFER_QS_IDLE);
> +}
> +
>  /**
>   * rcu_is_watching - RCU read-side critical sections permitted on current CPU?
>   *
> @@ -2483,6 +2490,8 @@ rcu_report_qs_rdp(struct rcu_data *rdp)
>  		}
>  
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
> +
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  		/* ^^^ Released rnp->lock */
>  	}
> @@ -2767,6 +2776,12 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
>  			if (ret > 0) {
>  				mask |= rdp->grpmask;
>  				rcu_disable_urgency_upon_qs(rdp);
> +				/*
> +				 * Clear any stale defer_qs_pending for idle/offline
> +				 * CPUs reporting QS. This can happen if a CPU went
> +				 * idle after raising softirq but before it ran.
> +				 */
> +				rcu_defer_qs_clear_pending(rdp);
>  			}
>  			if (ret < 0)
>  				rsmask |= rdp->grpmask;
> @@ -4373,6 +4388,7 @@ void rcutree_report_cpu_starting(unsigned int cpu)
>  
>  		local_irq_save(flags);
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
>  		/* Report QS -after- changing ->qsmaskinitnext! */
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  	} else {
> @@ -4432,6 +4448,7 @@ void rcutree_report_cpu_dead(void)
>  	if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
>  		/* Report quiescent state -before- changing ->qsmaskinitnext! */
>  		rcu_disable_urgency_upon_qs(rdp);
> +		rcu_defer_qs_clear_pending(rdp);
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  		raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	}
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 96c49c56fc14..7f2af0e45883 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -272,6 +272,10 @@ static void rcu_report_exp_rdp(struct rcu_data *rdp)
>  	raw_spin_lock_irqsave_rcu_node(rnp, flags);
>  	WRITE_ONCE(rdp->cpu_no_qs.b.exp, false);
>  	ASSERT_EXCLUSIVE_WRITER(rdp->cpu_no_qs.b.exp);
> +
> +	/* Expedited QS reported. TODO: what happens if we deferred both exp and normal QS (and viceversa for the other callsites)? */
> +	rcu_defer_qs_clear_pending(rdp);
> +
>  	rcu_report_exp_cpu_mult(rnp, flags, rdp->grpmask, true);
>  }
>  
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 6c86c7b96c63..d706daea021f 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -487,8 +487,6 @@ rcu_preempt_deferred_qs_irqrestore(struct task_struct *t, unsigned long flags)
>  	union rcu_special special;
>  
>  	rdp = this_cpu_ptr(&rcu_data);
> -	if (rdp->defer_qs_pending == DEFER_QS_PENDING)
> -		rdp->defer_qs_pending = DEFER_QS_IDLE;
>  
>  	/*
>  	 * If RCU core is waiting for this CPU to exit its critical section,
>

Looks like a possible direction.

Thanks.

-- 
Frederic Weisbecker
SUSE Labs

  reply	other threads:[~2026-01-09 14:23 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-01 16:34 [PATCH -next 0/8] RCU updates from me for next merge window Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 1/8] rcu: Fix rcu_read_unlock() deadloop due to softirq Joel Fernandes
2026-01-02 17:28   ` Steven Rostedt
2026-01-02 17:30     ` Steven Rostedt
2026-01-02 19:51       ` Paul E. McKenney
2026-01-03  0:41         ` Joel Fernandes
2026-01-04  3:20           ` Yao Kai
2026-01-05 17:16             ` Steven Rostedt
2026-01-09 16:38               ` Steven Rostedt
2026-01-04 10:00           ` Boqun Feng
2026-01-07 23:14   ` Frederic Weisbecker
2026-01-08  1:02     ` Joel Fernandes
2026-01-08  1:35       ` Joel Fernandes
2026-01-08  3:35         ` Joel Fernandes
2026-01-08 15:39           ` Frederic Weisbecker
2026-01-08 15:57             ` Mathieu Desnoyers
2026-01-08 15:25       ` Frederic Weisbecker
2026-01-09  1:12         ` Joel Fernandes
2026-01-09 14:23           ` Frederic Weisbecker [this message]
2026-01-01 16:34 ` [PATCH -next 2/8] srcu: Use suitable gfp_flags for the init_srcu_struct_nodes() Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 3/8] rcu/nocb: Remove unnecessary WakeOvfIsDeferred wake path Joel Fernandes
2026-01-08 15:57   ` Frederic Weisbecker
2026-01-09  1:39     ` Joel Fernandes
2026-01-09 10:32       ` Boqun Feng
2026-01-09 11:20         ` Joel Fernandes
2026-01-11 12:14           ` Boqun Feng
2026-01-01 16:34 ` [PATCH -next 4/8] rcu/nocb: Add warning if no rcuog wake up attempt happened during overload Joel Fernandes
2026-01-08 17:22   ` Frederic Weisbecker
2026-01-09  3:49     ` Joel Fernandes
2026-01-09 14:36       ` Frederic Weisbecker
2026-01-09 21:20         ` Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 5/8] rcu/nocb: Add warning to detect if overload advancement is ever useful Joel Fernandes
2026-01-14  1:09   ` Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 6/8] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 7/8] rcutorture: Prevent concurrent kvm.sh runs on same source tree Joel Fernandes
2026-01-01 16:34 ` [PATCH -next 8/8] rcutorture: Add --kill-previous option to terminate previous kvm.sh runs Joel Fernandes
2026-01-01 22:48   ` Paul E. McKenney
2026-01-04 10:55 ` [PATCH -next 0/8] RCU updates from me for next merge window Boqun Feng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aWEPRz2QWnDnIO36@localhost.localdomain \
    --to=frederic@kernel.org \
    --cc=boqun.feng@gmail.com \
    --cc=jiangshanlai@gmail.com \
    --cc=joel@joelfernandes.org \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=paulmck@kernel.org \
    --cc=qiang.zhang@linux.dev \
    --cc=rcu@vger.kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=shuah@kernel.org \
    --cc=urezki@gmail.com \
    --cc=wutengda2@huawei.com \
    --cc=yaokai34@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox