Re: possible deadlock in smp_call_function_many_cond

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Thomas Gleixner <tglx@linutronix.de>
To: 白烁冉 <baishuoran@hrbeu.edu.cn>, "Peter Zijlstra" <peterz@infradead.org>
Cc: Kun Hu <huk23@m.fudan.edu.cn>, Jiaji Qin <jjtan24@m.fudan.edu.cn>,
	linux-kernel@vger.kernel.org
Subject: Re: possible deadlock in smp_call_function_many_cond
Date: Sun, 20 Jul 2025 21:38:00 +0200	[thread overview]
Message-ID: <877c02vejr.ffs@tglx> (raw)
In-Reply-To: <758991c1.13f67.197f9cccf9b.Coremail.baishuoran@hrbeu.edu.cn>

On Fri, Jul 11 2025 at 22:03, 白烁冉 wrote:
> When using our customized Syzkaller to fuzz the latest Linux kernel,
> the following crash (122th)was triggered.
>
> HEAD commit: 6537cfb395f352782918d8ee7b7f10ba2cc3cbf2
> git tree: upstream

That's not the latest kernel.

> Output:https://github.com/pghk13/Kernel-Bug/blob/main/0702_6.14/INFO%3A%20rcu%20detected%20stall%20in%20sys_select/122report.txt
> Kernel config:https://github.com/pghk13/Kernel-Bug/blob/main/0305_6.14rc3/config.txt
> C reproducer:https:https://github.com/pghk13/Kernel-Bug/blob/main/0702_6.14/INFO%3A%20rcu%20detected%20stall%20in%20sys_select/122repro.c
> Syzlang reproducer: https://github.com/pghk13/Kernel-Bug/blob/main/0702_6.14/INFO%3A%20rcu%20detected%20stall%20in%20sys_select/122repro.txt
>
> Our reproducer uses mounts a constructed filesystem image.
>
> The error occurred around line 880 of the code, specifically during
> the call to csd_lock_wait. The status of CPU 1 (RCU GP kthread):
> executing the perf_event_open system call, needs to update tracepoint

I can't find a perf_event_open() syscall in the C reproducer. So how is
that supposed to be reproduced?

> calls on all CPUs, and smp_call_function_many_cond is stuck waiting
> for CPU 2 to respond to the IPI.  We have reproduced this issue
> several times on 6.14 again.

Again not the latest kernel. Please run it against Linus latest tree and
if it still triggers, provide proper information how to reproduce. If
not you should be able to bisect the fix.

> rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> rcu: 	2-...!: (3 GPs behind) idle=b834/1/0x4000000000000000 softirq=23574/23574 fqs=5
> rcu: 	(detected by 1, t=10502 jiffies, g=19957, q=594 ncpus=4)

So CPU 1 detects an RCU stall on CPU2

> Sending NMI from CPU 1 to CPUs 2:
> NMI backtrace for cpu 2
> CPU: 2 UID: 0 PID: 9461 Comm: sshd Not tainted 6.14.0 #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> RIP: 0010:__lock_acquire+0x106/0x46b0
> Code: ff df 4c 89 ea 48 c1 ea 03 80 3c 02 00 0f 85 ec 35 00 00 49 8b 45 00 48 3d a0 c7 8a 93 0f 84 29 0f 00 00 44 8b 05 2a dc 74 0c <45> 85 c0 0f 84 ad 06 00 00 48 3d e0 c7 8a 93 0f 84 a1 06 00 00 41
> RSP: 0018:ffffc90000568ac8 EFLAGS: 00000002
> RAX: ffffffff9aab9a20 RBX: 0000000000000000 RCX: 1ffff920000ad16c
> RDX: 1ffffffff35692cf RSI: 0000000000000000 RDI: ffffffff9ab49678
> RBP: ffff8880201aa480 R08: 0000000000000001 R09: 0000000000000001
> R10: 0000000000000001 R11: ffffffff90617d17 R12: 0000000000000000
> R13: ffffffff9ab49678 R14: 0000000000000000 R15: 0000000000000000
> FS:  00007fa644657900(0000) GS:ffff88802b900000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f0fa92178a9 CR3: 0000000000e90000 CR4: 0000000000750ef0
> PKRU: 55555554
> Call Trace:
>  <NMI>
>  </NMI>
>  <IRQ>
>  lock_acquire+0x1b6/0x570
>  _raw_spin_lock_irqsave+0x3d/0x60
>  debug_object_deactivate+0x139/0x390
>  __hrtimer_run_queues+0x416/0xc30
>  hrtimer_interrupt+0x398/0x890
>  __sysvec_apic_timer_interrupt+0x114/0x400
>  sysvec_apic_timer_interrupt+0xa3/0xc0

which handles the timer interrupt. What you cut off in your report is:

[  321.491987][    C2] hrtimer: interrupt took 31336677795 ns

That means the hrtimer interrupt got stuck for 32 seconds (!!!). That
warning is only emitted once, so I assume there is something weird going
on with hrtimers and one of their callbacks. But there is no indication
where this comes from.

Can you enable the hrtimer_expire_entry/exit tracepoints on the kernel
command line and add 'ftrace_dump_on_oops' as well, so that the trace
gets dumped with the rcu stall splat?

Thanks,

        tglx

next prev parent reply	other threads:[~2025-07-20 19:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-11 14:03 possible deadlock in smp_call_function_many_cond 白烁冉
2025-07-20 19:38 ` Thomas Gleixner [this message]
2025-07-21  1:38   ` 胡焜
2025-07-24  7:55   ` 胡焜
2025-07-24  9:35     ` Thomas Gleixner
     [not found]       ` <775c18ba.1d7b5.199098fbc3f.Coremail.baishuoran@hrbeu.edu.cn>
     [not found]         ` <87frcuro62.ffs@tglx>
2025-09-21 13:35           ` Re: " 白烁冉

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=877c02vejr.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=baishuoran@hrbeu.edu.cn \
    --cc=huk23@m.fudan.edu.cn \
    --cc=jjtan24@m.fudan.edu.cn \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox