public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: akpm@linux-foundation.org, brauner@kernel.org,
	linux-kernel@vger.kernel.org, joel.granados@kernel.org,
	kernel-team@meta.com, oleg@redhat.com
Subject: Re: [PATCH] exit: skip IRQ disabled warning during power off
Date: Fri, 4 Apr 2025 08:08:02 -0700	[thread overview]
Message-ID: <Z+/10vJb8YUVVxzg@gmail.com> (raw)
In-Reply-To: <CAGudoHGYLB0GC2x3hr8aX835dQrQnJ5u0Si=Pw35=c_fjVC72A@mail.gmail.com>

On Fri, Apr 04, 2025 at 04:20:53PM +0200, Mateusz Guzik wrote:
> On Fri, Apr 4, 2025 at 2:52 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > Hello Mateusz,
> >
> >
> > On Fri, Apr 04, 2025 at 07:40:45AM +0200, Mateusz Guzik wrote:
> > > On Thu, Apr 3, 2025 at 8:01 PM Breno Leitao <leitao@debian.org> wrote:
> > > >
> > > > When the system is shutting down due to pid 1 exiting, which is common
> > > > on virtual machines, a warning message is printed.
> > > >
> > > >         WARNING: CPU: 0 PID: 1 at kernel/exit.c:897 do_exit+0x7e3/0xab0
> > > >
> > > > This occurs because do_exit() is called after kernel_power_off(), which
> > > > disables interrupts. native_machine_shutdown() expliclty disable
> > > > interrupt to avoid receiving the timer interrupt, forcing scheduler load
> > > > balance during the power off phase.
> > > >
> > > > This is the simplified code path:
> > > >
> > > >         kernel_power_off()
> > > >           - native_machine_shutdown()
> > > >                 - local_irq_disable()
> > > >         do_exit()
> > > >
> > > > Modify the warning condition in do_exit() to only trigger the warning if
> > > > the system is not powering off, since it is expected to have the irq
> > > > disabled in that case.
> > > >
> > > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > > ---
> > > >  kernel/exit.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > > index 3485e5fc499e4..97ec4f8bfd98f 100644
> > > > --- a/kernel/exit.c
> > > > +++ b/kernel/exit.c
> > > > @@ -878,7 +878,7 @@ void __noreturn do_exit(long code)
> > > >         struct task_struct *tsk = current;
> > > >         int group_dead;
> > > >
> > > > -       WARN_ON(irqs_disabled());
> > > > +       WARN_ON(irqs_disabled() && system_state != SYSTEM_POWER_OFF);
> > > >
> > > >         synchronize_group_exit(tsk, code);
> > > >
> > > >
> > >
> > > Can you share the backtrace?
> >
> > Sure. Here is the decoded stack from the the latest net-next
> > 0907e7fb35756  ("Add linux-next specific files for 20250117")
> >
> >         [  254.466712] ACPI: PM: Preparing to enter system sleep state S5
> >         [  254.474273] reboot: Power down
> >         [  254.479332] ------------[ cut here ]------------
> >         [  254.479934] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit (kernel/exit.c:881)
> >         [  254.480597] Modules linked in: evdev(E) serio_raw(E) button(E) virtio_mmio(E) 9pnet_virtio(E) 9p(E) 9pnet(E) netfs(E)
> >         [  254.483163] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
> >         [  254.483736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >         [  254.484912] RIP: 0010:do_exit (kernel/exit.c:881)
> >         [ 254.485348] Code: 00 00 45 31 f6 f7 c3 00 02 00 00 41 0f 94 c6 48 c7 c7 48 8b 5d 87 44 89 f6 31 d2 31 c9 e8 c8 c5 41 00 f7 c3 00 02 00 00 75 02 <0f> 0b 48 c7 c7 78 8b 5d 87 44 89 f6 31 d2 31 c9 e8 ab c5 41 00 48
> >         All code
> >         ========
> >         0:   00 00                   add    %al,(%rax)
> >         2:   45 31 f6                xor    %r14d,%r14d
> >         5:   f7 c3 00 02 00 00       test   $0x200,%ebx
> >         b:   41 0f 94 c6             sete   %r14b
> >         f:   48 c7 c7 48 8b 5d 87    mov    $0xffffffff875d8b48,%rdi
> >         16:   44 89 f6                mov    %r14d,%esi
> >         19:   31 d2                   xor    %edx,%edx
> >         1b:   31 c9                   xor    %ecx,%ecx
> >         1d:   e8 c8 c5 41 00          call   0x41c5ea
> >         22:   f7 c3 00 02 00 00       test   $0x200,%ebx
> >         28:   75 02                   jne    0x2c
> >         2a:*  0f 0b                   ud2             <-- trapping instruction
> >         2c:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
> >         33:   44 89 f6                mov    %r14d,%esi
> >         36:   31 d2                   xor    %edx,%edx
> >         38:   31 c9                   xor    %ecx,%ecx
> >         3a:   e8 ab c5 41 00          call   0x41c5ea
> >         3f:   48                      rex.W
> >
> >         Code starting with the faulting instruction
> >         ===========================================
> >         0:   0f 0b                   ud2
> >         2:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
> >         9:   44 89 f6                mov    %r14d,%esi
> >         c:   31 d2                   xor    %edx,%edx
> >         e:   31 c9                   xor    %ecx,%ecx
> >         10:   e8 ab c5 41 00          call   0x41c5c0
> >         15:   48                      rex.W
> >         [  254.487377] RSP: 0018:ffa000000001fb80 EFLAGS: 00010046
> >         [  254.487947] RAX: 49739800a6a6bf00 RBX: 0000000000000016 RCX: 0000000000000000
> >         [  254.488735] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff875d8b48
> >         [  254.489545] RBP: ffa000000001fd10 R08: dffffc0000000000 R09: 1ffffffff13312b6
> >         [  254.490406] R10: dffffc0000000000 R11: fffffbfff13312b7 R12: 000000004321fedc
> >         [  254.491330] R13: dffffc0000000000 R14: 0000000000000001 R15: dffffc0000000000
> >         [  254.492200] FS:  00007f658bad6780(0000) GS:ff110004c6000000(0000) knlGS:0000000000000000
> >         [  254.493005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >         [  254.493667] CR2: 00007f7769811000 CR3: 000000010eebe004 CR4: 0000000000771ef0
> >         [  254.494555] PKRU: 55555554
> >         [  254.495006] Call Trace:
> >         [  254.495305]  <TASK>
> >         [  254.495609] ? __warn (kernel/panic.c:242 kernel/panic.c:748)
> >         [  254.496046] ? do_exit (kernel/exit.c:881)
> >         [  254.496675] ? do_exit (kernel/exit.c:881)
> >         [  254.497124] ? report_bug (lib/bug.c:? lib/bug.c:219)
> >         [  254.497564] ? handle_bug (arch/x86/kernel/traps.c:285)
> >         [  254.497984] ? exc_invalid_op (arch/x86/kernel/traps.c:309)
> >         [  254.498394] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
> >         [  254.498874] ? do_exit (kernel/exit.c:881)
> >         [  254.499355] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.499892] ? __rcu_read_unlock (kernel/rcu/tree_plugin.h:445)
> >         [  254.500311] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
> >         [  254.500864] ? __pfx_do_exit (kernel/exit.c:877)
> >         [  254.501302] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.501825] ? native_machine_shutdown (arch/x86/kernel/reboot.c:765)
> >         [  254.502380] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
> >         [  254.502984] __x64_sys_reboot (kernel/reboot.c:?)
> >         [  254.503407] ? __pfx___x64_sys_reboot (kernel/reboot.c:722)
> >         [  254.503952] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.504527] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.505067] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.505599] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.506159] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.506712] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.507263] ? do_syscall_64 (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./include/linux/entry-common.h:198 arch/x86/entry/common.c:79)
> >         [  254.507865] do_syscall_64 (arch/x86/entry/common.c:83)
> >         [  254.508317] ? exc_page_fault (arch/x86/mm/fault.c:1542)
> >         [  254.508781] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> >         [  254.509315] RIP: 0033:0x7f658b904a27
> >         [ 254.509879] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 c1 43 0f 00 f7 d8 64 89 02 b8
> >         All code
> >         ========
> >         0:   64 89 01                mov    %eax,%fs:(%rcx)
> >         3:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
> >         7:   c3                      ret
> >         8:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
> >         f:   00 00 00
> >         12:   90                      nop
> >         13:   f3 0f 1e fa             endbr64
> >         17:   89 fa                   mov    %edi,%edx
> >         19:   be 69 19 12 28          mov    $0x28121969,%esi
> >         1e:   bf ad de e1 fe          mov    $0xfee1dead,%edi
> >         23:   b8 a9 00 00 00          mov    $0xa9,%eax
> >         28:   0f 05                   syscall
> >         2a:*  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
> >         30:   77 01                   ja     0x33
> >         32:   c3                      ret
> >         33:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43fb
> >         3a:   f7 d8                   neg    %eax
> >         3c:   64 89 02                mov    %eax,%fs:(%rdx)
> >         3f:   b8                      .byte 0xb8
> >
> >         Code starting with the faulting instruction
> >         ===========================================
> >         0:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
> >         6:   77 01                   ja     0x9
> >         8:   c3                      ret
> >         9:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43d1
> >         10:   f7 d8                   neg    %eax
> >         12:   64 89 02                mov    %eax,%fs:(%rdx)
> >         15:   b8                      .byte 0xb8
> >         [  254.511720] RSP: 002b:00007ffec6b608c8 EFLAGS: 00000217 ORIG_RAX: 00000000000000a9
> >         [  254.512535] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f658b904a27
> >         [  254.513332] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
> >         [  254.514243] RBP: 000000000000000a R08: 000055b7f5e58690 R09: 0000000000000000
> >         [  254.515058] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000000011
> >         [  254.515833] R13: 00007ffec6b60c88 R14: 000055b7f5e3828c R15: 000000000000000a
> >         [  254.516702]  </TASK>
> >         [  254.517021] irq event stamp: 1438412
> >         [  254.517510] hardirqs last enabled at (1438411): _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> >         [  254.518554] hardirqs last disabled at (1438412): native_machine_shutdown (arch/x86/kernel/reboot.c:?)
> >         [  254.519466] softirqs last enabled at (1438324): handle_softirqs (./arch/x86/include/asm/preempt.h:26 kernel/softirq.c:408 kernel/softirq.c:589)
> >         [  254.520398] softirqs last disabled at (1438303): __irq_exit_rcu (./arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:664)
> >
> >
> > > Note first thing synchronize_group_exit() is going to do is cycle
> > > through an irq-protected lock, so by the time it unlocks irqs are
> > > enabled again.
> >
> > When pid=1 is being killed, then synchronize_group_exit() will be called
> > with irq enabled (as shown by the warning above), and
> > synchronize_group_exit()->spin_unlock_irq() will restore the interrupt
> > (once it got disabled in spin_lock_irq() pair).
> >
> > On the other side, if irqs are disabled when synchronize_group_exit() is
> > called, then synchronize_group_exit->spin_unlock_irq() will not enable
> > the interrupts, right?
> 
> It will enable interrupts. spin_lock_irq() blindly disables them, but
> in your case they are already off and this bit is a nop.
> spin_unlock_irq() blindly enables them and this is where they are back
> regardless of the initial state.
> 
> Thus even with the buggy caller interrupts are enabled after
> synchronize_group_exit() returns.

In fact this WARN_ON(irqs_disabled()); doesn't mean much, since it
doesn't matter what is the interrupt state just before
synchronize_group_exit, becaue it will disabled it, and enable it
anyway.

That means that WARN_ON() is useless(!?)

> I think this is the only reason there are no further warnings about
> the state when proceeding with exit.
> 
> Preferably this would be fixed so that nobody calls here with irqs off.

Does it really matter what is the interrupt state when calling do_exit()
since it will be enabled anyway?

Thanks!
--breno

  reply	other threads:[~2025-04-04 15:08 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-03 18:01 [PATCH] exit: skip IRQ disabled warning during power off Breno Leitao
2025-04-04  5:40 ` Mateusz Guzik
2025-04-04 12:51   ` Breno Leitao
2025-04-04 14:16     ` Oleg Nesterov
2025-04-04 15:14       ` Breno Leitao
2025-04-04 15:31         ` Oleg Nesterov
2025-05-11  4:43           ` Andrew Morton
2025-04-04 14:20     ` Mateusz Guzik
2025-04-04 15:08       ` Breno Leitao [this message]
2025-04-04 15:24       ` Oleg Nesterov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z+/10vJb8YUVVxzg@gmail.com \
    --to=leitao@debian.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=joel.granados@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mjguzik@gmail.com \
    --cc=oleg@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox