All of lore.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: David Woodhouse <dwmw2@infradead.org>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Jason Wang <jasowang@redhat.com>
Cc: "x86@kernel.org" <x86@kernel.org>, hpa <hpa@zytor.com>,
	dyoung <dyoung@redhat.com>, kexec <kexec@lists.infradead.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Stefano Garzarella <sgarzare@redhat.com>,
	eperezma <eperezma@redhat.com>,
	Paolo Bonzini <bonzini@redhat.com>,
	ming.lei@redhat.com, Petr Mladek <pmladek@suse.com>,
	John Ogness <jogness@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
Date: Fri, 13 Dec 2024 01:14:54 +0100	[thread overview]
Message-ID: <874j38a16p.ffs@tglx> (raw)
In-Reply-To: <dd06cd643ee7fa0be08ac3082cff443b8bfbfb58.camel@infradead.org>

On Thu, Dec 12 2024 at 19:19, David Woodhouse wrote:
> On Thu, 2024-12-12 at 19:04 +0100, Thomas Gleixner wrote:
> Build current master (231825b2e1ff here). The config I'm using is at
> http://david.woodhou.se/config-x86-kjump-irqs although I don't think
> there's anything special other than CONFIG_KEXEC_JUMP and enough
> lockdep to trigger the complaints.
>
> I've been running in qemu with the test case shoved into an initrd for
> faster testing, but it works just as well done manually. If it matters,
> the QEMU command line on my Haswell box is
>
>  qemu-system-x86_64 -accel kvm,kernel-irqchip=split -display none \
>    -serial mon:stdio -kernel arch/x86/boot/bzImage -smp 2 -m 2g \
>    -append "console=ttyS0 root=/dev/vda1 no_console_suspend earlyprintk=serial" \
>    -drive file=/var/lib/libvirt/images/fedora.qcow2,if=virtio \
>    -cpu host --no-reboot -nic user,model=virtio 
>
> Probably the only important part of that is the no_console_suspend.

I misread that and somehow thought it does _not_ happen with
no_console_suspend. Duh!

So that made it reproduce and after adding some more lockdep hacks here
is the culprit:

[   15.177945] DEBUG_LOCKS_WARN_ON(suspend_syscore_active)
[   15.178980] CPU: 0 UID: 0 PID: 410 Comm: k.1 Not tainted 6.13.0-rc2+ #118
[   15.185613] Call Trace:
[   15.185760]  <TASK>
[   15.187678]  __cond_resched+0x21/0x60
[   15.187923]  down_timeout+0x18/0x60
[   15.188159]  acpi_os_wait_semaphore+0x4c/0x80
[   15.188424]  acpi_ut_acquire_mutex+0x3d/0x100
[   15.188665]  acpi_ns_get_node+0x27/0x60
[   15.188879]  acpi_ns_evaluate+0x1cb/0x2d0
[   15.189106]  acpi_rs_set_srs_method_data+0x156/0x190
[   15.189382]  acpi_pci_link_set+0x11c/0x290
[   15.189618]  irqrouter_resume+0x54/0x60
[   15.189826]  syscore_resume+0x6a/0x200
[   15.190033]  kernel_kexec+0x145/0x1c0
[   15.190239]  __do_sys_reboot+0xeb/0x240
[   15.190561]  do_syscall_64+0x95/0x180

This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a wakeup and the scheduler
sets the NEED_RESCHED flag.

cond_resched() sees it set and invokes schedule() from a completely bogus
context.

Nothing knowns about the current system state and therefore happily
assumes that everything is cool and shiny.

I also found that kexec jump fails to set system_state = SYSTEM_SUSPEND
before syscore_suspend() and back to SYSTEM_RUNNING after
syscore_resume().  See suspend/hibernate...

But setting this does not solve the problem because NEED_RESCHED still
gets set and cond_resched() is oblivious of the state.

Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operation
it's just a question of time.

And of course then you need a code path which is invoked after that,
which actually invokes cond_resched().

Adding more debug, there are indeed a couple of ways to set NEED_RESCHED
between invoking syscore_suspend() and returning from syscore_resume().
Some stuff schedules work [un]conditonally.

So the real question is how to deal with the general problem, which
obviously affects suspend as well.

The only thing I came up with so far is the hack below. It cures the
problem for PREEMPT_NONE and PREEMPT_VOLUNTARY. PREEMPT_FULL is not
affected because preemptible() checks whether the preemption count is
zero and interrupts are enabled and cond_resched() not active.

We could check for interrupts enabled in cond_resched() too, but that's
not pretty either.

With that applied the problem goes away, but after a lot of repetitions
of the reproducer in a tight loop the whole machinery stops dead:

[   29.913179] Disabling non-boot CPUs ...
[   29.930328] smpboot: CPU 1 is now offline
[   29.930593] crash hp: kexec_trylock() failed, kdump image may be inaccurate
B[   29.940588] Enabling non-boot CPUs ...
[   29.940856] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[   29.941242] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   29.942654] CPU1 is up
[   29.945856] virtio_blk virtio1: 2/0/0 default/read/poll queues
[   29.948556] OOM killer enabled.
[   29.948750] Restarting tasks ... done.
Success
[   29.960044] Freezing user space processes
[   29.961447] Freezing user space processes completed (elapsed 0.001 seconds)
[   29.961861] OOM killer disabled.
[   30.102485] ata2: found unknown device (class 0)
[   30.107387] Disabling non-boot CPUs ...

That happens without 'no_console_suspend' on the command line as
well, but that's for tomorrow ...

Thanks,

        tglx
---
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1025,6 +1025,7 @@ int kernel_kexec(void)
 		if (error)
 			goto Enable_cpus;
 		local_irq_disable();
+		system_state = SYSTEM_SUSPEND;
 		error = syscore_suspend();
 		if (error)
 			goto Enable_irqs;
@@ -1054,6 +1055,7 @@ int kernel_kexec(void)
 	if (kexec_image->preserve_context) {
 		syscore_resume();
  Enable_irqs:
+		system_state = SYSTEM_RUNNING;
 		local_irq_enable();
  Enable_cpus:
 		suspend_enable_secondary_cpus();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
 #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
 int __sched __cond_resched(void)
 {
-	if (should_resched(0)) {
+	if (should_resched(0) && system_state != SYSTEM_SUSPEND) {
 		preempt_schedule_common();
 		return 1;
 	}


  reply	other threads:[~2024-12-13  0:15 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-09 14:28 Lockdep warnings on kexec (virtio_blk, hrtimers) David Woodhouse
2024-12-10  1:56 ` Jason Wang
2024-12-11 12:42   ` Stefan Hajnoczi
2024-12-12 11:07     ` David Woodhouse
2024-12-12 13:34       ` Thomas Gleixner
2024-12-12 13:46         ` David Woodhouse
2024-12-12 18:04           ` Thomas Gleixner
2024-12-12 19:19             ` David Woodhouse
2024-12-13  0:14               ` Thomas Gleixner [this message]
2024-12-13  9:31                 ` David Woodhouse
2024-12-13  9:43                   ` David Woodhouse
2024-12-13 10:42                     ` Thomas Gleixner
2024-12-13 11:09                       ` Ming Lei
2024-12-13 11:31                         ` Thomas Gleixner
2024-12-13 11:48                           ` Ming Lei
2024-12-13 13:23                             ` Thomas Gleixner
2024-12-13 14:07                               ` David Woodhouse
2024-12-13 17:05                                 ` Thomas Gleixner
2024-12-13 17:17                                   ` David Woodhouse
2024-12-13 17:48                                     ` Rafael J. Wysocki
2024-12-13 17:32                                   ` Rafael J. Wysocki
2024-12-13 19:06                                     ` Rafael J. Wysocki
2024-12-13 20:16                                       ` David Woodhouse
2024-12-14  9:57                                         ` David Woodhouse
2024-12-16 12:14                                           ` Rafael J. Wysocki
2024-12-13 17:59                                   ` Rafael J. Wysocki
2024-12-13 13:17                           ` David Woodhouse
2024-12-13 11:12                       ` David Woodhouse
2024-12-13 11:33                         ` Ming Lei
2024-12-13 11:20                 ` Peter Zijlstra
2024-12-13 13:13                   ` Thomas Gleixner
2024-12-16 13:20                     ` [PATCH] sched: Prevent rescheduling when interrupts are disabled Thomas Gleixner
2024-12-16 17:41                       ` David Woodhouse
2024-12-12 11:12     ` Lockdep warnings on kexec (virtio_blk, hrtimers) Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874j38a16p.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=bonzini@redhat.com \
    --cc=dwmw2@infradead.org \
    --cc=dyoung@redhat.com \
    --cc=eperezma@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jasowang@redhat.com \
    --cc=jogness@linutronix.de \
    --cc=kexec@lists.infradead.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mst@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=sgarzare@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.