public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Thomas Gleixner <tglx@linutronix.de>
To: David Woodhouse <dwmw2@infradead.org>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Jason Wang <jasowang@redhat.com>
Cc: "x86@kernel.org" <x86@kernel.org>, hpa <hpa@zytor.com>,
	dyoung <dyoung@redhat.com>, kexec <kexec@lists.infradead.org>,
	linux-ext4 <linux-ext4@vger.kernel.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Stefano Garzarella <sgarzare@redhat.com>,
	eperezma <eperezma@redhat.com>,
	Paolo Bonzini <bonzini@redhat.com>,
	ming.lei@redhat.com, Petr Mladek <pmladek@suse.com>,
	John Ogness <jogness@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>
Subject: Re: Lockdep warnings on kexec (virtio_blk, hrtimers)
Date: Fri, 13 Dec 2024 01:14:54 +0100	[thread overview]
Message-ID: <874j38a16p.ffs@tglx> (raw)
In-Reply-To: <dd06cd643ee7fa0be08ac3082cff443b8bfbfb58.camel@infradead.org>

On Thu, Dec 12 2024 at 19:19, David Woodhouse wrote:
> On Thu, 2024-12-12 at 19:04 +0100, Thomas Gleixner wrote:
> Build current master (231825b2e1ff here). The config I'm using is at
> http://david.woodhou.se/config-x86-kjump-irqs although I don't think
> there's anything special other than CONFIG_KEXEC_JUMP and enough
> lockdep to trigger the complaints.
>
> I've been running in qemu with the test case shoved into an initrd for
> faster testing, but it works just as well done manually. If it matters,
> the QEMU command line on my Haswell box is
>
>  qemu-system-x86_64 -accel kvm,kernel-irqchip=split -display none \
>    -serial mon:stdio -kernel arch/x86/boot/bzImage -smp 2 -m 2g \
>    -append "console=ttyS0 root=/dev/vda1 no_console_suspend earlyprintk=serial" \
>    -drive file=/var/lib/libvirt/images/fedora.qcow2,if=virtio \
>    -cpu host --no-reboot -nic user,model=virtio 
>
> Probably the only important part of that is the no_console_suspend.

I misread that and somehow thought it does _not_ happen with
no_console_suspend. Duh!

So that made it reproduce and after adding some more lockdep hacks here
is the culprit:

[   15.177945] DEBUG_LOCKS_WARN_ON(suspend_syscore_active)
[   15.178980] CPU: 0 UID: 0 PID: 410 Comm: k.1 Not tainted 6.13.0-rc2+ #118
[   15.185613] Call Trace:
[   15.185760]  <TASK>
[   15.187678]  __cond_resched+0x21/0x60
[   15.187923]  down_timeout+0x18/0x60
[   15.188159]  acpi_os_wait_semaphore+0x4c/0x80
[   15.188424]  acpi_ut_acquire_mutex+0x3d/0x100
[   15.188665]  acpi_ns_get_node+0x27/0x60
[   15.188879]  acpi_ns_evaluate+0x1cb/0x2d0
[   15.189106]  acpi_rs_set_srs_method_data+0x156/0x190
[   15.189382]  acpi_pci_link_set+0x11c/0x290
[   15.189618]  irqrouter_resume+0x54/0x60
[   15.189826]  syscore_resume+0x6a/0x200
[   15.190033]  kernel_kexec+0x145/0x1c0
[   15.190239]  __do_sys_reboot+0xeb/0x240
[   15.190561]  do_syscall_64+0x95/0x180

This is a long standing problem, which probably got more visible with
the recent printk changes. Something does a wakeup and the scheduler
sets the NEED_RESCHED flag.

cond_resched() sees it set and invokes schedule() from a completely bogus
context.

Nothing knowns about the current system state and therefore happily
assumes that everything is cool and shiny.

I also found that kexec jump fails to set system_state = SYSTEM_SUSPEND
before syscore_suspend() and back to SYSTEM_RUNNING after
syscore_resume().  See suspend/hibernate...

But setting this does not solve the problem because NEED_RESCHED still
gets set and cond_resched() is oblivious of the state.

Quite some of the code paths in syscore_suspend()/resume() can result in
triggering a wakeup with the exactly same consequences. They might not
have done so yet, but as they share a lot of code with normal operation
it's just a question of time.

And of course then you need a code path which is invoked after that,
which actually invokes cond_resched().

Adding more debug, there are indeed a couple of ways to set NEED_RESCHED
between invoking syscore_suspend() and returning from syscore_resume().
Some stuff schedules work [un]conditonally.

So the real question is how to deal with the general problem, which
obviously affects suspend as well.

The only thing I came up with so far is the hack below. It cures the
problem for PREEMPT_NONE and PREEMPT_VOLUNTARY. PREEMPT_FULL is not
affected because preemptible() checks whether the preemption count is
zero and interrupts are enabled and cond_resched() not active.

We could check for interrupts enabled in cond_resched() too, but that's
not pretty either.

With that applied the problem goes away, but after a lot of repetitions
of the reproducer in a tight loop the whole machinery stops dead:

[   29.913179] Disabling non-boot CPUs ...
[   29.930328] smpboot: CPU 1 is now offline
[   29.930593] crash hp: kexec_trylock() failed, kdump image may be inaccurate
B[   29.940588] Enabling non-boot CPUs ...
[   29.940856] crash hp: kexec_trylock() failed, kdump image may be inaccurate
[   29.941242] smpboot: Booting Node 0 Processor 1 APIC 0x1
[   29.942654] CPU1 is up
[   29.945856] virtio_blk virtio1: 2/0/0 default/read/poll queues
[   29.948556] OOM killer enabled.
[   29.948750] Restarting tasks ... done.
Success
[   29.960044] Freezing user space processes
[   29.961447] Freezing user space processes completed (elapsed 0.001 seconds)
[   29.961861] OOM killer disabled.
[   30.102485] ata2: found unknown device (class 0)
[   30.107387] Disabling non-boot CPUs ...

That happens without 'no_console_suspend' on the command line as
well, but that's for tomorrow ...

Thanks,

        tglx
---
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1025,6 +1025,7 @@ int kernel_kexec(void)
 		if (error)
 			goto Enable_cpus;
 		local_irq_disable();
+		system_state = SYSTEM_SUSPEND;
 		error = syscore_suspend();
 		if (error)
 			goto Enable_irqs;
@@ -1054,6 +1055,7 @@ int kernel_kexec(void)
 	if (kexec_image->preserve_context) {
 		syscore_resume();
  Enable_irqs:
+		system_state = SYSTEM_RUNNING;
 		local_irq_enable();
  Enable_cpus:
 		suspend_enable_secondary_cpus();
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7276,7 +7276,7 @@ void rt_mutex_setprio(struct task_struct
 #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC)
 int __sched __cond_resched(void)
 {
-	if (should_resched(0)) {
+	if (should_resched(0) && system_state != SYSTEM_SUSPEND) {
 		preempt_schedule_common();
 		return 1;
 	}

  reply	other threads:[~2024-12-13  0:14 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-09 14:28 Lockdep warnings on kexec (virtio_blk, hrtimers) David Woodhouse
2024-12-10  1:56 ` Jason Wang
2024-12-11 12:42   ` Stefan Hajnoczi
2024-12-12 11:07     ` David Woodhouse
2024-12-12 13:34       ` Thomas Gleixner
2024-12-12 13:46         ` David Woodhouse
2024-12-12 18:04           ` Thomas Gleixner
2024-12-12 19:19             ` David Woodhouse
2024-12-13  0:14               ` Thomas Gleixner [this message]
2024-12-13  9:31                 ` David Woodhouse
2024-12-13  9:43                   ` David Woodhouse
2024-12-13 10:42                     ` Thomas Gleixner
2024-12-13 11:09                       ` Ming Lei
2024-12-13 11:31                         ` Thomas Gleixner
2024-12-13 11:48                           ` Ming Lei
2024-12-13 13:23                             ` Thomas Gleixner
2024-12-13 14:07                               ` David Woodhouse
2024-12-13 17:05                                 ` Thomas Gleixner
2024-12-13 17:17                                   ` David Woodhouse
2024-12-13 17:48                                     ` Rafael J. Wysocki
2024-12-13 17:32                                   ` Rafael J. Wysocki
2024-12-13 19:06                                     ` Rafael J. Wysocki
2024-12-13 20:16                                       ` David Woodhouse
2024-12-14  9:57                                         ` David Woodhouse
2024-12-16 12:14                                           ` Rafael J. Wysocki
2024-12-13 17:59                                   ` Rafael J. Wysocki
2024-12-13 13:17                           ` David Woodhouse
2024-12-13 11:12                       ` David Woodhouse
2024-12-13 11:33                         ` Ming Lei
2024-12-13 11:20                 ` Peter Zijlstra
2024-12-13 13:13                   ` Thomas Gleixner
2024-12-16 13:20                     ` [PATCH] sched: Prevent rescheduling when interrupts are disabled Thomas Gleixner
2024-12-16 17:41                       ` David Woodhouse
2024-12-12 11:12     ` Lockdep warnings on kexec (virtio_blk, hrtimers) Ming Lei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874j38a16p.ffs@tglx \
    --to=tglx@linutronix.de \
    --cc=bonzini@redhat.com \
    --cc=dwmw2@infradead.org \
    --cc=dyoung@redhat.com \
    --cc=eperezma@redhat.com \
    --cc=hpa@zytor.com \
    --cc=jasowang@redhat.com \
    --cc=jogness@linutronix.de \
    --cc=kexec@lists.infradead.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=mst@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pmladek@suse.com \
    --cc=sgarzare@redhat.com \
    --cc=stefanha@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox