Re: Dovetail/Xenomai 3: Timer tick locking problem

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Philippe Gerum <rpm@xenomai.org>
To: Florian Bezdeka <florian.bezdeka@siemens.com>
Cc: xenomai@lists.linux.dev, Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: Dovetail/Xenomai 3: Timer tick locking problem
Date: Thu, 06 Jun 2024 10:18:30 +0200	[thread overview]
Message-ID: <87wmn2eawl.fsf@xenomai.org> (raw)
In-Reply-To: <1a4aa9f3bfb30fe5b5955fea1486a5fb8b040ff2.camel@siemens.com>


Florian Bezdeka <florian.bezdeka@siemens.com> writes:

> Hi all,
>
> I'm searching for the root cause of the following WARNING - followed by
> a complete system hang:
>
>
> [Xenomai] lock 00000000e04e7d2d already unlocked on CPU #3
>           last owner = kernel/xenomai/pipeline/intr.c:26
> (xnintr_core_clock_handler(), CPU #-2)

Mm, -2 looks pretty bad. This should be either a valid CPU#, or -1 if free (~0).

> CPU: 3 PID: 31 Comm: ksoftirqd/3 Not tainted 6.1.34-xenomai-1 #1
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
>
>
> Up to now I'm only able to reproduce that with VirtualBox when the PV
> spinlock infrastructure is enabled. It takes ~5 min to stall the system
> by running stress-ng with the --iomix stressor. "nopvspin" to the
> kernel cmdline "solves" this problem for now.
>
> I'm able to stall the same image with the --iomix stressor when running
> on kvm/qemu as well. Obviously there is no warning triggered. I'm using
> "pci=nomsi" on the kernel cmdline to get the same IRQ routing (via
> IOAPIC) as VBox.
>
> While reading all the related code I had some questions that I would
> like to have answered. 
>
>
> First question:
> Broken locking when Xenomai timer tick interrupts an OOB task? 
>
> The call stack into the xnintr_core_clock_handler():
>
> #0  xnintr_host_tick (sched=0xffff88803e8ab060) at kernel/xenomai/pipeline/intr.c:14
> #1  xnintr_core_clock_handler () at kernel/xenomai/pipeline/intr.c:40
> #2  0xffffffff81061fdc in clockevents_handle_event (ced=0xffff88803e89c280) at ./include/linux/clockchips.h:281
> #3  lapic_oob_handler (irq=<optimized out>, dev_id=<optimized out>) at arch/x86/kernel/apic/apic.c:503
> #4  0xffffffff81112ecc in do_oob_irq (desc=desc@entry=0xffff888003929400) at kernel/irq/pipeline.c:933
> #5  0xffffffff8111315a in handle_oob_irq (desc=0xffff888003929400) at kernel/irq/pipeline.c:1036
> #6  handle_oob_irq (desc=0xffff888003929400) at kernel/irq/pipeline.c:991
> #7  0xffffffff811133ff in generic_handle_irq_desc (desc=0xffff888003929400) at ./include/linux/irqdesc.h:161
> #8  generic_pipeline_irq_desc (desc=desc@entry=0xffff888003929400) at kernel/irq/pipeline.c:1141
> #9  0xffffffff81070e78 in arch_handle_irq (regs=regs@entry=0xffffc9000009be38, vector=vector@entry=236 '\354', irq_movable=irq_movable@entry=false) at arch/x86/kernel/irq_pipeline.c:243
> #10 0xffffffff81f48dcb in arch_pipeline_entry (regs=0xffffc9000009be38, vector=236 '\354') at arch/x86/kernel/irq_pipeline.c:291
> #11 0xffffffff8200148a in asm_sysvec_apic_timer_interrupt () at ./arch/x86/include/asm/idtentry.h:760
> #12 0x0000000000000000 in ?? ()
>
> To my understanding this is an OOB IRQ, interrupting any task, OOB
> tasks included. The code in xnintr_core_clock_handler():
>
> 	xnlock_get(&nklock);
> 	xnclock_tick(&nkclock);
> 	xnlock_put(&nklock);
>
> When running over an OOB task that is currently owning nklock, we will
> release the lock unconditionally, leaving the task "unprotected" /
> unsynchronized. Right?

No, the only way for a task to hold the ugly lock safely is to disable
IRQs if it has to compete with an IRQ handler. So this scenario is by
definition a usage bug on the application/driver side, not on the
infrastructure's. Meanwhile, the _irqsave() variant prevents spurious
lock release in recursion using a special marker in the saved interrupt
flags.

>
> There is no OOB task running on my system yet, so it's likely not my
> original problem.
>
>
> Second question:
> Back to the original problem. If I interpret the warning correctly
> something like
>
>     CPU #2                      CPU #3
>                                 xnlock_get(&nklock)
>     xnlock_get(&nklock)
>     xnlock_put(&nklock)
>                                 xnlock_put(&nklock) (triggers WARN)
>
> must have happened. I think we agree that the infrastructure behind
> xnlock_get() should take care that this scenario can never happen. Any
> ideas what is going on here? Might that be a bug in the VBox PV
> implementation?
>
> The vCPUs should be "parked"/"halted" instead of spinning. On x86 we
> run into kvm_wait() doing a "sti;hlt;" combination.
>
> Once kicked by the current lock holder we should wake up and continue
> after "hlt;". Even if this goes somehow wrong, some checks about
> spurious wakeups are in place.
>
> Might that be a memory "visibility problem", maybe due to a missing
> barrier?

I would address the "-2" weirdness before considering anything else.

-- 
Philippe.

next prev parent reply	other threads:[~2024-06-06  8:24 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-06  7:47 Dovetail/Xenomai 3: Timer tick locking problem Florian Bezdeka
2024-06-06  8:18 ` Philippe Gerum [this message]
2024-06-06  8:37   ` Florian Bezdeka
2024-06-06  9:04     ` Philippe Gerum
2024-06-06 10:47   ` Florian Bezdeka
2024-06-06 12:42     ` Philippe Gerum
2024-06-07  7:37       ` Florian Bezdeka
2024-06-07  9:17         ` Jan Kiszka
2024-06-07 13:15           ` Florian Bezdeka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87wmn2eawl.fsf@xenomai.org \
    --to=rpm@xenomai.org \
    --cc=florian.bezdeka@siemens.com \
    --cc=jan.kiszka@siemens.com \
    --cc=xenomai@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.