From: "Łukasz Majewski" <lukma@nabladev.com>
To: Philippe Gerum <rpm@xenomai.org>
Cc: Giulio Moro <giulio@bela.io>, Xenomai <xenomai@lists.linux.dev>
Subject: Re: Unexpected switches to in-band
Date: Mon, 27 Oct 2025 12:05:35 +0100 [thread overview]
Message-ID: <20251027120535.7933c720@wsk> (raw)
In-Reply-To: <87a51djuor.fsf@xenomai.org>
Hi Philippe,
> Hi Łukasz,
>
> Łukasz Majewski <lukma@nabladev.com> writes:
>
> > Hi Philippe,
> >
> >> Giulio Moro <giulio@bela.io> writes:
> >>
> >> > Łukasz Majewski wrote on 20/10/2025 02:47:
> >> >
> >> >> Could you share which version of libevl do you use?
> >> >
> >> > I was using the latest release of libevl that was compatible with
> >> > the kernel UAPI. Sorry I haven't provided more details on the
> >> > issue; I am focusing on building an image around 6.1 for a
> >> > deadline coming up next month so I haven't been able to get into
> >> > tracing yet.
> >> >
> >> > The only additional finding I have so far is that there seems to
> >> > be something on the Linux side that "breaks real-time" which
> >> > affects both linux and evl. In the below list, "evl bad" means
> >> > that the above mentioned ISW are observed. "Linux bad" means
> >> > that I see a disproportionate number of underruns under stress
> >> > on a Linux program running with SCHED_FIFO and priority 95 with
> >> > a period of 360us. I understand "disproportionate" is a very
> >> > subjective term but, to give an idea, over a 10 minutes test I
> >> > get a couple of underruns with "linux good" and I get hundreds
> >> > of underruns with "linux good"
> >> >
> >> > v6.12.y-evl-rebase: evl bad, linux bad
> >> > v6.11.y-evl-rebase: evl bad, linux untested
> >> > v6.10.y-evl-rebase: evl bad, linux untested
> >> > v6.9.y-evl-rebase: evl bad, linux bad
> >> > v6.6.y-evl-rebase: evl bad, linux bad
> >> > v6.3.y-evl-rebase: evl bad, linux bad on startup only
> >> > v6.2.y-evl-rebase: libevl r42, evl good, linux good
> >> > v6.1.y-cip-evl-rebase: libevl master, evl good, linux good
> >> >
> >> > Not sure if this is of any help; I hope to be able to get back on
> >> > this soon. Best,
> >> > Giulio
> >>
> >> If someone could send me the relevant portion of a trace file with
> >> a 'latspot' tracepoint triggered on a latmus run, I could
> >> investigate this issue. I'd need the function tracer active on all
> >> CPUs, with all traces dumped to a single trace file ('evl trace
> >> -ef' should do).
> >
> > Please find tar'ed output for the trace(s).
> >
> > Please, however be aware that - I've fall back to 6.6 (as it is the
> > version in which I can reproduce the issue in the fastest way).
> >
> > Customer also reported, that they can reproduce with their SW stack
> > the issue on 6.1-slts and 6.12, but it takes considerably longer
> > than for 6.6 (in which I can use simple programs to "allocate"
> > memory).
>
> Ok, but 6.6 is definitely unmaintained Dovetail-wise, and has been so
> for several months now.
Yes, I'm aware of it ...
> So although this issue was observed with
> maintained releases too, debugging a current issue on an obsolete code
> base is a fragile process nevertheless.
>
It is fragile, yes - but:
1. I can reproduce it with minutes with fairly easy way
2. I do assume, that when we find the issue on 6.6, then we can
forward port it for 6.12 (and I will be more than happy to switch to
6.12+).
> > I've used pretty standard set of ftrace CONFIG_* options enabled.
> >
> > However, it seems like there is a "hole" around the time when
> > in-band switch has been reported (in dmesg) and in ftrace output.
> >
> > I'm going to do the same with all available tracers enabled.
> >
> > Last but not least, the latspot event is not present in my ftrace
> > output (although I've enabled all the CONFIG_EVL*DEBUG options).
> >
> > Is there any special set of options to required for EVL tracig?
> >
>
> Yes, you need to enable the evl/evl_latspot tracepoint.
> See [1].
>
I've been confused a bit.
With libevl r50 - I've used
evl trace -e -f
(and I've assumed that evl* specific traces will be added).
When I've check:
cat /sys/kernel/debug/tracing/events/evl/evl_latspot/enable
returned '1', so it shall be enabled as well.
However, when I run:
evl trace -eirq it now works as expected.
If I may made an observation - I would expect the following work flow
(according to [1]):
evl trace -eirq
evl trace -d
evl trace -p > foo.txt
However, evl trace -d disables tracing, yes, but also clears the buffer.
(instead of 'evl trace -d' I do use
'echo 0 > /sys/kernel/tracing/tracing_on')
> > Tars with logs:
> > https://nextcloud.swupdate.org/index.php/s/FgiMsHG9xG8frk3
> > https://nextcloud.swupdate.org/index.php/s/XcW75xsQPMXm3zg
>
> Ok, first observation, the logs reveal that we are in an OOM
> situation: the kernel strategy is best-effort there, to keep the
> system in the best possible state while sacrificing processes.
+1
> But
> honestly, although the VM_LOCKED pages are unevictable by definition,
> there are quite a few spots in the mm which might trigger the OOM
> reaper, including the inability to allocate page table information,
> insert new pages and so on. Although all the memory of an oob
> application is committed, with its VMAs populated once libevl has
> issued mlockall(), I genuinely don't know how this fares with an OOM
> situation.
I would expect that, the VM_LOCKED pages (and their entries to PTE)
would be the _last_ ones to be touched (if any).
However, maybe some "optimization" in mm was added to Linux kernel
recently and we do experience the consequences.
>
> Anyway, I still see a common pattern between the two set of traces,
> the unwanted inband switch happens during what seems to be time holes
> (assuming that traces of all CPUs are merged into each log):
Yes, those were traces for all cpus - added to a single file.
>
> [ 285.166640] EVL: timer-responder:754 switching in-band [pid=756,
> excpt=14, user_pc=0x7be4dc59d5fe] <idle>-0 [022] *..1.
> 148.913052: rcu_dyntick: Start 1 0 0x8ec <idle>-0 [014] dN.1.
> 172.761906: rcu_dyntick: End 0 1 0x39c
>
> [ 172.317009] EVL: timer-responder:743 switching in-band [pid=745,
> excpt=14, user_pc=0x7036049465fe] <idle>-0 [022] *..1.
> 148.913052: rcu_dyntick: Start 1 0 0x8ec <idle>-0 [014] dN.1.
> 172.761906: rcu_dyntick: End 0 1 0x39c
>
> A lot can be done in 24 µs on such class of hardware,
For the record. Both Xenomai3 and Xenomai4 have been run on the
identical hardware.
On Xenomai3 there are no issues...
> so either some
> traces are missing,
I will check if I can add more tracing.
> or something happens at hardware level which the
> kernel does not know about, as it may be seen on x86 with some
> uncooperative BIOS (e.g. SMIs, thermal events come to mind).
IIRC - Giulio reported the issue on some ARM system...
> Hopefully
> this is not the case, but then we need to make sure that some traces
> are indeed missing. If such time hole is confirmed though, then the
> issue Giulio is seeing might be different.
>
> [1] https://v4.xenomai.org/core/commands/index.html#evl-trace-command
>
--
Best regards,
Lukasz Majewski
--
Nabla Software Engineering GmbH
HRB 40522 Augsburg
Phone: +49 821 45592596
E-Mail: office@nabladev.com
Geschftsfhrer : Stefano Babic
next prev parent reply other threads:[~2025-10-27 11:05 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-09 4:33 Unexpected switches to in-band Giulio Moro
2025-10-09 13:17 ` Łukasz Majewski
2025-10-09 19:05 ` Giulio Moro
2025-10-10 10:24 ` Łukasz Majewski
2025-10-10 12:21 ` Giulio Moro
2025-10-10 13:08 ` Łukasz Majewski
2025-10-11 4:25 ` Giulio Moro
2025-10-11 15:55 ` Philippe Gerum
2025-10-11 16:10 ` Philippe Gerum
2025-10-11 16:47 ` Giulio Moro
2025-10-11 16:56 ` Philippe Gerum
2025-10-11 17:15 ` Philippe Gerum
2025-10-11 19:46 ` Giulio Moro
2025-10-12 8:54 ` Philippe Gerum
2025-10-12 14:44 ` Philippe Gerum
2025-10-20 7:47 ` Łukasz Majewski
2025-10-20 12:46 ` Giulio Moro
2025-10-20 14:01 ` Philippe Gerum
2025-10-21 11:13 ` Łukasz Majewski
2025-10-23 13:54 ` Łukasz Majewski
2025-10-26 20:04 ` Philippe Gerum
2025-10-27 11:05 ` Łukasz Majewski [this message]
2025-10-27 11:35 ` Philippe Gerum
2025-10-27 12:54 ` Łukasz Majewski
2025-10-27 16:25 ` Łukasz Majewski
2025-10-27 18:16 ` Giulio Moro
2025-10-27 22:42 ` Giulio Moro
2025-10-29 9:18 ` Philippe Gerum
2025-10-29 13:51 ` Łukasz Majewski
2025-10-30 12:26 ` Łukasz Majewski
2025-10-30 16:17 ` Philippe Gerum
2025-10-31 15:56 ` Łukasz Majewski
2025-10-31 16:30 ` Philippe Gerum
2025-10-31 17:34 ` Jan Kiszka
2025-10-31 18:09 ` Philippe Gerum
2025-10-31 18:11 ` Philippe Gerum
2025-11-01 11:32 ` Łukasz Majewski
2025-11-03 7:57 ` Florian Bezdeka
2025-11-03 9:29 ` Jan Kiszka
2025-11-01 11:31 ` Łukasz Majewski
2025-10-31 18:13 ` Philippe Gerum
2025-11-01 15:59 ` Łukasz Majewski
2025-11-01 16:33 ` Giulio Moro
2025-11-03 14:06 ` Philippe Gerum
2025-11-04 7:53 ` Łukasz Majewski
2025-11-04 8:19 ` Philippe Gerum
2025-11-03 14:00 ` Philippe Gerum
2025-10-30 16:26 ` Philippe Gerum
2025-10-11 17:43 ` Philippe Gerum
2025-10-11 15:37 ` Philippe Gerum
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251027120535.7933c720@wsk \
--to=lukma@nabladev.com \
--cc=giulio@bela.io \
--cc=rpm@xenomai.org \
--cc=xenomai@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.