From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mslow3.mail.gandi.net (mslow3.mail.gandi.net [217.70.178.249]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8FC2D347C7 for ; Sun, 26 Oct 2025 20:10:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.70.178.249 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761509452; cv=none; b=ozTBbc03vnxE6i/lxq+IwSZ2a3hRkRHgQ46XMxdD1h7Ke5EQltJB+9Tf4+qnUMM4Yx9X68j9yM6Sp148t/Drw+aJyl6eRZf7hM961HSBrcO/cmajFDJ1T09Q8n9J/9WL4+8e1uIDhPS6phGG7/zvn0uf5AtD9EuGp2WWw7B+sVU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1761509452; c=relaxed/simple; bh=5BeWn33Vmybev78Ie1w+MLGd3jwbuoOOEzitK952tDE=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=lGyBF0wvpUQxHZSXqGmZam2+oW5lar9Aic1YTl6SVpC4IOfgZH+oAmTnn+ADktbnJWlbMSrdZYK7Wps5+vr2DecpKlM4IS+TI40U+cZHuCbzHHGdJ7pXGjD9wWg+nuC+cqYfSTYdTOLupjBnGJcqsPmkUiJxF22WFEffHekEpJ8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=xenomai.org; spf=pass smtp.mailfrom=xenomai.org; dkim=pass (2048-bit key) header.d=xenomai.org header.i=@xenomai.org header.b=P1IkzOct; arc=none smtp.client-ip=217.70.178.249 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=xenomai.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=xenomai.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=xenomai.org header.i=@xenomai.org header.b="P1IkzOct" Received: from relay1-d.mail.gandi.net (relay1-d.mail.gandi.net [IPv6:2001:4b98:dc4:8::221]) by mslow3.mail.gandi.net (Postfix) with ESMTP id 82F7E581334 for ; Sun, 26 Oct 2025 20:05:05 +0000 (UTC) Received: by mail.gandi.net (Postfix) with ESMTPSA id BDFB0443D2; Sun, 26 Oct 2025 20:04:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=xenomai.org; s=gm1; t=1761509098; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Jx3++tpF/LKGQcuxRh8u7YLn0m11PzQJ8nOt6VfB2Ns=; b=P1IkzOct5ywPPcleAZER0Dmh/pWeAV5PMxT3ws+694ukdXwXZInP+rNIu/6E8xU+11uz2l hPpenMRhsnQK6iDmbLObKrSnwcQv8jTB8y9PyiMu8b+OI+Vrx9gkuMqROtYh2B3BbPbzEd v3/Dafn4loFYESd8dV+XdxUHHayeemxZq05oWdx8iF6RiAp4DCVGUEYTcq8PeyqTT7GVGV Ry3gpEUp75ZxN7zWQnuEgPfRH9IRAje0KseapTNIqs225DfEl75SE0taGgdVG2v0qns3bF BBp/NHBp/cI3DEML/aG7XrWncHV1e+Xtdlr0CjDgntUhfeHzbbBh1nse9exKTQ== From: Philippe Gerum To: =?utf-8?Q?=C5=81ukasz?= Majewski Cc: Giulio Moro , Xenomai Subject: Re: Unexpected switches to in-band In-Reply-To: <20251023155439.0170f987@wsk> (=?utf-8?Q?=22=C5=81ukasz?= Majewski"'s message of "Thu, 23 Oct 2025 15:54:39 +0200") References: <20251009151737.0d03b211@wsk> <20676160-4572-d92d-4b33-ff4255946345@bela.io> <87qzv9sa9c.fsf@xenomai.org> <87ikgls9kh.fsf@xenomai.org> <20251020094705.2ac256f2@wsk> <9d2bacac-8d70-f083-e926-21beee2207c2@bela.io> <87o6q1ad07.fsf@xenomai.org> <20251023155439.0170f987@wsk> User-Agent: mu4e 1.12.12; emacs 30.2 Date: Sun, 26 Oct 2025 21:04:52 +0100 Message-ID: <87a51djuor.fsf@xenomai.org> Precedence: bulk X-Mailing-List: xenomai@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-GND-Sasl: rpm@xenomai.org Hi =C5=81ukasz, =C5=81ukasz Majewski writes: > Hi Philippe, > >> Giulio Moro writes: >>=20 >> > =C5=81ukasz Majewski wrote on 20/10/2025 02:47: >> >=20=20 >> >> Could you share which version of libevl do you use?=20=20 >> > >> > I was using the latest release of libevl that was compatible with >> > the kernel UAPI. Sorry I haven't provided more details on the >> > issue; I am focusing on building an image around 6.1 for a deadline >> > coming up next month so I haven't been able to get into tracing yet. >> > >> > The only additional finding I have so far is that there seems to be >> > something on the Linux side that "breaks real-time" which affects >> > both linux and evl. In the below list, "evl bad" means that the >> > above mentioned ISW are observed. "Linux bad" means that I see a >> > disproportionate number of underruns under stress on a Linux program >> > running with SCHED_FIFO and priority 95 with a period of 360us. I >> > understand "disproportionate" is a very subjective term but, to give >> > an idea, over a 10 minutes test I get a couple of underruns with >> > "linux good" and I get hundreds of underruns with "linux good" >> > >> > v6.12.y-evl-rebase: evl bad, linux bad >> > v6.11.y-evl-rebase: evl bad, linux untested >> > v6.10.y-evl-rebase: evl bad, linux untested >> > v6.9.y-evl-rebase: evl bad, linux bad >> > v6.6.y-evl-rebase: evl bad, linux bad >> > v6.3.y-evl-rebase: evl bad, linux bad on startup only >> > v6.2.y-evl-rebase: libevl r42, evl good, linux good >> > v6.1.y-cip-evl-rebase: libevl master, evl good, linux good >> > >> > Not sure if this is of any help; I hope to be able to get back on >> > this soon. Best, >> > Giulio=20=20 >>=20 >> If someone could send me the relevant portion of a trace file with a >> 'latspot' tracepoint triggered on a latmus run, I could investigate >> this issue. I'd need the function tracer active on all CPUs, with all >> traces dumped to a single trace file ('evl trace -ef' should do). >>=20 > > Please find tar'ed output for the trace(s). > > Please, however be aware that - I've fall back to 6.6 (as it is the > version in which I can reproduce the issue in the fastest way). > > Customer also reported, that they can reproduce with their SW stack the > issue on 6.1-slts and 6.12, but it takes considerably longer than for > 6.6 (in which I can use simple programs to "allocate" memory). > Ok, but 6.6 is definitely unmaintained Dovetail-wise, and has been so for several months now. So although this issue was observed with maintained releases too, debugging a current issue on an obsolete code base is a fragile process nevertheless. > I've used pretty standard set of ftrace CONFIG_* options enabled. > > However, it seems like there is a "hole" around the time when in-band > switch has been reported (in dmesg) and in ftrace output. > > I'm going to do the same with all available tracers enabled. > > Last but not least, the latspot event is not present in my ftrace > output (although I've enabled all the CONFIG_EVL*DEBUG options). > > Is there any special set of options to required for EVL tracig? > Yes, you need to enable the evl/evl_latspot tracepoint. See [1]. > Tars with logs: > https://nextcloud.swupdate.org/index.php/s/FgiMsHG9xG8frk3 > https://nextcloud.swupdate.org/index.php/s/XcW75xsQPMXm3zg Ok, first observation, the logs reveal that we are in an OOM situation: the kernel strategy is best-effort there, to keep the system in the best possible state while sacrificing processes. But honestly, although the VM_LOCKED pages are unevictable by definition, there are quite a few spots in the mm which might trigger the OOM reaper, including the inability to allocate page table information, insert new pages and so on. Although all the memory of an oob application is committed, with its VMAs populated once libevl has issued mlockall(), I genuinely don't know how this fares with an OOM situation. Anyway, I still see a common pattern between the two set of traces, the unwanted inband switch happens during what seems to be time holes (assuming that traces of all CPUs are merged into each log): [ 285.166640] EVL: timer-responder:754 switching in-band [pid=3D756, excpt= =3D14, user_pc=3D0x7be4dc59d5fe] -0 [022] *..1. 148.913052: rcu_dyntick: Start 1 0 0x8ec -0 [014] dN.1. 172.761906: rcu_dyntick: End 0 1 0x39c [ 172.317009] EVL: timer-responder:743 switching in-band [pid=3D745, excpt= =3D14, user_pc=3D0x7036049465fe] -0 [022] *..1. 148.913052: rcu_dyntick: Start 1 0 0x8ec -0 [014] dN.1. 172.761906: rcu_dyntick: End 0 1 0x39c A lot can be done in 24 =C2=B5s on such class of hardware, so either some traces are missing, or something happens at hardware level which the kernel does not know about, as it may be seen on x86 with some uncooperative BIOS (e.g. SMIs, thermal events come to mind). Hopefully this is not the case, but then we need to make sure that some traces are indeed missing. If such time hole is confirmed though, then the issue Giulio is seeing might be different. [1] https://v4.xenomai.org/core/commands/index.html#evl-trace-command --=20 Philippe.