From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mx.nabladev.com (mx.nabladev.com [178.251.229.89]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B4FF72836F for ; Sat, 1 Nov 2025 16:00:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=178.251.229.89 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762012811; cv=none; b=Q23efJWT5ArXwq7Vztmqlw3Ewl9xWPVDySZrnvHHG7YPcVyv4K6H2ba/JfuCrCvkz2J/5mJDHhqYIYqyjd9o0b2iUCv2BGzjotu+fJEUbqrSHQrIZSWpLXp07dR+Ew9NCgezyO4SGs+Krfhx+VkWH3tEQ4BmXCqwcT993MC6VRA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762012811; c=relaxed/simple; bh=D8FF2IcwinmcQNgv0uWF8zrjpf16v8gBs45ll8xOmM0=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=pjmLfTXMfNXzIfT3RvIyjhalD52DGOMLBSfZk6JDFRKoOOe+XKfHItMQxJ2kMX5DYQUyp5UcSgRT7wpozfFAQ+j0gNyG/SOk/mlFyNTUQf42GAXHQlhLJdqfefQHmjquQ5T7sfA7vx0kdJ2sun2lV+4V5a7wckBCJyVIN7dkv1k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nabladev.com; spf=pass smtp.mailfrom=nabladev.com; dkim=pass (2048-bit key) header.d=nabladev.com header.i=@nabladev.com header.b=hZ3QdUIx; arc=none smtp.client-ip=178.251.229.89 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=nabladev.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=nabladev.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=nabladev.com header.i=@nabladev.com header.b="hZ3QdUIx" Received: from [127.0.0.1] (localhost [127.0.0.1]) by localhost (Mailerdaemon) with ESMTPSA id F31BF1034A4; Sat, 1 Nov 2025 16:59:59 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nabladev.com; s=dkim; t=1762012800; h=from:subject:date:message-id:to:cc:mime-version:content-type: content-transfer-encoding:in-reply-to:references; bh=ED2evM/q+y1MqMQckvTzi0E6/9JjNZ2A+10zJlG3Sa0=; b=hZ3QdUIxXCX0ftkYthjzXdVTLPEMBeNaZt+zUiuz49+SM1EAZX16WrZk+qf1juKTLGj1GU NXwIS6oMofz0BpujMskOYOgg05lYoLnb07a/lWP7MtH1LNwOE3KaWYQYOWCld4k+U6elDB fk/N05ayEPM07jNR25HVI+BsZ4bY8JOsP+cAdEOstUN8e31tk8frIamb2xgpby4NLlPq2n wCpXhP1NycNQgbZ0eHxm1xLnGfoEahOcY5XtLVu356/7JW5Oq1DWVJoVrVHhkEVJ5u4SaI /GDUQeLYRUF2kG4VZMs59w7aiH4Lajd8ymTergDPnMXsZGYayMBL22EcUc0TJQ== Date: Sat, 1 Nov 2025 16:59:57 +0100 From: =?UTF-8?B?xYF1a2Fzeg==?= Majewski To: Philippe Gerum Cc: Giulio Moro , Xenomai Subject: Re: Unexpected switches to in-band Message-ID: <20251101165957.12432bf4@wsk> In-Reply-To: <87bjlnc9u5.fsf@xenomai.org> References: <20251009151737.0d03b211@wsk> <20676160-4572-d92d-4b33-ff4255946345@bela.io> <87qzv9sa9c.fsf@xenomai.org> <87ikgls9kh.fsf@xenomai.org> <20251020094705.2ac256f2@wsk> <9d2bacac-8d70-f083-e926-21beee2207c2@bela.io> <87o6q1ad07.fsf@xenomai.org> <20251023155439.0170f987@wsk> <87a51djuor.fsf@xenomai.org> <20251027120535.7933c720@wsk> <20251027172505.29eecfb2@wsk> <875xbyawx7.fsf@xenomai.org> <20251029145125.71debaab@wsk> <20251030132602.2d1ccfcc@wsk> <87qzukuzy5.fsf@xenomai.org> <20251031165615.2c9f72f5@wsk> <87bjlnc9u5.fsf@xenomai.org> Organization: Nabla X-Mailer: Claws Mail 3.19.0 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Precedence: bulk X-Mailing-List: xenomai@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Last-TLS-Session-Version: TLSv1.3 Hi Philippe, > =C5=81ukasz Majewski writes: >=20 > > Hi Philippe, > > =20 > >> Hi =C5=81ukasz, > >>=20 > >> =C5=81ukasz Majewski writes: > >> =20 > >> >> > Could you get me a trace which goes back up to two secs before > >> >> > the time reported by evl about the inband switch of the timer > >> >> > responder? e.g. =20 > >> > > >> > Why is it 2 seconds? Is it related to latmus "refresh time"? > >> > (which is 1s) > >> > =20 > >>=20 > >> Picking this delay is purely heuristical. Such a long backlog is > >> likely going to include the smoking gun, if any. > >> =20 > >> >> >=20 > >> >> > [ 75.214527] EVL: timer-responder:743 switching in-band > >> >> > [pid=3D745, excpt=3D14, user_pc=3D0x77d08492a5fe] > >> >> >=20 > >> >> > In the above case, I would need a trace going back to > >> >> > 73.214527 or so. =20 > >> >>=20 > >> >> Apparently the ftrace buffer was too small. I will adjust it to > >> >> store more data and share results. =20 > >> > > >> > It is also interesting, that: > >> > > >> > bash-5.2# echo 16384 > /sys/kernel/debug/tracing/buffer_size_kb > >> > bash-5.2# cat /sys/kernel/debug/tracing/buffer_size_kb > >> > 16387 > >> > bash-5.2# evl trace -eirq -f > >> > tracing enabled > >> > bash-5.2# cat /sys/kernel/debug/tracing/buffer_size_kb > >> > 131 > >> > > >> > Reduces the buffer size... > >> > =20 > >>=20 > >> Yeah, the evl-trace script is not that smart. I can see the issue > >> causing this, will fix. > >> =20 > >> > Hence, I've used the "normal" ftrace with enabled evl functions - > >> > the =20 > >>=20 > >> FWIW, EVL relies on the normal ftrace. However, evl-trace fiddles > >> with the sysfs entries manually, and may be a bit naive at times. > >> As Jan pointed out long ago, relying on the trace-cmd front-end > >> would be a better approach than tweaking ftrace's sysfs-based > >> controls manually at the very least. evl-trace adds EVL semantics > >> to the tracing configuration which is handy, but the way it is > >> done today may deserve a revamping, or maybe drop it entirely in > >> favor of trace-cmd if we have enough flexibility there in order to > >> introduce some EVL specifics. > >> =20 > >> > detailed information about the ftrace config is with log files: > >> > https://nextcloud.swupdate.org/index.php/s/czBd3Ydq9B9noZ7 > >> > > >> > > >> > To extract the cpu2 specific data: > >> > bash-5.2# grep -rnE '\[002\]' Xenomai4-ISW-trace-6.6-logs.txt > > >> > cpu2.txt > >> > > >> > Here again the ISW happens in the cpu2 execution "hole": > >> > [ 128.466459] EVL: timer-responder:736 switching in-band > >> > [pid=3D738, excpt=3D14, user_pc=3D0x7588e71985fe] > >> > > >> > Happens for cpu2 between: > >> > > >> > 11832546: -0 [002] *..2. 128.466438: > >> > sched_idle_set_state <-cpuidle_enter_state > >> > 11832547: -0 [002] *..2. 128.466468: > >> > handle_irq_pipelined_prepare <-arch_pipeline_entry > >> > > >> > However, now the log is large enough to inspect what has > >> > happened in the system for ~4s before. > >> > > >> > > >> > With the logs: > >> > timer-responder-738 [002] *.~2. 128.466214: evl_timer_shot: > >> > [proxy-timer/2] at 128.466467 (delay: 253 us, 66918 cycles) > >> > > >> > and after ISW: > >> > > >> > -0 [002] *.~1. 128.466469: evl_timer_shot: > >> > latmus_pulse_handler at 128.467208 (delay: 739 us, 195206 cycles) > >> > > >> > Why there is such increase in "delay" (253 vs 739 us) ?=20 > >> > > >> > It looks like we have only 2us delay as the timer has fired on > >> > 128.466469 vs. expected 128.466467. > >> > =20 > >>=20 > >> Ok, so the issue is still the same: for some reason, the > >> timer-responder thread is taking a (minor) fault when accessing > >> some memory, on its way back from the ioctl(EVL_LATIOC_PULSE) > >> syscall, or right after. This seems to be valid memory though, > >> just not immediately accessible, otherwise we'd have received > >> SEGV, which we did not. =20 > > > > Yes, I also think so - like some "deferred allocation" despite of > > mlockall. > > > > Please find the logs with the below patch applied: > > https://nextcloud.swupdate.org/index.php/s/JXNW5bJrB3fQo5E > > =20 >=20 > Thanks. I need to analyze the fault flags next week to see if I can > make something of these, but I just stumbled on an obviously > significant fact when looking at the multi-CPU view: the memory > compaction daemon is running asynchronously on other cores while the > responder is taking the fault. Compaction is all about moving memory > around to free larger physical regions, so updating PTEs will happen > as a result. >=20 > Moreover, I could not reproduce the issue at hand, but I did not have > CONFIG_COMPACTION enabled. In the meantime, could you try disabling > this on your end, to see if the fault still appears? >=20 CONFIG_COMPACTION=3Dn - it also automatically set CONFIG_MIGRATION=3Dn =46rom the Kconfig dependencies CONFIG_MIGRATION=3Dn -> !PREEMPT_RT [=3Dn] (i.e. it can be only enabled when PREEMPT_RT is NOT) I've just set CONFIG_COMPACTION=3Dn (which also disabled the CONFIG_MIGRATION) and now I cannot reproduce the issue on my setup (i.e. v6.6.) anymore. It looks like those (enabled by default) config options have slipped in silently and introduced the issue... --=20 Best regards, Lukasz Majewski -- Nabla Software Engineering GmbH HRB 40522 Augsburg Phone: +49 821 45592596 E-Mail: office@nabladev.com Geschftsfhrer : Stefano Babic