* Outstanding latency increase in kernel CAN gateway caught by CANlatester daily builds at 2023-10-02
@ 2023-10-02 8:40 Pavel Pisa
2023-10-14 10:57 ` Oliver Hartkopp
0 siblings, 1 reply; 3+ messages in thread
From: Pavel Pisa @ 2023-10-02 8:40 UTC (permalink / raw)
To: linux-can, Oliver Hartkopp, linux-rt-users
Cc: Ondrej Ille, Matěj Vasilevski, Pavel Hronek,
Jiří Novák, Carsten Emde, Jan Altenberg
Hello Oliver and others,
two consecutive daily runs of our CAN latency system
https://canbus.pages.fel.cvut.cz/#can-bus-channels-mutual-latency-testing
shows extreme increase in latency of the kernel CAN gateway under the load.
The first run with increased latency (run-DATE-TIME-KERNEL-OPTIONS)
run-231002-045216-hist+6.6.0-rc3-rt5-ge31516c1e553+flood-kern-prio-fd-load.jsonn
previous one consistent with daily runs form May
run-231001-045220-hist+6.6.0-rc3-rt5-ge31516c1e553+flood-kern-prio-fd-load.json
The history of the monitoring for kernel gateway under the load for latest RT
kernels,
branch run on "linux-rt-devel/for-kbuild-bot/current-stable" branch
https://canbus.pages.fel.cvut.cz/can-latester/inspect.html?kernel=rt&load=1&flood=1&fd=1&prio=1&kern=1
Monitoring of latency when userspace application is used to forward
data from one to another CAN interface does not show similar excess
https://canbus.pages.fel.cvut.cz/can-latester/inspect.html?kernel=rt&load=1&flood=1&fd=1&prio=1&kern=0
It is interesting that when priority of CAN controller interrupt service
routines
are not boosted then problem does not appear. Priority 90 is set for each
irq/[0-9]+-can[0-9]
thread by
chrt -f --pid 90 $pid
The device under the test as well as messages generation and monitoring
system are MZ_APO boards (AMD/XlinX Zynq XC7Z010) with CTU CAN FD IP core
CAN controller configured for 10 ns frames timestamping.
The problem can be in configuration of our system, CTU CAN FD IP core driver
or specific to Zynq ARM platform. But it is generally suspicious because
after initial tuning of the test system there has not been modifications
for long time. Monitoring system is running 6.2.0-rt3-00007-ge3a16816f987
kernel for all time and no problem with some Rx buffers overflow
on the tester side is reported for time covering all tests in the question.
Please, report if you have some idea which change between reported
versions from 2023-10-01 and 2023-10-02 could be reason for the change.
I plan to keep eye on results till end of the week and if the problem
continues then I start to investigate more by beginning of the next week
when I should find a little more time. I am quite busy by preparation for
conference and teaching this week so I do not expect to find much time.
Best wishes,
Pavel
--
Pavel Pisa
phone: +420 603531357
e-mail: pisa@cmp.felk.cvut.cz
Department of Control Engineering FEE CVUT
Karlovo namesti 13, 121 35, Prague 2
university: http://control.fel.cvut.cz/
personal: http://cmp.felk.cvut.cz/~pisa
social: https://social.kernel.org/ppisa
projects: https://www.openhub.net/accounts/ppisa
CAN related:http://canbus.pages.fel.cvut.cz/
RISC-V education: https://comparch.edu.cvut.cz/
Open Technologies Research Education and Exchange Services
https://gitlab.fel.cvut.cz/otrees/org/-/wikis/home
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: Outstanding latency increase in kernel CAN gateway caught by CANlatester daily builds at 2023-10-02
2023-10-02 8:40 Outstanding latency increase in kernel CAN gateway caught by CANlatester daily builds at 2023-10-02 Pavel Pisa
@ 2023-10-14 10:57 ` Oliver Hartkopp
2023-10-18 9:50 ` Pavel Pisa
0 siblings, 1 reply; 3+ messages in thread
From: Oliver Hartkopp @ 2023-10-14 10:57 UTC (permalink / raw)
To: Pavel Pisa, linux-can, linux-rt-users
Cc: Ondrej Ille, Matěj Vasilevski, Pavel Hronek,
Jiří Novák, Carsten Emde, Jan Altenberg
Hello Pavel,
is there any news on this latency issue?
I've not seen any can-gw related changes between 6.2 and 6.6.
The only change for linux/net/can/gw.c is this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a30b2bd01c23a7eeace3a3f82c2817227099805
Which should intentionally cause a problem when the cangw tool is used
in a wrong way:
From: Oliver Hartkopp <socketcan@hartkopp.net>
Date: Wed, 25 Jan 2023 06:54:07 +0100
Subject: can: gw: give feedback on missing CGW_FLAGS_CAN_IIF_TX_OK flag
To send CAN traffic back to the incoming interface a special flag has to
be set. When creating a routing job for identical interfaces without this
flag the rule is created but has no effect.
This patch adds an error return value in the case that the CAN interfaces
are identical but the CGW_FLAGS_CAN_IIF_TX_OK flag was not set.
Reported-by: Jannik Hartung <jannik.hartung@tu-bs.de>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Link:
https://lore.kernel.org/all/20230125055407.2053-1-socketcan@hartkopp.net
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
---
net/can/gw.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/can/gw.c b/net/can/gw.c
index 23a3d89cad81d..37528826935e7 100644
--- a/net/can/gw.c
+++ b/net/can/gw.c
@@ -1139,6 +1139,13 @@ static int cgw_create_job(struct sk_buff *skb,
struct nlmsghdr *nlh,
if (gwj->dst.dev->type != ARPHRD_CAN)
goto out;
+ /* is sending the skb back to the incoming interface intended? */
+ if (gwj->src.dev == gwj->dst.dev &&
+ !(gwj->flags & CGW_FLAGS_CAN_IIF_TX_OK)) {
+ err = -EINVAL;
+ goto out;
+ }
+
ASSERT_RTNL();
err = cgw_register_filter(net, gwj);
Please let me know if I can help on this topic.
So far it looks like a RT/configuration problem to me.
Best regards,
Oliver
On 02.10.23 10:40, Pavel Pisa wrote:
> Hello Oliver and others,
>
> two consecutive daily runs of our CAN latency system
>
> https://canbus.pages.fel.cvut.cz/#can-bus-channels-mutual-latency-testing
>
> shows extreme increase in latency of the kernel CAN gateway under the load.
> The first run with increased latency (run-DATE-TIME-KERNEL-OPTIONS)
>
>
> run-231002-045216-hist+6.6.0-rc3-rt5-ge31516c1e553+flood-kern-prio-fd-load.jsonn
>
> previous one consistent with daily runs form May
>
>
> run-231001-045220-hist+6.6.0-rc3-rt5-ge31516c1e553+flood-kern-prio-fd-load.json
>
> The history of the monitoring for kernel gateway under the load for latest RT
> kernels,
> branch run on "linux-rt-devel/for-kbuild-bot/current-stable" branch
>
>
> https://canbus.pages.fel.cvut.cz/can-latester/inspect.html?kernel=rt&load=1&flood=1&fd=1&prio=1&kern=1
>
> Monitoring of latency when userspace application is used to forward
> data from one to another CAN interface does not show similar excess
>
>
> https://canbus.pages.fel.cvut.cz/can-latester/inspect.html?kernel=rt&load=1&flood=1&fd=1&prio=1&kern=0
>
> It is interesting that when priority of CAN controller interrupt service
> routines
> are not boosted then problem does not appear. Priority 90 is set for each
> irq/[0-9]+-can[0-9]
> thread by
>
> chrt -f --pid 90 $pid
>
> The device under the test as well as messages generation and monitoring
> system are MZ_APO boards (AMD/XlinX Zynq XC7Z010) with CTU CAN FD IP core
> CAN controller configured for 10 ns frames timestamping.
>
> The problem can be in configuration of our system, CTU CAN FD IP core driver
> or specific to Zynq ARM platform. But it is generally suspicious because
> after initial tuning of the test system there has not been modifications
> for long time. Monitoring system is running 6.2.0-rt3-00007-ge3a16816f987
> kernel for all time and no problem with some Rx buffers overflow
> on the tester side is reported for time covering all tests in the question.
>
> Please, report if you have some idea which change between reported
> versions from 2023-10-01 and 2023-10-02 could be reason for the change.
> I plan to keep eye on results till end of the week and if the problem
> continues then I start to investigate more by beginning of the next week
> when I should find a little more time. I am quite busy by preparation for
> conference and teaching this week so I do not expect to find much time.
>
> Best wishes,
>
> Pavel
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: Outstanding latency increase in kernel CAN gateway caught by CANlatester daily builds at 2023-10-02
2023-10-14 10:57 ` Oliver Hartkopp
@ 2023-10-18 9:50 ` Pavel Pisa
0 siblings, 0 replies; 3+ messages in thread
From: Pavel Pisa @ 2023-10-18 9:50 UTC (permalink / raw)
To: Oliver Hartkopp
Cc: linux-can, linux-rt-users, Ondrej Ille, Matěj Vasilevski,
Pavel Hronek, Jiří Novák, Carsten Emde,
Jan Altenberg
Hello Oliver,
On Saturday 14 of October 2023 12:57:53 Oliver Hartkopp wrote:
> Hello Pavel,
>
> is there any news on this latency issue?
>
> I've not seen any can-gw related changes between 6.2 and 6.6.
>
> The only change for linux/net/can/gw.c is this patch:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?
>id=2a30b2bd01c23a7eeace3a3f82c2817227099805
I am keeping eye on the results daily and it seems that run
run-231002-045216-hist+6.6.0-rc3-rt5-ge31516c1e553+flood-kern-prio-fd-load
https://canbus.pages.fel.cvut.cz/can-latester/inspect.html?kernel=rt&load=1&flood=1&fd=1&prio=1&kern=1
is really outlayer. At the end it is single one, I have
interpretted the graph incorrectly because so much outstanding
value at the end looked as flat increase covering two consecutive
runs.
So the kernel GW average and maximas follow previous trend
after this single peak. So the peak could be related to some
transitional state in RT development causing some problem
with priorities etc., or can be result of some other problems
in the whole setup. I am analyzing some problems with
lost messages in some cases of RT runs which seems
to be related more to some problem in testing system,
setup before run, FPGA reload etc. which cause bus error
or something similar with initial suspicion on monitoring
side problem. But I do not have conclusion yet.
Published runs are complete with no message lost
and statistic/trends seems to be without significant
change from start of the measurement in May.
The change/increase of trends before May has well understood
reason, we have updated stress testing, include more sources
and tuned priorities for user GW etc...
In alonger term perspective, initial setup testing data from
April should/will be removed/masked from public data to not
provide false assumptions. We probably start new series
when 2024 year starts. We will see how data capacity
and viewing will work work and slower as data set is extended.
If I notice some significant change in more consecutive runs,
I try to check it and send information.
In the fact, we have caught one real problem in RT
already.
Best wishes,
Pavel
--
Pavel Pisa
phone: +420 603531357
e-mail: pisa@cmp.felk.cvut.cz
Department of Control Engineering FEE CVUT
Karlovo namesti 13, 121 35, Prague 2
university: http://control.fel.cvut.cz/
personal: http://cmp.felk.cvut.cz/~pisa
social: https://social.kernel.org/ppisa
projects: https://www.openhub.net/accounts/ppisa
CAN related:http://canbus.pages.fel.cvut.cz/
RISC-V education: https://comparch.edu.cvut.cz/
Open Technologies Research Education and Exchange Services
https://gitlab.fel.cvut.cz/otrees/org/-/wikis/home
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-10-18 9:51 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-02 8:40 Outstanding latency increase in kernel CAN gateway caught by CANlatester daily builds at 2023-10-02 Pavel Pisa
2023-10-14 10:57 ` Oliver Hartkopp
2023-10-18 9:50 ` Pavel Pisa
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).