Re: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: 김민수 <m.kim@knu.ac.kr>
Cc: linux-rt-users@vger.kernel.org
Subject: Re: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance
Date: Thu, 12 Mar 2026 17:15:26 +0100	[thread overview]
Message-ID: <20260312161526.p6Kz2Fs1@linutronix.de> (raw)
In-Reply-To: <C35885A5-1CAD-4373-927A-CBEB7DF116EA@knu.ac.kr>

On 2026-01-30 14:33:58 [+0900], 김민수 wrote:
> To: linux-rt-users@vger.kernel.org
> Subject: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance
> 
> --
> Hi all,
> 
> I'm working on reducing worst-case EtherCAT send-recv round-trip
> latency on PREEMPT_RT 6.8-rt8, targeting under 50μs. I've done
> ftrace analysis over ~50k cycles and tried two kernel-side fixes,
> but the measured tail (50-80μs) exceeds what my trace can account
> for (~31μs). I'd appreciate help identifying what I'm missing.
> 
> == Environment ==
> 
>   - Kernel: 6.8.0-rt8 (PREEMPT_RT)
>   - CPU: Intel, core 3 isolated
>     - isolcpus=domain,managed,3 nohz_full=3 rcu_nocbs=3
>     - intel_pstate=disable idle=poll intel_idle.max_cstate=0
>   - NIC: RTL8168H (r8169 driver), IRQ pinned to CPU3
>     - Coalescing off: rx-usecs=0, rx-frames=1
>   - EtherCAT master: SOEM (Open EtherCATsociety)
>     - SCHED_FIFO 99, pinned to CPU3, 1ms cycle
>     - AF_PACKET raw socket (ETH_P_ECAT)
>     - Measures send-recv with clock_gettime(CLOCK_MONOTONIC)
> 
> == What I've tried ==
> 
> Using ftrace (function_graph + sched_switch) on CPU3:
> 
> 1) ksoftirqd preemption during NAPI poll
> 
>    During rtl8169_poll() → napi_gro_receive() → sock_def_readable()
>    → wake_up(), SOEM (FIFO 99) preempts ksoftirqd/3 (CFS) mid-poll.
>    SOEM then blocks on rt_spin_lock (socket wq_head->lock, held by
>    wake_up) → D state ~2μs → returns to ksoftirqd → preempts again
>    after recv() completes. Measured overhead: 5.5μs mean, up to 25μs.
> 
>    Fix: chrt -f -p 99 ksoftirqd/3
> 
>    Result: D state eliminated, recv() peak improved ~3μs, but
>    max got worse (77μs → 88μs).

Why ksoftirqd active to begin with? If your CPU is isolated you
shouldn't have anything waking ksoftirqd. Everything network related
should happen within the threaded irq.

> 2) TX completion IRQ elimination
> 
>    Each cycle generates two polls: TX IRQ triggers poll#1 (rtl_tx
>    cleanup only, ~1.3μs), then RX IRQ triggers poll#2. Gap: 3-13μs.
> 
>    Fix: Masked TxOK from interrupt enable register.
> 
>    Result: recv() distribution narrowed, but send-recv unchanged
>    (TX poll + gap overlaps with wire delay).
> 
>    Combined: No improvement beyond H1 alone.

I don't follow. You need TX interrupt to clean up the skb you just sent.
If you don't, there will be a watchdog complaining.
The RX interrupt signals that you have a new packet waiting.

> == The discrepancy I can't explain ==
> 
> send-recv breakdown with H1 applied (from sched_switch trace):
> 
>   Component         Mean    Tail (P99/Max)
>   -------------------------------------------
>   send() syscall    ~2μs    stable
>   Wire RTT          ~7μs    stable
>   RX IRQ handling   ~2μs    ~3μs
>   Sched delay       ~1μs    ~10μs
>   Poll start delay  ~1μs    ~13μs
>   RX poll (napi)    ~5μs    ~6μs
>   recv() return     ~1μs    stable
>   -------------------------------------------
>   Trace total       ~19μs   ~31μs
> 
> But SOEM measures send-recv tails of 50-80μs using
> clock_gettime(CLOCK_MONOTONIC) around the send()+recv() pair.
> 
> That's a gap of roughly 19-49μs that my trace doesn't cover.
> 
> My tracing (function_graph on rtl8169_poll + sched_switch events)
> covers the kernel NAPI path, but I suspect there are latency
> sources outside this window — in the send() path, the
> syscall entry/exit, the userspace-to-kernel transitions, or
> somewhere else I haven't instrumented.

If everything (network driver (interrupt thread)) and your user
application (doing send + receive) runs on CPU2 then with enabled
function trace (as you do) looking at CPU2 is enough.
Receiving a packet will start with an interrupt, following by waking the
threaded interrupt. Within this interrupt you will read the packet from
the nic, stuff it into the socket and wake the user application. This
will do recv() + send() and sometime after send() returns there should
be another interrupt for the TX clean up. At this point, your packet is
gone.

This covers the whole path.

> == Question ==
> 
> What could account for the ~20-50μs gap between what ftrace
> shows in the NAPI/sched path (~31μs worst case) and what the
> application measures end-to-end (50-80μs)?

There should be no gap. There is /sys/kernel/tracing/trace_marker. You
can write there and this will pop up in your trace. So you could map
your application to events in the kernel/ trace. You need be careful
with the clock as the kernel uses sched_clock while you use
CLOCK_MONOTONIC (but the tracing clock can be changed). 

> Are there known latency sources in the PREEMPT_RT network path
> — such as softirq-to-userspace return, AF_PACKET socket
> processing, syscall exit, or something else — that
> function_graph + sched_switch tracing would not capture?

You add the packet to the socket queue and wake the socket. There will
be wakeup event and so. In general function_graph covers every function
so there are no holes.

> I've tried the specific fixes described above. Happy to collect
> additional traces with different tracepoints if someone can
> suggest what to instrument next.
> 
> Thanks,
> 
> —
> Minsu Kim
> Undergraduate Student
> School of Mechanical Engineering & Computer Science
> Kyungpook National University (KNU), Daegu, South Korea

Sebastian

next prev parent reply	other threads:[~2026-03-12 16:15 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-30  5:33 EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance 김민수
2026-03-12 16:15 ` Sebastian Andrzej Siewior [this message]
2026-03-13 10:15 ` Alexander Dahl
2026-03-16  8:00   ` Stephane ANCELOT

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260312161526.p6Kz2Fs1@linutronix.de \
    --to=bigeasy@linutronix.de \
    --cc=linux-rt-users@vger.kernel.org \
    --cc=m.kim@knu.ac.kr \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox