EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance

public inbox for linux-rt-users@vger.kernel.org
 help / color / mirror / Atom feed

* EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance 
@ 2026-01-30  5:33 김민수
  2026-03-12 16:15 ` Sebastian Andrzej Siewior
  2026-03-13 10:15 ` Alexander Dahl
  0 siblings, 2 replies; 4+ messages in thread
From: 김민수 @ 2026-01-30  5:33 UTC (permalink / raw)
  To: linux-rt-users

To: linux-rt-users@vger.kernel.org
Subject: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance

--
Hi all,

I'm working on reducing worst-case EtherCAT send-recv round-trip
latency on PREEMPT_RT 6.8-rt8, targeting under 50μs. I've done
ftrace analysis over ~50k cycles and tried two kernel-side fixes,
but the measured tail (50-80μs) exceeds what my trace can account
for (~31μs). I'd appreciate help identifying what I'm missing.

== Environment ==

  - Kernel: 6.8.0-rt8 (PREEMPT_RT)
  - CPU: Intel, core 3 isolated
    - isolcpus=domain,managed,3 nohz_full=3 rcu_nocbs=3
    - intel_pstate=disable idle=poll intel_idle.max_cstate=0
  - NIC: RTL8168H (r8169 driver), IRQ pinned to CPU3
    - Coalescing off: rx-usecs=0, rx-frames=1
  - EtherCAT master: SOEM (Open EtherCATsociety)
    - SCHED_FIFO 99, pinned to CPU3, 1ms cycle
    - AF_PACKET raw socket (ETH_P_ECAT)
    - Measures send-recv with clock_gettime(CLOCK_MONOTONIC)

== What I've tried ==

Using ftrace (function_graph + sched_switch) on CPU3:

1) ksoftirqd preemption during NAPI poll

   During rtl8169_poll() → napi_gro_receive() → sock_def_readable()
   → wake_up(), SOEM (FIFO 99) preempts ksoftirqd/3 (CFS) mid-poll.
   SOEM then blocks on rt_spin_lock (socket wq_head->lock, held by
   wake_up) → D state ~2μs → returns to ksoftirqd → preempts again
   after recv() completes. Measured overhead: 5.5μs mean, up to 25μs.

   Fix: chrt -f -p 99 ksoftirqd/3

   Result: D state eliminated, recv() peak improved ~3μs, but
   max got worse (77μs → 88μs).

2) TX completion IRQ elimination

   Each cycle generates two polls: TX IRQ triggers poll#1 (rtl_tx
   cleanup only, ~1.3μs), then RX IRQ triggers poll#2. Gap: 3-13μs.

   Fix: Masked TxOK from interrupt enable register.

   Result: recv() distribution narrowed, but send-recv unchanged
   (TX poll + gap overlaps with wire delay).

   Combined: No improvement beyond H1 alone.

== The discrepancy I can't explain ==

send-recv breakdown with H1 applied (from sched_switch trace):

  Component         Mean    Tail (P99/Max)
  -------------------------------------------
  send() syscall    ~2μs    stable
  Wire RTT          ~7μs    stable
  RX IRQ handling   ~2μs    ~3μs
  Sched delay       ~1μs    ~10μs
  Poll start delay  ~1μs    ~13μs
  RX poll (napi)    ~5μs    ~6μs
  recv() return     ~1μs    stable
  -------------------------------------------
  Trace total       ~19μs   ~31μs

But SOEM measures send-recv tails of 50-80μs using
clock_gettime(CLOCK_MONOTONIC) around the send()+recv() pair.

That's a gap of roughly 19-49μs that my trace doesn't cover.

My tracing (function_graph on rtl8169_poll + sched_switch events)
covers the kernel NAPI path, but I suspect there are latency
sources outside this window — in the send() path, the
syscall entry/exit, the userspace-to-kernel transitions, or
somewhere else I haven't instrumented.

== Question ==

What could account for the ~20-50μs gap between what ftrace
shows in the NAPI/sched path (~31μs worst case) and what the
application measures end-to-end (50-80μs)?

Are there known latency sources in the PREEMPT_RT network path
— such as softirq-to-userspace return, AF_PACKET socket
processing, syscall exit, or something else — that
function_graph + sched_switch tracing would not capture?

I've tried the specific fixes described above. Happy to collect
additional traces with different tracepoints if someone can
suggest what to instrument next.

Thanks,

—
Minsu Kim
Undergraduate Student
School of Mechanical Engineering & Computer Science
Kyungpook National University (KNU), Daegu, South Korea

Email: m.kim@knu.ac.kr

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance
  2026-01-30  5:33 EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance 김민수
@ 2026-03-12 16:15 ` Sebastian Andrzej Siewior
  2026-03-13 10:15 ` Alexander Dahl
  1 sibling, 0 replies; 4+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-12 16:15 UTC (permalink / raw)
  To: 김민수; +Cc: linux-rt-users

On 2026-01-30 14:33:58 [+0900], 김민수 wrote:
> To: linux-rt-users@vger.kernel.org
> Subject: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance
> 
> --
> Hi all,
> 
> I'm working on reducing worst-case EtherCAT send-recv round-trip
> latency on PREEMPT_RT 6.8-rt8, targeting under 50μs. I've done
> ftrace analysis over ~50k cycles and tried two kernel-side fixes,
> but the measured tail (50-80μs) exceeds what my trace can account
> for (~31μs). I'd appreciate help identifying what I'm missing.
> 
> == Environment ==
> 
>   - Kernel: 6.8.0-rt8 (PREEMPT_RT)
>   - CPU: Intel, core 3 isolated
>     - isolcpus=domain,managed,3 nohz_full=3 rcu_nocbs=3
>     - intel_pstate=disable idle=poll intel_idle.max_cstate=0
>   - NIC: RTL8168H (r8169 driver), IRQ pinned to CPU3
>     - Coalescing off: rx-usecs=0, rx-frames=1
>   - EtherCAT master: SOEM (Open EtherCATsociety)
>     - SCHED_FIFO 99, pinned to CPU3, 1ms cycle
>     - AF_PACKET raw socket (ETH_P_ECAT)
>     - Measures send-recv with clock_gettime(CLOCK_MONOTONIC)
> 
> == What I've tried ==
> 
> Using ftrace (function_graph + sched_switch) on CPU3:
> 
> 1) ksoftirqd preemption during NAPI poll
> 
>    During rtl8169_poll() → napi_gro_receive() → sock_def_readable()
>    → wake_up(), SOEM (FIFO 99) preempts ksoftirqd/3 (CFS) mid-poll.
>    SOEM then blocks on rt_spin_lock (socket wq_head->lock, held by
>    wake_up) → D state ~2μs → returns to ksoftirqd → preempts again
>    after recv() completes. Measured overhead: 5.5μs mean, up to 25μs.
> 
>    Fix: chrt -f -p 99 ksoftirqd/3
> 
>    Result: D state eliminated, recv() peak improved ~3μs, but
>    max got worse (77μs → 88μs).

Why ksoftirqd active to begin with? If your CPU is isolated you
shouldn't have anything waking ksoftirqd. Everything network related
should happen within the threaded irq.

> 2) TX completion IRQ elimination
> 
>    Each cycle generates two polls: TX IRQ triggers poll#1 (rtl_tx
>    cleanup only, ~1.3μs), then RX IRQ triggers poll#2. Gap: 3-13μs.
> 
>    Fix: Masked TxOK from interrupt enable register.
> 
>    Result: recv() distribution narrowed, but send-recv unchanged
>    (TX poll + gap overlaps with wire delay).
> 
>    Combined: No improvement beyond H1 alone.

I don't follow. You need TX interrupt to clean up the skb you just sent.
If you don't, there will be a watchdog complaining.
The RX interrupt signals that you have a new packet waiting.

> == The discrepancy I can't explain ==
> 
> send-recv breakdown with H1 applied (from sched_switch trace):
> 
>   Component         Mean    Tail (P99/Max)
>   -------------------------------------------
>   send() syscall    ~2μs    stable
>   Wire RTT          ~7μs    stable
>   RX IRQ handling   ~2μs    ~3μs
>   Sched delay       ~1μs    ~10μs
>   Poll start delay  ~1μs    ~13μs
>   RX poll (napi)    ~5μs    ~6μs
>   recv() return     ~1μs    stable
>   -------------------------------------------
>   Trace total       ~19μs   ~31μs
> 
> But SOEM measures send-recv tails of 50-80μs using
> clock_gettime(CLOCK_MONOTONIC) around the send()+recv() pair.
> 
> That's a gap of roughly 19-49μs that my trace doesn't cover.
> 
> My tracing (function_graph on rtl8169_poll + sched_switch events)
> covers the kernel NAPI path, but I suspect there are latency
> sources outside this window — in the send() path, the
> syscall entry/exit, the userspace-to-kernel transitions, or
> somewhere else I haven't instrumented.

If everything (network driver (interrupt thread)) and your user
application (doing send + receive) runs on CPU2 then with enabled
function trace (as you do) looking at CPU2 is enough.
Receiving a packet will start with an interrupt, following by waking the
threaded interrupt. Within this interrupt you will read the packet from
the nic, stuff it into the socket and wake the user application. This
will do recv() + send() and sometime after send() returns there should
be another interrupt for the TX clean up. At this point, your packet is
gone.

This covers the whole path.

> == Question ==
> 
> What could account for the ~20-50μs gap between what ftrace
> shows in the NAPI/sched path (~31μs worst case) and what the
> application measures end-to-end (50-80μs)?

There should be no gap. There is /sys/kernel/tracing/trace_marker. You
can write there and this will pop up in your trace. So you could map
your application to events in the kernel/ trace. You need be careful
with the clock as the kernel uses sched_clock while you use
CLOCK_MONOTONIC (but the tracing clock can be changed). 

> Are there known latency sources in the PREEMPT_RT network path
> — such as softirq-to-userspace return, AF_PACKET socket
> processing, syscall exit, or something else — that
> function_graph + sched_switch tracing would not capture?

You add the packet to the socket queue and wake the socket. There will
be wakeup event and so. In general function_graph covers every function
so there are no holes.

> I've tried the specific fixes described above. Happy to collect
> additional traces with different tracepoints if someone can
> suggest what to instrument next.
> 
> Thanks,
> 
> —
> Minsu Kim
> Undergraduate Student
> School of Mechanical Engineering & Computer Science
> Kyungpook National University (KNU), Daegu, South Korea

Sebastian

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance
  2026-01-30  5:33 EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance 김민수
  2026-03-12 16:15 ` Sebastian Andrzej Siewior
@ 2026-03-13 10:15 ` Alexander Dahl
  2026-03-16  8:00   ` Stephane ANCELOT
  1 sibling, 1 reply; 4+ messages in thread
From: Alexander Dahl @ 2026-03-13 10:15 UTC (permalink / raw)
  To: 김민수; +Cc: linux-rt-users

Hei hei,

Am Fri, Jan 30, 2026 at 02:33:58PM +0900 schrieb 김민수:
> To: linux-rt-users@vger.kernel.org
> Subject: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance
> 
> --
> Hi all,
> 
> I'm working on reducing worst-case EtherCAT send-recv round-trip
> latency on PREEMPT_RT 6.8-rt8, targeting under 50μs. I've done
> ftrace analysis over ~50k cycles and tried two kernel-side fixes,
> but the measured tail (50-80μs) exceeds what my trace can account
> for (~31μs). I'd appreciate help identifying what I'm missing.
> 
> == Environment ==
> 
>   - Kernel: 6.8.0-rt8 (PREEMPT_RT)
>   - CPU: Intel, core 3 isolated
>     - isolcpus=domain,managed,3 nohz_full=3 rcu_nocbs=3
>     - intel_pstate=disable idle=poll intel_idle.max_cstate=0
>   - NIC: RTL8168H (r8169 driver), IRQ pinned to CPU3
>     - Coalescing off: rx-usecs=0, rx-frames=1
>   - EtherCAT master: SOEM (Open EtherCATsociety)
>     - SCHED_FIFO 99, pinned to CPU3, 1ms cycle
>     - AF_PACKET raw socket (ETH_P_ECAT)
>     - Measures send-recv with clock_gettime(CLOCK_MONOTONIC)
> 
> == What I've tried ==
> 
> Using ftrace (function_graph + sched_switch) on CPU3:
> 
> 1) ksoftirqd preemption during NAPI poll
> 
>    During rtl8169_poll() → napi_gro_receive() → sock_def_readable()
>    → wake_up(), SOEM (FIFO 99) preempts ksoftirqd/3 (CFS) mid-poll.
>    SOEM then blocks on rt_spin_lock (socket wq_head->lock, held by
>    wake_up) → D state ~2μs → returns to ksoftirqd → preempts again
>    after recv() completes. Measured overhead: 5.5μs mean, up to 25μs.
> 
>    Fix: chrt -f -p 99 ksoftirqd/3
> 
>    Result: D state eliminated, recv() peak improved ~3μs, but
>    max got worse (77μs → 88μs).
> 
> 2) TX completion IRQ elimination
> 
>    Each cycle generates two polls: TX IRQ triggers poll#1 (rtl_tx
>    cleanup only, ~1.3μs), then RX IRQ triggers poll#2. Gap: 3-13μs.
> 
>    Fix: Masked TxOK from interrupt enable register.
> 
>    Result: recv() distribution narrowed, but send-recv unchanged
>    (TX poll + gap overlaps with wire delay).
> 
>    Combined: No improvement beyond H1 alone.

If you see latencies due to ksoftirqd, I assume you don't use threaded
NAPI?  Did you try?

Quoting from my own notes here:

    NAPI processing mostly happens in eth softirq threads, but not always,
    especially with higher system load it might end up in ksoftirqd, which
    runs with low priority, and which process priority is not recommended to
    be changed (with RT).
    
    Link: https://lwn.net/Articles/833840/
    Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fdd2f0e5c64
    Link: https://lwn.net/Articles/853289/

What we did:

+           # enable napi threads and set realtime priority
+           echo 1 > /sys/class/net/eth0/threaded
+           sleep 1
+           pgrep napi/eth0 | xargs -n 1 chrt --pid 49

That eliminated latency spikes for us, on completely different
hardware though. ^^

Greets
Alex


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance 
  2026-03-13 10:15 ` Alexander Dahl
@ 2026-03-16  8:00   ` Stephane ANCELOT
  0 siblings, 0 replies; 4+ messages in thread
From: Stephane ANCELOT @ 2026-03-16  8:00 UTC (permalink / raw)
  To: Alexander Dahl, 김민수; +Cc: linux-rt-users@vger.kernel.org


 
 hi,

By the past, on some rtl8139 chips,I had problems where  there was a register to program irq delays .
I mean the irq signal was triggered after this delay by the chip.
Check if there is similar register on this chip.

Regards
Steph


________________________________________
De: Alexander Dahl <ada@thorsis.com>
Envoyé: Vendredi 13 mars 2026 11:15
À: 김민수 <m.kim@knu.ac.kr>
Cc: linux-rt-users@vger.kernel.org <linux-rt-users@vger.kernel.org>
Objet: Re: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance 


        External Sender: Use caution with links/attachments.

Expediteur externe : Soyez prudent avec les liens/pieces jointes.



Hei hei,



Am Fri, Jan 30, 2026 at 02:33:58PM +0900 schrieb 김민수:

> To: linux-rt-users@vger.kernel.org

> Subject: EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance

>

> --

> Hi all,

>

> I'm working on reducing worst-case EtherCAT send-recv round-trip

> latency on PREEMPT_RT 6.8-rt8, targeting under 50μs. I've done

> ftrace analysis over ~50k cycles and tried two kernel-side fixes,

> but the measured tail (50-80μs) exceeds what my trace can account

> for (~31μs). I'd appreciate help identifying what I'm missing.

>

> == Environment ==

>

>   - Kernel: 6.8.0-rt8 (PREEMPT_RT)

>   - CPU: Intel, core 3 isolated

>     - isolcpus=domain,managed,3 nohz_full=3 rcu_nocbs=3

>     - intel_pstate=disable idle=poll intel_idle.max_cstate=0

>   - NIC: RTL8168H (r8169 driver), IRQ pinned to CPU3

>     - Coalescing off: rx-usecs=0, rx-frames=1

>   - EtherCAT master: SOEM (Open EtherCATsociety)

>     - SCHED_FIFO 99, pinned to CPU3, 1ms cycle

>     - AF_PACKET raw socket (ETH_P_ECAT)

>     - Measures send-recv with clock_gettime(CLOCK_MONOTONIC)

>

> == What I've tried ==

>

> Using ftrace (function_graph + sched_switch) on CPU3:

>

> 1) ksoftirqd preemption during NAPI poll

>

>    During rtl8169_poll() → napi_gro_receive() → sock_def_readable()

>    → wake_up(), SOEM (FIFO 99) preempts ksoftirqd/3 (CFS) mid-poll.

>    SOEM then blocks on rt_spin_lock (socket wq_head->lock, held by

>    wake_up) → D state ~2μs → returns to ksoftirqd → preempts again

>    after recv() completes. Measured overhead: 5.5μs mean, up to 25μs.

>

>    Fix: chrt -f -p 99 ksoftirqd/3

>

>    Result: D state eliminated, recv() peak improved ~3μs, but

>    max got worse (77μs → 88μs).

>

> 2) TX completion IRQ elimination

>

>    Each cycle generates two polls: TX IRQ triggers poll#1 (rtl_tx

>    cleanup only, ~1.3μs), then RX IRQ triggers poll#2. Gap: 3-13μs.

>

>    Fix: Masked TxOK from interrupt enable register.

>

>    Result: recv() distribution narrowed, but send-recv unchanged

>    (TX poll + gap overlaps with wire delay).

>

>    Combined: No improvement beyond H1 alone.



If you see latencies due to ksoftirqd, I assume you don't use threaded

NAPI?  Did you try?



Quoting from my own notes here:



    NAPI processing mostly happens in eth softirq threads, but not always,

    especially with higher system load it might end up in ksoftirqd, which

    runs with low priority, and which process priority is not recommended to

    be changed (with RT).



    Link: https://lwn.net/Articles/833840/

    Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fdd2f0e5c64

    Link: https://lwn.net/Articles/853289/



What we did:



+           # enable napi threads and set realtime priority

+           echo 1 > /sys/class/net/eth0/threaded

+           sleep 1

+           pgrep napi/eth0 | xargs -n 1 chrt --pid 49



That eliminated latency spikes for us, on completely different

hardware though. ^^



Greets

Alex






^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-16  8:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-30  5:33 EtherCAT on PREEMPT_RT: send-recv tail ~50-80μs, looking for guidance 김민수
2026-03-12 16:15 ` Sebastian Andrzej Siewior
2026-03-13 10:15 ` Alexander Dahl
2026-03-16  8:00   ` Stephane ANCELOT

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox