public inbox for dev@dpdk.org
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Xavier Guillaume <xavier.guillaume@ovhcloud.com>
Cc: <dev@dpdk.org>, <stable@dpdk.org>
Subject: Re: [PATCH v2 3/3] net/af_packet: support jumbo frames
Date: Thu, 12 Mar 2026 09:20:50 -0700	[thread overview]
Message-ID: <20260312092050.0a85818b@phoenix.local> (raw)
In-Reply-To: <20260312133248.3435717-1-xavier.guillaume@ovhcloud.com>

On Thu, 12 Mar 2026 14:32:48 +0100
Xavier Guillaume <xavier.guillaume@ovhcloud.com> wrote:

> Hi Stephen,
> 
> > I wonder if TPACKET header could go in mbuf headroom.
> > And also, could the copy on receive be avoided?  
> 
> Thank you for your review and the interesting questions. I had not
> considered these angles, so I took some time to look into it.
> 
> As far as I understand, the current RX path copies the packet data
> from the ring frame into an mbuf so that the ring slot can be returned to
> the kernel immediately after the copy. This keeps the ring available
> for new packets regardless of how long the application holds the mbuf.
> 
> Going down the zero-copy route would introduce a strong coupling
> between kernel-managed ring frames and DPDK-managed mbufs: the ring
> slot could not be released until the last reference to the mbuf is
> freed, which risks stalling the ring under any buffering.
> 
> Because of this copy and the resulting decoupling, the TPACKET header
> does not need to be carried into the mbuf at all. It is only read
> for metadata (packet length, VLAN, timestamp) before the frame is
> released back to the kernel.
> 
> In this context, my feeling is that the introduced risks outweigh the
> gains (the memcpy looks relatively small compared to the full kernel
> networking stack af_packet goes through).
> 
> Did I miss something?
> 
> Regards,
> Xavier

Copies matter, especially for larger packets.

I noticed that later kernels support TPACKET_V3 with sendmsg and MSG_ZEROCOPY
it was added in 4.18 kernel so should be ok; the downside is it goes from
ring to syscall per packet rather than syscall per burst.

For RX, you right it adds complexity.

Did some brainstorming (with AI as checking), and it looks like 
maybe some mixed mode where it uses zero copy on Rx until there
is some high watermark. Something like:


## The design

The receive path becomes:

1. At queue setup, register the entire mmap'd region as an external memory zone that DPDK knows about (via `rte_extmem_register` if needed for IOVA).

2. On each received frame, allocate an mbuf but attach it to the ring frame via `rte_pktmbuf_attach_extbuf` instead of copying. The `shinfo` free callback atomically sets `tp_status = TP_STATUS_KERNEL` to release the frame back to the kernel.

3. Advance `framenum` as normal — the frame stays owned by userspace until the mbuf is freed.

## The hard part: ring backpressure

This is the real design question. In the copy path, frames are returned to the kernel immediately in the RX loop. With zero-copy, a frame is held until the application frees the mbuf. If the app is slow or holds references (e.g., reassembly, batching into a burst for a worker core), you burn through ring slots fast.

A few options:

- **Large ring** — bump `framecnt` significantly. Memory is cheap and the ring is already mmap'd. For a capture workload this is usually fine.
- **Fallback to copy** — track how many frames are outstanding. When it crosses a watermark (say 75% of the ring), fall back to the memcpy path for new packets so you keep returning frames to the kernel. This is what the AF_XDP PMD does conceptually with its fill ring management.
- **Just drop** — if the ring is exhausted, that's backpressure. The kernel drops packets, which shows up in `tp_drops`. For monitoring/capture workloads this is often acceptable.

The fallback approach is probably the most robust for a general-purpose patch. Something roughly like:

```c
/* threshold: if outstanding frames exceed 75% of ring, fall back to copy */
bool zero_copy = (outstanding_frames < (framecount * 3 / 4));

if (zero_copy) {
    /* attach extbuf pointing into ring frame */
    rte_pktmbuf_attach_extbuf(mbuf, pbuf, pbuf_iova, data_len, shinfo);
    rte_pktmbuf_pkt_len(mbuf) = rte_pktmbuf_data_len(mbuf) = ppd->tp_snaplen;
    /* do NOT set tp_status = TP_STATUS_KERNEL here; callback does it */
    outstanding_frames++;
} else {
    /* copy path as before */
    rte_pktmbuf_pkt_len(mbuf) = rte_pktmbuf_data_len(mbuf) = ppd->tp_snaplen;
    memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, ppd->tp_snaplen);
    ppd->tp_status = TP_STATUS_KERNEL;
}
```

The `shinfo` callback would need an atomic decrement on the outstanding counter plus the `tp_status` write. You'd pre-allocate one `rte_mbuf_ext_shared_info` per frame slot at init time, each wired to its corresponding `tpacket2_hdr`.

One subtlety: `framenum` advancement is no longer gated on the current frame being released. You're advancing past frames that are still in-flight. So you need a separate counter or bitmap to know which frames are actually available when you wrap around. The simplest approach is to just check `tp_status` as you already do — if you come back around the ring and the frame is still held by userspace (status not `TP_STATUS_USER` from the kernel), you stop, same as today.

That actually works cleanly because the existing `tp_status` check at the top of the loop already handles this — a frame you haven't returned to the kernel won't have `TP_STATUS_USER` set, so the loop naturally stops.

  reply	other threads:[~2026-03-12 16:20 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-06  9:20 [PATCH v1 0/3] net/af_packet: fix MTU handling and add jumbo frame support Xavier Guillaume
2026-03-06  9:20 ` [PATCH v1 1/3] net/af_packet: fix MTU set data size calculation Xavier Guillaume
2026-03-06  9:20 ` [PATCH v1 2/3] net/af_packet: fix receive buffer overflow Xavier Guillaume
2026-03-06  9:20 ` [PATCH v1 3/3] net/af_packet: support jumbo frames Xavier Guillaume
2026-03-09 16:03   ` Stephen Hemminger
2026-03-09 16:10 ` [PATCH v2 0/3] net/af_packet: fix MTU handling and add jumbo frame support Xavier Guillaume
2026-03-09 16:10   ` [PATCH v2 1/3] net/af_packet: fix MTU set data size calculation Xavier Guillaume
2026-03-09 16:10   ` [PATCH v2 2/3] net/af_packet: fix receive buffer overflow Xavier Guillaume
2026-03-09 16:10   ` [PATCH v2 3/3] net/af_packet: support jumbo frames Xavier Guillaume
2026-03-10 23:31     ` Stephen Hemminger
2026-03-12 13:32       ` Xavier Guillaume
2026-03-12 16:20         ` Stephen Hemminger [this message]
2026-03-09 20:16   ` [PATCH v2 0/3] net/af_packet: fix MTU handling and add jumbo frame support Stephen Hemminger
2026-03-09 20:49   ` [PATCH] net/af_packet: add multi-segment mbuf support for jumbo frames Sriram Yagnaraman
2026-03-09 21:02     ` [PATCH v2] " Sriram Yagnaraman
2026-03-10 14:02       ` Stephen Hemminger
2026-03-10 20:02         ` Sriram Yagnaraman
2026-03-16 16:02           ` Stephen Hemminger
2026-03-19  9:25             ` Sriram Yagnaraman
2026-03-10  1:55   ` [PATCH v2 0/3] net/af_packet: fix MTU handling and add jumbo frame support Stephen Hemminger
2026-03-10 11:21   ` [PATCH v3 " Xavier Guillaume
2026-03-10 11:21     ` [PATCH v3 1/3] net/af_packet: fix MTU set data size calculation Xavier Guillaume
2026-03-10 11:21     ` [PATCH v3 2/3] net/af_packet: fix receive buffer overflow Xavier Guillaume
2026-03-10 11:21     ` [PATCH v3 3/3] net/af_packet: support jumbo frames Xavier Guillaume
2026-03-11 16:03     ` [PATCH v3 0/3] net/af_packet: fix MTU handling and add jumbo frame support Stephen Hemminger
2026-03-12 18:46     ` Stephen Hemminger
2026-03-16 15:59     ` Stephen Hemminger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260312092050.0a85818b@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=dev@dpdk.org \
    --cc=stable@dpdk.org \
    --cc=xavier.guillaume@ovhcloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox