Re: [PATCH v5] net/af_packet: add multi-segment mbuf support for jumbo frames

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stephen Hemminger <stephen@networkplumber.org>
To: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
Cc: <dev@dpdk.org>, <xavier.guillaume@ovhcloud.com>,
	<robin.l.lin@ericsson.com>
Subject: Re: [PATCH v5] net/af_packet: add multi-segment mbuf support for jumbo frames
Date: Mon, 23 Mar 2026 16:42:05 -0700	[thread overview]
Message-ID: <20260323164205.0509775e@phoenix.local> (raw)
In-Reply-To: <20260318094726.1284454-1-sriram.yagnaraman@ericsson.com>

On Wed, 18 Mar 2026 10:47:26 +0100
Sriram Yagnaraman <sriram.yagnaraman@ericsson.com> wrote:

> Enable jumbo frame reception with default mbuf data room size by
> chaining multiple mbufs when packet exceeds single mbuf tailroom.
> 
> Scatter Rx is only enabled when RTE_ETH_RX_OFFLOAD_SCATTER is
> requested. Packets are dropped if they exceed single mbuf size
> and scatter is not enabled, or if mbuf allocation fails during
> chaining. Error counter rx_dropped_pkts tracks all drops.
> 
> This allows receiving 9KB jumbo frames using standard 2KB mbufs,
> chaining ~5 segments per jumbo packet.
> 
> Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@ericsson.com>
> ---

In general looks good, but AI review found some things.

Summary:

Thanks for the v5, Sriram. A few comments below.

1) pkt_len is uint16_t but tp_snaplen is __u32

The scatter helper also takes uint16_t data_len, so anything above 64KB silently truncates. For jumbo frames this is fine in practice, but the type mismatch is a latent bug. Consider uint32_t for both, or at least a comment explaining why uint16_t is sufficient.

2) Scatter alloc failure should bump rx_nombuf, not just rx_dropped_pkts

When rte_pktmbuf_alloc() fails for a chained segment inside eth_af_packet_rx_scatter(), only rx_dropped_pkts is incremented. The head mbuf alloc failure correctly bumps rx_nombuf. The scatter case is also a no-mbuf condition and should be reported as such, otherwise operators can't distinguish memory pressure from oversized-packet drops in the stats.

3) Reclaiming headroom in chained segments

The prepend/zero-data_len/zero-pkt_len sequence works but is roundabout. Setting m->data_off = 0 directly would be cleaner and avoid the intermediate state where data_len is non-zero then immediately overwritten.

4) bufs[num_rx] fix looks correct

The change from bufs[i] to bufs[num_rx] fixes a pre-existing gap in the output array when packets are dropped via continue. Good catch.

Overall the structure looks solid — the scatter helper extraction is clean, the goto-based frame release avoids the earlier use-after-free risk, and the queue setup validation is correct.

Full detail:

## Review: [PATCH v5] net/af_packet: add multi-segment mbuf support for jumbo frames

### Error: `eth_af_packet_rx_scatter()` leaks chained mbufs on allocation failure

When `rte_pktmbuf_alloc()` fails inside the `while` loop, the function returns `-1`. The caller then calls `rte_pktmbuf_free(mbuf)` on the head mbuf. However, before the failure, previous iterations of the loop have already linked allocated segments via `m->next`. The question is whether `rte_pktmbuf_free()` walks the chain — it does, so segments linked before the failure *will* be freed. But there's a subtlety: `mbuf->nb_segs` is incremented for each successfully chained segment, and `rte_pktmbuf_free()` uses `nb_segs` to walk the chain. Since `nb_segs` and the `->next` linkage are kept in sync, this is actually correct.

On closer inspection: **no leak here.** Disregard — the caller's `rte_pktmbuf_free(mbuf)` correctly frees the entire chain. (Keeping this analysis transparent so you can verify the reasoning.)

### Error: `pkt_len` declared as `uint16_t` but `tp_snaplen` is `uint32_t` — silent truncation

`ppd->tp_snaplen` is `__u32` in the kernel's `tpacket2_hdr`. The patch declares `pkt_len` as `uint16_t`:

```c
uint16_t pkt_len;
...
pkt_len = ppd->tp_snaplen;
```

For jumbo frames up to 9KB this works, but `tp_snaplen` can be up to 65535 (or larger with cooked sockets). If `tp_snaplen > 65535` this silently truncates, and even values in the 32768–65535 range could cause issues if `pkt_len` is used in signed comparisons. More practically, the scatter helper also takes `uint16_t data_len`, capping the maximum receivable packet at 64KB.

Since this patch specifically targets jumbo frames (9KB), the practical risk is low, but the type should match the source. Consider using `uint32_t` for `pkt_len` and the `data_len` parameter of `eth_af_packet_rx_scatter()`, or at minimum add a bounds check / comment explaining why `uint16_t` is sufficient.

### Error: `rte_pktmbuf_pkt_len(mbuf)` not set in the single-mbuf fast path before VLAN reinsertion

In the patched code, the single-mbuf path sets `data_len` but defers setting `pkt_len` until after the VLAN/timestamp block:

```c
if (pkt_len <= rte_pktmbuf_tailroom(mbuf)) {
    memcpy(rte_pktmbuf_mtod(mbuf, void *), pbuf, pkt_len);
    rte_pktmbuf_data_len(mbuf) = pkt_len;
} else if (...) { ... }

rte_pktmbuf_pkt_len(mbuf) = pkt_len;

/* check for vlan info */
if (ppd->tp_status & TP_STATUS_VLAN_VALID) {
    ...
    if (!pkt_q->vlan_strip && rte_vlan_insert(&mbuf))
```

`rte_vlan_insert()` reads `mbuf->pkt_len` to determine packet length. At that point `pkt_len` has been assigned to `rte_pktmbuf_pkt_len(mbuf)` — wait, looking again, `pkt_len` assignment is on the line *before* the VLAN block. Let me re-read the patch flow:

```
        } else { ... goto release_frame; }

        rte_pktmbuf_pkt_len(mbuf) = pkt_len;   // <-- this line

        /* check for vlan info */
```

Yes, `pkt_len` is set before the VLAN check. This is correct. Disregard.

### Warning: `rx_nombuf` not incremented on scatter allocation failure

In the original code, `rx_nombuf` is incremented when `rte_pktmbuf_alloc()` fails for the head mbuf. In the scatter path, when `rte_pktmbuf_alloc()` fails for a *chained* segment, only `rx_dropped_pkts` is incremented. This is arguably a no-mbuf condition and should also increment `rx_nombuf` so that `rte_eth_stats.rx_nombuf` accurately reflects memory pressure. The current code makes it impossible to distinguish "dropped because too large and no scatter" from "dropped because mbuf pool exhausted during scatter."

Suggested fix: increment `pkt_q->rx_nombuf` instead of (or in addition to) `rx_dropped_pkts` when scatter allocation fails.

### Warning: `rte_pktmbuf_prepend()` return value unchecked

In `eth_af_packet_rx_scatter()`:

```c
rte_pktmbuf_prepend(m, rte_pktmbuf_headroom(m));
m->pkt_len = 0;
m->data_len = 0;
```

`rte_pktmbuf_prepend()` can return `NULL` if the headroom requested exceeds available headroom (shouldn't happen for a freshly allocated mbuf, but it's defensive). More importantly, after `rte_pktmbuf_prepend()`, `data_len` is increased by the headroom amount, but then immediately overwritten to `0`. The `pkt_len` is similarly set to `0`. This sequence works but is fragile — a cleaner approach would be to directly manipulate `m->data_off = 0` to reclaim the headroom, which is what this is effectively doing. Consider:

```c
m->data_off = 0;
```

This is simpler, less error-prone, and avoids the intermediate state where `data_len` contains a non-zero value that is immediately discarded.

### Warning: `scatter_enabled` read from `rxmode.offloads` in `rx_queue_setup`

Per AGENTS.md guidance, reading from `dev->data->dev_conf.rxmode` after configure can become stale. In this case, `rx_queue_setup` is called between `dev_configure` and `dev_start`, so the value should be consistent. However, the canonical pattern in DPDK is to check `dev->data->scattered_rx` or the offloads via `dev->data->dev_conf.rxmode.offloads` (which is acceptable in queue setup since it runs in the configure-to-start window). This is borderline acceptable but worth noting — if `mtu_set` is called after queue setup, the scatter state won't be re-evaluated. The AF_PACKET PMD's `mtu_set` doesn't re-select Rx functions or re-evaluate scatter, which is a pre-existing gap not introduced by this patch.

### Info: `bufs[i]` changed to `bufs[num_rx]` — correct fix for existing bug

The original code used `bufs[i]` which would leave gaps in the output array when packets are dropped (via `continue`). The change to `bufs[num_rx]` is correct and actually fixes a pre-existing bug in the drop path. Good catch in this patch.

### Info: Consider adding `rx_nombuf` tracking for the scatter failure case

As noted above, distinguishing "no mbuf" from "oversized drop" in statistics would be valuable for debugging. Minor suggestion for a follow-up.

next prev parent reply	other threads:[~2026-03-23 23:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-12  7:34 [PATCH v3] net/af_packet: add multi-segment mbuf support for jumbo frames Sriram Yagnaraman
2026-03-12 16:06 ` Stephen Hemminger
2026-03-12 16:34 ` Stephen Hemminger
2026-03-13  9:10 ` [PATCH v4] " Sriram Yagnaraman
2026-03-18  9:47   ` [PATCH v5] " Sriram Yagnaraman
2026-03-23 23:42     ` Stephen Hemminger [this message]
2026-03-24 16:29     ` [PATCH v6] net/af_packet: add Rx scatter " Sriram Yagnaraman
2026-03-25 17:31       ` Stephen Hemminger
2026-03-26 13:01         ` Sriram Yagnaraman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260323164205.0509775e@phoenix.local \
    --to=stephen@networkplumber.org \
    --cc=dev@dpdk.org \
    --cc=robin.l.lin@ericsson.com \
    --cc=sriram.yagnaraman@ericsson.com \
    --cc=xavier.guillaume@ovhcloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.