Netdev List
 help / color / mirror / Atom feed
* [PATCH net v2 2/2] octeon_ep_vf: add NULL check for napi_build_skb()
From: David Carlier @ 2026-04-09 18:40 UTC (permalink / raw)
  To: netdev
  Cc: vburru, sedara, srasheed, sburla, andrew+netdev, davem, edumazet,
	kuba, pabeni, horms, linux-kernel, stable, David Carlier
In-Reply-To: <20260409184009.930359-1-devnexen@gmail.com>

napi_build_skb() can return NULL on allocation failure. In
__octep_vf_oq_process_rx(), the result is used directly without a NULL
check in both the single-buffer and multi-fragment paths, leading to a
NULL pointer dereference.

Add NULL checks after both napi_build_skb() calls, properly advancing
descriptors and consuming remaining fragments on failure.

Fixes: 1cd3b407977c ("octeon_ep_vf: add Tx/Rx processing and interrupt support")
Cc: stable@vger.kernel.org
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 .../marvell/octeon_ep_vf/octep_vf_rx.c        | 30 +++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/marvell/octeon_ep_vf/octep_vf_rx.c b/drivers/net/ethernet/marvell/octeon_ep_vf/octep_vf_rx.c
index 7bd1b9b8d7f5..d98247408242 100644
--- a/drivers/net/ethernet/marvell/octeon_ep_vf/octep_vf_rx.c
+++ b/drivers/net/ethernet/marvell/octeon_ep_vf/octep_vf_rx.c
@@ -414,10 +414,15 @@ static int __octep_vf_oq_process_rx(struct octep_vf_device *oct,
 			data_offset = OCTEP_VF_OQ_RESP_HW_SIZE;
 			rx_ol_flags = 0;
 		}
-		rx_bytes += buff_info->len;
-
 		if (buff_info->len <= oq->max_single_buffer_size) {
 			skb = napi_build_skb((void *)resp_hw, PAGE_SIZE);
+			if (!skb) {
+				oq->stats->alloc_failures++;
+				desc_used++;
+				read_idx = octep_vf_oq_next_idx(oq, read_idx);
+				continue;
+			}
+			rx_bytes += buff_info->len;
 			skb_reserve(skb, data_offset);
 			skb_put(skb, buff_info->len);
 			desc_used++;
@@ -427,6 +432,27 @@ static int __octep_vf_oq_process_rx(struct octep_vf_device *oct,
 			u16 data_len;
 
 			skb = napi_build_skb((void *)resp_hw, PAGE_SIZE);
+			if (!skb) {
+				oq->stats->alloc_failures++;
+				desc_used++;
+				read_idx = octep_vf_oq_next_idx(oq, read_idx);
+				data_len = buff_info->len - oq->max_single_buffer_size;
+				while (data_len) {
+					dma_unmap_page(oq->dev, oq->desc_ring[read_idx].buffer_ptr,
+						       PAGE_SIZE, DMA_FROM_DEVICE);
+					buff_info = (struct octep_vf_rx_buffer *)
+						    &oq->buff_info[read_idx];
+					buff_info->page = NULL;
+					if (data_len < oq->buffer_size)
+						data_len = 0;
+					else
+						data_len -= oq->buffer_size;
+					desc_used++;
+					read_idx = octep_vf_oq_next_idx(oq, read_idx);
+				}
+				continue;
+			}
+			rx_bytes += buff_info->len;
 			skb_reserve(skb, data_offset);
 			/* Head fragment includes response header(s);
 			 * subsequent fragments contains only data.
-- 
2.53.0


^ permalink raw reply related

* [PATCH net v2 0/3] nfc: fix chained TLV parsing and integer underflow vulnerabilities
From: Lekë Hapçiu @ 2026-04-09 18:59 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, linux-nfc, stable, horms,
	Lekë Hapçiu
In-Reply-To: <20260409164129.GO469338@kernel.org>

From: Lekë Hapçiu <framemain@outlook.com>

These three patches fix vulnerabilities in the NFC LLCP and NCI subsystems
that form an exploit chain.  Each bug is independently reachable from an
unauthenticated NFC peer at ~4 cm range; together they create a path from
controlled heap disclosure to heap corruption.

--- Chain summary ---

  [1/3]  nci/ntf.c — nci_store_general_bytes_nfc_dep()
         u8 integer underflow: when the peer's ATR_RES/ATR_REQ length field
         is smaller than NFC_ATR_RES_GT_OFFSET (15) or NFC_ATR_REQ_GT_OFFSET
         (14), the subtraction wraps to a large u8 value.  min_t(__u8, ...)
         clamps to NFC_ATR_RES_GB_MAXSIZE (47), and a 47-byte memcpy reads
         out-of-bounds data into ndev->remote_gb[].  This corrupted buffer
         is subsequently parsed by nfc_llcp_parse_gb_tlv().

  [2/3]  llcp_commands.c — nfc_llcp_parse_gb_tlv() +
                           nfc_llcp_parse_connection_tlv()
         Two bugs in both TLV parsers.  The first, a u8 offset truncation
         causing an infinite loop, was addressed in v1 of this series.
         This version adds the fix identified during review by Simon Horman:
         the `length` byte is read from peer-controlled data with no check
         that the remainder of the array can accommodate `length` more bytes.
         A crafted `length` advances the `tlv` pointer into adjacent kernel
         memory; the next iteration reads tlv[0]/tlv[1] from that location.
         When combined with [1/3], a crafted `length` in the garbage-filled
         remote_gb[] can walk `tlv` past nfc_llcp_local.remote_gb[] and into
         adjacent struct fields, including sdreq_timer.function at ~+176 bytes,
         enabling a kernel pointer disclosure via sock->remote_miu/getsockopt.

  [3/3]  llcp_core.c — nfc_llcp_recv_snl()
         The SNL TLV parsing loop carries the same missing guards as [2/3].
         Additionally: LLCP_TLV_SDREQ accesses tlv[2] and computes
         `service_name_len = length - 1` (u8 underflow to 255 when length==0,
         causing a 255-byte kernel memory scan via strncmp); and
         LLCP_TLV_SDRES accesses tlv[2] and tlv[3] without verifying
         length >= 2.  Unlike the parsers in [2/3], SDREQ/SDRES are processed
         directly without the llcp_tlv_length[] table protection.  A missing
         skb->len guard also allows tlv_len to underflow to ~65534 if
         skb->len < LLCP_HEADER_SIZE.

--- Individual CVSS (AV:A/AC:L/PR:N/UI:N/S:U) ---

  [1/3]  C:H/I:N/A:L  — 6.5
  [2/3]  C:H/I:N/A:L  — 6.5
  [3/3]  C:H/I:N/A:L  — 6.5

--- Chain CVSS ---

  AV:A/AC:H/PR:N/UI:N/S:C/C:H/I:H/A:H — 8.3

  KASLR bypass via [1/3]+[2/3] makes [3/3] reliably exploitable without
  the race-condition timing required against the bugs in isolation.

All patches carry Cc: stable@vger.kernel.org.


Lekë Hapçiu (3):
  nfc: nci: fix u8 underflow in nci_store_general_bytes_nfc_dep
  nfc: llcp: add TLV length bounds checks in parse_gb_tlv and
    parse_connection_tlv
  nfc: llcp: fix TLV parsing OOB and length underflow in
    nfc_llcp_recv_snl

 net/nfc/llcp_commands.c | 22 ++++++++++++++++++++++
 net/nfc/llcp_core.c     | 13 +++++++++++++
 net/nfc/nci/ntf.c       | 22 ++++++++++++++--------
 3 files changed, 49 insertions(+), 8 deletions(-)

-- 
2.51.0


^ permalink raw reply

* [PATCH net v2 1/3] nfc: nci: fix u8 underflow in nci_store_general_bytes_nfc_dep
From: Lekë Hapçiu @ 2026-04-09 18:59 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, linux-nfc, stable, horms,
	Lekë Hapçiu
In-Reply-To: <20260409185958.1821242-1-snowwlake@icloud.com>

From: Lekë Hapçiu <framemain@outlook.com>

nci_store_general_bytes_nfc_dep() computes the number of General Bytes
to copy from an ATR_RES or ATR_REQ frame by subtracting a fixed header
offset from the peer-supplied length field:

  ndev->remote_gb_len = min_t(__u8,
      (atr_res_len - NFC_ATR_RES_GT_OFFSET),   /* offset = 15 */
      NFC_ATR_RES_GB_MAXSIZE);

Both length fields are __u8.  When a malicious NFC-DEP target (POLL mode)
or initiator (LISTEN mode) sends an ATR_RES/ATR_REQ whose length field is
smaller than the fixed offset (< 15 or < 14 respectively), the subtraction
wraps in unsigned u8 arithmetic:

  e.g. atr_res_len = 0 -> (u8)(0 - 15) = 241

min_t(__u8, 241, 47) then yields 47, so the subsequent memcpy reads
47 bytes from beyond the end of the valid activation parameter data into
ndev->remote_gb[].  This buffer is later passed to nfc_llcp_parse_gb_tlv()
as a TLV array, feeding directly into the TLV parser hardened by the
companion patch.

Fix: add an explicit lower-bound check on each length field before the
subtraction.  If the length is smaller than the required offset the frame
is malformed; leave remote_gb_len at zero and skip the memcpy.

Both the POLL (atr_res_len / NFC_ATR_RES_GT_OFFSET = 15) and the LISTEN
(atr_req_len / NFC_ATR_REQ_GT_OFFSET = 14) paths are affected; both are
fixed symmetrically.

Reachability: the ATR_RES is sent by an NFC-DEP target during RF
activation, before any authentication or pairing.  The bug is therefore
reachable from any NFC peer within ~4 cm.

Fixes: a99903ec4566 ("NFC: NCI: Handle Target mode activation")
Cc: stable@vger.kernel.org
Signed-off-by: Lekë Hapçiu <framemain@outlook.com>
---
 net/nfc/nci/ntf.c | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

diff --git a/net/nfc/nci/ntf.c b/net/nfc/nci/ntf.c
index c96512bb8..8eb295580 100644
--- a/net/nfc/nci/ntf.c
+++ b/net/nfc/nci/ntf.c
@@ -631,25 +631,31 @@ static int nci_store_general_bytes_nfc_dep(struct nci_dev *ndev,
 	switch (ntf->activation_rf_tech_and_mode) {
 	case NCI_NFC_A_PASSIVE_POLL_MODE:
 	case NCI_NFC_F_PASSIVE_POLL_MODE:
+		if (ntf->activation_params.poll_nfc_dep.atr_res_len <
+		    NFC_ATR_RES_GT_OFFSET)
+			break;
 		ndev->remote_gb_len = min_t(__u8,
-			(ntf->activation_params.poll_nfc_dep.atr_res_len
-						- NFC_ATR_RES_GT_OFFSET),
+			ntf->activation_params.poll_nfc_dep.atr_res_len
+						- NFC_ATR_RES_GT_OFFSET,
 			NFC_ATR_RES_GB_MAXSIZE);
 		memcpy(ndev->remote_gb,
-		       (ntf->activation_params.poll_nfc_dep.atr_res
-						+ NFC_ATR_RES_GT_OFFSET),
+		       ntf->activation_params.poll_nfc_dep.atr_res
+						+ NFC_ATR_RES_GT_OFFSET,
 		       ndev->remote_gb_len);
 		break;
 
 	case NCI_NFC_A_PASSIVE_LISTEN_MODE:
 	case NCI_NFC_F_PASSIVE_LISTEN_MODE:
+		if (ntf->activation_params.listen_nfc_dep.atr_req_len <
+		    NFC_ATR_REQ_GT_OFFSET)
+			break;
 		ndev->remote_gb_len = min_t(__u8,
-			(ntf->activation_params.listen_nfc_dep.atr_req_len
-						- NFC_ATR_REQ_GT_OFFSET),
+			ntf->activation_params.listen_nfc_dep.atr_req_len
+						- NFC_ATR_REQ_GT_OFFSET,
 			NFC_ATR_REQ_GB_MAXSIZE);
 		memcpy(ndev->remote_gb,
-		       (ntf->activation_params.listen_nfc_dep.atr_req
-						+ NFC_ATR_REQ_GT_OFFSET),
+		       ntf->activation_params.listen_nfc_dep.atr_req
+						+ NFC_ATR_REQ_GT_OFFSET,
 		       ndev->remote_gb_len);
 		break;
 
-- 
2.51.0


^ permalink raw reply related

* [PATCH net v2 2/3] nfc: llcp: add TLV length bounds checks in parse_gb_tlv and parse_connection_tlv
From: Lekë Hapçiu @ 2026-04-09 18:59 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, linux-nfc, stable, horms,
	Lekë Hapçiu
In-Reply-To: <20260409185958.1821242-1-snowwlake@icloud.com>

From: Lekë Hapçiu <framemain@outlook.com>

v1 of this fix promoted `offset` from u8 to u16 in both TLV parsers,
preventing the infinite loop when a connection TLV array exceeds 255 bytes.
During review, Simon Horman identified two additional issues that the u16
promotion alone does not address.

Issue 1 - truncated TLV header:

  The loop guard `offset < tlv_array_len` is not sufficient to guarantee
  that reading tlv[0] (type) and tlv[1] (length) is safe.  When exactly
  one byte remains (offset == tlv_array_len - 1) the loop body reads
  tlv[1] one byte past the end of the array.

Issue 2 - peer-controlled `length` field:

  `length` is read from peer-supplied frame data and is not checked against
  the remaining array space before advancing `tlv` and `offset`:

    offset += length + 2;   /* always */
    tlv    += length + 2;   /* may now point past buffer end */

  A crafted `length` advances `tlv` past the array boundary; the following
  iteration reads tlv[0]/tlv[1] from adjacent kernel memory.

  For nfc_llcp_parse_gb_tlv() this is particularly impactful: its input is
  &local->remote_gb[3], a field within nfc_llcp_local.  A large `length`
  can walk `tlv` into adjacent struct fields including sdreq_timer and
  sdreq_timeout_work which contain kernel function pointers at approximately
  +176 and +216 bytes past remote_gb[].  The parsed `type` byte at those
  positions may match a recognized TLV type causing the parser to store
  bytes from the function pointer into local->remote_miu, which is
  subsequently readable via getsockopt().

Issue 3 - zero-length TLV value:

  The llcp_tlv8() and llcp_tlv16() accessor helpers read tlv[2] and
  tlv[2..3] respectively.  The outer guard guarantees `length` bytes of
  value are available past the two-byte header, but when length == 0 it
  only guarantees offset+2 <= tlv_array_len (non-strict), leaving tlv[2]
  out of bounds.  Per-type minimum-length checks are required before each
  accessor call.  Note: llcp_tlv8/16 additionally validate against the
  llcp_tlv_length[] table, providing a second safety layer; the per-type
  checks here make the rejection explicit and avoid silent zero-defaults.

Fix: add two loop-level guards inside each parsing loop:

  if (tlv_array_len - offset < 2)            /* need type + length */
      break;
  [read type, length]
  if (tlv_array_len - offset - 2 < length)   /* need length value bytes */
      break;

Both subtractions are safe: the loop condition guarantees offset <
tlv_array_len; the first guard then guarantees the difference is >= 2,
making the second subtraction non-negative.

Add per-type minimum-length checks before each accessor call:
  - tlv8-based (VERSION, LTO, OPT, RW): require length >= 1
  - tlv16-based (MIUX, WKS):            require length >= 2

Reachability: nfc_llcp_parse_connection_tlv() is reached on receipt of a
CONNECT or CC PDU before any connection is established.
nfc_llcp_parse_gb_tlv() is reached during ATR_RES processing.  Both are
triggerable from any NFC peer within ~4 cm with no authentication.

Reported-by: Simon Horman <horms@kernel.org>
Fixes: 7a06e586b9bf ("NFC: Move LLCP receiver window value to socket structure")
Cc: stable@vger.kernel.org
Signed-off-by: Lekë Hapçiu <framemain@outlook.com>
---
 net/nfc/llcp_commands.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/net/nfc/llcp_commands.c b/net/nfc/llcp_commands.c
index 6937dcb3b..7cc237a6d 100644
--- a/net/nfc/llcp_commands.c
+++ b/net/nfc/llcp_commands.c
@@ -202,25 +202,39 @@ int nfc_llcp_parse_gb_tlv(struct nfc_llcp_local *local,
 		return -ENODEV;
 
 	while (offset < tlv_array_len) {
+		if (tlv_array_len - offset < 2)
+			break;
 		type = tlv[0];
 		length = tlv[1];
+		if (tlv_array_len - offset - 2 < length)
+			break;
 
 		pr_debug("type 0x%x length %d\n", type, length);
 
 		switch (type) {
 		case LLCP_TLV_VERSION:
+			if (length < 1)
+				break;
 			local->remote_version = llcp_tlv_version(tlv);
 			break;
 		case LLCP_TLV_MIUX:
+			if (length < 2)
+				break;
 			local->remote_miu = llcp_tlv_miux(tlv) + 128;
 			break;
 		case LLCP_TLV_WKS:
+			if (length < 2)
+				break;
 			local->remote_wks = llcp_tlv_wks(tlv);
 			break;
 		case LLCP_TLV_LTO:
+			if (length < 1)
+				break;
 			local->remote_lto = llcp_tlv_lto(tlv) * 10;
 			break;
 		case LLCP_TLV_OPT:
+			if (length < 1)
+				break;
 			local->remote_opt = llcp_tlv_opt(tlv);
 			break;
 		default:
@@ -253,16 +267,24 @@ int nfc_llcp_parse_connection_tlv(struct nfc_llcp_sock *sock,
 		return -ENOTCONN;
 
 	while (offset < tlv_array_len) {
+		if (tlv_array_len - offset < 2)
+			break;
 		type = tlv[0];
 		length = tlv[1];
+		if (tlv_array_len - offset - 2 < length)
+			break;
 
 		pr_debug("type 0x%x length %d\n", type, length);
 
 		switch (type) {
 		case LLCP_TLV_MIUX:
+			if (length < 2)
+				break;
 			sock->remote_miu = llcp_tlv_miux(tlv) + 128;
 			break;
 		case LLCP_TLV_RW:
+			if (length < 1)
+				break;
 			sock->remote_rw = llcp_tlv_rw(tlv);
 			break;
 		case LLCP_TLV_SN:
-- 
2.51.0


^ permalink raw reply related

* [PATCH net v2 3/3] nfc: llcp: fix TLV parsing OOB and length underflow in nfc_llcp_recv_snl
From: Lekë Hapçiu @ 2026-04-09 18:59 UTC (permalink / raw)
  To: netdev
  Cc: davem, edumazet, kuba, pabeni, linux-nfc, stable, horms,
	Lekë Hapçiu
In-Reply-To: <20260409185958.1821242-1-snowwlake@icloud.com>

From: Lekë Hapçiu <framemain@outlook.com>

nfc_llcp_recv_snl() contains four distinct vulnerabilities.

Issue 1 - missing minimum-length guard on skb:

  nfc_llcp_dsap() and nfc_llcp_ssap() access pdu->data[0] and pdu->data[1]
  unconditionally.  The subsequent computation:

    tlv_len = skb->len - LLCP_HEADER_SIZE;   /* LLCP_HEADER_SIZE = 2 */

  truncates to u16.  If skb->len < 2, the unsigned subtraction wraps at
  unsigned int width and the truncation to u16 yields up to 65534, causing
  the while loop to iterate far beyond the skb data.  No guard exists at
  the dispatch path to prevent this.

  Fix: add `if (skb->len < LLCP_HEADER_SIZE) return;` before any skb->data
  access, matching the pattern already used in nfc_llcp_recv_agf().

Issue 2 - missing per-iteration TLV header guard:

  The loop reads tlv[0] and tlv[1] with no prior check that two bytes
  remain.  When one byte remains, tlv[1] is one byte past the array end.

  Fix: `if (tlv_len - offset < 2) break;`

Issue 3 - peer-controlled `length` field advances tlv past skb end:

  `length` (tlv[1]) is advanced unconditionally into `offset` and `tlv`
  without verifying that `length` bytes of TLV value exist.  A malicious
  peer sets `length` large enough that `offset` remains below `tlv_len` on
  the next iteration while `tlv` points into adjacent kernel heap.

  Fix: `if (tlv_len - offset - 2 < length) break;`

Issue 4 - per-type minimum-length hazards:

  LLCP_TLV_SDREQ: `service_name_len = length - 1` is u8 arithmetic.  When
  length == 0 this wraps to 255, causing a 255-byte kernel memory scan via
  strncmp.  tlv[2] (tid) is also accessed unconditionally.
  Fix: require length >= 1 before the tid/service_name access.

  LLCP_TLV_SDRES: tlv[2] and tlv[3] are accessed without verifying
  length >= 2.  Unlike the GB/connection parsers, SDREQ/SDRES are not
  processed via llcp_tlv8/16, so the llcp_tlv_length[] table provides no
  protection here.
  Fix: require length >= 2 before the tlv[2]/tlv[3] accesses.

  In both cases a `break` from the inner switch falls through to the
  unconditional `offset += length + 2; tlv += length + 2` at the loop
  tail, correctly advancing past the malformed TLV.  The outer two guards
  break from the while loop entirely.

Reachability: SNL PDUs are processed during LLCP service discovery, before
any connection is established, from any NFC peer within ~4 cm with no
authentication or pairing.

Fixes: 19cfe5843e86 ("NFC: Initial SNL support")
Cc: stable@vger.kernel.org
Signed-off-by: Lekë Hapçiu <framemain@outlook.com>
---
 net/nfc/llcp_core.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/net/nfc/llcp_core.c b/net/nfc/llcp_core.c
index db5bc6a87..16acf7c2b 100644
--- a/net/nfc/llcp_core.c
+++ b/net/nfc/llcp_core.c
@@ -1284,6 +1284,11 @@ static void nfc_llcp_recv_snl(struct nfc_llcp_local *local,
 	size_t sdres_tlvs_len;
 	HLIST_HEAD(nl_sdres_list);
 
+	if (skb->len < LLCP_HEADER_SIZE) {
+		pr_err("Malformed SNL PDU\n");
+		return;
+	}
+
 	dsap = nfc_llcp_dsap(skb);
 	ssap = nfc_llcp_ssap(skb);
 
@@ -1300,11 +1305,17 @@ static void nfc_llcp_recv_snl(struct nfc_llcp_local *local,
 	sdres_tlvs_len = 0;
 
 	while (offset < tlv_len) {
+		if (tlv_len - offset < 2)
+			break;
 		type = tlv[0];
 		length = tlv[1];
+		if (tlv_len - offset - 2 < length)
+			break;
 
 		switch (type) {
 		case LLCP_TLV_SDREQ:
+			if (length < 1)
+				break;
 			tid = tlv[2];
 			service_name = (char *) &tlv[3];
 			service_name_len = length - 1;
@@ -1369,6 +1380,8 @@ static void nfc_llcp_recv_snl(struct nfc_llcp_local *local,
 			break;
 
 		case LLCP_TLV_SDRES:
+			if (length < 2)
+				break;
 			mutex_lock(&local->sdreq_lock);
 
 			pr_debug("LLCP_TLV_SDRES: searching tid %d\n", tlv[2]);
-- 
2.51.0


^ permalink raw reply related

* Re: [PATCH net] netrom: do some basic forms of validation on incoming frames
From: Simon Horman @ 2026-04-09 19:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: netdev, linux-kernel, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, linux-hams, Yizhe Zhuang, stable
In-Reply-To: <2026040730-untagged-groin-bbb7@gregkh>

On Tue, Apr 07, 2026 at 10:45:31AM +0200, Greg Kroah-Hartman wrote:
> There is a lack of much validation of frame size coming from a
> netrom-based device.  While these devices are "trusted" doing some
> sanity checks is good to at least keep the fuzzing tools happy when they
> stumble across this ancient protocol and light up with a range of bug
> reports.
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Simon Horman <horms@kernel.org>
> Cc: linux-hams@vger.kernel.org
> Assisted-by: gregkh_clanker_2000
> Reviewed-by: Yizhe Zhuang <yizhe@darknavy.com>
> Cc: stable <stable@kernel.org>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

Hi Greg 2000!

I expect that checking skb->len isn't sufficient here
and pskb_may_pull needs to be used to ensure that
the data is also available in the linear section of the skb.

Also, although I'm all for incremental enhancements,
I do suspect that similar problems exist in the call
chain of these functions.

...

^ permalink raw reply

* Re: [PATCH net-next v4] selftests/net: convert so_txtime to drv-net
From: Willem de Bruijn @ 2026-04-09 19:10 UTC (permalink / raw)
  To: Willem de Bruijn, netdev
  Cc: davem, kuba, edumazet, pabeni, horms, linux-kselftest, shuah,
	Willem de Bruijn
In-Reply-To: <20260409164238.661091-1-willemdebruijn.kernel@gmail.com>

Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> In preparation for extending to pacing hardware offload, convert the
> so_txtime.sh test to a drv-net test that can be run against netdevsim
> and real hardware.
> 
> Also update so_txtime.c to not exit on first failure, but run to
> completion and report exit code there. This helps with debugging
> unexpected results, especially when processing multiple packets,
> as in the "reverse_order" testcase.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>

> +def main() -> None:
> +    """Boilerplate ksft main."""
> +    with NetDrvEpEnv(__file__) as cfg:
> +        # Record original root qdisc
> +        cmd_obj = cmd((f"tc -j qdisc show dev {cfg.ifname} root"))
> +        qdisc_root = json.loads(cmd_obj.stdout)[0].get("kind", None)
> +
> +        ksft_run([test_so_txtime_mono, test_so_txtime_etf], args=(cfg,))
> +
> +        # Restore original root qdisc. If mq, populate with default_qdisc nodes
> +        if (qdisc_root):

I evidently couldn't resist a touch up after running through pylint.

Unnecessary parentheses. Only a warn. But I can resubmit irrespective
of other concerns.

Again, could add a tc helper (in a separate patch) to hide some of the
open coded ugliness too.

> +            cmd(f"tc qdisc replace dev {cfg.ifname} root {qdisc_root}")
> +    ksft_exit()

^ permalink raw reply

* Re: [PATCH v2 net 2/2] net: hamradio: scc: validate bufsize in SIOCSCCSMEM ioctl
From: Joerg Reuter @ 2026-04-09 19:27 UTC (permalink / raw)
  To: Mashiro Chen
  Cc: netdev, andrew+netdev, davem, edumazet, kuba, pabeni, linux-hams,
	linux-kernel, stable
In-Reply-To: <20260409024927.24397-3-mashiro.chen@mailbox.org>

Looks great, thanks!

    73, Joerg

> The SIOCSCCSMEM ioctl copies a scc_mem_config from user space and
> assigns its bufsize field directly to scc->stat.bufsize without any
> range validation:
> 
>   scc->stat.bufsize = memcfg.bufsize;
> 
> If a privileged user (CAP_SYS_RAWIO) sets bufsize to 0, the receive
> interrupt handler later calls dev_alloc_skb(0) and immediately writes
> a KISS type byte via skb_put_u8() into a zero-capacity socket buffer,
> corrupting the adjacent skb_shared_info region.
> 
> Reject bufsize values smaller than 16; this is large enough to hold
> at least one KISS header byte plus useful data.
> 
> Cc: stable@vger.kernel.org
> Cc: linux-hams@vger.kernel.org
Acked-by: Joerg Reuter <jreuter@yaina.de>
> Signed-off-by: Mashiro Chen <mashiro.chen@mailbox.org>
> ---
>  drivers/net/hamradio/scc.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/hamradio/scc.c b/drivers/net/hamradio/scc.c
> index ae5048efde686a..8569db4a71401c 100644
> --- a/drivers/net/hamradio/scc.c
> +++ b/drivers/net/hamradio/scc.c
> @@ -1909,6 +1909,8 @@ static int scc_net_siocdevprivate(struct net_device *dev,
>  			if (!capable(CAP_SYS_RAWIO)) return -EPERM;
>  			if (!arg || copy_from_user(&memcfg, arg, sizeof(memcfg)))
>  				return -EINVAL;
> +			if (memcfg.bufsize < 16)
> +				return -EINVAL;
>  			scc->stat.bufsize   = memcfg.bufsize;
>  			return 0;
>  		
> -- 
> 2.53.0
> 

-- 
Joerg Reuter                                    http://yaina.de/jreuter
And I make my way to where the warm scent of soil fills the evening air. 
Everything is waiting quietly out there....                 (Anne Clark)

^ permalink raw reply

* Re: [PATCH] nfc: llcp: fix missing return after LLCP_CLOSED check in recv_hdlc and recv_disc
From: Lekë Hapçiu @ 2026-04-09 19:34 UTC (permalink / raw)
  To: horms
  Cc: netdev, davem, linux-nfc, kuba, krzysztof.kozlowski, stable,
	framemain
In-Reply-To: <20260409164527.GP469338@kernel.org>

Thanks for the pointer. Withdrawing this patch — the existing
submission at:
  https://lore.kernel.org/all/20260408081006.3723-1-qjx1298677004@gmail.com/
covers the same fix.

Lekë

^ permalink raw reply

* Re: [PATCH] nfc: nci: fix OOB heap read in nci_core_init_rsp_packet_v1()
From: Lekë Hapçiu @ 2026-04-09 19:37 UTC (permalink / raw)
  To: horms; +Cc: netdev, kuba, davem, edumazet, pabeni, security, framemain
In-Reply-To: <20260408190505.GK469338@kernel.org>

Thanks for the review. I will address all three points across both
patches (v1 and v2) and send them together as a patchset:

  - add an early skb->len >= sizeof(*rsp_1) guard before any struct
    field access
  - compute the rsp_2 pointer using the raw chip-supplied
    num_supported_rf_interfaces (validated against skb->len), not the
    already-capped ndev value
  - investigate whether pskb_may_pull is needed in the NCI receive path
    before these handlers are called

Lekë

^ permalink raw reply

* Re: [PATCH net-next v2 3/3] net: mdio: treat PSE EPROBE_DEFER as non-fatal during PHY registration
From: Andrew Lunn @ 2026-04-09 19:54 UTC (permalink / raw)
  To: Russell King (Oracle)
  Cc: Kory Maincent, Carlo Szelinsky, o.rempel, andrew+netdev,
	hkallweit1, kuba, davem, edumazet, pabeni, horms, netdev,
	linux-kernel
In-Reply-To: <adfPAQDiYX6eIjrT@shell.armlinux.org.uk>

On Thu, Apr 09, 2026 at 05:08:33PM +0100, Russell King (Oracle) wrote:
> On Thu, Apr 09, 2026 at 05:34:56PM +0200, Andrew Lunn wrote:
> > I still think we should be deferring probe until we have all the parts
> > available. The question is, how do we actually do that?
> 
> Indeed...
> 
> > We could insist that MACs being used with PSE need to call
> > phylink_connect() in probe, so we can return EPROBE_DEFER. We might
> > actually need a new API method, phylink_connect_probe(). That can call
> > down into phylib, maybe again new API methods, which will not bind
> > genphy, but return EPROBE_DEFER.

I did not say i would be easy...
 
> How would MACs know whether they should call phylink_connect_probe()
> or phylink_connect_phy() ?

It would not. Anybody with a board using PSE would need to modify the
MAC driver to use phylink_connect_probe(), if they have a slow to load
PSE device.

> What do we do about MAC drivers that are a single driver and device,
> but are made up of several network devices (like Marvell PP2) ?

It would need more care, but it should work. You might end up removing
a perfectly good device because the other one is missing its PHY,
which is not ideal, but hopefully you get there in the end.

> We also have network drivers that provide a MDIO bus for a different
> network device, which makes connecting the PHY harder in the probe
> path.

Yes, we would see such setup doing more deferred probing, but again,
they should get there in the end. The most common systems doing this
are using the FEC. Are there any board using the FEC and problematic
PSE?

> Lastly, what do we do where a PHY driver hasn't been configured or
> doesn't exist for the PHY?

I was wondering if we can get from the driver core some idea where we
are in the deferred probing window. If we are 2/3 of the way through
the window, fall back to genphy?

I'm not saying we should change all MAC drivers, or recommend new MAC
drivers connect to the PHY in probe. I just want to offer the option
if you have a problematic PSE or PHY, change the MAC driver.

What we have also said in the past, it is the bootloaders problem to
download firmware into the PHY, or PSE, so that it is ready to go by
the time Linux boots. That would also be the simpler solution here.

    Andrew

^ permalink raw reply

* Re: [PATCH net-next 1/2] selftests: drv-net: Add ntuple (NFC) flow steering test
From: Dimitri Daskalakis @ 2026-04-09 20:28 UTC (permalink / raw)
  To: Jakub Kicinski, Michael Chan
  Cc: David S . Miller, Andrew Lunn, Eric Dumazet, Paolo Abeni,
	Shuah Khan, Willem de Bruijn, Petr Machata, David Wei,
	Chris J Arges, Carolina Jubran, Dimitri Daskalakis, netdev,
	linux-kselftest
In-Reply-To: <20260409085055.0834111c@kernel.org>



On 4/9/26 8:50 AM, Jakub Kicinski wrote:
> On Tue,  7 Apr 2026 09:49:53 -0700 Dimitri Daskalakis wrote:
>> Add a test for ethtool NFC (ntuple) flow steering rules. The test
>> creates an ntuple rule matching on various flow fields and verifies
>> that traffic is steered to the correct queue.
> Hi Michael, how accurate is the stats refresh timer in bnxt?
> This test is seeing ~10% of flakiness on bnxt, fewer packets
> got counted than we sent. Could be something else but I suspect
> the stats just didn't get refreshed. We give it 25% margin right 
> now.
>
> Dimitiri, this skips for some drivers because they don't auto-enable
> ntuple filters. Looks like other selftests have the same check and also
> skip in netdev CI. So probably a separate / follow up task but I think
> we need to add code to enable the filters if they were disabled.

Sounds good, I will follow up with enabling ntuple filters across the
selftests.

^ permalink raw reply

* [PATCH v2][next] netfilter: x_tables: Avoid a couple -Wflex-array-member-not-at-end warnings
From: Gustavo A. R. Silva @ 2026-04-09 20:28 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Florian Westphal, Phil Sutter, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: netfilter-devel, coreteam, netdev, linux-kernel,
	Gustavo A. R. Silva, linux-hardening

-Wflex-array-member-not-at-end was introduced in GCC-14, and we are
getting ready to enable it, globally.

struct compat_xt_standard_target and struct compat_xt_error_target are
only used in xt_compat_check_entry_offsets(). Remove these structs and
instead define the same memory layout on the stack via flexible struct
compat_xt_entry_target and DEFINE_RAW_FLEX(). Adjust the rest of the
code accordingly.

With these changes, fix the following warnings:

1 net/netfilter/x_tables.c:816:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]
1 net/netfilter/x_tables.c:811:39: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
---
Changes in v2:
 - Update verdict after (compat_uint_t *)st->data;

v1:
 - Link: https://lore.kernel.org/linux-hardening/adbIKC0cZcK7VcCF@kspp/

 net/netfilter/x_tables.c | 31 ++++++++++++++-----------------
 1 file changed, 14 insertions(+), 17 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index b39017c80548..746012196d83 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -817,17 +817,6 @@ int xt_compat_match_to_user(const struct xt_entry_match *m,
 }
 EXPORT_SYMBOL_GPL(xt_compat_match_to_user);
 
-/* non-compat version may have padding after verdict */
-struct compat_xt_standard_target {
-	struct compat_xt_entry_target t;
-	compat_uint_t verdict;
-};
-
-struct compat_xt_error_target {
-	struct compat_xt_entry_target t;
-	char errorname[XT_FUNCTION_MAXNAMELEN];
-};
-
 int xt_compat_check_entry_offsets(const void *base, const char *elems,
 				  unsigned int target_offset,
 				  unsigned int next_offset)
@@ -850,18 +839,26 @@ int xt_compat_check_entry_offsets(const void *base, const char *elems,
 		return -EINVAL;
 
 	if (strcmp(t->u.user.name, XT_STANDARD_TARGET) == 0) {
-		const struct compat_xt_standard_target *st = (const void *)t;
+		DEFINE_RAW_FLEX(const struct compat_xt_entry_target, st, data,
+				sizeof(compat_uint_t));
+		compat_uint_t *verdict;
 
-		if (COMPAT_XT_ALIGN(target_offset + sizeof(*st)) != next_offset)
+		st = (const void *)t;
+		verdict = (compat_uint_t *)st->data;
+
+		if (COMPAT_XT_ALIGN(target_offset + __struct_size(st)) !=
+				next_offset)
 			return -EINVAL;
 
-		if (!verdict_ok(st->verdict))
+		if (!verdict_ok(*verdict))
 			return -EINVAL;
 	} else if (strcmp(t->u.user.name, XT_ERROR_TARGET) == 0) {
-		const struct compat_xt_error_target *et = (const void *)t;
+		DEFINE_RAW_FLEX(const struct compat_xt_entry_target, et, data,
+				XT_FUNCTION_MAXNAMELEN);
+		et = (const void *)t;
 
-		if (!error_tg_ok(t->u.target_size, sizeof(*et),
-				 et->errorname, sizeof(et->errorname)))
+		if (!error_tg_ok(t->u.target_size, __struct_size(et),
+				 et->data, __member_size(et->data)))
 			return -EINVAL;
 	}
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net 0/2] mlx5 misc fixes 2026-04-09
From: Tariq Toukan @ 2026-04-09 20:28 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Jianbo Liu, Kees Cook, Lama Kayal, Michal Swiatkowski,
	Gal Pressman, Roy Novich, Roi Dayan, Raed Salem, netdev,
	linux-rdma, linux-kernel, Dragos Tatulea

Hi,

This small patchset provides misc bug fixes from Gal to the mlx5 Eth
driver.

Thanks,
Tariq.

Gal Pressman (2):
  net/mlx5e: Fix features not applied during netdev registration
  net/mlx5e: IPsec, fix ASO poll timeout with read_poll_timeout_atomic()

 .../mellanox/mlx5/core/en_accel/ipsec_offload.c      | 12 ++++--------
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c    |  8 ++++++++
 2 files changed, 12 insertions(+), 8 deletions(-)


base-commit: ebe560ea5f54134279356703e73b7f867c89db13
-- 
2.44.0


^ permalink raw reply

* [PATCH net 1/2] net/mlx5e: Fix features not applied during netdev registration
From: Tariq Toukan @ 2026-04-09 20:28 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Jianbo Liu, Kees Cook, Lama Kayal, Michal Swiatkowski,
	Gal Pressman, Roy Novich, Roi Dayan, Raed Salem, netdev,
	linux-rdma, linux-kernel, Dragos Tatulea
In-Reply-To: <20260409202852.158059-1-tariqt@nvidia.com>

From: Gal Pressman <gal@nvidia.com>

mlx5e_fix_features() returns early when the netdevice is not present.
This is correct during profile transitions where priv is cleared, but it
also incorrectly blocks feature fixups during register_netdev(), when
the device is also not yet present.

It is not trivial to distinguish between both cases as we cannot use
priv to carry state, and in both cases reg_state == NETREG_REGISTERED.

Force a netdev features update after register_netdev() completes, where
the device is present and fix_features() can actually work.

This is not a pretty solution, as it results in an additional features
update call (register_netdevice() already calls
__netdev_update_features() internally), but it is the simplest,
cleanest, and most robust way I found to fix this issue after multiple
attempts.

This fixes an issue on systems where CQE compression is enabled by
default, RXHASH remains enabled after registration despite the two
features being mutually exclusive.

Fixes: ab4b01bfdaa6 ("net/mlx5e: Verify dev is present for fix features ndo")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index b6c12460b54a..0b8b44bbcb9e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -6756,6 +6756,14 @@ static int _mlx5e_probe(struct auxiliary_device *adev)
 		goto err_resume;
 	}
 
+	/* mlx5e_fix_features() returns early when the device is not present
+	 * to avoid dereferencing cleared priv during profile changes.
+	 * This also causes it to be a no-op during register_netdev(), where
+	 * the device is not yet present.
+	 * Trigger an additional features update that will actually work.
+	 */
+	mlx5e_update_features(netdev);
+
 	mlx5e_dcbnl_init_app(priv);
 	mlx5_core_uplink_netdev_set(mdev, netdev);
 	mlx5e_params_print_info(mdev, &priv->channels.params);
-- 
2.44.0


^ permalink raw reply related

* [PATCH net 2/2] net/mlx5e: IPsec, fix ASO poll timeout with read_poll_timeout_atomic()
From: Tariq Toukan @ 2026-04-09 20:28 UTC (permalink / raw)
  To: Eric Dumazet, Jakub Kicinski, Paolo Abeni, Andrew Lunn,
	David S. Miller
  Cc: Boris Pismenny, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Jianbo Liu, Kees Cook, Lama Kayal, Michal Swiatkowski,
	Gal Pressman, Roy Novich, Roi Dayan, Raed Salem, netdev,
	linux-rdma, linux-kernel, Dragos Tatulea
In-Reply-To: <20260409202852.158059-1-tariqt@nvidia.com>

From: Gal Pressman <gal@nvidia.com>

The do-while poll loop uses jiffies for its timeout:
  expires = jiffies + msecs_to_jiffies(10);

jiffies is sampled at an arbitrary point within the current tick, so the
first partial tick contributes anywhere from a full tick down to nearly
zero real time. For small msecs_to_jiffies() results this is
significant, the effective poll window can be much shorter than the
requested 10ms, and in the worst case the loop exits after a single
iteration (e.g., when HZ=100), well before the device has delivered the
CQE.

Replace the loop with read_poll_timeout_atomic(), which counts elapsed
time via udelay() accounting rather than jiffies, guaranteeing the full
poll window regardless of HZ.

Additionally, read_poll_timeout_atomic() executes the poll operation one
more time after the timeout has expired, giving the CQE a final chance
to be detected. The old do-while loop could exit without a final poll if
the timeout expired during the udelay() between iterations.

Fixes: 76e463f6508b ("net/mlx5e: Overcome slow response for first IPsec ASO WQE")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Jianbo Liu <jianbol@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
---
 .../mellanox/mlx5/core/en_accel/ipsec_offload.c      | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c
index 05faad5083d9..145677ce9640 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/ipsec_offload.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
 /* Copyright (c) 2017, Mellanox Technologies inc. All rights reserved. */
 
+#include <linux/iopoll.h>
+
 #include "mlx5_core.h"
 #include "en.h"
 #include "ipsec.h"
@@ -592,7 +594,6 @@ int mlx5e_ipsec_aso_query(struct mlx5e_ipsec_sa_entry *sa_entry,
 	struct mlx5_wqe_aso_ctrl_seg *ctrl;
 	struct mlx5e_hw_objs *res;
 	struct mlx5_aso_wqe *wqe;
-	unsigned long expires;
 	u8 ds_cnt;
 	int ret;
 
@@ -614,13 +615,8 @@ int mlx5e_ipsec_aso_query(struct mlx5e_ipsec_sa_entry *sa_entry,
 	mlx5e_ipsec_aso_copy(ctrl, data);
 
 	mlx5_aso_post_wqe(aso->aso, false, &wqe->ctrl);
-	expires = jiffies + msecs_to_jiffies(10);
-	do {
-		ret = mlx5_aso_poll_cq(aso->aso, false);
-		if (ret)
-			/* We are in atomic context */
-			udelay(10);
-	} while (ret && time_is_after_jiffies(expires));
+	read_poll_timeout_atomic(mlx5_aso_poll_cq, ret, !ret, 10,
+				 10 * USEC_PER_MSEC, false, aso->aso, false);
 	if (!ret)
 		memcpy(sa_entry->ctx, aso->ctx, MLX5_ST_SZ_BYTES(ipsec_aso));
 	spin_unlock_bh(&aso->lock);
-- 
2.44.0


^ permalink raw reply related

* Re: [PATCH] MAINTAINERS: Remove Salil Mehta as HiSilicon HNS3/HNS Ethernet maintainer
From: patchwork-bot+netdevbpf @ 2026-04-09 20:30 UTC (permalink / raw)
  To: Salil Mehta; +Cc: davem, netdev, kuba, salil.mehta, shenjian15, shaojijie
In-Reply-To: <20260409000430.7217-1-salil.mehta@huawei.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu, 9 Apr 2026 01:04:30 +0100 you wrote:
> From: Salil Mehta <salil.mehta@opnsrc.net>
> 
> Closing this chapter and a long wonderful journey with my team, I sign off one
> last time with my Huawei email address. Remove my maintainer entry for the
> HiSilicon HNS and HNS3 10G/100G Ethernet drivers, and add a CREDITS entry for
> my co-authorship and maintenance contributions to these drivers.
> 
> [...]

Here is the summary with links:
  - MAINTAINERS: Remove Salil Mehta as HiSilicon HNS3/HNS Ethernet maintainer
    https://git.kernel.org/netdev/net/c/eb216e422044

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Dave Hansen @ 2026-04-09 20:36 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Pawan Gupta, x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, David Kaplan, Sean Christopherson,
	Borislav Petkov, Dave Hansen, Peter Zijlstra, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, KP Singh, Jiri Olsa,
	David S. Miller, David Laight, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, David Ahern, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, John Fastabend, Stanislav Fomichev,
	Hao Luo, Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm,
	Asit Mallick, Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <CALMp9eRfNsghM_RnDXOs=SJYObfPa5A1aOVDZno_zJ=XotfmRw@mail.gmail.com>

On 4/7/26 17:47, Jim Mattson wrote:
> On Tue, Apr 7, 2026 at 4:41 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 4/7/26 16:27, Jim Mattson wrote:
>>> What is your proposed BHI_DIS_S override mechanism, then?
>> Let me make sure I get this right. The desire is to:
>>
>> 1. Have hypervisors lie to guests about the CPU they are running on (for
>>    the benefit of large/diverse migration pools)
>> 2. Have guests be allowed to boot with BHI_DIS_S for performance
>> 3. Have apps in those guests that care about security to opt back in to
>>    BHI_DIS_S for themselves?
> I just want guests on heterogeneous migration pools to properly
> protect themselves from native BHI when running on host kernels at
> least as far back as Linux v6.6.
> 
> To that end, I would be satisfied with using the longer BHB clearing
> sequence when HYPERVISOR is true and BHI_CTRL is false.

If the guests can't get mitigation information from model/family because
the hypervisor is lying (or may lie), then it's on the hypervisor to
figure it out.

I'm not sure we want to just assume that all hypervisors are going to
lie all the time about this.

I kinda think we should just let Pawan's series move forward and then we
can debate the lying hypervisor problem once the series is settled.

^ permalink raw reply

* Re: [PATCH net-next v2 5/5] ethtool: strset: check nla_len overflow
From: Stanislav Fomichev @ 2026-04-09 20:39 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Stanislav Fomichev, Jakub Kicinski, Hangbin Liu, Donald Hunter,
	David S. Miller, Eric Dumazet, Paolo Abeni, Simon Horman, netdev,
	linux-kernel
In-Reply-To: <bb2e7087-aa36-4556-8778-b65d11354779@lunn.ch>

On 04/09, Andrew Lunn wrote:
> > I guess... Should we update ethtool.yaml doc to tell the users to prefer
> > ioctl over netlink for strset-get and mention this new EMSGSIZE?
> 
> No. The ioctl is deprecated. It can still be used for drivers which
> need it, but netlink is the preferred method.

I'm with you on deprecating ioctl and pushing for netlink, but I'm not sure
how we can recommend this specific api call if it consistently can return
EMSGSIZE for some devices? Or am I reading this whole series wrong?

^ permalink raw reply

* Re: [PATCH net-next] net/mlx5: Use dma_wmb() for completion queue doorbell updates
From: Tariq Toukan @ 2026-04-09 20:46 UTC (permalink / raw)
  To: lirongqing, Saeed Mahameed, Leon Romanovsky, Tariq Toukan,
	Mark Bloch, Andrew Lunn, David S . Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, John Fastabend, Stanislav Fomichev,
	Boris Pismenny, Richard Cochran, Cosmin Ratiu, Dragos Tatulea,
	Carolina Jubran, Kees Cook, Akiva Goldberger, Simon Horman,
	netdev, linux-rdma, linux-kernel, bpf
In-Reply-To: <20260402055206.2311-1-lirongqing@baidu.com>



On 02/04/2026 8:52, lirongqing wrote:
> From: Li RongQing <lirongqing@baidu.com>
> 
> dma_wmb() barriers are specifically for ordering writes to DMA
> coherent memory that is accessible to both the CPU and DMA capable
> devices.
> 
> The dma_wmb() barrier is lighter than wmb() on some architectures
> because it only ensures ordering for DMA writes, not for all writes
> including MMIO accesses.
> 
> In the MLX5 driver, completion queue (CQ) doorbell records are
> allocated as DMA coherent memory via mlx5_dma_zalloc_coherent_node().
> The CQ update pattern is:
>    1. Update CQ space (device reads via DMA)
>    2. Update doorbell record (device reads via DMA)
>    3. Memory barrier
>    4. Enable more CQEs
> 
> Since only DMA coherent memory accesses are involved (no MMIO accesses
> follow), can safely use dma_wmb() instead of wmb().
> 
> This change improves performance slightly on architectures where
> dma_wmb() is lighter than wmb().
> 
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---

Hi,

Sorry for the delay.
Thanks for your patch.

The idea looks valid.
This is the kind of patches that better go through intensive testing 
before acceptance, I'm picking it for internal testing and will update.

PS: I know you have one more patch [1] pending testing. It looks good so 
far, I'll verify and send an update soon.

Regards,
Tariq

[1] 
https://patchwork.kernel.org/project/netdevbpf/patch/20260317003544.2583-1-lirongqing@baidu.com/

>   drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c    | 2 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c    | 2 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c     | 2 +-
>   drivers/net/ethernet/mellanox/mlx5/core/en_tx.c     | 2 +-
>   drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c | 2 +-
>   drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c   | 2 +-
>   drivers/net/ethernet/mellanox/mlx5/core/wc.c        | 2 +-
>   7 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
> index 1b76647..7bd6dfc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/ptp.c
> @@ -259,7 +259,7 @@ static bool mlx5e_ptp_poll_ts_cq(struct mlx5e_cq *cq, int napi_budget)
>   	mlx5_cqwq_update_db_record(cqwq);
>   
>   	/* ensure cq space is freed before enabling more cqes */
> -	wmb();
> +	dma_wmb();
>   
>   	while (metadata_buff_sz > 0)
>   		mlx5e_ptp_metadata_fifo_push(&ptpsq->metadata_freelist,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index 80f9fc1..dde8856 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -805,7 +805,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
>   	mlx5_cqwq_update_db_record(&cq->wq);
>   
>   	/* ensure cq space is freed before enabling more cqes */
> -	wmb();
> +	dma_wmb();
>   
>   	sq->cc = sqcc;
>   	return (i == MLX5E_TX_CQ_POLL_BUDGET);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 268e208..f17e7f1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -2447,7 +2447,7 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
>   	mlx5_cqwq_update_db_record(cqwq);
>   
>   	/* ensure cq space is freed before enabling more cqes */
> -	wmb();
> +	dma_wmb();
>   
>   	return work_done;
>   }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> index 9f02726..7ba319f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
> @@ -849,7 +849,7 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
>   	mlx5_cqwq_update_db_record(&cq->wq);
>   
>   	/* ensure cq space is freed before enabling more cqes */
> -	wmb();
> +	dma_wmb();
>   
>   	sq->dma_fifo_cc = dma_fifo_cc;
>   	sq->cc = sqcc;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
> index 1f6bde5..1341874 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/conn.c
> @@ -384,7 +384,7 @@ static inline void mlx5_fpga_conn_cqes(struct mlx5_fpga_conn *conn,
>   
>   	mlx5_fpga_dbg(conn->fdev, "Re-arming CQ with cc# %u\n", conn->cq.wq.cc);
>   	/* ensure cq space is freed before enabling more cqes */
> -	wmb();
> +	dma_wmb();
>   	mlx5_fpga_conn_arm_cq(conn);
>   }
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c
> index 614cd57..8f7a89a 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/aso.c
> @@ -421,7 +421,7 @@ int mlx5_aso_poll_cq(struct mlx5_aso *aso, bool with_data)
>   	mlx5_cqwq_update_db_record(&cq->wq);
>   
>   	/* ensure cq space is freed before enabling more cqes */
> -	wmb();
> +	dma_wmb();
>   
>   	if (with_data)
>   		aso->cc += MLX5_ASO_WQEBBS_DATA;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wc.c b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
> index 7d3d4d7..1afbdd19 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/wc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/wc.c
> @@ -314,7 +314,7 @@ static void mlx5_wc_post_nop(struct mlx5_wc_sq *sq, unsigned int *offset,
>   	/* ensure doorbell record is visible to device before ringing the
>   	 * doorbell
>   	 */
> -	wmb();
> +	dma_wmb();
>   
>   	mlx5_iowrite64_copy(sq, mmio_wqe, sizeof(mmio_wqe), *offset);
>   


^ permalink raw reply

* Re: [RFC net PATCH v1] net: pcs: pcs-mtk-lynxi: fix bpi-r3 serdes configuration
From: Daniel Golle @ 2026-04-09 20:55 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Frank Wunderlich, Chester A. Unal, Felix Fietkau,
	Alexander Couzens, Andrew Lunn, Heiner Kallweit, Russell King,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Matthias Brugger, AngeloGioacchino Del Regno, Frank Wunderlich,
	netdev, linux-kernel, linux-arm-kernel, linux-mediatek
In-Reply-To: <20260409164942.wbmwtkpd5d5zibyy@skbuf>

On Thu, Apr 09, 2026 at 07:49:42PM +0300, Vladimir Oltean wrote:
> I notice Arınc, listed by ./scripts/get_maintainer.pl drivers/net/dsa/mt7530.c,
> and Felix, listed by ./scripts/get_maintainer.pl drivers/net/ethernet/mediatek/mtk_eth_soc.c,
> are not on CC. Maybe they have more info.
> 
> Only the switch port has a chance of having a non-zero default polarity
> setting? (coming from the efuse, if I understood this discussion properly)
> https://lore.kernel.org/netdev/C59EED96-3973-4074-A4D8-C264949D447E@linux.dev/
> The GMAC doesn't?

Yes, vendor SDK uses DT mediatek,pnswap{,-rx,-tx} properties only for the
SoC GMACs. For MT7531 there are **no** strap pins deciding the SerDes
polarity, and also no software-way to override the defaults in the vendor
SDK.

However, the MT7531 datasheet quite clearly states:
Register 000050EC QPHY_WRAP_CTRL -- QPHY wrapper control
Reset value: 0x00000501

BIT 1 RX_BIT_POLARITY -- RX bit polarity control
 1'b0: normal
 1'b1: inverted

BIT 0 TX_BIT_POLARITY -- TX bit polarity control (TX default inversed in MT7531)
 1'b0: normal
 1'b1: inverted

Hence the best would be to just assume the documented default in the driver
as well.

A quick register dump using the BPi-R3 confirms that this applies to *both*
SerDes PCS on MT7531A (port 5 and port 6) equally, both read 0x00000501
after reset.

^ permalink raw reply

* Re: [PATCH v9 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
From: Jim Mattson @ 2026-04-09 21:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pawan Gupta, x86, Jon Kohler, Nikolay Borisov, H. Peter Anvin,
	Josh Poimboeuf, David Kaplan, Sean Christopherson,
	Borislav Petkov, Dave Hansen, Peter Zijlstra, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, KP Singh, Jiri Olsa,
	David S. Miller, David Laight, Andy Lutomirski, Thomas Gleixner,
	Ingo Molnar, David Ahern, Martin KaFai Lau, Eduard Zingerman,
	Song Liu, Yonghong Song, John Fastabend, Stanislav Fomichev,
	Hao Luo, Paolo Bonzini, Jonathan Corbet, linux-kernel, kvm,
	Asit Mallick, Tao Zhang, bpf, netdev, linux-doc, chao.gao
In-Reply-To: <410df9f6-69ec-483f-9009-0a9b8c9162a9@intel.com>

On Thu, Apr 9, 2026 at 1:36 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 4/7/26 17:47, Jim Mattson wrote:
> > On Tue, Apr 7, 2026 at 4:41 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 4/7/26 16:27, Jim Mattson wrote:
> >>> What is your proposed BHI_DIS_S override mechanism, then?
> >> Let me make sure I get this right. The desire is to:
> >>
> >> 1. Have hypervisors lie to guests about the CPU they are running on (for
> >>    the benefit of large/diverse migration pools)
> >> 2. Have guests be allowed to boot with BHI_DIS_S for performance
> >> 3. Have apps in those guests that care about security to opt back in to
> >>    BHI_DIS_S for themselves?
> > I just want guests on heterogeneous migration pools to properly
> > protect themselves from native BHI when running on host kernels at
> > least as far back as Linux v6.6.
> >
> > To that end, I would be satisfied with using the longer BHB clearing
> > sequence when HYPERVISOR is true and BHI_CTRL is false.
>
> If the guests can't get mitigation information from model/family because
> the hypervisor is lying (or may lie), then it's on the hypervisor to
> figure it out.
>
> I'm not sure we want to just assume that all hypervisors are going to
> lie all the time about this.

Without any information, that is exactly what we must assume. There is
precedent for this.

In vulnerable_to_its():

        /*
         * If a VMM did not expose ITS_NO, assume that a guest could
         * be running on a vulnerable hardware or may migrate to such
         * hardware.
         */
        if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
                return true;


In cpu_set_bug_bits():

        /*
         * Intel parts with eIBRS are vulnerable to BHI attacks. Parts with
         * BHI_NO still need to use the BHI mitigation to prevent Intra-mode
         * attacks.  When virtualized, eIBRS could be hidden, assume vulnerable.
         */
        if (!cpu_matches(cpu_vuln_whitelist, NO_BHI) &&
            (boot_cpu_has(X86_FEATURE_IBRS_ENHANCED) ||
             boot_cpu_has(X86_FEATURE_HYPERVISOR)))
                setup_force_cpu_bug(X86_BUG_BHI);

...and...

        if (c->x86_vendor == X86_VENDOR_AMD) {
                if (!cpu_has(c, X86_FEATURE_TSA_SQ_NO) ||
                    !cpu_has(c, X86_FEATURE_TSA_L1_NO)) {
                        if (cpu_matches(cpu_vuln_blacklist, TSA) ||
                            /* Enable bug on Zen guests to allow for
live migration. */
                            (cpu_has(c, X86_FEATURE_HYPERVISOR) &&
cpu_has(c, X86_FEATURE_ZEN)))
                                setup_force_cpu_bug(X86_BUG_TSA);
                }
        }


In check_null_seg_clears_base():

        /*
         * CPUID bit above wasn't set. If this kernel is still running
         * as a HV guest, then the HV has decided not to advertize
         * that CPUID bit for whatever reason. For example, one
         * member of the migration pool might be vulnerable. Which
         * means, the bug is present: set the BUG flag and return.
         */
        if (cpu_has(c, X86_FEATURE_HYPERVISOR)) {
                set_cpu_bug(c, X86_BUG_NULL_SEG);
                return;
        }

The hypervisor could provide more information so that the guest can
determine when it's safe to use the short sequence, but that's just
icing on the cake. The default, out-of-the-box configuration must be
safe.

^ permalink raw reply

* Re: [PATCH net-next 0/7] tcp: restrict rcv_wnd and window_clamp to representable window
From: Simon Baatz @ 2026-04-09 21:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Shuah Khan, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <CANn89iJO5DTWeVVfMYh7y3e9Npsu+FQ_a=W9ZqbMtb_wLeBL7A@mail.gmail.com>

Hi Eric,

On Thu, Apr 09, 2026 at 07:52:03AM -0700, Eric Dumazet wrote:
> On Wed, Apr 8, 2026 at 2:50???PM Simon Baatz via B4 Relay
> <devnull+gmbnomis.gmail.com@kernel.org> wrote:
> >
> > Hi,
> >
> > this series ensures that rcv_wnd and window_clamp do not exceed the
> > maximum window size representable for the connection's window scale
> > factor.
> >
> > This is most visible when TCP window scaling is not used for a
> > connection. In that case, the advertised window is limited to 65535
> > bytes, but rcv_wnd or window_clamp can still grow beyond 65535 when
> > large receive buffers are used. The resulting mismatch breaks
> > calculations that depend on the advertised window, such as the ACK
> > decision in __tcp_ack_snd_check(), and can prevent immediate ACKs.
> >
> > Similar effects may also occur when window scaling is in use, e.g. if
> > the application dynamically adjusts SO_RCVBUF in unusual ways or when
> > the rmem sysctl parameters change during a connection???s lifetime.
> >
> > Summary:
> >
> > - Patch 1 keeps rcv_wnd capped by the (window scale-limited)
> >   window_clamp at connection start.
> > - Patch 3 and 6 ensure that window_clamp is limited to the
> >   representable window when it is updated.
> > - The other patches add packetdrill tests to verify the new behavior.
> >
> > A simple iperf test on a virtme-ng VM (Intel i5-7500, 4 cores,
> > loopback) shows a noticeable improvement with window scaling disabled:
> 
> Explain why we should spend time reviewing patches trying to help
> stacks from 2 decades ago,
> risking breaking other usages.
> 
> Almost every time we change the rcvbuf logic, we introduce bugs.

As soon as someone gives me access to a link with a bandwidth delay
product of probably > 500 MB I am happy to provide another set of
benchmarks results:

`./defaults.sh
sysctl -q net.ipv4.tcp_rmem="4096 2147483647 2147483647"`

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,nop,sackOK,nop,wscale 14>
   +0 < . 1:1(0) ack 1 win 32792

   +0 accept(3, ..., ...) = 4

   +0 getsockopt(4, IPPROTO_TCP, 10, [1073725440], [4]) = 0
   +0 < P. 1:65001(65000) ack 1 win 32792
   +0 > . 1:1(0) ack 65001 win 65535
   +0 < P. 65001:130001(65000) ack 1 win 32792
   +0 > . 1:1(0) ack 130001 win 65535
   +0 < P. 130001:195001(65000) ack 1 win 32792
   +0 > . 1:1(0) ack 195001 win 65535
   +0 < P. 195001:260001(65000) ack 1 win 32792
   +0 > . 1:1(0) ack 260001 win 65535
   +0 < P. 260001:325001(65000) ack 1 win 32792
   +0 > . 1:1(0) ack 325001 win 65535
   +0 < P. 325001:390001(65000) ack 1 win 32792
   +0 > . 1:1(0) ack 390001 win 65535
   +0 getsockopt(4, IPPROTO_TCP, 10, [2113929215], [4]) = 0
+.1 %{ assert tcpi_rcv_wnd <= 1073725440, tcpi_rcv_wnd }%

Fails with:

 AssertionError: 1074511872

on a current kernel.
 
So, I think we should spend time reviewing this because currently we
just pretend to clamp the window at its limits.
 
> Not using window scaling in 2026 and expecting "iperf improvement" is
> quite something!

I wondered if providing these numbers was a good idea and apparently
it wasn't.  I just found the difference to be striking.  The only
thing I wanted to demonstrate is that basing our calculations on
bogus window sizes can have real effects.

> Out of curiosity, which legacy product is stuck in the 20th century?

I have half a dozen of these products "stuck in the 20th century" at
home.  They are called IoT devices and I find saying that TCP
connections to such devices need not to have proper sequence number
acceptability tests according to RFC 9293 quite something.  ;-)

- Simon

-- 
Simon Baatz <gmbnomis@gmail.com>

^ permalink raw reply

* Re: [PATCH net-next 0/7] tcp: restrict rcv_wnd and window_clamp to representable window
From: Eric Dumazet @ 2026-04-09 21:28 UTC (permalink / raw)
  To: Simon Baatz
  Cc: Neal Cardwell, Kuniyuki Iwashima, David S. Miller, David Ahern,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Shuah Khan, netdev,
	linux-kernel, linux-kselftest
In-Reply-To: <adgY-Yn6-OophYXb@gandalf.schnuecks.de>

On Thu, Apr 9, 2026 at 2:24 PM Simon Baatz <gmbnomis@gmail.com> wrote:
>
> Hi Eric,
>
> On Thu, Apr 09, 2026 at 07:52:03AM -0700, Eric Dumazet wrote:
> > On Wed, Apr 8, 2026 at 2:50???PM Simon Baatz via B4 Relay
> > <devnull+gmbnomis.gmail.com@kernel.org> wrote:
> > >
> > > Hi,
> > >
> > > this series ensures that rcv_wnd and window_clamp do not exceed the
> > > maximum window size representable for the connection's window scale
> > > factor.
> > >
> > > This is most visible when TCP window scaling is not used for a
> > > connection. In that case, the advertised window is limited to 65535
> > > bytes, but rcv_wnd or window_clamp can still grow beyond 65535 when
> > > large receive buffers are used. The resulting mismatch breaks
> > > calculations that depend on the advertised window, such as the ACK
> > > decision in __tcp_ack_snd_check(), and can prevent immediate ACKs.
> > >
> > > Similar effects may also occur when window scaling is in use, e.g. if
> > > the application dynamically adjusts SO_RCVBUF in unusual ways or when
> > > the rmem sysctl parameters change during a connection???s lifetime.
> > >
> > > Summary:
> > >
> > > - Patch 1 keeps rcv_wnd capped by the (window scale-limited)
> > >   window_clamp at connection start.
> > > - Patch 3 and 6 ensure that window_clamp is limited to the
> > >   representable window when it is updated.
> > > - The other patches add packetdrill tests to verify the new behavior.
> > >
> > > A simple iperf test on a virtme-ng VM (Intel i5-7500, 4 cores,
> > > loopback) shows a noticeable improvement with window scaling disabled:
> >
> > Explain why we should spend time reviewing patches trying to help
> > stacks from 2 decades ago,
> > risking breaking other usages.
> >
> > Almost every time we change the rcvbuf logic, we introduce bugs.
>
> As soon as someone gives me access to a link with a bandwidth delay
> product of probably > 500 MB I am happy to provide another set of
> benchmarks results:
>
> `./defaults.sh
> sysctl -q net.ipv4.tcp_rmem="4096 2147483647 2147483647"`

Please do not do this. Stick to reasonable limits.

You might have missed that we are flooded with bug reports (and buggy patches).
We have very limited time for bugs not proven by real-world conditions.

^ permalink raw reply

* Re: [PATCH 1/2] net: fix skb_ext_total_length() BUILD_BUG_ON with CONFIG_GCOV_PROFILE_ALL
From: Konstantin Khorenko @ 2026-04-09 21:43 UTC (permalink / raw)
  To: Paolo Abeni, David S . Miller, Eric Dumazet, Jakub Kicinski
  Cc: Simon Horman, Thomas Weißschuh, Arnd Bergmann,
	Peter Oberparleiter, Mikhail Zaslonko, netdev, linux-kernel,
	Pavel Tikhomirov, Vasileios Almpanis
In-Reply-To: <4f744383-1dc1-415a-a8da-5fe8f59daa35@redhat.com>

On 4/7/26 09:55, Paolo Abeni wrote:
> On 4/2/26 4:05 PM, Konstantin Khorenko wrote:
...
>>
>> Fixes: 5d21d0a65b57 ("net: generalize calculation of skb extensions length")
>> Fixes: d6e5794b06c0 ("net: avoid build bug in skb extension length calculation")
>>
>> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
> 
> No empty lines in the tags area.

Sure, will fix.

> Also given the commit description, isn't the introduction of the 5th skb
> extension a better fixes tag?

Well, if we did not have 5d21d0a65b57 ("net: generalize calculation of skb extensions length"), we 
won't have a problem even after 5th skb extension.

On the other hand, yes, the defect reveals itself only after the appearance of the 5th skb extension, 
so we can also treat it guilty.

i will change the Fixes: tag.

>> Reviewed-by: Thomas Weißschuh <linux@weissschuh.net>
>> ---
>>   net/core/skbuff.c | 4 +---
>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 0e217041958a..47c7f0ab6e84 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -5145,7 +5145,7 @@ static const u8 skb_ext_type_len[] = {
>>   #endif
>>   };
>>   
>> -static __always_inline unsigned int skb_ext_total_length(void)
>> +static __always_inline __no_profile unsigned int skb_ext_total_length(void)
>>   {
>>   	unsigned int l = SKB_EXT_CHUNKSIZEOF(struct skb_ext);
>>   	int i;
>> @@ -5159,9 +5159,7 @@ static __always_inline unsigned int skb_ext_total_length(void)
>>   static void skb_extensions_init(void)
>>   {
>>   	BUILD_BUG_ON(SKB_EXT_NUM > 8);
>> -#if !IS_ENABLED(CONFIG_KCOV_INSTRUMENT_ALL)
>>   	BUILD_BUG_ON(skb_ext_total_length() > 255);
>> -#endif
> 
> Sashiko notes that there could be still build breakage:
> 
> https://sashiko.dev/#/patchset/20260402140558.1437002-1-khorenko%40virtuozzo.com
> 
> Could you please double check the above?
Sashiko is great!

The concern about KCOV is valid in theory but doesn't apply in practice. Here's why:

__no_profile (__no_profile_instrument_function__) indeed only prevents GCOV profiling counters
(-fprofile-arcs) from being inserted. It has no effect on KCOV instrumentation
(-fsanitize-coverage=trace-pc), which would require __no_sanitize_coverage instead.

However, KCOV instrumentation does not break constant folding in the first place.
I verified this with a standalone test: a __always_inline function with a loop over a const array 
(mimicking skb_ext_total_length()), compiled with different instrumentation flags:

  * No instrumentation: BUILD_BUG_ON passes (constant folded)
  * GCOV (-fprofile-arcs -ftest-coverage -fno-tree-loop-im): BUILD_BUG_ON fails
  * KCOV (-fsanitize-coverage=trace-pc): BUILD_BUG_ON passes (constant folded)
  * GCOV + atomic (-fprofile-arcs -ftest-coverage -fno-tree-loop-im -fprofile-update=atomic): 
BUILD_BUG_ON fails

The difference is in how GCC instruments code. GCOV inserts global counter increments inside the loop 
body. Combined with -fno-tree-loop-im, these counter operations prevent GCC from proving the loop 
result is a compile-time constant.

KCOV only inserts __sanitizer_cov_trace_pc() callbacks at basic block entries - these are opaque 
function calls that don't participate in value computation, so GCC can still see the loop iterates 
over a const array and fold it.


 > I think a 'noinline' in skb_extensions_init() would address any
 > complains on patch 2/2

Yes, will add "noinline" to be on a safe side.

--
Konstantin Khorenko

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox