Linux RDMA and InfiniBand development

Linux RDMA and InfiniBand development
 help / color / mirror / Atom feed

* [PATCH v2] RDMA/irdma: Suppress PF reset on HMC error
From: Seyeong Kim @ 2026-06-19  5:00 UTC (permalink / raw)
  To: linux-rdma
  Cc: Krzysztof Czurylo, Tatyana Nikolova, Jason Gunthorpe,
	Leon Romanovsky, Seyeong Kim

The irdma driver currently issues an unconditional PF reset whenever the
HMC Error interrupt (PFINT_OICR bit 26) fires:

	if (event->reg & IRDMAPFINT_OICR_HMC_ERR_M) {
		ibdev_err(&iwdev->ibdev, "HMC Error\n");
		iwdev->rf->reset = true;
	}

request_reset() issues an IIDC_PFR to ice. In practice a single HMC_ERR
can trigger cascading PF resets, IOMMU faults during teardown, and
teardown of every RDMA connection on the device.

i40e handles the identically-named interrupt by reading
PFHMC_ERRORINFO and PFHMC_ERRORDATA and logging them without touching
device state; see commit 9c010ee0ea5f ("i40e: Suppress HMC error to
Interrupt message level") which removed the reset as "not necessary".
This patch mirrors that handling on irdma.

With this change, repeated HMC_ERR no longer produces a reset storm and
RDMA traffic on the device continues uninterrupted.

Signed-off-by: Seyeong Kim <seyeong.kim@canonical.com>
---
v2:
 - Drop RFC; retested on mainline v7.1-rc4 (NVM 4.51). The patch
   applies unchanged and behaves the same. No functional change.

v1: https://lore.kernel.org/linux-rdma/20260416071541.3899471-1-seyeong.kim@canonical.com/

Notes for reviewers
-------------------

Some details are inferred rather than verified against the E810
datasheet; Intel confirmation would settle them:

1. Register offsets. PFHMC_ERRORINFO (0x00520400) and PFHMC_ERRORDATA
   (0x00520500) are inferred from the same 0x00520000 PFHMC bank as
   PFHMC_PDINV (0x00520300); they have not been verified against the
   datasheet. On v7.1-rc4 both reads returned 0x00000000, which is
   consistent but not a positive confirmation of the offsets.

2. HMC error semantics. The assumption that every E810 HMC_ERR is
   safe to continue past is not datasheet-confirmed. A conditional
   reset branch analogous to the existing PE_CRITERR /
   IRDMA_Q1_RESOURCE_ERR whitelist can be added on top if needed.

3. Test methodology. The interrupt was forced via a /dev/mem bit
   write to PFINT_OICR, which exercises the handler path but does
   not reproduce firmware-triggered HMC errors directly.

Testing details (mainline v7.1-rc4)
-----------------------------------

Tested on:
  Kernel   : 7.1.0-rc4 (mainline)
  Adapter  : Intel E810-XXV for SFP [8086:159b rev02], 2-port
  NVM      : 4.51, fw.mgmt 7.5.4, DDP 1.3.43.0
  Repro    : writel(BIT(26), BAR0 + 0x0016CA00) via /dev/mem
  Workload : ib_write_bw -R -F -q 4 -D 90, 4 QPs, loopback on the
             injected PF; HMC_ERR forced three times during the run

Before the patch, a forced HMC_ERR caused a full PF reset. With an
RDMA workload running, the reset tore down the irdma aux device with a
uverbs file still open and hit a WARNING at uverbs_destroy_ufile_hw
(rdma_core.c:957, via ice_prepare_for_reset -> ice_unplug_aux_dev);
the ib_write_bw run aborted. After the patch, each forced HMC_ERR only
logged a single "HMC Error: errinfo=0x00000000 errdata=0x00000000"
line - no reset, no WARNING - and the run completed (8622 MiB/s over
4 QPs).

The customer report this addresses was on NVM 3.10, where a single
HMC_ERR produced cascading PF resets and DMAR faults rather than the
single reset seen on NVM 4.51. The unconditional reset is the common
cause in both cases.

 drivers/infiniband/hw/irdma/i40iw_hw.c  | 4 +++-
 drivers/infiniband/hw/irdma/icrdma_hw.c | 2 ++
 drivers/infiniband/hw/irdma/icrdma_hw.h | 2 ++
 drivers/infiniband/hw/irdma/icrdma_if.c | 8 ++++++--
 drivers/infiniband/hw/irdma/irdma.h     | 2 ++
 5 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/i40iw_hw.c b/drivers/infiniband/hw/irdma/i40iw_hw.c
index 60c1f2b1811d..8301938b4543 100644
--- a/drivers/infiniband/hw/irdma/i40iw_hw.c
+++ b/drivers/infiniband/hw/irdma/i40iw_hw.c
@@ -29,7 +29,9 @@ static u32 i40iw_regs[IRDMA_MAX_REGS] = {
 	I40E_PFHMC_PDINV,
 	I40E_GLHMC_VFPDINV(0),
 	I40E_GLPE_CRITERR,
-	0xffffffff      /* PFINT_RATEN not used in FPK */
+	0xffffffff,     /* PFINT_RATEN not used in FPK */
+	0xffffffff,     /* PFHMC_ERRORINFO not used in FPK */
+	0xffffffff      /* PFHMC_ERRORDATA not used in FPK */
 };

 static u32 i40iw_stat_offsets[] = {
diff --git a/drivers/infiniband/hw/irdma/icrdma_hw.c b/drivers/infiniband/hw/irdma/icrdma_hw.c
index 32f26284a788..b1f1b5485762 100644
--- a/drivers/infiniband/hw/irdma/icrdma_hw.c
+++ b/drivers/infiniband/hw/irdma/icrdma_hw.c
@@ -29,6 +29,8 @@ static u32 icrdma_regs[IRDMA_MAX_REGS] = {
 	GLHMC_VFPDINV(0),
 	GLPE_CRITERR,
 	GLINT_RATE(0),
+	PFHMC_ERRORINFO,
+	PFHMC_ERRORDATA,
 };

 static u64 icrdma_masks[IRDMA_MAX_MASKS] = {
diff --git a/drivers/infiniband/hw/irdma/icrdma_hw.h b/drivers/infiniband/hw/irdma/icrdma_hw.h
index d97944ab45da..0acdeda1236d 100644
--- a/drivers/infiniband/hw/irdma/icrdma_hw.h
+++ b/drivers/infiniband/hw/irdma/icrdma_hw.h
@@ -40,6 +40,8 @@
 #define GLHMC_VFPDINV(_i)	(0x00528300 + ((_i) * 4)) /* _i=0...31 */
 #define GLPE_CRITERR		0x00534000
 #define GLINT_RATE(_INT)	(0x0015A000 + ((_INT) * 4)) /* _i=0...2047 */ /* Reset Source: CORER */
+#define PFHMC_ERRORINFO		0x00520400
+#define PFHMC_ERRORDATA		0x00520500

 #define ICRDMA_DB_ADDR_OFFSET		(8 * 1024 * 1024 - 64 * 1024)

diff --git a/drivers/infiniband/hw/irdma/icrdma_if.c b/drivers/infiniband/hw/irdma/icrdma_if.c
index 2172a2092e3f..4b451d8482a4 100644
--- a/drivers/infiniband/hw/irdma/icrdma_if.c
+++ b/drivers/infiniband/hw/irdma/icrdma_if.c
@@ -91,8 +91,12 @@ static void icrdma_iidc_event_handler(struct iidc_rdma_core_dev_info *cdev_info,
 			}
 		}
 		if (event->reg & IRDMAPFINT_OICR_HMC_ERR_M) {
-			ibdev_err(&iwdev->ibdev, "HMC Error\n");
-			iwdev->rf->reset = true;
+			u32 hmc_errinfo = readl(iwdev->rf->sc_dev.hw_regs[IRDMA_PFHMC_ERRORINFO]);
+			u32 hmc_errdata = readl(iwdev->rf->sc_dev.hw_regs[IRDMA_PFHMC_ERRORDATA]);
+
+			/* Log diagnostics; do not reset here. */
+			ibdev_warn(&iwdev->ibdev, "HMC Error: errinfo=0x%08x errdata=0x%08x\n",
+				   hmc_errinfo, hmc_errdata);
 		}
 		if (event->reg & IRDMAPFINT_OICR_PE_PUSH_M) {
 			ibdev_err(&iwdev->ibdev, "PE Push Error\n");
diff --git a/drivers/infiniband/hw/irdma/irdma.h b/drivers/infiniband/hw/irdma/irdma.h
index ff938a01d70c..e8cda27d7854 100644
--- a/drivers/infiniband/hw/irdma/irdma.h
+++ b/drivers/infiniband/hw/irdma/irdma.h
@@ -66,6 +66,8 @@ enum irdma_registers {
 	IRDMA_GLHMC_VFPDINV,
 	IRDMA_GLPE_CRITERR,
 	IRDMA_GLINT_RATE,
+	IRDMA_PFHMC_ERRORINFO,
+	IRDMA_PFHMC_ERRORDATA,
 	IRDMA_MAX_REGS, /* Must be last entry */
 };

-- 
2.43.0

^ permalink raw reply related

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Dust Li @ 2026-06-19  3:31 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <20260618221106.236699-1-hexlabsecurity@proton.me>

On 2026-06-18 22:11:12, Bryam Vargas wrote:
>On Fri, 19 Jun 2026 00:03:17 +0800, Dust Li wrote:
>> Once we validate the CDC message at the input boundary (as in the
>> previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
>> this check becomes unreachable. So I don't think this patch is needed.
>
>This one I'd actually like to keep, and let me walk through why -- I don't think the
>boundary check closes it.
>
>bytes_to_rcv isn't set to a cursor count, it's a running accumulator:
>smc_cdc_msg_recv_action does atomic_add(diff_prod, &bytes_to_rcv), where
>diff_prod = smc_curs_diff(rmb_desc->len, old, new). So bounding each cursor's count at
>the boundary doesn't bound the sum of the deltas.
>
>The differing-wrap branch of smc_curs_diff returns (len - old.count) + new.count,
>which is up to 2*len-1 even when both cursors pass count <= len. With len=16, a prod
>going (0,0) -> (1,15) gives diff=31, so bytes_to_rcv is already 31 > len after one
>message; alternating wrap 0<->1 at count=15 keeps adding ~len and eventually wraps the
>atomic_t negative. I have an A/B for this -- happy to send it along.

Glad to see you A/B test, I think we can decide after we see the real
issue.

>
>So to make this truly unreachable from the boundary check, we'd need to bound
>prod - cons <= len there, not just the absolute count. The consumer-side clamp is two
>lines and race-free against the tasklet, so my preference would be to keep it as a
>backstop -- but if you'd rather fold it into a stronger boundary check instead, I'm
>open to that.

Another thing I'd worry about is if this really happens, should we also
abort the connection like what we did in patch #1 ?

Best regards,
Dust


^ permalink raw reply

* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Dust Li @ 2026-06-19  3:26 UTC (permalink / raw)
  To: Bryam Vargas
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <20260618221057.236673-1-hexlabsecurity@proton.me>

On 2026-06-18 22:11:05, Bryam Vargas wrote:
>On Thu, 18 Jun 2026 22:29:20 +0800, Dust Li wrote:
>> once we detect that the peer is misbehaving, I think the right action is
>> to abort the connection and record the event, rather than silently clamp.
>[...]
>>         u32 prod_count = ntohs(cdc->prod.count);
>> ...
>>             cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {
>
>Thanks for taking a look, Dust. I'm on board with the direction for net-next --
>aborting and recording a bad CDC is cleaner than clamping something we already know
>we can't trust, and as you say, the clamp just papers over the peer bug. So: minimal
>clamp stays for -stable, and net-next gets the wire-boundary check + abort (through
>abort_work, with an smc_stats counter and a ratelimited warn).

That's greate. Then I think we can move on in this direction.

>
>A few things I ran into on the check itself, though:
>
>- count is __be32, so it wants ntohl() rather than ntohs() -- ntohs() ends up reading
>  the wrong half.

Right

>
>- I'd drop the wrap > 1 tests. wrap is a free-running counter (smc_curs_add does
>  wrap++), so a connection that legitimately wraps its RMB ends up with wrap > 1; and
>  since it's a __be16 read raw, on little-endian wrap==1 already reads as 0x0100 and
>  we'd abort on the very first wrap. I don't think there's a sane upper bound to put
>  on wrap.

Agree

>
>- the check is typed for SMC-R, but the SMC-D path hands a host-order smcd_cdc_msg to
>  smc_cdc_msg_recv() cast as smc_cdc_msg (smc_cdc.c:456), so ntohl/ntohs would
>  double-swap it there. The simplest thing I found is one check on the host cursor
>  right after smc_cdc_msg_to_host(), before the diff/atomic_add block -- that covers
>  SMC-R and SMC-D in one place.

Agree

>
>Minor: >= len rather than > len (count is an offset in [0,len)), and peer_rmbe_size
>is signed so worth guarding. The cons vs peer_rmbe_size bound looks right to me.

No problem

Best regards,
Dust


^ permalink raw reply

* Re: [PATCH net-next v2] net: rds: check cmsg_len before reading rds_rdma_args in size pass
From: patchwork-bot+netdevbpf @ 2026-06-19  1:40 UTC (permalink / raw)
  To: Michael Bommarito
  Cc: achender, davem, kuba, pabeni, edumazet, horms, netdev,
	linux-rdma, rds-devel, linux-kernel
In-Reply-To: <20260617023146.2780077-1-michael.bommarito@gmail.com>

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 16 Jun 2026 22:31:46 -0400 you wrote:
> rds_rm_size() handles RDS_CMSG_RDMA_ARGS after only CMSG_OK() and then
> calls rds_rdma_extra_size(), which reads args->local_vec_addr and
> args->nr_local without first checking that cmsg_len covers struct
> rds_rdma_args. The other two RDS_CMSG_RDMA_ARGS consumers already guard
> this: rds_rdma_bytes() in rds_sendmsg() and rds_cmsg_rdma_args() in
> rds_cmsg_send() both reject cmsg_len < CMSG_LEN(sizeof(struct
> rds_rdma_args)). Add the same check to rds_rm_size() so all three RDMA
> args passes are consistent.
> 
> [...]

Here is the summary with links:
  - [net-next,v2] net: rds: check cmsg_len before reading rds_rdma_args in size pass
    https://git.kernel.org/netdev/net/c/e5c00023270e

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH net V2 0/3] net/mlx5e: Fix crashes in dynamic per-channel stats and HV VHCA agent
From: Jakub Kicinski @ 2026-06-19  1:14 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Andrew Lunn, David S. Miller, Eric Dumazet, netdev, Paolo Abeni,
	Cosmin Ratiu, Eran Ben Elisha, Feng Liu, Haiyang Zhang,
	Lama Kayal, Leon Romanovsky, linux-kernel, linux-rdma, Mark Bloch,
	Nimrod Oren, Saeed Mahameed
In-Reply-To: <20260617140127.573117-1-tariqt@nvidia.com>

On Wed, 17 Jun 2026 17:01:24 +0300 Tariq Toukan wrote:
> Since per-channel stats were converted to be allocated and published
> lazily at first channel open in commit fa691d0c9c08 ("net/mlx5e:
> Allocate per-channel stats dynamically at first usage"),
> priv->channel_stats[] and priv->stats_nch are filled in
> incrementally during interface bring-up. This opened a window in
> which the various stats readers - most of them reachable from
> userspace via netlink/netdev stats queries - can race with
> mlx5e_open_channel() on another CPU and observe partially
> initialized state. The HV VHCA stats agent, which is created
> before the channels are opened, hits related problems of its own.
> 
> This series by Feng fixes the resulting crashes.

No longer(?) applies:

Applying: net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
Applying: net/mlx5e: Fix HV VHCA stats agent registration race
Applying: net/mlx5e: Fix publication race for priv->channel_stats[]
error: patch failed: drivers/net/ethernet/mellanox/mlx5/core/en_main.c:5533
error: drivers/net/ethernet/mellanox/mlx5/core/en_main.c: patch does not apply
Patch failed at 0003 net/mlx5e: Fix publication race for priv->channel_stats[]
-- 
pw-bot: cr

^ permalink raw reply

* Re: [PATCH v3 3/3] net/smc: bound the send length to the send buffer in smc_tx_sendmsg()
From: Bryam Vargas @ 2026-06-18 22:11 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajQX7_9xFI9GSaq5@linux.alibaba.com>

On Fri, 19 Jun 2026 00:08:15 +0800, Dust Li wrote:
> I think this is the same as patch #2.

Same story as 2/3, just on the SMC-D send side: sndbuf_space accumulates
diff_tx = smc_curs_diff(sndbuf_desc->len, tx_curs_fin, cons) from the peer's consumer
cursor, so a cons alternating wrap 0<->1 walks it past sndbuf_desc->len (and negative
over time), and smc_tx_sendmsg's wrap-around write then runs off the end of the
buffer. The boundary count check doesn't bound diff_tx here either, so I'd keep the
same two-line bound. The same A/B covers it.

Bryam

^ permalink raw reply

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Bryam Vargas @ 2026-06-18 22:11 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajQWxQZXzM2J8kaZ@linux.alibaba.com>

On Fri, 19 Jun 2026 00:03:17 +0800, Dust Li wrote:
> Once we validate the CDC message at the input boundary (as in the
> previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
> this check becomes unreachable. So I don't think this patch is needed.

This one I'd actually like to keep, and let me walk through why -- I don't think the
boundary check closes it.

bytes_to_rcv isn't set to a cursor count, it's a running accumulator:
smc_cdc_msg_recv_action does atomic_add(diff_prod, &bytes_to_rcv), where
diff_prod = smc_curs_diff(rmb_desc->len, old, new). So bounding each cursor's count at
the boundary doesn't bound the sum of the deltas.

The differing-wrap branch of smc_curs_diff returns (len - old.count) + new.count,
which is up to 2*len-1 even when both cursors pass count <= len. With len=16, a prod
going (0,0) -> (1,15) gives diff=31, so bytes_to_rcv is already 31 > len after one
message; alternating wrap 0<->1 at count=15 keeps adding ~len and eventually wraps the
atomic_t negative. I have an A/B for this -- happy to send it along.

So to make this truly unreachable from the boundary check, we'd need to bound
prod - cons <= len there, not just the absolute count. The consumer-side clamp is two
lines and race-free against the tasklet, so my preference would be to keep it as a
backstop -- but if you'd rather fold it into a stronger boundary check instead, I'm
open to that.

Bryam

^ permalink raw reply

* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Bryam Vargas @ 2026-06-18 22:11 UTC (permalink / raw)
  To: Dust Li
  Cc: Wenjia Zhang, D . Wythe, Sidraya Jayagond, Eric Dumazet,
	David S . Miller, Mahanta Jambigi, Wen Gu, Simon Horman,
	Ursula Braun, Stefan Raspl, Tony Lu, Paolo Abeni, Jakub Kicinski,
	netdev, linux-s390, linux-rdma, linux-kernel
In-Reply-To: <ajQAwBMzCJfO9SM1@linux.alibaba.com>

On Thu, 18 Jun 2026 22:29:20 +0800, Dust Li wrote:
> once we detect that the peer is misbehaving, I think the right action is
> to abort the connection and record the event, rather than silently clamp.
[...]
>         u32 prod_count = ntohs(cdc->prod.count);
> ...
>             cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {

Thanks for taking a look, Dust. I'm on board with the direction for net-next --
aborting and recording a bad CDC is cleaner than clamping something we already know
we can't trust, and as you say, the clamp just papers over the peer bug. So: minimal
clamp stays for -stable, and net-next gets the wire-boundary check + abort (through
abort_work, with an smc_stats counter and a ratelimited warn).

A few things I ran into on the check itself, though:

- count is __be32, so it wants ntohl() rather than ntohs() -- ntohs() ends up reading
  the wrong half.

- I'd drop the wrap > 1 tests. wrap is a free-running counter (smc_curs_add does
  wrap++), so a connection that legitimately wraps its RMB ends up with wrap > 1; and
  since it's a __be16 read raw, on little-endian wrap==1 already reads as 0x0100 and
  we'd abort on the very first wrap. I don't think there's a sane upper bound to put
  on wrap.

- the check is typed for SMC-R, but the SMC-D path hands a host-order smcd_cdc_msg to
  smc_cdc_msg_recv() cast as smc_cdc_msg (smc_cdc.c:456), so ntohl/ntohs would
  double-swap it there. The simplest thing I found is one check on the host cursor
  right after smc_cdc_msg_to_host(), before the diff/atomic_add block -- that covers
  SMC-R and SMC-D in one place.

Minor: >= len rather than > len (count is an offset in [0,len)), and peer_rmbe_size
is signed so worth guarding. The cons vs peer_rmbe_size bound looks right to me.

Happy to spin it whichever way you prefer.

Bryam


^ permalink raw reply

* [PATCH rdma-next 4/4] RDMA/irdma: Add refcounting to user ring MRs
From: Jacob Moroni @ 2026-06-18 20:14 UTC (permalink / raw)
  To: tatyana.e.nikolova, jgg, leon; +Cc: linux-rdma, Jacob Moroni
In-Reply-To: <20260618201458.875740-1-jmoroni@google.com>

Prevent userspace from deregistering the MRs that back QP/CQ/SRQ rings
by bumping the MR's refcount upon association.

Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Jacob Moroni <jmoroni@google.com>
---
 drivers/infiniband/hw/irdma/utils.c |  6 ++++
 drivers/infiniband/hw/irdma/verbs.c | 45 +++++++++++++++++++++++++++--
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/utils.c b/drivers/infiniband/hw/irdma/utils.c
index e4037d5ef..290ad02ed 100644
--- a/drivers/infiniband/hw/irdma/utils.c
+++ b/drivers/infiniband/hw/irdma/utils.c
@@ -1168,6 +1168,12 @@ void irdma_free_qp_rsrc(struct irdma_qp *iwqp)
 	iwqp->kqp.dma_mem.va = NULL;
 	kfree(iwqp->kqp.sq_wrid_mem);
 	kfree(iwqp->kqp.rq_wrid_mem);
+
+	if (iwqp->user_mode && iwqp->iwpbl) {
+		struct irdma_mr *iwmr = iwqp->iwpbl->iwmr;
+
+		refcount_dec(&iwmr->user_ring_refs);
+	}
 }
 
 /**
diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index dccecff3c..70201e1e2 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -464,6 +464,9 @@ static struct irdma_pbl *irdma_get_pbl(unsigned long va,
 
 	list_for_each_entry (iwpbl, pbl_list, list) {
 		if (iwpbl->user_base == va) {
+			struct irdma_mr *iwmr = iwpbl->iwmr;
+
+			refcount_inc(&iwmr->user_ring_refs);
 			list_del(&iwpbl->list);
 			iwpbl->on_list = false;
 			return iwpbl;
@@ -1881,6 +1884,11 @@ static void irdma_srq_free_rsrc(struct irdma_pci_f *rf, struct irdma_srq *iwsrq)
 		dma_free_coherent(rf->sc_dev.hw->device, iwsrq->kmem.size,
 				  iwsrq->kmem.va, iwsrq->kmem.pa);
 		iwsrq->kmem.va = NULL;
+	} else {
+		/* Not called in any failure path, so iwpbl is valid. */
+		struct irdma_mr *iwmr = iwsrq->iwpbl->iwmr;
+
+		refcount_dec(&iwmr->user_ring_refs);
 	}
 
 	irdma_free_rsrc(rf, rf->allocated_srqs, srq->srq_uk.srq_id);
@@ -1903,6 +1911,21 @@ static void irdma_cq_free_rsrc(struct irdma_pci_f *rf, struct irdma_cq *iwcq)
 				  iwcq->kmem_shadow.size,
 				  iwcq->kmem_shadow.va, iwcq->kmem_shadow.pa);
 		iwcq->kmem_shadow.va = NULL;
+	} else {
+		struct irdma_mr *iwmr;
+
+		/* May be called in a failure path before iwpbl is valid. */
+		if (iwcq->iwpbl) {
+			iwmr = iwcq->iwpbl->iwmr;
+
+			refcount_dec(&iwmr->user_ring_refs);
+		}
+
+		if (iwcq->iwpbl_shadow) {
+			iwmr = iwcq->iwpbl_shadow->iwmr;
+
+			refcount_dec(&iwmr->user_ring_refs);
+		}
 	}
 
 	irdma_free_rsrc(rf, rf->allocated_cqs, cq->cq_uk.cq_id);
@@ -2018,7 +2041,7 @@ static int irdma_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 	struct irdma_modify_cq_info info = {};
 	struct irdma_dma_mem kmem_buf;
 	struct irdma_cq_mr *cqmr_buf;
-	struct irdma_pbl *iwpbl_buf;
+	struct irdma_pbl *iwpbl_buf = NULL;
 	struct irdma_device *iwdev;
 	struct irdma_pci_f *rf;
 	struct irdma_cq_buf *cq_buf = NULL;
@@ -2129,11 +2152,19 @@ static int irdma_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 		goto error;
 
 	spin_lock_irqsave(&iwcq->lock, flags);
-	if (udata)
+	if (udata) {
+		struct irdma_pbl *old_iwpbl = iwcq->iwpbl;
+
 		/* Only update if the resize was successful. Otherwise, HW is
 		 * still pointing to the old PBL.
 		 */
 		iwcq->iwpbl = iwpbl_buf;
+		if (old_iwpbl) {
+			struct irdma_mr *old_iwmr = old_iwpbl->iwmr;
+
+			refcount_dec(&old_iwmr->user_ring_refs);
+		}
+	}
 	if (cq_buf) {
 		cq_buf->kmem_buf = iwcq->kmem;
 		cq_buf->hw = dev->hw;
@@ -2149,6 +2180,11 @@ static int irdma_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 
 	return 0;
 error:
+	if (iwpbl_buf) {
+		struct irdma_mr *iwmr = iwpbl_buf->iwmr;
+
+		refcount_dec(&iwmr->user_ring_refs);
+	}
 	if (!udata) {
 		dma_free_coherent(dev->hw->device, kmem_buf.size, kmem_buf.va,
 				  kmem_buf.pa);
@@ -2425,6 +2461,11 @@ static int irdma_create_srq(struct ib_srq *ibsrq,
 		dma_free_coherent(rf->hw.device, iwsrq->kmem.size,
 				  iwsrq->kmem.va, iwsrq->kmem.pa);
 free_rsrc:
+	if (iwsrq->user_mode && iwsrq->iwpbl) {
+		struct irdma_mr *iwmr = iwsrq->iwpbl->iwmr;
+
+		refcount_dec(&iwmr->user_ring_refs);
+	}
 	irdma_free_rsrc(rf, rf->allocated_srqs, iwsrq->srq_num);
 	return err_code;
 }
-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog


^ permalink raw reply related

* [PATCH rdma-next 3/4] RDMA/irdma: Add irdma_cq fields to track pbl allocations
From: Jacob Moroni @ 2026-06-18 20:14 UTC (permalink / raw)
  To: tatyana.e.nikolova, jgg, leon; +Cc: linux-rdma, Jacob Moroni
In-Reply-To: <20260618201458.875740-1-jmoroni@google.com>

These fields will be used in a subsequent commit which adds
refcounting to user CQ MRs.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
---
 drivers/infiniband/hw/irdma/verbs.c | 25 +++++++++++++++----------
 drivers/infiniband/hw/irdma/verbs.h |  2 ++
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index 2c51684dd..dccecff3c 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -2129,6 +2129,11 @@ static int irdma_resize_cq(struct ib_cq *ibcq, unsigned int entries,
 		goto error;
 
 	spin_lock_irqsave(&iwcq->lock, flags);
+	if (udata)
+		/* Only update if the resize was successful. Otherwise, HW is
+		 * still pointing to the old PBL.
+		 */
+		iwcq->iwpbl = iwpbl_buf;
 	if (cq_buf) {
 		cq_buf->kmem_buf = iwcq->kmem;
 		cq_buf->hw = dev->hw;
@@ -2499,6 +2504,8 @@ static int irdma_create_cq(struct ib_cq *ibcq,
 	INIT_LIST_HEAD(&iwcq->resize_list);
 	INIT_LIST_HEAD(&iwcq->cmpl_generated);
 	iwcq->cq_num = cq_num;
+	iwcq->iwpbl = NULL;
+	iwcq->iwpbl_shadow = NULL;
 	info.dev = dev;
 	ukinfo->cq_size = max(entries, 4);
 	ukinfo->cq_id = cq_num;
@@ -2518,8 +2525,6 @@ static int irdma_create_cq(struct ib_cq *ibcq,
 		struct irdma_ucontext *ucontext;
 		struct irdma_create_cq_req req = {};
 		struct irdma_cq_mr *cqmr;
-		struct irdma_pbl *iwpbl;
-		struct irdma_pbl *iwpbl_shadow;
 		struct irdma_cq_mr *cqmr_shadow;
 
 		iwcq->user_mode = true;
@@ -2533,34 +2538,34 @@ static int irdma_create_cq(struct ib_cq *ibcq,
 		}
 
 		spin_lock_irqsave(&ucontext->cq_reg_mem_list_lock, flags);
-		iwpbl = irdma_get_pbl((unsigned long)req.user_cq_buf,
-				      &ucontext->cq_reg_mem_list);
+		iwcq->iwpbl = irdma_get_pbl((unsigned long)req.user_cq_buf,
+					    &ucontext->cq_reg_mem_list);
 		spin_unlock_irqrestore(&ucontext->cq_reg_mem_list_lock, flags);
-		if (!iwpbl) {
+		if (!iwcq->iwpbl) {
 			err_code = -EPROTO;
 			goto cq_free_rsrc;
 		}
 
-		cqmr = &iwpbl->cq_mr;
+		cqmr = &iwcq->iwpbl->cq_mr;
 
 		if (rf->sc_dev.hw_attrs.uk_attrs.feature_flags &
 		    IRDMA_FEATURE_CQ_RESIZE) {
 			spin_lock_irqsave(&ucontext->cq_reg_mem_list_lock, flags);
-			iwpbl_shadow = irdma_get_pbl(
+			iwcq->iwpbl_shadow = irdma_get_pbl(
 					(unsigned long)req.user_shadow_area,
 					&ucontext->cq_reg_mem_list);
 			spin_unlock_irqrestore(&ucontext->cq_reg_mem_list_lock, flags);
 
-			if (!iwpbl_shadow) {
+			if (!iwcq->iwpbl_shadow) {
 				err_code = -EPROTO;
 				goto cq_free_rsrc;
 			}
-			cqmr_shadow = &iwpbl_shadow->cq_mr;
+			cqmr_shadow = &iwcq->iwpbl_shadow->cq_mr;
 			info.shadow_area_pa = cqmr_shadow->cq_pbl.addr;
 		} else {
 			info.shadow_area_pa = cqmr->shadow;
 		}
-		if (iwpbl->pbl_allocated) {
+		if (iwcq->iwpbl->pbl_allocated) {
 			info.virtual_map = true;
 			info.pbl_chunk_size = 1;
 			info.first_pm_pbl_idx = cqmr->cq_pbl.idx;
diff --git a/drivers/infiniband/hw/irdma/verbs.h b/drivers/infiniband/hw/irdma/verbs.h
index fbd487dbe..a1651641e 100644
--- a/drivers/infiniband/hw/irdma/verbs.h
+++ b/drivers/infiniband/hw/irdma/verbs.h
@@ -153,6 +153,8 @@ struct irdma_cq {
 	struct list_head resize_list;
 	struct irdma_cq_poll_info cur_cqe;
 	struct list_head cmpl_generated;
+	struct irdma_pbl *iwpbl;
+	struct irdma_pbl *iwpbl_shadow;
 };
 
 struct irdma_cmpl_gen {
-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog


^ permalink raw reply related

* [PATCH rdma-next 2/4] RDMA/irdma: Add a refcount to track user ring MR associations
From: Jacob Moroni @ 2026-06-18 20:14 UTC (permalink / raw)
  To: tatyana.e.nikolova, jgg, leon; +Cc: linux-rdma, Jacob Moroni
In-Reply-To: <20260618201458.875740-1-jmoroni@google.com>

User QP/CQ/SRQ rings are registered with the normal reg_mr
mechanism prior to creating the actual QP/CQ/SRQ object. In
order to prevent userspace from deregistering these special MRs
while the child object still exists, a refcount will be used.

This commit adds the refcount and logic to reject a dereg_mr
with active references. Subsequent commits will add logic to
bump this refcount when the user QP/CQ/SRQ objects are created.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
---
 drivers/infiniband/hw/irdma/verbs.c | 21 +++++++++++++++++----
 drivers/infiniband/hw/irdma/verbs.h |  1 +
 2 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index af04dd554..2c51684dd 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -3363,6 +3363,7 @@ static struct irdma_mr *irdma_alloc_iwmr(struct ib_umem *region,
 	if (!iwmr)
 		return ERR_PTR(-ENOMEM);
 
+	refcount_set(&iwmr->user_ring_refs, 1);
 	iwpbl = &iwmr->iwpbl;
 	iwpbl->iwmr = iwmr;
 	iwmr->region = region;
@@ -3923,13 +3924,16 @@ static struct ib_mr *irdma_get_dma_mr(struct ib_pd *pd, int acc)
  * irdma_del_memlist - Deleting pbl list entries for CQ/QP
  * @iwmr: iwmr for IB's user page addresses
  * @ucontext: ptr to user context
+ *
+ * Return: True if the MR is currently in-use by a QP/CQ/SRQ ring.
  */
-static void irdma_del_memlist(struct irdma_mr *iwmr,
+static bool irdma_del_memlist(struct irdma_mr *iwmr,
 			      struct irdma_ucontext *ucontext)
 {
 	struct irdma_pbl *iwpbl = &iwmr->iwpbl;
 	unsigned long flags;
 	spinlock_t *lock;
+	bool in_use = false;
 
 	switch (iwmr->type) {
 	case IRDMA_MEMREG_TYPE_CQ:
@@ -3942,15 +3946,19 @@ static void irdma_del_memlist(struct irdma_mr *iwmr,
 		lock = &ucontext->srq_reg_mem_list_lock;
 		break;
 	default:
-		return;
+		return false;
 	}
 
 	spin_lock_irqsave(lock, flags);
-	if (iwpbl->on_list) {
+	if (!refcount_dec_if_one(&iwmr->user_ring_refs)) {
+		in_use = true;
+	} else if (iwpbl->on_list) {
 		iwpbl->on_list = false;
 		list_del(&iwpbl->list);
 	}
 	spin_unlock_irqrestore(lock, flags);
+
+	return in_use;
 }
 
 /**
@@ -3973,7 +3981,12 @@ static int irdma_dereg_mr(struct ib_mr *ib_mr, struct ib_udata *udata)
 			ucontext = rdma_udata_to_drv_context(udata,
 						struct irdma_ucontext,
 						ibucontext);
-			irdma_del_memlist(iwmr, ucontext);
+
+			/* Do not allow the MR to be unpinned if it is still
+			 * backing a user ring.
+			 */
+			if (irdma_del_memlist(iwmr, ucontext))
+				return -EBUSY;
 		}
 		goto done;
 	}
diff --git a/drivers/infiniband/hw/irdma/verbs.h b/drivers/infiniband/hw/irdma/verbs.h
index 289ebc9b2..fbd487dbe 100644
--- a/drivers/infiniband/hw/irdma/verbs.h
+++ b/drivers/infiniband/hw/irdma/verbs.h
@@ -120,6 +120,7 @@ struct irdma_mr {
 	u64 len;
 	u64 pgaddrmem[IRDMA_MAX_SAVED_PHY_PGADDR];
 	struct irdma_pbl iwpbl;
+	refcount_t user_ring_refs;
 };
 
 struct irdma_srq {
-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog


^ permalink raw reply related

* [PATCH rdma-next 1/4] RDMA/irdma: Deduplicate the irdma_del_memlist logic
From: Jacob Moroni @ 2026-06-18 20:14 UTC (permalink / raw)
  To: tatyana.e.nikolova, jgg, leon; +Cc: linux-rdma, Jacob Moroni
In-Reply-To: <20260618201458.875740-1-jmoroni@google.com>

Simplify/dedup the irdma_del_memlist logic in preparation for
the QP/CQ/SRQ ring MR refcounting change that will follow in
a subsequent commit.

Signed-off-by: Jacob Moroni <jmoroni@google.com>
---
 drivers/infiniband/hw/irdma/verbs.c | 31 +++++++++++------------------
 1 file changed, 12 insertions(+), 19 deletions(-)

diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index b79f5afe6..af04dd554 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -3929,35 +3929,28 @@ static void irdma_del_memlist(struct irdma_mr *iwmr,
 {
 	struct irdma_pbl *iwpbl = &iwmr->iwpbl;
 	unsigned long flags;
+	spinlock_t *lock;
 
 	switch (iwmr->type) {
 	case IRDMA_MEMREG_TYPE_CQ:
-		spin_lock_irqsave(&ucontext->cq_reg_mem_list_lock, flags);
-		if (iwpbl->on_list) {
-			iwpbl->on_list = false;
-			list_del(&iwpbl->list);
-		}
-		spin_unlock_irqrestore(&ucontext->cq_reg_mem_list_lock, flags);
+		lock = &ucontext->cq_reg_mem_list_lock;
 		break;
 	case IRDMA_MEMREG_TYPE_QP:
-		spin_lock_irqsave(&ucontext->qp_reg_mem_list_lock, flags);
-		if (iwpbl->on_list) {
-			iwpbl->on_list = false;
-			list_del(&iwpbl->list);
-		}
-		spin_unlock_irqrestore(&ucontext->qp_reg_mem_list_lock, flags);
+		lock = &ucontext->qp_reg_mem_list_lock;
 		break;
 	case IRDMA_MEMREG_TYPE_SRQ:
-		spin_lock_irqsave(&ucontext->srq_reg_mem_list_lock, flags);
-		if (iwpbl->on_list) {
-			iwpbl->on_list = false;
-			list_del(&iwpbl->list);
-		}
-		spin_unlock_irqrestore(&ucontext->srq_reg_mem_list_lock, flags);
+		lock = &ucontext->srq_reg_mem_list_lock;
 		break;
 	default:
-		break;
+		return;
+	}
+
+	spin_lock_irqsave(lock, flags);
+	if (iwpbl->on_list) {
+		iwpbl->on_list = false;
+		list_del(&iwpbl->list);
 	}
+	spin_unlock_irqrestore(lock, flags);
 }
 
 /**
-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog


^ permalink raw reply related

* [PATCH rdma-next 0/4] RDMA/irdma: Prevent premature deregistration of user ring MRs
From: Jacob Moroni @ 2026-06-18 20:14 UTC (permalink / raw)
  To: tatyana.e.nikolova, jgg, leon; +Cc: linux-rdma, Jacob Moroni

When a QP/CQ/SRQ is created, a two step process is used where the
buffer is allocated in userspace and explicitly registered with the
normal reg_mr mechanism prior to creating the actual QP/CQ/SRQ object.

Even though these special MRs are internal to the verbs provider,
nothing actually prevents a custom/malicious user application from
manually invoking dereg_mr on these regions. If this occurs, the PBL
is freed and umem is released while the HW may still be accessing it.

Since the core layer is unaware of the relationship between these MRs
and their associated QP/CQ/SRQ objects, refcounting must be performed
in the irdma driver to block any deregistration attempts if the region
is still associated with an active ring object.

Jacob Moroni (4):
  RDMA/irdma: Deduplicate the irdma_del_memlist logic
  RDMA/irdma: Add a refcount to track user ring MR associations
  RDMA/irdma: Add irdma_cq fields to track pbl allocations
  RDMA/irdma: Add refcounting to user ring MRs

 drivers/infiniband/hw/irdma/utils.c |   6 ++
 drivers/infiniband/hw/irdma/verbs.c | 116 ++++++++++++++++++++--------
 drivers/infiniband/hw/irdma/verbs.h |   3 +
 3 files changed, 93 insertions(+), 32 deletions(-)

-- 
2.55.0.rc0.738.g0c8ab3ebcc-goog

^ permalink raw reply

* RE: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Haiyang Zhang @ 2026-06-18 18:38 UTC (permalink / raw)
  To: Dexuan Cui, KY Srinivasan, wei.liu@kernel.org, Dexuan Cui,
	Long Li, andrew+netdev@lunn.ch, davem@davemloft.net,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	Konstantin Taranov, horms@kernel.org, ernis@linux.microsoft.com,
	dipayanroy@linux.microsoft.com, kees@kernel.org,
	jacob.e.keller@intel.com, ssengar@linux.microsoft.com,
	linux-hyperv@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-rdma@vger.kernel.org
  Cc: stable@vger.kernel.org
In-Reply-To: <20260618035029.249361-1-decui@microsoft.com>



> -----Original Message-----
> From: Dexuan Cui <decui@microsoft.com>
> Sent: Wednesday, June 17, 2026 11:50 PM
> To: KY Srinivasan <kys@microsoft.com>; Haiyang Zhang
> <haiyangz@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>;
> andrew+netdev@lunn.ch; davem@davemloft.net; edumazet@google.com;
> kuba@kernel.org; pabeni@redhat.com; Konstantin Taranov
> <kotaranov@microsoft.com>; horms@kernel.org; ernis@linux.microsoft.com;
> dipayanroy@linux.microsoft.com; kees@kernel.org; jacob.e.keller@intel.com;
> ssengar@linux.microsoft.com; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org
> Cc: stable@vger.kernel.org
> Subject: [PATCH net] net: mana: Sync page pool RX frags for CPU
> 
> MANA allocates RX buffers from page pool fragments when frag_count is
> greater than 1. In that case the buffers remain DMA mapped by page pool
> and the RX completion path does not call dma_unmap_single(). As a result,
> the implicit sync-for-CPU normally performed by dma_unmap_single() is
> missing before the packet data is passed to the networking stack.
> 
> This breaks RX on configurations which require explicit DMA syncing, for
> example when booted with swiotlb=force.
> 
> Fix this by recording the page pool page and DMA sync offset when the RX
> buffer is allocated, and syncing the received packet range for CPU access
> before handing the RX buffer to the stack.
> 
> Also validate the packet length reported in the RX CQE before using it as
> a DMA sync length or passing it to skb processing. The CQE is supplied
> by the device and should not be blindly trusted by Confidential VMs.
> 
> Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers
> instead of full pages to improve memory efficiency.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> ---
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
>  include/net/mana/mana.h                       |  8 +++
>  2 files changed, 57 insertions(+), 12 deletions(-)

Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>



^ permalink raw reply

* Re: [PATCH v3 3/3] net/smc: bound the send length to the send buffer in smc_tx_sendmsg()
From: Dust Li @ 2026-06-18 16:08 UTC (permalink / raw)
  To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
  Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-3-551fa514257e@proton.me>

On 2026-06-14 03:23:32, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>On the SMC-D DMB-merge (nocopy) path, smc_cdc_msg_recv_action() advances
>conn->sndbuf_space from the peer's consumer cursor:
>
>	diff_tx = smc_curs_diff(sndbuf_desc->len, &tx_curs_fin,
>				&local_rx_ctrl.cons);
>	atomic_add(diff_tx, &conn->sndbuf_space);
>
>The consumer cursor is wire-controlled and unvalidated, and
>smc_curs_diff()'s differing-wrap branch can return more than
>sndbuf_desc->len, so a forged cursor drives sndbuf_space past the send
>buffer (and across many CDC messages can overflow the signed counter
>negative).  smc_tx_sendmsg() reads it as the available write space and
>performs a wrap-around copy whose second chunk (copylen - first_chunk,
>written at ring offset 0) is never re-bounded to sndbuf_desc->len,
>writing user data past the send buffer -- a heap out-of-bounds write of
>attacker-influenced length and content.

Hi Bryam,

I think this is the same as patch #2.

Best regards,
Dust


>
>Bound the write space to sndbuf_desc->len where it is consumed, treating
>a negative (sign-overflowed) value as out of range too, so the copy
>length can never exceed the ring.  This enforces the documented
>0 <= sndbuf_space <= sndbuf_desc->len invariant at the producer,
>race-free against the CDC tasklet that advances sndbuf_space.
>
>Fixes: cc0ab806fc52 ("net/smc: adapt cursor update when sndbuf and peer DMB are merged")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_tx.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
>diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
>index 3144b4b1fe29..5916f02060fb 100644
>--- a/net/smc/smc_tx.c
>+++ b/net/smc/smc_tx.c
>@@ -233,6 +233,19 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
> 		/* initialize variables for 1st iteration of subsequent loop */
> 		/* could be just 1 byte, even after smc_tx_wait above */
> 		writespace = atomic_read(&conn->sndbuf_space);
>+		/* sndbuf_space is advanced from the peer's wire-controlled
>+		 * consumer cursor on the SMC-D DMB-merge path; a forged cursor
>+		 * can inflate it past the send buffer, or overflow the signed
>+		 * accumulator to a negative value across many CDC messages
>+		 * (which a plain "> len" check would miss before the size_t
>+		 * cast below turns it huge).  Bound it to the send buffer in
>+		 * either case so the wrap-around write cannot run past
>+		 * sndbuf_desc->len.  This enforces the documented
>+		 * 0 <= sndbuf_space <= sndbuf_desc->len invariant at the
>+		 * producer, race-free against the CDC tasklet.
>+		 */
>+		if (writespace < 0 || writespace > conn->sndbuf_desc->len)
>+			writespace = conn->sndbuf_desc->len;
> 		/* not more than what user space asked for */
> 		copylen = min_t(size_t, send_remaining, writespace);
> 		/* determine start of sndbuf */
>
>-- 
>2.43.0
>

^ permalink raw reply

* Re: [PATCH v3 2/3] net/smc: bound the receive length to the RMB in smc_rx_recvmsg()
From: Dust Li @ 2026-06-18 16:03 UTC (permalink / raw)
  To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
  Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-2-551fa514257e@proton.me>

On 2026-06-14 03:23:31, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>conn->bytes_to_rcv is accumulated in the receive tasklet from the peer's
>producer cursor:
>
>	diff_prod = smc_curs_diff(rmb_desc->len, &prod_old, &prod_new);
>	atomic_add(diff_prod, &conn->bytes_to_rcv);
>
>smc_curs_diff()'s differing-wrap branch returns (size - old.count) +
>new.count, which exceeds rmb_desc->len for a forged producer cursor and
>accumulates across CDC messages, so bytes_to_rcv can grow past the RMB
>(and across many messages can overflow the signed counter negative).
>smc_rx_recvmsg() reads it as the number of readable bytes and performs a
>wrap-around copy whose second chunk (copylen - first_chunk, read from
>ring offset 0) is never re-bounded to rmb_desc->len, reading past the
>RMB into adjacent kernel memory and disclosing it to the peer.

Hi Bryam,

Once we validate the CDC message at the input boundary (as in the
previous patch), bytes_to_rcv can never exceed rmb_desc->len, so
this check becomes unreachable. So I don't think this patch is needed.

Best regards,
Dust


>
>Bound the readable count to rmb_desc->len where it is consumed, treating
>a negative (sign-overflowed) value as out of range too, so the copy
>length can never exceed the ring.  This enforces the documented
>0 <= bytes_to_rcv <= rmb_desc->len invariant at the consumer, where it
>is race-free against the producer update that runs in the receive
>tasklet.
>
>Fixes: 952310ccf2d8 ("smc: receive data from RMBE")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_rx.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
>diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
>index c1d9b923938d..f461cf10b085 100644
>--- a/net/smc/smc_rx.c
>+++ b/net/smc/smc_rx.c
>@@ -442,6 +442,18 @@ int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg,
> 		/* initialize variables for 1st iteration of subsequent loop */
> 		/* could be just 1 byte, even after waiting on data above */
> 		readable = smc_rx_data_available(conn, peeked_bytes);
>+		/* bytes_to_rcv is accumulated from the peer's wire-controlled
>+		 * producer cursor; a forged cursor can drive it past the RMB,
>+		 * or overflow the signed accumulator to a negative value across
>+		 * many CDC messages (which a plain "> len" check would miss
>+		 * before the size_t cast below turns it huge).  Bound it to the
>+		 * RMB in either case so the wrap-around copy cannot run past
>+		 * rmb_desc->len.  This enforces the documented
>+		 * 0 <= bytes_to_rcv <= rmb_desc->len invariant at the consumer,
>+		 * race-free against the producer update in the receive tasklet.
>+		 */
>+		if (readable < 0 || readable > conn->rmb_desc->len)
>+			readable = conn->rmb_desc->len;
> 		splbytes = atomic_read(&conn->splice_pending);
> 		if (!readable || (msg && splbytes)) {
> 			if (splbytes)
>
>-- 
>2.43.0
>

^ permalink raw reply

* Re: [PATCH rdma-next v2 4/6] RDMA/uverbs: Add ioctl method for CQ resize
From: Dennis Dalessandro @ 2026-06-18 15:40 UTC (permalink / raw)
  To: Leon Romanovsky, Jiri Pirko; +Cc: linux-rdma, jgg, mrgolin
In-Reply-To: <20260617110605.GV327369@unreal>

On 6/17/26 7:06 AM, Leon Romanovsky wrote:
> On Mon, Jun 15, 2026 at 10:50:38AM +0200, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@nvidia.com>
>>
>> Resize CQ is currently only reachable through the legacy write()
>> uverbs command (IB_USER_VERBS_CMD_RESIZE_CQ). Add an equivalent modern
>> ioctl method, UVERBS_METHOD_CQ_RESIZE, on the CQ object so the
>> operation is available through the ioctl interface and can carry
>> per-attribute extensions. The handler mirrors the legacy command: it
>> looks up the CQ, calls resize_user_cq() and returns the new cqe count.
>> The legacy write path is left in place for ABI compatibility.
> 
> I have a general question. Do we actually need CQ resizing, given that it is
> rarely implemented and often incorrect in existing drivers? Maybe this is a
> good time to consider deprecating that path.

Can you expand on what you mean by incorrect in existing drivers? I mean 
I see a glaring coding bug in our implementation but I can patch that.

-Denny



^ permalink raw reply

* Re: [GIT PULL] Please pull RDMA subsystem changes
From: pr-tracker-bot @ 2026-06-18 15:22 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Linus Torvalds, linux-rdma, linux-kernel, Leon Romanovsky
In-Reply-To: <20260617201545.GA320356@nvidia.com>

The pull request you sent on Wed, 17 Jun 2026 17:15:45 -0300:

> git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git tags/for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/9e7e6633458362db72427b48effad8d759131c35

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply

* [PATCH net v3] net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete
From: Doruk Tan Ozturk @ 2026-06-18 14:55 UTC (permalink / raw)
  To: saeedm, leon, tariqt, mbloch, sd, andrew+netdev, davem, edumazet,
	kuba, pabeni
  Cc: borisp, raeds, ehakim, netdev, linux-rdma, linux-kernel, stable,
	horms

When an offloaded MACsec RX SC is deleted, macsec_del_rxsc_ctx() released
the per-SC metadata_dst with metadata_dst_free(), which calls kfree()
unconditionally and ignores the dst reference count. The RX datapath in
mlx5e_macsec_offload_handle_rx_skb() looks up the SC under rcu_read_lock()
via xa_load() and, while still holding only the RCU read lock, takes a
reference with dst_hold() and attaches the dst to the skb with
skb_dst_set().

A reader that has already obtained the rx_sc pointer can therefore race
with the delete path:

  CPU0 (del_rxsc)			CPU1 (rx datapath)
  --------------			------------------
					rcu_read_lock();
					rx_sc = xa_load(...)->rx_sc;
  xa_erase(...);
  metadata_dst_free(rx_sc->md_dst);	/* kfree(), ignores refcount */
					dst_hold(&rx_sc->md_dst->dst); /* UAF */
					skb_dst_set(skb, &rx_sc->md_dst->dst);

metadata_dst_free() frees the object even though the datapath still holds
(or is about to take) a reference, so the subsequent dst_hold() /
skb_dst_set() and the later skb free operate on freed memory.

Fix the owner side by dropping the reference with dst_release() instead of
freeing unconditionally. dst_release() only schedules the RCU-deferred
dst_destroy() once the reference count reaches zero, so a concurrent reader
that still holds a reference keeps the object alive.

Dropping the owner reference is not sufficient on its own: once the owner
reference is the last one, dst_release() drops the count to zero and the
destroy is merely RCU-deferred. A racing reader that runs plain dst_hold()
on that already-dead dst gets rcuref_get() == false but dst_hold() only
WARNs and attaches the dying dst to the skb anyway; the later skb free then
calls dst_release() on an object whose destroy is already scheduled, again
a use-after-free.

Convert the RX datapath to dst_hold_safe(), which returns false (without
warning) when the dst is already dead, and only attach it to the skb when a
reference was successfully taken. When the SC is being deleted the in-flight
packet simply proceeds without the offload metadata_dst: skb_metadata_dst()
returns NULL, the MACsec core sees !is_macsec_md_dst and skips this secy
(rx_uses_md_dst path), which is the correct behaviour for a packet whose SC
is going away.

While reworking the datapath lookup, also guard the two NULL dereferences
on the same path that an automated review (forwarded by Simon Horman)
flagged: xa_load() can return NULL when the fs_id has just been erased, and
mlx5e_macsec_add_rxsc() publishes sc_xarray_element via xa_alloc() before
rx_sc->md_dst is allocated, so a packet carrying a freshly recycled fs_id
can observe a non-NULL rx_sc whose md_dst is still NULL. Check both before
dereferencing.

Note: macsec_del_rxsc_ctx() also kfree()s rx_sc->sc_xarray_element without
an RCU grace period while the same datapath reads it under rcu_read_lock();
that is a separate pre-existing issue and is left to a follow-up patch.

Fixes: b7c9400cbc48 ("net/mlx5e: Implement MACsec Rx data path using MACsec skb_metadata_dst")
Cc: stable@vger.kernel.org
Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai>
---
v3:
 - Also guard the RX-datapath NULL dereferences flagged by the automated
   review: NULL-check the xa_load() result and rx_sc->md_dst before use.
 - Note the unrelated non-RCU kfree(sc_xarray_element) in the delete path
   as a separate follow-up rather than folding it in here.
v2:
 - Convert the RX datapath dst_hold() to dst_hold_safe() so a reader racing
   the SC delete cannot attach a dst whose last reference was just dropped
   (per the automated review forwarded by Simon Horman).
v1: https://lore.kernel.org/netdev/20260615140534.52691-1-doruk@0sec.ai/

 .../net/ethernet/mellanox/mlx5/core/en_accel/macsec.c  | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
index 71b3a05..fb2c64d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
@@ -829,7 +829,7 @@ static void macsec_del_rxsc_ctx(struct mlx5e_macsec *macsec, struct mlx5e_macsec
 	 */
 	list_del_rcu(&rx_sc->rx_sc_list_element);
 	xa_erase(&macsec->sc_xarray, rx_sc->sc_xarray_element->fs_id);
-	metadata_dst_free(rx_sc->md_dst);
+	dst_release(&rx_sc->md_dst->dst);
 	kfree(rx_sc->sc_xarray_element);
 	kfree_rcu_mightsleep(rx_sc);
 }
@@ -1695,10 +1695,10 @@ void mlx5e_macsec_offload_handle_rx_skb(struct net_device *netdev,

 	rcu_read_lock();
 	sc_xarray_element = xa_load(&macsec->sc_xarray, fs_id);
-	rx_sc = sc_xarray_element->rx_sc;
-	if (rx_sc) {
-		dst_hold(&rx_sc->md_dst->dst);
-		skb_dst_set(skb, &rx_sc->md_dst->dst);
+	rx_sc = sc_xarray_element ? sc_xarray_element->rx_sc : NULL;
+	if (rx_sc && rx_sc->md_dst) {
+		if (dst_hold_safe(&rx_sc->md_dst->dst))
+			skb_dst_set(skb, &rx_sc->md_dst->dst);
 	}

 	rcu_read_unlock();
-- 
2.53.0

^ permalink raw reply related

* Re: [PATCH v3 1/3] net/smc: bound the wire-controlled producer cursor to the RMB
From: Dust Li @ 2026-06-18 14:29 UTC (permalink / raw)
  To: hexlabsecurity, Wenjia Zhang, D. Wythe, Sidraya Jayagond
  Cc: Eric Dumazet, David S. Miller, Mahanta Jambigi, Wen Gu,
	Simon Horman, netdev, Ursula Braun, Stefan Raspl, linux-s390,
	Paolo Abeni, linux-kernel, linux-rdma, Jakub Kicinski, Tony Lu
In-Reply-To: <20260614-b4-disp-edd64be9-v3-1-551fa514257e@proton.me>

On 2026-06-14 03:23:30, Bryam Vargas via B4 Relay wrote:
>From: Bryam Vargas <hexlabsecurity@proton.me>
>
>smc_cdc_cursor_to_host() (SMC-R) and smcd_cdc_msg_to_host() (SMC-D)
>import the peer's producer cursor from the wire into the local
>connection cursor with no upper bound against the receive buffer (RMB).
>The urgent path then uses that count as a raw index:
>
>	base = conn->rmb_desc->cpu_addr + conn->rx_off;
>	conn->urg_rx_byte = *(base + conn->urg_curs.count - 1);
>
>so a peer that advertises a producer cursor past rmb_desc->len reads
>out of bounds of the RMB allocation in the receive tasklet (softirq).
>
>Bound the producer cursor count to rmb_desc->len at the conversion
>boundary, for both SMC-R and SMC-D.  Apply the bound to the producer
>cursor only: the consumer cursor indexes the peer's RMB and is bounded
>by peer_rmbe_size, so clamping it to our rmb_desc->len would
>under-credit peer_rmbe_space and stall transmit to a peer whose RMB is
>larger than ours.
>
>Fixes: de8474eb9d50 ("net/smc: urgent data support")
>Cc: stable@vger.kernel.org
>Signed-off-by: Bryam Vargas <hexlabsecurity@proton.me>
>---
> net/smc/smc_cdc.h | 27 ++++++++++++++++++++++++---
> 1 file changed, 24 insertions(+), 3 deletions(-)
>
>diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
>index 696cc11f2303..ca76ef630356 100644
>--- a/net/smc/smc_cdc.h
>+++ b/net/smc/smc_cdc.h
>@@ -221,7 +221,8 @@ static inline void smc_host_msg_to_cdc(struct smc_cdc_msg *peer,
> 
> static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
> 					  union smc_cdc_cursor *peer,
>-					  struct smc_connection *conn)
>+					  struct smc_connection *conn,
>+					  int max_count)
> {
> 	union smc_host_cursor temp, old;
> 	union smc_cdc_cursor net;
>@@ -235,6 +236,15 @@ static inline void smc_cdc_cursor_to_host(union smc_host_cursor *local,
> 	if ((old.wrap == temp.wrap) &&
> 	    (old.count > temp.count))
> 		return;
>+	/* The peer producer cursor is wire-controlled and is later used as a
>+	 * raw index into our RMB by the urgent path; bound its count to the
>+	 * RMB.  max_count == 0 leaves the consumer cursor unbounded here: it
>+	 * indexes the peer's RMB (bounded by peer_rmbe_size, not our
>+	 * rmb_desc->len), so clamping it to rmb_desc->len would under-credit
>+	 * peer_rmbe_space and stall transmit to peers with a larger RMB.
>+	 */
>+	if (max_count && temp.count > max_count)
>+		temp.count = max_count;
> 	smc_curs_copy(local, &temp, conn);
> }
> 
>@@ -246,8 +256,13 @@ static inline void smcr_cdc_msg_to_host(struct smc_host_cdc_msg *local,
> 	local->len = peer->len;
> 	local->seqno = ntohs(peer->seqno);
> 	local->token = ntohl(peer->token);
>-	smc_cdc_cursor_to_host(&local->prod, &peer->prod, conn);
>-	smc_cdc_cursor_to_host(&local->cons, &peer->cons, conn);
>+	/* bound the wire-controlled producer cursor to our RMB (used as a raw
>+	 * index by the urgent path); leave the consumer cursor unbounded -- it
>+	 * indexes the peer's RMB and is bounded by peer_rmbe_size.
>+	 */
>+	smc_cdc_cursor_to_host(&local->prod, &peer->prod, conn,
>+			       conn->rmb_desc->len);
>+	smc_cdc_cursor_to_host(&local->cons, &peer->cons, conn, 0);
> 	local->prod_flags = peer->prod_flags;
> 	local->conn_state_flags = peer->conn_state_flags;
> }
>@@ -260,6 +275,12 @@ static inline void smcd_cdc_msg_to_host(struct smc_host_cdc_msg *local,
> 
> 	temp.wrap = peer->prod.wrap;
> 	temp.count = peer->prod.count;
>+	/* the peer producer cursor is wire-controlled and is used as a raw
>+	 * index into our RMB by the urgent path; bound it to the RMB.  The
>+	 * consumer cursor below indexes the peer's RMB and is left unbounded.
>+	 */
>+	if (temp.count > conn->rmb_desc->len)
>+		temp.count = conn->rmb_desc->len;
> 	smc_curs_copy(&local->prod, &temp, conn);
> 
> 	temp.wrap = peer->cons.wrap;

Hi Bryam,

I agree the issue is real. SMC-R's original design didn't fully
account for misbehaving peers, which is the root cause behind a
number of similar issues we've seen. The good news is that this
class of problem isn't easy to hit in practice, so it isn't
particularly urgent.

On the approach itself: once we detect that the peer is misbehaving,
I think the right action is to abort the connection and record the
event, rather than silently clamp. An invalid CDC means the whole
communication state can no longer be trusted, so continuing on a
clamped value just papers over a peer bug.

I'd suggest we add a dedicated CDC message check, and route any
failure through the existing abort path, maybe something like bellow:

    static bool smc_cdc_msg_check(struct smc_connection *conn,
                                  struct smc_cdc_msg *cdc)
    {
        u32 prod_count = ntohs(cdc->prod.count);
        u32 cons_count = ntohs(cdc->cons.count);

        if (prod_count > conn->rmb_desc->len ||
            cons_count > conn->peer_rmbe_size ||
            cdc->prod.wrap > 1 || cdc->cons.wrap > 1) {
            this_cpu_inc(net->smc.smc_stats->...cdc_inval);
            net_ratelimited_function(pr_warn,
                "smc: invalid CDC from peer (token=%u)\n",
                ntohl(cdc->token));
            return false;
        }
        return true;
    }

For -stable, your current minimal patch is fine. For net-next, though, I'd prefer
the approach above: validate at the wire boundary, abort on violation, and
make the event observable via smc_stats and a ratelimited warning.

Best regards,
Dust


^ permalink raw reply

* Re: [PATCH for-next v3 3/3] RDMA/erdma: Implement erdma_reg_user_mr_dmabuf
From: Jacob Moroni @ 2026-06-18 14:27 UTC (permalink / raw)
  To: Boshi Yu; +Cc: jgg, leon, linux-rdma, chengyou, kaishen
In-Reply-To: <20260618045120.51210-4-boshiyu@linux.alibaba.com>

Reviewed-by: Jacob Moroni <jmoroni@google.com>

^ permalink raw reply

* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-18 14:16 UTC (permalink / raw)
  To: Mahanta Jambigi, D. Wythe, Dust Li, Sidraya Jayagond,
	Wenjia Zhang, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Tony Lu, Wen Gu, Simon Horman, Karsten Graul, linux-rdma,
	linux-s390, netdev, linux-kernel, jianhao.xu, runyu.xiao
In-Reply-To: <f7e36176-d00a-4471-94ed-d385e579b43d@linux.ibm.com>

Hi,

Thanks for taking a look.

The exact Lockdep stack I have is from the grounded reproducer, not from
a production SMC setup.  The reproducer keeps the same callback shape:
the close/flush side holds sk_callback_lock and invokes the installed
sk_data_ready callback, which re-enters smc_clcsock_data_ready() and tries
to take sk_callback_lock again.

The relevant Lockdep report is:

  WARNING: possible recursive locking detected
  kworker/u4:3/39 is trying to acquire lock:
    (sk_callback_lock) at smc_clcsock_data_ready+0xa/0x4d

  but task is already holding lock:
    (sk_callback_lock) at smc_close_flush_work+0xc/0x30

  Possible unsafe locking scenario:

        CPU0
        ----
        lock(sk_callback_lock);
        lock(sk_callback_lock);

  *** DEADLOCK ***

  Workqueue: smc_close_wq smc_close_flush_work

  Call Trace:
    dump_stack_lvl
    __lock_acquire
    lock_acquire
    _raw_read_lock_bh
    smc_clcsock_data_ready+0xa/0x4d
    smc_close_flush_work+0x1f/0x30
    process_one_work
    worker_thread
    kthread
    ret_from_fork

The nvmet change I referred to is:

  2fa8961d3a6a ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")

The stable/backport patch I originally used as the reference is:

  1c90f930e7b4 ("nvmet-tcp: fixup hang in nvmet_tcp_listen_data_ready()")

Its commit message says that when the socket is closed while in
TCP_LISTEN, the flush callback can call nvmet_tcp_listen_data_ready()
with sk_callback_lock already held, so nvmet moved the TCP_LISTEN check
before taking sk_callback_lock.

For the TCP_LISTEN check: my reasoning was that smc_clcsock_data_ready()
is installed by smc_listen() on the underlying TCP listen socket and only
queues smc_tcp_listen_work() for the SMC listen/accept path.  Once that
underlying socket is no longer in TCP_LISTEN, there should be no SMC
listen accept work to queue from this callback.  TCP_SYN_RECV and
TCP_ESTABLISHED are not listen-socket states for this callback path, so I
did not intend the callback to queue listen work for those states.

That said, if SMC expects smc_clcsock_data_ready() to handle a non-LISTEN
state during fallback or another transition, then the proposed check is
too strict and I should rework the fix.

Thanks,
Runyu

^ permalink raw reply

* Re: [PATCH net v2] net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete
From: Simon Horman @ 2026-06-18 13:18 UTC (permalink / raw)
  To: doruk
  Cc: Simon Horman, saeedm, leon, tariqt, mbloch, sd, andrew+netdev,
	davem, edumazet, kuba, pabeni, borisp, raeds, ehakim, netdev,
	linux-rdma, linux-kernel, stable
In-Reply-To: <20260617100558.83654-1-doruk@0sec.ai>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev
---
net/mlx5e: macsec: fix use-after-free of metadata_dst on RX SC delete

This commit fixes a use-after-free in the MACsec RX datapath by replacing
a synchronous metadata_dst_free() with dst_release() when an RX SC is
deleted. It also updates the datapath to use dst_hold_safe() to avoid
acquiring a reference to a dying destination entry.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/macsec.c
> @@ -829,7 +829,7 @@ static void macsec_del_rxsc_ctx(struct mlx5e_macsec *macsec, struct mlx5e_macsec
>  	 */
>  	list_del_rcu(&rx_sc->rx_sc_list_element);
>  	xa_erase(&macsec->sc_xarray, rx_sc->sc_xarray_element->fs_id);
> -	metadata_dst_free(rx_sc->md_dst);
> +	dst_release(&rx_sc->md_dst->dst);
>  	kfree(rx_sc->sc_xarray_element);

[Severity: Critical]
This is a pre-existing issue, but is it safe to synchronously call kfree()
on rx_sc->sc_xarray_element here?

The RX datapath accesses the xarray under rcu_read_lock() in
mlx5e_macsec_offload_handle_rx_skb(). If sc_xarray_element is fetched via
xa_load() by a concurrent reader just before xa_erase() runs, won't this
kfree() free the element while the reader is still accessing it, leading to
a use-after-free?

>  	kfree_rcu_mightsleep(rx_sc);
>  }

[ ... ]

> @@ -1697,8 +1697,8 @@ void mlx5e_macsec_offload_handle_rx_skb(struct net_device *netdev,
>  	sc_xarray_element = xa_load(&macsec->sc_xarray, fs_id);
>  	rx_sc = sc_xarray_element->rx_sc;

[Severity: Critical]
This isn't a bug introduced by this patch, but does xa_load() return NULL
if the fs_id has already been erased by xa_erase() in the deletion path?

If packets matched by hardware before the rule was deleted are still
lingering in the software RX queues, xa_load() might return NULL.
Dereferencing sc_xarray_element->rx_sc immediately after could cause a
NULL pointer dereference. Could we add a NULL check here before accessing
sc_xarray_element?

[Severity: Critical]
This is also a pre-existing issue, but is rx_sc fully initialized when it
becomes visible to RCU readers?

In mlx5e_macsec_add_rxsc(), it appears sc_xarray_element is published to
the xarray via xa_alloc() before rx_sc->md_dst is allocated and
initialized:

mlx5e_macsec_add_rxsc() {
    ...
    err = xa_alloc(&macsec->sc_xarray, &sc_xarray_element->fs_id, ...);
    ...
    rx_sc->md_dst = metadata_dst_alloc(0, METADATA_MACSEC, GFP_KERNEL);
    ...
}

Since xa_alloc() assigns the lowest available index, the new fs_id could be
recycled from a recently deleted SC. If a lingering packet with the old
fs_id is processed by the datapath, it could find the newly published
element and attempt to call dst_hold_safe(&rx_sc->md_dst->dst).

Because rx_sc->md_dst is still NULL at that moment, could this cause a NULL
pointer dereference?

>  	if (rx_sc) {
> -		dst_hold(&rx_sc->md_dst->dst);
> -		skb_dst_set(skb, &rx_sc->md_dst->dst);
> +		if (dst_hold_safe(&rx_sc->md_dst->dst))
> +			skb_dst_set(skb, &rx_sc->md_dst->dst);
>  	}
>  
>  	rcu_read_unlock();

^ permalink raw reply

* Re: [PATCH net 3/3] net/mlx5e: TC, skip peer flow cleanup when LAG seq is unavailable
From: Simon Horman @ 2026-06-18 12:48 UTC (permalink / raw)
  To: tariqt
  Cc: Simon Horman, edumazet, kuba, pabeni, andrew+netdev, davem,
	saeedm, mbloch, leon, shayd, ohartoov, edwards, msanalla, phaddad,
	parav, gbayer, kees, moshe, rongweil, jacob.e.keller, netdev,
	linux-rdma, linux-kernel, gal
In-Reply-To: <20260617063204.547427-4-tariqt@nvidia.com>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://sashiko.dev

Simon says: I think the first two issues flagged below warrant
            consideration within the context of this patch.

	    But the third and final issue flagged below as pre-existing
	    seems to be something to consider in the context of
	    possible follow-up.
---
net/mlx5e: TC, skip peer flow cleanup when LAG seq is unavailable

This commit addresses a bad memory access crash by skipping peer flows when
mlx5_lag_get_dev_seq() returns an error. This occurs when the peer isn't in
the LAG or when no device is marked as master.

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> index a9001d1c902ff..c6e6534a5e23b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> @@ -2155,6 +2155,9 @@ static void mlx5e_tc_del_fdb_peers_flow(struct mlx5e_tc_flow *flow)
>  	devcom = flow->priv->mdev->priv.eswitch->devcom;
>  	mlx5_devcom_for_each_peer_entry(devcom, peer_esw, pos) {
>  		i = mlx5_lag_get_dev_seq(peer_esw->dev);
> +		if (i < 0)
> +			continue;
> +

[Severity: Critical]
Does skipping the remainder of this loop prevent the flow from being
unlinked from the doubly-linked list?

By using continue here in mlx5e_tc_del_fdb_peers_flow(), we skip calling
mlx5e_tc_del_fdb_peer_flow(), which means flow->peer[i] is never
unlinked from the esw->offloads.peer_flows[original_i] list.

If the flow is subsequently freed via mlx5e_flow_put(), will this leave a
freed list node resident in the active list and guarantee a use-after-free
list corruption during the next list traversal?

>  		mlx5e_tc_del_fdb_peer_flow(flow, i);
>  	}
>  }

[ ... ]

> @@ -5526,6 +5529,9 @@ void mlx5e_tc_clean_fdb_peer_flows(struct mlx5_eswitch *esw)
>  
>  	mlx5_devcom_for_each_peer_entry(devcom, peer_esw, pos) {
>  		i = mlx5_lag_get_dev_seq(peer_esw->dev);
> +		if (i < 0)
> +			continue;
> +

[Severity: Critical]
Similarly to the above, does this skip unlinking during flow cleanup when
the LAG sequence is unavailable?

>  		list_for_each_entry_safe(flow, tmp, &esw->offloads.peer_flows[i], peer[i])
>  			mlx5e_tc_del_fdb_peers_flow(flow);
>  	}

[Severity: Critical]
This isn't a bug introduced by this patch, but is there a similar missing
bounds check during flow addition in mlx5e_tc_add_fdb_peer_flow() in this
same file?

If a flow is added concurrently with LAG teardown, mlx5_lag_get_dev_seq()
can return -ENOENT (-2):

drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:mlx5e_tc_add_fdb_peer_flow() {
    ...
    int i = mlx5_lag_get_dev_seq(peer_esw->dev);
    ...
    list_add_tail(&flow->peer[i], &esw->offloads.peer_flows[i]);
    ...
}

Does this result in an out-of-bounds write to a negative array index?

^ permalink raw reply

* Re: [PATCH net] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Mahanta Jambigi @ 2026-06-18  6:24 UTC (permalink / raw)
  To: Runyu Xiao, D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Tony Lu, Wen Gu, Simon Horman, Karsten Graul, linux-rdma,
	linux-s390, netdev, linux-kernel, jianhao.xu, stable
In-Reply-To: <20260617152855.1039151-1-runyu.xiao@seu.edu.cn>



On 17/06/26 8:58 pm, Runyu Xiao wrote:
> smc_listen() installs smc_clcsock_data_ready() as the underlying TCP
> listen socket's sk_data_ready callback.  smc_clcsock_data_ready() then
> immediately takes sk_callback_lock before looking up the SMC listener and
> queuing smc_tcp_listen_work().
> 
> That is unsafe once the TCP listen socket is leaving TCP_LISTEN.  The TCP
> close/flush path can run the installed sk_data_ready callback with
> sk_callback_lock already held, so entering smc_clcsock_data_ready() again
> tries to take the same rwlock recursively in the same thread.  The nvmet

Could you provide me the exact call stack showing recursive lock? Also
help me with the nvmet commit details.

> TCP listener had to make the same state check before taking
> sk_callback_lock for this reason.
> 
> This issue was found by our static analysis tool and then manually
> reviewed against the current tree.
> 
> The grounded PoC kept the SMC listen callback installation path:
> 
>   smc_listen()
>   smc_clcsock_replace_cb()
>   sk_data_ready = smc_clcsock_data_ready()
> 
> It then modeled the close/flush carrier that invokes the installed
> sk_data_ready callback while sk_callback_lock is already held.  Lockdep
> reported the same-thread recursive acquisition:
> 
>   WARNING: possible recursive locking detected
>   smc_clcsock_data_ready+0xa/0x4d [vuln_msv]
>   smc_close_flush_work+0x1f/0x30 [vuln_msv]
>   *** DEADLOCK ***
> 
> Return before taking sk_callback_lock when the underlying TCP socket is no
> longer in TCP_LISTEN.  In that state there is no listen accept work to
> queue for SMC, and avoiding the callback lock mirrors the fix used by the
> TCP nvmet listener.
> 
> Fixes: 0558226cebee ("net/smc: Fix slab-out-of-bounds issue in fallback")
> Cc: stable@vger.kernel.org
> Signed-off-by: Runyu Xiao <runyu.xiao@seu.edu.cn>
> ---
>  net/smc/af_smc.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 6421c2e1c84d..1af4e3c333ff 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -2631,6 +2631,9 @@ static void smc_clcsock_data_ready(struct sock *listen_clcsock)
>  {
>  	struct smc_sock *lsmc;
>  
> +	if (READ_ONCE(listen_clcsock->sk_state) != TCP_LISTEN)

Is *TCP_LISTEN* check sufficient? What about *TCP_SYN_RECV* or
*TCP_ESTABLISHED*?

> +		return;
> +
>  	read_lock_bh(&listen_clcsock->sk_callback_lock);
>  	lsmc = smc_clcsock_user_data(listen_clcsock);
>  	if (!lsmc)


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox