* [PATCH] RDMA/bng_re: return a timeout when firmware responses stall
From: Pengpeng Hou @ 2026-06-25 0:36 UTC (permalink / raw)
To: Siva Reddy Kallam
Cc: pengpeng, Jason Gunthorpe, Leon Romanovsky, linux-rdma,
linux-kernel
__wait_for_resp() documents that it returns a non-zero error when a
firmware command does not complete, and bng_re_rcfw_send_message() already
marks the firmware as stalled when the helper returns -ENODEV.
However, the helper ignores wait_event_timeout() expiry. If the response
slot remains in use after the timeout and after the polled CREQ service
attempt, the loop starts another full timeout period and can repeat
forever.
Return -ENODEV after a timed out wait that still has no response. The
existing caller then marks FIRMWARE_STALL_DETECTED and returns
-ETIMEDOUT to the command issuer.
Signed-off-by: Pengpeng Hou <pengpeng@iscas.ac.cn>
---
drivers/infiniband/hw/bng_re/bng_fw.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/drivers/infiniband/hw/bng_re/bng_fw.c b/drivers/infiniband/hw/bng_re/bng_fw.c
index 50156c300..ab6a2d2e9 100644
--- a/drivers/infiniband/hw/bng_re/bng_fw.c
+++ b/drivers/infiniband/hw/bng_re/bng_fw.c
@@ -401,14 +401,15 @@ static int __wait_for_resp(struct bng_re_rcfw *rcfw, u16 cookie)
{
struct bng_re_cmdq_ctx *cmdq;
struct bng_re_crsqe *crsqe;
+ unsigned long time_left;
cmdq = &rcfw->cmdq;
crsqe = &rcfw->crsqe_tbl[cookie];
do {
- wait_event_timeout(cmdq->waitq,
- !crsqe->is_in_used,
- secs_to_jiffies(rcfw->max_timeout));
+ time_left = wait_event_timeout(cmdq->waitq,
+ !crsqe->is_in_used,
+ secs_to_jiffies(rcfw->max_timeout));
if (!crsqe->is_in_used)
return 0;
@@ -417,6 +418,9 @@ static int __wait_for_resp(struct bng_re_rcfw *rcfw, u16 cookie)
if (!crsqe->is_in_used)
return 0;
+
+ if (!time_left)
+ return -ENODEV;
} while (true);
};
--
2.50.1 (Apple Git-155)
^ permalink raw reply related
* RE: [EXTERNAL] Re: [PATCH net] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-24 22:50 UTC (permalink / raw)
To: Simon Horman
Cc: KY Srinivasan, Haiyang Zhang, wei.liu@kernel.org, Long Li,
andrew+netdev@lunn.ch, davem@davemloft.net, edumazet@google.com,
kuba@kernel.org, pabeni@redhat.com, Konstantin Taranov,
ernis@linux.microsoft.com, dipayanroy@linux.microsoft.com,
kees@kernel.org, jacob.e.keller@intel.com,
ssengar@linux.microsoft.com, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, stable@vger.kernel.org
In-Reply-To: <20260619090514.GT827683@horms.kernel.org>
> From: Simon Horman <horms@kernel.org>
> Sent: Friday, June 19, 2026 2:05 AM
> > ...
> > Also validate the packet length reported in the RX CQE before using it as
> > a DMA sync length or passing it to skb processing. The CQE is supplied
> > by the device and should not be blindly trusted by Confidential VMs.
>
> I think this last part warrants being split out into a separate patch.
Sorry for the late reply. I split v1 into 2 patches of v2, which I just posted:
https://lwn.net/ml/linux-kernel/20260624222605.1794719-1-decui@microsoft.com/
Thanks,
Dexuan
^ permalink raw reply
* [PATCH net v2 1/2] net: mana: Sync page pool RX frags for CPU
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>
MANA allocates RX buffers from page pool fragments when frag_count is
greater than 1. In that case the buffers remain DMA mapped by page pool
and the RX completion path does not call dma_unmap_single(). As a result,
the implicit sync-for-CPU normally performed by dma_unmap_single() is
missing before the packet data is passed to the networking stack.
This breaks RX on configurations which require explicit DMA syncing, for
example when booted with swiotlb=force.
Fix this by recording the page pool page and DMA sync offset when the RX
buffer is allocated, and syncing the received packet range for CPU access
before handing the RX buffer to the stack.
Fixes: 730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead of full pages to improve memory efficiency.")
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
Changes since v1:
v1 is split into two patches in the v2.
Add Haiyang's Reviewed-by.
drivers/net/ethernet/microsoft/mana/mana_en.c | 39 +++++++++++++++----
include/net/mana/mana.h | 8 ++++
2 files changed, 40 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index c9b1df1ed109..1875bffd82b7 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2044,12 +2044,16 @@ static void mana_rx_skb(void *buf_va, bool from_pool,
}
static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
- dma_addr_t *da, bool *from_pool)
+ dma_addr_t *da, bool *from_pool,
+ struct page **pp_page, u32 *dma_sync_offset)
{
struct page *page;
u32 offset;
void *va;
+
*from_pool = false;
+ *pp_page = NULL;
+ *dma_sync_offset = 0;
/* Don't use fragments for jumbo frames or XDP where it's 1 fragment
* per page.
@@ -2087,31 +2091,47 @@ static void *mana_get_rxfrag(struct mana_rxq *rxq, struct device *dev,
va = page_to_virt(page) + offset;
*da = page_pool_get_dma_addr(page) + offset + rxq->headroom;
*from_pool = true;
+ *pp_page = page;
+ *dma_sync_offset = offset + rxq->headroom;
return va;
}
/* Allocate frag for rx buffer, and save the old buf */
static void mana_refill_rx_oob(struct device *dev, struct mana_rxq *rxq,
- struct mana_recv_buf_oob *rxoob, void **old_buf,
- bool *old_fp)
+ struct mana_recv_buf_oob *rxoob, u32 pktlen,
+ void **old_buf, bool *old_fp)
{
+ struct page *pp_page;
+ u32 dma_sync_offset;
bool from_pool;
dma_addr_t da;
void *va;
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return;
- if (!rxoob->from_pool || rxq->frag_count == 1)
+ if (!rxoob->from_pool || rxq->frag_count == 1) {
dma_unmap_single(dev, rxoob->sgl[0].address, rxq->datasize,
DMA_FROM_DEVICE);
+ } else {
+ /* The page pool maps the whole page and only syncs for device
+ * automatically (PP_FLAG_DMA_SYNC_DEV). Sync the received bytes
+ * for the CPU before they are read: this is required if DMA
+ * is incoherent or bounce buffers are used.
+ */
+ page_pool_dma_sync_for_cpu(rxq->page_pool, rxoob->pp_page,
+ rxoob->dma_sync_offset, pktlen);
+ }
*old_buf = rxoob->buf_va;
*old_fp = rxoob->from_pool;
rxoob->buf_va = va;
rxoob->sgl[0].address = da;
rxoob->from_pool = from_pool;
+ rxoob->pp_page = pp_page;
+ rxoob->dma_sync_offset = dma_sync_offset;
}
static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
@@ -2170,7 +2190,7 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
rxbuf_oob = &rxq->rx_oobs[curr];
WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, &old_buf, &old_fp);
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
/* Unsuccessful refill will have old_buf == NULL.
* In this case, mana_rx_skb() will drop the packet.
@@ -2566,6 +2586,8 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
struct mana_rxq *rxq, struct device *dev)
{
struct mana_port_context *mpc = netdev_priv(rxq->ndev);
+ struct page *pp_page = NULL;
+ u32 dma_sync_offset = 0;
bool from_pool = false;
dma_addr_t da;
void *va;
@@ -2573,13 +2595,16 @@ static int mana_fill_rx_oob(struct mana_recv_buf_oob *rx_oob, u32 mem_key,
if (mpc->rxbufs_pre)
va = mana_get_rxbuf_pre(rxq, &da);
else
- va = mana_get_rxfrag(rxq, dev, &da, &from_pool);
+ va = mana_get_rxfrag(rxq, dev, &da, &from_pool, &pp_page,
+ &dma_sync_offset);
if (!va)
return -ENOMEM;
rx_oob->buf_va = va;
rx_oob->from_pool = from_pool;
+ rx_oob->pp_page = pp_page;
+ rx_oob->dma_sync_offset = dma_sync_offset;
rx_oob->sgl[0].address = da;
rx_oob->sgl[0].size = rxq->datasize;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 8f721cd4e4a7..4111b93169d2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -305,6 +305,14 @@ struct mana_recv_buf_oob {
void *buf_va;
bool from_pool; /* allocated from a page pool */
+ /* head page of the page_pool fragment; valid only when
+ * from_pool && frag_count > 1.
+ */
+ struct page *pp_page;
+ /* Fragment offset plus rxq->headroom, passed to
+ * page_pool_dma_sync_for_cpu().
+ */
+ u32 dma_sync_offset;
/* SGL of the buffer going to be sent as part of the work request. */
u32 num_sge;
--
2.34.1
^ permalink raw reply related
* [PATCH net v2 2/2] net: mana: Validate the packet length reported by the NIC
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
Cc: stable
In-Reply-To: <20260624222605.1794719-1-decui@microsoft.com>
Validate the packet length reported in the RX CQE before using it as a DMA
sync length or passing it to skb processing. The CQE is supplied by the
NIC device and should not be blindly trusted.
Cc: stable@vger.kernel.org
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
Changes since v1:
v1 is split into two patches in the v2.
Add Haiyang's Reviewed-by.
drivers/net/ethernet/microsoft/mana/mana_en.c | 24 +++++++++++++++----
1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 1875bffd82b7..0b44c51ae6ec 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2190,12 +2190,26 @@ static void mana_process_rx_cqe(struct mana_rxq *rxq, struct mana_cq *cq,
rxbuf_oob = &rxq->rx_oobs[curr];
WARN_ON_ONCE(rxbuf_oob->wqe_inf.wqe_size_in_bu != 1);
- mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen, &old_buf, &old_fp);
+ if (unlikely(pktlen > rxq->datasize)) {
+ /* Increase it even if mana_rx_skb() isn't called. */
+ rxq->rx_cq.work_done++;
- /* Unsuccessful refill will have old_buf == NULL.
- * In this case, mana_rx_skb() will drop the packet.
- */
- mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ ++ndev->stats.rx_dropped;
+ netdev_warn_once(ndev,
+ "Dropped oversized RX packet: len=%u, datasize=%u\n",
+ pktlen, rxq->datasize);
+
+ /* Reuse the RX buffer since rxbuf_oob is unchanged. */
+ } else {
+
+ mana_refill_rx_oob(dev, rxq, rxbuf_oob, pktlen,
+ &old_buf, &old_fp);
+
+ /* Unsuccessful refill will have old_buf == NULL.
+ * In this case, mana_rx_skb() will drop the packet.
+ */
+ mana_rx_skb(old_buf, old_fp, oob, rxq, i);
+ }
mana_move_wq_tail(rxq->gdma_rq,
rxbuf_oob->wqe_inf.wqe_size_in_bu);
--
2.34.1
^ permalink raw reply related
* [PATCH net v2 0/2] Fix MANA RX with bounce buffering
From: Dexuan Cui @ 2026-06-24 22:26 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, kotaranov, horms, ernis, dipayanroy, kees,
jacob.e.keller, ssengar, linux-hyperv, netdev, linux-kernel,
linux-rdma
With swiotlb=force, the MANA NIC fails to work properly due to commit
730ff06d3f5c ("net: mana: Use page pool fragments for RX buffers instead
of full pages to improve memory efficiency.")
Dipayaan tried to fix this by avoiding page pool frags when bounce
buffering is in use [1][2]. However, that is not a clean solution: no
other NIC drivers need to explicitly check whether bounce buffering is
in use. It is also not good for throughput, since
dma_map_single()/dma_unmap_single() are then called for each incoming
packet.
In fact, page pool frags can still be used with the standard MTU of
1500: all we need is to add page_pool_dma_sync_for_cpu() before the CPU
reads the incoming packet, so I implemented that in v1 [3].
As Simon suggested [4], this version splits v1 into two patches:
Patch 1 adds page_pool_dma_sync_for_cpu().
Patch 2 validates the packet length reported by the NIC.
There is no functional difference between v1 and v2, so I am keeping
Haiyang's Reviewed-by tag in v2.
Please review. Thanks!
Note that, with jumbo MTU and XDP, page pool frags are not used, and
dma_map_single()/dma_unmap_single() are still called for each incoming
packet, causing poor throughput with swiotlb=force; see
mana_get_rxbuf_cfg() and mana_refill_rx_oob() -> mana_get_rxfrag().
The jumbo MTU/XDP issue will be addressed later since that needs more
consideration if we want to use page pool with PP_FLAG_DMA_MAP there:
e.g., for XDP, the received packet can be transmitted in place, i.e. the
same RX buffer can be used as a TX buffer:
mana_rx_skb() -> mana_xdp_tx() -> mana_start_xmit() -> mana_map_skb().
In mana_create_page_pool(), we may have to set pprm.dma_dir to
DMA_BIDIRECTIONAL if XDP is in use:
pprm.dma_dir = mana_xdp_get(mpc) ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
In the case of XDP, the next issue is that mana_rx_skb() -> ... ->
mana_map_skb() appears to call dma_map_single() on an RX buffer allocated
from a page pool created with PP_FLAG_DMA_MAP, which seems incorrect.
Any thoughts?
[1] https://lore.kernel.org/all/ae91hyrLf4n23XE6@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/#r
[2] https://lore.kernel.org/all/ae9pxvJfkAZYfKMf@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net/
[3] https://lore.kernel.org/all/20260618035029.249361-1-decui@microsoft.com/
[4] https://lore.kernel.org/all/20260619090514.GT827683@horms.kernel.org/
Dexuan Cui (2):
net: mana: Sync page pool RX frags for CPU
net: mana: Validate the packet length reported by the NIC
drivers/net/ethernet/microsoft/mana/mana_en.c | 61 +++++++++++++++----
include/net/mana/mana.h | 8 +++
2 files changed, 58 insertions(+), 11 deletions(-)
--
2.34.1
^ permalink raw reply
* [PATCH net] net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link()
From: Jakub Kicinski @ 2026-06-24 19:04 UTC (permalink / raw)
To: davem
Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
Breno Leitao, joshwash, hramamurthy, anthony.l.nguyen,
przemyslaw.kitszel, saeedm, tariqt, mbloch, leon, alexanderduyck,
kernel-team, kys, haiyangz, wei.liu, decui, longli, jordanrhee,
jacob.e.keller, nktgrg, debarghyak, mohsin.bashr, ernis, sdf, gal,
linux-rdma, linux-hyperv
Breno reports following splats on mlx5:
RTNL: assertion failed at net/core/dev.c (2241)
WARNING: net/core/dev.c:2241 at netif_state_change+0xed/0x130, CPU#5: ethtool/1335
RIP: 0010:netif_state_change+0xf9/0x130
Call Trace:
<TASK>
__linkwatch_sync_dev+0xea/0x120
ethtool_op_get_link+0xe/0x20
__ethtool_get_link+0x26/0x40
linkstate_prepare_data+0x51/0x200
ethnl_default_doit+0x213/0x470
genl_family_rcv_msg_doit+0xdd/0x110
Looks like I missed ethtool_op_get_link() trying to sync linkwatch,
which needs rtnl_lock. Not all drivers do this - bnxt doesn't,
it just returns the link state, so add an opt-in bit.
Reported-by: Breno Leitao <leitao@debian.org>
Fixes: 45079e00133e ("net: ethtool: optionally skip rtnl_lock on Netlink path for GET ops")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: joshwash@google.com
CC: hramamurthy@google.com
CC: anthony.l.nguyen@intel.com
CC: przemyslaw.kitszel@intel.com
CC: saeedm@nvidia.com
CC: tariqt@nvidia.com
CC: mbloch@nvidia.com
CC: leon@kernel.org
CC: alexanderduyck@fb.com
CC: kernel-team@meta.com
CC: kys@microsoft.com
CC: haiyangz@microsoft.com
CC: wei.liu@kernel.org
CC: decui@microsoft.com
CC: longli@microsoft.com
CC: jordanrhee@google.com
CC: jacob.e.keller@intel.com
CC: nktgrg@google.com
CC: debarghyak@google.com
CC: leitao@debian.org
CC: mohsin.bashr@gmail.com
CC: ernis@linux.microsoft.com
CC: sdf@fomichev.me
CC: gal@nvidia.com
CC: linux-rdma@vger.kernel.org
CC: linux-hyperv@vger.kernel.org
---
include/linux/ethtool.h | 2 ++
net/ethtool/common.h | 4 ++++
drivers/net/ethernet/google/gve/gve_ethtool.c | 3 ++-
drivers/net/ethernet/intel/iavf/iavf_ethtool.c | 1 +
drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 3 ++-
drivers/net/ethernet/mellanox/mlx5/core/en_rep.c | 3 ++-
drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c | 4 +++-
drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c | 3 ++-
drivers/net/ethernet/microsoft/mana/mana_ethtool.c | 3 ++-
9 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index 1b834e2a522e..5d491a98265e 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -942,6 +942,7 @@ struct kernel_ethtool_ts_info {
#define ETHTOOL_OP_NEEDS_RTNL_GPAUSEPARAM BIT(5)
#define ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM BIT(6)
#define ETHTOOL_OP_NEEDS_RTNL_RSS BIT(7)
+#define ETHTOOL_OP_NEEDS_RTNL_GLINK BIT(8)
/**
* struct ethtool_ops - optional netdev operations
@@ -978,6 +979,7 @@ struct kernel_ethtool_ts_info {
* - phylink helpers (note that phydev is currently unsupported!)
* - netdev_update_features()
* - netif_set_real_num_tx_queues()
+ * - ethtool_op_get_link() (syncs link watch under rtnl_lock)
*
* @get_drvinfo: Report driver/device information. Modern drivers no
* longer have to implement this callback. Most fields are
diff --git a/net/ethtool/common.h b/net/ethtool/common.h
index 2b3847f00801..4e5356e26f40 100644
--- a/net/ethtool/common.h
+++ b/net/ethtool/common.h
@@ -113,6 +113,8 @@ ethtool_nl_msg_needs_rtnl(const struct net_device *dev, u8 cmd)
return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM;
case ETHTOOL_MSG_RSS_SET:
return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_RSS;
+ case ETHTOOL_MSG_LINKSTATE_GET:
+ return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_GLINK;
case ETHTOOL_MSG_TSCONFIG_GET:
case ETHTOOL_MSG_TSCONFIG_SET:
/* tsconfig calls ndos (ndo_hwtstamp_set/get), not ethtool ops.
@@ -159,6 +161,8 @@ ethtool_ioctl_needs_rtnl(const struct net_device *dev, u32 ethcmd)
case ETHTOOL_SRXFH:
case ETHTOOL_SRXFHINDIR:
return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_RSS;
+ case ETHTOOL_GLINK:
+ return ops->op_needs_rtnl & ETHTOOL_OP_NEEDS_RTNL_GLINK;
}
return false;
}
diff --git a/drivers/net/ethernet/google/gve/gve_ethtool.c b/drivers/net/ethernet/google/gve/gve_ethtool.c
index 7cc22916852f..8199738ba979 100644
--- a/drivers/net/ethernet/google/gve/gve_ethtool.c
+++ b/drivers/net/ethernet/google/gve/gve_ethtool.c
@@ -984,7 +984,8 @@ const struct ethtool_ops gve_ethtool_ops = {
.supported_ring_params = ETHTOOL_RING_USE_TCP_DATA_SPLIT |
ETHTOOL_RING_USE_RX_BUF_LEN,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = gve_get_drvinfo,
.get_strings = gve_get_strings,
.get_sset_count = gve_get_sset_count,
diff --git a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
index a615d599b88e..e7cf12eaa268 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_ethtool.c
@@ -1855,6 +1855,7 @@ static const struct ethtool_ops iavf_ethtool_ops = {
.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
ETHTOOL_COALESCE_USE_ADAPTIVE,
.supported_input_xfrm = RXH_XFRM_SYM_XOR,
+ .op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = iavf_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_ringparam = iavf_get_ringparam,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index 2f5b626ba33f..112926d07634 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -2721,7 +2721,8 @@ const struct ethtool_ops mlx5e_ethtool_ops = {
.rxfh_max_num_contexts = MLX5E_MAX_NUM_RSS,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
- ETHTOOL_OP_NEEDS_RTNL_SPFLAGS,
+ ETHTOOL_OP_NEEDS_RTNL_SPFLAGS |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
ETHTOOL_COALESCE_MAX_FRAMES |
ETHTOOL_COALESCE_USE_ADAPTIVE |
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
index 1a8a19f980d3..c8b76d301c92 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rep.c
@@ -419,7 +419,8 @@ static const struct ethtool_ops mlx5e_rep_ethtool_ops = {
ETHTOOL_COALESCE_MAX_FRAMES |
ETHTOOL_COALESCE_USE_ADAPTIVE,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = mlx5e_rep_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = mlx5e_rep_get_strings,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
index 9b3b32408c64..01ddc3def9ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ethtool.c
@@ -286,7 +286,8 @@ const struct ethtool_ops mlx5i_ethtool_ops = {
ETHTOOL_COALESCE_MAX_FRAMES |
ETHTOOL_COALESCE_USE_ADAPTIVE,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = mlx5i_get_drvinfo,
.get_strings = mlx5i_get_strings,
.get_sset_count = mlx5i_get_sset_count,
@@ -309,6 +310,7 @@ const struct ethtool_ops mlx5i_ethtool_ops = {
};
const struct ethtool_ops mlx5i_pkey_ethtool_ops = {
+ .op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = mlx5i_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_ts_info = mlx5i_get_ts_info,
diff --git a/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c b/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
index cb34fc166ef9..0e47088ec44b 100644
--- a/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
+++ b/drivers/net/ethernet/meta/fbnic/fbnic_ethtool.c
@@ -2024,7 +2024,8 @@ static const struct ethtool_ops fbnic_ethtool_ops = {
ETHTOOL_OP_NEEDS_RTNL_GPAUSEPARAM |
ETHTOOL_OP_NEEDS_RTNL_SPAUSEPARAM |
ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_drvinfo = fbnic_get_drvinfo,
.get_regs_len = fbnic_get_regs_len,
.get_regs = fbnic_get_regs,
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 94e658d07a27..881df597d7f9 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -597,7 +597,8 @@ static int mana_get_link_ksettings(struct net_device *ndev,
const struct ethtool_ops mana_ethtool_ops = {
.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
.op_needs_rtnl = ETHTOOL_OP_NEEDS_RTNL_SCHANNELS |
- ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM,
+ ETHTOOL_OP_NEEDS_RTNL_SRINGPARAM |
+ ETHTOOL_OP_NEEDS_RTNL_GLINK,
.get_ethtool_stats = mana_get_ethtool_stats,
.get_sset_count = mana_get_sset_count,
.get_strings = mana_get_strings,
--
2.54.0
^ permalink raw reply related
* Re: [PATCH rdma v1 1/2] RDMA/zrdma: Add basic framework for ZTE Dinghai Ethernet Protocol Driver for RDMA
From: Julian Braha @ 2026-06-24 17:41 UTC (permalink / raw)
To: zhang.yanze, jgg, leon
Cc: linux-kernel, linux-rdma, wei.quan, han.junyang, ran.ming,
han.chengfei
In-Reply-To: <20260624164852120pLCX6txujHU8n4GMakGbe@zte.com.cn>
Hi Yanze,
On 6/24/26 09:48, zhang.yanze@zte.com.cn wrote:
> +++ b/drivers/infiniband/hw/zrdma/Kconfig
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config INFINIBAND_ZRDMA
> + tristate "ZTE Ethernet Protocol Driver for RDMA"
> + depends on INFINIBAND
> + help
> + Say Y or M here to enable support for the ZTE DingHai (ZXDH) Ethernet
> + Protocol Driver for RDMA. This driver provides RDMA over Converged
> + Ethernet (RoCE) functionality for ZTE DingHai network adapters.
> + If you choose to build this driver as a module, it will be built as
> + a module named zrdma.
You've got a duplicate dependency on INFINIBAND for your
INFINIBAND_ZRDMA config option. There's already an
'if INFINIBAND .. endif' that wraps the kconfig file import in
drivers/infiniband/Kconfig
- Julian Braha
^ permalink raw reply
* [PATCH for-next v2 2/2] RDMA/bnxt_re: Add uverbs object handle path for CQ/SRQ toggle page
From: Selvin Xavier @ 2026-06-24 22:39 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, kalesh-anakkur.purayil,
sriharsha.basavapatna, Selvin Xavier, Jason Gunthorpe
In-Reply-To: <20260624223927.521882-1-selvin.xavier@broadcom.com>
The current GET_TOGGLE_MEM ioctl requires the caller to supply
a type enum and a raw hardware queue ID (RES_ID). The kernel
looks up the CQ or SRQ by that ID without verifying that the
caller owns the resource.
Add a new, preferred code path that accepts standard uverbs
object handles (BNXT_RE_TOGGLE_MEM_CQ_HANDLE /
BNXT_RE_TOGGLE_MEM_SRQ_HANDLE) instead.
Only newer rdma-core versions support this path. Capability is
negotiated during context creation using the req mask
(BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT) and resp
mask (BNXT_RE_UCNTX_CMASK_TOGGLE_MEM_UOBJ_SUPPORT).
The existing TYPE + RES_ID path is retained for backward
compatibility with older rdma-core.
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 7 +-
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 1 +
drivers/infiniband/hw/bnxt_re/uapi.c | 99 +++++++++++++++---------
include/uapi/rdma/bnxt_re-abi.h | 4 +
4 files changed, 74 insertions(+), 37 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index d1eebd7b56f4..423c8f3184bb 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -4846,7 +4846,8 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata)
rc = ib_copy_validate_udata_in_cm(
udata, ureq, comp_mask,
BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT |
- BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT);
+ BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT |
+ BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT);
if (rc)
goto cfail;
if (ureq.comp_mask & BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT) {
@@ -4859,6 +4860,10 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata)
if (resp.mode == BNXT_QPLIB_WQE_MODE_VARIABLE)
uctx->cmask |= BNXT_RE_UCNTX_CAP_VAR_WQE_ENABLED;
}
+ if (ureq.comp_mask & BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT) {
+ uctx->cmask |= BNXT_RE_UCNTX_CAP_TOGGLE_MEM_UOBJ;
+ resp.comp_mask |= BNXT_RE_UCNTX_CMASK_TOGGLE_MEM_UOBJ_SUPPORT;
+ }
}
xa_init(&uctx->cq_xa);
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index 76f407cd3435..85e594a25448 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -193,6 +193,7 @@ static inline u16 bnxt_re_get_rwqe_size(int nsge)
enum {
BNXT_RE_UCNTX_CAP_POW2_DISABLED = 0x1ULL,
BNXT_RE_UCNTX_CAP_VAR_WQE_ENABLED = 0x2ULL,
+ BNXT_RE_UCNTX_CAP_TOGGLE_MEM_UOBJ = 0x4ULL,
};
static inline u32 bnxt_re_init_depth(u32 ent, u32 max,
diff --git a/drivers/infiniband/hw/bnxt_re/uapi.c b/drivers/infiniband/hw/bnxt_re/uapi.c
index 7e2acd0933f7..45dcaa49d6a8 100644
--- a/drivers/infiniband/hw/bnxt_re/uapi.c
+++ b/drivers/infiniband/hw/bnxt_re/uapi.c
@@ -216,57 +216,76 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund
{
struct ib_uobject *uobj = uverbs_attr_get_uobject(attrs, BNXT_RE_TOGGLE_MEM_HANDLE);
enum bnxt_re_mmap_flag mmap_flag = BNXT_RE_MMAP_TOGGLE_PAGE;
- enum bnxt_re_get_toggle_mem_type res_type;
struct bnxt_re_user_mmap_entry *entry;
struct bnxt_re_ucontext *uctx;
struct ib_ucontext *ib_uctx;
+ struct ib_uobject *res_uobj;
u32 length = PAGE_SIZE;
u64 mem_offset;
u32 offset = 0;
u64 addr = 0;
- u32 res_id;
int err;
ib_uctx = ib_uverbs_get_ucontext(attrs);
if (IS_ERR(ib_uctx))
return PTR_ERR(ib_uctx);
- err = uverbs_get_const(&res_type, attrs, BNXT_RE_TOGGLE_MEM_TYPE);
- if (err)
- return err;
-
uctx = container_of(ib_uctx, struct bnxt_re_ucontext, ib_uctx);
- err = uverbs_copy_from(&res_id, attrs, BNXT_RE_TOGGLE_MEM_RES_ID);
- if (err)
- return err;
-
- switch (res_type) {
- case BNXT_RE_CQ_TOGGLE_MEM:
- struct bnxt_re_cq *cq;
+ res_uobj = uverbs_attr_get_uobject(attrs, BNXT_RE_TOGGLE_MEM_CQ_HANDLE);
+ if (!IS_ERR(res_uobj)) {
+ struct bnxt_re_cq *cq =
+ container_of((struct ib_cq *)res_uobj->object,
+ struct bnxt_re_cq, ib_cq);
+
+ addr = (u64)cq->uctx_cq_page;
+ } else {
+ res_uobj = uverbs_attr_get_uobject(attrs, BNXT_RE_TOGGLE_MEM_SRQ_HANDLE);
+ if (!IS_ERR(res_uobj)) {
+ struct bnxt_re_srq *srq =
+ container_of((struct ib_srq *)res_uobj->object,
+ struct bnxt_re_srq, ib_srq);
- xa_lock(&uctx->cq_xa);
- cq = xa_load(&uctx->cq_xa, res_id);
- if (cq)
- addr = (u64)cq->uctx_cq_page;
- xa_unlock(&uctx->cq_xa);
- if (!addr)
- return -EINVAL;
- break;
- case BNXT_RE_SRQ_TOGGLE_MEM:
- struct bnxt_re_srq *srq;
-
- xa_lock(&uctx->srq_xa);
- srq = xa_load(&uctx->srq_xa, res_id);
- if (srq)
addr = (u64)srq->uctx_srq_page;
- xa_unlock(&uctx->srq_xa);
- if (!addr)
- return -EINVAL;
- break;
- default:
- return -EOPNOTSUPP;
+ } else {
+ /*
+ * Legacy path: old libbnxt_re sends TYPE + RES_ID.
+ * Look up the CQ or SRQ in the per-context XArray
+ */
+ enum bnxt_re_get_toggle_mem_type res_type;
+ u32 res_id;
+
+ err = uverbs_get_const(&res_type, attrs,
+ BNXT_RE_TOGGLE_MEM_TYPE);
+ if (err)
+ return err;
+ err = uverbs_copy_from(&res_id, attrs,
+ BNXT_RE_TOGGLE_MEM_RES_ID);
+ if (err)
+ return err;
+
+ if (res_type == BNXT_RE_CQ_TOGGLE_MEM) {
+ struct bnxt_re_cq *cq;
+
+ xa_lock(&uctx->cq_xa);
+ cq = xa_load(&uctx->cq_xa, res_id);
+ if (cq)
+ addr = (u64)cq->uctx_cq_page;
+ xa_unlock(&uctx->cq_xa);
+ } else if (res_type == BNXT_RE_SRQ_TOGGLE_MEM) {
+ struct bnxt_re_srq *srq;
+
+ xa_lock(&uctx->srq_xa);
+ srq = xa_load(&uctx->srq_xa, res_id);
+ if (srq)
+ addr = (u64)srq->uctx_srq_page;
+ xa_unlock(&uctx->srq_xa);
+ }
+ }
}
+ if (!addr)
+ return -EOPNOTSUPP;
+
entry = bnxt_re_mmap_entry_insert(uctx, addr, mmap_flag, &mem_offset);
if (!entry)
return -ENOMEM;
@@ -308,10 +327,10 @@ DECLARE_UVERBS_NAMED_METHOD(BNXT_RE_METHOD_GET_TOGGLE_MEM,
UA_MANDATORY),
UVERBS_ATTR_CONST_IN(BNXT_RE_TOGGLE_MEM_TYPE,
enum bnxt_re_get_toggle_mem_type,
- UA_MANDATORY),
+ UA_OPTIONAL),
UVERBS_ATTR_PTR_IN(BNXT_RE_TOGGLE_MEM_RES_ID,
UVERBS_ATTR_TYPE(u32),
- UA_MANDATORY),
+ UA_OPTIONAL),
UVERBS_ATTR_PTR_OUT(BNXT_RE_TOGGLE_MEM_MMAP_PAGE,
UVERBS_ATTR_TYPE(u64),
UA_MANDATORY),
@@ -320,7 +339,15 @@ DECLARE_UVERBS_NAMED_METHOD(BNXT_RE_METHOD_GET_TOGGLE_MEM,
UA_MANDATORY),
UVERBS_ATTR_PTR_OUT(BNXT_RE_TOGGLE_MEM_MMAP_LENGTH,
UVERBS_ATTR_TYPE(u32),
- UA_MANDATORY));
+ UA_MANDATORY),
+ UVERBS_ATTR_IDR(BNXT_RE_TOGGLE_MEM_CQ_HANDLE,
+ UVERBS_OBJECT_CQ,
+ UVERBS_ACCESS_READ,
+ UA_OPTIONAL),
+ UVERBS_ATTR_IDR(BNXT_RE_TOGGLE_MEM_SRQ_HANDLE,
+ UVERBS_OBJECT_SRQ,
+ UVERBS_ACCESS_READ,
+ UA_OPTIONAL));
DECLARE_UVERBS_NAMED_METHOD_DESTROY(BNXT_RE_METHOD_RELEASE_TOGGLE_MEM,
UVERBS_ATTR_IDR(BNXT_RE_RELEASE_TOGGLE_MEM_HANDLE,
diff --git a/include/uapi/rdma/bnxt_re-abi.h b/include/uapi/rdma/bnxt_re-abi.h
index a4599d7b736a..a6cfd68ed8f5 100644
--- a/include/uapi/rdma/bnxt_re-abi.h
+++ b/include/uapi/rdma/bnxt_re-abi.h
@@ -57,6 +57,7 @@ enum {
BNXT_RE_UCNTX_CMASK_POW2_DISABLED = 0x10ULL,
BNXT_RE_UCNTX_CMASK_MSN_TABLE_ENABLED = 0x40,
BNXT_RE_UCNTX_CMASK_QP_RATE_LIMIT_ENABLED = 0x80ULL,
+ BNXT_RE_UCNTX_CMASK_TOGGLE_MEM_UOBJ_SUPPORT = 0x400000ULL,
};
enum bnxt_re_wqe_mode {
@@ -68,6 +69,7 @@ enum bnxt_re_wqe_mode {
enum {
BNXT_RE_COMP_MASK_REQ_UCNTX_POW2_SUPPORT = 0x01,
BNXT_RE_COMP_MASK_REQ_UCNTX_VAR_WQE_SUPPORT = 0x02,
+ BNXT_RE_COMP_MASK_REQ_UCNTX_TOGGLE_MEM_UOBJ_SUPPORT = 0x20,
};
struct bnxt_re_uctx_req {
@@ -218,6 +220,8 @@ enum bnxt_re_var_toggle_mem_attrs {
BNXT_RE_TOGGLE_MEM_MMAP_PAGE,
BNXT_RE_TOGGLE_MEM_MMAP_OFFSET,
BNXT_RE_TOGGLE_MEM_MMAP_LENGTH,
+ BNXT_RE_TOGGLE_MEM_CQ_HANDLE,
+ BNXT_RE_TOGGLE_MEM_SRQ_HANDLE,
};
enum bnxt_re_toggle_mem_attrs {
--
2.39.3
^ permalink raw reply related
* [PATCH for-next v2 1/2] RDMA/bnxt_re: Replace per-device hash tables with per-context XArray
From: Selvin Xavier @ 2026-06-24 22:39 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, kalesh-anakkur.purayil,
sriharsha.basavapatna, Selvin Xavier
In-Reply-To: <20260624223927.521882-1-selvin.xavier@broadcom.com>
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=a, Size: 12613 bytes --]
The CQ and SRQ hash tables (cq_hash, srq_hash) on struct bnxt_re_dev
were used exclusively to look up a toggle-page pointer from a
user-space-supplied hardware queue ID in the GET_TOGGLE_MEM
ioctl handler. This approach has couple of problems. First,
because the tables are per-device, any user can look up another
user's CQ or SRQ by guessing the hardware queue ID. Second,
concurrent add and remove operations on the hash table are not
protected by any lock, leaving a race window.
The correct fix is to retrieve the CQ and SRQ objects via the uverbs
object handle, which gives built-in ownership verification and reference
pinning for the duration of the ioctl. That is added in the next patch of
this series.
To maintain backward compatibility with older rdma-core versions that
do not send a uverbs object handle, the driver must continue to support
the existing TYPE + RES_ID lookup path. This patch replaces the per-device
hash tables with per-ucontext XArrays (cq_xa and srq_xa on struct
bnxt_re_ucontext), which narrows the lookup scope to the calling context,
eliminating the cross-user visibility. Also adds Xarray locking mechanism
for synchronization.
The GET_TOGGLE_MEM ioctl handler is updated to call xa_load()
in place of the now-removed bnxt_re_search_for_cq()/
bnxt_re_search_for_srq() helpers. No ABI changes are required.
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
---
drivers/infiniband/hw/bnxt_re/bnxt_re.h | 6 --
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 84 ++++++++++++++++++------
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 4 +-
drivers/infiniband/hw/bnxt_re/main.c | 5 --
drivers/infiniband/hw/bnxt_re/uapi.c | 55 ++++------------
5 files changed, 80 insertions(+), 74 deletions(-)
diff --git a/drivers/infiniband/hw/bnxt_re/bnxt_re.h b/drivers/infiniband/hw/bnxt_re/bnxt_re.h
index 3a7ce4729fcf..a43e678151d3 100644
--- a/drivers/infiniband/hw/bnxt_re/bnxt_re.h
+++ b/drivers/infiniband/hw/bnxt_re/bnxt_re.h
@@ -41,7 +41,6 @@
#define __BNXT_RE_H__
#include <rdma/uverbs_ioctl.h>
#include "hw_counters.h"
-#include <linux/hashtable.h>
#define ROCE_DRV_MODULE_NAME "bnxt_re"
#define BNXT_RE_DESC "Broadcom NetXtreme-C/E RoCE Driver"
@@ -158,9 +157,6 @@ struct bnxt_re_nq_record {
struct mutex load_lock;
};
-#define MAX_CQ_HASH_BITS (16)
-#define MAX_SRQ_HASH_BITS (16)
-
static inline bool bnxt_re_chip_gen_p7(u16 chip_num)
{
return (chip_num == CHIP_NUM_58818 ||
@@ -215,8 +211,6 @@ struct bnxt_re_dev {
struct bnxt_re_pacing pacing;
struct work_struct dbq_fifo_check_work;
struct delayed_work dbq_pacing_work;
- DECLARE_HASHTABLE(cq_hash, MAX_CQ_HASH_BITS);
- DECLARE_HASHTABLE(srq_hash, MAX_SRQ_HASH_BITS);
struct dentry *dbg_root;
struct dentry *qp_debugfs;
unsigned long event_bitmap;
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 565762529007..d1eebd7b56f4 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -38,6 +38,7 @@
#include <linux/interrupt.h>
#include <linux/types.h>
+#include <linux/xarray.h>
#include <linux/pci.h>
#include <linux/netdevice.h>
#include <linux/if_ether.h>
@@ -51,8 +52,6 @@
#include <rdma/ib_cache.h>
#include <rdma/ib_pma.h>
#include <rdma/uverbs_ioctl.h>
-#include <linux/hashtable.h>
-
#include "roce_hsi.h"
#include "qplib_res.h"
#include "qplib_sp.h"
@@ -2152,11 +2151,19 @@ int bnxt_re_destroy_srq(struct ib_srq *ib_srq, struct ib_udata *udata)
if (ret)
return ret;
- if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT)
- hash_del(&srq->hash_entry);
bnxt_qplib_destroy_srq(&rdev->qplib_res, qplib_srq);
- if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT)
- free_page((unsigned long)srq->uctx_srq_page);
+ if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) {
+ struct bnxt_re_ucontext *uctx =
+ rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx);
+
+ if (uctx) {
+ /* similar to cq, use __xa_erase() with the lock already held */
+ xa_lock(&uctx->srq_xa);
+ __xa_erase(&uctx->srq_xa, srq->qplib_srq.id);
+ xa_unlock(&uctx->srq_xa);
+ free_page((unsigned long)srq->uctx_srq_page);
+ }
+ }
ib_umem_release(srq->umem);
atomic_dec(&rdev->stats.res.srq_count);
return ib_respond_empty_udata(udata);
@@ -2263,20 +2270,21 @@ int bnxt_re_create_srq(struct ib_srq *ib_srq,
resp.srqid = srq->qplib_srq.id;
if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) {
- hash_add(rdev->srq_hash, &srq->hash_entry, srq->qplib_srq.id);
srq->uctx_srq_page = (void *)get_zeroed_page(GFP_KERNEL);
if (!srq->uctx_srq_page) {
rc = -ENOMEM;
- goto fail;
+ goto fail_destroy_srq;
+ }
+ if (xa_is_err(xa_store(&uctx->srq_xa, srq->qplib_srq.id,
+ srq, GFP_KERNEL))) {
+ rc = -ENOMEM;
+ goto fail_free_toggle;
}
resp.comp_mask |= BNXT_RE_SRQ_TOGGLE_PAGE_SUPPORT;
}
rc = ib_respond_udata(udata, resp);
- if (rc) {
- bnxt_qplib_destroy_srq(&rdev->qplib_res,
- &srq->qplib_srq);
- goto fail;
- }
+ if (rc)
+ goto fail_respond;
}
active_srqs = atomic_inc_return(&rdev->stats.res.srq_count);
if (active_srqs > rdev->stats.res.srq_watermark)
@@ -2285,6 +2293,16 @@ int bnxt_re_create_srq(struct ib_srq *ib_srq,
return 0;
+fail_respond:
+ if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT) {
+ xa_lock(&uctx->srq_xa);
+ __xa_erase(&uctx->srq_xa, srq->qplib_srq.id);
+ xa_unlock(&uctx->srq_xa);
+ }
+fail_free_toggle:
+ free_page((unsigned long)srq->uctx_srq_page);
+fail_destroy_srq:
+ bnxt_qplib_destroy_srq(&rdev->qplib_res, &srq->qplib_srq);
fail:
ib_umem_release(srq->umem);
exit:
@@ -3475,11 +3493,24 @@ int bnxt_re_destroy_cq(struct ib_cq *ib_cq, struct ib_udata *udata)
if (ret)
return ret;
- if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT)
- hash_del(&cq->hash_entry);
bnxt_qplib_destroy_cq(&rdev->qplib_res, &cq->qplib_cq);
- if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT)
- free_page((unsigned long)cq->uctx_cq_page);
+ if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT) {
+ struct bnxt_re_ucontext *uctx =
+ rdma_udata_to_drv_context(udata, struct bnxt_re_ucontext, ib_uctx);
+
+ if (uctx) {
+ /*
+ * Hold xa_lock across the erase so that GET_TOGGLE_MEM's
+ * xa_lock + xa_load + dereference region is atomic with respect
+ * to removal. xa_erase() would re-acquire the same lock and
+ * deadlock; use __xa_erase() with the lock already held.
+ */
+ xa_lock(&uctx->cq_xa);
+ __xa_erase(&uctx->cq_xa, cq->qplib_cq.id);
+ xa_unlock(&uctx->cq_xa);
+ free_page((unsigned long)cq->uctx_cq_page);
+ }
+ }
bnxt_re_put_nq(rdev, nq);
@@ -3554,14 +3585,15 @@ int bnxt_re_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *att
spin_lock_init(&cq->cq_lock);
if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT) {
- hash_add(rdev->cq_hash, &cq->hash_entry, cq->qplib_cq.id);
- /* Allocate a page */
cq->uctx_cq_page = (void *)get_zeroed_page(GFP_KERNEL);
if (!cq->uctx_cq_page) {
rc = -ENOMEM;
goto destroy_cq;
}
-
+ if (xa_is_err(xa_store(&uctx->cq_xa, cq->qplib_cq.id, cq, GFP_KERNEL))) {
+ rc = -ENOMEM;
+ goto free_toggle_page;
+ }
resp.comp_mask |= BNXT_RE_CQ_TOGGLE_PAGE_SUPPORT;
}
resp.cqid = cq->qplib_cq.id;
@@ -3574,6 +3606,12 @@ int bnxt_re_create_user_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *att
return 0;
free_mem:
+ if (cctx->modes.toggle_bits & BNXT_QPLIB_CQ_TOGGLE_BIT) {
+ xa_lock(&uctx->cq_xa);
+ __xa_erase(&uctx->cq_xa, cq->qplib_cq.id);
+ xa_unlock(&uctx->cq_xa);
+ }
+free_toggle_page:
free_page((unsigned long)cq->uctx_cq_page);
destroy_cq:
bnxt_qplib_destroy_cq(&rdev->qplib_res, &cq->qplib_cq);
@@ -4823,6 +4861,9 @@ int bnxt_re_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *udata)
}
}
+ xa_init(&uctx->cq_xa);
+ xa_init(&uctx->srq_xa);
+
rc = ib_respond_udata(udata, resp);
if (rc)
goto cfail;
@@ -4848,6 +4889,9 @@ void bnxt_re_dealloc_ucontext(struct ib_ucontext *ib_uctx)
if (uctx->shpg)
free_page((unsigned long)uctx->shpg);
+ xa_destroy(&uctx->cq_xa);
+ xa_destroy(&uctx->srq_xa);
+
if (uctx->dpi.dbr) {
/* Free DPI only if this is the first PD allocated by the
* application and mark the context dpi as NULL
diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.h b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
index 22bf81668cfb..76f407cd3435 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.h
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.h
@@ -78,7 +78,6 @@ struct bnxt_re_srq {
struct ib_umem *umem;
spinlock_t lock; /* protect srq */
void *uctx_srq_page;
- struct hlist_node hash_entry;
};
struct bnxt_re_qp {
@@ -113,7 +112,6 @@ struct bnxt_re_cq {
struct ib_umem *resize_umem;
int resize_cqe;
void *uctx_cq_page;
- struct hlist_node hash_entry;
};
struct bnxt_re_mr {
@@ -147,6 +145,8 @@ struct bnxt_re_ucontext {
void *shpg;
spinlock_t sh_lock; /* protect shpg */
struct rdma_user_mmap_entry *shpage_mmap;
+ struct xarray cq_xa; /* cqid → bnxt_re_cq, for toggle page lookup */
+ struct xarray srq_xa; /* srqid → bnxt_re_srq, for toggle page lookup */
u64 cmask;
};
diff --git a/drivers/infiniband/hw/bnxt_re/main.c b/drivers/infiniband/hw/bnxt_re/main.c
index d25fdc458120..637f023b18ac 100644
--- a/drivers/infiniband/hw/bnxt_re/main.c
+++ b/drivers/infiniband/hw/bnxt_re/main.c
@@ -54,7 +54,6 @@
#include <rdma/ib_user_verbs.h>
#include <rdma/ib_umem.h>
#include <rdma/ib_addr.h>
-#include <linux/hashtable.h>
#include <linux/bnxt/ulp.h>
#include "roce_hsi.h"
@@ -2337,10 +2336,6 @@ static int bnxt_re_dev_init(struct bnxt_re_dev *rdev, u8 op_type)
if (!(rdev->qplib_res.en_dev->flags & BNXT_EN_FLAG_ROCE_VF_RES_MGMT))
bnxt_re_vf_res_config(rdev);
}
- hash_init(rdev->cq_hash);
- if (rdev->chip_ctx->modes.toggle_bits & BNXT_QPLIB_SRQ_TOGGLE_BIT)
- hash_init(rdev->srq_hash);
-
bnxt_re_debugfs_add_pdev(rdev);
bnxt_re_init_dcb_wq(rdev);
diff --git a/drivers/infiniband/hw/bnxt_re/uapi.c b/drivers/infiniband/hw/bnxt_re/uapi.c
index 263238a6e4cd..7e2acd0933f7 100644
--- a/drivers/infiniband/hw/bnxt_re/uapi.c
+++ b/drivers/infiniband/hw/bnxt_re/uapi.c
@@ -22,32 +22,6 @@
#include "bnxt_re.h"
#include "ib_verbs.h"
-static struct bnxt_re_cq *bnxt_re_search_for_cq(struct bnxt_re_dev *rdev, u32 cq_id)
-{
- struct bnxt_re_cq *cq = NULL, *tmp_cq;
-
- hash_for_each_possible(rdev->cq_hash, tmp_cq, hash_entry, cq_id) {
- if (tmp_cq->qplib_cq.id == cq_id) {
- cq = tmp_cq;
- break;
- }
- }
- return cq;
-}
-
-static struct bnxt_re_srq *bnxt_re_search_for_srq(struct bnxt_re_dev *rdev, u32 srq_id)
-{
- struct bnxt_re_srq *srq = NULL, *tmp_srq;
-
- hash_for_each_possible(rdev->srq_hash, tmp_srq, hash_entry, srq_id) {
- if (tmp_srq->qplib_srq.id == srq_id) {
- srq = tmp_srq;
- break;
- }
- }
- return srq;
-}
-
static int UVERBS_HANDLER(BNXT_RE_METHOD_NOTIFY_DRV)(struct uverbs_attr_bundle *attrs)
{
struct bnxt_re_ucontext *uctx;
@@ -246,10 +220,7 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund
struct bnxt_re_user_mmap_entry *entry;
struct bnxt_re_ucontext *uctx;
struct ib_ucontext *ib_uctx;
- struct bnxt_re_dev *rdev;
- struct bnxt_re_srq *srq;
u32 length = PAGE_SIZE;
- struct bnxt_re_cq *cq;
u64 mem_offset;
u32 offset = 0;
u64 addr = 0;
@@ -265,31 +236,33 @@ static int UVERBS_HANDLER(BNXT_RE_METHOD_GET_TOGGLE_MEM)(struct uverbs_attr_bund
return err;
uctx = container_of(ib_uctx, struct bnxt_re_ucontext, ib_uctx);
- rdev = uctx->rdev;
err = uverbs_copy_from(&res_id, attrs, BNXT_RE_TOGGLE_MEM_RES_ID);
if (err)
return err;
switch (res_type) {
case BNXT_RE_CQ_TOGGLE_MEM:
- cq = bnxt_re_search_for_cq(rdev, res_id);
- if (!cq)
- return -EINVAL;
+ struct bnxt_re_cq *cq;
- addr = (u64)cq->uctx_cq_page;
+ xa_lock(&uctx->cq_xa);
+ cq = xa_load(&uctx->cq_xa, res_id);
+ if (cq)
+ addr = (u64)cq->uctx_cq_page;
+ xa_unlock(&uctx->cq_xa);
if (!addr)
- return -EOPNOTSUPP;
+ return -EINVAL;
break;
case BNXT_RE_SRQ_TOGGLE_MEM:
- srq = bnxt_re_search_for_srq(rdev, res_id);
- if (!srq)
- return -EINVAL;
+ struct bnxt_re_srq *srq;
- addr = (u64)srq->uctx_srq_page;
+ xa_lock(&uctx->srq_xa);
+ srq = xa_load(&uctx->srq_xa, res_id);
+ if (srq)
+ addr = (u64)srq->uctx_srq_page;
+ xa_unlock(&uctx->srq_xa);
if (!addr)
- return -EOPNOTSUPP;
+ return -EINVAL;
break;
-
default:
return -EOPNOTSUPP;
}
--
2.39.3
^ permalink raw reply related
* [PATCH for-next v2 0/2] RDMA/bnxt_re: Update the toggle page handling of CQ and SRQ
From: Selvin Xavier @ 2026-06-24 22:39 UTC (permalink / raw)
To: leon, jgg
Cc: linux-rdma, andrew.gospodarek, kalesh-anakkur.purayil,
sriharsha.basavapatna, Selvin Xavier
Based on the suggestion from Jason (
https://patchwork.kernel.org/project/linux-rdma/patch/20260615224751.232802-5-selvin.xavier@broadcom.com/)
, adding the uverb object to retrieve the CQ an SRQ structures while getting the
toggle mem. To work with older rdma-core, retain the existing code with
modification.
The rdma-core pull request is here: https://github.com/linux-rdma/rdma-core/pull/1761
Please review and apply the series.
Thanks,
Selvin Xavier
v1->v2 :
- Fix the error cleanup for SRQ and CQ create paths
- Fix a synchronization issue for the legacy path which can cause a
UAF
Selvin Xavier (2):
RDMA/bnxt_re: Replace per-device hash tables with per-context XArray
RDMA/bnxt_re: Add uverbs object handle path for CQ/SRQ toggle page
drivers/infiniband/hw/bnxt_re/bnxt_re.h | 6 --
drivers/infiniband/hw/bnxt_re/ib_verbs.c | 91 +++++++++++++----
drivers/infiniband/hw/bnxt_re/ib_verbs.h | 5 +-
drivers/infiniband/hw/bnxt_re/main.c | 5 -
drivers/infiniband/hw/bnxt_re/uapi.c | 124 +++++++++++------------
include/uapi/rdma/bnxt_re-abi.h | 4 +
6 files changed, 139 insertions(+), 96 deletions(-)
--
2.39.3
^ permalink raw reply
* [PATCH] RDMA/irdma: Prevent overflows in memory contiguity checks
From: Aleksandrova Alyona @ 2026-06-24 14:48 UTC (permalink / raw)
To: Krzysztof Czurylo, Tatyana Nikolova
Cc: Jason Gunthorpe, Leon Romanovsky, Mustafa Ismail, Shiraz Saleem,
linux-rdma, linux-kernel, lvc-project
irdma_check_mem_contiguous() and irdma_check_mr_contiguous() verify that
PBL entries describe physically contiguous memory ranges.
Both functions calculate byte offsets using 32-bit operands. For example,
with 4 KiB pages, pg_size * pg_idx overflows 32-bit arithmetic when
pg_idx reaches 1048576. In the level-2 check, PBLE_PER_PAGE is 512, so
i * pg_size * PBLE_PER_PAGE overflows when i reaches 2048.
These values are reachable in the driver. For MRs, palloc->total_cnt
comes from iwmr->page_cnt, which is calculated by
ib_umem_num_dma_blocks(). The MR size is limited by IRDMA_MAX_MR_SIZE,
so a 4 GiB MR with 4 KiB pages can reach page_cnt of 1048576. PBLE
resources do not exclude this value either: for gen3, the limit is based
on avail_sds * MAX_PBLE_PER_SD, and MAX_PBLE_PER_SD is 0x40000, so 4 SDs
are enough for 1048576 PBLEs.
Cast one operand to u64 before the multiplications so that the offset
calculations are performed in 64-bit arithmetic.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Fixes: b48c24c2d710 ("RDMA/irdma: Implement device supported verb APIs")
Signed-off-by: Aleksandrova Alyona <aga@itb.spb.ru>
---
drivers/infiniband/hw/irdma/verbs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/infiniband/hw/irdma/verbs.c b/drivers/infiniband/hw/irdma/verbs.c
index 17086048d2d7..ab55f674cb63 100644
--- a/drivers/infiniband/hw/irdma/verbs.c
+++ b/drivers/infiniband/hw/irdma/verbs.c
@@ -2819,7 +2819,7 @@ static bool irdma_check_mem_contiguous(u64 *arr, u32 npages, u32 pg_size)
u32 pg_idx;
for (pg_idx = 0; pg_idx < npages; pg_idx++) {
- if ((*arr + (pg_size * pg_idx)) != arr[pg_idx])
+ if ((*arr + ((u64)pg_size * pg_idx)) != arr[pg_idx])
return false;
}
@@ -2852,7 +2852,7 @@ static bool irdma_check_mr_contiguous(struct irdma_pble_alloc *palloc,
for (i = 0; i < lvl2->leaf_cnt; i++, leaf++) {
arr = leaf->addr;
- if ((*start_addr + (i * pg_size * PBLE_PER_PAGE)) != *arr)
+ if ((*start_addr + ((u64)i * pg_size * PBLE_PER_PAGE)) != *arr)
return false;
ret = irdma_check_mem_contiguous(arr, leaf->cnt, pg_size);
if (!ret)
--
2.26.2
^ permalink raw reply related
* Re: [PATCH] RDMA/siw: publish QP after initialization
From: Bernard Metzler @ 2026-06-24 14:16 UTC (permalink / raw)
To: Ruoyu Wang, Jason Gunthorpe, Leon Romanovsky; +Cc: linux-rdma, linux-kernel
In-Reply-To: <20260620155306.78919-1-ruoyuw560@gmail.com>
On 20.06.2026 17:53, Ruoyu Wang wrote:
> siw_create_qp() allocates a QP number before the queues, CQ pointers,
> state, completion, and device list entry are ready. A QPN lookup can
> therefore reach a QP that is still being constructed if the object is
> published at allocation time.
>
> Reserve the QPN with an empty XArray entry first. Publish the QP object
> only after the kernel-visible QP state is initialized and just before
> siw_create_qp() returns it to the caller.
>
> Fixes: f29dd55b0236 ("rdma/siw: queue pair methods")
> Signed-off-by: Ruoyu Wang <ruoyuw560@gmail.com>
> ---
> drivers/infiniband/sw/siw/siw.h | 1 +
> drivers/infiniband/sw/siw/siw_qp.c | 26 ++++++++++++++++++--------
> drivers/infiniband/sw/siw/siw_verbs.c | 12 +++++++++++-
> 3 files changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/infiniband/sw/siw/siw.h b/drivers/infiniband/sw/siw/siw.h
> index f5fd71717b80..ade7c96135c2 100644
> --- a/drivers/infiniband/sw/siw/siw.h
> +++ b/drivers/infiniband/sw/siw/siw.h
> @@ -511,6 +511,7 @@ void siw_send_terminate(struct siw_qp *qp);
> void siw_qp_get_ref(struct ib_qp *qp);
> void siw_qp_put_ref(struct ib_qp *qp);
> int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp);
> +int siw_qp_publish(struct siw_qp *qp);
> void siw_free_qp(struct kref *ref);
>
> void siw_init_terminate(struct siw_qp *qp, enum term_elayer layer,
> diff --git a/drivers/infiniband/sw/siw/siw_qp.c b/drivers/infiniband/sw/siw/siw_qp.c
> index bb780e3904a2..1a9135d9a2a7 100644
> --- a/drivers/infiniband/sw/siw/siw_qp.c
> +++ b/drivers/infiniband/sw/siw/siw_qp.c
> @@ -1281,15 +1281,25 @@ void siw_rq_flush(struct siw_qp *qp)
>
> int siw_qp_add(struct siw_device *sdev, struct siw_qp *qp)
> {
> - int rv = xa_alloc(&sdev->qp_xa, &qp->base_qp.qp_num, qp, xa_limit_32b,
> - GFP_KERNEL);
> + qp->sdev = sdev;
>
> - if (!rv) {
> - kref_init(&qp->ref);
> - qp->sdev = sdev;
> - siw_dbg_qp(qp, "new QP\n");
> - }
> - return rv;
> + return xa_alloc(&sdev->qp_xa, &qp->base_qp.qp_num, NULL,
> + xa_limit_32b, GFP_KERNEL);
> +}
> +
> +int siw_qp_publish(struct siw_qp *qp)
> +{
> + void *old;
> +
> + kref_init(&qp->ref);
> +
> + old = xa_store(&qp->sdev->qp_xa, qp_id(qp), qp, GFP_KERNEL);
> + if (xa_is_err(old))
> + return xa_err(old);
> +
> + siw_dbg_qp(qp, "new QP\n");
> +
> + return 0;
> }
>
> void siw_free_qp(struct kref *ref)
> diff --git a/drivers/infiniband/sw/siw/siw_verbs.c b/drivers/infiniband/sw/siw/siw_verbs.c
> index 1e1d262a4ae2..71bc0cc59e3d 100644
> --- a/drivers/infiniband/sw/siw/siw_verbs.c
> +++ b/drivers/infiniband/sw/siw/siw_verbs.c
> @@ -482,14 +482,24 @@ int siw_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
> goto err_out_xa;
> }
> INIT_LIST_HEAD(&qp->devq);
> + init_completion(&qp->qp_free);
> +
> spin_lock_irqsave(&sdev->lock, flags);
> list_add_tail(&qp->devq, &sdev->qp_list);
> spin_unlock_irqrestore(&sdev->lock, flags);
>
> - init_completion(&qp->qp_free);
> + rv = siw_qp_publish(qp);
To avoid this transient visibility of a not-yet-initialized
QP - can't we just move siw_qp_add() to the end of the
siw_create_qp() function?
> + if (rv)
> + goto err_out_list;
>
> return 0;
>
> +err_out_list:
> + spin_lock_irqsave(&sdev->lock, flags);
> + list_del(&qp->devq);
> + spin_unlock_irqrestore(&sdev->lock, flags);
> +
> + siw_put_tx_cpu(qp->tx_cpu);
> err_out_xa:
> xa_erase(&sdev->qp_xa, qp_id(qp));
> if (uctx) {
^ permalink raw reply
* [recipe build #4056915] of ~linux-rdma rdma-core-daily in xenial: Dependency wait
From: noreply @ 2026-06-24 13:03 UTC (permalink / raw)
To: Linux RDMA
* State: Dependency wait
* Recipe: linux-rdma/rdma-core-daily
* Archive: ~linux-rdma/ubuntu/rdma-core-daily
* Distroseries: xenial
* Duration: 3 minutes
* Build Log: https://launchpad.net/~linux-rdma/+archive/ubuntu/rdma-core-daily/+recipebuild/4056915/+files/buildlog.txt.gz
* Upload Log:
* Builder: https://launchpad.net/builders/lcy02-amd64-063
--
https://launchpad.net/~linux-rdma/+archive/ubuntu/rdma-core-daily/+recipebuild/4056915
Your team Linux RDMA is the requester of the build.
^ permalink raw reply
* Re: [PATCH net v2] net/smc: avoid recursive sk_callback_lock in listen data_ready
From: Runyu Xiao @ 2026-06-24 10:37 UTC (permalink / raw)
To: XIAO WU
Cc: D. Wythe, Dust Li, Sidraya Jayagond, Wenjia Zhang,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Mahanta Jambigi, Tony Lu, Wen Gu, Simon Horman, Karsten Graul,
linux-rdma, linux-s390, netdev, linux-kernel, jianhao.xu
In-Reply-To: <tencent_BD4B709F8D16281265EDBC0DC9EFC8758808@qq.com>
Hi Xiao,
> the error path in smc_listen() does not restore icsk_af_ops when
> kernel_listen() fails
Thanks, this looks like a real error-path bug. I will prepare it as a
separate fix for smc_listen() rather than folding it into this
sk_callback_lock patch.
Runyu
^ permalink raw reply
* [PATCH rdma v1 2/2] RDMA/zrdma: Add hardware config code and improve driver init flow
From: zhang.yanze @ 2026-06-24 8:50 UTC (permalink / raw)
To: jgg, leon
Cc: linux-kernel, linux-rdma, zhang.yanze, wei.quan, han.junyang,
ran.ming, han.chengfei
In-Reply-To: <202606241644572798vwNbZEAtt3c2IHcJUtCs@zte.com.cn>
From: Yanze Zhang <zhang.yanze@zte.com.cn>
Add hardware config code and improve driver init flow
- Add TX/RX/IO register configuration logic to new zrdma_hw.c
- Add zrdma_defs.h for common bitfield masks and constants
- Add zrdma_uk.h for shared enum definitions
- Update Makefile to build zrdma_hw.o as separate module
- Call zxdh_ctrl_init_hw() in probe and add proper cleanup on error
Signed-off-by: Yanze Zhang <zhang.yanze@zte.com.cn>
---
drivers/infiniband/hw/zrdma/Makefile | 4 +-
drivers/infiniband/hw/zrdma/zrdma_defs.h | 37 ++++++
drivers/infiniband/hw/zrdma/zrdma_hw.c | 135 ++++++++++++++++++++
drivers/infiniband/hw/zrdma/zrdma_hw.h | 156 +++++++++++++++++++++++
drivers/infiniband/hw/zrdma/zrdma_main.c | 14 ++
drivers/infiniband/hw/zrdma/zrdma_main.h | 1 +
drivers/infiniband/hw/zrdma/zrdma_mem.h | 2 +
drivers/infiniband/hw/zrdma/zrdma_uk.h | 18 +++
8 files changed, 366 insertions(+), 1 deletion(-)
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_defs.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_hw.c
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_hw.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_uk.h
diff --git a/drivers/infiniband/hw/zrdma/Makefile b/drivers/infiniband/hw/zrdma/Makefile
index b5297543e393..128bf98fd731 100644
--- a/drivers/infiniband/hw/zrdma/Makefile
+++ b/drivers/infiniband/hw/zrdma/Makefile
@@ -1,4 +1,6 @@
# SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
obj-$(CONFIG_INFINIBAND_ZRDMA) += zrdma.o
-zrdma-y := zrdma_main.o
+zrdma-y := \
+ zrdma_hw.o \
+ zrdma_main.o
diff --git a/drivers/infiniband/hw/zrdma/zrdma_defs.h b/drivers/infiniband/hw/zrdma/zrdma_defs.h
new file mode 100644
index 000000000000..512a426281e6
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_defs.h
@@ -0,0 +1,37 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#ifndef ZRDMA_DEFS_H
+#define ZRDMA_DEFS_H
+
+#include <linux/bitfield.h>
+#include <rdma/ib_verbs.h>
+
+/* RDMA TX DDR Access REG Masks */
+#define ZXDH_TX_CACHE_ID GENMASK_ULL(1, 0)
+#define ZXDH_TX_INDICATE_ID GENMASK_ULL(3, 2)
+#define ZXDH_TX_AXI_ID GENMASK_ULL(6, 4)
+#define ZXDH_TX_WAY_PARTITION GENMASK_ULL(9, 7)
+
+/* RDMA RX DDR Access REG Masks */
+#define ZXDH_RX_CACHE_ID GENMASK_ULL(1, 0)
+#define ZXDH_RX_INDICATE_ID GENMASK_ULL(3, 2)
+#define ZXDH_RX_AXI_ID GENMASK_ULL(6, 4)
+#define ZXDH_RX_WAY_PARTITION GENMASK_ULL(9, 7)
+
+/* RDMA IO REG Masks */
+#define ZXDH_IOTABLE2_SID GENMASK_ULL(5, 0)
+
+#define ZXDH_IOTABLE4_EPID GENMASK_ULL(14, 11)
+#define ZXDH_IOTABLE4_VFID GENMASK_ULL(10, 3)
+#define ZXDH_IOTABLE4_PFID GENMASK_ULL(2, 0)
+
+#define ZXDH_IOTABLE7_PFID GENMASK_ULL(4, 2)
+#define ZXDH_IOTABLE7_EPID GENMASK_ULL(8, 5)
+
+#define RDMARX_MAX_MSG_SIZE 0x80000000
+
+#endif /* ZRDMA_DEFS_H */
diff --git a/drivers/infiniband/hw/zrdma/zrdma_hw.c b/drivers/infiniband/hw/zrdma/zrdma_hw.c
new file mode 100644
index 000000000000..38eeb1f21d0b
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_hw.c
@@ -0,0 +1,135 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#include "zrdma_hw.h"
+#include "zrdma_ctrl.h"
+#include "zrdma_mem.h"
+
+static void zxdh_config_tx_regs(struct zxdh_sc_dev *dev)
+{
+ u32 temp;
+
+ temp = FIELD_PREP(ZXDH_TX_CACHE_ID, 0) |
+ FIELD_PREP(ZXDH_TX_INDICATE_ID, ZXDH_INDICATE_HOST_NOSMMU) |
+ FIELD_PREP(ZXDH_TX_AXI_ID, (ZXDH_AXID_HOST_EP0 + dev->ep_id)) |
+ FIELD_PREP(ZXDH_TX_WAY_PARTITION, 0);
+
+ writel(temp, dev->hw->hw_addr + RDMATX_ACK_SQWQE_PARA_CFG);
+ writel(temp, dev->hw->hw_addr + RDMATX_ACK_DDR_PARA_CFG);
+ writel(temp, dev->hw->hw_addr + RDMATX_DB_SQWQE_ID_CFG);
+ writel(temp, dev->hw->hw_addr + RDMATX_SQWQE_PARA_CFG);
+ writel(temp, dev->hw->hw_addr + RDMATX_PAYLOAD_PARA_CFG);
+
+ if (dev->hmc_use_dpu_ddr) {
+ temp = FIELD_PREP(ZXDH_TX_CACHE_ID, dev->cache_id) |
+ FIELD_PREP(ZXDH_TX_INDICATE_ID, ZXDH_INDICATE_DPU_DDR) |
+ FIELD_PREP(ZXDH_TX_AXI_ID,
+ (ZXDH_AXID_HOST_EP0 + dev->ep_id)) |
+ FIELD_PREP(ZXDH_TX_WAY_PARTITION, 0);
+ } else {
+ temp = FIELD_PREP(ZXDH_TX_CACHE_ID, dev->cache_id) |
+ FIELD_PREP(ZXDH_TX_INDICATE_ID,
+ ZXDH_INDICATE_HOST_SMMU) |
+ FIELD_PREP(ZXDH_TX_AXI_ID,
+ (ZXDH_AXID_HOST_EP0 + dev->ep_id)) |
+ FIELD_PREP(ZXDH_TX_WAY_PARTITION, 0);
+ }
+ writel(temp, dev->hw->hw_addr + C_HMC_MRTE_TX2);
+ writel(temp, dev->hw->hw_addr + C_HMC_PBLEMR_TX2);
+ writel((ZXDH_AXID_HOST_EP0 + dev->ep_id),
+ dev->hw->hw_addr + RDMATX_HOSTID_CFG);
+}
+
+static void zxdh_config_rx_regs(struct zxdh_sc_dev *dev)
+{
+ u32 temp;
+
+ temp = FIELD_PREP(ZXDH_RX_CACHE_ID, 0) |
+ FIELD_PREP(ZXDH_RX_INDICATE_ID, ZXDH_INDICATE_HOST_NOSMMU) |
+ FIELD_PREP(ZXDH_RX_AXI_ID, (ZXDH_AXID_HOST_EP0 + dev->ep_id)) |
+ FIELD_PREP(ZXDH_RX_WAY_PARTITION, 0);
+
+ writel(temp, dev->hw->hw_addr + RDMARX_PLD_WR_AXIID_RAM);
+ writel(temp, dev->hw->hw_addr + RDMARX_RQ_AXI_RAM);
+ writel(temp, dev->hw->hw_addr + RDMARX_SRQ_AXI_RAM);
+ writel(temp, dev->hw->hw_addr + RDMARX_ACK_RQDB_AXI_RAM);
+ writel(temp, dev->hw->hw_addr + RDMARX_CQ_CQE_AXI_INFO_RAM);
+ writel(temp, dev->hw->hw_addr + RDMARX_CQ_DBSA_AXI_INFO_RAM);
+ writel(dev->hmc_fn_id,
+ dev->hw->hw_addr + RDMARX_MUL_CACHE_CFG_SIDN_RAM);
+ writel((ZXDH_AXID_HOST_EP0 + dev->ep_id),
+ dev->hw->hw_addr + RDMARX_MUL_COPY_QPN_INDICATE);
+ writel(RDMARX_MAX_MSG_SIZE,
+ dev->hw->hw_addr + RDMARX_VHCA_MAX_SIZE_RAM);
+}
+
+static void zxdh_config_io_regs(struct zxdh_sc_dev *dev)
+{
+ u32 temp0, temp1, temp2;
+ struct zxdh_pci_f *rf = container_of(dev, struct zxdh_pci_f, sc_dev);
+
+ temp0 = FIELD_PREP(ZXDH_IOTABLE2_SID, dev->hmc_fn_id);
+ writel(temp0, dev->hw->hw_addr + C_RDMAIO_TABLE2);
+
+ temp1 = FIELD_PREP(ZXDH_IOTABLE4_EPID,
+ (ZXDH_HOST_EP0_ID + dev->ep_id)) |
+ FIELD_PREP(ZXDH_IOTABLE4_VFID, dev->vf_id) |
+ FIELD_PREP(ZXDH_IOTABLE4_PFID, rf->pf_id);
+ writel(temp1, dev->hw->hw_addr + C_RDMAIO_TABLE4);
+
+ temp0 = 0x10000;
+ writel(temp0, dev->hw->hw_addr + C_RDMAIO_TABLE3);
+ for (temp0 = 0; temp0 < 32; temp0++) {
+ if (temp0 < ZXDH_RW_PAYLOAD || temp0 == ZXDH_QPC_OBJ_ID) {
+ writel(0, dev->hw->hw_addr +
+ (C_RDMAIO_TABLE5_0 + (temp0 * 4)));
+ } else {
+ temp2 = (rf->ftype == 0) ? 0 : ZXDH_TABLE5_VF_EN;
+ writel(temp2, dev->hw->hw_addr + (C_RDMAIO_TABLE5_0 +
+ (temp0 * 4)));
+ }
+ }
+
+ if (rf->ftype == 0) {
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_0);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_1);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_2);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_3);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_4);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_5);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_6);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_7);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_8);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_9);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_10);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_11);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_12);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_13);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_14);
+ writel(0, dev->hw->hw_addr + C_RDMAIO_TABLE6_15);
+
+ temp2 = FIELD_PREP(ZXDH_IOTABLE7_PFID, rf->pf_id) |
+ FIELD_PREP(ZXDH_IOTABLE7_EPID,
+ (ZXDH_HOST_EP0_ID + rf->ep_id));
+ writel(temp2, dev->hw->hw_addr + C_RDMAIO_TABLE7);
+ }
+}
+
+static void zxdh_config_hw_regs(struct zxdh_sc_dev *dev)
+{
+ zxdh_config_tx_regs(dev);
+ zxdh_config_rx_regs(dev);
+ zxdh_config_io_regs(dev);
+}
+
+int zxdh_ctrl_init_hw(struct zxdh_pci_f *rf)
+{
+ struct zxdh_sc_dev *dev = &rf->sc_dev;
+
+ zxdh_config_hw_regs(dev);
+
+ return 0;
+}
diff --git a/drivers/infiniband/hw/zrdma/zrdma_hw.h b/drivers/infiniband/hw/zrdma/zrdma_hw.h
new file mode 100644
index 000000000000..bfbe20461900
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_hw.h
@@ -0,0 +1,156 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#ifndef ZRDMA_HW_H
+#define ZRDMA_HW_H
+#include "zrdma_main.h"
+
+/**************** ZTE RDMA Registers ***************/
+#define C_RDMA_RX_VHCA_REG_IDX (0x3400)
+#define C_RDMA_TX_VHCA_REG_IDX (0x5800)
+#define C_RDMA_IO_VHCA_REG_IDX (0x6000)
+#define C_RDMA_IO_SIDN_REG_IDX (0x7800)
+
+/* rdmatx_ack_recv_vhca_pfvf */
+#define RDMATX_ACK_SQWQE_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x001u)
+#define RDMATX_ACK_DDR_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x004u)
+#define RDMATX_ACK_PCI_MAX_MRTE_INDEX_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x005u)
+
+/* rdmatx_doorbell_mgr_vhca_pfvf */
+#define RDMATX_DB_PBLE_ID_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x100u)
+#define RDMATX_DB_SQWQE_ID_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x103u)
+#define RDMATX_QPN_BASEQPN_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x109u)
+#define RDMATX_QPN_CONTEXT_ID_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x10Au)
+#define RDMATX_QUEUE_VHCA_FLAG (C_RDMA_TX_VHCA_REG_IDX + 0x112u)
+
+/* rdmatx_wqe_parse_vhca_pfvf */
+#define RDMATX_SQWQE_PBLE_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x300u)
+#define RDMATX_SQWQE_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x301u)
+#define RDMATX_AH_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x302u)
+#define RDMATX_LOCAL_MRTE_PARENT_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 303u)
+#define RDMATX_LOCAL_MRTE_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x304u)
+#define RDMATX_SGETRAN_MRTE_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x305u)
+#define RDMATX_SGETRAN_PBLE_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x306u)
+#define RDMATX_PAYLOAD_PARA_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x307u)
+#define RDMATX_HOSTID_CFG (C_RDMA_TX_VHCA_REG_IDX + 0x308u)
+
+/******************************MRTE***************************/
+#define C_HMC_MRTE_RX1 (C_RDMA_RX_VHCA_REG_IDX + 0x0c0u)
+#define C_HMC_MRTE_RX2 (C_RDMA_RX_VHCA_REG_IDX + 0x183u)
+
+#define C_HMC_MRTE_TX1 (C_RDMA_TX_VHCA_REG_IDX + 0x002u)
+#define C_HMC_MRTE_TX2 (C_RDMA_TX_VHCA_REG_IDX + 0x305u)
+#define C_HMC_MRTE_TX3 (C_RDMA_TX_VHCA_REG_IDX + 0x304u)
+
+#define C_HMC_MRTE_RDMAIO_INDICATE (C_RDMA_IO_VHCA_REG_IDX + 0x012u)
+#define C_HMC_MRTE_RDMAIO_BASE_LOW (C_RDMA_IO_VHCA_REG_IDX + 0x004u)
+#define C_HMC_MRTE_RDMAIO_BASE_HIGH (C_RDMA_IO_VHCA_REG_IDX + 0x005u)
+
+/* rdmarx_pkt_proc_pfvf */
+#define RDMARX_MUL_CACHE_CFG_SIDN_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x006u)
+#define RDMARX_MUL_COPY_QPN_INDICATE (C_RDMA_RX_VHCA_REG_IDX + 0x085u)
+
+/* rdma_rdmarx_hd_cache_top_pfvf */
+#define RDMARX_LIST_CACHE_BASE_QPN (C_RDMA_RX_VHCA_REG_IDX + 0x0A3u)
+#define RDMARX_PLD_WR_AXIID_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x0C1u)
+
+/* rdmarx_prifield_check_pfvf */
+#define RDMARX_VHCA_MAX_SIZE_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x100u)
+
+/* rdmarx_ceq_pfvf */
+#define C_CEQ_CEQE_AXI_INFO_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x1A0u)
+#define C_CEQ_RPBLE_AXI_INFO_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x1A1u)
+#define C_CEQ_LPBLE_AXI_INFO_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x1A2u)
+#define C_CEQ_INT_INFO_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x1A3u)
+#define RDMARX_ACK_RQDB_AXI_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x180u)
+
+/* rdmarx_completion_queue_pfvf */
+#define RDMARX_CQ_CQN_BASE_OFFSET_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x160u)
+#define RDMARX_CQ_CQE_AXI_INFO_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x165u)
+#define RDMARX_CQ_DBSA_AXI_INFO_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x166u)
+
+/* rdmarx_completion_queue_pf */
+#define RDMARX_RQ_AXI_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x140u)
+#define RDMARX_SRQ_AXI_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x142u)
+#define RDMARX_PCI_MAX_MRTE_INDEX_RAM (C_RDMA_RX_VHCA_REG_IDX + 0x143u)
+
+/****** IO Module Register ******/
+#define C_RDMAIO_TABLE2 (C_RDMA_IO_VHCA_REG_IDX + 0x018u)
+#define C_RDMAIO_TABLE4 (C_RDMA_IO_VHCA_REG_IDX + 0x020u)
+#define C_RDMAIO_TABLE3 (C_RDMA_IO_VHCA_REG_IDX + 0x01cu)
+
+#define C_RDMAIO_TABLE5_0 (C_RDMA_IO_VHCA_REG_IDX + 0x024u)
+#define C_RDMAIO_TABLE5_1 (C_RDMA_IO_VHCA_REG_IDX + 0x025u)
+#define C_RDMAIO_TABLE5_2 (C_RDMA_IO_VHCA_REG_IDX + 0x026u)
+#define C_RDMAIO_TABLE5_3 (C_RDMA_IO_VHCA_REG_IDX + 0x027u)
+
+#define C_RDMAIO_TABLE5_4 (C_RDMA_IO_VHCA_REG_IDX + 0x028u)
+#define C_RDMAIO_TABLE5_5 (C_RDMA_IO_VHCA_REG_IDX + 0x029u)
+#define C_RDMAIO_TABLE5_6 (C_RDMA_IO_VHCA_REG_IDX + 0x02au)
+#define C_RDMAIO_TABLE5_7 (C_RDMA_IO_VHCA_REG_IDX + 0x02bu)
+
+#define C_RDMAIO_TABLE5_8 (C_RDMA_IO_VHCA_REG_IDX + 0x02cu)
+#define C_RDMAIO_TABLE5_9 (C_RDMA_IO_VHCA_REG_IDX + 0x02du)
+#define C_RDMAIO_TABLE5_10 (C_RDMA_IO_VHCA_REG_IDX + 0x02eu)
+#define C_RDMAIO_TABLE5_11 (C_RDMA_IO_VHCA_REG_IDX + 0x02fu)
+
+#define C_RDMAIO_TABLE5_12 (C_RDMA_IO_VHCA_REG_IDX + 0x030u)
+#define C_RDMAIO_TABLE5_13 (C_RDMA_IO_VHCA_REG_IDX + 0x031u)
+#define C_RDMAIO_TABLE5_14 (C_RDMA_IO_VHCA_REG_IDX + 0x032u)
+#define C_RDMAIO_TABLE5_15 (C_RDMA_IO_VHCA_REG_IDX + 0x033u)
+
+#define C_RDMAIO_TABLE5_16 (C_RDMA_IO_VHCA_REG_IDX + 0x034u)
+#define C_RDMAIO_TABLE5_17 (C_RDMA_IO_VHCA_REG_IDX + 0x035u)
+#define C_RDMAIO_TABLE5_18 (C_RDMA_IO_VHCA_REG_IDX + 0x036u)
+#define C_RDMAIO_TABLE5_19 (C_RDMA_IO_VHCA_REG_IDX + 0x037u)
+
+#define C_RDMAIO_TABLE5_20 (C_RDMA_IO_VHCA_REG_IDX + 0x038u)
+#define C_RDMAIO_TABLE5_21 (C_RDMA_IO_VHCA_REG_IDX + 0x039u)
+#define C_RDMAIO_TABLE5_22 (C_RDMA_IO_VHCA_REG_IDX + 0x03au)
+#define C_RDMAIO_TABLE5_23 (C_RDMA_IO_VHCA_REG_IDX + 0x03bu)
+
+#define C_RDMAIO_TABLE5_24 (C_RDMA_IO_VHCA_REG_IDX + 0x03cu)
+#define C_RDMAIO_TABLE5_25 (C_RDMA_IO_VHCA_REG_IDX + 0x03du)
+#define C_RDMAIO_TABLE5_26 (C_RDMA_IO_VHCA_REG_IDX + 0x03eu)
+#define C_RDMAIO_TABLE5_27 (C_RDMA_IO_VHCA_REG_IDX + 0x03fu)
+
+#define C_RDMAIO_TABLE5_28 (C_RDMA_IO_VHCA_REG_IDX + 0x040u)
+#define C_RDMAIO_TABLE5_29 (C_RDMA_IO_VHCA_REG_IDX + 0x041u)
+#define C_RDMAIO_TABLE5_30 (C_RDMA_IO_VHCA_REG_IDX + 0x042u)
+#define C_RDMAIO_TABLE5_31 (C_RDMA_IO_VHCA_REG_IDX + 0x043u)
+
+#define C_RDMAIO_TABLE7 (C_RDMA_IO_SIDN_REG_IDX + 0x000u)
+#define C_RDMAIO_TABLE6_0 (C_RDMA_IO_SIDN_REG_IDX + 0x004u)
+#define C_RDMAIO_TABLE6_1 (C_RDMA_IO_SIDN_REG_IDX + 0x005u)
+#define C_RDMAIO_TABLE6_2 (C_RDMA_IO_SIDN_REG_IDX + 0x006u)
+#define C_RDMAIO_TABLE6_3 (C_RDMA_IO_SIDN_REG_IDX + 0x007u)
+#define C_RDMAIO_TABLE6_4 (C_RDMA_IO_SIDN_REG_IDX + 0x008u)
+#define C_RDMAIO_TABLE6_5 (C_RDMA_IO_SIDN_REG_IDX + 0x009u)
+#define C_RDMAIO_TABLE6_6 (C_RDMA_IO_SIDN_REG_IDX + 0x00au)
+#define C_RDMAIO_TABLE6_7 (C_RDMA_IO_SIDN_REG_IDX + 0x00bu)
+#define C_RDMAIO_TABLE6_8 (C_RDMA_IO_SIDN_REG_IDX + 0x00cu)
+#define C_RDMAIO_TABLE6_9 (C_RDMA_IO_SIDN_REG_IDX + 0x00du)
+#define C_RDMAIO_TABLE6_10 (C_RDMA_IO_SIDN_REG_IDX + 0x00eu)
+#define C_RDMAIO_TABLE6_11 (C_RDMA_IO_SIDN_REG_IDX + 0x00fu)
+#define C_RDMAIO_TABLE6_12 (C_RDMA_IO_SIDN_REG_IDX + 0x010u)
+#define C_RDMAIO_TABLE6_13 (C_RDMA_IO_SIDN_REG_IDX + 0x011u)
+#define C_RDMAIO_TABLE6_14 (C_RDMA_IO_SIDN_REG_IDX + 0x012u)
+#define C_RDMAIO_TABLE6_15 (C_RDMA_IO_SIDN_REG_IDX + 0x013u)
+
+/**************************PF HMC REGISTER *************************/
+/******************************PBLEMR***************************/
+
+#define C_HMC_PBLEMR_RX1 (C_RDMA_RX_VHCA_REG_IDX + 0x0c2u)
+#define C_HMC_PBLEMR_RX2 (C_RDMA_RX_VHCA_REG_IDX + 0x184u)
+
+#define C_HMC_PBLEMR_TX1 (C_RDMA_TX_VHCA_REG_IDX + 0x003u)
+#define C_HMC_PBLEMR_TX2 (C_RDMA_TX_VHCA_REG_IDX + 0x306u)
+
+#define C_HMC_PBLEMR_RDMAIO_INDICATE (C_RDMA_IO_VHCA_REG_IDX + 0x010u)
+#define C_HMC_PBLEMR_RDMAIO_BASE_LOW (C_RDMA_IO_VHCA_REG_IDX + 0x000u)
+#define C_HMC_PBLEMR_RDMAIO_BASE_HIGH (C_RDMA_IO_VHCA_REG_IDX + 0x001u)
+
+#endif /* ZRDMA_HW_H */
diff --git a/drivers/infiniband/hw/zrdma/zrdma_main.c b/drivers/infiniband/hw/zrdma/zrdma_main.c
index e06ad176e6d8..116ae37ea281 100644
--- a/drivers/infiniband/hw/zrdma/zrdma_main.c
+++ b/drivers/infiniband/hw/zrdma/zrdma_main.c
@@ -69,6 +69,7 @@ static int zxdh_probe(struct auxiliary_device *aux_dev,
struct zxdh_handler *hdl;
struct zxdh_device *zdev;
struct zxdh_pci_f *rf;
+ int err;
zdev = ib_alloc_device(zxdh_device, ibdev);
if (!zdev)
@@ -94,11 +95,23 @@ static int zxdh_probe(struct auxiliary_device *aux_dev,
zdev->hdl = hdl;
zdev->netdev_speed = SPEED_UNKNOWN;
+ rf = zdev->rf;
+ err = zxdh_ctrl_init_hw(rf);
+ if (err)
+ goto err_ctrl_init;
+
zxdh_add_handler(hdl);
dev_set_drvdata(&aux_dev->dev, zdev);
return 0;
+
+err_ctrl_init:
+ kfree(hdl);
+ kfree(zdev->rf);
+ ib_dealloc_device(&zdev->ibdev);
+
+ return err;
}
static void zxdh_remove(struct auxiliary_device *aux_dev)
@@ -111,6 +124,7 @@ static void zxdh_remove(struct auxiliary_device *aux_dev)
}
zxdh_del_handler(zdev->hdl);
+ kfree(zdev->hdl);
kfree(zdev->rf);
ib_dealloc_device(&zdev->ibdev);
}
diff --git a/drivers/infiniband/hw/zrdma/zrdma_main.h b/drivers/infiniband/hw/zrdma/zrdma_main.h
index 2eb2fc802f5c..5ba18669fa14 100644
--- a/drivers/infiniband/hw/zrdma/zrdma_main.h
+++ b/drivers/infiniband/hw/zrdma/zrdma_main.h
@@ -10,6 +10,7 @@
#include <linux/auxiliary_bus.h>
#include "zrdma_type.h"
#include "zrdma_ctrl.h"
+#include "zrdma_uk.h"
#define ZXDH_PF_NAME "dinghai10e"
#define ZXDH_RDMA_DEV_NAME "rdma_aux"
diff --git a/drivers/infiniband/hw/zrdma/zrdma_mem.h b/drivers/infiniband/hw/zrdma/zrdma_mem.h
index b49df6d51d19..c81cd5ee2d31 100644
--- a/drivers/infiniband/hw/zrdma/zrdma_mem.h
+++ b/drivers/infiniband/hw/zrdma/zrdma_mem.h
@@ -7,6 +7,8 @@
#ifndef ZRDMA_MEM_H
#define ZRDMA_MEM_H
+#include "zrdma_defs.h"
+
#define ZXDH_TABLE5_VF_EN 0x04
#define ZXDH_HMC_MAX_SD_COUNT 8192
diff --git a/drivers/infiniband/hw/zrdma/zrdma_uk.h b/drivers/infiniband/hw/zrdma/zrdma_uk.h
new file mode 100644
index 000000000000..d21755a521e3
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_uk.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#ifndef ZRDMA_UK_H
+#define ZRDMA_UK_H
+
+enum zxdh_host_epid {
+ ZXDH_HOST_EP0_ID = 5,
+ ZXDH_HOST_EP1_ID = 6,
+ ZXDH_HOST_EP2_ID = 7,
+ ZXDH_HOST_EP3_ID = 8,
+ ZXDH_HOST_EP4_ID = 9,
+};
+
+#endif /* ZRDMA_UK_H */
--
2.27.0
^ permalink raw reply related
* [PATCH rdma v1 1/2] RDMA/zrdma: Add basic framework for ZTE Dinghai Ethernet Protocol Driver for RDMA
From: zhang.yanze @ 2026-06-24 8:48 UTC (permalink / raw)
To: jgg, leon
Cc: linux-kernel, linux-rdma, zhang.yanze, wei.quan, han.junyang,
ran.ming, han.chengfei
In-Reply-To: <202606241644572798vwNbZEAtt3c2IHcJUtCs@zte.com.cn>
From: Yanze Zhang <zhang.yanze@zte.com.cn>
Add basic framework for ZTE DingHai Ethernet Protocol Driver for RDMA,
including Kconfig/Makefile build support and auxiliary device probe/remove
skeleton.
Signed-off-by: Yanze Zhang <zhang.yanze@zte.com.cn>
---
MAINTAINERS | 6 +
drivers/infiniband/Kconfig | 1 +
drivers/infiniband/hw/Makefile | 1 +
drivers/infiniband/hw/zrdma/Kconfig | 10 +
drivers/infiniband/hw/zrdma/Makefile | 4 +
drivers/infiniband/hw/zrdma/zrdma_ctrl.h | 248 +++++++++++++++++++++++
drivers/infiniband/hw/zrdma/zrdma_main.c | 142 +++++++++++++
drivers/infiniband/hw/zrdma/zrdma_main.h | 139 +++++++++++++
drivers/infiniband/hw/zrdma/zrdma_mem.h | 103 ++++++++++
drivers/infiniband/hw/zrdma/zrdma_type.h | 110 ++++++++++
10 files changed, 764 insertions(+)
create mode 100644 drivers/infiniband/hw/zrdma/Kconfig
create mode 100644 drivers/infiniband/hw/zrdma/Makefile
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_ctrl.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_main.c
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_main.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_mem.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_type.h
diff --git a/MAINTAINERS b/MAINTAINERS
index aed5a1103de7..a784ce5b99fa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -29483,6 +29483,12 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound.git
F: sound/hda/codecs/senarytech.c
+ZTE DINGHAI ZRDMA DRIVER
+M: Yanze Zhang <zhang.yanze@zte.com.cn>
+L: linux-rdma@vger.kernel.org
+S: Supported
+F: drivers/infiniband/hw/zrdma/
+
THE REST
M: Linus Torvalds <torvalds@linux-foundation.org>
L: linux-kernel@vger.kernel.org
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 086195758a8a..419dbf742f38 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -100,6 +100,7 @@ source "drivers/infiniband/hw/ocrdma/Kconfig"
source "drivers/infiniband/hw/qedr/Kconfig"
source "drivers/infiniband/hw/usnic/Kconfig"
source "drivers/infiniband/hw/vmw_pvrdma/Kconfig"
+source "drivers/infiniband/hw/zrdma/Kconfig"
source "drivers/infiniband/sw/rdmavt/Kconfig"
endif # !UML
source "drivers/infiniband/sw/rxe/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index c42b22ac3303..a86b867c56d4 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_INFINIBAND_BNXT_RE) += bnxt_re/
obj-$(CONFIG_INFINIBAND_BNG_RE) += bng_re/
obj-$(CONFIG_INFINIBAND_ERDMA) += erdma/
obj-$(CONFIG_INFINIBAND_IONIC) += ionic/
+obj-$(CONFIG_INFINIBAND_ZRDMA) += zrdma/
diff --git a/drivers/infiniband/hw/zrdma/Kconfig b/drivers/infiniband/hw/zrdma/Kconfig
new file mode 100644
index 000000000000..316be707331f
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config INFINIBAND_ZRDMA
+ tristate "ZTE Ethernet Protocol Driver for RDMA"
+ depends on INFINIBAND
+ help
+ Say Y or M here to enable support for the ZTE DingHai (ZXDH) Ethernet
+ Protocol Driver for RDMA. This driver provides RDMA over Converged
+ Ethernet (RoCE) functionality for ZTE DingHai network adapters.
+ If you choose to build this driver as a module, it will be built as
+ a module named zrdma.
diff --git a/drivers/infiniband/hw/zrdma/Makefile b/drivers/infiniband/hw/zrdma/Makefile
new file mode 100644
index 000000000000..b5297543e393
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
+
+obj-$(CONFIG_INFINIBAND_ZRDMA) += zrdma.o
+zrdma-y := zrdma_main.o
diff --git a/drivers/infiniband/hw/zrdma/zrdma_ctrl.h b/drivers/infiniband/hw/zrdma/zrdma_ctrl.h
new file mode 100644
index 000000000000..1078884de36a
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_ctrl.h
@@ -0,0 +1,248 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#ifndef ZRDMA_CTRL_H
+#define ZRDMA_CTRL_H
+
+#include "zrdma_mem.h"
+
+#define ZXDH_HW_SCHEDULE_OFF 0
+#define ZXDH_HW_SCHEDULE_ON 1
+
+enum zxdh_cqp_op_type {
+ ZXDH_OP_CEQ_DESTROY = 1,
+ ZXDH_OP_AEQ_DESTROY = 2,
+ ZXDH_OP_CEQ_CREATE = 5,
+ ZXDH_OP_QP_MODIFY = 8,
+ ZXDH_OP_CQ_CREATE = 10,
+ ZXDH_OP_CQ_DESTROY = 11,
+ ZXDH_OP_QP_CREATE = 12,
+ ZXDH_OP_QP_DESTROY = 13,
+ ZXDH_OP_ALLOC_STAG = 14,
+ ZXDH_OP_MR_REG_NON_SHARED = 15,
+ ZXDH_OP_DEALLOC_STAG = 16,
+ ZXDH_OP_MW_ALLOC = 17,
+ ZXDH_OP_QP_FLUSH_WQES = 18,
+ ZXDH_OP_REQ_CMDS = 28,
+ ZXDH_OP_CMPL_CMDS = 29,
+ ZXDH_OP_AH_CREATE = 30,
+ ZXDH_OP_AH_DESTROY = 32,
+ ZXDH_OP_CONFIG_PTE_TAB = 51,
+ ZXDH_OP_CONFIG_PBLE_TAB = 53,
+ ZXDH_OP_CONFIG_MAILBOX = 54,
+ ZXDH_OP_DMA_WRITE = 55,
+ ZXDH_OP_DMA_WRITE32 = 56,
+ ZXDH_OP_DMA_READ = 58,
+ ZXDH_OP_DMA_READ_USE_CQE = 59,
+ ZXDH_OP_QUERY_QPC = 60,
+ ZXDH_OP_QUERY_CQC = 61,
+ ZXDH_OP_QUERY_SRQC = 62,
+ ZXDH_OP_QUERY_CEQC = 63,
+ ZXDH_OP_QUERY_AEQC = 64,
+ ZXDH_OP_SRQ_CREATE = 65,
+ ZXDH_OP_SRQ_DESTROY = 66,
+ ZXDH_OP_SRQ_MODIFY = 67,
+ ZXDH_OP_QUERY_MKEY = 68,
+ ZXDH_OP_CQ_MODIFY_MODERATION = 69,
+ ZXDH_OP_QUERY_HW_OBJECT_INFO = 71,
+ ZXDH_OP_CQ_MODIFY_OVERFLOW_CHECK_EN = 72,
+ /* Must be last entry */
+ ZXDH_MAX_CQP_OPS = 73,
+};
+
+struct zxdh_hw {
+ u32 __iomem *hw_addr;
+ u32 __iomem *pci_hw_addr;
+ struct device *device;
+ struct zxdh_hmc_info hmc;
+};
+
+struct zxdh_uk_attrs {
+ u64 feature_flags;
+ u32 max_hw_wq_frags;
+ u32 max_hw_read_sges;
+ u32 max_hw_inline;
+ u32 max_hw_srq_quanta;
+ u32 max_hw_rq_quanta;
+ u32 max_hw_wq_quanta;
+ u32 min_hw_cq_size;
+ u32 max_hw_cq_size;
+ u16 max_hw_sq_chunk;
+ u32 max_hw_srq_wr;
+ u8 hw_rev;
+};
+
+struct zxdh_hw_attrs {
+ struct zxdh_uk_attrs uk_attrs;
+ u64 max_hw_outbound_msg_size;
+ u64 max_hw_inbound_msg_size;
+ u64 max_mr_size;
+ u32 min_hw_qp_id;
+ u32 min_hw_aeq_size;
+ u32 max_hw_aeq_size;
+ u32 min_hw_ceq_size;
+ u32 max_hw_ceq_size;
+ u32 max_hw_device_pages;
+ u32 max_hw_vf_fpm_id;
+ u32 first_hw_vf_fpm_id;
+ u32 max_hw_ird;
+ u32 max_hw_ord;
+ u32 max_hw_wqes;
+ u32 max_hw_pds;
+ u32 max_qp_wr;
+ u32 max_srq_wr;
+ u32 max_pe_ready_count;
+ u32 max_done_count;
+ u32 max_sleep_count;
+ u32 max_cqp_compl_wait_time_ms;
+ u16 max_stat_inst;
+ u16 max_stat_idx;
+ u32 cqp_timeout_threshold;
+ u8 skip_hw;
+};
+
+struct zxdh_sc_dev {
+ struct list_head cqp_cmd_head;
+ spinlock_t cqp_lock; /* Protects CQP command queue submission */
+ struct zxdh_dma_mem clear_dpu_mem;
+ struct zxdh_dma_mem nof_clear_dpu_mem;
+ u64 pte_l2d_startpa;
+ u32 pte_l2d_len;
+ struct zxdh_hw *hw;
+ u8 __iomem *db_addr;
+ u32 __iomem *wqe_alloc_db;
+ u32 __iomem *cq_arm_db;
+ u32 __iomem *cqp_db;
+ u32 __iomem *cq_ack_db;
+ u64 cqp_cmd_stats[ZXDH_MAX_CQP_OPS];
+ struct zxdh_hw_attrs hw_attrs;
+ struct zxdh_hmc_info *hmc_info;
+ struct zxdh_sc_cqp *cqp;
+ struct zxdh_sc_cq *ccq;
+ u32 max_ceqs;
+ u32 base_qpn;
+ u32 base_cqn;
+ u32 base_srqn;
+ u32 base_ceqn;
+ u32 max_qp;
+ u32 max_cq;
+ u32 max_srq;
+ u16 num_vfs;
+ u16 active_vfs_num;
+ u32 hmc_fn_id;
+ u16 vf_id;
+ u16 vhca_id;
+ u16 vhca_id_pf;
+ u16 cache_id;
+ u8 ep_id;
+ u8 hmc_epid;
+ u16 ird_size;
+ u32 total_vhca;
+ u16 vhca_gqp_start;
+ u16 vhca_gqp_cnt;
+ u16 vhca_8k_index_start;
+ u16 vhca_8k_index_cnt;
+ u16 vhca_ud_gqp;
+ u16 vhca_ud_8k_index;
+ u64 nof_ioq_ddr_addr;
+ u8 chip_version;
+ u64 l2d_pt_addr;
+ u32 l2d_pt_l2_offset;
+ u32 l2_pt_num;
+ u32 l3_pt_num;
+ u8 ceq_valid : 1;
+ u8 privileged : 1;
+ u8 hmc_use_dpu_ddr : 1;
+ u8 np_mode_low_lat : 1;
+ struct mutex vchnl_mutex; /* protects virtual channel operations */
+ u8 driver_load;
+};
+
+struct cqp_info {
+ union {
+ struct {
+ struct zxdh_sc_cq *cq;
+ u64 scratch;
+ } cq_create;
+
+ struct {
+ struct zxdh_sc_cq *cq;
+ u64 scratch;
+ } cq_destroy;
+ } u;
+};
+
+struct cqp_cmds_info {
+ struct list_head cqp_cmd_entry;
+ u8 cqp_cmd;
+ u8 post_sq;
+ struct cqp_info in;
+};
+
+struct zxdh_cqp_err_info {
+ u16 maj;
+ u16 min;
+ const char *desc;
+};
+
+struct zxdh_cqp_compl_info {
+ u64 op_ret_val;
+ u16 maj_err_code;
+ u16 min_err_code;
+ bool error;
+ u8 op_code;
+ __le64 addrbuf[5];
+};
+
+struct zxdh_cqp_request {
+ struct cqp_cmds_info info;
+ wait_queue_head_t waitq;
+ struct list_head list;
+ refcount_t refcnt;
+ void (*callback_fcn)(struct zxdh_cqp_request *cqp_request);
+ void *param;
+ struct zxdh_cqp_compl_info compl_info;
+ u8 waiting : 1;
+ u8 request_done : 1;
+ u8 dynamic : 1;
+};
+
+struct zxdh_sc_cqp {
+ u64 sq_pa;
+ u64 host_ctx_pa;
+ void *back_cqp;
+ struct zxdh_sc_dev *dev;
+ struct zxdh_ring sq_ring;
+ struct zxdh_cqp_quanta *sq_base;
+ __le64 *host_ctx;
+ u64 *scratch_array;
+ u32 sq_size;
+ u32 hw_sq_size;
+ u8 polarity;
+ u8 state_cfg : 1;
+};
+
+struct zxdh_cqp {
+ struct zxdh_sc_cqp sc_cqp;
+ spinlock_t req_lock; /* protect CQP request list */
+ spinlock_t compl_lock; /* protect CQP completion processing */
+ wait_queue_head_t waitq;
+ wait_queue_head_t remove_wq;
+ struct zxdh_dma_mem sq;
+ struct zxdh_dma_mem host_ctx;
+ u64 *scratch_array;
+ struct zxdh_cqp_request *cqp_requests;
+ struct list_head cqp_avail_reqs;
+ struct list_head cqp_pending_reqs;
+};
+
+struct zxdh_ccq {
+ struct zxdh_sc_cq sc_cq;
+ struct zxdh_dma_mem mem_cq;
+ struct zxdh_dma_mem shadow_area;
+};
+
+#endif
diff --git a/drivers/infiniband/hw/zrdma/zrdma_main.c b/drivers/infiniband/hw/zrdma/zrdma_main.c
new file mode 100644
index 000000000000..e06ad176e6d8
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_main.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#include <linux/module.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/types.h>
+#include <linux/netdevice.h>
+#include <linux/workqueue.h>
+#include <rdma/ib_verbs.h>
+
+#include "zrdma_main.h"
+
+MODULE_ALIAS("zrdma");
+MODULE_AUTHOR("Yanze Zhang <zhang.yanze@zte.com.cn>");
+MODULE_DESCRIPTION("ZTE Ethernet Protocol Driver for RDMA");
+MODULE_LICENSE("Dual BSD/GPL");
+
+LIST_HEAD(zxdh_handlers);
+DEFINE_SPINLOCK(zxdh_handler_lock);
+
+static void zxdh_add_handler(struct zxdh_handler *hdl)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&zxdh_handler_lock, flags);
+ list_add(&hdl->list, &zxdh_handlers);
+ spin_unlock_irqrestore(&zxdh_handler_lock, flags);
+}
+
+static void zxdh_del_handler(struct zxdh_handler *hdl)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&zxdh_handler_lock, flags);
+ list_del(&hdl->list);
+ spin_unlock_irqrestore(&zxdh_handler_lock, flags);
+}
+
+static void zxdh_fill_device_info(struct zxdh_device *zdev,
+ struct zxdh_core_dev_info *zdev_info)
+{
+ struct zxdh_pci_f *rf = zdev->rf;
+
+ rf->ftype = ZXDH_FUNC_TYPE(zdev_info->vport_id);
+ rf->pf_id = ZXDH_PF_ID(zdev_info->vport_id);
+ rf->sc_dev.ep_id = ZXDH_EP_ID(zdev_info->vport_id);
+ rf->ep_id = rf->sc_dev.ep_id;
+ rf->sc_dev.driver_load = true;
+ rf->zdev_info = zdev_info;
+ rf->pcidev = zdev_info->pdev;
+ rf->hw.pci_hw_addr = zdev_info->hw_addr;
+ rf->zdev = zdev;
+
+ INIT_LIST_HEAD(&zdev->ah_list);
+ mutex_init(&zdev->ah_list_lock);
+ zdev->netdev = zdev_info->netdev;
+ zdev->source_netdev = zdev_info->netdev;
+ zdev->init_state = INITIAL_STATE;
+}
+
+static int zxdh_probe(struct auxiliary_device *aux_dev,
+ const struct auxiliary_device_id *id)
+{
+ struct zxdh_auxiliary_dev *zxdh_adev =
+ container_of(aux_dev, struct zxdh_auxiliary_dev, adev);
+ struct zxdh_handler *hdl;
+ struct zxdh_device *zdev;
+ struct zxdh_pci_f *rf;
+
+ zdev = ib_alloc_device(zxdh_device, ibdev);
+ if (!zdev)
+ return -ENOMEM;
+
+ zdev->zxdh_adev = zxdh_adev;
+ zdev->rf = kzalloc_obj(*zdev->rf);
+ if (!zdev->rf) {
+ ib_dealloc_device(&zdev->ibdev);
+ return -ENOMEM;
+ }
+
+ zxdh_fill_device_info(zdev, zxdh_adev->zxdh_info);
+
+ hdl = kzalloc_obj(*hdl);
+ if (!hdl) {
+ kfree(zdev->rf);
+ ib_dealloc_device(&zdev->ibdev);
+ return -ENOMEM;
+ }
+
+ hdl->zdev = zdev;
+ zdev->hdl = hdl;
+ zdev->netdev_speed = SPEED_UNKNOWN;
+
+ zxdh_add_handler(hdl);
+
+ dev_set_drvdata(&aux_dev->dev, zdev);
+
+ return 0;
+}
+
+static void zxdh_remove(struct auxiliary_device *aux_dev)
+{
+ struct zxdh_device *zdev = dev_get_drvdata(&aux_dev->dev);
+
+ if (!zdev) {
+ dev_err(&aux_dev->dev, "%s: zdev is NULL\n", __func__);
+ return;
+ }
+
+ zxdh_del_handler(zdev->hdl);
+ kfree(zdev->rf);
+ ib_dealloc_device(&zdev->ibdev);
+}
+
+static void zxdh_shutdown(struct auxiliary_device *aux_dev)
+{
+ zxdh_remove(aux_dev);
+}
+
+static const struct auxiliary_device_id zxdh_auxiliary_id_table[] = {
+ {
+ .name = ZXDH_PF_NAME "." ZXDH_RDMA_DEV_NAME,
+ },
+ {},
+};
+
+MODULE_DEVICE_TABLE(auxiliary, zxdh_auxiliary_id_table);
+
+static struct auxiliary_driver zxdh_auxiliary_drv = {
+ .driver = {
+ .name = ZXDH_RDMA_DEV_NAME,
+ },
+ .id_table = zxdh_auxiliary_id_table,
+ .probe = zxdh_probe,
+ .remove = zxdh_remove,
+ .shutdown = zxdh_shutdown,
+};
+
+module_auxiliary_driver(zxdh_auxiliary_drv);
diff --git a/drivers/infiniband/hw/zrdma/zrdma_main.h b/drivers/infiniband/hw/zrdma/zrdma_main.h
new file mode 100644
index 000000000000..2eb2fc802f5c
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_main.h
@@ -0,0 +1,139 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#ifndef ZRDMA_MAIN_H
+#define ZRDMA_MAIN_H
+
+#include <linux/auxiliary_bus.h>
+#include "zrdma_type.h"
+#include "zrdma_ctrl.h"
+
+#define ZXDH_PF_NAME "dinghai10e"
+#define ZXDH_RDMA_DEV_NAME "rdma_aux"
+
+#define ZXDH_FUNC_TYPE(vport_id) (((vport_id) >> 11) & 0x1)
+#define ZXDH_PF_ID(vport_id) (((vport_id) >> 8) & 0x7)
+#define ZXDH_EP_ID(vport_id) (((vport_id) >> 12) & 0x7)
+
+enum init_completion_state {
+ INVALID_STATE = 0,
+ INITIAL_STATE,
+ CQP_CREATED,
+ SMMU_PAGETABLE_INITIALIZED,
+ DATA_CAP_CREATED,
+ HMC_OBJS_CREATED,
+ HW_RSRC_INITIALIZED,
+ CQP_QP_CREATED,
+ AEQ_CREATED,
+ CCQ_CREATED,
+ CEQ0_CREATED,
+ CEQS_CREATED,
+ PBLE_CHUNK_MEM,
+};
+
+struct zxdh_fw_compat {
+ u8 module_id;
+ u8 major;
+ u8 fw_minor;
+ u8 drv_minor;
+ u16 patch;
+ u16 rsv;
+} __packed;
+
+struct zxdh_handler {
+ struct list_head list;
+ struct zxdh_device *zdev;
+};
+
+struct zxdh_pci_f {
+ u8 reset : 1;
+ u8 rsrc_created : 1;
+ u8 ftype : 1;
+ u8 max_rdma_vfs;
+ u8 *hmc_info_mem;
+ u8 *mem_rsrc;
+ u8 pf_id;
+ u8 vf_id;
+ u8 ep_id;
+ u32 msix_count;
+ u32 max_mr;
+ u32 max_qp;
+ u32 max_cq;
+ u32 max_ah;
+ u32 next_ah;
+ u32 max_pd;
+ u32 next_qp;
+ u32 next_cq;
+ u32 next_pd;
+ u32 next_mr;
+ u32 max_mr_size;
+ u32 max_cqe;
+ u32 used_pds;
+ u32 used_cqs;
+ u32 used_mrs;
+ u32 used_qps;
+ u32 max_srq;
+ u32 next_srq;
+ u32 used_srqs;
+ u32 max_pri;
+ u16 max_8k_idx;
+ u32 *qp_cnt_8k_idxs;
+ u32 ceqs_count;
+ u64 base_bar_offset;
+
+ unsigned long *allocated_qps;
+ unsigned long *allocated_cqs;
+ unsigned long *allocated_mrs;
+ unsigned long *allocated_pds;
+ unsigned long *allocated_mcgs;
+ unsigned long *allocated_ahs;
+ unsigned long *allocated_srqs;
+ unsigned long *allocated_pris;
+ unsigned long *allocated_8k_idx;
+
+ enum init_completion_state init_state;
+ struct zxdh_sc_dev sc_dev;
+ struct zxdh_handler *hdl;
+ struct pci_dev *pcidev;
+ struct zxdh_core_dev_info *zdev_info;
+ struct zxdh_hw hw;
+ struct zxdh_cqp cqp;
+ struct zxdh_ccq ccq;
+ struct zxdh_dma_mem cqp_host_ctx;
+ spinlock_t rsrc_lock; /* protect HardWare resource array access */
+ struct workqueue_struct *cqp_cmpl_wq;
+ struct work_struct cqp_cmpl_work;
+ struct zxdh_device *zdev;
+ u8 mcode_type;
+ u16 pcie_id;
+ void *dh_dev;
+};
+
+struct zxdh_device {
+ struct ib_device ibdev;
+ const struct uverbs_object_tree_def *driver_trees[6];
+ struct zxdh_pci_f *rf;
+ struct net_device *netdev;
+ struct net_device *source_netdev;
+ struct zxdh_handler *hdl;
+ struct workqueue_struct *cleanup_wq;
+ struct list_head ah_list;
+ struct mutex ah_list_lock; /* protects ah_list operations */
+ u32 ah_list_cnt;
+ u32 ah_list_hwm;
+ u32 vendor_id;
+ u32 vendor_part_id;
+ u32 device_cap_flags;
+ enum init_completion_state init_state;
+ wait_queue_head_t suspend_wq;
+ u32 netdev_speed;
+ struct zxdh_fw_compat fw_ver;
+ struct zxdh_auxiliary_dev *zxdh_adev;
+};
+
+int zxdh_ctrl_init_hw(struct zxdh_pci_f *rf);
+
+#endif /* ZRDMA_MAIN_H */
diff --git a/drivers/infiniband/hw/zrdma/zrdma_mem.h b/drivers/infiniband/hw/zrdma/zrdma_mem.h
new file mode 100644
index 000000000000..b49df6d51d19
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_mem.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#ifndef ZRDMA_MEM_H
+#define ZRDMA_MEM_H
+
+#define ZXDH_TABLE5_VF_EN 0x04
+
+#define ZXDH_HMC_MAX_SD_COUNT 8192
+
+enum zxdh_indicate_id {
+ ZXDH_INDICATE_L2D = 0,
+ ZXDH_INDICATE_DPU_DDR = ZXDH_INDICATE_L2D,
+ ZXDH_INDICATE_REGISTER = ZXDH_INDICATE_L2D,
+ ZXDH_INDICATE_RESERVED = 1,
+ ZXDH_INDICATE_HOST_NOSMMU = 2,
+ ZXDH_INDICATE_HOST_SMMU = 3,
+};
+
+enum zxdh_axid_type {
+ ZXDH_AXID_L2D,
+ ZXDH_AXID_DPUDDR,
+ ZXDH_AXID_HOST_EP0,
+ ZXDH_AXID_HOST_EP1,
+ ZXDH_AXID_HOST_EP2,
+ ZXDH_AXID_HOST_EP3,
+ ZXDH_AXID_HOST_EP4,
+};
+
+enum zxdh_object_id {
+ ZXDH_PBLE_MR_OBJ_ID = 0,
+ ZXDH_PBLE_QUEUE_OBJ_ID = 1,
+ ZXDH_MR_OBJ_ID = 2,
+ ZXDH_AH_OBJ_ID = 3,
+ ZXDH_IRD_OBJ_ID = 4,
+ ZXDH_TX_WINDOW_OBJ_ID = 5,
+ ZXDH_SRQC_OBJ_ID = 6,
+ ZXDH_CQC_OBJ_ID = 7,
+ ZXDH_MG_PAYLOAD_OBJ_ID = 8,
+ ZXDH_MG_OBJ_ID = 9,
+ ZXDH_RW_PAYLOAD = 10,
+ ZXDH_SQ = 11,
+ ZXDH_SQ_SHADOW_AREA = 12,
+ ZXDH_RQ = 13,
+ ZXDH_RQ_SHADOW_AREA = 14,
+ ZXDH_SRQP = 15,
+ ZXDH_SRQ = 16,
+ ZXDH_SRQ_SHADOW_AREA = 17,
+ ZXDH_CQ = 18,
+ ZXDH_CQ_SHADOW_AREA = 19,
+ ZXDH_CEQ = 20,
+ ZXDH_AEQ = 21,
+ ZXDH_MG_QPN = 22,
+ ZXDH_CPU_DDR = 24,
+ ZXDH_QPC_OBJ_ID = 29,
+ ZXDH_DMA_OBJ_ID = 30,
+ ZXDH_L2D_OBJ_ID = 31,
+ ZXDH_REG_OBJ_ID = ZXDH_L2D_OBJ_ID,
+};
+
+struct zxdh_dma_mem {
+ void *va;
+ dma_addr_t pa;
+ u32 size;
+} __packed;
+
+struct zxdh_virt_mem {
+ void *va;
+ u32 size;
+} __packed;
+
+struct zxdh_hmc_sd_entry {
+ bool valid;
+ struct zxdh_dma_mem addr;
+ struct zxdh_dma_mem addr_hardware;
+};
+
+struct zxdh_hmc_sd_table {
+ struct zxdh_virt_mem addr;
+ u32 sd_cnt;
+ struct zxdh_hmc_sd_entry *sd_entry;
+};
+
+struct zxdh_hmc_info {
+ u32 signature;
+ u8 hmc_fn_id;
+ u32 pble_hmc_index;
+ u32 pble_mr_hmc_index;
+ u32 hmc_entry_total;
+ u32 hmc_first_entry_pble;
+ u32 hmc_first_entry_pble_mr;
+ struct zxdh_hmc_obj_info *hmc_obj;
+ struct zxdh_virt_mem hmc_obj_virt_mem;
+ struct zxdh_hmc_sd_table sd_table;
+ u16 sd_indexes[ZXDH_HMC_MAX_SD_COUNT];
+ u8 pble_mr_cachid;
+ u8 pble_ird_cachid;
+};
+
+#endif
diff --git a/drivers/infiniband/hw/zrdma/zrdma_type.h b/drivers/infiniband/hw/zrdma/zrdma_type.h
new file mode 100644
index 000000000000..b5def5803d1f
--- /dev/null
+++ b/drivers/infiniband/hw/zrdma/zrdma_type.h
@@ -0,0 +1,110 @@
+/* SPDX-License-Identifier: GPL-2.0-only
+ *
+ * ZTE DingHai Rdma driver
+ * Copyright (c) 2022-2026, ZTE Corporation
+ */
+
+#ifndef ZRDMA_TYPE_H
+#define ZRDMA_TYPE_H
+
+#include "zrdma_mem.h"
+
+#define ZXDH_CQP_WQE_SIZE 8
+#define ZXDH_CQE_SIZE 8
+
+struct zxdh_ring {
+ u32 head;
+ u32 tail;
+ u32 size;
+};
+
+struct zxdh_cqp_quanta {
+ __le64 elem[ZXDH_CQP_WQE_SIZE];
+};
+
+struct zxdh_cqe {
+ __le64 buf[ZXDH_CQE_SIZE];
+};
+
+struct zxdh_cq_uk {
+ struct zxdh_cqe *cq_base;
+ u32 __iomem *cqe_alloc_db;
+ u32 __iomem *cq_ack_db;
+ __le64 *shadow_area;
+ u32 cq_id;
+ u32 cq_size;
+ u32 cq_log_size;
+ u32 cqe_rd_cnt;
+ bool valid_cq;
+ struct zxdh_ring cq_ring;
+ u8 polarity;
+ u8 armed : 1;
+ u8 cqe_size;
+};
+
+struct zxdh_sc_cq {
+ struct zxdh_cq_uk cq_uk;
+ u64 cq_pa;
+ u64 shadow_area_pa;
+ struct zxdh_sc_dev *dev;
+ void *pbl_list;
+ void *back_cq;
+ u32 ceq_id;
+ u32 ceq_index;
+ u32 shadow_read_threshold;
+ u8 pbl_chunk_size;
+ u8 cq_type;
+ u8 tph_val;
+ u32 first_pm_pbl_idx;
+ u8 ceqe_mask : 1;
+ u8 virtual_map : 1;
+ u8 ceq_id_valid : 1;
+ u8 tph_en;
+ u8 cq_st;
+ u16 is_in_list_cnt;
+ u16 cq_max;
+ u16 cq_period;
+ u8 scqe_break_moderation_en : 1;
+ u8 cq_overflow_locked_flag : 1;
+};
+
+struct zxdh_ver_info {
+ u16 major;
+ u16 minor;
+ u64 support;
+};
+
+enum zxdh_function_type {
+ ZXDH_FUNCTION_TYPE_PF,
+ ZXDH_FUNCTION_TYPE_VF,
+};
+
+struct zxdh_core_dev_info {
+ struct pci_dev *pdev;
+ struct auxiliary_device *adev;
+ u32 __iomem *hw_addr;
+ struct zxdh_ver_info ver;
+ void *auxiliary_priv;
+ enum zxdh_function_type ftype;
+ u16 vport_id;
+ u16 slot_id;
+ struct net_device *netdev;
+ struct msix_entry msix_entries;
+ u16 msix_count;
+ void *dh_dev;
+};
+
+struct zxdh_rdma_if {
+ void *(*get_rdma_netdev)(void *dh_dev);
+};
+
+struct zxdh_auxiliary_dev {
+ struct auxiliary_device adev;
+ struct zxdh_core_dev_info *zxdh_info;
+ struct zxdh_rdma_if *rdma_ops;
+ void *ops;
+ void *parent;
+ void *auxiliary_ops[18];
+};
+
+#endif /* ZRDMA_TYPE_H */
--
2.27.0
^ permalink raw reply related
* [PATCH rdma v1 0/2] Add ZTE DingHai Ethernet Protocol Driver for RDMA
From: zhang.yanze @ 2026-06-24 8:44 UTC (permalink / raw)
To: jgg, leon
Cc: linux-kernel, linux-rdma, zhang.yanze, wei.quan, han.junyang,
ran.ming, han.chengfei
From: Yanze Zhang <zhang.yanze@zte.com.cn>
This series adds initial RoCEv2 support for the ZTE ZXDH RNIC,
a PCIe Gen5 adapter.
The driver implements core IB verbs (QP/CQ/MR/SRQ/AH) with async CQP
command path and DCQCN congestion control, using the auxiliary bus
framework following hns/bnxt_re patterns. It does not include Ethernet
network data plane support. Perftest shows ~395 Gb/s write bandwidth at
4KB message size over RoCEv2 RC/UD QPs.
Yanze Zhang (2):
RDMA/zrdma: Add ZTE Dinghai Ethernet Protocol Driver for RDMA
RDMA/zrdma: Add hardware config code and improve driver init flow
MAINTAINERS | 6 +
drivers/infiniband/Kconfig | 1 +
drivers/infiniband/hw/Makefile | 1 +
drivers/infiniband/hw/zrdma/Kconfig | 10 +
drivers/infiniband/hw/zrdma/Makefile | 6 +
drivers/infiniband/hw/zrdma/zrdma_ctrl.h | 248 +++++++++++++++++++++++
drivers/infiniband/hw/zrdma/zrdma_defs.h | 37 ++++
drivers/infiniband/hw/zrdma/zrdma_hw.c | 135 ++++++++++++
drivers/infiniband/hw/zrdma/zrdma_hw.h | 156 ++++++++++++++
drivers/infiniband/hw/zrdma/zrdma_main.c | 156 ++++++++++++++
drivers/infiniband/hw/zrdma/zrdma_main.h | 140 +++++++++++++
drivers/infiniband/hw/zrdma/zrdma_mem.h | 105 ++++++++++
drivers/infiniband/hw/zrdma/zrdma_type.h | 110 ++++++++++
drivers/infiniband/hw/zrdma/zrdma_uk.h | 18 ++
14 files changed, 1129 insertions(+)
create mode 100644 drivers/infiniband/hw/zrdma/Kconfig
create mode 100644 drivers/infiniband/hw/zrdma/Makefile
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_ctrl.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_defs.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_hw.c
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_hw.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_main.c
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_main.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_mem.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_type.h
create mode 100644 drivers/infiniband/hw/zrdma/zrdma_uk.h
--
2.27.0
^ permalink raw reply
* Re: [PATCH] RDMA/core: Fix memory leak in __ib_create_cq() on invalid cqe
From: Kalesh Anakkur Purayil @ 2026-06-24 8:38 UTC (permalink / raw)
To: Chenguang Zhao
Cc: jgg, leon, edwards, mbloch, michaelgur, msanalla, ohartoov, jiri,
linux-rdma
In-Reply-To: <20260624025949.306783-1-zhaochenguang@kylinos.cn>
[-- Attachment #1: Type: text/plain, Size: 1106 bytes --]
On Wed, Jun 24, 2026 at 8:30 AM Chenguang Zhao <zhaochenguang@kylinos.cn> wrote:
>
> Free the allocated CQ object when cqe is zero before returning
> -EINVAL.
>
> Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
> ---
> drivers/infiniband/core/verbs.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> index 3b613b57e269..d6b2eb820061 100644
> --- a/drivers/infiniband/core/verbs.c
> +++ b/drivers/infiniband/core/verbs.c
> @@ -2200,8 +2200,10 @@ struct ib_cq *__ib_create_cq(struct ib_device *device,
> if (!cq)
> return ERR_PTR(-ENOMEM);
>
> - if (WARN_ON_ONCE(!cq_attr->cqe))
> + if (WARN_ON_ONCE(!cq_attr->cqe)) {
> + kfree(cq);
> return ERR_PTR(-EINVAL);
> + }
[Kalesh] I think you can move this check to the beginning of the function.
Also, please add a Fixes tag.
>
> cq->device = device;
> cq->comp_handler = comp_handler;
> --
> 2.25.1
>
>
--
Regards,
Kalesh AP
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5509 bytes --]
^ permalink raw reply
* Re: [PATCH net V3 1/3] net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
From: Tariq Toukan @ 2026-06-24 7:53 UTC (permalink / raw)
To: Simon Horman, tariqt
Cc: andrew+netdev, davem, edumazet, kuba, netdev, pabeni, cratiu,
eranbe, feliu, haiyangz, lkayal, leon, linux-kernel, linux-rdma,
mbloch, noren, saeedm, gal, alazar, cjubran, kees, eranbe, saeedm
In-Reply-To: <20260623104624.1073738-3-horms@kernel.org>
On 23/06/2026 13:46, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> Full review at: https://sashiko.dev
>
> Simon says: I think this issue can be addressed by reversing the order
> that patches 1/3 and 2/3 appear in this patch-set.
>
> ---
> net/mlx5e: Fix HV VHCA stats zero-sized buffer allocation
>
> This commit fixes a zero-sized buffer allocation by using max_nch instead of
> stats_nch when allocating the buffer for HV VHCA stats.
>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
>> index 195863b2c013e..06cbd49d4e98e 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/hv_vhca_stats.c
>> @@ -122,7 +128,7 @@ static void mlx5e_hv_vhca_stats_cleanup(struct mlx5_hv_vhca_agent *agent)
>>
>> void mlx5e_hv_vhca_stats_create(struct mlx5e_priv *priv)
>> {
>> - int buf_len = mlx5e_hv_vhca_stats_buf_size(priv);
>> + int buf_len = mlx5e_hv_vhca_stats_buf_max_size(priv);
>> struct mlx5_hv_vhca_agent *agent;
>>
>> priv->stats_agent.buf = kvzalloc(buf_len, GFP_KERNEL);
>
> [Severity: High]
> Does this change introduce a potential double-free regression on memory
> allocation failure?
>
> When mlx5e_hv_vhca_stats_destroy() cleans up the stats agent, it frees the
> agent but does not set priv->stats_agent.agent to NULL.
>
Good catch on the double-free, thank you.
Swapping the patches has its own problem: patch 2/3 also modifies
mlx5e_hv_vhca_stats_create(), so applying the race fix first would leave
an intermediate tree that still has the zero-sized buffer bug, which is
exactly the condition the race fix builds on top of.
Instead, I think we'll make patch 1/3 self-contained by including the
NULL assignments in mlx5e_hv_vhca_stats_destroy():
mlx5_hv_vhca_agent_destroy(priv->stats_agent.agent);
priv->stats_agent.agent = NULL;
kvfree(priv->stats_agent.buf);
priv->stats_agent.buf = NULL;
This closes the double-free window in the same patch that opens it, with
no ordering dependency on patch 2/3.
> Before this patch, kvzalloc() was called with 0 and returned ZERO_SIZE_PTR,
> so it never returned NULL. It always proceeded to call
> mlx5_hv_vhca_agent_create() which overwrote priv->stats_agent.agent.
>
> Now that the buffer size is non-zero, if kvzalloc() fails it will return
> NULL and take the early return path in mlx5e_hv_vhca_stats_create():
>
> priv->stats_agent.buf = kvzalloc(buf_len, GFP_KERNEL);
> if (!priv->stats_agent.buf)
> return;
>
> This early return bypasses overwriting the dangling priv->stats_agent.agent.
> If a subsequent device detach occurs, mlx5e_hv_vhca_stats_destroy() could
> dereference and free this stale pointer again.
>
> I noticed this was later addressed upstream in commit e600849cc1e0
> ("net/mlx5e: Fix HV VHCA stats agent registration race").
>
^ permalink raw reply
* Re: [PATCH v9 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature
From: Zhiping Zhang @ 2026-06-24 5:24 UTC (permalink / raw)
To: Alex Williamson
Cc: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal,
Christian Konig, Bjorn Helgaas, kvm, linux-rdma, linux-pci,
dri-devel
In-Reply-To: <20260623121736.4d9e38b9@shazbot.org>
On Tue, Jun 23, 2026 at 11:17 AM Alex Williamson <alex@shazbot.org> wrote:
>
> >
> On Mon, 22 Jun 2026 11:41:36 -0700
> Zhiping Zhang <zhipingz@meta.com> wrote:
>
> > Implement dma-buf get_pci_tph for vfio-pci exported dma-bufs and add
> > VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata
> > for a VFIO-owned device.
> >
> > 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
> > uAPI carries both with explicit validity flags, and get_pci_tph()
> > returns the value matching the importer's requested namespace or
> > -EOPNOTSUPP.
> >
> > Publish and read the TPH descriptor under dmabuf->resv, matching the
> > locking used for other importer-visible dma-buf state. The SET ioctl
> > takes dma_resv_lock_interruptible(), while the callback runs under
> > DMA-buf's asserted resv lock.
> >
> > Reject requests the device cannot consume as a completer:
> > pcie_tph_completer_type() must report at least
> > PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires
> > PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Make PROBE follow the same hardware
> > gate so the feature only probes as supported when the device can really
> > consume it.
> >
> > Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> > ---
> > drivers/vfio/pci/vfio_pci_core.c | 3 +
> > drivers/vfio/pci/vfio_pci_dmabuf.c | 97 +++++++++++++++++++++++++++++-
> > drivers/vfio/pci/vfio_pci_priv.h | 12 ++++
> > include/uapi/linux/vfio.h | 37 ++++++++++++
> > 4 files changed, 148 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> > index a28f1e99362c..c7d6902bc61b 100644
> > --- a/drivers/vfio/pci/vfio_pci_core.c
> > +++ b/drivers/vfio/pci/vfio_pci_core.c
> > @@ -1572,6 +1572,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> > return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> > case VFIO_DEVICE_FEATURE_DMA_BUF:
> > return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> > + case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
> > + return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
> > + argsz);
> > default:
> > return -ENOTTY;
> > }
> > diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > index c16f460c01d6..d6f5dd321000 100644
> > --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> > +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> > @@ -3,6 +3,7 @@
> > */
> > #include <linux/dma-buf-mapping.h>
> > #include <linux/pci-p2pdma.h>
> > +#include <linux/pci-tph.h>
> > #include <linux/dma-resv.h>
> >
> > #include "vfio_pci_priv.h"
> > @@ -19,7 +20,14 @@ struct vfio_pci_dma_buf {
> > u32 nr_ranges;
> > struct kref kref;
> > struct completion comp;
> > - u8 revoked : 1;
> > +
> > + /* Protected by dmabuf->resv. */
>
> Nit, it would be more accurate to say:
>
> /*
> * Updates protected by dmabuf->resv, @revoked additionally
> * protected by memory_lock.
> */
>
> revoked also has an unprotected read, but it's previously existing and
> benign, and likely just needs a READ_ONCE() annotation.
>
Agreed, I'll update the comment and add READ_ONCE() as well.
> > + u16 tph_st_ext;
> > + u8 tph_st;
> > + u8 revoked:1;
> > + u8 tph_st_valid:1;
> > + u8 tph_st_ext_valid:1;
> > + u8 tph_ph:2;
> > };
> >
> > static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> > @@ -69,6 +77,26 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> > return ret;
> > }
> >
> > +static int vfio_pci_dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended,
> > + u16 *steering_tag, u8 *ph)
> > +{
> > + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> > +
> > + dma_resv_assert_held(dmabuf->resv);
> > +
> > + if (extended) {
> > + if (!priv->tph_st_ext_valid)
> > + return -EOPNOTSUPP;
> > + *steering_tag = priv->tph_st_ext;
> > + } else {
> > + if (!priv->tph_st_valid)
> > + return -EOPNOTSUPP;
> > + *steering_tag = priv->tph_st;
> > + }
> > + *ph = priv->tph_ph;
> > + return 0;
> > +}
> > +
> > static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> > struct sg_table *sgt,
> > enum dma_data_direction dir)
> > @@ -101,6 +129,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
> >
> > static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> > .attach = vfio_pci_dma_buf_attach,
> > + .get_pci_tph = vfio_pci_dma_buf_get_pci_tph,
> > .map_dma_buf = vfio_pci_dma_buf_map,
> > .unmap_dma_buf = vfio_pci_dma_buf_unmap,
> > .release = vfio_pci_dma_buf_release,
> > @@ -333,6 +362,72 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> > return ret;
> > }
> >
> > +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> > + u32 flags,
> > + struct vfio_device_feature_dma_buf_tph __user *arg,
> > + size_t argsz)
> > +{
> > + struct vfio_device_feature_dma_buf_tph set_tph;
> > + struct vfio_pci_dma_buf *priv;
> > + struct dma_buf *dmabuf;
> > + u8 comp;
> > + int ret;
> > +
> > + comp = pcie_tph_completer_type(vdev->pdev);
> > + if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
> > + return -EOPNOTSUPP;
> > +
> > + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> > + sizeof(set_tph));
> > + if (ret != 1)
> > + return ret;
> > +
> > + if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> > + return -EFAULT;
> > +
> > + if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> > + return -EINVAL;
> > +
> > + if (set_tph.ph & ~0x3)
> > + return -EINVAL;
> > +
> > + if ((set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT) &&
> > + comp != PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH)
> > + return -EOPNOTSUPP;
> > +
> > + dmabuf = dma_buf_get(set_tph.dmabuf_fd);
> > + if (IS_ERR(dmabuf))
> > + return PTR_ERR(dmabuf);
> > +
> > + if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
> > + ret = -EINVAL;
> > + goto out_put;
> > + }
> > +
> > + priv = dmabuf->priv;
> > + if (priv->vdev != vdev) {
> > + ret = -EINVAL;
> > + goto out_put;
> > + }
> > +
> > + ret = dma_resv_lock_interruptible(dmabuf->resv, NULL);
> > + if (ret)
> > + goto out_put;
> > +
> > + priv->tph_st = set_tph.steering_tag;
> > + priv->tph_st_ext = set_tph.steering_tag_ext;
> > + priv->tph_ph = set_tph.ph;
> > + priv->tph_st_valid = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST);
> > + priv->tph_st_ext_valid =
> > + !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT);
> > + dma_resv_unlock(dmabuf->resv);
> > + ret = 0;
> > +
> > +out_put:
> > + dma_buf_put(dmabuf);
> > + return ret;
> > +}
> > +
> > void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> > {
> > struct vfio_pci_dma_buf *priv;
> > diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> > index fca9d0dfac90..c58f369be4b3 100644
> > --- a/drivers/vfio/pci/vfio_pci_priv.h
> > +++ b/drivers/vfio/pci/vfio_pci_priv.h
> > @@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> > int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> > struct vfio_device_feature_dma_buf __user *arg,
> > size_t argsz);
> > +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> > + u32 flags,
> > + struct vfio_device_feature_dma_buf_tph __user *arg,
> > + size_t argsz);
> > void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> > void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
> > #else
> > @@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> > {
> > return -ENOTTY;
> > }
> > +
> > +static inline int
> > +vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
> > + struct vfio_device_feature_dma_buf_tph __user *arg,
> > + size_t argsz)
> > +{
> > + return -ENOTTY;
> > +}
> > static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> > {
> > }
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 5de618a3a5ee..2d30ba43e2cf 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -1534,6 +1534,43 @@ struct vfio_device_feature_dma_buf {
> > */
> > #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12
> >
> > +/**
> > + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
> > + * with a vfio-exported dma-buf. The dma-buf must have been created by
> > + * VFIO_DEVICE_FEATURE_DMA_BUF on this device, and the device must report
> > + * TPH Completer support in Device Capabilities 2 (bits 13:12); requests
> > + * carrying VFIO_DMA_BUF_TPH_ST_EXT additionally require the device to
> > + * report the Extended TPH Completer encoding. Otherwise the ioctl
> > + * returns -EOPNOTSUPP.
> > + *
> > + * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
> > + *
> > + * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
> > + * distinct namespaces. Userspace supplies whichever values are valid and sets
> > + * the matching VFIO_DMA_BUF_TPH_ST / VFIO_DMA_BUF_TPH_ST_EXT bits in @flags;
> > + * an importer requests one namespace and receives the matching value.
> > + *
> > + * @flags == 0 marks any previously published ST / Extended-ST as invalid
> > + * for future PCI TPH queries on this dma-buf.
>
> I think we should avoid "published" as it still suggests we're somehow
> able to invalidate what was previously reported and consumed. It's
> offset by the trailing clause here, but that's absent below.
>
> Also, we're noting @flags == 0 as if it's a special case, but we really
> need the clarity that if either flag bit is not set, the corresponding
> field is marked invalid for future queries. Perhaps something like:
>
> * @flags is the authoritative validity for each namespace: when
> * VFIO_DMA_BUF_TPH_ST is set, @steering_tag becomes the valid 8-bit ST; when
> * VFIO_DMA_BUF_TPH_ST_EXT is set, @steering_tag_ext becomes the valid 16-bit
> * Extended ST. A namespace whose bit is clear is marked invalid and
> * reported as unsupported to importers requesting it.
> *
> * Each SET fully replaces the dma-buf's TPH state: any namespace not selected
> * in @flags is left invalid, so @flags == 0 marks both ST and Extended ST
> * invalid. This only affects TPH queries made after the SET completes; an
> * importer that has already retrieved a value is unaffected. Userspace must
> * therefore configure TPH before handing the dma-buf fd to an importer.
>
Thanks, I'll use your suggested wording here.
> > + *
> > + * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
>
> Slight inconsistency here, @ph. We might also help to silence the
> Sashiko warning to note:
>
> * Undefined @flags and @ph bits must always be zero.
>
Yes, I'll fix @ph and add the undefined-bits-must-be-zero wording.
> > + *
> > + * Userspace must publish TPH before handing the dma-buf fd to an importer.
> > + * Calling SET again replaces the published values.
>
> The above suggestion is meant to replace this as well.
>
> > + */
> > +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
>
> There are several series in-flight contending for device feature
> indexes, so this should really go in through the vfio tree to reduce
> the risk of duplicates. We also still need acks from the relevant
> maintainers for PCI, dma-buf, and mlx5 before this can be queued for
> v7.3. Thanks,
>
> Alex
>
Sure!
Thanks,
Zhiping
^ permalink raw reply
* blktests failures with v7.1 kernel
From: Shin'ichiro Kawasaki @ 2026-06-24 5:04 UTC (permalink / raw)
To: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
linux-scsi@vger.kernel.org, nbd, linux-rdma
Hi all,
I ran the latest blktests (git hash: 5a62429536b1) with the v7.1 kernel. I
observed 9 failures listed below. Comparing with the previous report for the
v7.1-rc1 kernel [1], 2 failures are avoided (nvme/045, scsi/002) and 3 failures
are new (block/005, nvme/062, nvme/063). As always, your help with fixes will be
appreciated.
[1] https://lore.kernel.org/linux-block/afB5syZbUrppgsDQ@shinmob/
List of failures
================
#1: block/005 (new)
#2: nvme/005 (tcp transport)
#3: nvme/058 (fc transport)(kmemleak)
#4: nvme/060 (rdma transport)
#5: nvme/061 (rdma transport, siw driver)(kmemleak)
#6: nvme/061 (fc transport)
#7: nvme/062 (tcp transport)(new)
#8: nvme/063 (tcp transport)(new)
#9: nbd/002
Failure description
===================
#1: block/005 (new)
I found the test case block/005 failed under some conditions due to
concurrent writes to a sysfs IO scheduler attribute. The failure was
discussed and a fix patch candidate is available [2].
[2] https://lore.kernel.org/linux-block/20260623013238.642052-1-shinichiro.kawasaki@wdc.com/
#2: nvme/005 (tcp transport)
The test case nvme/005 fails for tcp transport due to the lockdep WARN
related to the three locks q->q_usage_counter, q->elevator_lock and
set->srcu. Refer to [1] for the details of the failure. There are two causes
of the WARN and two fixes are required. One fix by Chaitanya is already in
the kernel v7.1. The other fix is queued for v7.2-rc1 [3].
[3] https://lore.kernel.org/linux-nvme/20260604023208.388157-1-shinichiro.kawasaki@wdc.com/
#3: nvme/058 (fc transport)(kmemleak)
When the kernel enables CONFIG_DEBUG_KMEMLEAK, the test case sometimes
causes kmemleak. With v7.1-rc1 kernel, this test case had caused a hang, but
the hang is no longer observed with v7.1 kernel, so the kmemleak is easier
to recreate now. The memory leak report with v7.1 kernel looks similar to
those I reported for v7.0-rc1 kernel [4].
[4] https://lore.kernel.org/linux-block/aZ_-cH8euZLySxdD@shinmob/
#4: nvme/060 (rdma transport)
When the test case is repeated for rdma transport around 50 times, the test
case fails. There are two failure symptoms and both do not look like kernel-
side problems. I posted blktests side fix candidate patches [5].
[5] https://lore.kernel.org/linux-nvme/20260619013329.558580-1-shinichiro.kawasaki@wdc.com/
#5: nvme/061 (rdma transport, siw driver)(kmemleak)
When the test case nvme/061 is repeated twice for the rdma transport and the
siw driver on the kernel with CONFIG_DEBUG_KMEMLEAK enabled, it causes
kmemleak that is detected at the beginning of the 2nd run. Refer to the
nvme/061 failure report for v6.19 kernel [6].
[6] https://lore.kernel.org/linux-block/aY7ZBfMjVIhe_wh3@shinmob/
#6: nvme/061 (fc transport)
When the test case nvme/061 is repeated around 50 times for the fc
transport, the test process fails after Oops and KASAN null-ptr-deref.
Refer to the report for the v7.0-rc1 kernel [4].
#7: nvme/062 (tcp transport)(new)
The test case nvme/062 fails for tcp transport due to the lockdep WARN
related to the three locks fs_reclaim, set->srcu and sk_lock-AF_INET-NVME.
q->elevator_lock and q->q_usage_counter are also recorded in the lockdep
splat [7].
I ran nvme/062 on v7.1-rc1 kernel, and I observed it failed with lockdep
WARN. In the past, I did not observe the failure of this test case because
lockdep had been disabled due to the lockdep WARN at nvme/005. Now that
nvme/005 no longer reports a lockdep WARN, I see it at nvme/062 instead.
When I applied the fix patch for lockdep WARN at nvme/005 [3], the symptoms
of the lockdep WARN changed [8]. With the patch, the three locks
kernfs_rwsem, sparse_irq_lock and kernfs_supers_rwsem caused the WARN. The
fix patch candidate for block/005 [2] did not affect the failure of
nvme/062.
#8: nvme/063 (tcp transport)(new)
The test case nvme/063 fails for tcp transport due to the lockdep WARN
related to the three locks set->srcu, q->q_usage_counter(io) and
q->elevator_lock [9].
I had reported the failure of this test case on v7.1-rc1 kernel together
with nvme/005, assuming the failures of nvme/005 and nvme/063 would have a
single cause. But even applying the fix for nvme/005 [3], I still observe
the failure of nvme/063. Therefore, this nvme/063 failure is a different
problem from the nvme/005 failure. The fix patch candidate for block/005 [2]
did not affect the failure of nvme/063 either.
#9: nbd/002
The test case nbd/002 fails due to the lockdep WARN related to
sk_lock-AF_INET6, cmd->lock and nsock->txlock. The lockdep WARN of this test
case has been reported since v6.18-rc1 kernel [10]. Eric Dumazet posted a
fix patch and it is queued for v7.2-rc1 [11]. I confirmed the patch avoids
the failure. Thanks!
[10] https://lore.kernel.org/linux-block/ynmi72x5wt5ooljjafebhcarit3pvu6axkslqenikb2p5txe57@ldytqa2t4i2x/
[11] https://lore.kernel.org/linux-block/20260613042619.1108126-1-edumazet@google.com/
[7] nvme/062 dmesg on v7.1 kernel
[ 271.544567] [ T1351] run blktests nvme/062 at 2026-06-24 10:47:29
[ 271.810359] [ T1746] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 271.848916] [ T1749] nvmet: Allow non-TLS connections while TLS1.3 is enabled
[ 271.869077] [ T1752] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 272.059895] [ T358] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 272.067526] [ T1759] nvme nvme5: creating 4 I/O queues.
[ 272.073127] [ T1759] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 272.085957] [ T1759] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 272.648494] [ T1813] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
[ 272.847561] [ T1822] ======================================================
[ 272.848150] [ T1822] WARNING: possible circular locking dependency detected
[ 272.848775] [ T1822] 7.1.0 #5 Not tainted
[ 272.849107] [ T1822] ------------------------------------------------------
[ 272.849704] [ T1822] tlshd/1822 is trying to acquire lock:
[ 272.850161] [ T1822] ffffffff9b8ccda0 (fs_reclaim){+.+.}-{0:0}, at: __kmalloc_cache_noprof+0x58/0x720
[ 272.850908] [ T1822]
but task is already holding lock:
[ 272.851497] [ T1822] ffff8881220ad058 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: do_tcp_setsockopt+0x499/0x26a0
[ 272.852360] [ T1822]
which lock already depends on the new lock.
[ 272.853170] [ T1822]
the existing dependency chain (in reverse order) is:
[ 272.853857] [ T1822]
-> #4 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[ 272.854483] [ T1822] lock_sock_nested+0x32/0xf0
[ 272.854883] [ T1822] tcp_sendmsg+0x1c/0x50
[ 272.855250] [ T1822] sock_sendmsg+0x305/0x3c0
[ 272.855588] [ T1822] nvme_tcp_try_send_cmd_pdu+0x630/0xcf0 [nvme_tcp]
[ 272.856098] [ T1822] nvme_tcp_try_send+0x1ef/0xa60 [nvme_tcp]
[ 272.856608] [ T1822] nvme_tcp_queue_rq+0xfa3/0x19e0 [nvme_tcp]
[ 272.857676] [ T1822] blk_mq_dispatch_rq_list+0x3e0/0x2420
[ 272.858698] [ T1822] __blk_mq_sched_dispatch_requests+0x20a/0x15d0
[ 272.859747] [ T1822] blk_mq_sched_dispatch_requests+0xab/0x150
[ 272.860785] [ T1822] blk_mq_run_work_fn+0x135/0x2e0
[ 272.861720] [ T1822] process_one_work+0x8b6/0x1650
[ 272.862619] [ T1822] worker_thread+0x5fd/0xfe0
[ 272.863473] [ T1822] kthread+0x36a/0x460
[ 272.864299] [ T1822] ret_from_fork+0x655/0x9d0
[ 272.865117] [ T1822] ret_from_fork_asm+0x1a/0x30
[ 272.865963] [ T1822]
-> #3 (set->srcu){.+.+}-{0:0}:
[ 272.867350] [ T1822] __synchronize_srcu+0xe1/0x2f0
[ 272.868170] [ T1822] elevator_switch+0x2bd/0x670
[ 272.868993] [ T1822] elevator_change+0x2e7/0x500
[ 272.869805] [ T1822] elevator_set_none+0xaa/0xf0
[ 272.870623] [ T1822] blk_unregister_queue+0x15e/0x2e0
[ 272.871459] [ T1822] __del_gendisk+0x28f/0xaa0
[ 272.872280] [ T1822] del_gendisk+0x11a/0x1c0
[ 272.873074] [ T1822] nvme_ns_remove+0x331/0x940 [nvme_core]
[ 272.873989] [ T1822] nvme_remove_namespaces+0x289/0x3f0 [nvme_core]
[ 272.874941] [ T1822] nvme_do_delete_ctrl+0xf6/0x160 [nvme_core]
[ 272.875811] [ T1822] nvme_delete_ctrl_sync.cold+0x8/0xd [nvme_core]
[ 272.876761] [ T1822] nvme_sysfs_delete+0xba/0xe0 [nvme_core]
[ 272.877636] [ T1822] kernfs_fop_write_iter+0x3d6/0x5e0
[ 272.878441] [ T1822] vfs_write+0x4b3/0xf70
[ 272.879157] [ T1822] ksys_write+0x112/0x250
[ 272.879908] [ T1822] do_syscall_64+0xdf/0x790
[ 272.880698] [ T1822] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 272.881499] [ T1822]
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
[ 272.882720] [ T1822] __mutex_lock+0x1ae/0x2650
[ 272.883397] [ T1822] elevator_change+0x197/0x500
[ 272.884139] [ T1822] elv_iosched_store+0x38f/0x430
[ 272.884926] [ T1822] queue_attr_store+0x25f/0x3e0
[ 272.885685] [ T1822] kernfs_fop_write_iter+0x3d6/0x5e0
[ 272.886433] [ T1822] vfs_write+0x4b3/0xf70
[ 272.887112] [ T1822] ksys_write+0x112/0x250
[ 272.887838] [ T1822] do_syscall_64+0xdf/0x790
[ 272.888534] [ T1822] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 272.889303] [ T1822]
-> #1 (&q->q_usage_counter(io)){++++}-{0:0}:
[ 272.890567] [ T1822] blk_alloc_queue+0x605/0x7a0
[ 272.891320] [ T1822] blk_mq_alloc_queue+0x168/0x270
[ 272.892058] [ T1822] scsi_alloc_sdev+0x86c/0xcd0
[ 272.892808] [ T1822] scsi_probe_and_add_lun+0x601/0xc50
[ 272.893568] [ T1822] __scsi_add_device+0x233/0x280
[ 272.894309] [ T1822] ata_scsi_scan_host+0x137/0x3a0
[ 272.895032] [ T1822] async_run_entry_fn+0x93/0x550
[ 272.895766] [ T1822] process_one_work+0x8b6/0x1650
[ 272.896503] [ T1822] worker_thread+0x5fd/0xfe0
[ 272.897199] [ T1822] kthread+0x36a/0x460
[ 272.897824] [ T1822] ret_from_fork+0x655/0x9d0
[ 272.898528] [ T1822] ret_from_fork_asm+0x1a/0x30
[ 272.899251] [ T1822]
-> #0 (fs_reclaim){+.+.}-{0:0}:
[ 272.900443] [ T1822] __lock_acquire+0xe06/0x22e0
[ 272.901136] [ T1822] lock_acquire+0x1a5/0x330
[ 272.901839] [ T1822] fs_reclaim_acquire+0xd5/0x120
[ 272.902588] [ T1822] __kmalloc_cache_noprof+0x58/0x720
[ 272.903333] [ T1822] __request_module+0x253/0x610
[ 272.904050] [ T1822] tcp_set_ulp+0x395/0x5e0
[ 272.904769] [ T1822] do_tcp_setsockopt+0x4a9/0x26a0
[ 272.905492] [ T1822] do_sock_setsockopt+0x163/0x3b0
[ 272.906235] [ T1822] __sys_setsockopt+0xe0/0x150
[ 272.906965] [ T1822] __x64_sys_setsockopt+0xb9/0x180
[ 272.907734] [ T1822] do_syscall_64+0xdf/0x790
[ 272.908368] [ T1822] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 272.909049] [ T1822]
other info that might help us debug this:
[ 272.910750] [ T1822] Chain exists of:
fs_reclaim --> set->srcu --> sk_lock-AF_INET-NVME
[ 272.912667] [ T1822] Possible unsafe locking scenario:
[ 272.913915] [ T1822] CPU0 CPU1
[ 272.914569] [ T1822] ---- ----
[ 272.915244] [ T1822] lock(sk_lock-AF_INET-NVME);
[ 272.915921] [ T1822] lock(set->srcu);
[ 272.916742] [ T1822] lock(sk_lock-AF_INET-NVME);
[ 272.917634] [ T1822] lock(fs_reclaim);
[ 272.918266] [ T1822]
*** DEADLOCK ***
[ 272.919861] [ T1822] 1 lock held by tlshd/1822:
[ 272.920527] [ T1822] #0: ffff8881220ad058 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: do_tcp_setsockopt+0x499/0x26a0
[ 272.921589] [ T1822]
stack backtrace:
[ 272.922693] [ T1822] CPU: 2 UID: 0 PID: 1822 Comm: tlshd Not tainted 7.1.0 #5 PREEMPT(full)
[ 272.922696] [ T1822] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-10.fc44 06/10/2025
[ 272.922698] [ T1822] Call Trace:
[ 272.922701] [ T1822] <TASK>
[ 272.922703] [ T1822] dump_stack_lvl+0x6a/0x90
[ 272.922708] [ T1822] print_circular_bug.cold+0x189/0x1eb
[ 272.922713] [ T1822] check_noncircular+0x173/0x1a0
[ 272.922717] [ T1822] __lock_acquire+0xe06/0x22e0
[ 272.922720] [ T1822] lock_acquire+0x1a5/0x330
[ 272.922722] [ T1822] ? __kmalloc_cache_noprof+0x58/0x720
[ 272.922724] [ T1822] ? do_raw_spin_lock+0x12d/0x280
[ 272.922727] [ T1822] ? __request_module+0x253/0x610
[ 272.922728] [ T1822] fs_reclaim_acquire+0xd5/0x120
[ 272.922731] [ T1822] ? __kmalloc_cache_noprof+0x58/0x720
[ 272.922733] [ T1822] __kmalloc_cache_noprof+0x58/0x720
[ 272.922735] [ T1822] ? lockdep_hardirqs_on+0x8c/0x130
[ 272.922738] [ T1822] ? rcu_is_watching+0x11/0xb0
[ 272.922742] [ T1822] ? tcp_set_ulp+0x395/0x5e0
[ 272.922744] [ T1822] __request_module+0x253/0x610
[ 272.922747] [ T1822] ? __pfx___request_module+0x10/0x10
[ 272.922750] [ T1822] ? lock_acquire+0x1a5/0x330
[ 272.922751] [ T1822] ? rcu_is_watching+0x11/0xb0
[ 272.922753] [ T1822] ? cap_capable+0x1b7/0x3b0
[ 272.922756] [ T1822] ? lock_acquire+0x1b5/0x330
[ 272.922757] [ T1822] ? find_held_lock+0x2b/0x80
[ 272.922761] [ T1822] ? tcp_set_ulp+0x374/0x5e0
[ 272.922763] [ T1822] tcp_set_ulp+0x395/0x5e0
[ 272.922766] [ T1822] do_tcp_setsockopt+0x4a9/0x26a0
[ 272.922769] [ T1822] ? __pfx_do_tcp_setsockopt+0x10/0x10
[ 272.922771] [ T1822] ? __lock_acquire+0x3d2/0x22e0
[ 272.922772] [ T1822] ? lock_acquire+0x1a5/0x330
[ 272.922774] [ T1822] ? folio_add_file_rmap_ptes+0x7b6/0xa90
[ 272.922779] [ T1822] ? do_raw_spin_lock+0x12d/0x280
[ 272.922780] [ T1822] ? percpu_counter_add_batch+0x89/0x280
[ 272.922785] [ T1822] ? __pfx_selinux_netlbl_socket_setsockopt+0x10/0x10
[ 272.922788] [ T1822] ? find_held_lock+0x2b/0x80
[ 272.922792] [ T1822] do_sock_setsockopt+0x163/0x3b0
[ 272.922795] [ T1822] ? __pfx_do_sock_setsockopt+0x10/0x10
[ 272.922798] [ T1822] __sys_setsockopt+0xe0/0x150
[ 272.922802] [ T1822] __x64_sys_setsockopt+0xb9/0x180
[ 272.922804] [ T1822] ? rcu_read_unlock+0x17/0x60
[ 272.922807] [ T1822] ? lock_release+0x1b5/0x340
[ 272.922809] [ T1822] do_syscall_64+0xdf/0x790
[ 272.922811] [ T1822] ? rcu_read_unlock+0x1c/0x60
[ 272.922813] [ T1822] ? do_fault+0x8fc/0x13c0
[ 272.922815] [ T1822] ? rcu_read_unlock+0x17/0x60
[ 272.922817] [ T1822] ? lock_release+0x1b5/0x340
[ 272.922819] [ T1822] ? __handle_mm_fault+0x10ef/0x1d60
[ 272.922822] [ T1822] ? __lock_acquire+0x3d2/0x22e0
[ 272.922824] [ T1822] ? __pfx___css_rstat_updated+0x10/0x10
[ 272.922829] [ T1822] ? lock_acquire+0x1a5/0x330
[ 272.922831] [ T1822] ? count_memcg_events_mm.constprop.0+0x22/0x130
[ 272.922833] [ T1822] ? rcu_is_watching+0x11/0xb0
[ 272.922835] [ T1822] ? count_memcg_events+0x107/0x4e0
[ 272.922839] [ T1822] ? find_held_lock+0x2b/0x80
[ 272.922841] [ T1822] ? rcu_read_unlock+0x17/0x60
[ 272.922843] [ T1822] ? lock_release+0x1b5/0x340
[ 272.922845] [ T1822] ? find_held_lock+0x2b/0x80
[ 272.922847] [ T1822] ? exc_page_fault+0x94/0x140
[ 272.922849] [ T1822] ? lock_release+0x1b5/0x340
[ 272.922851] [ T1822] ? rcu_is_watching+0x11/0xb0
[ 272.922857] [ T1822] ? trace_hardirqs_on+0x14/0x1b0
[ 272.922860] [ T1822] ? preempt_count_add+0x7f/0x190
[ 272.922864] [ T1822] ? do_syscall_64+0x5d/0x790
[ 272.922865] [ T1822] ? do_syscall_64+0x8d/0x790
[ 272.922867] [ T1822] ? irqentry_exit+0xfc/0x790
[ 272.922869] [ T1822] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 272.922871] [ T1822] RIP: 0033:0x7fa53a2da75e
[ 272.922875] [ T1822] Code: 55 48 63 c9 48 63 ff 45 89 c9 48 89 e5 48 83 ec 08 6a 2c e8 54 69 f7 ff c9 c3 66 90 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 0a c3 66 0f 1f 84 00 00 00 00 00 48 8b 15 61
[ 272.922877] [ T1822] RSP: 002b:00007ffc6461e8e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000036
[ 272.922880] [ T1822] RAX: ffffffffffffffda RBX: 000055d6e364a750 RCX: 00007fa53a2da75e
[ 272.922881] [ T1822] RDX: 000000000000001f RSI: 0000000000000006 RDI: 0000000000000005
[ 272.922882] [ T1822] RBP: 00007ffc6461e930 R08: 0000000000000004 R09: 0000000000000070
[ 272.922883] [ T1822] R10: 000055d6b7b7be2a R11: 0000000000000206 R12: 000055d6e3657c30
[ 272.922884] [ T1822] R13: 00007ffc6461e904 R14: 00007ffc6461e990 R15: 000055d6e364a7c0
[ 272.922889] [ T1822] </TASK>
[ 273.088107] [ T296] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
[ 273.100521] [ T1820] nvme nvme5: creating 4 I/O queues.
[ 273.143469] [ T1820] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 273.147971] [ T1820] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 273.463094] [ T1890] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
[ 273.581465] [ T1903] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 273.611642] [ T1909] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 273.697895] [ T1916] nvme_tcp: queue 0: failed to receive icresp, error -104
[ 273.800385] [ T358] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
[ 273.810247] [ T1926] nvme nvme5: creating 4 I/O queues.
[ 273.844828] [ T1926] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 273.850998] [ T1926] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 274.163114] [ T1985] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
[8] nvme/062 dmesg on v7.1 kernel + nvme-tcp lockdep fix patch
[ 327.536320] [ T1023] run blktests nvme/062 at 2026-06-24 12:43:52
[ 327.852000] [ T1119] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 327.896673] [ T1122] nvmet: Allow non-TLS connections while TLS1.3 is enabled
[ 327.913286] [ T1125] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 328.118861] [ T350] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 328.129891] [ T1132] nvme nvme5: creating 4 I/O queues.
[ 328.141710] [ T1132] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 328.149625] [ T1132] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 329.072473] [ T1185] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
[ 329.463408] [ T167] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
[ 329.481208] [ T1192] nvme nvme5: creating 4 I/O queues.
[ 329.562530] [ T1192] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 329.580181] [ T1192] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 330.583680] [ T1261] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
[ 330.852275] [ T1274] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 330.914316] [ T1280] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 331.109290] [ T1287] nvme_tcp: queue 0: failed to receive icresp, error -104
[ 331.117766] [ T1287] ======================================================
[ 331.118639] [ T1287] WARNING: possible circular locking dependency detected
[ 331.119510] [ T1287] 7.1.0+ #407 Not tainted
[ 331.120120] [ T1287] ------------------------------------------------------
[ 331.120993] [ T1287] nvme/1287 is trying to acquire lock:
[ 331.121627] [ T1287] ffff888100923180 (&root->kernfs_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x53/0x160
[ 331.122934] [ T1287]
but task is already holding lock:
[ 331.123841] [ T1287] ffff8881009232a0 (&root->kernfs_supers_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x4b/0x160
[ 331.125143] [ T1287]
which lock already depends on the new lock.
[ 331.126388] [ T1287]
the existing dependency chain (in reverse order) is:
[ 331.127475] [ T1287]
-> #9 (&root->kernfs_supers_rwsem){++++}-{4:4}:
[ 331.128528] [ T1287] down_read+0xa6/0x4c0
[ 331.129147] [ T1287] kernfs_remove_by_name_ns+0x4b/0x160
[ 331.129900] [ T1287] remove_files+0x8d/0x1b0
[ 331.130474] [ T1287] sysfs_remove_group+0x78/0x170
[ 331.131164] [ T1287] sysfs_remove_groups+0x63/0xd0
[ 331.132307] [ T1287] __kobject_del+0x7d/0x1e0
[ 331.133391] [ T1287] kobject_del+0x34/0x60
[ 331.134437] [ T1287] free_desc+0x184/0x1a0
[ 331.135485] [ T1287] irq_free_descs+0x4d/0x70
[ 331.136554] [ T1287] msi_domain_free_locked.part.0+0x492/0x690
[ 331.137838] [ T1287] msi_domain_free_irqs_all_locked+0xe9/0x140
[ 331.139056] [ T1287] pci_free_msi_irqs+0x12/0x90
[ 331.140124] [ T1287] pci_disable_msix+0xab/0xf0
[ 331.141173] [ T1287] pci_free_irq_vectors+0x12/0xe0
[ 331.142299] [ T1287] nvme_setup_io_queues+0x5d6/0x16c0 [nvme]
[ 331.143485] [ T1287] nvme_probe.cold+0x30f/0x65a [nvme]
[ 331.144607] [ T1287] local_pci_probe+0xdf/0x190
[ 331.145620] [ T1287] pci_call_probe+0x160/0x6d0
[ 331.146635] [ T1287] pci_device_probe+0x179/0x2f0
[ 331.147660] [ T1287] really_probe+0x1ed/0x900
[ 331.148641] [ T1287] __driver_probe_device+0x1d2/0x420
[ 331.149699] [ T1287] driver_probe_device+0x4a/0x120
[ 331.150719] [ T1287] __driver_attach_async_helper+0x10b/0x280
[ 331.151827] [ T1287] async_run_entry_fn+0x93/0x550
[ 331.152796] [ T1287] process_one_work+0x8b2/0x1640
[ 331.153764] [ T1287] worker_thread+0x5fd/0xfe0
[ 331.154708] [ T1287] kthread+0x367/0x460
[ 331.155581] [ T1287] ret_from_fork+0x655/0x9d0
[ 331.156502] [ T1287] ret_from_fork_asm+0x1a/0x30
[ 331.157434] [ T1287]
-> #8 (sparse_irq_lock){+.+.}-{4:4}:
[ 331.158932] [ T1287] __mutex_lock+0x1ae/0x2640
[ 331.159820] [ T1287] cpuhp_bringup_ap+0x52/0x950
[ 331.160725] [ T1287] cpuhp_invoke_callback+0x2d1/0x12e0
[ 331.161725] [ T1287] __cpuhp_invoke_callback_range+0xb6/0x1e0
[ 331.162771] [ T1287] _cpu_up+0x2eb/0x6d0
[ 331.163604] [ T1287] cpu_up+0x111/0x190
[ 331.164436] [ T1287] cpuhp_bringup_mask+0xd3/0x110
[ 331.165358] [ T1287] smp_init+0x27/0xe0
[ 331.166183] [ T1287] kernel_init_freeable+0x442/0x710
[ 331.167120] [ T1287] kernel_init+0x18/0x150
[ 331.167951] [ T1287] ret_from_fork+0x655/0x9d0
[ 331.168804] [ T1287] ret_from_fork_asm+0x1a/0x30
[ 331.169665] [ T1287]
-> #7 (cpu_hotplug_lock){++++}-{0:0}:
[ 331.171064] [ T1287] cpus_read_lock+0x3c/0xe0
[ 331.171902] [ T1287] static_key_disable+0x12/0x30
[ 331.172773] [ T1287] inet_hash+0xf3/0xd00
[ 331.173571] [ T1287] inet_csk_listen_start+0x350/0x440
[ 331.174508] [ T1287] __inet_listen_sk+0x191/0x650
[ 331.175390] [ T1287] inet_listen+0x9a/0xe0
[ 331.176203] [ T1287] __sys_listen+0x85/0x100
[ 331.177018] [ T1287] __x64_sys_listen+0x4e/0x90
[ 331.177860] [ T1287] do_syscall_64+0xdf/0x790
[ 331.178686] [ T1287] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.179686] [ T1287]
-> #6 (sk_lock-AF_INET){+.+.}-{0:0}:
[ 331.181060] [ T1287] lock_sock_nested+0x32/0xf0
[ 331.181906] [ T1287] tls_sw_sendmsg+0x1b4/0x23b0 [tls]
[ 331.182832] [ T1287] sock_sendmsg+0x305/0x3c0
[ 331.183661] [ T1287] nvmet_tcp_try_recv_pdu+0x150d/0x1d10 [nvmet_tcp]
[ 331.184742] [ T1287] nvmet_tcp_io_work+0x122/0x2420 [nvmet_tcp]
[ 331.185769] [ T1287] process_one_work+0x8b2/0x1640
[ 331.186665] [ T1287] worker_thread+0x5fd/0xfe0
[ 331.187515] [ T1287] kthread+0x367/0x460
[ 331.188312] [ T1287] ret_from_fork+0x655/0x9d0
[ 331.189171] [ T1287] ret_from_fork_asm+0x1a/0x30
[ 331.190030] [ T1287]
-> #5 (&ctx->tx_lock){+.+.}-{4:4}:
[ 331.191388] [ T1287] __mutex_lock+0x1ae/0x2640
[ 331.192245] [ T1287] tls_sw_sendmsg+0x130/0x23b0 [tls]
[ 331.193187] [ T1287] sock_sendmsg+0x305/0x3c0
[ 331.194015] [ T1287] nvme_tcp_try_send_cmd_pdu+0x630/0xcf0 [nvme_tcp]
[ 331.195097] [ T1287] nvme_tcp_try_send+0x1ef/0xa60 [nvme_tcp]
[ 331.196106] [ T1287] nvme_tcp_queue_rq+0xfa3/0x19e0 [nvme_tcp]
[ 331.197123] [ T1287] blk_mq_dispatch_rq_list+0x3e0/0x2420
[ 331.198088] [ T1287] __blk_mq_sched_dispatch_requests+0x20a/0x15d0
[ 331.199145] [ T1287] blk_mq_sched_dispatch_requests+0xa7/0x140
[ 331.200166] [ T1287] blk_mq_run_work_fn+0x135/0x2e0
[ 331.201075] [ T1287] process_one_work+0x8b2/0x1640
[ 331.201962] [ T1287] worker_thread+0x5fd/0xfe0
[ 331.202808] [ T1287] kthread+0x367/0x460
[ 331.203593] [ T1287] ret_from_fork+0x655/0x9d0
[ 331.204458] [ T1287] ret_from_fork_asm+0x1a/0x30
[ 331.205336] [ T1287]
-> #4 (set->srcu){.+.+}-{0:0}:
[ 331.206663] [ T1287] __synchronize_srcu+0xe1/0x2f0
[ 331.207558] [ T1287] elevator_switch+0x2bd/0x670
[ 331.208462] [ T1287] elevator_change+0x2e7/0x500
[ 331.209346] [ T1287] elevator_set_none+0xaa/0xf0
[ 331.210226] [ T1287] blk_unregister_queue+0x15e/0x2e0
[ 331.211157] [ T1287] __del_gendisk+0x28b/0xaa0
[ 331.212007] [ T1287] del_gendisk+0x11a/0x1c0
[ 331.212834] [ T1287] nvme_ns_remove+0x331/0x940 [nvme_core]
[ 331.213828] [ T1287] nvme_remove_namespaces+0x289/0x3f0 [nvme_core]
[ 331.214896] [ T1287] nvme_do_delete_ctrl+0xf6/0x160 [nvme_core]
[ 331.215931] [ T1287] nvme_delete_ctrl_sync.cold+0x8/0xd [nvme_core]
[ 331.217003] [ T1287] nvme_sysfs_delete+0xb7/0xe0 [nvme_core]
[ 331.218011] [ T1287] kernfs_fop_write_iter+0x3d6/0x5e0
[ 331.218942] [ T1287] vfs_write+0x4b3/0xf70
[ 331.219759] [ T1287] ksys_write+0x112/0x250
[ 331.220594] [ T1287] do_syscall_64+0xdf/0x790
[ 331.221456] [ T1287] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.222474] [ T1287]
-> #3 (&q->elevator_lock){+.+.}-{4:4}:
[ 331.223909] [ T1287] __mutex_lock+0x1ae/0x2640
[ 331.224764] [ T1287] elevator_change+0x197/0x500
[ 331.225653] [ T1287] elv_iosched_store+0x38f/0x430
[ 331.226557] [ T1287] queue_attr_store+0x25f/0x3e0
[ 331.227458] [ T1287] kernfs_fop_write_iter+0x3d6/0x5e0
[ 331.228406] [ T1287] vfs_write+0x4b3/0xf70
[ 331.229232] [ T1287] ksys_write+0x112/0x250
[ 331.230068] [ T1287] do_syscall_64+0xdf/0x790
[ 331.230905] [ T1287] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.231904] [ T1287]
-> #2 (&q->q_usage_counter(io)){++++}-{0:0}:
[ 331.233380] [ T1287] blk_alloc_queue+0x605/0x7a0
[ 331.234276] [ T1287] blk_mq_alloc_queue+0x168/0x270
[ 331.235196] [ T1287] scsi_alloc_sdev+0x86c/0xcd0
[ 331.236087] [ T1287] scsi_probe_and_add_lun+0x601/0xc50
[ 331.237034] [ T1287] __scsi_add_device+0x233/0x280
[ 331.237926] [ T1287] ata_scsi_scan_host+0x137/0x3a0
[ 331.238826] [ T1287] async_run_entry_fn+0x93/0x550
[ 331.239717] [ T1287] process_one_work+0x8b2/0x1640
[ 331.240618] [ T1287] worker_thread+0x5fd/0xfe0
[ 331.241485] [ T1287] kthread+0x367/0x460
[ 331.242291] [ T1287] ret_from_fork+0x655/0x9d0
[ 331.243165] [ T1287] ret_from_fork_asm+0x1a/0x30
[ 331.244029] [ T1287]
-> #1 (fs_reclaim){+.+.}-{0:0}:
[ 331.245366] [ T1287] fs_reclaim_acquire+0xd5/0x120
[ 331.246266] [ T1287] kmem_cache_alloc_lru_noprof+0x52/0x6c0
[ 331.247257] [ T1287] alloc_inode+0x9d/0x1e0
[ 331.248093] [ T1287] iget_locked+0x19d/0x630
[ 331.248923] [ T1287] kernfs_get_inode+0x42/0x440
[ 331.249791] [ T1287] kernfs_get_tree+0x5d0/0xbd0
[ 331.250662] [ T1287] sysfs_get_tree+0x3f/0x140
[ 331.251523] [ T1287] vfs_get_tree+0x87/0x2f0
[ 331.252367] [ T1287] fc_mount+0x16/0x220
[ 331.253172] [ T1287] path_mount+0x854/0x1d10
[ 331.253998] [ T1287] __x64_sys_mount+0x208/0x270
[ 331.254860] [ T1287] do_syscall_64+0xdf/0x790
[ 331.255696] [ T1287] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.256706] [ T1287]
-> #0 (&root->kernfs_rwsem){++++}-{4:4}:
[ 331.258135] [ T1287] __lock_acquire+0xe06/0x22e0
[ 331.259001] [ T1287] lock_acquire+0x1a5/0x330
[ 331.259836] [ T1287] down_write+0x8c/0x1f0
[ 331.260639] [ T1287] kernfs_remove_by_name_ns+0x53/0x160
[ 331.261596] [ T1287] sysfs_unmerge_group+0xd5/0x160
[ 331.262508] [ T1287] dev_pm_qos_hide_latency_tolerance+0x1f/0x60
[ 331.263555] [ T1287] nvme_uninit_ctrl+0x8f/0x110 [nvme_core]
[ 331.264573] [ T1287] nvme_tcp_create_ctrl+0x887/0xc20 [nvme_tcp]
[ 331.265625] [ T1287] nvmf_dev_write+0x40b/0x830 [nvme_fabrics]
[ 331.266651] [ T1287] vfs_write+0x1cc/0xf70
[ 331.267472] [ T1287] ksys_write+0x112/0x250
[ 331.268308] [ T1287] do_syscall_64+0xdf/0x790
[ 331.269164] [ T1287] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.270178] [ T1287]
other info that might help us debug this:
[ 331.272111] [ T1287] Chain exists of:
&root->kernfs_rwsem --> sparse_irq_lock --> &root->kernfs_supers_rwsem
[ 331.274536] [ T1287] Possible unsafe locking scenario:
[ 331.275924] [ T1287] CPU0 CPU1
[ 331.276806] [ T1287] ---- ----
[ 331.277684] [ T1287] rlock(&root->kernfs_supers_rwsem);
[ 331.278585] [ T1287] lock(sparse_irq_lock);
[ 331.279668] [ T1287] lock(&root->kernfs_supers_rwsem);
[ 331.280853] [ T1287] lock(&root->kernfs_rwsem);
[ 331.281672] [ T1287]
*** DEADLOCK ***
[ 331.283370] [ T1287] 3 locks held by nvme/1287:
[ 331.284208] [ T1287] #0: ffffffffc19e8260 (nvmf_dev_mutex){+.+.}-{4:4}, at: nvmf_dev_write+0x82/0x830 [nvme_fabrics]
[ 331.285743] [ T1287] #1: ffffffff87d02ce0 (dev_pm_qos_sysfs_mtx){+.+.}-{4:4}, at: dev_pm_qos_hide_latency_tolerance+0x17/0x60
[ 331.287381] [ T1287] #2: ffff8881009232a0 (&root->kernfs_supers_rwsem){++++}-{4:4}, at: kernfs_remove_by_name_ns+0x4b/0x160
[ 331.288984] [ T1287]
stack backtrace:
[ 331.290235] [ T1287] CPU: 3 UID: 0 PID: 1287 Comm: nvme Not tainted 7.1.0+ #407 PREEMPT(full)
[ 331.290239] [ T1287] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-10.fc44 06/10/2025
[ 331.290243] [ T1287] Call Trace:
[ 331.290245] [ T1287] <TASK>
[ 331.290246] [ T1287] dump_stack_lvl+0x6a/0x90
[ 331.290253] [ T1287] print_circular_bug.cold+0x189/0x1eb
[ 331.290257] [ T1287] check_noncircular+0x173/0x1a0
[ 331.290261] [ T1287] __lock_acquire+0xe06/0x22e0
[ 331.290265] [ T1287] lock_acquire+0x1a5/0x330
[ 331.290267] [ T1287] ? kernfs_remove_by_name_ns+0x53/0x160
[ 331.290270] [ T1287] ? __pfx___might_resched+0x10/0x10
[ 331.290274] [ T1287] down_write+0x8c/0x1f0
[ 331.290276] [ T1287] ? kernfs_remove_by_name_ns+0x53/0x160
[ 331.290279] [ T1287] ? __pfx_down_write+0x10/0x10
[ 331.290280] [ T1287] ? kernfs_root+0xac/0x1b0
[ 331.290282] [ T1287] ? lock_release+0x1b5/0x340
[ 331.290285] [ T1287] kernfs_remove_by_name_ns+0x53/0x160
[ 331.290288] [ T1287] sysfs_unmerge_group+0xd5/0x160
[ 331.290291] [ T1287] dev_pm_qos_hide_latency_tolerance+0x1f/0x60
[ 331.290294] [ T1287] nvme_uninit_ctrl+0x8f/0x110 [nvme_core]
[ 331.290312] [ T1287] nvme_tcp_create_ctrl+0x887/0xc20 [nvme_tcp]
[ 331.290317] [ T1287] ? nvmf_dev_write+0x2ff/0x830 [nvme_fabrics]
[ 331.290324] [ T1287] nvmf_dev_write+0x40b/0x830 [nvme_fabrics]
[ 331.290329] [ T1287] vfs_write+0x1cc/0xf70
[ 331.290333] [ T1287] ? __pfx_vfs_write+0x10/0x10
[ 331.290338] [ T1287] ksys_write+0x112/0x250
[ 331.290341] [ T1287] ? __pfx_ksys_write+0x10/0x10
[ 331.290343] [ T1287] ? kasan_quarantine_put+0x12e/0x260
[ 331.290346] [ T1287] ? kasan_quarantine_put+0x12e/0x260
[ 331.290348] [ T1287] do_syscall_64+0xdf/0x790
[ 331.290352] [ T1287] ? do_sys_openat2+0xfd/0x170
[ 331.290354] [ T1287] ? __pfx_do_sys_openat2+0x10/0x10
[ 331.290356] [ T1287] ? lock_is_held_type+0xf6/0x1b0
[ 331.290359] [ T1287] ? rcu_is_watching+0x11/0xb0
[ 331.290362] [ T1287] ? trace_hardirqs_on+0x14/0x1b0
[ 331.290364] [ T1287] ? lockdep_hardirqs_on+0x8c/0x130
[ 331.290366] [ T1287] ? __call_rcu_common.constprop.0+0x4af/0x1190
[ 331.290370] [ T1287] ? __x64_sys_openat+0x10a/0x210
[ 331.290372] [ T1287] ? __pfx___call_rcu_common.constprop.0+0x10/0x10
[ 331.290375] [ T1287] ? __pfx___x64_sys_openat+0x10/0x10
[ 331.290378] [ T1287] ? rcu_is_watching+0x11/0xb0
[ 331.290380] [ T1287] ? do_syscall_64+0x1ec/0x790
[ 331.290382] [ T1287] ? trace_hardirqs_on_prepare+0x14c/0x1a0
[ 331.290384] [ T1287] ? lockdep_hardirqs_on+0x8c/0x130
[ 331.290386] [ T1287] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.290388] [ T1287] ? do_syscall_64+0x20a/0x790
[ 331.290391] [ T1287] ? fput_close_sync+0xda/0x1b0
[ 331.290393] [ T1287] ? __pfx_fput_close_sync+0x10/0x10
[ 331.290395] [ T1287] ? do_raw_spin_unlock+0x55/0x230
[ 331.290398] [ T1287] ? rcu_is_watching+0x11/0xb0
[ 331.290401] [ T1287] ? do_syscall_64+0x1ec/0x790
[ 331.290403] [ T1287] ? trace_hardirqs_on_prepare+0x14c/0x1a0
[ 331.290405] [ T1287] ? lockdep_hardirqs_on+0x8c/0x130
[ 331.290407] [ T1287] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.290409] [ T1287] ? do_syscall_64+0x20a/0x790
[ 331.290410] [ T1287] ? count_memcg_events_mm.constprop.0+0x22/0x130
[ 331.290414] [ T1287] ? rcu_is_watching+0x11/0xb0
[ 331.290417] [ T1287] ? count_memcg_events+0x107/0x4e0
[ 331.290419] [ T1287] ? find_held_lock+0x2b/0x80
[ 331.290422] [ T1287] ? rcu_read_unlock+0x17/0x60
[ 331.290425] [ T1287] ? lock_release+0x1b5/0x340
[ 331.290427] [ T1287] ? find_held_lock+0x2b/0x80
[ 331.290430] [ T1287] ? exc_page_fault+0x94/0x140
[ 331.290432] [ T1287] ? lock_release+0x1b5/0x340
[ 331.290435] [ T1287] ? rcu_is_watching+0x11/0xb0
[ 331.290438] [ T1287] ? trace_hardirqs_on+0x14/0x1b0
[ 331.290439] [ T1287] ? preempt_count_add+0x7f/0x190
[ 331.290442] [ T1287] ? do_syscall_64+0x5d/0x790
[ 331.290444] [ T1287] ? do_syscall_64+0x8d/0x790
[ 331.290446] [ T1287] ? irqentry_exit+0xfc/0x790
[ 331.290449] [ T1287] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 331.290451] [ T1287] RIP: 0033:0x7f407046277e
[ 331.290456] [ T1287] Code: 4d 89 d8 e8 d4 bc 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 13 ff ff ff 0f 1f 00 f3 0f 1e fa
[ 331.290458] [ T1287] RSP: 002b:00007ffc375d7ae0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 331.290463] [ T1287] RAX: ffffffffffffffda RBX: 000000003c2f7680 RCX: 00007f407046277e
[ 331.290464] [ T1287] RDX: 00000000000000cf RSI: 000000003c2f7680 RDI: 0000000000000003
[ 331.290466] [ T1287] RBP: 00007ffc375d7af0 R08: 0000000000000000 R09: 0000000000000000
[ 331.290467] [ T1287] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
[ 331.290468] [ T1287] R13: 000000003c2f5540 R14: 00007f4070638818 R15: 0000000000000001
[ 331.290472] [ T1287] </TASK>
[ 331.336870] [ C0] clocksource: Watchdog remote CPU 3 read timed out
[ 331.494199] [ T373] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
[ 331.504462] [ T1298] nvme nvme5: creating 4 I/O queues.
[ 331.546425] [ T1298] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 331.552151] [ T1298] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 332.171342] [ T1363] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
[9] nvme/063 dmesg on v7.1 kernel
[ 353.490641] [ T1285] run blktests nvme/063 at 2026-06-24 10:54:40
[ 353.790326] [ T1754] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 353.823845] [ T1757] nvmet: Allow non-TLS connections while TLS1.3 is enabled
[ 353.835314] [ T1760] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 354.034947] [ T458] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[ 354.053192] [ T358] nvme nvme5: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[ 354.055130] [ T1770] nvme nvme5: qid 0: authenticated
[ 354.163171] [ T11] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
[ 354.173339] [ T1770] nvme nvme5: creating 4 I/O queues.
[ 354.234059] [ T1770] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 354.244089] [ T1770] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 354.714629] [ T1846] nvme nvme5: resetting controller
[ 354.730193] [ T295] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[ 354.738053] [ T49] nvme nvme5: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[ 354.739399] [ T361] nvme nvme5: qid 0: authenticated
[ 354.770783] [ T295] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
[ 354.776061] [ T361] nvme nvme5: creating 4 I/O queues.
[ 354.848673] [ T361] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 354.933728] [ T1861] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
[ 354.954040] [ T1861] ======================================================
[ 354.956228] [ T1861] WARNING: possible circular locking dependency detected
[ 354.958462] [ T1861] 7.1.0 #5 Not tainted
[ 354.959747] [ T1861] ------------------------------------------------------
[ 354.961969] [ T1861] nvme/1861 is trying to acquire lock:
[ 354.963758] [ T1861] ffff88812a332518 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0xc1/0x2f0
[ 354.966758] [ T1861]
but task is already holding lock:
[ 354.969092] [ T1861] ffff88810d596bf8 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x197/0x500
[ 354.971880] [ T1861]
which lock already depends on the new lock.
[ 354.976664] [ T1861]
the existing dependency chain (in reverse order) is:
[ 354.980056] [ T1861]
-> #5 (&q->elevator_lock){+.+.}-{4:4}:
[ 354.982450] [ T1861] __mutex_lock+0x1ae/0x2650
[ 354.983578] [ T1861] elevator_change+0x197/0x500
[ 354.984623] [ T1861] elv_iosched_store+0x38f/0x430
[ 354.985737] [ T1861] queue_attr_store+0x25f/0x3e0
[ 354.986901] [ T1861] kernfs_fop_write_iter+0x3d6/0x5e0
[ 354.988242] [ T1861] vfs_write+0x4b3/0xf70
[ 354.989283] [ T1861] ksys_write+0x112/0x250
[ 354.990311] [ T1861] do_syscall_64+0xdf/0x790
[ 354.991481] [ T1861] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 354.992718] [ T1861]
-> #4 (&q->q_usage_counter(io)){++++}-{0:0}:
[ 354.994760] [ T1861] blk_alloc_queue+0x605/0x7a0
[ 354.995936] [ T1861] blk_mq_alloc_queue+0x168/0x270
[ 354.997058] [ T1861] scsi_alloc_sdev+0x86c/0xcd0
[ 354.998151] [ T1861] scsi_probe_and_add_lun+0x601/0xc50
[ 354.999296] [ T1861] __scsi_add_device+0x233/0x280
[ 355.000391] [ T1861] ata_scsi_scan_host+0x137/0x3a0
[ 355.001434] [ T1861] async_run_entry_fn+0x93/0x550
[ 355.002475] [ T1861] process_one_work+0x8b6/0x1650
[ 355.003571] [ T1861] worker_thread+0x5fd/0xfe0
[ 355.004559] [ T1861] kthread+0x36a/0x460
[ 355.005471] [ T1861] ret_from_fork+0x655/0x9d0
[ 355.006445] [ T1861] ret_from_fork_asm+0x1a/0x30
[ 355.007454] [ T1861]
-> #3 (fs_reclaim){+.+.}-{0:0}:
[ 355.009116] [ T1861] fs_reclaim_acquire+0xd5/0x120
[ 355.010151] [ T1861] __kmalloc_cache_noprof+0x58/0x720
[ 355.011270] [ T1861] __request_module+0x253/0x610
[ 355.012285] [ T1861] tcp_set_ulp+0x395/0x5e0
[ 355.013306] [ T1861] do_tcp_setsockopt+0x4a9/0x26a0
[ 355.014296] [ T1861] do_sock_setsockopt+0x163/0x3b0
[ 355.015198] [ T1861] __sys_setsockopt+0xe0/0x150
[ 355.016140] [ T1861] __x64_sys_setsockopt+0xb9/0x180
[ 355.017098] [ T1861] do_syscall_64+0xdf/0x790
[ 355.018038] [ T1861] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 355.019093] [ T1861]
-> #2 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[ 355.020508] [ T1861] lock_sock_nested+0x32/0xf0
[ 355.021427] [ T1861] tls_sw_sendmsg+0x1b4/0x23b0 [tls]
[ 355.022442] [ T1861] sock_sendmsg+0x305/0x3c0
[ 355.023332] [ T1861] nvme_tcp_init_connection+0x3d8/0x970 [nvme_tcp]
[ 355.024658] [ T1861] nvme_tcp_alloc_queue+0xf92/0x1ba0 [nvme_tcp]
[ 355.025873] [ T1861] nvme_tcp_alloc_admin_queue+0xff/0x440 [nvme_tcp]
[ 355.027046] [ T1861] nvme_tcp_setup_ctrl+0x188/0x8a0 [nvme_tcp]
[ 355.028079] [ T1861] nvme_tcp_create_ctrl+0x874/0xc20 [nvme_tcp]
[ 355.029186] [ T1861] nvmf_dev_write+0x40b/0x830 [nvme_fabrics]
[ 355.030272] [ T1861] vfs_write+0x1cc/0xf70
[ 355.031137] [ T1861] ksys_write+0x112/0x250
[ 355.032031] [ T1861] do_syscall_64+0xdf/0x790
[ 355.032919] [ T1861] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 355.033977] [ T1861]
-> #1 (&ctx->tx_lock){+.+.}-{4:4}:
[ 355.035415] [ T1861] __mutex_lock+0x1ae/0x2650
[ 355.036296] [ T1861] tls_sw_sendmsg+0x130/0x23b0 [tls]
[ 355.037287] [ T1861] sock_sendmsg+0x305/0x3c0
[ 355.038171] [ T1861] nvme_tcp_try_send_cmd_pdu+0x630/0xcf0 [nvme_tcp]
[ 355.039293] [ T1861] nvme_tcp_try_send+0x1ef/0xa60 [nvme_tcp]
[ 355.040362] [ T1861] nvme_tcp_queue_rq+0xfa3/0x19e0 [nvme_tcp]
[ 355.041470] [ T1861] blk_mq_dispatch_rq_list+0x3e0/0x2420
[ 355.042456] [ T1861] __blk_mq_sched_dispatch_requests+0x20a/0x15d0
[ 355.043618] [ T1861] blk_mq_sched_dispatch_requests+0xab/0x150
[ 355.044725] [ T1861] blk_mq_run_work_fn+0x135/0x2e0
[ 355.045739] [ T1861] process_one_work+0x8b6/0x1650
[ 355.046753] [ T1861] worker_thread+0x5fd/0xfe0
[ 355.047663] [ T1861] kthread+0x36a/0x460
[ 355.048507] [ T1861] ret_from_fork+0x655/0x9d0
[ 355.049433] [ T1861] ret_from_fork_asm+0x1a/0x30
[ 355.050354] [ T1861]
-> #0 (set->srcu){.+.+}-{0:0}:
[ 355.051837] [ T1861] __lock_acquire+0xe06/0x22e0
[ 355.052584] [ T1861] lock_sync+0xbf/0x120
[ 355.053245] [ T1861] __synchronize_srcu+0xe1/0x2f0
[ 355.053967] [ T1861] elevator_switch+0x2bd/0x670
[ 355.054671] [ T1861] elevator_change+0x2e7/0x500
[ 355.055360] [ T1861] elevator_set_none+0xaa/0xf0
[ 355.056129] [ T1861] blk_unregister_queue+0x15e/0x2e0
[ 355.056911] [ T1861] __del_gendisk+0x28f/0xaa0
[ 355.057530] [ T1861] del_gendisk+0x11a/0x1c0
[ 355.058166] [ T1861] nvme_ns_remove+0x331/0x940 [nvme_core]
[ 355.058897] [ T1861] nvme_remove_namespaces+0x289/0x3f0 [nvme_core]
[ 355.059623] [ T1861] nvme_do_delete_ctrl+0xf6/0x160 [nvme_core]
[ 355.060683] [ T1861] nvme_delete_ctrl_sync.cold+0x8/0xd [nvme_core]
[ 355.061487] [ T1861] nvme_sysfs_delete+0xba/0xe0 [nvme_core]
[ 355.062274] [ T1861] kernfs_fop_write_iter+0x3d6/0x5e0
[ 355.063033] [ T1861] vfs_write+0x4b3/0xf70
[ 355.063694] [ T1861] ksys_write+0x112/0x250
[ 355.064372] [ T1861] do_syscall_64+0xdf/0x790
[ 355.065092] [ T1861] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 355.065912] [ T1861]
other info that might help us debug this:
[ 355.067898] [ T1861] Chain exists of:
set->srcu --> &q->q_usage_counter(io) --> &q->elevator_lock
[ 355.070120] [ T1861] Possible unsafe locking scenario:
[ 355.071461] [ T1861] CPU0 CPU1
[ 355.072232] [ T1861] ---- ----
[ 355.073014] [ T1861] lock(&q->elevator_lock);
[ 355.073692] [ T1861] lock(&q->q_usage_counter(io));
[ 355.074592] [ T1861] lock(&q->elevator_lock);
[ 355.075498] [ T1861] sync(set->srcu);
[ 355.076133] [ T1861]
*** DEADLOCK ***
[ 355.077737] [ T1861] 5 locks held by nvme/1861:
[ 355.078437] [ T1861] #0: ffff88812a9a6410 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x112/0x250
[ 355.079384] [ T1861] #1: ffff888135ef9480 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x257/0x5e0
[ 355.080388] [ T1861] #2: ffff88810a849008 (kn->active#144){++++}-{0:0}, at: sysfs_remove_file_self+0x61/0xb0
[ 355.081392] [ T1861] #3: ffff88810641c1c8 (&set->update_nr_hwq_lock){++++}-{4:4}, at: del_gendisk+0x112/0x1c0
[ 355.082436] [ T1861] #4: ffff88810d596bf8 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0x197/0x500
[ 355.083442] [ T1861]
stack backtrace:
[ 355.084570] [ T1861] CPU: 3 UID: 0 PID: 1861 Comm: nvme Not tainted 7.1.0 #5 PREEMPT(full)
[ 355.084573] [ T1861] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-10.fc44 06/10/2025
[ 355.084576] [ T1861] Call Trace:
[ 355.084578] [ T1861] <TASK>
[ 355.084580] [ T1861] dump_stack_lvl+0x6a/0x90
[ 355.084586] [ T1861] print_circular_bug.cold+0x189/0x1eb
[ 355.084590] [ T1861] check_noncircular+0x173/0x1a0
[ 355.084595] [ T1861] __lock_acquire+0xe06/0x22e0
[ 355.084598] [ T1861] lock_sync+0xbf/0x120
[ 355.084600] [ T1861] ? __synchronize_srcu+0xc1/0x2f0
[ 355.084602] [ T1861] ? __synchronize_srcu+0xc1/0x2f0
[ 355.084604] [ T1861] __synchronize_srcu+0xe1/0x2f0
[ 355.084606] [ T1861] ? __pfx___synchronize_srcu+0x10/0x10
[ 355.084608] [ T1861] ? lock_acquire+0x1a5/0x330
[ 355.084611] [ T1861] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 355.084614] [ T1861] ? synchronize_srcu+0xae/0x3f0
[ 355.084616] [ T1861] elevator_switch+0x2bd/0x670
[ 355.084619] [ T1861] elevator_change+0x2e7/0x500
[ 355.084622] [ T1861] elevator_set_none+0xaa/0xf0
[ 355.084624] [ T1861] ? __pfx_elevator_set_none+0x10/0x10
[ 355.084626] [ T1861] ? kobject_put+0x62/0x530
[ 355.084631] [ T1861] blk_unregister_queue+0x15e/0x2e0
[ 355.084633] [ T1861] __del_gendisk+0x28f/0xaa0
[ 355.084635] [ T1861] ? down_read+0xbd/0x4d0
[ 355.084637] [ T1861] ? down_read+0x148/0x4d0
[ 355.084638] [ T1861] ? __pfx___del_gendisk+0x10/0x10
[ 355.084641] [ T1861] ? __pfx_down_read+0x10/0x10
[ 355.084642] [ T1861] ? up_write+0x201/0x540
[ 355.084645] [ T1861] ? up_write+0x2ad/0x540
[ 355.084647] [ T1861] del_gendisk+0x11a/0x1c0
[ 355.084649] [ T1861] nvme_ns_remove+0x331/0x940 [nvme_core]
[ 355.084666] [ T1861] nvme_remove_namespaces+0x289/0x3f0 [nvme_core]
[ 355.084681] [ T1861] ? __pfx_nvme_remove_namespaces+0x10/0x10 [nvme_core]
[ 355.084693] [ T1861] nvme_do_delete_ctrl+0xf6/0x160 [nvme_core]
[ 355.084705] [ T1861] nvme_delete_ctrl_sync.cold+0x8/0xd [nvme_core]
[ 355.084717] [ T1861] nvme_sysfs_delete+0xba/0xe0 [nvme_core]
[ 355.084729] [ T1861] ? __pfx_sysfs_kf_write+0x10/0x10
[ 355.084732] [ T1861] kernfs_fop_write_iter+0x3d6/0x5e0
[ 355.084735] [ T1861] ? __pfx_kernfs_fop_write_iter+0x10/0x10
[ 355.084736] [ T1861] vfs_write+0x4b3/0xf70
[ 355.084739] [ T1861] ? __pfx_vfs_write+0x10/0x10
[ 355.084741] [ T1861] ? rcu_is_watching+0x11/0xb0
[ 355.084745] [ T1861] ? kmem_cache_free+0x163/0x6b0
[ 355.084749] [ T1861] ksys_write+0x112/0x250
[ 355.084751] [ T1861] ? __pfx_ksys_write+0x10/0x10
[ 355.084753] [ T1861] ? __x64_sys_close+0x87/0xf0
[ 355.084756] [ T1861] do_syscall_64+0xdf/0x790
[ 355.084758] [ T1861] ? __x64_sys_openat+0x10a/0x210
[ 355.084760] [ T1861] ? __pfx___x64_sys_openat+0x10/0x10
[ 355.084762] [ T1861] ? rcu_is_watching+0x11/0xb0
[ 355.084764] [ T1861] ? do_syscall_64+0x1ec/0x790
[ 355.084766] [ T1861] ? trace_hardirqs_on_prepare+0x14c/0x1a0
[ 355.084769] [ T1861] ? lockdep_hardirqs_on+0x8c/0x130
[ 355.084772] [ T1861] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 355.084773] [ T1861] ? do_syscall_64+0x20a/0x790
[ 355.084776] [ T1861] ? lock_is_held_type+0xf6/0x1b0
[ 355.084778] [ T1861] ? rcu_is_watching+0x11/0xb0
[ 355.084780] [ T1861] ? trace_hardirqs_on+0x14/0x1b0
[ 355.084781] [ T1861] ? lockdep_hardirqs_on+0x8c/0x130
[ 355.084783] [ T1861] ? __call_rcu_common.constprop.0+0x4af/0x11a0
[ 355.084785] [ T1861] ? __call_rcu_common.constprop.0+0x4af/0x11a0
[ 355.084787] [ T1861] ? __pfx___call_rcu_common.constprop.0+0x10/0x10
[ 355.084791] [ T1861] ? fput_close_sync+0xda/0x1b0
[ 355.084792] [ T1861] ? kmem_cache_free+0x47e/0x6b0
[ 355.084795] [ T1861] ? fput_close_sync+0xda/0x1b0
[ 355.084796] [ T1861] ? __pfx_fput_close_sync+0x10/0x10
[ 355.084798] [ T1861] ? do_raw_spin_unlock+0x55/0x230
[ 355.084800] [ T1861] ? rcu_is_watching+0x11/0xb0
[ 355.084802] [ T1861] ? do_syscall_64+0x1ec/0x790
[ 355.084803] [ T1861] ? trace_hardirqs_on_prepare+0x14c/0x1a0
[ 355.084805] [ T1861] ? lockdep_hardirqs_on+0x8c/0x130
[ 355.084807] [ T1861] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 355.084808] [ T1861] ? do_syscall_64+0x20a/0x790
[ 355.084809] [ T1861] ? do_syscall_64+0x1ec/0x790
[ 355.084811] [ T1861] ? trace_hardirqs_on_prepare+0x14c/0x1a0
[ 355.084812] [ T1861] ? lockdep_hardirqs_on+0x8c/0x130
[ 355.084814] [ T1861] ? entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 355.084815] [ T1861] ? do_syscall_64+0x20a/0x790
[ 355.084817] [ T1861] ? trace_hardirqs_on_prepare+0x14c/0x1a0
[ 355.084818] [ T1861] ? rcu_is_watching+0x11/0xb0
[ 355.084820] [ T1861] ? trace_hardirqs_on+0x14/0x1b0
[ 355.084822] [ T1861] ? preempt_count_add+0x7f/0x190
[ 355.084825] [ T1861] ? do_syscall_64+0x5d/0x790
[ 355.084826] [ T1861] ? do_syscall_64+0x8d/0x790
[ 355.084828] [ T1861] ? irqentry_exit+0xfc/0x790
[ 355.084830] [ T1861] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 355.084832] [ T1861] RIP: 0033:0x7fe310bc408e
[ 355.084836] [ T1861] Code: 4d 89 d8 e8 94 bd 00 00 4c 8b 5d f8 41 8b 93 08 03 00 00 59 5e 48 83 f8 fc 74 11 c9 c3 0f 1f 80 00 00 00 00 48 8b 45 10 0f 05 <c9> c3 83 e2 39 83 fa 08 75 e7 e8 03 ff ff ff 0f 1f 00 f3 0f 1e fa
[ 355.084838] [ T1861] RSP: 002b:00007ffe4bbbbdf0 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
[ 355.084841] [ T1861] RAX: ffffffffffffffda RBX: 00007fe310d8a566 RCX: 00007fe310bc408e
[ 355.084842] [ T1861] RDX: 0000000000000001 RSI: 00007fe310d8a566 RDI: 0000000000000003
[ 355.084843] [ T1861] RBP: 00007ffe4bbbbe00 R08: 0000000000000000 R09: 0000000000000000
[ 355.084844] [ T1861] R10: 0000000000000000 R11: 0000000000000202 R12: 000000001ddfd910
[ 355.084845] [ T1861] R13: 00007ffe4bbbe5e6 R14: 000000001ddfd720 R15: 000000001ddffb50
[ 355.084848] [ T1861] </TASK>
[ 355.363487] [ T295] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[ 355.381067] [ T49] nvme nvme5: qid 0: authenticated with hash hmac(sha384) dhgroup ffdhe3072
[ 355.385933] [ T1873] nvme nvme5: qid 0: authenticated
[ 355.413432] [ T11] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349, TLS.
[ 355.431349] [ T1873] nvme nvme5: creating 4 I/O queues.
[ 355.473722] [ T1873] nvme nvme5: mapped 4/0/0 default/read/poll queues.
[ 355.479399] [ T1873] nvme nvme5: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 355.956596] [ T1931] nvme nvme5: Removing ctrl: NQN "blktests-subsystem-1"
^ permalink raw reply
* [PATCH] RDMA/core: Fix memory leak in __ib_create_cq() on invalid cqe
From: Chenguang Zhao @ 2026-06-24 2:59 UTC (permalink / raw)
To: jgg, leon
Cc: edwards, mbloch, michaelgur, msanalla, ohartoov, jiri,
kalesh-anakkur.purayil, linux-rdma, Chenguang Zhao
Free the allocated CQ object when cqe is zero before returning
-EINVAL.
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
---
drivers/infiniband/core/verbs.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 3b613b57e269..d6b2eb820061 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -2200,8 +2200,10 @@ struct ib_cq *__ib_create_cq(struct ib_device *device,
if (!cq)
return ERR_PTR(-ENOMEM);
- if (WARN_ON_ONCE(!cq_attr->cqe))
+ if (WARN_ON_ONCE(!cq_attr->cqe)) {
+ kfree(cq);
return ERR_PTR(-EINVAL);
+ }
cq->device = device;
cq->comp_handler = comp_handler;
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v9 3/4] vfio/pci: implement get_pci_tph and DMA_BUF_TPH feature
From: Alex Williamson @ 2026-06-23 18:17 UTC (permalink / raw)
To: Zhiping Zhang
Cc: Jason Gunthorpe, Leon Romanovsky, Michael Guralnik, Sumit Semwal,
Christian Konig, Bjorn Helgaas, kvm, linux-rdma, linux-pci,
dri-devel, alex
In-Reply-To: <20260622184211.2229399-4-zhipingz@meta.com>
On Mon, 22 Jun 2026 11:41:36 -0700
Zhiping Zhang <zhipingz@meta.com> wrote:
> Implement dma-buf get_pci_tph for vfio-pci exported dma-bufs and add
> VFIO_DEVICE_FEATURE_DMA_BUF_TPH so userspace can publish TPH metadata
> for a VFIO-owned device.
>
> 8-bit ST and 16-bit Extended ST are distinct PCIe TPH namespaces; the
> uAPI carries both with explicit validity flags, and get_pci_tph()
> returns the value matching the importer's requested namespace or
> -EOPNOTSUPP.
>
> Publish and read the TPH descriptor under dmabuf->resv, matching the
> locking used for other importer-visible dma-buf state. The SET ioctl
> takes dma_resv_lock_interruptible(), while the callback runs under
> DMA-buf's asserted resv lock.
>
> Reject requests the device cannot consume as a completer:
> pcie_tph_completer_type() must report at least
> PCI_EXP_DEVCAP2_TPH_COMP_TPH_ONLY, and Extended ST requires
> PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH. Make PROBE follow the same hardware
> gate so the feature only probes as supported when the device can really
> consume it.
>
> Signed-off-by: Zhiping Zhang <zhipingz@meta.com>
> ---
> drivers/vfio/pci/vfio_pci_core.c | 3 +
> drivers/vfio/pci/vfio_pci_dmabuf.c | 97 +++++++++++++++++++++++++++++-
> drivers/vfio/pci/vfio_pci_priv.h | 12 ++++
> include/uapi/linux/vfio.h | 37 ++++++++++++
> 4 files changed, 148 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
> index a28f1e99362c..c7d6902bc61b 100644
> --- a/drivers/vfio/pci/vfio_pci_core.c
> +++ b/drivers/vfio/pci/vfio_pci_core.c
> @@ -1572,6 +1572,9 @@ int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
> return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
> case VFIO_DEVICE_FEATURE_DMA_BUF:
> return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
> + case VFIO_DEVICE_FEATURE_DMA_BUF_TPH:
> + return vfio_pci_core_feature_dma_buf_tph(vdev, flags, arg,
> + argsz);
> default:
> return -ENOTTY;
> }
> diff --git a/drivers/vfio/pci/vfio_pci_dmabuf.c b/drivers/vfio/pci/vfio_pci_dmabuf.c
> index c16f460c01d6..d6f5dd321000 100644
> --- a/drivers/vfio/pci/vfio_pci_dmabuf.c
> +++ b/drivers/vfio/pci/vfio_pci_dmabuf.c
> @@ -3,6 +3,7 @@
> */
> #include <linux/dma-buf-mapping.h>
> #include <linux/pci-p2pdma.h>
> +#include <linux/pci-tph.h>
> #include <linux/dma-resv.h>
>
> #include "vfio_pci_priv.h"
> @@ -19,7 +20,14 @@ struct vfio_pci_dma_buf {
> u32 nr_ranges;
> struct kref kref;
> struct completion comp;
> - u8 revoked : 1;
> +
> + /* Protected by dmabuf->resv. */
Nit, it would be more accurate to say:
/*
* Updates protected by dmabuf->resv, @revoked additionally
* protected by memory_lock.
*/
revoked also has an unprotected read, but it's previously existing and
benign, and likely just needs a READ_ONCE() annotation.
> + u16 tph_st_ext;
> + u8 tph_st;
> + u8 revoked:1;
> + u8 tph_st_valid:1;
> + u8 tph_st_ext_valid:1;
> + u8 tph_ph:2;
> };
>
> static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
> @@ -69,6 +77,26 @@ vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
> return ret;
> }
>
> +static int vfio_pci_dma_buf_get_pci_tph(struct dma_buf *dmabuf, bool extended,
> + u16 *steering_tag, u8 *ph)
> +{
> + struct vfio_pci_dma_buf *priv = dmabuf->priv;
> +
> + dma_resv_assert_held(dmabuf->resv);
> +
> + if (extended) {
> + if (!priv->tph_st_ext_valid)
> + return -EOPNOTSUPP;
> + *steering_tag = priv->tph_st_ext;
> + } else {
> + if (!priv->tph_st_valid)
> + return -EOPNOTSUPP;
> + *steering_tag = priv->tph_st;
> + }
> + *ph = priv->tph_ph;
> + return 0;
> +}
> +
> static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
> struct sg_table *sgt,
> enum dma_data_direction dir)
> @@ -101,6 +129,7 @@ static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
>
> static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
> .attach = vfio_pci_dma_buf_attach,
> + .get_pci_tph = vfio_pci_dma_buf_get_pci_tph,
> .map_dma_buf = vfio_pci_dma_buf_map,
> .unmap_dma_buf = vfio_pci_dma_buf_unmap,
> .release = vfio_pci_dma_buf_release,
> @@ -333,6 +362,72 @@ int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> return ret;
> }
>
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> + u32 flags,
> + struct vfio_device_feature_dma_buf_tph __user *arg,
> + size_t argsz)
> +{
> + struct vfio_device_feature_dma_buf_tph set_tph;
> + struct vfio_pci_dma_buf *priv;
> + struct dma_buf *dmabuf;
> + u8 comp;
> + int ret;
> +
> + comp = pcie_tph_completer_type(vdev->pdev);
> + if (comp == PCI_EXP_DEVCAP2_TPH_COMP_NONE)
> + return -EOPNOTSUPP;
> +
> + ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET,
> + sizeof(set_tph));
> + if (ret != 1)
> + return ret;
> +
> + if (copy_from_user(&set_tph, arg, sizeof(set_tph)))
> + return -EFAULT;
> +
> + if (set_tph.flags & ~(VFIO_DMA_BUF_TPH_ST | VFIO_DMA_BUF_TPH_ST_EXT))
> + return -EINVAL;
> +
> + if (set_tph.ph & ~0x3)
> + return -EINVAL;
> +
> + if ((set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT) &&
> + comp != PCI_EXP_DEVCAP2_TPH_COMP_EXT_TPH)
> + return -EOPNOTSUPP;
> +
> + dmabuf = dma_buf_get(set_tph.dmabuf_fd);
> + if (IS_ERR(dmabuf))
> + return PTR_ERR(dmabuf);
> +
> + if (dmabuf->ops != &vfio_pci_dmabuf_ops) {
> + ret = -EINVAL;
> + goto out_put;
> + }
> +
> + priv = dmabuf->priv;
> + if (priv->vdev != vdev) {
> + ret = -EINVAL;
> + goto out_put;
> + }
> +
> + ret = dma_resv_lock_interruptible(dmabuf->resv, NULL);
> + if (ret)
> + goto out_put;
> +
> + priv->tph_st = set_tph.steering_tag;
> + priv->tph_st_ext = set_tph.steering_tag_ext;
> + priv->tph_ph = set_tph.ph;
> + priv->tph_st_valid = !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST);
> + priv->tph_st_ext_valid =
> + !!(set_tph.flags & VFIO_DMA_BUF_TPH_ST_EXT);
> + dma_resv_unlock(dmabuf->resv);
> + ret = 0;
> +
> +out_put:
> + dma_buf_put(dmabuf);
> + return ret;
> +}
> +
> void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
> {
> struct vfio_pci_dma_buf *priv;
> diff --git a/drivers/vfio/pci/vfio_pci_priv.h b/drivers/vfio/pci/vfio_pci_priv.h
> index fca9d0dfac90..c58f369be4b3 100644
> --- a/drivers/vfio/pci/vfio_pci_priv.h
> +++ b/drivers/vfio/pci/vfio_pci_priv.h
> @@ -118,6 +118,10 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
> int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> struct vfio_device_feature_dma_buf __user *arg,
> size_t argsz);
> +int vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev,
> + u32 flags,
> + struct vfio_device_feature_dma_buf_tph __user *arg,
> + size_t argsz);
> void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
> void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
> #else
> @@ -128,6 +132,14 @@ vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
> {
> return -ENOTTY;
> }
> +
> +static inline int
> +vfio_pci_core_feature_dma_buf_tph(struct vfio_pci_core_device *vdev, u32 flags,
> + struct vfio_device_feature_dma_buf_tph __user *arg,
> + size_t argsz)
> +{
> + return -ENOTTY;
> +}
> static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
> {
> }
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 5de618a3a5ee..2d30ba43e2cf 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -1534,6 +1534,43 @@ struct vfio_device_feature_dma_buf {
> */
> #define VFIO_DEVICE_FEATURE_MIG_PRECOPY_INFOv2 12
>
> +/**
> + * Upon VFIO_DEVICE_FEATURE_SET associate TPH (TLP Processing Hints) metadata
> + * with a vfio-exported dma-buf. The dma-buf must have been created by
> + * VFIO_DEVICE_FEATURE_DMA_BUF on this device, and the device must report
> + * TPH Completer support in Device Capabilities 2 (bits 13:12); requests
> + * carrying VFIO_DMA_BUF_TPH_ST_EXT additionally require the device to
> + * report the Extended TPH Completer encoding. Otherwise the ioctl
> + * returns -EOPNOTSUPP.
> + *
> + * dmabuf_fd is the file descriptor returned by VFIO_DEVICE_FEATURE_DMA_BUF.
> + *
> + * 8-bit ST (steering_tag) and 16-bit Extended ST (steering_tag_ext) are
> + * distinct namespaces. Userspace supplies whichever values are valid and sets
> + * the matching VFIO_DMA_BUF_TPH_ST / VFIO_DMA_BUF_TPH_ST_EXT bits in @flags;
> + * an importer requests one namespace and receives the matching value.
> + *
> + * @flags == 0 marks any previously published ST / Extended-ST as invalid
> + * for future PCI TPH queries on this dma-buf.
I think we should avoid "published" as it still suggests we're somehow
able to invalidate what was previously reported and consumed. It's
offset by the trailing clause here, but that's absent below.
Also, we're noting @flags == 0 as if it's a special case, but we really
need the clarity that if either flag bit is not set, the corresponding
field is marked invalid for future queries. Perhaps something like:
* @flags is the authoritative validity for each namespace: when
* VFIO_DMA_BUF_TPH_ST is set, @steering_tag becomes the valid 8-bit ST; when
* VFIO_DMA_BUF_TPH_ST_EXT is set, @steering_tag_ext becomes the valid 16-bit
* Extended ST. A namespace whose bit is clear is marked invalid and
* reported as unsupported to importers requesting it.
*
* Each SET fully replaces the dma-buf's TPH state: any namespace not selected
* in @flags is left invalid, so @flags == 0 marks both ST and Extended ST
* invalid. This only affects TPH queries made after the SET completes; an
* importer that has already retrieved a value is unaffected. Userspace must
* therefore configure TPH before handing the dma-buf fd to an importer.
> + *
> + * ph is the 2-bit TLP Processing Hint and must be in the range [0, 3].
Slight inconsistency here, @ph. We might also help to silence the
Sashiko warning to note:
* Undefined @flags and @ph bits must always be zero.
> + *
> + * Userspace must publish TPH before handing the dma-buf fd to an importer.
> + * Calling SET again replaces the published values.
The above suggestion is meant to replace this as well.
> + */
> +#define VFIO_DEVICE_FEATURE_DMA_BUF_TPH 13
There are several series in-flight contending for device feature
indexes, so this should really go in through the vfio tree to reduce
the risk of duplicates. We also still need acks from the relevant
maintainers for PCI, dma-buf, and mlx5 before this can be queued for
v7.3. Thanks,
Alex
> +
> +#define VFIO_DMA_BUF_TPH_ST (1 << 0)
> +#define VFIO_DMA_BUF_TPH_ST_EXT (1 << 1)
> +
> +struct vfio_device_feature_dma_buf_tph {
> + __s32 dmabuf_fd;
> + __u32 flags;
> + __u16 steering_tag_ext;
> + __u8 steering_tag;
> + __u8 ph;
> +};
> +
> /* -------- API for Type1 VFIO IOMMU -------- */
>
> /**
^ permalink raw reply
* Re: [PATCH net] net/mlx5e: Use sender devcom for MPV master-up
From: manjunath.b.patil @ 2026-06-23 17:51 UTC (permalink / raw)
To: Tariq Toukan, Saeed Mahameed, Mark Bloch, Leon Romanovsky, netdev
Cc: Andrew Lunn, David S . Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Patrisious Haddad, linux-rdma, linux-kernel, stable
In-Reply-To: <293db0b4-f308-469e-99c1-ef1b57d41451@nvidia.com>
On 6/22/26 2:01 AM, Tariq Toukan wrote:
>
>
> On 10/06/2026 20:39, Manjunath Patil wrote:
>> After PCIe DPC recovery, mlx5 reloads the affected functions and
>> replays multiport affiliation events. In the reported failure, the
>> first relevant device error was:
>>
>> pcieport 0000:10:01.1: DPC: containment event
>> pcieport 0000:10:01.1: PCIe Bus Error: severity=Uncorrected (Fatal)
>> pcieport 0000:10:01.1: [ 5] SDES (First)
>>
>> mlx5 recovered the PCI functions and resumed 0000:11:00.1. During
>> that resume, RDMA multiport binding replayed
>> MLX5_DRIVER_EVENT_AFFILIATION_DONE and mlx5e sent
>> MPV_DEVCOM_MASTER_UP. The host then panicked with:
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000010
>> RIP: mlx5_devcom_comp_set_ready+0x5/0x40 [mlx5_core]
>> RDI: 0000000000000000
>>
>> Call trace included:
>>
>> mlx5_devcom_comp_set_ready
>> mlx5e_devcom_event_mpv
>> mlx5_devcom_send_event
>> mlx5_ib_bind_slave_port
>> mlx5r_mp_probe
>> mlx5_pci_resume
>>
>> MPV devcom registration publishes mlx5e private data to the component
>> peer list before mlx5e_devcom_init_mpv() stores the returned component
>> device in priv->devcom. A concurrent master-up event can therefore
>> reach a peer whose private data is visible but whose priv->devcom
>> backpointer is still NULL.
>>
>> MPV_DEVCOM_MASTER_UP already carries the sender/master mlx5e private
>> data as event_data. The ready bit is stored on the shared devcom
>> component, not on an individual peer. Use the sender devcom when
>> marking the MPV component ready.
>>
>> This preserves the readiness transition while avoiding a NULL
>> dereference of the peer devcom pointer during affiliation replay after
>> PCI error recovery.
>>
>> Fixes: bf11485f8419 ("net/mlx5: Register mlx5e priv to devcom in MPV
>> mode")
>> Assisted-by: Codex:gpt-5
>> Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
>> Cc: stable@vger.kernel.org # 6.7+
>> ---
>
> Thanks for your patch and sorry for the late response.
>
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index 8f2b3abe0092..f7ff20b97e8c 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -211,11 +211,14 @@ static void mlx5e_disable_async_events(struct
>> mlx5e_priv *priv)
>> static int mlx5e_devcom_event_mpv(int event, void *my_data, void
>> *event_data)
>> {
>> - struct mlx5e_priv *slave_priv = my_data;
>> + struct mlx5e_priv *master_priv = event_data;
>
> makes sense.
>
>> switch (event) {
>> case MPV_DEVCOM_MASTER_UP:
>> - mlx5_devcom_comp_set_ready(slave_priv->devcom, true);
>> + if (!master_priv || !master_priv->devcom)
>> + return -EINVAL;
>
> is this currently possible? or just being defensive?
> if this return is unreachable I'd drop it.
Yes, the check is only defensive. For MPV_DEVCOM_MASTER_UP, event_data
is passed from mlx5e_devcom_init_mpv() after priv->devcom has been
assigned, so it should not be reachable in the valid path.
Please feel free to drop the check while applying. If you prefer a v2,
let me know and I will send one.
Thanks,
Manjunath
>
>> +
>> + mlx5_devcom_comp_set_ready(master_priv->devcom, true);
>> break;
>> case MPV_DEVCOM_MASTER_DOWN:
>> /* no need for comp set ready false since we unregister after
>
^ permalink raw reply
* Please backport f6b079629bec ("RDMA/bnxt_re: zero shared page before exposing to userspace") to stable
From: pomzm67 @ 2026-06-23 14:17 UTC (permalink / raw)
To: stable
Cc: selvin.xavier, kalesh-anakkur.purayil, jgg, leon, eddie.wai,
somnath.kotur, sriharsha.basavapatna, devesh.sharma, dledford,
gregkh, linux-rdma, linux-kernel, Lord Ulf Henrik Holmberg
From: Lord Ulf Henrik Holmberg <henrik.holmberg@defensify.se>
Hi,
Could the following upstream commit be queued for the active stable
trees? It does not carry a Cc: stable tag and does not appear to have
been picked up by AUTOSEL.
commit f6b079629becfa977f9c51fe53ad2e6dcc55ef44
("RDMA/bnxt_re: zero shared page before exposing to userspace")
It fixes a kernel-memory information leak: bnxt_re_alloc_ucontext()
allocates uctx->shpg with __get_free_page() (no __GFP_ZERO) and then
maps the whole page into userspace via vm_insert_page() under
BNXT_RE_MMAP_SH_PAGE. The driver writes only 4 bytes (a u32 AVID) to
the page, so the remaining 4092 bytes of stale kernel memory are
exposed to any user with access to /dev/infiniband/uverbsX (typically
rdma group membership).
It carries:
Fixes: 1ac5a4047975 ("RDMA/bnxt_re: Add bnxt_re RoCE driver")
so every kernel from 4.10 onwards is affected. Please apply to the
6.6, 6.12 and 7.0 stable trees (and any other active trees you deem
appropriate).
Thanks,
Lord Ulf Henrik Holmberg (Defensify)
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox