* [PATCH for-rc 0/3] RDMA/hns: Misc fixes
@ 2026-05-20 5:57 Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang
This patchset contains servral fixes for hns.
Junxian Huang (1):
RDMA/hns: Fix memory leak of bonding resource
Lianfa Weng (2):
RDMA/hns: Fix warning in poll cq direct mode
RDMA/hns: Fix log flood after cmd_mbox failure
drivers/infiniband/hw/hns/hns_roce_cq.c | 6 +++---
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 18 +++++++++---------
drivers/infiniband/hw/hns/hns_roce_main.c | 6 ++++--
drivers/infiniband/hw/hns/hns_roce_mr.c | 6 +++---
drivers/infiniband/hw/hns/hns_roce_srq.c | 2 +-
5 files changed, 20 insertions(+), 18 deletions(-)
--
2.33.0
^ permalink raw reply [flat|nested] 6+ messages in thread* [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource 2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang @ 2026-05-20 5:57 ` Junxian Huang 2026-05-25 14:38 ` Jason Gunthorpe 2026-05-20 5:57 ` [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode Junxian Huang ` (2 subsequent siblings) 3 siblings, 1 reply; 6+ messages in thread From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw) To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang In a corner case of concurrent driver removal and driver reset, bonding resource is first released in hns_roce_hw_v2_exit() during driver removal, and then is allocated again in hns_roce_register_device() during driver reset. This leads to memory leak because the release timing has already passed. This may also lead to a kernel panic as below because of the leaked notifier callback: Call trace: 0xffffa20fccc04978 (P) raw_notifier_call_chain+0x20/0x38 call_netdevice_notifiers_info+0x60/0xb8 netdev_lower_state_changed+0x4c/0xb8 Bonding resource allocation and release should occur only during driver init and removal, so don't do the allocation during reset. Fixes: b37ad2e290fc ("RDMA/hns: Initialize bonding resources") Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> --- drivers/infiniband/hw/hns/hns_roce_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c index c17ff5347a01..a7308a3c586e 100644 --- a/drivers/infiniband/hw/hns/hns_roce_main.c +++ b/drivers/infiniband/hw/hns/hns_roce_main.c @@ -795,6 +795,7 @@ static const struct ib_device_ops hns_roce_dev_restrack_ops = { static int hns_roce_register_device(struct hns_roce_dev *hr_dev) { + struct hns_roce_v2_priv *priv = hr_dev->priv; struct hns_roce_ib_iboe *iboe = NULL; struct device *dev = hr_dev->dev; struct ib_device *ib_dev = NULL; @@ -838,7 +839,8 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev) dma_set_max_seg_size(dev, SZ_2G); - if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND) { + if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND && + priv->handle->rinfo.reset_state != HNS_ROCE_STATE_RST_INIT) { ret = hns_roce_alloc_bond_grp(hr_dev); if (ret) { dev_err(dev, "failed to alloc bond_grp for bus %u, ret = %d\n", -- 2.33.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource 2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang @ 2026-05-25 14:38 ` Jason Gunthorpe 0 siblings, 0 replies; 6+ messages in thread From: Jason Gunthorpe @ 2026-05-25 14:38 UTC (permalink / raw) To: Junxian Huang; +Cc: leon, linux-rdma, linuxarm, tangchengchang On Wed, May 20, 2026 at 01:57:57PM +0800, Junxian Huang wrote: > In a corner case of concurrent driver removal and driver reset, > bonding resource is first released in hns_roce_hw_v2_exit() during > driver removal, and then is allocated again in hns_roce_register_device() > during driver reset. This leads to memory leak because the release > timing has already passed. This may also lead to a kernel panic > as below because of the leaked notifier callback: > > Call trace: > 0xffffa20fccc04978 (P) > raw_notifier_call_chain+0x20/0x38 > call_netdevice_notifiers_info+0x60/0xb8 > netdev_lower_state_changed+0x4c/0xb8 > > Bonding resource allocation and release should occur only during > driver init and removal, so don't do the allocation during reset. > > Fixes: b37ad2e290fc ("RDMA/hns: Initialize bonding resources") > Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> > --- > drivers/infiniband/hw/hns/hns_roce_main.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c > index c17ff5347a01..a7308a3c586e 100644 > --- a/drivers/infiniband/hw/hns/hns_roce_main.c > +++ b/drivers/infiniband/hw/hns/hns_roce_main.c > @@ -795,6 +795,7 @@ static const struct ib_device_ops hns_roce_dev_restrack_ops = { > > static int hns_roce_register_device(struct hns_roce_dev *hr_dev) > { > + struct hns_roce_v2_priv *priv = hr_dev->priv; > struct hns_roce_ib_iboe *iboe = NULL; > struct device *dev = hr_dev->dev; > struct ib_device *ib_dev = NULL; > @@ -838,7 +839,8 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev) > > dma_set_max_seg_size(dev, SZ_2G); > > - if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND) { > + if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND && > + priv->handle->rinfo.reset_state != HNS_ROCE_STATE_RST_INIT) { > ret = hns_roce_alloc_bond_grp(hr_dev); > if (ret) { > dev_err(dev, "failed to alloc bond_grp for bus %u, ret = %d\n", The sashiko comments about inverted teardown seems pretty reasonable? https://sashiko.dev/#/patchset/20260520055759.2354037-1-huangjunxian6%40hisilic It would be better to fix it that way instead of sprinkling this around. The other comments seem less interesting. Jason ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode 2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang 2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang @ 2026-05-20 5:57 ` Junxian Huang 2026-05-20 5:57 ` [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure Junxian Huang 2026-05-25 14:39 ` [PATCH for-rc 0/3] RDMA/hns: Misc fixes Jason Gunthorpe 3 siblings, 0 replies; 6+ messages in thread From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw) To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang From: Lianfa Weng <wenglianfa@huawei.com> CQs allocated by ib_alloc_cq() always have a comp_handler. Though in direct mode this handler is never expected to be called, it is still called when the driver is reset, triggering the following WARN_ONCE(): Call trace: ib_cq_completion_direct+0x38/0x60 hns_roce_cq_completion+0x54/0x90 (hns_roce_hw_v2] hns_roce_handle_device_err+Ox1c8/0x340 [hns_roce_hw_v2] hns_roce_hw_v2_uninit_instance.constprop.0+0x34/0x70 [hns_roce_hw_v2] hns_roce_hw_v2_reset_notify+0xc4/0xe0 [hns_roce_hw_v2] hclge_notify_roce_client+0x60/0xbc [hclge] hclge_reset_rebuild+0x48/0x34c [hclge] hclge_reset_subtask+0xcc/0xec [hclge] hclge_reset_service_task+0x80/0x160 [hclge] hclge_service_task+0x50/0x80 (hclge] process_one_work+0x1cc/0x4d0 worker_thread+0x154/0x414 kthread+0x104/0x144 ret_from_fork+0x10/0x18 Fixes: f295e4cece5c ("RDMA/hns: Delete unnecessary callback functions for cq") Signed-off-by: Lianfa Weng <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> --- drivers/infiniband/hw/hns/hns_roce_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c index a7308a3c586e..2b71c2b30bc8 100644 --- a/drivers/infiniband/hw/hns/hns_roce_main.c +++ b/drivers/infiniband/hw/hns/hns_roce_main.c @@ -1114,7 +1114,7 @@ static void check_and_get_armed_cq(struct list_head *cq_list, struct ib_cq *cq) unsigned long flags; spin_lock_irqsave(&hr_cq->lock, flags); - if (cq->comp_handler) { + if (cq->comp_handler && hr_cq->ib_cq.poll_ctx != IB_POLL_DIRECT) { if (!hr_cq->is_armed) { hr_cq->is_armed = 1; list_add_tail(&hr_cq->node, cq_list); -- 2.33.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure 2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang 2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang 2026-05-20 5:57 ` [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode Junxian Huang @ 2026-05-20 5:57 ` Junxian Huang 2026-05-25 14:39 ` [PATCH for-rc 0/3] RDMA/hns: Misc fixes Jason Gunthorpe 3 siblings, 0 replies; 6+ messages in thread From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw) To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang From: Lianfa Weng <wenglianfa@huawei.com> hns_roce_cmd_mbox() is the command interface between driver and hardware. When hardware is abnormal, the unlimited error printings after hns_roce_cmd_mbox() failure will cause log flood and even system crash. Replace ibdev_err() and ibdev_warn() with their ratelimited versions in the error handling path after hns_roce_cmd_mbox() (and its wrappers hns_roce_create_hw_ctx/hns_roce_destroy_hw_ctx) fails. Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver") Signed-off-by: Lianfa Weng <wenglianfa@huawei.com> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com> --- drivers/infiniband/hw/hns/hns_roce_cq.c | 6 +++--- drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 18 +++++++++--------- drivers/infiniband/hw/hns/hns_roce_mr.c | 6 +++--- drivers/infiniband/hw/hns/hns_roce_srq.c | 2 +- 4 files changed, 16 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c index 24de651f735e..1dd0efb5620d 100644 --- a/drivers/infiniband/hw/hns/hns_roce_cq.c +++ b/drivers/infiniband/hw/hns/hns_roce_cq.c @@ -174,9 +174,9 @@ static int hns_roce_create_cqc(struct hns_roce_dev *hr_dev, ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_CQC, hr_cq->cqn); if (ret) - ibdev_err(ibdev, - "failed to send create cmd for CQ(0x%lx), ret = %d.\n", - hr_cq->cqn, ret); + ibdev_err_ratelimited(ibdev, + "failed to send create cmd for CQ(0x%lx), ret = %d.\n", + hr_cq->cqn, ret); hns_roce_free_cmd_mailbox(hr_dev, mailbox); diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c index 4afd7d6ae3ca..332a4816f2ca 100644 --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c @@ -6193,9 +6193,9 @@ static int hns_roce_v2_modify_srq(struct ib_srq *ibsrq, HNS_ROCE_CMD_MODIFY_SRQC, srq->srqn); hns_roce_free_cmd_mailbox(hr_dev, mailbox); if (ret) - ibdev_err(&hr_dev->ib_dev, - "failed to handle cmd of modifying SRQ, ret = %d.\n", - ret); + ibdev_err_ratelimited(&hr_dev->ib_dev, + "failed to handle cmd of modifying SRQ, ret = %d.\n", + ret); } out: @@ -6221,9 +6221,9 @@ static int hns_roce_v2_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr) ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, HNS_ROCE_CMD_QUERY_SRQC, srq->srqn); if (ret) { - ibdev_err(&hr_dev->ib_dev, - "failed to process cmd of querying SRQ, ret = %d.\n", - ret); + ibdev_err_ratelimited(&hr_dev->ib_dev, + "failed to process cmd of querying SRQ, ret = %d.\n", + ret); goto out; } @@ -6329,9 +6329,9 @@ static int hns_roce_v2_query_mpt(struct hns_roce_dev *hr_dev, u32 key, ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, HNS_ROCE_CMD_QUERY_MPT, key_to_hw_index(key)); if (ret) { - ibdev_err(&hr_dev->ib_dev, - "failed to process cmd when querying MPT, ret = %d.\n", - ret); + ibdev_err_ratelimited(&hr_dev->ib_dev, + "failed to process cmd when querying MPT, ret = %d.\n", + ret); goto err_mailbox; } diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c index 896af1828a38..e8a9e7d8f267 100644 --- a/drivers/infiniband/hw/hns/hns_roce_mr.c +++ b/drivers/infiniband/hw/hns/hns_roce_mr.c @@ -173,7 +173,7 @@ static int hns_roce_mr_enable(struct hns_roce_dev *hr_dev, ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_MPT, mtpt_idx & (hr_dev->caps.num_mtpts - 1)); if (ret) { - dev_err(dev, "failed to create mpt, ret = %d.\n", ret); + dev_err_ratelimited(dev, "failed to create mpt, ret = %d.\n", ret); goto err_page; } @@ -315,7 +315,7 @@ struct ib_mr *hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_MPT, mtpt_idx); if (ret) - ibdev_warn(ib_dev, "failed to destroy MPT, ret = %d.\n", ret); + ibdev_warn_ratelimited(ib_dev, "failed to destroy MPT, ret = %d.\n", ret); mr->enabled = 0; mr->iova = virt_addr; @@ -346,7 +346,7 @@ struct ib_mr *hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start, ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_MPT, mtpt_idx); if (ret) { - ibdev_err(ib_dev, "failed to create MPT, ret = %d.\n", ret); + ibdev_err_ratelimited(ib_dev, "failed to create MPT, ret = %d.\n", ret); goto free_cmd_mbox; } diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c index 8644c3916367..00552a08f21a 100644 --- a/drivers/infiniband/hw/hns/hns_roce_srq.c +++ b/drivers/infiniband/hw/hns/hns_roce_srq.c @@ -103,7 +103,7 @@ static int hns_roce_create_srqc(struct hns_roce_dev *hr_dev, ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_SRQ, srq->srqn); if (ret) - ibdev_err(ibdev, "failed to config SRQC, ret = %d.\n", ret); + ibdev_err_ratelimited(ibdev, "failed to config SRQC, ret = %d.\n", ret); err_mbox: hns_roce_free_cmd_mailbox(hr_dev, mailbox); -- 2.33.0 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH for-rc 0/3] RDMA/hns: Misc fixes 2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang ` (2 preceding siblings ...) 2026-05-20 5:57 ` [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure Junxian Huang @ 2026-05-25 14:39 ` Jason Gunthorpe 3 siblings, 0 replies; 6+ messages in thread From: Jason Gunthorpe @ 2026-05-25 14:39 UTC (permalink / raw) To: Junxian Huang; +Cc: leon, linux-rdma, linuxarm, tangchengchang On Wed, May 20, 2026 at 01:57:56PM +0800, Junxian Huang wrote: > This patchset contains servral fixes for hns. > > Lianfa Weng (2): > RDMA/hns: Fix warning in poll cq direct mode > RDMA/hns: Fix log flood after cmd_mbox failure I picked up these to into for-next Thanks, Jason ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-25 14:39 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang 2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang 2026-05-25 14:38 ` Jason Gunthorpe 2026-05-20 5:57 ` [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode Junxian Huang 2026-05-20 5:57 ` [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure Junxian Huang 2026-05-25 14:39 ` [PATCH for-rc 0/3] RDMA/hns: Misc fixes Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox