* [PATCH for-rc 0/3] RDMA/hns: Misc fixes
@ 2026-05-20 5:57 Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang
This patchset contains servral fixes for hns.
Junxian Huang (1):
RDMA/hns: Fix memory leak of bonding resource
Lianfa Weng (2):
RDMA/hns: Fix warning in poll cq direct mode
RDMA/hns: Fix log flood after cmd_mbox failure
drivers/infiniband/hw/hns/hns_roce_cq.c | 6 +++---
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 18 +++++++++---------
drivers/infiniband/hw/hns/hns_roce_main.c | 6 ++++--
drivers/infiniband/hw/hns/hns_roce_mr.c | 6 +++---
drivers/infiniband/hw/hns/hns_roce_srq.c | 2 +-
5 files changed, 20 insertions(+), 18 deletions(-)
--
2.33.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource
2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang
@ 2026-05-20 5:57 ` Junxian Huang
2026-05-25 14:38 ` Jason Gunthorpe
2026-05-20 5:57 ` [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode Junxian Huang
` (2 subsequent siblings)
3 siblings, 1 reply; 6+ messages in thread
From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang
In a corner case of concurrent driver removal and driver reset,
bonding resource is first released in hns_roce_hw_v2_exit() during
driver removal, and then is allocated again in hns_roce_register_device()
during driver reset. This leads to memory leak because the release
timing has already passed. This may also lead to a kernel panic
as below because of the leaked notifier callback:
Call trace:
0xffffa20fccc04978 (P)
raw_notifier_call_chain+0x20/0x38
call_netdevice_notifiers_info+0x60/0xb8
netdev_lower_state_changed+0x4c/0xb8
Bonding resource allocation and release should occur only during
driver init and removal, so don't do the allocation during reset.
Fixes: b37ad2e290fc ("RDMA/hns: Initialize bonding resources")
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
---
drivers/infiniband/hw/hns/hns_roce_main.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index c17ff5347a01..a7308a3c586e 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -795,6 +795,7 @@ static const struct ib_device_ops hns_roce_dev_restrack_ops = {
static int hns_roce_register_device(struct hns_roce_dev *hr_dev)
{
+ struct hns_roce_v2_priv *priv = hr_dev->priv;
struct hns_roce_ib_iboe *iboe = NULL;
struct device *dev = hr_dev->dev;
struct ib_device *ib_dev = NULL;
@@ -838,7 +839,8 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev)
dma_set_max_seg_size(dev, SZ_2G);
- if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND) {
+ if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND &&
+ priv->handle->rinfo.reset_state != HNS_ROCE_STATE_RST_INIT) {
ret = hns_roce_alloc_bond_grp(hr_dev);
if (ret) {
dev_err(dev, "failed to alloc bond_grp for bus %u, ret = %d\n",
--
2.33.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode
2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang
@ 2026-05-20 5:57 ` Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure Junxian Huang
2026-05-25 14:39 ` [PATCH for-rc 0/3] RDMA/hns: Misc fixes Jason Gunthorpe
3 siblings, 0 replies; 6+ messages in thread
From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang
From: Lianfa Weng <wenglianfa@huawei.com>
CQs allocated by ib_alloc_cq() always have a comp_handler. Though
in direct mode this handler is never expected to be called, it
is still called when the driver is reset, triggering the following
WARN_ONCE():
Call trace:
ib_cq_completion_direct+0x38/0x60
hns_roce_cq_completion+0x54/0x90 (hns_roce_hw_v2]
hns_roce_handle_device_err+Ox1c8/0x340 [hns_roce_hw_v2]
hns_roce_hw_v2_uninit_instance.constprop.0+0x34/0x70 [hns_roce_hw_v2]
hns_roce_hw_v2_reset_notify+0xc4/0xe0 [hns_roce_hw_v2]
hclge_notify_roce_client+0x60/0xbc [hclge]
hclge_reset_rebuild+0x48/0x34c [hclge]
hclge_reset_subtask+0xcc/0xec [hclge]
hclge_reset_service_task+0x80/0x160 [hclge]
hclge_service_task+0x50/0x80 (hclge]
process_one_work+0x1cc/0x4d0
worker_thread+0x154/0x414
kthread+0x104/0x144
ret_from_fork+0x10/0x18
Fixes: f295e4cece5c ("RDMA/hns: Delete unnecessary callback functions for cq")
Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
---
drivers/infiniband/hw/hns/hns_roce_main.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
index a7308a3c586e..2b71c2b30bc8 100644
--- a/drivers/infiniband/hw/hns/hns_roce_main.c
+++ b/drivers/infiniband/hw/hns/hns_roce_main.c
@@ -1114,7 +1114,7 @@ static void check_and_get_armed_cq(struct list_head *cq_list, struct ib_cq *cq)
unsigned long flags;
spin_lock_irqsave(&hr_cq->lock, flags);
- if (cq->comp_handler) {
+ if (cq->comp_handler && hr_cq->ib_cq.poll_ctx != IB_POLL_DIRECT) {
if (!hr_cq->is_armed) {
hr_cq->is_armed = 1;
list_add_tail(&hr_cq->node, cq_list);
--
2.33.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure
2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode Junxian Huang
@ 2026-05-20 5:57 ` Junxian Huang
2026-05-25 14:39 ` [PATCH for-rc 0/3] RDMA/hns: Misc fixes Jason Gunthorpe
3 siblings, 0 replies; 6+ messages in thread
From: Junxian Huang @ 2026-05-20 5:57 UTC (permalink / raw)
To: jgg, leon; +Cc: linux-rdma, linuxarm, huangjunxian6, tangchengchang
From: Lianfa Weng <wenglianfa@huawei.com>
hns_roce_cmd_mbox() is the command interface between driver and
hardware. When hardware is abnormal, the unlimited error printings
after hns_roce_cmd_mbox() failure will cause log flood and even
system crash.
Replace ibdev_err() and ibdev_warn() with their ratelimited versions
in the error handling path after hns_roce_cmd_mbox() (and its wrappers
hns_roce_create_hw_ctx/hns_roce_destroy_hw_ctx) fails.
Fixes: 9a4435375cd1 ("IB/hns: Add driver files for hns RoCE driver")
Signed-off-by: Lianfa Weng <wenglianfa@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
---
drivers/infiniband/hw/hns/hns_roce_cq.c | 6 +++---
drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 18 +++++++++---------
drivers/infiniband/hw/hns/hns_roce_mr.c | 6 +++---
drivers/infiniband/hw/hns/hns_roce_srq.c | 2 +-
4 files changed, 16 insertions(+), 16 deletions(-)
diff --git a/drivers/infiniband/hw/hns/hns_roce_cq.c b/drivers/infiniband/hw/hns/hns_roce_cq.c
index 24de651f735e..1dd0efb5620d 100644
--- a/drivers/infiniband/hw/hns/hns_roce_cq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_cq.c
@@ -174,9 +174,9 @@ static int hns_roce_create_cqc(struct hns_roce_dev *hr_dev,
ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_CQC,
hr_cq->cqn);
if (ret)
- ibdev_err(ibdev,
- "failed to send create cmd for CQ(0x%lx), ret = %d.\n",
- hr_cq->cqn, ret);
+ ibdev_err_ratelimited(ibdev,
+ "failed to send create cmd for CQ(0x%lx), ret = %d.\n",
+ hr_cq->cqn, ret);
hns_roce_free_cmd_mailbox(hr_dev, mailbox);
diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
index 4afd7d6ae3ca..332a4816f2ca 100644
--- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
+++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
@@ -6193,9 +6193,9 @@ static int hns_roce_v2_modify_srq(struct ib_srq *ibsrq,
HNS_ROCE_CMD_MODIFY_SRQC, srq->srqn);
hns_roce_free_cmd_mailbox(hr_dev, mailbox);
if (ret)
- ibdev_err(&hr_dev->ib_dev,
- "failed to handle cmd of modifying SRQ, ret = %d.\n",
- ret);
+ ibdev_err_ratelimited(&hr_dev->ib_dev,
+ "failed to handle cmd of modifying SRQ, ret = %d.\n",
+ ret);
}
out:
@@ -6221,9 +6221,9 @@ static int hns_roce_v2_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr)
ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma,
HNS_ROCE_CMD_QUERY_SRQC, srq->srqn);
if (ret) {
- ibdev_err(&hr_dev->ib_dev,
- "failed to process cmd of querying SRQ, ret = %d.\n",
- ret);
+ ibdev_err_ratelimited(&hr_dev->ib_dev,
+ "failed to process cmd of querying SRQ, ret = %d.\n",
+ ret);
goto out;
}
@@ -6329,9 +6329,9 @@ static int hns_roce_v2_query_mpt(struct hns_roce_dev *hr_dev, u32 key,
ret = hns_roce_cmd_mbox(hr_dev, 0, mailbox->dma, HNS_ROCE_CMD_QUERY_MPT,
key_to_hw_index(key));
if (ret) {
- ibdev_err(&hr_dev->ib_dev,
- "failed to process cmd when querying MPT, ret = %d.\n",
- ret);
+ ibdev_err_ratelimited(&hr_dev->ib_dev,
+ "failed to process cmd when querying MPT, ret = %d.\n",
+ ret);
goto err_mailbox;
}
diff --git a/drivers/infiniband/hw/hns/hns_roce_mr.c b/drivers/infiniband/hw/hns/hns_roce_mr.c
index 896af1828a38..e8a9e7d8f267 100644
--- a/drivers/infiniband/hw/hns/hns_roce_mr.c
+++ b/drivers/infiniband/hw/hns/hns_roce_mr.c
@@ -173,7 +173,7 @@ static int hns_roce_mr_enable(struct hns_roce_dev *hr_dev,
ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_MPT,
mtpt_idx & (hr_dev->caps.num_mtpts - 1));
if (ret) {
- dev_err(dev, "failed to create mpt, ret = %d.\n", ret);
+ dev_err_ratelimited(dev, "failed to create mpt, ret = %d.\n", ret);
goto err_page;
}
@@ -315,7 +315,7 @@ struct ib_mr *hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start,
ret = hns_roce_destroy_hw_ctx(hr_dev, HNS_ROCE_CMD_DESTROY_MPT,
mtpt_idx);
if (ret)
- ibdev_warn(ib_dev, "failed to destroy MPT, ret = %d.\n", ret);
+ ibdev_warn_ratelimited(ib_dev, "failed to destroy MPT, ret = %d.\n", ret);
mr->enabled = 0;
mr->iova = virt_addr;
@@ -346,7 +346,7 @@ struct ib_mr *hns_roce_rereg_user_mr(struct ib_mr *ibmr, int flags, u64 start,
ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_MPT,
mtpt_idx);
if (ret) {
- ibdev_err(ib_dev, "failed to create MPT, ret = %d.\n", ret);
+ ibdev_err_ratelimited(ib_dev, "failed to create MPT, ret = %d.\n", ret);
goto free_cmd_mbox;
}
diff --git a/drivers/infiniband/hw/hns/hns_roce_srq.c b/drivers/infiniband/hw/hns/hns_roce_srq.c
index 8644c3916367..00552a08f21a 100644
--- a/drivers/infiniband/hw/hns/hns_roce_srq.c
+++ b/drivers/infiniband/hw/hns/hns_roce_srq.c
@@ -103,7 +103,7 @@ static int hns_roce_create_srqc(struct hns_roce_dev *hr_dev,
ret = hns_roce_create_hw_ctx(hr_dev, mailbox, HNS_ROCE_CMD_CREATE_SRQ,
srq->srqn);
if (ret)
- ibdev_err(ibdev, "failed to config SRQC, ret = %d.\n", ret);
+ ibdev_err_ratelimited(ibdev, "failed to config SRQC, ret = %d.\n", ret);
err_mbox:
hns_roce_free_cmd_mailbox(hr_dev, mailbox);
--
2.33.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource
2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang
@ 2026-05-25 14:38 ` Jason Gunthorpe
0 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-05-25 14:38 UTC (permalink / raw)
To: Junxian Huang; +Cc: leon, linux-rdma, linuxarm, tangchengchang
On Wed, May 20, 2026 at 01:57:57PM +0800, Junxian Huang wrote:
> In a corner case of concurrent driver removal and driver reset,
> bonding resource is first released in hns_roce_hw_v2_exit() during
> driver removal, and then is allocated again in hns_roce_register_device()
> during driver reset. This leads to memory leak because the release
> timing has already passed. This may also lead to a kernel panic
> as below because of the leaked notifier callback:
>
> Call trace:
> 0xffffa20fccc04978 (P)
> raw_notifier_call_chain+0x20/0x38
> call_netdevice_notifiers_info+0x60/0xb8
> netdev_lower_state_changed+0x4c/0xb8
>
> Bonding resource allocation and release should occur only during
> driver init and removal, so don't do the allocation during reset.
>
> Fixes: b37ad2e290fc ("RDMA/hns: Initialize bonding resources")
> Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
> ---
> drivers/infiniband/hw/hns/hns_roce_main.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/hw/hns/hns_roce_main.c b/drivers/infiniband/hw/hns/hns_roce_main.c
> index c17ff5347a01..a7308a3c586e 100644
> --- a/drivers/infiniband/hw/hns/hns_roce_main.c
> +++ b/drivers/infiniband/hw/hns/hns_roce_main.c
> @@ -795,6 +795,7 @@ static const struct ib_device_ops hns_roce_dev_restrack_ops = {
>
> static int hns_roce_register_device(struct hns_roce_dev *hr_dev)
> {
> + struct hns_roce_v2_priv *priv = hr_dev->priv;
> struct hns_roce_ib_iboe *iboe = NULL;
> struct device *dev = hr_dev->dev;
> struct ib_device *ib_dev = NULL;
> @@ -838,7 +839,8 @@ static int hns_roce_register_device(struct hns_roce_dev *hr_dev)
>
> dma_set_max_seg_size(dev, SZ_2G);
>
> - if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND) {
> + if (hr_dev->caps.flags & HNS_ROCE_CAP_FLAG_BOND &&
> + priv->handle->rinfo.reset_state != HNS_ROCE_STATE_RST_INIT) {
> ret = hns_roce_alloc_bond_grp(hr_dev);
> if (ret) {
> dev_err(dev, "failed to alloc bond_grp for bus %u, ret = %d\n",
The sashiko comments about inverted teardown seems pretty reasonable?
https://sashiko.dev/#/patchset/20260520055759.2354037-1-huangjunxian6%40hisilic
It would be better to fix it that way instead of sprinkling this
around.
The other comments seem less interesting.
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH for-rc 0/3] RDMA/hns: Misc fixes
2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang
` (2 preceding siblings ...)
2026-05-20 5:57 ` [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure Junxian Huang
@ 2026-05-25 14:39 ` Jason Gunthorpe
3 siblings, 0 replies; 6+ messages in thread
From: Jason Gunthorpe @ 2026-05-25 14:39 UTC (permalink / raw)
To: Junxian Huang; +Cc: leon, linux-rdma, linuxarm, tangchengchang
On Wed, May 20, 2026 at 01:57:56PM +0800, Junxian Huang wrote:
> This patchset contains servral fixes for hns.
>
> Lianfa Weng (2):
> RDMA/hns: Fix warning in poll cq direct mode
> RDMA/hns: Fix log flood after cmd_mbox failure
I picked up these to into for-next
Thanks,
Jason
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-05-25 14:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20 5:57 [PATCH for-rc 0/3] RDMA/hns: Misc fixes Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 1/3] RDMA/hns: Fix memory leak of bonding resource Junxian Huang
2026-05-25 14:38 ` Jason Gunthorpe
2026-05-20 5:57 ` [PATCH for-rc 2/3] RDMA/hns: Fix warning in poll cq direct mode Junxian Huang
2026-05-20 5:57 ` [PATCH for-rc 3/3] RDMA/hns: Fix log flood after cmd_mbox failure Junxian Huang
2026-05-25 14:39 ` [PATCH for-rc 0/3] RDMA/hns: Misc fixes Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox