From: Chenguang Zhao <chenguang.zhao@linux.dev>
To: jgg@ziepe.ca, leon@kernel.org
Cc: edwards@nvidia.com, michaelgur@nvidia.com,
vdumitrescu@nvidia.com, jiri@resnulli.us,
chenguang.zhao@linux.dev, linux-rdma@vger.kernel.org,
Chenguang Zhao <zhaochenguang@kylinos.cn>
Subject: [PATCH] RDMA/core: quiesce CQ polling before device shutdown on reboot
Date: Thu, 2 Jul 2026 15:34:22 +0800 [thread overview]
Message-ID: <20260702073422.279820-1-chenguang.zhao@linux.dev> (raw)
From: Chenguang Zhao <zhaochenguang@kylinos.cn>
When executing reboot -f in an NFS over RDMA environment, there is a 5%
probability of a crash occurring. The call trace is as follows:
[ 610.937741] Unable to handle kernel paging request at virtual address ffff3773a3e867f8
[ 610.938895] ib_srpt received unrecognized IB event 8
[ 610.952435] Mem abort info:
[ 610.977383] ESR = 0x96000005
[ 610.984027] Exception class = DABT (current EL), IL = 32 bits
[ 610.993574] SET = 0, FnV = 0
[ 611.000197] EA = 0, S1PTW = 0
[ 611.006851] Data abort info:
[ 611.013183] ISV = 0, ISS = 0x00000005
[ 611.020435] CM = 0, WnR = 0
[ 611.026766] swapper pgtable: 64k pages, 48-bit VAs, pgdp = 00000000489bb83a
[ 611.040104] [ffff3773a3e867f8] pgd=0000000000000000, pud=0000000000000000
[ 611.053421] Internal error: Oops: 96000005 [#1] SMP
[ 611.061809] Modules linked in: rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver enfs(OE) iptable_filter nfsv3(OE) nfs_acl(OE) nfs(OE) lockd(OE) grace fscache(OE) rfkill vfat fat rpcrdma(OE) sunrpc(OE) rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser rdma_cm iw_cm ib_umad libiscsi scsi_transport_iscsi ib_ipoib ib_cm ipmi_ssif aes_ce_blk crypto_simd cryptd mlx5_ib aes_ce_cipher crct10dif_ce ib_uverbs ofpart cmdlinepart ghash_ce ses ib_core sha2_ce hi_sfc sha256_arm64 enclosure sha1_ce joydev sbsa_gwdt ipmi_si ipmi_devintf ipmi_msghandler mtd spi_dw_mmio sch_fq_codel ip_tables realtek hns3 mlx5_core megaraid_sas hclge nfit mlxfw nvme hisi_sas_v3_hw hnae3 libnvdimm hisi_sas_main nvme_core host_edma_drv dm_mirror dm_region_hash dm_log
[ 611.162725] Process kworker/70:1H (pid: 778, stack limit = 0x00000000fdde5e06)
[ 611.177485] CPU: 70 PID: 778 Comm: kworker/70:1H Kdump: loaded Tainted: G OE 4.19.90-52.40.v2207.ky10.aarch64 #4
[ 611.196574] Source Version: 2c067295fb9b1b9b1b6693820c85b8ba78e29114
[ 611.207038] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDUA, BIOS 1.89 05/20/2022
[ 611.223343] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[ 611.233098] pstate: 20c00009 (nzCv daif +PAN +UAO)
[ 611.242075] pc : __ib_process_cq+0x88/0xe8 [ib_core]
[ 611.251215] lr : __ib_process_cq+0x90/0xe8 [ib_core]
[ 611.260316] sp : ffffa04001e2bd20
[ 611.267776] x29: ffffa04001e2bd20 x28: ffff8040a0f6dc40
[ 611.277259] x27: 0000000000000040 x26: ffff8040a0f6dc00
[ 611.286723] x25: 0000000000010000 x24: 0000000000000010
[ 611.296109] x23: 0000000000000000 x22: 000000000000000a
[ 611.305385] x21: ffff8040a0f62000 x20: ffff8040a0f6de80
[ 611.314578] x19: ffff8040a0f6dcc0 x18: 0000000000000001
[ 611.323685] x17: 0000fffe7b3ae660 x16: ffff2b4528d7c368
[ 611.332767] x15: 0000000000000001 x14: ffff2b45299da9f8
[ 611.341778] x13: 0000000000000000 x12: 0000000000000000
[ 611.350689] x11: 00000000ffffffff x10: 0000000000000000
[ 611.359482] x9 : 000000000000000a x8 : ffff8040abe80000
[ 611.368173] x7 : 0000000000000005 x6 : 000000000000000a
[ 611.376760] x5 : 000000000000174b x4 : 000000000000000a
[ 611.385224] x3 : ffffa04001e2bd1c x2 : ffff3773a3e867f8
[ 611.393611] x1 : ffff8040a0f6dcc0 x0 : ffff8040a0f62000
[ 611.401910] Call trace:
[ 611.407252] __ib_process_cq+0x88/0xe8 [ib_core]
[ 611.414725] ib_cq_poll_work+0x34/0xa0 [ib_core]
[ 611.422103] process_one_work+0x1fc/0x4a8
[ 611.428793] worker_thread+0x50/0x4d8
[ 611.435047] kthread+0x134/0x138
[ 611.440779] ret_from_fork+0x10/0x18
[ 611.446770] Code: f9400262 aa1303e1 aa1503e0 b40002a2 (f9400042)
[ 611.455330] SMP: stopping secondary CPUs
[ 611.467959] Starting crashdump kernel...
On forced reboot (reboot -f), mlx5 may enter shutdown while ib-comp-wq
still polls live CQs, causing use-after-free in wr_cqe->done(). Drain
completion workqueues from the kernel reboot path before device_shutdown()
via a function pointer hook, and guard CQ processing during shutdown.
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
---
With NFS over RDMA (rpcrdma + mlx5), reboot -f can Oops in
ib_cq_poll_work -> __ib_process_cq when calling wr_cqe->done(),
often after mlx5_core: Shutdown was called.
reboot -f skips orderly shutdown and goes directly to:
kernel_restart_prepare() -> device_shutdown() -> mlx5 shutdown
Upper layers may still hold live CQs, while ib-comp-wq keeps
polling — a use-after-free race.
Normal reboot usually works because userspace has already
torn down RDMA and called ib_free_cq().
This fixes the crash, but putting InfiniBand-specific logic
into the generic reboot path does not feel ideal. Is this
the right place, or is there a better pattern — e.g. a hook
inside ib_core, driver-level quiesce, or a generic
pre-device_shutdown() mechanism?
drivers/infiniband/core/cq.c | 47 +++++++++++++++++++++++++++++++-
drivers/infiniband/core/device.c | 3 ++
include/rdma/ib_verbs.h | 3 ++
kernel/reboot.c | 11 ++++++++
4 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 3d7b6cddd131..b64905bafeaf 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -9,6 +9,13 @@
#include "core_priv.h"
#include <trace/events/rdma_core.h>
+
+static atomic_t ib_cq_drain;
+
+static bool ib_cq_draining(void)
+{
+ return atomic_read(&ib_cq_drain) || system_state != SYSTEM_RUNNING;
+}
/* Max size for shared CQ, may require tuning */
#define IB_MAX_SHARED_CQ_SZ 4096U
@@ -96,6 +103,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *wcs,
trace_cq_process(cq);
+ if (ib_cq_draining())
+ return 0;
+
/*
* budget might be (-1) if the caller does not
* want to bound this call, thus we need unsigned
@@ -106,6 +116,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *wcs,
for (i = 0; i < n; i++) {
struct ib_wc *wc = &wcs[i];
+ if (ib_cq_draining())
+ return completed;
+
if (wc->wr_cqe)
wc->wr_cqe->done(cq, wc);
else
@@ -157,7 +170,8 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)
completed = __ib_process_cq(cq, budget, cq->wc, IB_POLL_BATCH);
if (completed < budget) {
irq_poll_complete(&cq->iop);
- if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
+ if (!ib_cq_draining() &&
+ ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
trace_cq_reschedule(cq);
irq_poll_sched(&cq->iop);
}
@@ -171,6 +185,9 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)
static void ib_cq_completion_softirq(struct ib_cq *cq, void *private)
{
+ if (ib_cq_draining())
+ return;
+
trace_cq_schedule(cq);
irq_poll_sched(&cq->iop);
}
@@ -180,8 +197,14 @@ static void ib_cq_poll_work(struct work_struct *work)
struct ib_cq *cq = container_of(work, struct ib_cq, work);
int completed;
+ if (ib_cq_draining())
+ return;
+
completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, cq->wc,
IB_POLL_BATCH);
+ if (ib_cq_draining())
+ return;
+
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
queue_work(cq->comp_wq, &cq->work);
@@ -191,6 +214,9 @@ static void ib_cq_poll_work(struct work_struct *work)
static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
{
+ if (ib_cq_draining())
+ return;
+
trace_cq_schedule(cq);
queue_work(cq->comp_wq, &cq->work);
}
@@ -359,6 +385,25 @@ void ib_free_cq(struct ib_cq *cq)
}
EXPORT_SYMBOL(ib_free_cq);
+/**
+ * ib_drain_completion_queues - Quiesce CQ polling before device shutdown
+ *
+ * Called from the kernel reboot/poweroff path immediately before
+ * device_shutdown(), while RDMA upper layers may still hold live CQs.
+ * Stops new CQ work from being queued and waits for in-flight
+ * ib_cq_poll_work handlers to finish.
+ */
+void ib_drain_completion_queues(void)
+{
+ atomic_set(&ib_cq_drain, 1);
+
+ if (ib_comp_wq)
+ flush_workqueue(ib_comp_wq);
+ if (ib_comp_unbound_wq)
+ flush_workqueue(ib_comp_unbound_wq);
+}
+EXPORT_SYMBOL_GPL(ib_drain_completion_queues);
+
void ib_cq_pool_cleanup(struct ib_device *dev)
{
struct ib_cq *cq, *n;
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b8193e077a74..601c58abd184 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -3105,6 +3105,8 @@ static int __init ib_core_init(void)
register_netdevice_notifier(&nb_netdevice);
+ ib_drain_completion_queues_fn = ib_drain_completion_queues;
+
return 0;
err_parent:
@@ -3134,6 +3136,7 @@ static int __init ib_core_init(void)
static void __exit ib_core_cleanup(void)
{
+ ib_drain_completion_queues_fn = NULL;
unregister_netdevice_notifier(&nb_netdevice);
roce_gid_mgmt_cleanup();
rdma_nl_unregister(RDMA_NL_LS);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 794746de8db0..c0b6d1cca598 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -61,6 +61,9 @@ extern struct workqueue_struct *ib_wq;
extern struct workqueue_struct *ib_comp_wq;
extern struct workqueue_struct *ib_comp_unbound_wq;
+extern void (*ib_drain_completion_queues_fn)(void);
+void ib_drain_completion_queues(void);
+
struct ib_ucq_object;
__printf(2, 3) __cold
diff --git a/kernel/reboot.c b/kernel/reboot.c
index 695c33e75efd..889e77960153 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -81,6 +81,9 @@ void __weak (*pm_power_off)(void);
*/
static BLOCKING_NOTIFIER_HEAD(reboot_notifier_list);
+void (*ib_drain_completion_queues_fn)(void);
+EXPORT_SYMBOL_GPL(ib_drain_completion_queues_fn);
+
/**
* emergency_restart - reboot the system
*
@@ -102,6 +105,10 @@ void kernel_restart_prepare(char *cmd)
blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
system_state = SYSTEM_RESTART;
usermodehelper_disable();
+#if IS_ENABLED(CONFIG_INFINIBAND)
+ if (ib_drain_completion_queues_fn)
+ ib_drain_completion_queues_fn();
+#endif
device_shutdown();
}
@@ -305,6 +312,10 @@ static void kernel_shutdown_prepare(enum system_states state)
(state == SYSTEM_HALT) ? SYS_HALT : SYS_POWER_OFF, NULL);
system_state = state;
usermodehelper_disable();
+#if IS_ENABLED(CONFIG_INFINIBAND)
+ if (ib_drain_completion_queues_fn)
+ ib_drain_completion_queues_fn();
+#endif
device_shutdown();
}
/**
--
2.25.1
next reply other threads:[~2026-07-02 7:34 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-02 7:34 Chenguang Zhao [this message]
2026-07-02 17:40 ` [PATCH] RDMA/core: quiesce CQ polling before device shutdown on reboot Jason Gunthorpe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260702073422.279820-1-chenguang.zhao@linux.dev \
--to=chenguang.zhao@linux.dev \
--cc=edwards@nvidia.com \
--cc=jgg@ziepe.ca \
--cc=jiri@resnulli.us \
--cc=leon@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=michaelgur@nvidia.com \
--cc=vdumitrescu@nvidia.com \
--cc=zhaochenguang@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox