* [PATCH] RDMA/core: quiesce CQ polling before device shutdown on reboot
@ 2026-07-02 7:34 Chenguang Zhao
2026-07-02 17:40 ` Jason Gunthorpe
0 siblings, 1 reply; 2+ messages in thread
From: Chenguang Zhao @ 2026-07-02 7:34 UTC (permalink / raw)
To: jgg, leon
Cc: edwards, michaelgur, vdumitrescu, jiri, chenguang.zhao,
linux-rdma, Chenguang Zhao
From: Chenguang Zhao <zhaochenguang@kylinos.cn>
When executing reboot -f in an NFS over RDMA environment, there is a 5%
probability of a crash occurring. The call trace is as follows:
[ 610.937741] Unable to handle kernel paging request at virtual address ffff3773a3e867f8
[ 610.938895] ib_srpt received unrecognized IB event 8
[ 610.952435] Mem abort info:
[ 610.977383] ESR = 0x96000005
[ 610.984027] Exception class = DABT (current EL), IL = 32 bits
[ 610.993574] SET = 0, FnV = 0
[ 611.000197] EA = 0, S1PTW = 0
[ 611.006851] Data abort info:
[ 611.013183] ISV = 0, ISS = 0x00000005
[ 611.020435] CM = 0, WnR = 0
[ 611.026766] swapper pgtable: 64k pages, 48-bit VAs, pgdp = 00000000489bb83a
[ 611.040104] [ffff3773a3e867f8] pgd=0000000000000000, pud=0000000000000000
[ 611.053421] Internal error: Oops: 96000005 [#1] SMP
[ 611.061809] Modules linked in: rpcsec_gss_krb5(OE) auth_rpcgss(OE) nfsv4(OE) dns_resolver enfs(OE) iptable_filter nfsv3(OE) nfs_acl(OE) nfs(OE) lockd(OE) grace fscache(OE) rfkill vfat fat rpcrdma(OE) sunrpc(OE) rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser rdma_cm iw_cm ib_umad libiscsi scsi_transport_iscsi ib_ipoib ib_cm ipmi_ssif aes_ce_blk crypto_simd cryptd mlx5_ib aes_ce_cipher crct10dif_ce ib_uverbs ofpart cmdlinepart ghash_ce ses ib_core sha2_ce hi_sfc sha256_arm64 enclosure sha1_ce joydev sbsa_gwdt ipmi_si ipmi_devintf ipmi_msghandler mtd spi_dw_mmio sch_fq_codel ip_tables realtek hns3 mlx5_core megaraid_sas hclge nfit mlxfw nvme hisi_sas_v3_hw hnae3 libnvdimm hisi_sas_main nvme_core host_edma_drv dm_mirror dm_region_hash dm_log
[ 611.162725] Process kworker/70:1H (pid: 778, stack limit = 0x00000000fdde5e06)
[ 611.177485] CPU: 70 PID: 778 Comm: kworker/70:1H Kdump: loaded Tainted: G OE 4.19.90-52.40.v2207.ky10.aarch64 #4
[ 611.196574] Source Version: 2c067295fb9b1b9b1b6693820c85b8ba78e29114
[ 611.207038] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDUA, BIOS 1.89 05/20/2022
[ 611.223343] Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
[ 611.233098] pstate: 20c00009 (nzCv daif +PAN +UAO)
[ 611.242075] pc : __ib_process_cq+0x88/0xe8 [ib_core]
[ 611.251215] lr : __ib_process_cq+0x90/0xe8 [ib_core]
[ 611.260316] sp : ffffa04001e2bd20
[ 611.267776] x29: ffffa04001e2bd20 x28: ffff8040a0f6dc40
[ 611.277259] x27: 0000000000000040 x26: ffff8040a0f6dc00
[ 611.286723] x25: 0000000000010000 x24: 0000000000000010
[ 611.296109] x23: 0000000000000000 x22: 000000000000000a
[ 611.305385] x21: ffff8040a0f62000 x20: ffff8040a0f6de80
[ 611.314578] x19: ffff8040a0f6dcc0 x18: 0000000000000001
[ 611.323685] x17: 0000fffe7b3ae660 x16: ffff2b4528d7c368
[ 611.332767] x15: 0000000000000001 x14: ffff2b45299da9f8
[ 611.341778] x13: 0000000000000000 x12: 0000000000000000
[ 611.350689] x11: 00000000ffffffff x10: 0000000000000000
[ 611.359482] x9 : 000000000000000a x8 : ffff8040abe80000
[ 611.368173] x7 : 0000000000000005 x6 : 000000000000000a
[ 611.376760] x5 : 000000000000174b x4 : 000000000000000a
[ 611.385224] x3 : ffffa04001e2bd1c x2 : ffff3773a3e867f8
[ 611.393611] x1 : ffff8040a0f6dcc0 x0 : ffff8040a0f62000
[ 611.401910] Call trace:
[ 611.407252] __ib_process_cq+0x88/0xe8 [ib_core]
[ 611.414725] ib_cq_poll_work+0x34/0xa0 [ib_core]
[ 611.422103] process_one_work+0x1fc/0x4a8
[ 611.428793] worker_thread+0x50/0x4d8
[ 611.435047] kthread+0x134/0x138
[ 611.440779] ret_from_fork+0x10/0x18
[ 611.446770] Code: f9400262 aa1303e1 aa1503e0 b40002a2 (f9400042)
[ 611.455330] SMP: stopping secondary CPUs
[ 611.467959] Starting crashdump kernel...
On forced reboot (reboot -f), mlx5 may enter shutdown while ib-comp-wq
still polls live CQs, causing use-after-free in wr_cqe->done(). Drain
completion workqueues from the kernel reboot path before device_shutdown()
via a function pointer hook, and guard CQ processing during shutdown.
Signed-off-by: Chenguang Zhao <zhaochenguang@kylinos.cn>
---
With NFS over RDMA (rpcrdma + mlx5), reboot -f can Oops in
ib_cq_poll_work -> __ib_process_cq when calling wr_cqe->done(),
often after mlx5_core: Shutdown was called.
reboot -f skips orderly shutdown and goes directly to:
kernel_restart_prepare() -> device_shutdown() -> mlx5 shutdown
Upper layers may still hold live CQs, while ib-comp-wq keeps
polling — a use-after-free race.
Normal reboot usually works because userspace has already
torn down RDMA and called ib_free_cq().
This fixes the crash, but putting InfiniBand-specific logic
into the generic reboot path does not feel ideal. Is this
the right place, or is there a better pattern — e.g. a hook
inside ib_core, driver-level quiesce, or a generic
pre-device_shutdown() mechanism?
drivers/infiniband/core/cq.c | 47 +++++++++++++++++++++++++++++++-
drivers/infiniband/core/device.c | 3 ++
include/rdma/ib_verbs.h | 3 ++
kernel/reboot.c | 11 ++++++++
4 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/cq.c b/drivers/infiniband/core/cq.c
index 3d7b6cddd131..b64905bafeaf 100644
--- a/drivers/infiniband/core/cq.c
+++ b/drivers/infiniband/core/cq.c
@@ -9,6 +9,13 @@
#include "core_priv.h"
#include <trace/events/rdma_core.h>
+
+static atomic_t ib_cq_drain;
+
+static bool ib_cq_draining(void)
+{
+ return atomic_read(&ib_cq_drain) || system_state != SYSTEM_RUNNING;
+}
/* Max size for shared CQ, may require tuning */
#define IB_MAX_SHARED_CQ_SZ 4096U
@@ -96,6 +103,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *wcs,
trace_cq_process(cq);
+ if (ib_cq_draining())
+ return 0;
+
/*
* budget might be (-1) if the caller does not
* want to bound this call, thus we need unsigned
@@ -106,6 +116,9 @@ static int __ib_process_cq(struct ib_cq *cq, int budget, struct ib_wc *wcs,
for (i = 0; i < n; i++) {
struct ib_wc *wc = &wcs[i];
+ if (ib_cq_draining())
+ return completed;
+
if (wc->wr_cqe)
wc->wr_cqe->done(cq, wc);
else
@@ -157,7 +170,8 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)
completed = __ib_process_cq(cq, budget, cq->wc, IB_POLL_BATCH);
if (completed < budget) {
irq_poll_complete(&cq->iop);
- if (ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
+ if (!ib_cq_draining() &&
+ ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0) {
trace_cq_reschedule(cq);
irq_poll_sched(&cq->iop);
}
@@ -171,6 +185,9 @@ static int ib_poll_handler(struct irq_poll *iop, int budget)
static void ib_cq_completion_softirq(struct ib_cq *cq, void *private)
{
+ if (ib_cq_draining())
+ return;
+
trace_cq_schedule(cq);
irq_poll_sched(&cq->iop);
}
@@ -180,8 +197,14 @@ static void ib_cq_poll_work(struct work_struct *work)
struct ib_cq *cq = container_of(work, struct ib_cq, work);
int completed;
+ if (ib_cq_draining())
+ return;
+
completed = __ib_process_cq(cq, IB_POLL_BUDGET_WORKQUEUE, cq->wc,
IB_POLL_BATCH);
+ if (ib_cq_draining())
+ return;
+
if (completed >= IB_POLL_BUDGET_WORKQUEUE ||
ib_req_notify_cq(cq, IB_POLL_FLAGS) > 0)
queue_work(cq->comp_wq, &cq->work);
@@ -191,6 +214,9 @@ static void ib_cq_poll_work(struct work_struct *work)
static void ib_cq_completion_workqueue(struct ib_cq *cq, void *private)
{
+ if (ib_cq_draining())
+ return;
+
trace_cq_schedule(cq);
queue_work(cq->comp_wq, &cq->work);
}
@@ -359,6 +385,25 @@ void ib_free_cq(struct ib_cq *cq)
}
EXPORT_SYMBOL(ib_free_cq);
+/**
+ * ib_drain_completion_queues - Quiesce CQ polling before device shutdown
+ *
+ * Called from the kernel reboot/poweroff path immediately before
+ * device_shutdown(), while RDMA upper layers may still hold live CQs.
+ * Stops new CQ work from being queued and waits for in-flight
+ * ib_cq_poll_work handlers to finish.
+ */
+void ib_drain_completion_queues(void)
+{
+ atomic_set(&ib_cq_drain, 1);
+
+ if (ib_comp_wq)
+ flush_workqueue(ib_comp_wq);
+ if (ib_comp_unbound_wq)
+ flush_workqueue(ib_comp_unbound_wq);
+}
+EXPORT_SYMBOL_GPL(ib_drain_completion_queues);
+
void ib_cq_pool_cleanup(struct ib_device *dev)
{
struct ib_cq *cq, *n;
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b8193e077a74..601c58abd184 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -3105,6 +3105,8 @@ static int __init ib_core_init(void)
register_netdevice_notifier(&nb_netdevice);
+ ib_drain_completion_queues_fn = ib_drain_completion_queues;
+
return 0;
err_parent:
@@ -3134,6 +3136,7 @@ static int __init ib_core_init(void)
static void __exit ib_core_cleanup(void)
{
+ ib_drain_completion_queues_fn = NULL;
unregister_netdevice_notifier(&nb_netdevice);
roce_gid_mgmt_cleanup();
rdma_nl_unregister(RDMA_NL_LS);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 794746de8db0..c0b6d1cca598 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -61,6 +61,9 @@ extern struct workqueue_struct *ib_wq;
extern struct workqueue_struct *ib_comp_wq;
extern struct workqueue_struct *ib_comp_unbound_wq;
+extern void (*ib_drain_completion_queues_fn)(void);
+void ib_drain_completion_queues(void);
+
struct ib_ucq_object;
__printf(2, 3) __cold
diff --git a/kernel/reboot.c b/kernel/reboot.c
index 695c33e75efd..889e77960153 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -81,6 +81,9 @@ void __weak (*pm_power_off)(void);
*/
static BLOCKING_NOTIFIER_HEAD(reboot_notifier_list);
+void (*ib_drain_completion_queues_fn)(void);
+EXPORT_SYMBOL_GPL(ib_drain_completion_queues_fn);
+
/**
* emergency_restart - reboot the system
*
@@ -102,6 +105,10 @@ void kernel_restart_prepare(char *cmd)
blocking_notifier_call_chain(&reboot_notifier_list, SYS_RESTART, cmd);
system_state = SYSTEM_RESTART;
usermodehelper_disable();
+#if IS_ENABLED(CONFIG_INFINIBAND)
+ if (ib_drain_completion_queues_fn)
+ ib_drain_completion_queues_fn();
+#endif
device_shutdown();
}
@@ -305,6 +312,10 @@ static void kernel_shutdown_prepare(enum system_states state)
(state == SYSTEM_HALT) ? SYS_HALT : SYS_POWER_OFF, NULL);
system_state = state;
usermodehelper_disable();
+#if IS_ENABLED(CONFIG_INFINIBAND)
+ if (ib_drain_completion_queues_fn)
+ ib_drain_completion_queues_fn();
+#endif
device_shutdown();
}
/**
--
2.25.1
^ permalink raw reply related [flat|nested] 2+ messages in thread
* Re: [PATCH] RDMA/core: quiesce CQ polling before device shutdown on reboot
2026-07-02 7:34 [PATCH] RDMA/core: quiesce CQ polling before device shutdown on reboot Chenguang Zhao
@ 2026-07-02 17:40 ` Jason Gunthorpe
0 siblings, 0 replies; 2+ messages in thread
From: Jason Gunthorpe @ 2026-07-02 17:40 UTC (permalink / raw)
To: Chenguang Zhao
Cc: leon, edwards, michaelgur, vdumitrescu, jiri, linux-rdma,
Chenguang Zhao
On Thu, Jul 02, 2026 at 03:34:22PM +0800, Chenguang Zhao wrote:
> On forced reboot (reboot -f), mlx5 may enter shutdown while ib-comp-wq
> still polls live CQs, causing use-after-free in wr_cqe->done(). Drain
> completion workqueues from the kernel reboot path before device_shutdown()
> via a function pointer hook, and guard CQ processing during shutdown.
I think this is a bug in mlx5, it should not be destroying things
other parts of the kernel depend on during its shutdown. The real purpose
of shutdown is to DMA quite the device so the kdump kernel can recover
it and use it as the kdump NIC.
Jason
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2026-07-02 17:40 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-02 7:34 [PATCH] RDMA/core: quiesce CQ polling before device shutdown on reboot Chenguang Zhao
2026-07-02 17:40 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox