* Re: [PATCH 6.6] RDMA/rxe: Fix "trying to register non-static key in rxe_qp_do_cleanup" bug
2026-06-05 16:55 [PATCH 6.6] RDMA/rxe: Fix "trying to register non-static key in rxe_qp_do_cleanup" bug Vladislav Nikolaev
@ 2026-06-05 20:23 ` yanjun.zhu
0 siblings, 0 replies; 2+ messages in thread
From: yanjun.zhu @ 2026-06-05 20:23 UTC (permalink / raw)
To: Vladislav Nikolaev, stable, Greg Kroah-Hartman, Zhu Yanjun
Cc: Zhu Yanjun, Doug Ledford, Jason Gunthorpe, Haggai Eran,
Kamal Heib, Amir Vadai, Moni Shoua, Yonatan Cohen,
Leon Romanovsky, linux-rdma, linux-kernel, Zhu Yanjun,
lvc-project, syzbot+4edb496c3cad6e953a31
On 6/5/26 9:55 AM, Vladislav Nikolaev wrote:
> From: Zhu Yanjun <yanjun.zhu@linux.dev>
>
> commit 1c7eec4d5f3b39cdea2153abaebf1b7229a47072 upstream.
>
> Call Trace:
> <TASK>
> __dump_stack lib/dump_stack.c:94 [inline]
> dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:120
> assign_lock_key kernel/locking/lockdep.c:986 [inline]
> register_lock_class+0x4a3/0x4c0 kernel/locking/lockdep.c:1300
> __lock_acquire+0x99/0x1ba0 kernel/locking/lockdep.c:5110
> lock_acquire kernel/locking/lockdep.c:5866 [inline]
> lock_acquire+0x179/0x350 kernel/locking/lockdep.c:5823
> __timer_delete_sync+0x152/0x1b0 kernel/time/timer.c:1644
> rxe_qp_do_cleanup+0x5c3/0x7e0 drivers/infiniband/sw/rxe/rxe_qp.c:815
> execute_in_process_context+0x3a/0x160 kernel/workqueue.c:4596
> __rxe_cleanup+0x267/0x3c0 drivers/infiniband/sw/rxe/rxe_pool.c:232
> rxe_create_qp+0x3f7/0x5f0 drivers/infiniband/sw/rxe/rxe_verbs.c:604
> create_qp+0x62d/0xa80 drivers/infiniband/core/verbs.c:1250
> ib_create_qp_kernel+0x9f/0x310 drivers/infiniband/core/verbs.c:1361
> ib_create_qp include/rdma/ib_verbs.h:3803 [inline]
> rdma_create_qp+0x10c/0x340 drivers/infiniband/core/cma.c:1144
> rds_ib_setup_qp+0xc86/0x19a0 net/rds/ib_cm.c:600
> rds_ib_cm_initiate_connect+0x1e8/0x3d0 net/rds/ib_cm.c:944
> rds_rdma_cm_event_handler_cmn+0x61f/0x8c0 net/rds/rdma_transport.c:109
> cma_cm_event_handler+0x94/0x300 drivers/infiniband/core/cma.c:2184
> cma_work_handler+0x15b/0x230 drivers/infiniband/core/cma.c:3042
> process_one_work+0x9cc/0x1b70 kernel/workqueue.c:3238
> process_scheduled_works kernel/workqueue.c:3319 [inline]
> worker_thread+0x6c8/0xf10 kernel/workqueue.c:3400
> kthread+0x3c2/0x780 kernel/kthread.c:464
> ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:153
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
> </TASK>
>
> The root cause is as below:
>
> In the function rxe_create_qp, the function rxe_qp_from_init is called
> to create qp, if this function rxe_qp_from_init fails, rxe_cleanup will
> be called to handle all the allocated resources, including the timers:
> retrans_timer and rnr_nak_timer.
>
> The function rxe_qp_from_init calls the function rxe_qp_init_req to
> initialize the timers: retrans_timer and rnr_nak_timer.
>
> But these timers are initialized in the end of rxe_qp_init_req.
> If some errors occur before the initialization of these timers, this
> problem will occur.
>
> The solution is to check whether these timers are initialized or not.
> If these timers are not initialized, ignore these timers.
>
> Fixes: 8700e3e7c485 ("Soft RoCE driver")
> Reported-by: syzbot+4edb496c3cad6e953a31@syzkaller.appspotmail.com
> Closes: https://syzkaller.appspot.com/bug?extid=4edb496c3cad6e953a31
> Signed-off-by: Zhu Yanjun <yanjun.zhu@linux.dev>
> Link: https://patch.msgid.link/20250419080741.1515231-1-yanjun.zhu@linux.dev
> Signed-off-by: Leon Romanovsky <leon@kernel.org>
> [ Vladislav: keep del_timer_sync() because linux-6.6.y has not renamed it
> to timer_delete_sync() yet. The actual fix is unchanged: check the timer
> .function fields before deleting the timers. ]
> Signed-off-by: Vladislav Nikolaev <vlad102nikolaev@gmail.com>
> ---
> Backport of upstream commit 1c7eec4d5f3b to linux-6.6.y.
> drivers/infiniband/sw/rxe/rxe_qp.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> index 287fc8b8f5ba..8426c261c263 100644
> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> @@ -817,7 +817,12 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
> spin_unlock_irqrestore(&qp->state_lock, flags);
> qp->qp_timeout_jiffies = 0;
>
> - if (qp_type(qp) == IB_QPT_RC) {
> + /* In the function timer_setup, .function is initialized. If .function
> + * is NULL, it indicates the function timer_setup is not called, the
> + * timer is not initialized. Or else, the timer is initialized.
> + */
> + if (qp_type(qp) == IB_QPT_RC && qp->retrans_timer.function &&
> + qp->rnr_nak_timer.function) {
> del_timer_sync(&qp->retrans_timer);
> del_timer_sync(&qp->rnr_nak_timer);
> }
Thanks a lot.
Sashiko:
This isn't a bug introduced by this patch, but does this teardown sequence
leave a window for the timer to be illegally re-armed?
rxe_qp_do_cleanup() deletes the timers here before the asynchronous tasks
(like the completer task) are fully stopped by rxe_cleanup_task() just below
this block.
If rxe_completer() is already executing and has passed the qp->valid check
before it was cleared, del_timer_sync() will return immediately as the timer
isn't pending.
Then, rxe_completer() can process an incoming RNR NAK and reach
COMPST_RNR_RETRY, where it calls mod_timer(&qp->rnr_nak_timer, ...) without
holding the state_lock.
When the cleanup task unblocks and finishes, ib_destroy_qp_user() frees the
qp memory. Later, the newly armed rnr_nak_timer fires, and the
rnr_nak_timer() callback attempts to acquire the freed qp->state_lock,
resulting in a use-after-free.
Additionally, if a timer fires concurrently with teardown while the refcount
is already 0, it invokes rxe_sched_task(). The underlying rxe_get() fails
silently on the 0-refcount, but the task is still queued. When the task
finishes, it calls rxe_put(), triggering a refcount_t underflow.
I think it is not caused from this commit.
I am fine with this patch.
Zhu Yanjun
^ permalink raw reply [flat|nested] 2+ messages in thread