public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Manjunath Patil <manjunath.b.patil@oracle.com>,
	Leon Romanovsky <leon@kernel.org>,
	Sasha Levin <sashal@kernel.org>,
	markzhang@nvidia.com, linux-rdma@vger.kernel.org
Subject: [PATCH AUTOSEL 5.10 27/31] RDMA/cm: add timeout to cm_destroy_id wait
Date: Fri, 29 Mar 2024 08:48:44 -0400	[thread overview]
Message-ID: <20240329124903.3093161-27-sashal@kernel.org> (raw)
In-Reply-To: <20240329124903.3093161-1-sashal@kernel.org>

From: Manjunath Patil <manjunath.b.patil@oracle.com>

[ Upstream commit 96d9cbe2f2ff7abde021bac75eafaceabe9a51fa ]

Add timeout to cm_destroy_id, so that userspace can trigger any data
collection that would help in analyzing the cause of delay in destroying
the cm_id.

New noinline function helps dtrace/ebpf programs to hook on to it.
Existing functionality isn't changed except triggering a probe-able new
function at every timeout interval.

We have seen cases where CM messages stuck with MAD layer (either due to
software bug or faulty HCA), leading to cm_id getting stuck in the
following call stack. This patch helps in resolving such issues faster.

kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds.
...
	Call Trace:
	__schedule+0x2bc/0x895
	schedule+0x36/0x7c
	schedule_timeout+0x1f6/0x31f
 	? __slab_free+0x19c/0x2ba
	wait_for_completion+0x12b/0x18a
	? wake_up_q+0x80/0x73
	cm_destroy_id+0x345/0x610 [ib_cm]
	ib_destroy_cm_id+0x10/0x20 [ib_cm]
	rdma_destroy_id+0xa8/0x300 [rdma_cm]
	ucma_destroy_id+0x13e/0x190 [rdma_ucm]
	ucma_write+0xe0/0x160 [rdma_ucm]
	__vfs_write+0x3a/0x16d
	vfs_write+0xb2/0x1a1
	? syscall_trace_enter+0x1ce/0x2b8
	SyS_write+0x5c/0xd3
	do_syscall_64+0x79/0x1b9
	entry_SYSCALL_64_after_hwframe+0x16d/0x0

Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Link: https://lore.kernel.org/r/20240309063323.458102-1-manjunath.b.patil@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 drivers/infiniband/core/cm.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index db1a25fbe2fa9..2a30b25c5e7e5 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -33,6 +33,7 @@ MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("InfiniBand CM");
 MODULE_LICENSE("Dual BSD/GPL");
 
+#define CM_DESTROY_ID_WAIT_TIMEOUT 10000 /* msecs */
 static const char * const ibcm_rej_reason_strs[] = {
 	[IB_CM_REJ_NO_QP]			= "no QP",
 	[IB_CM_REJ_NO_EEC]			= "no EEC",
@@ -1056,10 +1057,20 @@ static void cm_reset_to_idle(struct cm_id_private *cm_id_priv)
 	}
 }
 
+static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id)
+{
+	struct cm_id_private *cm_id_priv;
+
+	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
+	pr_err("%s: cm_id=%p timed out. state=%d refcnt=%d\n", __func__,
+	       cm_id, cm_id->state, refcount_read(&cm_id_priv->refcount));
+}
+
 static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
 {
 	struct cm_id_private *cm_id_priv;
 	struct cm_work *work;
+	int ret;
 
 	cm_id_priv = container_of(cm_id, struct cm_id_private, id);
 	spin_lock_irq(&cm_id_priv->lock);
@@ -1171,7 +1182,14 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
 
 	xa_erase(&cm.local_id_table, cm_local_id(cm_id->local_id));
 	cm_deref_id(cm_id_priv);
-	wait_for_completion(&cm_id_priv->comp);
+	do {
+		ret = wait_for_completion_timeout(&cm_id_priv->comp,
+						  msecs_to_jiffies(
+						  CM_DESTROY_ID_WAIT_TIMEOUT));
+		if (!ret) /* timeout happened */
+			cm_destroy_id_wait_timeout(cm_id);
+	} while (!ret);
+
 	while ((work = cm_dequeue_work(cm_id_priv)) != NULL)
 		cm_free_work(work);
 
-- 
2.43.0


  parent reply	other threads:[~2024-03-29 12:49 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-29 12:48 [PATCH AUTOSEL 5.10 01/31] Input: synaptics-rmi4 - fail probing if memory allocation for "phys" fails Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 02/31] pinctrl: renesas: checker: Limit cfg reg enum checks to provided IDs Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 03/31] VMCI: Fix memcpy() run-time warning in dg_dispatch_as_host() Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 04/31] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 05/31] quota: Fix potential NULL pointer dereference Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 06/31] scsi: lpfc: Fix possible memory leak in lpfc_rcv_padisc() Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 07/31] panic: Flush kernel log buffer at the end Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 08/31] isofs: handle CDs with bad root inode but good Joliet root directory Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 09/31] arm64: dts: rockchip: fix rk3328 hdmi ports node Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 10/31] arm64: dts: rockchip: fix rk3399 " Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 11/31] media: sta2x11: fix irq handler cast Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 12/31] ext4: add a hint for block bitmap corrupt state in mb_groups Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 13/31] ext4: forbid commit inconsistent quota data when errors=remount-ro Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 14/31] drm/amd/display: Fix nanosec stat overflow Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 15/31] pstore/zone: Add a null pointer check to the psz_kmsg_read Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 16/31] tools/power x86_energy_perf_policy: Fix file leak in get_pkg_num() Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 17/31] SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned int Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 18/31] Revert "ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default" Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 19/31] sparc: vdso: Disable UBSAN instrumentation Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 20/31] libperf evlist: Avoid out-of-bounds access Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 21/31] PCI: Mark LSI FW643 to avoid bus reset Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 22/31] sh: Fix build with CONFIG_UBSAN=y Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 23/31] btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks() Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 24/31] btrfs: export: handle invalid inode or root reference in btrfs_get_parent() Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 25/31] btrfs: send: handle path ref underflow in header iterate_inode_ref() Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 26/31] block: prevent division by zero in blk_rq_stat_sum() Sasha Levin
2024-03-29 12:48 ` Sasha Levin [this message]
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 28/31] Input: allocate keycode for Display refresh rate toggle Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 29/31] platform/x86: touchscreen_dmi: Add an extra entry for a variant of the Chuwi Vi8 tablet Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 30/31] ktest: force $buildonly = 1 for 'make_warnings_file' test type Sasha Levin
2024-03-29 12:48 ` [PATCH AUTOSEL 5.10 31/31] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240329124903.3093161-27-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=manjunath.b.patil@oracle.com \
    --cc=markzhang@nvidia.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox