From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Manjunath Patil <manjunath.b.patil@oracle.com>,
Leon Romanovsky <leon@kernel.org>,
Sasha Levin <sashal@kernel.org>,
markzhang@nvidia.com, linux-rdma@vger.kernel.org
Subject: [PATCH AUTOSEL 6.6 65/75] RDMA/cm: add timeout to cm_destroy_id wait
Date: Fri, 29 Mar 2024 08:42:46 -0400 [thread overview]
Message-ID: <20240329124330.3089520-65-sashal@kernel.org> (raw)
In-Reply-To: <20240329124330.3089520-1-sashal@kernel.org>
From: Manjunath Patil <manjunath.b.patil@oracle.com>
[ Upstream commit 96d9cbe2f2ff7abde021bac75eafaceabe9a51fa ]
Add timeout to cm_destroy_id, so that userspace can trigger any data
collection that would help in analyzing the cause of delay in destroying
the cm_id.
New noinline function helps dtrace/ebpf programs to hook on to it.
Existing functionality isn't changed except triggering a probe-able new
function at every timeout interval.
We have seen cases where CM messages stuck with MAD layer (either due to
software bug or faulty HCA), leading to cm_id getting stuck in the
following call stack. This patch helps in resolving such issues faster.
kernel: ... INFO: task XXXX:56778 blocked for more than 120 seconds.
...
Call Trace:
__schedule+0x2bc/0x895
schedule+0x36/0x7c
schedule_timeout+0x1f6/0x31f
? __slab_free+0x19c/0x2ba
wait_for_completion+0x12b/0x18a
? wake_up_q+0x80/0x73
cm_destroy_id+0x345/0x610 [ib_cm]
ib_destroy_cm_id+0x10/0x20 [ib_cm]
rdma_destroy_id+0xa8/0x300 [rdma_cm]
ucma_destroy_id+0x13e/0x190 [rdma_ucm]
ucma_write+0xe0/0x160 [rdma_ucm]
__vfs_write+0x3a/0x16d
vfs_write+0xb2/0x1a1
? syscall_trace_enter+0x1ce/0x2b8
SyS_write+0x5c/0xd3
do_syscall_64+0x79/0x1b9
entry_SYSCALL_64_after_hwframe+0x16d/0x0
Signed-off-by: Manjunath Patil <manjunath.b.patil@oracle.com>
Link: https://lore.kernel.org/r/20240309063323.458102-1-manjunath.b.patil@oracle.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/infiniband/core/cm.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index ff58058aeadca..bf0df6ee4f785 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -34,6 +34,7 @@ MODULE_AUTHOR("Sean Hefty");
MODULE_DESCRIPTION("InfiniBand CM");
MODULE_LICENSE("Dual BSD/GPL");
+#define CM_DESTROY_ID_WAIT_TIMEOUT 10000 /* msecs */
static const char * const ibcm_rej_reason_strs[] = {
[IB_CM_REJ_NO_QP] = "no QP",
[IB_CM_REJ_NO_EEC] = "no EEC",
@@ -1025,10 +1026,20 @@ static void cm_reset_to_idle(struct cm_id_private *cm_id_priv)
}
}
+static noinline void cm_destroy_id_wait_timeout(struct ib_cm_id *cm_id)
+{
+ struct cm_id_private *cm_id_priv;
+
+ cm_id_priv = container_of(cm_id, struct cm_id_private, id);
+ pr_err("%s: cm_id=%p timed out. state=%d refcnt=%d\n", __func__,
+ cm_id, cm_id->state, refcount_read(&cm_id_priv->refcount));
+}
+
static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
{
struct cm_id_private *cm_id_priv;
struct cm_work *work;
+ int ret;
cm_id_priv = container_of(cm_id, struct cm_id_private, id);
spin_lock_irq(&cm_id_priv->lock);
@@ -1135,7 +1146,14 @@ static void cm_destroy_id(struct ib_cm_id *cm_id, int err)
xa_erase(&cm.local_id_table, cm_local_id(cm_id->local_id));
cm_deref_id(cm_id_priv);
- wait_for_completion(&cm_id_priv->comp);
+ do {
+ ret = wait_for_completion_timeout(&cm_id_priv->comp,
+ msecs_to_jiffies(
+ CM_DESTROY_ID_WAIT_TIMEOUT));
+ if (!ret) /* timeout happened */
+ cm_destroy_id_wait_timeout(cm_id);
+ } while (!ret);
+
while ((work = cm_dequeue_work(cm_id_priv)) != NULL)
cm_free_work(work);
--
2.43.0
next prev parent reply other threads:[~2024-03-29 12:45 UTC|newest]
Thread overview: 77+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-29 12:41 [PATCH AUTOSEL 6.6 01/75] drm/vc4: don't check if plane->state->fb == state->fb Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 02/75] Input: synaptics-rmi4 - fail probing if memory allocation for "phys" fails Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 03/75] drm: panel-orientation-quirks: Add quirk for GPD Win Mini Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 04/75] ASoC: SOF: amd: Optimize quirk for Valve Galileo Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 05/75] drm/ttm: return ENOSPC from ttm_bo_mem_space v3 Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 06/75] arm64: dts: qcom: sdm630: add USB QMP PHY support Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 07/75] arm64: dts: qcom: sda660-ifc6560: enable USB 3.0 PHY Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 08/75] pinctrl: renesas: checker: Limit cfg reg enum checks to provided IDs Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 09/75] VMCI: Fix memcpy() run-time warning in dg_dispatch_as_host() Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 10/75] sysv: don't call sb_bread() with pointers_lock held Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 11/75] quota: Fix potential NULL pointer dereference Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 12/75] scsi: lpfc: Fix possible memory leak in lpfc_rcv_padisc() Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 13/75] printk: For @suppress_panic_printk check for other CPU in panic Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 14/75] printk: Avoid non-panic CPUs writing to ringbuffer Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 15/75] panic: Flush kernel log buffer at the end Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 16/75] isofs: handle CDs with bad root inode but good Joliet root directory Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 17/75] ASoC: Intel: common: DMI remap for rebranded Intel NUC M15 (LAPRC710) laptops Sasha Levin
2024-03-29 12:41 ` [PATCH AUTOSEL 6.6 18/75] cpuidle: Avoid potential overflow in integer multiplication Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 19/75] ARM: dts: rockchip: fix rk3288 hdmi ports node Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 20/75] ARM: dts: rockchip: fix rk322x " Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 21/75] arm64: dts: rockchip: fix rk3328 " Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 22/75] arm64: dts: rockchip: fix rk3399 " Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 23/75] pmdomain: ti: Add a null pointer check to the omap_prm_domain_init Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 24/75] pmdomain: imx8mp-blk-ctrl: imx8mp_blk: Add fdcc clock to hdmimix domain Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 25/75] arm64: dts: sc8280xp: correct DMIC2 and DMIC3 pin config node names Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 26/75] arm64: dts: sm8450: " Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 27/75] arm64: dts: sm8550: " Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 28/75] rcu/nocb: Fix WARN_ON_ONCE() in the rcu_nocb_bypass_lock() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 29/75] rcu-tasks: Repair RCU Tasks Trace quiescence check Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 30/75] Julia Lawall reported this null pointer dereference, this should fix it Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 31/75] media: sta2x11: fix irq handler cast Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 32/75] ALSA: firewire-lib: handle quirk to calculate payload quadlets as data block counter Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 33/75] ASoC: Intel: avs: Populate board selection with new I2S entries Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 34/75] firmware: tegra: bpmp: Return directly after a failed kzalloc() in get_filename() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 35/75] ext4: add a hint for block bitmap corrupt state in mb_groups Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 36/75] ext4: forbid commit inconsistent quota data when errors=remount-ro Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 37/75] ACPI: x86: Move acpi_quirk_skip_serdev_enumeration() out of CONFIG_X86_ANDROID_TABLETS Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 38/75] drm/amd/display: Fix nanosec stat overflow Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 39/75] pstore/zone: Add a null pointer check to the psz_kmsg_read Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 40/75] tools/power x86_energy_perf_policy: Fix file leak in get_pkg_num() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 41/75] accel/habanalabs: increase HL_MAX_STR to 64 bytes to avoid warnings Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 42/75] i2c: designware: Fix RX FIFO depth define on Wangxun 10Gb NIC Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 43/75] HID: input: avoid polling stylus battery on Chromebook Pompom Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 44/75] drm/amd/amdgpu: Fix potential ioremap() memory leaks in amdgpu_device_init() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 45/75] dma-direct: Leak pages on dma_set_decrypted() failure Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 46/75] drm: Check output polling initialized before disabling Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 47/75] SUNRPC: increase size of rpc_wait_queue.qlen from unsigned short to unsigned int Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 48/75] PCI: Disable D3cold on Asus B1400 PCI-NVMe bridge Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 49/75] Revert "ACPI: PM: Block ASUS B1400CEAE from suspend to idle by default" Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 50/75] sparc: vdso: Disable UBSAN instrumentation Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 51/75] libperf evlist: Avoid out-of-bounds access Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 52/75] PCI: Mark LSI FW643 to avoid bus reset Sasha Levin
2024-03-29 15:17 ` Bjorn Helgaas
2024-04-07 23:51 ` Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 53/75] cpufreq: Don't unregister cpufreq cooling on CPU hotplug Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 54/75] overflow: Allow non-type arg to type_max() and type_min() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 55/75] sh: Fix build with CONFIG_UBSAN=y Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 56/75] input/touchscreen: imagis: Correct the maximum touch area value Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 57/75] input/touchscreen: imagis: Add support for Imagis IST3038B Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 58/75] input/touchscreen: imagis: add support for IST3032C Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 59/75] drivers/perf: hisi: Enable HiSilicon Erratum 162700402 quirk for HIP09 Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 60/75] btrfs: preallocate temporary extent buffer for inode logging when needed Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 61/75] btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 62/75] btrfs: export: handle invalid inode or root reference in btrfs_get_parent() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 63/75] btrfs: send: handle path ref underflow in header iterate_inode_ref() Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 64/75] block: prevent division by zero in blk_rq_stat_sum() Sasha Levin
2024-03-29 12:42 ` Sasha Levin [this message]
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 66/75] Input: make input_class constant Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 67/75] Input: imagis - use FIELD_GET where applicable Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 68/75] Input: imagis - add touch key support Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 69/75] Input: allocate keycode for Display refresh rate toggle Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 70/75] platform/x86: touchscreen_dmi: Add an extra entry for a variant of the Chuwi Vi8 tablet Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 71/75] perf/x86/amd/lbr: Discard erroneous branch entries Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 72/75] ALSA: hda/realtek: Add quirk for Lenovo Yoga 9 14IMH9 Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 73/75] ktest: force $buildonly = 1 for 'make_warnings_file' test type Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 74/75] Input: xpad - add support for Snakebyte GAMEPADs Sasha Levin
2024-03-29 12:42 ` [PATCH AUTOSEL 6.6 75/75] ring-buffer: use READ_ONCE() to read cpu_buffer->commit_page in concurrent environment Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240329124330.3089520-65-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=leon@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=manjunath.b.patil@oracle.com \
--cc=markzhang@nvidia.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox