From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexander Aring <aahringo@redhat.com>
Date: Wed, 29 Sep 2021 13:10:54 -0400
Subject: [Cluster-devel] [PATCH dlm/next] fs: dlm: avoid ls_waiter_mutex
	circular lock dependency warning
Message-ID: <20210929171054.2623204-1-aahringo@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

This patch adds comments about a very specific DLM behaviour which we
require to avoid a deadlock. The behaviour is about a state when new DLM
operations and processing of DLM messages are stopped. This DLM state
indicates that a certain group of function in DLM can't occur e.g.
any central locking logic stages functionality of fs/dlm/lock.c .
To check on this state a new assert "DLM_ASSERT_OPS_LOCKED" was introduced
to be sure that the required locks are held when this state is required.

One of the function where this state is required is
"dlm_recover_waiters_pre()", otherwise a deadlock can occur because a
different lock order between ls_waiters_mutex and res_mutex. A reverse
lock order of dlm_recover_waiters_pre() can't happen because the right
locks are hold to stop all "lock operations" as DLM_ASSERT_OPS_LOCKED()
checks on it. This patch moves the "ls_waiters_mutex" to a different
lockclass to avoid a lockdep warning about a possible deadlock.
The lock should be the same lockclass but so far I understand the only
way to tell lockdep to handle the lock in dlm_recover_waiters_pre()
differently than other ls_waiters_mutex lock calls.

It should avoid the following warning:

[  619.855891] ======================================================
[  619.856858] WARNING: possible circular locking dependency detected
[  619.857865] 5.14.0-1.el9.x86_64+debug #1 Not tainted
[  619.858646] ------------------------------------------------------
[  619.859646] dlm_recoverd/3961 is trying to acquire lock:
[  619.860478] ffff888019dcd628 (&r->res_mutex){+.+.}-{3:3}, at: _receive_unlock_reply+0x78/0x600 [dlm]
[  619.861999]
[  619.861999] but task is already holding lock:
[  619.862933] ffff88800ee901a8 (&ls->ls_waiters_mutex){+.+.}-{3:3}, at: dlm_recover_waiters_pre+0x72/0xc80 [dlm]
[  619.864529]
[  619.864529] which lock already depends on the new lock.
[  619.864529]
[  619.865837]
[  619.865837] the existing dependency chain (in reverse order) is:
[  619.866993]
[  619.866993] -> #1 (&ls->ls_waiters_mutex){+.+.}-{3:3}:
[  619.868088]        __lock_acquire+0xb72/0x1870
[  619.868861]        lock_acquire+0x1ca/0x570
[  619.869554]        __mutex_lock+0x14c/0x1170
[  619.870283]        add_to_waiters+0x6a/0x500 [dlm]
[  619.871047]        _request_lock+0x39f/0x9f0 [dlm]
[  619.871860]        request_lock.part.0+0x1ae/0x220 [dlm]
[  619.872713]        dlm_user_request+0x237/0x5a0 [dlm]
[  619.873555]        device_user_lock+0x42c/0x660 [dlm]
[  619.874366]        device_write+0x5ff/0x8d0 [dlm]
[  619.875116]        vfs_write+0x1c7/0x850
[  619.875762]        ksys_write+0xf9/0x1d0
[  619.876385]        do_syscall_64+0x3b/0x90
[  619.877034]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  619.877972]
[  619.877972] -> #0 (&r->res_mutex){+.+.}-{3:3}:
[  619.878931]        check_prev_add+0x15e/0x20f0
[  619.879699]        validate_chain+0xaba/0xde0
[  619.880404]        __lock_acquire+0xb72/0x1870
[  619.881100]        lock_acquire+0x1ca/0x570
[  619.881823]        __mutex_lock+0x14c/0x1170
[  619.882506]        _receive_unlock_reply+0x78/0x600 [dlm]
[  619.883365]        dlm_recover_waiters_pre+0x6e8/0xc80 [dlm]
[  619.884262]        ls_recover.isra.0+0x517/0x1090 [dlm]
[  619.885087]        dlm_recoverd+0x348/0x430 [dlm]
[  619.885844]        kthread+0x329/0x3e0
[  619.886456]        ret_from_fork+0x22/0x30
[  619.887113]
[  619.887113] other info that might help us debug this:
[  619.887113]
[  619.888376]  Possible unsafe locking scenario:
[  619.888376]
[  619.889359]        CPU0                    CPU1
[  619.890064]        ----                    ----
[  619.890775]   lock(&ls->ls_waiters_mutex);
[  619.891436]                                lock(&r->res_mutex);
[  619.892378]                                lock(&ls->ls_waiters_mutex);
[  619.893436]   lock(&r->res_mutex);
[  619.893991]
[  619.893991]  *** DEADLOCK ***
[  619.893991]
[  619.894930] 3 locks held by dlm_recoverd/3961:
[  619.895647]  #0: ffff88800ee90d78 (&ls->ls_in_recovery){++++}-{3:3}, at: dlm_recoverd+0x1d1/0x430 [dlm]
[  619.897173]  #1: ffff88800ee90c68 (&ls->ls_recoverd_active){+.+.}-{3:3}, at: ls_recover.isra.0+0xf9/0x1090 [dlm]
[  619.898759]  #2: ffff88800ee901a8 (&ls->ls_waiters_mutex){+.+.}-{3:3}, at: dlm_recover_waiters_pre+0x72/0xc80 [dlm]
[  619.900439]
[  619.900439] stack backtrace:
[  619.901145] CPU: 1 PID: 3961 Comm: dlm_recoverd Not tainted 5.14.0-1.el9.x86_64+debug #1
[  619.902461] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[  619.903390] Call Trace:
[  619.903808]  dump_stack_lvl+0x57/0x7d
[  619.904493]  check_noncircular+0x26a/0x310
[  619.905155]  ? print_circular_bug+0x1f0/0x1f0
[  619.905839]  ? alloc_chain_hlocks+0x1de/0x530
[  619.906528]  check_prev_add+0x15e/0x20f0
[  619.907155]  validate_chain+0xaba/0xde0
[  619.907787]  ? check_prev_add+0x20f0/0x20f0
[  619.908489]  __lock_acquire+0xb72/0x1870
[  619.909147]  lock_acquire+0x1ca/0x570
[  619.909730]  ? _receive_unlock_reply+0x78/0x600 [dlm]
[  619.910554]  ? rcu_read_unlock+0x40/0x40
[  619.911183]  ? __lock_acquired+0x1d2/0x8c0
[  619.911826]  ? dlm_recoverd+0x348/0x430 [dlm]
[  619.912541]  __mutex_lock+0x14c/0x1170
[  619.913160]  ? _receive_unlock_reply+0x78/0x600 [dlm]
[  619.913997]  ? _receive_unlock_reply+0x78/0x600 [dlm]
[  619.914838]  ? mutex_lock_io_nested+0xfc0/0xfc0
[  619.915552]  ? dlm_recover_waiters_pre+0x72/0xc80 [dlm]
[  619.916380]  ? io_schedule_timeout+0x150/0x150
[  619.917072]  ? mutex_lock_io_nested+0xfc0/0xfc0
[  619.917833]  ? lockdep_hardirqs_on_prepare.part.0+0x19a/0x350
[  619.918738]  ? _receive_unlock_reply+0x78/0x600 [dlm]
[  619.919568]  _receive_unlock_reply+0x78/0x600 [dlm]
[  619.920352]  dlm_recover_waiters_pre+0x6e8/0xc80 [dlm]
[  619.921186]  ls_recover.isra.0+0x517/0x1090 [dlm]
[  619.921941]  ? dlm_clear_toss+0x280/0x280 [dlm]
[  619.922666]  ? dlm_recoverd+0x33d/0x430 [dlm]
[  619.923384]  dlm_recoverd+0x348/0x430 [dlm]
[  619.924053]  ? ls_recover.isra.0+0x1090/0x1090 [dlm]
[  619.924896]  kthread+0x329/0x3e0
[  619.925422]  ? _raw_spin_unlock_irq+0x24/0x30
[  619.926100]  ? set_kthread_struct+0x100/0x100
[  619.926788]  ret_from_fork+0x22/0x30

Signed-off-by: Alexander Aring <aahringo@redhat.com>
---
 fs/dlm/dlm_internal.h | 22 ++++++++++++++++++++++
 fs/dlm/lock.c         | 12 +++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index 49cf83e04c80..3fb610dfea19 100644
--- a/fs/dlm/dlm_internal.h
+++ b/fs/dlm/dlm_internal.h
@@ -746,6 +746,28 @@ static inline int dlm_no_directory(struct dlm_ls *ls)
 	return test_bit(LSFL_NODIR, &ls->ls_flags);
 }
 
+#ifdef CONFIG_PROVE_LOCKING
+static inline int dlm_assert_ops_unlocked(struct dlm_ls *ls)
+{
+	int ret = 0;
+
+	/* will block all new DLM-API from user */
+	ret |= !rwsem_is_locked(&ls->ls_in_recovery);
+	/* assumed this flag was set while ls_recv_active was locked.
+	 * This will make sure no receiving is processed and every
+	 * further receivings are queued into the requestqueue.
+	 */
+	ret |= !dlm_locking_stopped(ls);
+
+	return ret;
+}
+#define DLM_ASSERT_OPS_LOCKED(ls)			\
+	DLM_ASSERT(dlm_assert_ops_unlocked(ls) == 0,	\
+		   printk("Locking active in ls: %s\n", ls->ls_name);)
+#else
+#define DLM_ASSERT_OPS_LOCKED(ls)
+#endif /* CONFIG_PROVE_LOCKING */
+
 int dlm_netlink_init(void);
 void dlm_netlink_exit(void);
 void dlm_timeout_warn(struct dlm_lkb *lkb);
diff --git a/fs/dlm/lock.c b/fs/dlm/lock.c
index c502c065d007..4231edb3c614 100644
--- a/fs/dlm/lock.c
+++ b/fs/dlm/lock.c
@@ -5125,7 +5125,17 @@ void dlm_recover_waiters_pre(struct dlm_ls *ls)
 	if (!ms_stub)
 		return;
 
-	mutex_lock(&ls->ls_waiters_mutex);
+	/* we do mutex_lock_nested() here in a different lockclass to avoid
+	 * a false positive report about a possible circular lock dependency.
+	 * The lockclass should be the same but doing it avoids the false
+	 * positive report. This report is about a different lock order
+	 * regarding to ls_waiters_mutex lock and res_mutex of lkb between here
+	 * and during lock operations. However lock operations cannot run in
+	 * parallel with dlm_recover_waiters_pre() because certain locks are
+	 * held which DLM_ASSERT_OPS_LOCKED(ls) is checking on.
+	 */
+	DLM_ASSERT_OPS_LOCKED(ls);
+	mutex_lock_nested(&ls->ls_waiters_mutex, 1);
 
 	list_for_each_entry_safe(lkb, safe, &ls->ls_waiters, lkb_wait_reply) {
 
-- 
2.27.0