From mboxrd@z Thu Jan  1 00:00:00 1970
From: Alexander Aring <aahringo@redhat.com>
Date: Mon,  7 Mar 2022 09:40:48 -0500
Subject: [Cluster-devel] [PATCH dlm/next] fs: dlm: move some midcomms
 WARN_ON to BUG
Message-ID: <20220307144048.2451280-1-aahringo@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

Recently those warnings were triggered on gfs2 handling by calling
dlm API which runs into a BUG() because -ENOBUFS had beedn returned.
Those cases which has a WARN_ON() are not related to any memory failure
but should never happen. It's because we reset a midcomms state and the
dlm api still tries to transmit something which should be prevented by
dlm application layer handling e.g. locks.

Call trace of warning was:

[14003.162881] Call Trace:
[14003.162883]  [<000003ff80796d70>] dlm_midcomms_get_mhandle+0x170/0x1f0 [dlm]
[14003.162892] ([<000003ff80796d6c>] dlm_midcomms_get_mhandle+0x16c/0x1f0 [dlm])
[14003.162901]  [<000003ff80787366>] create_message+0x56/0x100 [dlm]
[14003.162909]  [<000003ff8078849c>] send_common+0x7c/0x130 [dlm]
[14003.162928]  [<000003ff8078b50c>] _convert_lock+0x3c/0x140 [dlm]
[14003.162936]  [<000003ff8078b698>] convert_lock+0x88/0xd0 [dlm]
[14003.162944]  [<000003ff80790008>] dlm_lock+0x158/0x1b0 [dlm]
[14003.162952]  [<000003ff807ff4c6>] gdlm_lock+0x1f6/0x2f0 [gfs2]
[14003.162997]  [<000003ff807d96c8>] do_xmote+0x1f8/0x440 [gfs2]
[14003.163008]  [<000003ff807d9d88>] gfs2_glock_nq+0x88/0x130 [gfs2]
[14003.163020]  [<000003ff807fac92>] gfs2_statfs_sync+0x52/0x180 [gfs2]
[14003.163031]  [<000003ff807f2b70>] gfs2_quotad+0xc0/0x360 [gfs2]
[14003.163043]  [<0000000050527cfc>] kthread+0x17c/0x190
[14003.163061]  [<00000000504af5dc>] __ret_from_fork+0x3c/0x60
[14003.163064]  [<0000000050d6df4a>] ret_from_fork+0xa/0x30

Call trace of BUG() was:

 #0 [8026be60] __machine_kexec at 504c09ee
 #1 [8026bea0] pcpu_delegate at 504c389e
 #2 [380004ab8b0] smp_call_ipl_cpu at 504c4b66
 #3 [380004ab8d0] __crash_kexec at 505c488a
 #4 [380004ab9a8] panic at 50d58682
 #5 [380004aba48] die at 504c1b28
 #6 [380004abab0] __do_pgm_check at 50d60966
 #7 [380004abb00] pgm_check_handler at 50d6e088
 PSW:  0704c00180000000 000003ff807d97e6 (do_xmote+790 [gfs2])
 GPRS: c0000000ffffbfff 0000000000000027 0000000000000067 00000000ffffbfff
       00000380004ab798 00000380004ab790 000003ff807f2b70 000003ff80810df0
       0000000086115000 00000380004abd98 0000000000000001 0000000083ef9540
       0000000081421500 0000000000000400 000003ff807d97e2 00000380004abc60
 #0 [380004abcb8] gfs2_glock_nq at 3ff807d9d88 [gfs2]
 #1 [380004abcf0] gfs2_statfs_sync at 3ff807fac92 [gfs2]
 #2 [380004abd88] gfs2_quotad at 3ff807f2b70 [gfs2]
 #3 [380004abe18] kthread at 50527cfc
 #4 [380004abe70] __ret_from_fork at 504af5dc
 #5 [380004abea0] ret_from_fork at 50d6df4a

A vmcore file was captured when BUG() on gfs2 level was being triggered.
After analyzing vmcore I had no issues found and specific lock states like
"ls->ls_in_recovery" was in write state, so the above call trace should
never occur. There is a small cap between the WARN_ON() call trace and the
BUG() in gfs2 call so the vmcore file cannot be trusted because the
specific lock states could be different in the call trace of WARN_ON().

To be prepared next time and having an accurate vmcore file we move the
WARN_ON() to BUG(). The problem was probably related to a corosync error
where dlm_controld log showed the following errors multiple times:

Feb 24 12:12:40 4008 cpg_dispatch error 2

This could end in a nondeterministic behaviour in the upcall/downcall
mechanism of fencing/new config (recovery) handling. The reasons for
those errors are still unknown.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
---
 fs/dlm/midcomms.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/dlm/midcomms.c b/fs/dlm/midcomms.c
index 3635e42b0669..46bd1d84c7b8 100644
--- a/fs/dlm/midcomms.c
+++ b/fs/dlm/midcomms.c
@@ -1110,7 +1110,7 @@ struct dlm_mhandle *dlm_midcomms_get_mhandle(int nodeid, int len,
 		break;
 	default:
 		dlm_free_mhandle(mh);
-		WARN_ON(1);
+		BUG();
 		goto err;
 	}
 
@@ -1153,7 +1153,7 @@ void dlm_midcomms_commit_mhandle(struct dlm_mhandle *mh)
 		break;
 	default:
 		srcu_read_unlock(&nodes_srcu, mh->idx);
-		WARN_ON(1);
+		BUG();
 		break;
 	}
 }
-- 
2.31.1