From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexander Aring Date: Mon, 7 Mar 2022 09:40:48 -0500 Subject: [Cluster-devel] [PATCH dlm/next] fs: dlm: move some midcomms WARN_ON to BUG Message-ID: <20220307144048.2451280-1-aahringo@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Recently those warnings were triggered on gfs2 handling by calling dlm API which runs into a BUG() because -ENOBUFS had beedn returned. Those cases which has a WARN_ON() are not related to any memory failure but should never happen. It's because we reset a midcomms state and the dlm api still tries to transmit something which should be prevented by dlm application layer handling e.g. locks. Call trace of warning was: [14003.162881] Call Trace: [14003.162883] [<000003ff80796d70>] dlm_midcomms_get_mhandle+0x170/0x1f0 [dlm] [14003.162892] ([<000003ff80796d6c>] dlm_midcomms_get_mhandle+0x16c/0x1f0 [dlm]) [14003.162901] [<000003ff80787366>] create_message+0x56/0x100 [dlm] [14003.162909] [<000003ff8078849c>] send_common+0x7c/0x130 [dlm] [14003.162928] [<000003ff8078b50c>] _convert_lock+0x3c/0x140 [dlm] [14003.162936] [<000003ff8078b698>] convert_lock+0x88/0xd0 [dlm] [14003.162944] [<000003ff80790008>] dlm_lock+0x158/0x1b0 [dlm] [14003.162952] [<000003ff807ff4c6>] gdlm_lock+0x1f6/0x2f0 [gfs2] [14003.162997] [<000003ff807d96c8>] do_xmote+0x1f8/0x440 [gfs2] [14003.163008] [<000003ff807d9d88>] gfs2_glock_nq+0x88/0x130 [gfs2] [14003.163020] [<000003ff807fac92>] gfs2_statfs_sync+0x52/0x180 [gfs2] [14003.163031] [<000003ff807f2b70>] gfs2_quotad+0xc0/0x360 [gfs2] [14003.163043] [<0000000050527cfc>] kthread+0x17c/0x190 [14003.163061] [<00000000504af5dc>] __ret_from_fork+0x3c/0x60 [14003.163064] [<0000000050d6df4a>] ret_from_fork+0xa/0x30 Call trace of BUG() was: #0 [8026be60] __machine_kexec at 504c09ee #1 [8026bea0] pcpu_delegate at 504c389e #2 [380004ab8b0] smp_call_ipl_cpu at 504c4b66 #3 [380004ab8d0] __crash_kexec at 505c488a #4 [380004ab9a8] panic at 50d58682 #5 [380004aba48] die at 504c1b28 #6 [380004abab0] __do_pgm_check at 50d60966 #7 [380004abb00] pgm_check_handler at 50d6e088 PSW: 0704c00180000000 000003ff807d97e6 (do_xmote+790 [gfs2]) GPRS: c0000000ffffbfff 0000000000000027 0000000000000067 00000000ffffbfff 00000380004ab798 00000380004ab790 000003ff807f2b70 000003ff80810df0 0000000086115000 00000380004abd98 0000000000000001 0000000083ef9540 0000000081421500 0000000000000400 000003ff807d97e2 00000380004abc60 #0 [380004abcb8] gfs2_glock_nq at 3ff807d9d88 [gfs2] #1 [380004abcf0] gfs2_statfs_sync at 3ff807fac92 [gfs2] #2 [380004abd88] gfs2_quotad at 3ff807f2b70 [gfs2] #3 [380004abe18] kthread at 50527cfc #4 [380004abe70] __ret_from_fork at 504af5dc #5 [380004abea0] ret_from_fork at 50d6df4a A vmcore file was captured when BUG() on gfs2 level was being triggered. After analyzing vmcore I had no issues found and specific lock states like "ls->ls_in_recovery" was in write state, so the above call trace should never occur. There is a small cap between the WARN_ON() call trace and the BUG() in gfs2 call so the vmcore file cannot be trusted because the specific lock states could be different in the call trace of WARN_ON(). To be prepared next time and having an accurate vmcore file we move the WARN_ON() to BUG(). The problem was probably related to a corosync error where dlm_controld log showed the following errors multiple times: Feb 24 12:12:40 4008 cpg_dispatch error 2 This could end in a nondeterministic behaviour in the upcall/downcall mechanism of fencing/new config (recovery) handling. The reasons for those errors are still unknown. Signed-off-by: Alexander Aring --- fs/dlm/midcomms.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/dlm/midcomms.c b/fs/dlm/midcomms.c index 3635e42b0669..46bd1d84c7b8 100644 --- a/fs/dlm/midcomms.c +++ b/fs/dlm/midcomms.c @@ -1110,7 +1110,7 @@ struct dlm_mhandle *dlm_midcomms_get_mhandle(int nodeid, int len, break; default: dlm_free_mhandle(mh); - WARN_ON(1); + BUG(); goto err; } @@ -1153,7 +1153,7 @@ void dlm_midcomms_commit_mhandle(struct dlm_mhandle *mh) break; default: srcu_read_unlock(&nodes_srcu, mh->idx); - WARN_ON(1); + BUG(); break; } } -- 2.31.1