From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bob Peterson Date: Mon, 5 Oct 2015 12:15:56 -0400 (EDT) Subject: [Cluster-devel] GFS2 deadlock In-Reply-To: References: Message-ID: <673421950.40526151.1444061756793.JavaMail.zimbra@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit ----- Original Message ----- > We've just run into a deadlock. > > It seems very similar to the one referenced in commit > 44ad37d69b2cc421d5b5c7ad7fed16230685b092 > > is it possible that fs/gfs2/export.c:gfs2_get_dentry() > > 140 inode = gfs2_ilookup(sb, inum->no_addr, 0); > > should be: > > 140 inode = gfs2_ilookup(sb, inum->no_addr, 1); > > ? > > I have a dump if more information would help. > > same inode: > this is gfs2_inode->i_iopen_gh->gh_gl > G: s:SH n:5/3157699 f:DIqob t:SH d:UN/104484397000 a:0 v:0 r:3 m:200 > H: s:SH f:EH e:0 p:24919 [nfsd] gfs2_inode_lookup+0x10e/0x210 [gfs2] > > this is gfs2_inode->i_gl > G: s:EX n:2/3157699 f:yIqob t:EX d:EX/0 a:0 v:0 r:4 m:200 > H: s:EX f:H e:0 p:24920 [nfsd] gfs2_evict_inode+0x124/0x400 [gfs2] > I: n:81596/51738265 t:8 f:0x00 d:0x00000000 s:500 > > This is doing SEQ/PUTFH/GETATTR: > > crash> bt > PID: 24919 TASK: ffff881f9e11d160 CPU: 32 COMMAND: "nfsd" > #0 [ffff883f62443950] __schedule at ffffffff8165aaf4 > #1 [ffff883f624439a0] schedule at ffffffff8165b1a7 > #2 [ffff883f624439a8] __wait_on_freeing_inode at ffffffff811fbe1c > #3 [ffff883f62443a30] find_inode at ffffffff811fbed1 > #4 [ffff883f62443a80] ilookup5_nowait at ffffffff811fbf61 > #5 [ffff883f62443ab0] ilookup5 at ffffffff811fcb33 > #6 [ffff883f62443ad0] gfs2_ilookup at ffffffffa080d1db [gfs2] > #7 [ffff883f62443af0] gfs2_get_dentry at ffffffffa0806a11 [gfs2] > #8 [ffff883f62443b10] gfs2_fh_to_dentry at ffffffffa0806b2c [gfs2] > #9 [ffff883f62443b30] exportfs_decode_fh at ffffffff81262ef2 > #10 [ffff883f62443ca0] fh_verify at ffffffffa057e977 [nfsd] > #11 [ffff883f62443d20] nfsd4_putfh at ffffffffa058ce6d [nfsd] > #12 [ffff883f62443d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd] > #13 [ffff883f62443db0] nfsd_dispatch at ffffffffa057af83 [nfsd] > #14 [ffff883f62443df0] svc_process_common at ffffffffa01a2bb0 [sunrpc] > #15 [ffff883f62443e60] svc_process at ffffffffa01a2f53 [sunrpc] > #16 [ffff883f62443e90] nfsd at ffffffffa057a98f [nfsd] > #17 [ffff883f62443ec0] kthread at ffffffff81096919 > #18 [ffff883f62443f50] ret_from_fork at ffffffff8165f3a2 > > This is doing SEQ/PUTFH/REMOVE: > > crash> bt > PID: 24920 TASK: ffff881febf843d0 CPU: 32 COMMAND: "nfsd" > #0 [ffff883f62447a00] __schedule at ffffffff8165aaf4 > #1 [ffff883f62447a50] schedule at ffffffff8165b1a7 > #2 [ffff883f62447a58] bit_wait at ffffffff8165b9bc > #3 [ffff883f62447a70] bit_wait at ffffffff8165b9bc > #4 [ffff883f62447a80] __wait_on_bit at ffffffff8165b645 > #5 [ffff883f62447ad0] out_of_line_wait_on_bit at ffffffff8165b6e2 > #6 [ffff883f62447b40] gfs2_glock_dq_wait at ffffffffa07ff4f3 [gfs2] > #7 [ffff883f62447b60] gfs2_evict_inode at ffffffffa0818111 [gfs2] > #8 [ffff883f62447bf0] evict at ffffffff811fc9eb > #9 [ffff883f62447c20] iput at ffffffff811fd34b > #10 [ffff883f62447c50] d_delete at ffffffff811f8c58 > #11 [ffff883f62447c80] vfs_unlink at ffffffff811ee8f9 > #12 [ffff883f62447cd0] nfsd_unlink at ffffffffa0580dcf [nfsd] > #13 [ffff883f62447d10] nfsd4_remove at ffffffffa058debd [nfsd] > #14 [ffff883f62447d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd] > #15 [ffff883f62447db0] nfsd_dispatch at ffffffffa057af83 [nfsd] > #16 [ffff883f62447df0] svc_process_common at ffffffffa01a2bb0 [sunrpc] > #17 [ffff883f62447e60] svc_process at ffffffffa01a2f53 [sunrpc] > #18 [ffff883f62447e90] nfsd at ffffffffa057a98f [nfsd] > #19 [ffff883f62447ec0] kthread at ffffffff81096919 > #20 [ffff883f62447f50] ret_from_fork at ffffffff8165f3a2 > > Thanks, > > Andy > > -- > Andrew W. Elble > aweits at discipline.rit.edu > Infrastructure Engineer, Communications Technical Lead > Rochester Institute of Technology > PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912 Hi Andy, Can you tell me how you recreated this problem? Seems like a test we should automate and check regularly in our regression testing. At any rate, the nfs code path is the only one that calls gfs2_ilookup with non_block set to 0. So if we do that, we might as well get rid of the parameter entirely. I suspect your problem goes deeper than this, and I'd like to understand the problem in more detail. At any rate, you're right: my latest set of patches will hopefully eliminate the problem and allow for a smoother transition from unlinked to deleted. If there's still a problem, I want to know about it and recreate it as soon as possible. Regards, Bob Peterson Red Hat File Systems