* [Cluster-devel] GFS2 deadlock
@ 2015-10-05 15:34 Andrew W Elble
2015-10-05 16:03 ` Andrew W Elble
2015-10-05 16:15 ` Bob Peterson
0 siblings, 2 replies; 4+ messages in thread
From: Andrew W Elble @ 2015-10-05 15:34 UTC (permalink / raw)
To: cluster-devel.redhat.com
We've just run into a deadlock.
It seems very similar to the one referenced in commit
44ad37d69b2cc421d5b5c7ad7fed16230685b092
is it possible that fs/gfs2/export.c:gfs2_get_dentry()
140 inode = gfs2_ilookup(sb, inum->no_addr, 0);
should be:
140 inode = gfs2_ilookup(sb, inum->no_addr, 1);
?
I have a dump if more information would help.
same inode:
this is gfs2_inode->i_iopen_gh->gh_gl
G: s:SH n:5/3157699 f:DIqob t:SH d:UN/104484397000 a:0 v:0 r:3 m:200
H: s:SH f:EH e:0 p:24919 [nfsd] gfs2_inode_lookup+0x10e/0x210 [gfs2]
this is gfs2_inode->i_gl
G: s:EX n:2/3157699 f:yIqob t:EX d:EX/0 a:0 v:0 r:4 m:200
H: s:EX f:H e:0 p:24920 [nfsd] gfs2_evict_inode+0x124/0x400 [gfs2]
I: n:81596/51738265 t:8 f:0x00 d:0x00000000 s:500
This is doing SEQ/PUTFH/GETATTR:
crash> bt
PID: 24919 TASK: ffff881f9e11d160 CPU: 32 COMMAND: "nfsd"
#0 [ffff883f62443950] __schedule at ffffffff8165aaf4
#1 [ffff883f624439a0] schedule at ffffffff8165b1a7
#2 [ffff883f624439a8] __wait_on_freeing_inode at ffffffff811fbe1c
#3 [ffff883f62443a30] find_inode at ffffffff811fbed1
#4 [ffff883f62443a80] ilookup5_nowait at ffffffff811fbf61
#5 [ffff883f62443ab0] ilookup5 at ffffffff811fcb33
#6 [ffff883f62443ad0] gfs2_ilookup at ffffffffa080d1db [gfs2]
#7 [ffff883f62443af0] gfs2_get_dentry at ffffffffa0806a11 [gfs2]
#8 [ffff883f62443b10] gfs2_fh_to_dentry at ffffffffa0806b2c [gfs2]
#9 [ffff883f62443b30] exportfs_decode_fh at ffffffff81262ef2
#10 [ffff883f62443ca0] fh_verify at ffffffffa057e977 [nfsd]
#11 [ffff883f62443d20] nfsd4_putfh at ffffffffa058ce6d [nfsd]
#12 [ffff883f62443d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
#13 [ffff883f62443db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
#14 [ffff883f62443df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
#15 [ffff883f62443e60] svc_process at ffffffffa01a2f53 [sunrpc]
#16 [ffff883f62443e90] nfsd at ffffffffa057a98f [nfsd]
#17 [ffff883f62443ec0] kthread at ffffffff81096919
#18 [ffff883f62443f50] ret_from_fork at ffffffff8165f3a2
This is doing SEQ/PUTFH/REMOVE:
crash> bt
PID: 24920 TASK: ffff881febf843d0 CPU: 32 COMMAND: "nfsd"
#0 [ffff883f62447a00] __schedule at ffffffff8165aaf4
#1 [ffff883f62447a50] schedule at ffffffff8165b1a7
#2 [ffff883f62447a58] bit_wait at ffffffff8165b9bc
#3 [ffff883f62447a70] bit_wait at ffffffff8165b9bc
#4 [ffff883f62447a80] __wait_on_bit at ffffffff8165b645
#5 [ffff883f62447ad0] out_of_line_wait_on_bit at ffffffff8165b6e2
#6 [ffff883f62447b40] gfs2_glock_dq_wait at ffffffffa07ff4f3 [gfs2]
#7 [ffff883f62447b60] gfs2_evict_inode at ffffffffa0818111 [gfs2]
#8 [ffff883f62447bf0] evict at ffffffff811fc9eb
#9 [ffff883f62447c20] iput at ffffffff811fd34b
#10 [ffff883f62447c50] d_delete at ffffffff811f8c58
#11 [ffff883f62447c80] vfs_unlink at ffffffff811ee8f9
#12 [ffff883f62447cd0] nfsd_unlink at ffffffffa0580dcf [nfsd]
#13 [ffff883f62447d10] nfsd4_remove at ffffffffa058debd [nfsd]
#14 [ffff883f62447d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
#15 [ffff883f62447db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
#16 [ffff883f62447df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
#17 [ffff883f62447e60] svc_process at ffffffffa01a2f53 [sunrpc]
#18 [ffff883f62447e90] nfsd at ffffffffa057a98f [nfsd]
#19 [ffff883f62447ec0] kthread at ffffffff81096919
#20 [ffff883f62447f50] ret_from_fork at ffffffff8165f3a2
Thanks,
Andy
--
Andrew W. Elble
aweits at discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Cluster-devel] GFS2 deadlock
2015-10-05 15:34 [Cluster-devel] GFS2 deadlock Andrew W Elble
@ 2015-10-05 16:03 ` Andrew W Elble
2015-10-05 16:15 ` Bob Peterson
1 sibling, 0 replies; 4+ messages in thread
From: Andrew W Elble @ 2015-10-05 16:03 UTC (permalink / raw)
To: cluster-devel.redhat.com
...I'm guessing I should be trying Bob's latest patch series.
:-)
Thanks,
Andy
--
Andrew W. Elble
aweits at discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Cluster-devel] GFS2 deadlock
2015-10-05 15:34 [Cluster-devel] GFS2 deadlock Andrew W Elble
2015-10-05 16:03 ` Andrew W Elble
@ 2015-10-05 16:15 ` Bob Peterson
2015-10-05 17:10 ` Andrew W Elble
1 sibling, 1 reply; 4+ messages in thread
From: Bob Peterson @ 2015-10-05 16:15 UTC (permalink / raw)
To: cluster-devel.redhat.com
----- Original Message -----
> We've just run into a deadlock.
>
> It seems very similar to the one referenced in commit
> 44ad37d69b2cc421d5b5c7ad7fed16230685b092
>
> is it possible that fs/gfs2/export.c:gfs2_get_dentry()
>
> 140 inode = gfs2_ilookup(sb, inum->no_addr, 0);
>
> should be:
>
> 140 inode = gfs2_ilookup(sb, inum->no_addr, 1);
>
> ?
>
> I have a dump if more information would help.
>
> same inode:
> this is gfs2_inode->i_iopen_gh->gh_gl
> G: s:SH n:5/3157699 f:DIqob t:SH d:UN/104484397000 a:0 v:0 r:3 m:200
> H: s:SH f:EH e:0 p:24919 [nfsd] gfs2_inode_lookup+0x10e/0x210 [gfs2]
>
> this is gfs2_inode->i_gl
> G: s:EX n:2/3157699 f:yIqob t:EX d:EX/0 a:0 v:0 r:4 m:200
> H: s:EX f:H e:0 p:24920 [nfsd] gfs2_evict_inode+0x124/0x400 [gfs2]
> I: n:81596/51738265 t:8 f:0x00 d:0x00000000 s:500
>
> This is doing SEQ/PUTFH/GETATTR:
>
> crash> bt
> PID: 24919 TASK: ffff881f9e11d160 CPU: 32 COMMAND: "nfsd"
> #0 [ffff883f62443950] __schedule at ffffffff8165aaf4
> #1 [ffff883f624439a0] schedule at ffffffff8165b1a7
> #2 [ffff883f624439a8] __wait_on_freeing_inode at ffffffff811fbe1c
> #3 [ffff883f62443a30] find_inode at ffffffff811fbed1
> #4 [ffff883f62443a80] ilookup5_nowait at ffffffff811fbf61
> #5 [ffff883f62443ab0] ilookup5 at ffffffff811fcb33
> #6 [ffff883f62443ad0] gfs2_ilookup at ffffffffa080d1db [gfs2]
> #7 [ffff883f62443af0] gfs2_get_dentry at ffffffffa0806a11 [gfs2]
> #8 [ffff883f62443b10] gfs2_fh_to_dentry at ffffffffa0806b2c [gfs2]
> #9 [ffff883f62443b30] exportfs_decode_fh at ffffffff81262ef2
> #10 [ffff883f62443ca0] fh_verify at ffffffffa057e977 [nfsd]
> #11 [ffff883f62443d20] nfsd4_putfh at ffffffffa058ce6d [nfsd]
> #12 [ffff883f62443d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
> #13 [ffff883f62443db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
> #14 [ffff883f62443df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
> #15 [ffff883f62443e60] svc_process at ffffffffa01a2f53 [sunrpc]
> #16 [ffff883f62443e90] nfsd at ffffffffa057a98f [nfsd]
> #17 [ffff883f62443ec0] kthread at ffffffff81096919
> #18 [ffff883f62443f50] ret_from_fork at ffffffff8165f3a2
>
> This is doing SEQ/PUTFH/REMOVE:
>
> crash> bt
> PID: 24920 TASK: ffff881febf843d0 CPU: 32 COMMAND: "nfsd"
> #0 [ffff883f62447a00] __schedule at ffffffff8165aaf4
> #1 [ffff883f62447a50] schedule at ffffffff8165b1a7
> #2 [ffff883f62447a58] bit_wait at ffffffff8165b9bc
> #3 [ffff883f62447a70] bit_wait at ffffffff8165b9bc
> #4 [ffff883f62447a80] __wait_on_bit at ffffffff8165b645
> #5 [ffff883f62447ad0] out_of_line_wait_on_bit at ffffffff8165b6e2
> #6 [ffff883f62447b40] gfs2_glock_dq_wait at ffffffffa07ff4f3 [gfs2]
> #7 [ffff883f62447b60] gfs2_evict_inode at ffffffffa0818111 [gfs2]
> #8 [ffff883f62447bf0] evict at ffffffff811fc9eb
> #9 [ffff883f62447c20] iput at ffffffff811fd34b
> #10 [ffff883f62447c50] d_delete at ffffffff811f8c58
> #11 [ffff883f62447c80] vfs_unlink at ffffffff811ee8f9
> #12 [ffff883f62447cd0] nfsd_unlink at ffffffffa0580dcf [nfsd]
> #13 [ffff883f62447d10] nfsd4_remove at ffffffffa058debd [nfsd]
> #14 [ffff883f62447d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
> #15 [ffff883f62447db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
> #16 [ffff883f62447df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
> #17 [ffff883f62447e60] svc_process at ffffffffa01a2f53 [sunrpc]
> #18 [ffff883f62447e90] nfsd at ffffffffa057a98f [nfsd]
> #19 [ffff883f62447ec0] kthread at ffffffff81096919
> #20 [ffff883f62447f50] ret_from_fork at ffffffff8165f3a2
>
> Thanks,
>
> Andy
>
> --
> Andrew W. Elble
> aweits at discipline.rit.edu
> Infrastructure Engineer, Communications Technical Lead
> Rochester Institute of Technology
> PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
Hi Andy,
Can you tell me how you recreated this problem? Seems like a test
we should automate and check regularly in our regression testing.
At any rate, the nfs code path is the only one that calls gfs2_ilookup
with non_block set to 0. So if we do that, we might as well get rid
of the parameter entirely. I suspect your problem goes deeper than
this, and I'd like to understand the problem in more detail.
At any rate, you're right: my latest set of patches will hopefully
eliminate the problem and allow for a smoother transition from unlinked
to deleted. If there's still a problem, I want to know about it and
recreate it as soon as possible.
Regards,
Bob Peterson
Red Hat File Systems
^ permalink raw reply [flat|nested] 4+ messages in thread
* [Cluster-devel] GFS2 deadlock
2015-10-05 16:15 ` Bob Peterson
@ 2015-10-05 17:10 ` Andrew W Elble
0 siblings, 0 replies; 4+ messages in thread
From: Andrew W Elble @ 2015-10-05 17:10 UTC (permalink / raw)
To: cluster-devel.redhat.com
> Hi Andy,
>
> Can you tell me how you recreated this problem? Seems like a test
> we should automate and check regularly in our regression testing.
I'd love to - except the environment that generated it is somewhat
beyond my control (shared hosting for ~4000 websites).
The filename seems to indicate it was a cache file for mod_custom/Joomla?
Unfortunately, my regular packet capture of the nfs-side of things was
not running when this happened. Hopefully if it happens again, that will
be running.
We only ran into this after roughly a week of testing, it might be a while...
> At any rate, the nfs code path is the only one that calls gfs2_ilookup
> with non_block set to 0. So if we do that, we might as well get rid
> of the parameter entirely. I suspect your problem goes deeper than
> this, and I'd like to understand the problem in more detail.
>
> At any rate, you're right: my latest set of patches will hopefully
> eliminate the problem and allow for a smoother transition from unlinked
> to deleted. If there's still a problem, I want to know about it and
> recreate it as soon as possible.
I've rebased your patches on 4.1.10, and we'll be staging them into the
environment here today/tomorrow. I've added a '-' flag for
GLF_INODE_DELETING in show_glock_flags() in trace_gfs2.h
Thanks,
Andy
--
Andrew W. Elble
aweits at discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2015-10-05 17:10 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-05 15:34 [Cluster-devel] GFS2 deadlock Andrew W Elble
2015-10-05 16:03 ` Andrew W Elble
2015-10-05 16:15 ` Bob Peterson
2015-10-05 17:10 ` Andrew W Elble
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.