All of lore.kernel.org
 help / color / mirror / Atom feed
* [Cluster-devel] GFS2 deadlock
@ 2015-10-05 15:34 Andrew W Elble
  2015-10-05 16:03 ` Andrew W Elble
  2015-10-05 16:15 ` Bob Peterson
  0 siblings, 2 replies; 4+ messages in thread
From: Andrew W Elble @ 2015-10-05 15:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

We've just run into a deadlock.

It seems very similar to the one referenced in commit
44ad37d69b2cc421d5b5c7ad7fed16230685b092

is it possible that fs/gfs2/export.c:gfs2_get_dentry()

140          inode = gfs2_ilookup(sb, inum->no_addr, 0);

should be:

140          inode = gfs2_ilookup(sb, inum->no_addr, 1);

?

I have a dump if more information would help.

same inode:
this is gfs2_inode->i_iopen_gh->gh_gl
G:  s:SH n:5/3157699 f:DIqob t:SH d:UN/104484397000 a:0 v:0 r:3 m:200
 H: s:SH f:EH e:0 p:24919 [nfsd] gfs2_inode_lookup+0x10e/0x210 [gfs2]

this is gfs2_inode->i_gl
G:  s:EX n:2/3157699 f:yIqob t:EX d:EX/0 a:0 v:0 r:4 m:200
 H: s:EX f:H e:0 p:24920 [nfsd] gfs2_evict_inode+0x124/0x400 [gfs2]
  I: n:81596/51738265 t:8 f:0x00 d:0x00000000 s:500

This is doing SEQ/PUTFH/GETATTR:

crash> bt
PID: 24919  TASK: ffff881f9e11d160  CPU: 32  COMMAND: "nfsd"
 #0 [ffff883f62443950] __schedule at ffffffff8165aaf4
 #1 [ffff883f624439a0] schedule at ffffffff8165b1a7
 #2 [ffff883f624439a8] __wait_on_freeing_inode at ffffffff811fbe1c
 #3 [ffff883f62443a30] find_inode at ffffffff811fbed1
 #4 [ffff883f62443a80] ilookup5_nowait at ffffffff811fbf61
 #5 [ffff883f62443ab0] ilookup5 at ffffffff811fcb33
 #6 [ffff883f62443ad0] gfs2_ilookup at ffffffffa080d1db [gfs2]
 #7 [ffff883f62443af0] gfs2_get_dentry at ffffffffa0806a11 [gfs2]
 #8 [ffff883f62443b10] gfs2_fh_to_dentry at ffffffffa0806b2c [gfs2]
 #9 [ffff883f62443b30] exportfs_decode_fh at ffffffff81262ef2
#10 [ffff883f62443ca0] fh_verify at ffffffffa057e977 [nfsd]
#11 [ffff883f62443d20] nfsd4_putfh at ffffffffa058ce6d [nfsd]
#12 [ffff883f62443d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
#13 [ffff883f62443db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
#14 [ffff883f62443df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
#15 [ffff883f62443e60] svc_process at ffffffffa01a2f53 [sunrpc]
#16 [ffff883f62443e90] nfsd at ffffffffa057a98f [nfsd]
#17 [ffff883f62443ec0] kthread at ffffffff81096919
#18 [ffff883f62443f50] ret_from_fork at ffffffff8165f3a2

This is doing SEQ/PUTFH/REMOVE:

crash> bt
PID: 24920  TASK: ffff881febf843d0  CPU: 32  COMMAND: "nfsd"
 #0 [ffff883f62447a00] __schedule at ffffffff8165aaf4
 #1 [ffff883f62447a50] schedule at ffffffff8165b1a7
 #2 [ffff883f62447a58] bit_wait at ffffffff8165b9bc
 #3 [ffff883f62447a70] bit_wait at ffffffff8165b9bc
 #4 [ffff883f62447a80] __wait_on_bit at ffffffff8165b645
 #5 [ffff883f62447ad0] out_of_line_wait_on_bit at ffffffff8165b6e2
 #6 [ffff883f62447b40] gfs2_glock_dq_wait at ffffffffa07ff4f3 [gfs2]
 #7 [ffff883f62447b60] gfs2_evict_inode at ffffffffa0818111 [gfs2]
 #8 [ffff883f62447bf0] evict at ffffffff811fc9eb
 #9 [ffff883f62447c20] iput at ffffffff811fd34b
#10 [ffff883f62447c50] d_delete at ffffffff811f8c58
#11 [ffff883f62447c80] vfs_unlink at ffffffff811ee8f9
#12 [ffff883f62447cd0] nfsd_unlink at ffffffffa0580dcf [nfsd]
#13 [ffff883f62447d10] nfsd4_remove at ffffffffa058debd [nfsd]
#14 [ffff883f62447d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
#15 [ffff883f62447db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
#16 [ffff883f62447df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
#17 [ffff883f62447e60] svc_process at ffffffffa01a2f53 [sunrpc]
#18 [ffff883f62447e90] nfsd at ffffffffa057a98f [nfsd]
#19 [ffff883f62447ec0] kthread at ffffffff81096919
#20 [ffff883f62447f50] ret_from_fork at ffffffff8165f3a2

Thanks,

Andy

-- 
Andrew W. Elble
aweits at discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] GFS2 deadlock
  2015-10-05 15:34 [Cluster-devel] GFS2 deadlock Andrew W Elble
@ 2015-10-05 16:03 ` Andrew W Elble
  2015-10-05 16:15 ` Bob Peterson
  1 sibling, 0 replies; 4+ messages in thread
From: Andrew W Elble @ 2015-10-05 16:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com


...I'm guessing I should be trying Bob's latest patch series.

:-)

Thanks,

Andy

-- 
Andrew W. Elble
aweits at discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] GFS2 deadlock
  2015-10-05 15:34 [Cluster-devel] GFS2 deadlock Andrew W Elble
  2015-10-05 16:03 ` Andrew W Elble
@ 2015-10-05 16:15 ` Bob Peterson
  2015-10-05 17:10   ` Andrew W Elble
  1 sibling, 1 reply; 4+ messages in thread
From: Bob Peterson @ 2015-10-05 16:15 UTC (permalink / raw)
  To: cluster-devel.redhat.com

----- Original Message -----
> We've just run into a deadlock.
> 
> It seems very similar to the one referenced in commit
> 44ad37d69b2cc421d5b5c7ad7fed16230685b092
> 
> is it possible that fs/gfs2/export.c:gfs2_get_dentry()
> 
> 140          inode = gfs2_ilookup(sb, inum->no_addr, 0);
> 
> should be:
> 
> 140          inode = gfs2_ilookup(sb, inum->no_addr, 1);
> 
> ?
> 
> I have a dump if more information would help.
> 
> same inode:
> this is gfs2_inode->i_iopen_gh->gh_gl
> G:  s:SH n:5/3157699 f:DIqob t:SH d:UN/104484397000 a:0 v:0 r:3 m:200
>  H: s:SH f:EH e:0 p:24919 [nfsd] gfs2_inode_lookup+0x10e/0x210 [gfs2]
> 
> this is gfs2_inode->i_gl
> G:  s:EX n:2/3157699 f:yIqob t:EX d:EX/0 a:0 v:0 r:4 m:200
>  H: s:EX f:H e:0 p:24920 [nfsd] gfs2_evict_inode+0x124/0x400 [gfs2]
>   I: n:81596/51738265 t:8 f:0x00 d:0x00000000 s:500
> 
> This is doing SEQ/PUTFH/GETATTR:
> 
> crash> bt
> PID: 24919  TASK: ffff881f9e11d160  CPU: 32  COMMAND: "nfsd"
>  #0 [ffff883f62443950] __schedule at ffffffff8165aaf4
>  #1 [ffff883f624439a0] schedule at ffffffff8165b1a7
>  #2 [ffff883f624439a8] __wait_on_freeing_inode at ffffffff811fbe1c
>  #3 [ffff883f62443a30] find_inode at ffffffff811fbed1
>  #4 [ffff883f62443a80] ilookup5_nowait at ffffffff811fbf61
>  #5 [ffff883f62443ab0] ilookup5 at ffffffff811fcb33
>  #6 [ffff883f62443ad0] gfs2_ilookup at ffffffffa080d1db [gfs2]
>  #7 [ffff883f62443af0] gfs2_get_dentry at ffffffffa0806a11 [gfs2]
>  #8 [ffff883f62443b10] gfs2_fh_to_dentry at ffffffffa0806b2c [gfs2]
>  #9 [ffff883f62443b30] exportfs_decode_fh at ffffffff81262ef2
> #10 [ffff883f62443ca0] fh_verify at ffffffffa057e977 [nfsd]
> #11 [ffff883f62443d20] nfsd4_putfh at ffffffffa058ce6d [nfsd]
> #12 [ffff883f62443d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
> #13 [ffff883f62443db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
> #14 [ffff883f62443df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
> #15 [ffff883f62443e60] svc_process at ffffffffa01a2f53 [sunrpc]
> #16 [ffff883f62443e90] nfsd at ffffffffa057a98f [nfsd]
> #17 [ffff883f62443ec0] kthread at ffffffff81096919
> #18 [ffff883f62443f50] ret_from_fork at ffffffff8165f3a2
> 
> This is doing SEQ/PUTFH/REMOVE:
> 
> crash> bt
> PID: 24920  TASK: ffff881febf843d0  CPU: 32  COMMAND: "nfsd"
>  #0 [ffff883f62447a00] __schedule at ffffffff8165aaf4
>  #1 [ffff883f62447a50] schedule at ffffffff8165b1a7
>  #2 [ffff883f62447a58] bit_wait at ffffffff8165b9bc
>  #3 [ffff883f62447a70] bit_wait at ffffffff8165b9bc
>  #4 [ffff883f62447a80] __wait_on_bit at ffffffff8165b645
>  #5 [ffff883f62447ad0] out_of_line_wait_on_bit at ffffffff8165b6e2
>  #6 [ffff883f62447b40] gfs2_glock_dq_wait at ffffffffa07ff4f3 [gfs2]
>  #7 [ffff883f62447b60] gfs2_evict_inode at ffffffffa0818111 [gfs2]
>  #8 [ffff883f62447bf0] evict at ffffffff811fc9eb
>  #9 [ffff883f62447c20] iput at ffffffff811fd34b
> #10 [ffff883f62447c50] d_delete at ffffffff811f8c58
> #11 [ffff883f62447c80] vfs_unlink at ffffffff811ee8f9
> #12 [ffff883f62447cd0] nfsd_unlink at ffffffffa0580dcf [nfsd]
> #13 [ffff883f62447d10] nfsd4_remove at ffffffffa058debd [nfsd]
> #14 [ffff883f62447d50] nfsd4_proc_compound at ffffffffa058ed57 [nfsd]
> #15 [ffff883f62447db0] nfsd_dispatch at ffffffffa057af83 [nfsd]
> #16 [ffff883f62447df0] svc_process_common at ffffffffa01a2bb0 [sunrpc]
> #17 [ffff883f62447e60] svc_process at ffffffffa01a2f53 [sunrpc]
> #18 [ffff883f62447e90] nfsd at ffffffffa057a98f [nfsd]
> #19 [ffff883f62447ec0] kthread at ffffffff81096919
> #20 [ffff883f62447f50] ret_from_fork at ffffffff8165f3a2
> 
> Thanks,
> 
> Andy
> 
> --
> Andrew W. Elble
> aweits at discipline.rit.edu
> Infrastructure Engineer, Communications Technical Lead
> Rochester Institute of Technology
> PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912

Hi Andy,

Can you tell me how you recreated this problem? Seems like a test
we should automate and check regularly in our regression testing.

At any rate, the nfs code path is the only one that calls gfs2_ilookup
with non_block set to 0. So if we do that, we might as well get rid
of the parameter entirely. I suspect your problem goes deeper than
this, and I'd like to understand the problem in more detail.

At any rate, you're right: my latest set of patches will hopefully
eliminate the problem and allow for a smoother transition from unlinked
to deleted. If there's still a problem, I want to know about it and
recreate it as soon as possible.

Regards,

Bob Peterson
Red Hat File Systems



^ permalink raw reply	[flat|nested] 4+ messages in thread

* [Cluster-devel] GFS2 deadlock
  2015-10-05 16:15 ` Bob Peterson
@ 2015-10-05 17:10   ` Andrew W Elble
  0 siblings, 0 replies; 4+ messages in thread
From: Andrew W Elble @ 2015-10-05 17:10 UTC (permalink / raw)
  To: cluster-devel.redhat.com


> Hi Andy,
>
> Can you tell me how you recreated this problem? Seems like a test
> we should automate and check regularly in our regression testing.

I'd love to - except the environment that generated it is somewhat
beyond my control (shared hosting for ~4000 websites).

The filename seems to indicate it was a cache file for mod_custom/Joomla?

Unfortunately, my regular packet capture of the nfs-side of things was
not running when this happened. Hopefully if it happens again, that will
be running.

We only ran into this after roughly a week of testing, it might be a while...

> At any rate, the nfs code path is the only one that calls gfs2_ilookup
> with non_block set to 0. So if we do that, we might as well get rid
> of the parameter entirely. I suspect your problem goes deeper than
> this, and I'd like to understand the problem in more detail.
>
> At any rate, you're right: my latest set of patches will hopefully
> eliminate the problem and allow for a smoother transition from unlinked
> to deleted. If there's still a problem, I want to know about it and
> recreate it as soon as possible.

I've rebased your patches on 4.1.10, and we'll be staging them into the
environment here today/tomorrow. I've added a '-' flag for
GLF_INODE_DELETING in show_glock_flags() in trace_gfs2.h

Thanks,

Andy

-- 
Andrew W. Elble
aweits at discipline.rit.edu
Infrastructure Engineer, Communications Technical Lead
Rochester Institute of Technology
PGP: BFAD 8461 4CCF DC95 DA2C B0EB 965B 082E 863E C912



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-10-05 17:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-10-05 15:34 [Cluster-devel] GFS2 deadlock Andrew W Elble
2015-10-05 16:03 ` Andrew W Elble
2015-10-05 16:15 ` Bob Peterson
2015-10-05 17:10   ` Andrew W Elble

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.