linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [TOPIC] Last iput() from flusher thread, last fput() from munmap()...
@ 2012-03-27 21:08 Jan Kara
  2012-03-28  2:38 ` Al Viro
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2012-03-27 21:08 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-fsdevel, linux-mm

  Hello,

  maybe the name of this topic could be "How hard should be life of
filesystems?" but that's kind of broad topic and suggests too much of
bikeshedding. I'd like to concentrate on concrete possible pain points
between filesystems & VFS (possibly writeback or even generally MM).
Lately, I've myself came across the two issues in $SUBJECT:
1) dropping of last file reference can happen from munmap() and in that
   case mmap_sem will be held when ->release() is called. Even more it
   could be held when ->evict_inode() is called to delete inode because
   inode was unlinked.
2) since flusher thread takes inode reference when writing inode out, the
   last inode reference can be dropped from flusher thread. Thus inode may
   get deleted in the flusher thread context. This does not seem that
   problematic on its own but if we realize progress of memory reclaim
   depends (at least from a longterm perspective) on flusher thread making
   progress, things start looking a bit uncertain. Even more so when we
   would like avoid ->writepage() calls from reclaim and let flusher thread
   do the work instead. That would then require filesystems to carefully
   design their ->evict_inode() routines so that things are not
   deadlockable.

  Both these issues should be avoidable (we can postpone fput() after we
drop mmap_sem; we can tweak inode refcounting to avoid last iput() from
flusher thread) but obviously there's some cost in the complexity of generic
layer. So the question is, is it worth it?

Certainly we can also discuss other pain points if people come with them.
We should have enough know-how in place to be able to tell which changes
are reasonably possible and which are not...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [TOPIC] Last iput() from flusher thread, last fput() from munmap()...
  2012-03-27 21:08 [TOPIC] Last iput() from flusher thread, last fput() from munmap() Jan Kara
@ 2012-03-28  2:38 ` Al Viro
  2012-03-28  4:45   ` Dave Chinner
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Al Viro @ 2012-03-28  2:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: lsf-pc, linux-fsdevel, linux-mm

On Tue, Mar 27, 2012 at 11:08:58PM +0200, Jan Kara wrote:
>   Hello,
> 
>   maybe the name of this topic could be "How hard should be life of
> filesystems?" but that's kind of broad topic and suggests too much of
> bikeshedding. I'd like to concentrate on concrete possible pain points
> between filesystems & VFS (possibly writeback or even generally MM).
> Lately, I've myself came across the two issues in $SUBJECT:
> 1) dropping of last file reference can happen from munmap() and in that
>    case mmap_sem will be held when ->release() is called. Even more it
>    could be held when ->evict_inode() is called to delete inode because
>    inode was unlinked.

Yes, it can.

> 2) since flusher thread takes inode reference when writing inode out, the
>    last inode reference can be dropped from flusher thread. Thus inode may
>    get deleted in the flusher thread context. This does not seem that
>    problematic on its own but if we realize progress of memory reclaim
>    depends (at least from a longterm perspective) on flusher thread making
>    progress, things start looking a bit uncertain. Even more so when we
>    would like avoid ->writepage() calls from reclaim and let flusher thread
>    do the work instead. That would then require filesystems to carefully
>    design their ->evict_inode() routines so that things are not
>    deadlockable.

You mean "use GFP_NOIO for allocations when holding fs-internal locks"?

>   Both these issues should be avoidable (we can postpone fput() after we
> drop mmap_sem; we can tweak inode refcounting to avoid last iput() from
> flusher thread) but obviously there's some cost in the complexity of generic
> layer. So the question is, is it worth it?

I don't thing it is.  ->i_mutex in ->release() is never needed; existing
cases are racy and dropping preallocation that way is simply wrong.  And
->evict_inode() is a non-issue, since it has no reason whatsoever to take
*any* locks in mutex - the damn thing is called when nobody has references
to struct inode anymore.  Deadlocks with flusher... that's what NOIO and
NOFS are for.

As for the IMA issues...  We probably ought to use a separate mutex for
xattr and relying on ->i_mutex for its internal locking is unconvincing,
to put it mildly...

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [TOPIC] Last iput() from flusher thread, last fput() from munmap()...
  2012-03-28  2:38 ` Al Viro
@ 2012-03-28  4:45   ` Dave Chinner
  2012-03-28  9:04   ` Steven Whitehouse
  2012-03-28 12:10   ` Jan Kara
  2 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2012-03-28  4:45 UTC (permalink / raw)
  To: Al Viro; +Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm

On Wed, Mar 28, 2012 at 03:38:52AM +0100, Al Viro wrote:
> On Tue, Mar 27, 2012 at 11:08:58PM +0200, Jan Kara wrote:
> >   Hello,
> > 
> >   maybe the name of this topic could be "How hard should be life of
> > filesystems?" but that's kind of broad topic and suggests too much of
> > bikeshedding. I'd like to concentrate on concrete possible pain points
> > between filesystems & VFS (possibly writeback or even generally MM).
> > Lately, I've myself came across the two issues in $SUBJECT:
> > 1) dropping of last file reference can happen from munmap() and in that
> >    case mmap_sem will be held when ->release() is called. Even more it
> >    could be held when ->evict_inode() is called to delete inode because
> >    inode was unlinked.
> 
> Yes, it can.
> 
> > 2) since flusher thread takes inode reference when writing inode out, the
> >    last inode reference can be dropped from flusher thread. Thus inode may
> >    get deleted in the flusher thread context. This does not seem that
> >    problematic on its own but if we realize progress of memory reclaim
> >    depends (at least from a longterm perspective) on flusher thread making
> >    progress, things start looking a bit uncertain. Even more so when we
> >    would like avoid ->writepage() calls from reclaim and let flusher thread
> >    do the work instead. That would then require filesystems to carefully
> >    design their ->evict_inode() routines so that things are not
> >    deadlockable.
> 
> You mean "use GFP_NOIO for allocations when holding fs-internal locks"?
> 
> >   Both these issues should be avoidable (we can postpone fput() after we
> > drop mmap_sem; we can tweak inode refcounting to avoid last iput() from
> > flusher thread) but obviously there's some cost in the complexity of generic
> > layer. So the question is, is it worth it?
> 
> I don't thing it is.  ->i_mutex in ->release() is never needed; existing
> cases are racy and dropping preallocation that way is simply wrong.

The alternative to using ->release is ....?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [TOPIC] Last iput() from flusher thread, last fput() from munmap()...
  2012-03-28  2:38 ` Al Viro
  2012-03-28  4:45   ` Dave Chinner
@ 2012-03-28  9:04   ` Steven Whitehouse
  2012-03-28 11:54     ` [Lsf-pc] " Jan Kara
  2012-03-28 12:10   ` Jan Kara
  2 siblings, 1 reply; 7+ messages in thread
From: Steven Whitehouse @ 2012-03-28  9:04 UTC (permalink / raw)
  To: Al Viro; +Cc: Jan Kara, lsf-pc, linux-fsdevel, linux-mm

Hi,

On Wed, 2012-03-28 at 03:38 +0100, Al Viro wrote:
> On Tue, Mar 27, 2012 at 11:08:58PM +0200, Jan Kara wrote:
> >   Hello,
> > 
> >   maybe the name of this topic could be "How hard should be life of
> > filesystems?" but that's kind of broad topic and suggests too much of
> > bikeshedding. I'd like to concentrate on concrete possible pain points
> > between filesystems & VFS (possibly writeback or even generally MM).
> > Lately, I've myself came across the two issues in $SUBJECT:
> > 1) dropping of last file reference can happen from munmap() and in that
> >    case mmap_sem will be held when ->release() is called. Even more it
> >    could be held when ->evict_inode() is called to delete inode because
> >    inode was unlinked.
> 
> Yes, it can.
> 
> > 2) since flusher thread takes inode reference when writing inode out, the
> >    last inode reference can be dropped from flusher thread. Thus inode may
> >    get deleted in the flusher thread context. This does not seem that
> >    problematic on its own but if we realize progress of memory reclaim
> >    depends (at least from a longterm perspective) on flusher thread making
> >    progress, things start looking a bit uncertain. Even more so when we
> >    would like avoid ->writepage() calls from reclaim and let flusher thread
> >    do the work instead. That would then require filesystems to carefully
> >    design their ->evict_inode() routines so that things are not
> >    deadlockable.
> 
> You mean "use GFP_NOIO for allocations when holding fs-internal locks"?
> 
> >   Both these issues should be avoidable (we can postpone fput() after we
> > drop mmap_sem; we can tweak inode refcounting to avoid last iput() from
> > flusher thread) but obviously there's some cost in the complexity of generic
> > layer. So the question is, is it worth it?
> 
> I don't thing it is.  ->i_mutex in ->release() is never needed; existing
> cases are racy and dropping preallocation that way is simply wrong.  And
> ->evict_inode() is a non-issue, since it has no reason whatsoever to take
> *any* locks in mutex - the damn thing is called when nobody has references
> to struct inode anymore.  Deadlocks with flusher... that's what NOIO and
> NOFS are for.
> 
For cluster filesystems, we have to take locks (cluster wide) in
->evict_inode() in order to establish for certain whether we are the
last opener of the inode. Just because there are no references on the
local node, doesn't mean that a remote node doesn't hold the file open
still.

We do always use GFP_NOFS when allocating memory while holding such
locks, so I'm not quite sure from the above whether or not that will be
an issue,

Steve.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [TOPIC] Last iput() from flusher thread, last fput() from munmap()...
  2012-03-28  9:04   ` Steven Whitehouse
@ 2012-03-28 11:54     ` Jan Kara
  2012-03-28 14:07       ` Steven Whitehouse
  0 siblings, 1 reply; 7+ messages in thread
From: Jan Kara @ 2012-03-28 11:54 UTC (permalink / raw)
  To: Steven Whitehouse; +Cc: Al Viro, linux-fsdevel, linux-mm, lsf-pc, Jan Kara

  Hi,

On Wed 28-03-12 10:04:15, Steven Whitehouse wrote:
> On Wed, 2012-03-28 at 03:38 +0100, Al Viro wrote:
> > On Tue, Mar 27, 2012 at 11:08:58PM +0200, Jan Kara wrote:
> > >   Hello,
> > > 
> > >   maybe the name of this topic could be "How hard should be life of
> > > filesystems?" but that's kind of broad topic and suggests too much of
> > > bikeshedding. I'd like to concentrate on concrete possible pain points
> > > between filesystems & VFS (possibly writeback or even generally MM).
> > > Lately, I've myself came across the two issues in $SUBJECT:
> > > 1) dropping of last file reference can happen from munmap() and in that
> > >    case mmap_sem will be held when ->release() is called. Even more it
> > >    could be held when ->evict_inode() is called to delete inode because
> > >    inode was unlinked.
> > 
> > Yes, it can.
> > 
> > > 2) since flusher thread takes inode reference when writing inode out, the
> > >    last inode reference can be dropped from flusher thread. Thus inode may
> > >    get deleted in the flusher thread context. This does not seem that
> > >    problematic on its own but if we realize progress of memory reclaim
> > >    depends (at least from a longterm perspective) on flusher thread making
> > >    progress, things start looking a bit uncertain. Even more so when we
> > >    would like avoid ->writepage() calls from reclaim and let flusher thread
> > >    do the work instead. That would then require filesystems to carefully
> > >    design their ->evict_inode() routines so that things are not
> > >    deadlockable.
> > 
> > You mean "use GFP_NOIO for allocations when holding fs-internal locks"?
> > 
> > >   Both these issues should be avoidable (we can postpone fput() after we
> > > drop mmap_sem; we can tweak inode refcounting to avoid last iput() from
> > > flusher thread) but obviously there's some cost in the complexity of generic
> > > layer. So the question is, is it worth it?
> > 
> > I don't thing it is.  ->i_mutex in ->release() is never needed; existing
> > cases are racy and dropping preallocation that way is simply wrong.  And
> > ->evict_inode() is a non-issue, since it has no reason whatsoever to take
> > *any* locks in mutex - the damn thing is called when nobody has references
> > to struct inode anymore.  Deadlocks with flusher... that's what NOIO and
> > NOFS are for.
> > 
> For cluster filesystems, we have to take locks (cluster wide) in
> ->evict_inode() in order to establish for certain whether we are the
> last opener of the inode. Just because there are no references on the
> local node, doesn't mean that a remote node doesn't hold the file open
> still.
> 
> We do always use GFP_NOFS when allocating memory while holding such
> locks, so I'm not quite sure from the above whether or not that will be
> an issue,
  Yeah, but you have to use networking to communicate with other nodes
about locks and this creates another interesting dependecy.

Currently, everything seems to work out just fine and I don't say I know
about a particular deadlock. I just say that the dependencies are so
complex that I don't know whether things will work OK e.g. if we change
page reclaim to offload more to flusher thread. And that's what I feel
uneasy about.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [TOPIC] Last iput() from flusher thread, last fput() from munmap()...
  2012-03-28  2:38 ` Al Viro
  2012-03-28  4:45   ` Dave Chinner
  2012-03-28  9:04   ` Steven Whitehouse
@ 2012-03-28 12:10   ` Jan Kara
  2 siblings, 0 replies; 7+ messages in thread
From: Jan Kara @ 2012-03-28 12:10 UTC (permalink / raw)
  To: Al Viro; +Cc: Jan Kara, linux-fsdevel, linux-mm, lsf-pc

On Wed 28-03-12 03:38:52, Al Viro wrote:
> On Tue, Mar 27, 2012 at 11:08:58PM +0200, Jan Kara wrote:
> >   Hello,
> > 
> >   maybe the name of this topic could be "How hard should be life of
> > filesystems?" but that's kind of broad topic and suggests too much of
> > bikeshedding. I'd like to concentrate on concrete possible pain points
> > between filesystems & VFS (possibly writeback or even generally MM).
> > Lately, I've myself came across the two issues in $SUBJECT:
> > 1) dropping of last file reference can happen from munmap() and in that
> >    case mmap_sem will be held when ->release() is called. Even more it
> >    could be held when ->evict_inode() is called to delete inode because
> >    inode was unlinked.
> 
> Yes, it can.
> 
> > 2) since flusher thread takes inode reference when writing inode out, the
> >    last inode reference can be dropped from flusher thread. Thus inode may
> >    get deleted in the flusher thread context. This does not seem that
> >    problematic on its own but if we realize progress of memory reclaim
> >    depends (at least from a longterm perspective) on flusher thread making
> >    progress, things start looking a bit uncertain. Even more so when we
> >    would like avoid ->writepage() calls from reclaim and let flusher thread
> >    do the work instead. That would then require filesystems to carefully
> >    design their ->evict_inode() routines so that things are not
> >    deadlockable.
> 
> You mean "use GFP_NOIO for allocations when holding fs-internal locks"?
  Well, but in ->evict_inode filesystem isn't necessarily holding any
internal locks it knows about. So it should be perfectly fine doing
GFP_KERNEL allocation. But if ->evict_inode is called from flusher thread
and we do GFP_KERNEL allocation, things start to be a bit uncertain IMHO.

> >   Both these issues should be avoidable (we can postpone fput() after we
> > drop mmap_sem; we can tweak inode refcounting to avoid last iput() from
> > flusher thread) but obviously there's some cost in the complexity of generic
> > layer. So the question is, is it worth it?
> 
> I don't thing it is.  ->i_mutex in ->release() is never needed; existing
> cases are racy and dropping preallocation that way is simply wrong.
  Yes. And my point really is, if fs developers get this often wrong,
shouldn't we change the interface so that it's harder to get it wrong? In
this particular case it shouldn't be a big burden on VFS.

> And ->evict_inode() is a non-issue, since it has no reason whatsoever to
> take *any* locks in mutex - the damn thing is called when nobody has
> references to struct inode anymore.
  As Steven pointed out, at least clustered filesystems need to do complex
synchronization in ->evict_inode(). I think OCFS2 offloads some of this
stuff to separate kernel thread to avoid deadlocks (at least the obvious
ones which you can hit during testing / which lockdep can catch).

> Deadlocks with flusher... that's what NOIO and NOFS are for.
> 
> As for the IMA issues...  We probably ought to use a separate mutex for
> xattr and relying on ->i_mutex for its internal locking is unconvincing,
> to put it mildly...
  Agreed.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [Lsf-pc] [TOPIC] Last iput() from flusher thread, last fput() from munmap()...
  2012-03-28 11:54     ` [Lsf-pc] " Jan Kara
@ 2012-03-28 14:07       ` Steven Whitehouse
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Whitehouse @ 2012-03-28 14:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: Al Viro, linux-fsdevel, linux-mm, lsf-pc

Hi,

On Wed, 2012-03-28 at 13:54 +0200, Jan Kara wrote:
> Hi,
> 
> On Wed 28-03-12 10:04:15, Steven Whitehouse wrote:
> > On Wed, 2012-03-28 at 03:38 +0100, Al Viro wrote:
> > > On Tue, Mar 27, 2012 at 11:08:58PM +0200, Jan Kara wrote:
> > > >   Hello,
> > > > 
> > > >   maybe the name of this topic could be "How hard should be life of
> > > > filesystems?" but that's kind of broad topic and suggests too much of
> > > > bikeshedding. I'd like to concentrate on concrete possible pain points
> > > > between filesystems & VFS (possibly writeback or even generally MM).
> > > > Lately, I've myself came across the two issues in $SUBJECT:
> > > > 1) dropping of last file reference can happen from munmap() and in that
> > > >    case mmap_sem will be held when ->release() is called. Even more it
> > > >    could be held when ->evict_inode() is called to delete inode because
> > > >    inode was unlinked.
> > > 
> > > Yes, it can.
> > > 
> > > > 2) since flusher thread takes inode reference when writing inode out, the
> > > >    last inode reference can be dropped from flusher thread. Thus inode may
> > > >    get deleted in the flusher thread context. This does not seem that
> > > >    problematic on its own but if we realize progress of memory reclaim
> > > >    depends (at least from a longterm perspective) on flusher thread making
> > > >    progress, things start looking a bit uncertain. Even more so when we
> > > >    would like avoid ->writepage() calls from reclaim and let flusher thread
> > > >    do the work instead. That would then require filesystems to carefully
> > > >    design their ->evict_inode() routines so that things are not
> > > >    deadlockable.
> > > 
> > > You mean "use GFP_NOIO for allocations when holding fs-internal locks"?
> > > 
> > > >   Both these issues should be avoidable (we can postpone fput() after we
> > > > drop mmap_sem; we can tweak inode refcounting to avoid last iput() from
> > > > flusher thread) but obviously there's some cost in the complexity of generic
> > > > layer. So the question is, is it worth it?
> > > 
> > > I don't thing it is.  ->i_mutex in ->release() is never needed; existing
> > > cases are racy and dropping preallocation that way is simply wrong.  And
> > > ->evict_inode() is a non-issue, since it has no reason whatsoever to take
> > > *any* locks in mutex - the damn thing is called when nobody has references
> > > to struct inode anymore.  Deadlocks with flusher... that's what NOIO and
> > > NOFS are for.
> > > 
> > For cluster filesystems, we have to take locks (cluster wide) in
> > ->evict_inode() in order to establish for certain whether we are the
> > last opener of the inode. Just because there are no references on the
> > local node, doesn't mean that a remote node doesn't hold the file open
> > still.
> > 
> > We do always use GFP_NOFS when allocating memory while holding such
> > locks, so I'm not quite sure from the above whether or not that will be
> > an issue,
>   Yeah, but you have to use networking to communicate with other nodes
> about locks and this creates another interesting dependecy.
> 
> Currently, everything seems to work out just fine and I don't say I know
> about a particular deadlock. I just say that the dependencies are so
> complex that I don't know whether things will work OK e.g. if we change
> page reclaim to offload more to flusher thread. And that's what I feel
> uneasy about.
> 
> 								Honza

Yes, I agree. I've certainly seen some issues with this code path in
GFS2 in the past though, so making it more robust in this way seems to
be a good plan to me,

Steve.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-03-28 14:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-27 21:08 [TOPIC] Last iput() from flusher thread, last fput() from munmap() Jan Kara
2012-03-28  2:38 ` Al Viro
2012-03-28  4:45   ` Dave Chinner
2012-03-28  9:04   ` Steven Whitehouse
2012-03-28 11:54     ` [Lsf-pc] " Jan Kara
2012-03-28 14:07       ` Steven Whitehouse
2012-03-28 12:10   ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).