Re: [fuse-devel] fuse_lowlevel_notify_inval

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
       [not found] ` <878vqpg7zq.fsf@tucsk.pomaz.szeredi.hu>
@ 2011-08-24 19:33   ` Sage Weil
  2013-04-18  0:43     ` Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2011-08-24 19:33 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: fuse-devel, ceph-devel

On Fri, 19 Aug 2011, Miklos Szeredi wrote:
> Sage Weil <sage@newdream.net> writes:
> 
> > Hi,
> >
> > I just tried using fuse_lowlevel_notify_inval_inode for the ceph fuse 
> > client and ran into a deadlock.  This translates into a call to 
> > invalidate_inode_pages2(), which will lock each page in the address_space.  
> > I end up with a process stuck on
> >
> > [<ffffffff81106a9e>] sleep_on_page+0xe/0x20
> > [<ffffffff81106a87>] __lock_page+0x67/0x70
> > [<ffffffff81114023>] invalidate_inode_pages2_range+0x373/0x390
> > [<ffffffff81260815>] fuse_reverse_inval_inode+0x75/0x90
> > [<ffffffff812589c3>] fuse_dev_do_write+0x8d3/0xae0
> > [<ffffffff81258c3c>] fuse_dev_write+0x6c/0x70
> > [<ffffffff8115d563>] do_sync_readv_writev+0xd3/0x110
> > [<ffffffff8115e3c4>] do_readv_writev+0xd4/0x1e0
> > [<ffffffff8115e518>] vfs_writev+0x48/0x60
> > [<ffffffff8115e651>] sys_writev+0x51/0xc0
> > [<ffffffff815cae02>] system_call_fastpath+0x16/0x1b
> >
> > I assume this is due to a racing write(2) or something.  Has anyone else 
> > seen this?
> 
> Fuse's write function locks the pages being written to.  So yes, doing a
> fuse_lowlevel_notify_inval_inode() on the same file from the write call
> will reliably deadlock.
> 
> > Would invalidate_mapping_pages() make more sense here?  Locked pages (due 
> > to writers) would be skipped, but that seems sane enough to me for a 
> > concurrent write(2) and invalidate callback. 
> 
> What exactly is the purpuse of invalidating the page cache in write?

I took a closer look at my logs and it looks this is what's happening:

 - cfuse: we get a server callback message, take a mutex
 - kernel/fuse: a write starts, locks pages
 - cfuse: we call fuse_lowlevel_notify_inval_inode()
 - cfuse: the write call (or something that preceeds it in the queue) 
   blocks on the mutex
 -> deadlock.. neither the write nor invalidate can complete.

So basically I can't hold any locks during the invalidate call, so that I 
can be sure that the write will complete and we don't deadlock.  That's a 
little inconvenient: I can't use the lock to order the invalidation with 
respect to any other operations (say, a subsequent read(2) that shouldn't 
see stale data) because I have no idea whether a write(2) may have been 
started on the kernel side and may be working it's way through the fuse 
channel.

On the other hand, doing invalidate_mapping_pages() means I may leave 
partially stale data in the page cache that is still marked Uptodate if 
a racing write only overwrites part of a page.

Anyway, clearly fuse is doing the right thing here.  I just need to push 
this to another thread to do it properly from my end.  That complicates 
things a bit, but it's doable.

Thanks!
sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

* fuse_lowlevel_notify_inval_inode deadlock
  2011-08-24 19:33   ` [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock Sage Weil
@ 2013-04-18  0:43     ` Sage Weil
       [not found]       ` <alpine.DEB.2.00.1304171730380.23728-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2013-04-18  0:43 UTC (permalink / raw)
  To: fuse-devel; +Cc: ceph-devel, greg.farnum

We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, this time 
on the read side:

- ceph-fuse queues an invalidate (in a separate thread)
- kernel initiates a read
- invalidate blocks in kernel, waiting on a page lock
- the read blocks in ceph-fuse

Now, assuming we're reading the stack traces properly, this is more or 
less what we see with writes, except with reads, and the obvious "don't 
block the read" would resolve it.

But!  If that is the only way to avoid deadlock, I'm afraid it is 
difficult to implement reliable cache invalidation at all.  The reason we 
are invalidating is because the server told us to: we are no longer 
allowed to do reads and cached data is invalid.  The obvious approach is 
to 

1- stop processing new reads
2- let in-progress reads complete
3- invalidate the cache
4- ack to server

...but that will deadlock as above, as any new read will lock pages before 
blcoking.  If we don't block, then the read may repopulate pages we just 
invalidated.  We could

1- invalidate
2- if any reads happened while we were invalidating, goto 1
3- ack

but then we risk starvation and livelock.

How do other people solve this problem?  It seems like another upcall that 
would let you block new reads (and/or writes) from starting while the 
invalidate is in progress would do the trick, but I'm not convinced I'm 
not missing something much simpler.

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <alpine.DEB.2.00.1304171730380.23728-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>]

* Re: fuse_lowlevel_notify_inval_inode deadlock
       [not found]       ` <alpine.DEB.2.00.1304171730380.23728-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2013-04-18  3:33         ` Anand Avati
  2013-04-18  4:45           ` [fuse-devel] " Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Anand Avati @ 2013-04-18  3:33 UTC (permalink / raw)
  To: Sage Weil
  Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	greg.farnum-4GqslpFJ+cxBDgjK7y7TUQ,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, this time
> on the read side:
>
> - ceph-fuse queues an invalidate (in a separate thread)
> - kernel initiates a read
> - invalidate blocks in kernel, waiting on a page lock
> - the read blocks in ceph-fuse
>
> Now, assuming we're reading the stack traces properly, this is more or
> less what we see with writes, except with reads, and the obvious "don't
> block the read" would resolve it.
>
> But!  If that is the only way to avoid deadlock, I'm afraid it is
> difficult to implement reliable cache invalidation at all.  The reason we
> are invalidating is because the server told us to: we are no longer
> allowed to do reads and cached data is invalid.  The obvious approach is
> to
>
> 1- stop processing new reads
> 2- let in-progress reads complete
> 3- invalidate the cache
> 4- ack to server
>
> ...but that will deadlock as above, as any new read will lock pages before
> blcoking.  If we don't block, then the read may repopulate pages we just
> invalidated.  We could
>
> 1- invalidate
> 2- if any reads happened while we were invalidating, goto 1
> 3- ack
>
> but then we risk starvation and livelock.
>
> How do other people solve this problem?  It seems like another upcall that
> would let you block new reads (and/or writes) from starting while the
> invalidate is in progress would do the trick, but I'm not convinced I'm
> not missing something much simpler.
>

Do you really need to call fuse_lowlevel_notify_inval_inode() while still
holding the mutex in cfuse? It should be sufficient if you -

0 - Receive inval request from server
1 - mutex_lock() in cfuse
2 - invalidate cfuse cache
3 - mutex_unlock() in cfuse
4 - fuse_lowlevel_notify_inval_inode()
5 - ack to server

The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
mutex boundaries looks unnecessary and self-imposed. In-progress reads
which took the page lock before fuse_lowlevel_notify_inval_inode() would
either read data cached in cfuse (in case they reached the cache before 1),
or get sent over to server as though data was never cached. There wouldn't
be a livelock either. Did I miss something?

Avati
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
  2013-04-18  3:33         ` Anand Avati
@ 2013-04-18  4:45           ` Sage Weil
       [not found]             ` <alpine.DEB.2.00.1304172140060.626-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2013-04-18  4:45 UTC (permalink / raw)
  To: Anand Avati; +Cc: fuse-devel@lists.sourceforge.net, greg.farnum, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3254 bytes --]

On Wed, 17 Apr 2013, Anand Avati wrote:
> On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage@inktank.com> wrote:
>       We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
>       this time
>       on the read side:
> 
>       - ceph-fuse queues an invalidate (in a separate thread)
>       - kernel initiates a read
>       - invalidate blocks in kernel, waiting on a page lock
>       - the read blocks in ceph-fuse
> 
>       Now, assuming we're reading the stack traces properly, this is
>       more or
>       less what we see with writes, except with reads, and the obvious
>       "don't
>       block the read" would resolve it.
> 
>       But!  If that is the only way to avoid deadlock, I'm afraid it
>       is
>       difficult to implement reliable cache invalidation at all.  The
>       reason we
>       are invalidating is because the server told us to: we are no
>       longer
>       allowed to do reads and cached data is invalid.  The obvious
>       approach is
>       to
> 
>       1- stop processing new reads
>       2- let in-progress reads complete
>       3- invalidate the cache
>       4- ack to server
> 
>       ...but that will deadlock as above, as any new read will lock
>       pages before
>       blcoking.  If we don't block, then the read may repopulate pages
>       we just
>       invalidated.  We could
> 
>       1- invalidate
>       2- if any reads happened while we were invalidating, goto 1
>       3- ack
> 
>       but then we risk starvation and livelock.
> 
>       How do other people solve this problem?  It seems like another
>       upcall that
>       would let you block new reads (and/or writes) from starting
>       while the
>       invalidate is in progress would do the trick, but I'm not
>       convinced I'm
>       not missing something much simpler.
> 
> 
> Do you really need to call fuse_lowlevel_notify_inval_inode() while still
> holding the mutex in cfuse? It should be sufficient if you -
> 
> 0 - Receive inval request from server
> 1 - mutex_lock() in cfuse
> 2 - invalidate cfuse cache
> 3 - mutex_unlock() in cfuse
> 4 - fuse_lowlevel_notify_inval_inode()
> 5 - ack to server
> 
> The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
> mutex boundaries looks unnecessary and self-imposed. In-progress reads which
> took the page lock before fuse_lowlevel_notify_inval_inode() would either
> read data cached in cfuse (in case they reached the cache before 1), or get
> sent over to server as though data was never cached. There wouldn't be a
> livelock either. Did I miss something?

It's the concurrent reads I'm concerned about:

3.5 - read(2) is called, locks some pages, and sends a message through the 
fuse connection

3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it, 
repopulating a region of the page cache it possibly just partially 
invalidated (rendering the invalidate a failure), or block, possibly 
preventing the invalidate from ever completing.

4.2 - invalidate either completes (having possibly missed some just-read 
pages), or deadlocks on a locked page (depending on whether we blocked the 
read above)

sage

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <alpine.DEB.2.00.1304172140060.626-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>]

* Re: fuse_lowlevel_notify_inval_inode deadlock
       [not found]             ` <alpine.DEB.2.00.1304172140060.626-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
@ 2013-04-18  6:26               ` Anand Avati
  2013-04-18 17:05                 ` [fuse-devel] " Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Anand Avati @ 2013-04-18  6:26 UTC (permalink / raw)
  To: Sage Weil
  Cc: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org,
	greg.farnum-4GqslpFJ+cxBDgjK7y7TUQ,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:

> On Wed, 17 Apr 2013, Anand Avati wrote:
> > On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> wrote:
> >       We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
> >       this time
> >       on the read side:
> >
> >       - ceph-fuse queues an invalidate (in a separate thread)
> >       - kernel initiates a read
> >       - invalidate blocks in kernel, waiting on a page lock
> >       - the read blocks in ceph-fuse
> >
> >       Now, assuming we're reading the stack traces properly, this is
> >       more or
> >       less what we see with writes, except with reads, and the obvious
> >       "don't
> >       block the read" would resolve it.
> >
> >       But!  If that is the only way to avoid deadlock, I'm afraid it
> >       is
> >       difficult to implement reliable cache invalidation at all.  The
> >       reason we
> >       are invalidating is because the server told us to: we are no
> >       longer
> >       allowed to do reads and cached data is invalid.  The obvious
> >       approach is
> >       to
> >
> >       1- stop processing new reads
> >       2- let in-progress reads complete
> >       3- invalidate the cache
> >       4- ack to server
> >
> >       ...but that will deadlock as above, as any new read will lock
> >       pages before
> >       blcoking.  If we don't block, then the read may repopulate pages
> >       we just
> >       invalidated.  We could
> >
> >       1- invalidate
> >       2- if any reads happened while we were invalidating, goto 1
> >       3- ack
> >
> >       but then we risk starvation and livelock.
> >
> >       How do other people solve this problem?  It seems like another
> >       upcall that
> >       would let you block new reads (and/or writes) from starting
> >       while the
> >       invalidate is in progress would do the trick, but I'm not
> >       convinced I'm
> >       not missing something much simpler.
> >
> >
> > Do you really need to call fuse_lowlevel_notify_inval_inode() while still
> > holding the mutex in cfuse? It should be sufficient if you -
> >
> > 0 - Receive inval request from server
> > 1 - mutex_lock() in cfuse
> > 2 - invalidate cfuse cache
> > 3 - mutex_unlock() in cfuse
> > 4 - fuse_lowlevel_notify_inval_inode()
> > 5 - ack to server
> >
> > The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
> > mutex boundaries looks unnecessary and self-imposed. In-progress reads
> which
> > took the page lock before fuse_lowlevel_notify_inval_inode() would either
> > read data cached in cfuse (in case they reached the cache before 1), or
> get
> > sent over to server as though data was never cached. There wouldn't be a
> > livelock either. Did I miss something?
>
> It's the concurrent reads I'm concerned about:
>
> 3.5 - read(2) is called, locks some pages, and sends a message through the
> fuse connection
>
> 3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
> repopulating a region of the page cache it possibly just partially
> invalidated (rendering the invalidate a failure),


I think the problem lies here, that handling the read before 5 is returning
old data (which effectively renders the invalidation a failure). Step 0
needs to guarantee that new data about to be written is already staged in
the server and made available for read. However the write request itself
needs to be blocked from completing till step 5 from all other clients
completes.


> or block, possibly
> preventing the invalidate from ever completing.
>

Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
server should be prepared to serve staged data even before Step 0.
Invalidating might be *delayed* till the in-progress reads finishes. But
that only delays the completion of write(), but no deadlocks anywhere. Hope
I din't miss something :-)


> 4.2 - invalidate either completes (having possibly missed some just-read
> pages),


Pre-staging would prevent this.


> or deadlocks on a locked page (depending on whether we blocked the
> read above)
>

Again, not sure where or why the read() is getting blocked?


Another approach to the problem - when a second client opens the file in
write mode, change this client's (and all other clients') file into
direct_io mode, bypassing the page cache (and also disable all ceph-fuse
caching). This would require a new reverse invalidation method to toggle
the inode mode as direct or buffered, and check for this inode flag (along
with file->f_flags for O_DIRECT) in fuse_file_aio_write() to take the
direct_write() code path. Instead of blocking write completion till all
other clients invalidate their cached pages, blocking a second
open-for-write till all other clients toggle their inode flag for direct_io
seems like a simpler approach to reason with. This is the approach I have
in mind for gluster (handle everything in open(), just not worth having a
complex per-IO invalidation protocol).

Avati
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
  2013-04-18  6:26               ` Anand Avati
@ 2013-04-18 17:05                 ` Sage Weil
  2013-04-18 18:46                   ` Anand Avati
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2013-04-18 17:05 UTC (permalink / raw)
  To: Anand Avati; +Cc: fuse-devel@lists.sourceforge.net, greg.farnum, ceph-devel

On Wed, 17 Apr 2013, Anand Avati wrote:
> On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil <sage@inktank.com> wrote:
> 
> > On Wed, 17 Apr 2013, Anand Avati wrote:
> > > On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage@inktank.com> wrote:
> > >       We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
> > >       this time
> > >       on the read side:
> > >
> > >       - ceph-fuse queues an invalidate (in a separate thread)
> > >       - kernel initiates a read
> > >       - invalidate blocks in kernel, waiting on a page lock
> > >       - the read blocks in ceph-fuse
> > >
> > >       Now, assuming we're reading the stack traces properly, this is
> > >       more or
> > >       less what we see with writes, except with reads, and the obvious
> > >       "don't
> > >       block the read" would resolve it.
> > >
> > >       But!  If that is the only way to avoid deadlock, I'm afraid it
> > >       is
> > >       difficult to implement reliable cache invalidation at all.  The
> > >       reason we
> > >       are invalidating is because the server told us to: we are no
> > >       longer
> > >       allowed to do reads and cached data is invalid.  The obvious
> > >       approach is
> > >       to
> > >
> > >       1- stop processing new reads
> > >       2- let in-progress reads complete
> > >       3- invalidate the cache
> > >       4- ack to server
> > >
> > >       ...but that will deadlock as above, as any new read will lock
> > >       pages before
> > >       blcoking.  If we don't block, then the read may repopulate pages
> > >       we just
> > >       invalidated.  We could
> > >
> > >       1- invalidate
> > >       2- if any reads happened while we were invalidating, goto 1
> > >       3- ack
> > >
> > >       but then we risk starvation and livelock.
> > >
> > >       How do other people solve this problem?  It seems like another
> > >       upcall that
> > >       would let you block new reads (and/or writes) from starting
> > >       while the
> > >       invalidate is in progress would do the trick, but I'm not
> > >       convinced I'm
> > >       not missing something much simpler.
> > >
> > >
> > > Do you really need to call fuse_lowlevel_notify_inval_inode() while still
> > > holding the mutex in cfuse? It should be sufficient if you -
> > >
> > > 0 - Receive inval request from server
> > > 1 - mutex_lock() in cfuse
> > > 2 - invalidate cfuse cache
> > > 3 - mutex_unlock() in cfuse
> > > 4 - fuse_lowlevel_notify_inval_inode()
> > > 5 - ack to server
> > >
> > > The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
> > > mutex boundaries looks unnecessary and self-imposed. In-progress reads
> > which
> > > took the page lock before fuse_lowlevel_notify_inval_inode() would either
> > > read data cached in cfuse (in case they reached the cache before 1), or
> > get
> > > sent over to server as though data was never cached. There wouldn't be a
> > > livelock either. Did I miss something?
> >
> > It's the concurrent reads I'm concerned about:
> >
> > 3.5 - read(2) is called, locks some pages, and sends a message through the
> > fuse connection
> >
> > 3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
> > repopulating a region of the page cache it possibly just partially
> > invalidated (rendering the invalidate a failure),
> 
> I think the problem lies here, that handling the read before 5 is returning
> old data (which effectively renders the invalidation a failure). Step 0
> needs to guarantee that new data about to be written is already staged in
> the server and made available for read. However the write request itself
> needs to be blocked from completing till step 5 from all other clients
> completes.

It sounds like you're thinking of a weaker consistency model.  The mds is 
telling us to stop reads and invalidate our cache *before* anybody is 
allowed to write. At the end of this process, we ack, we should have an 
empty page cache for this file and any reads should be blocked until 
further notice.

> > or block, possibly preventing the invalidate from ever completing.
> >
> 
> Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
> server should be prepared to serve staged data even before Step 0.
> Invalidating might be *delayed* till the in-progress reads finishes. But
> that only delays the completion of write(), but no deadlocks anywhere. Hope
> I din't miss something :-)

You can ignore the mutex_lock stuff; I don't think it's necessary to see 
the issue.  Simply consider a racing read (that has pages locked) and 
invalidate (that is walking through the address_space mapping locking and 
discarding pages).  We can't block the read without risking deadlock with 
the invalidate, and we can't continue with the read without making the 
invalidate unsuccessful/unreliable.  We can actually do reads at this 
point from the ceph client vs server perspective since we haven't acked 
the revocation yet.. but with the fuse vs ceph-fuse interaction we are 
choosing betweeen deadlock or potential livelock (if we do the read and 
then try the invalidate a second time).

sage


> 
> 
> > 4.2 - invalidate either completes (having possibly missed some just-read
> > pages),
> 
> 
> Pre-staging would prevent this.
> 
> 
> > or deadlocks on a locked page (depending on whether we blocked the
> > read above)
> >
> 
> Again, not sure where or why the read() is getting blocked?
> 
> 
> Another approach to the problem - when a second client opens the file in
> write mode, change this client's (and all other clients') file into
> direct_io mode, bypassing the page cache (and also disable all ceph-fuse
> caching). This would require a new reverse invalidation method to toggle
> the inode mode as direct or buffered, and check for this inode flag (along
> with file->f_flags for O_DIRECT) in fuse_file_aio_write() to take the
> direct_write() code path. Instead of blocking write completion till all
> other clients invalidate their cached pages, blocking a second
> open-for-write till all other clients toggle their inode flag for direct_io
> seems like a simpler approach to reason with. This is the approach I have
> in mind for gluster (handle everything in open(), just not worth having a
> complex per-IO invalidation protocol).
> 
> Avati
> ------------------------------------------------------------------------------
> Precog is a next-generation analytics platform capable of advanced
> analytics on semi-structured data. The platform includes APIs for building
> apps and a phenomenal toolset for data science. Developers can use
> our toolset for easy data analysis & visualization. Get a free account!
> http://www2.precog.com/precogplatform/slashdotnewsletter
> _______________________________________________
> fuse-devel mailing list
> fuse-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/fuse-devel
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
  2013-04-18 17:05                 ` [fuse-devel] " Sage Weil
@ 2013-04-18 18:46                   ` Anand Avati
  2013-04-18 19:12                     ` Sage Weil
  0 siblings, 1 reply; 9+ messages in thread
From: Anand Avati @ 2013-04-18 18:46 UTC (permalink / raw)
  To: Sage Weil
  Cc: Anand Avati, fuse-devel@lists.sourceforge.net, greg.farnum,
	ceph-devel


On Apr 18, 2013, at 10:05 AM, Sage Weil <sage@inktank.com> wrote:

> On Wed, 17 Apr 2013, Anand Avati wrote:
>> On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil <sage@inktank.com> wrote:
>> 
>>> On Wed, 17 Apr 2013, Anand Avati wrote:
>>>> On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage@inktank.com> wrote:
>>>>      We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
>>>>      this time
>>>>      on the read side:
>>>> 
>>>>      - ceph-fuse queues an invalidate (in a separate thread)
>>>>      - kernel initiates a read
>>>>      - invalidate blocks in kernel, waiting on a page lock
>>>>      - the read blocks in ceph-fuse
>>>> 
>>>>      Now, assuming we're reading the stack traces properly, this is
>>>>      more or
>>>>      less what we see with writes, except with reads, and the obvious
>>>>      "don't
>>>>      block the read" would resolve it.
>>>> 
>>>>      But!  If that is the only way to avoid deadlock, I'm afraid it
>>>>      is
>>>>      difficult to implement reliable cache invalidation at all.  The
>>>>      reason we
>>>>      are invalidating is because the server told us to: we are no
>>>>      longer
>>>>      allowed to do reads and cached data is invalid.  The obvious
>>>>      approach is
>>>>      to
>>>> 
>>>>      1- stop processing new reads
>>>>      2- let in-progress reads complete
>>>>      3- invalidate the cache
>>>>      4- ack to server
>>>> 
>>>>      ...but that will deadlock as above, as any new read will lock
>>>>      pages before
>>>>      blcoking.  If we don't block, then the read may repopulate pages
>>>>      we just
>>>>      invalidated.  We could
>>>> 
>>>>      1- invalidate
>>>>      2- if any reads happened while we were invalidating, goto 1
>>>>      3- ack
>>>> 
>>>>      but then we risk starvation and livelock.
>>>> 
>>>>      How do other people solve this problem?  It seems like another
>>>>      upcall that
>>>>      would let you block new reads (and/or writes) from starting
>>>>      while the
>>>>      invalidate is in progress would do the trick, but I'm not
>>>>      convinced I'm
>>>>      not missing something much simpler.
>>>> 
>>>> 
>>>> Do you really need to call fuse_lowlevel_notify_inval_inode() while still
>>>> holding the mutex in cfuse? It should be sufficient if you -
>>>> 
>>>> 0 - Receive inval request from server
>>>> 1 - mutex_lock() in cfuse
>>>> 2 - invalidate cfuse cache
>>>> 3 - mutex_unlock() in cfuse
>>>> 4 - fuse_lowlevel_notify_inval_inode()
>>>> 5 - ack to server
>>>> 
>>>> The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
>>>> mutex boundaries looks unnecessary and self-imposed. In-progress reads
>>> which
>>>> took the page lock before fuse_lowlevel_notify_inval_inode() would either
>>>> read data cached in cfuse (in case they reached the cache before 1), or
>>> get
>>>> sent over to server as though data was never cached. There wouldn't be a
>>>> livelock either. Did I miss something?
>>> 
>>> It's the concurrent reads I'm concerned about:
>>> 
>>> 3.5 - read(2) is called, locks some pages, and sends a message through the
>>> fuse connection
>>> 
>>> 3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
>>> repopulating a region of the page cache it possibly just partially
>>> invalidated (rendering the invalidate a failure),
>> 
>> I think the problem lies here, that handling the read before 5 is returning
>> old data (which effectively renders the invalidation a failure). Step 0
>> needs to guarantee that new data about to be written is already staged in
>> the server and made available for read. However the write request itself
>> needs to be blocked from completing till step 5 from all other clients
>> completes.
> 
> It sounds like you're thinking of a weaker consistency model.  The mds is 
> telling us to stop reads and invalidate our cache *before* anybody is 
> allowed to write. At the end of this process, we ack, we should have an 
> empty page cache for this file and any reads should be blocked until 
> further notice.

Yes, the consistency model I was talking about is weaker than "block new reads, purge all cache, wait until further notified". If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version.

>>> or block, possibly preventing the invalidate from ever completing.
>>> 
>> 
>> Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
>> server should be prepared to serve staged data even before Step 0.
>> Invalidating might be *delayed* till the in-progress reads finishes. But
>> that only delays the completion of write(), but no deadlocks anywhere. Hope
>> I din't miss something :-)
> 
> You can ignore the mutex_lock stuff; I don't think it's necessary to see 
> the issue.  Simply consider a racing read (that has pages locked) and 
> invalidate (that is walking through the address_space mapping locking and 
> discarding pages).  We can't block the read without risking deadlock with 
> the invalidate, and we can't continue with the read without making the 
> invalidate unsuccessful/unreliable.  We can actually do reads at this 
> point from the ceph client vs server perspective since we haven't acked 
> the revocation yet.. but with the fuse vs ceph-fuse interaction we are 
> choosing betweeen deadlock or potential livelock (if we do the read and 
> then try the invalidate a second time).

For the consistency model you are looking at - (halt all new reads, clear all page cache, perform write) this sure seems to be a deadlock vs live-lock situation.   Why isn't your consistency requirements a problem for the ceph in-kernel client? What extra machinery does it have which FUSE does not? I supposed you implement the "halt new reads" in f_op->[aio_]read even before taking page locks?

ceph-fuse is always processing "parts" of an application read() call (sometimes the "part" could be large enough to match the full read() request, but not always) and can apply consistency enforcement to only individual parts. For e.g, a few cached pages could satisfy first half of a read request and second half might be the only part ceph-fuse even gets to witness at all. If invalidate() is called after the first half is copied from page cache to user buffer, you still have an inconsistent read. Similarly a request split up because of the fuse channel size limitation can result in the two halfs of the reads landing before and after a single write spanning the full read region.

Enabling direct_io mode bypasses the page cache and solves the "half-cache-copy" problem, but still is subject to the split-reads problem. I suppose implementing a "cork" in fuse_file_aio_read() is the only safe solution?

Avati



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
  2013-04-18 18:46                   ` Anand Avati
@ 2013-04-18 19:12                     ` Sage Weil
  2013-04-18 19:18                       ` Anand Avati
  0 siblings, 1 reply; 9+ messages in thread
From: Sage Weil @ 2013-04-18 19:12 UTC (permalink / raw)
  To: Anand Avati
  Cc: Anand Avati, fuse-devel@lists.sourceforge.net, greg.farnum,
	ceph-devel

On Thu, 18 Apr 2013, Anand Avati wrote:
> On Apr 18, 2013, at 10:05 AM, Sage Weil <sage@inktank.com> wrote:
> > On Wed, 17 Apr 2013, Anand Avati wrote:
> >> On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil <sage@inktank.com> wrote:
> >> 
> >>> On Wed, 17 Apr 2013, Anand Avati wrote:
> >>>> On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil <sage@inktank.com> wrote:
> >>>>      We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
> >>>>      this time
> >>>>      on the read side:
> >>>> 
> >>>>      - ceph-fuse queues an invalidate (in a separate thread)
> >>>>      - kernel initiates a read
> >>>>      - invalidate blocks in kernel, waiting on a page lock
> >>>>      - the read blocks in ceph-fuse
> >>>> 
> >>>>      Now, assuming we're reading the stack traces properly, this is
> >>>>      more or
> >>>>      less what we see with writes, except with reads, and the obvious
> >>>>      "don't
> >>>>      block the read" would resolve it.
> >>>> 
> >>>>      But!  If that is the only way to avoid deadlock, I'm afraid it
> >>>>      is
> >>>>      difficult to implement reliable cache invalidation at all.  The
> >>>>      reason we
> >>>>      are invalidating is because the server told us to: we are no
> >>>>      longer
> >>>>      allowed to do reads and cached data is invalid.  The obvious
> >>>>      approach is
> >>>>      to
> >>>> 
> >>>>      1- stop processing new reads
> >>>>      2- let in-progress reads complete
> >>>>      3- invalidate the cache
> >>>>      4- ack to server
> >>>> 
> >>>>      ...but that will deadlock as above, as any new read will lock
> >>>>      pages before
> >>>>      blcoking.  If we don't block, then the read may repopulate pages
> >>>>      we just
> >>>>      invalidated.  We could
> >>>> 
> >>>>      1- invalidate
> >>>>      2- if any reads happened while we were invalidating, goto 1
> >>>>      3- ack
> >>>> 
> >>>>      but then we risk starvation and livelock.
> >>>> 
> >>>>      How do other people solve this problem?  It seems like another
> >>>>      upcall that
> >>>>      would let you block new reads (and/or writes) from starting
> >>>>      while the
> >>>>      invalidate is in progress would do the trick, but I'm not
> >>>>      convinced I'm
> >>>>      not missing something much simpler.
> >>>> 
> >>>> 
> >>>> Do you really need to call fuse_lowlevel_notify_inval_inode() while still
> >>>> holding the mutex in cfuse? It should be sufficient if you -
> >>>> 
> >>>> 0 - Receive inval request from server
> >>>> 1 - mutex_lock() in cfuse
> >>>> 2 - invalidate cfuse cache
> >>>> 3 - mutex_unlock() in cfuse
> >>>> 4 - fuse_lowlevel_notify_inval_inode()
> >>>> 5 - ack to server
> >>>> 
> >>>> The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
> >>>> mutex boundaries looks unnecessary and self-imposed. In-progress reads
> >>> which
> >>>> took the page lock before fuse_lowlevel_notify_inval_inode() would either
> >>>> read data cached in cfuse (in case they reached the cache before 1), or
> >>> get
> >>>> sent over to server as though data was never cached. There wouldn't be a
> >>>> livelock either. Did I miss something?
> >>> 
> >>> It's the concurrent reads I'm concerned about:
> >>> 
> >>> 3.5 - read(2) is called, locks some pages, and sends a message through the
> >>> fuse connection
> >>> 
> >>> 3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
> >>> repopulating a region of the page cache it possibly just partially
> >>> invalidated (rendering the invalidate a failure),
> >> 
> >> I think the problem lies here, that handling the read before 5 is returning
> >> old data (which effectively renders the invalidation a failure). Step 0
> >> needs to guarantee that new data about to be written is already staged in
> >> the server and made available for read. However the write request itself
> >> needs to be blocked from completing till step 5 from all other clients
> >> completes.
> > 
> > It sounds like you're thinking of a weaker consistency model.  The mds is 
> > telling us to stop reads and invalidate our cache *before* anybody is 
> > allowed to write. At the end of this process, we ack, we should have an 
> > empty page cache for this file and any reads should be blocked until 
> > further notice.
> 
> Yes, the consistency model I was talking about is weaker than "block new reads, purge all cache, wait until further notified". If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version.
> 
> >>> or block, possibly preventing the invalidate from ever completing.
> >>> 
> >> 
> >> Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
> >> server should be prepared to serve staged data even before Step 0.
> >> Invalidating might be *delayed* till the in-progress reads finishes. But
> >> that only delays the completion of write(), but no deadlocks anywhere. Hope
> >> I din't miss something :-)
> > 
> > You can ignore the mutex_lock stuff; I don't think it's necessary to see 
> > the issue.  Simply consider a racing read (that has pages locked) and 
> > invalidate (that is walking through the address_space mapping locking and 
> > discarding pages).  We can't block the read without risking deadlock with 
> > the invalidate, and we can't continue with the read without making the 
> > invalidate unsuccessful/unreliable.  We can actually do reads at this 
> > point from the ceph client vs server perspective since we haven't acked 
> > the revocation yet.. but with the fuse vs ceph-fuse interaction we are 
> > choosing betweeen deadlock or potential livelock (if we do the read and 
> > then try the invalidate a second time).
> 
> For the consistency model you are looking at - (halt all new reads, 
> clear all page cache, perform write) this sure seems to be a deadlock vs 
> live-lock situation.  Why isn't your consistency requirements a problem 
> for the ceph in-kernel client? What extra machinery does it have which 
> FUSE does not? I supposed you implement the "halt new reads" in 
> f_op->[aio_]read even before taking page locks?

Exactly.

I think to get the proper semantics in the fuse case (and also to solve 
the striping problem you describe below), there would need to be an upcall 
to quiece read (and probably something similar to write, depending on how 
the writeback stuff works out) before doing the invalidate.

What does gluster do in this case?  Are they using direct_io mode?  Or a 
different consistency model?

sage


> ceph-fuse is always processing "parts" of an application read() call 
> (sometimes the "part" could be large enough to match the full read() 
> request, but not always) and can apply consistency enforcement to only 
> individual parts. For e.g, a few cached pages could satisfy first half 
> of a read request and second half might be the only part ceph-fuse even 
> gets to witness at all. If invalidate() is called after the first half 
> is copied from page cache to user buffer, you still have an inconsistent 
> read. Similarly a request split up because of the fuse channel size 
> limitation can result in the two halfs of the reads landing before and 
> after a single write spanning the full read region.
> 
> Enabling direct_io mode bypasses the page cache and solves the 
> "half-cache-copy" problem, but still is subject to the split-reads 
> problem. I suppose implementing a "cork" in fuse_file_aio_read() is the 
> only safe solution?
> 
> Avati
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
  2013-04-18 19:12                     ` Sage Weil
@ 2013-04-18 19:18                       ` Anand Avati
  0 siblings, 0 replies; 9+ messages in thread
From: Anand Avati @ 2013-04-18 19:18 UTC (permalink / raw)
  To: Sage Weil
  Cc: Anand Avati, fuse-devel@lists.sourceforge.net, greg.farnum,
	ceph-devel

On 04/18/2013 12:12 PM, Sage Weil wrote:
> On Thu, 18 Apr 2013, Anand Avati wrote:
>> On Apr 18, 2013, at 10:05 AM, Sage Weil<sage@inktank.com>  wrote:
>>> On Wed, 17 Apr 2013, Anand Avati wrote:
>>>> On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil<sage@inktank.com>  wrote:
>>>>
>>>>> On Wed, 17 Apr 2013, Anand Avati wrote:
>>>>>> On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil<sage@inktank.com>  wrote:
>>>>>>       We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
>>>>>>       this time
>>>>>>       on the read side:
>>>>>>
>>>>>>       - ceph-fuse queues an invalidate (in a separate thread)
>>>>>>       - kernel initiates a read
>>>>>>       - invalidate blocks in kernel, waiting on a page lock
>>>>>>       - the read blocks in ceph-fuse
>>>>>>
>>>>>>       Now, assuming we're reading the stack traces properly, this is
>>>>>>       more or
>>>>>>       less what we see with writes, except with reads, and the obvious
>>>>>>       "don't
>>>>>>       block the read" would resolve it.
>>>>>>
>>>>>>       But!  If that is the only way to avoid deadlock, I'm afraid it
>>>>>>       is
>>>>>>       difficult to implement reliable cache invalidation at all.  The
>>>>>>       reason we
>>>>>>       are invalidating is because the server told us to: we are no
>>>>>>       longer
>>>>>>       allowed to do reads and cached data is invalid.  The obvious
>>>>>>       approach is
>>>>>>       to
>>>>>>
>>>>>>       1- stop processing new reads
>>>>>>       2- let in-progress reads complete
>>>>>>       3- invalidate the cache
>>>>>>       4- ack to server
>>>>>>
>>>>>>       ...but that will deadlock as above, as any new read will lock
>>>>>>       pages before
>>>>>>       blcoking.  If we don't block, then the read may repopulate pages
>>>>>>       we just
>>>>>>       invalidated.  We could
>>>>>>
>>>>>>       1- invalidate
>>>>>>       2- if any reads happened while we were invalidating, goto 1
>>>>>>       3- ack
>>>>>>
>>>>>>       but then we risk starvation and livelock.
>>>>>>
>>>>>>       How do other people solve this problem?  It seems like another
>>>>>>       upcall that
>>>>>>       would let you block new reads (and/or writes) from starting
>>>>>>       while the
>>>>>>       invalidate is in progress would do the trick, but I'm not
>>>>>>       convinced I'm
>>>>>>       not missing something much simpler.
>>>>>>
>>>>>>
>>>>>> Do you really need to call fuse_lowlevel_notify_inval_inode() while still
>>>>>> holding the mutex in cfuse? It should be sufficient if you -
>>>>>>
>>>>>> 0 - Receive inval request from server
>>>>>> 1 - mutex_lock() in cfuse
>>>>>> 2 - invalidate cfuse cache
>>>>>> 3 - mutex_unlock() in cfuse
>>>>>> 4 - fuse_lowlevel_notify_inval_inode()
>>>>>> 5 - ack to server
>>>>>>
>>>>>> The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
>>>>>> mutex boundaries looks unnecessary and self-imposed. In-progress reads
>>>>> which
>>>>>> took the page lock before fuse_lowlevel_notify_inval_inode() would either
>>>>>> read data cached in cfuse (in case they reached the cache before 1), or
>>>>> get
>>>>>> sent over to server as though data was never cached. There wouldn't be a
>>>>>> livelock either. Did I miss something?
>>>>>
>>>>> It's the concurrent reads I'm concerned about:
>>>>>
>>>>> 3.5 - read(2) is called, locks some pages, and sends a message through the
>>>>> fuse connection
>>>>>
>>>>> 3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
>>>>> repopulating a region of the page cache it possibly just partially
>>>>> invalidated (rendering the invalidate a failure),
>>>>
>>>> I think the problem lies here, that handling the read before 5 is returning
>>>> old data (which effectively renders the invalidation a failure). Step 0
>>>> needs to guarantee that new data about to be written is already staged in
>>>> the server and made available for read. However the write request itself
>>>> needs to be blocked from completing till step 5 from all other clients
>>>> completes.
>>>
>>> It sounds like you're thinking of a weaker consistency model.  The mds is
>>> telling us to stop reads and invalidate our cache *before* anybody is
>>> allowed to write. At the end of this process, we ack, we should have an
>>> empty page cache for this file and any reads should be blocked until
>>> further notice.
>>
>> Yes, the consistency model I was talking about is weaker than "block new reads, purge all cache, wait until further notified". If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version.
>>
>>>>> or block, possibly preventing the invalidate from ever completing.
>>>>>
>>>>
>>>> Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
>>>> server should be prepared to serve staged data even before Step 0.
>>>> Invalidating might be *delayed* till the in-progress reads finishes. But
>>>> that only delays the completion of write(), but no deadlocks anywhere. Hope
>>>> I din't miss something :-)
>>>
>>> You can ignore the mutex_lock stuff; I don't think it's necessary to see
>>> the issue.  Simply consider a racing read (that has pages locked) and
>>> invalidate (that is walking through the address_space mapping locking and
>>> discarding pages).  We can't block the read without risking deadlock with
>>> the invalidate, and we can't continue with the read without making the
>>> invalidate unsuccessful/unreliable.  We can actually do reads at this
>>> point from the ceph client vs server perspective since we haven't acked
>>> the revocation yet.. but with the fuse vs ceph-fuse interaction we are
>>> choosing betweeen deadlock or potential livelock (if we do the read and
>>> then try the invalidate a second time).
>>
>> For the consistency model you are looking at - (halt all new reads,
>> clear all page cache, perform write) this sure seems to be a deadlock vs
>> live-lock situation.  Why isn't your consistency requirements a problem
>> for the ceph in-kernel client? What extra machinery does it have which
>> FUSE does not? I supposed you implement the "halt new reads" in
>> f_op->[aio_]read even before taking page locks?
>
> Exactly.
>
> I think to get the proper semantics in the fuse case (and also to solve
> the striping problem you describe below), there would need to be an upcall
> to quiece read (and probably something similar to write, depending on how
> the writeback stuff works out) before doing the invalidate.
>
> What does gluster do in this case?  Are they using direct_io mode?  Or a
> different consistency model?


Currently we offer only close-to-open consistency like NFSv3. For 
tighter requirements we honor posix locks and uncache the locked 
regions, with a combination of direct_io. Concurrent reader/writer of 
the same file/record is just not the use case gluster is focussing on.

Avati


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-04-18 19:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.64.1108171208490.11873@cobra.newdream.net>
     [not found] ` <878vqpg7zq.fsf@tucsk.pomaz.szeredi.hu>
2011-08-24 19:33   ` [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock Sage Weil
2013-04-18  0:43     ` Sage Weil
     [not found]       ` <alpine.DEB.2.00.1304171730380.23728-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-04-18  3:33         ` Anand Avati
2013-04-18  4:45           ` [fuse-devel] " Sage Weil
     [not found]             ` <alpine.DEB.2.00.1304172140060.626-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-04-18  6:26               ` Anand Avati
2013-04-18 17:05                 ` [fuse-devel] " Sage Weil
2013-04-18 18:46                   ` Anand Avati
2013-04-18 19:12                     ` Sage Weil
2013-04-18 19:18                       ` Anand Avati

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.