Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Anand Avati <avati@redhat.com>
To: Sage Weil <sage@inktank.com>
Cc: Anand Avati <anand.avati@gmail.com>,
	"fuse-devel@lists.sourceforge.net"
	<fuse-devel@lists.sourceforge.net>,
	greg.farnum@inktank.com, ceph-devel@vger.kernel.org
Subject: Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock
Date: Thu, 18 Apr 2013 12:18:17 -0700	[thread overview]
Message-ID: <517046F9.4060307@redhat.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1304181208100.1128@cobra.newdream.net>

On 04/18/2013 12:12 PM, Sage Weil wrote:
> On Thu, 18 Apr 2013, Anand Avati wrote:
>> On Apr 18, 2013, at 10:05 AM, Sage Weil<sage@inktank.com>  wrote:
>>> On Wed, 17 Apr 2013, Anand Avati wrote:
>>>> On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil<sage@inktank.com>  wrote:
>>>>
>>>>> On Wed, 17 Apr 2013, Anand Avati wrote:
>>>>>> On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil<sage@inktank.com>  wrote:
>>>>>>       We've hit a new deadlock with fuse_lowlevel_notify_inval_inode,
>>>>>>       this time
>>>>>>       on the read side:
>>>>>>
>>>>>>       - ceph-fuse queues an invalidate (in a separate thread)
>>>>>>       - kernel initiates a read
>>>>>>       - invalidate blocks in kernel, waiting on a page lock
>>>>>>       - the read blocks in ceph-fuse
>>>>>>
>>>>>>       Now, assuming we're reading the stack traces properly, this is
>>>>>>       more or
>>>>>>       less what we see with writes, except with reads, and the obvious
>>>>>>       "don't
>>>>>>       block the read" would resolve it.
>>>>>>
>>>>>>       But!  If that is the only way to avoid deadlock, I'm afraid it
>>>>>>       is
>>>>>>       difficult to implement reliable cache invalidation at all.  The
>>>>>>       reason we
>>>>>>       are invalidating is because the server told us to: we are no
>>>>>>       longer
>>>>>>       allowed to do reads and cached data is invalid.  The obvious
>>>>>>       approach is
>>>>>>       to
>>>>>>
>>>>>>       1- stop processing new reads
>>>>>>       2- let in-progress reads complete
>>>>>>       3- invalidate the cache
>>>>>>       4- ack to server
>>>>>>
>>>>>>       ...but that will deadlock as above, as any new read will lock
>>>>>>       pages before
>>>>>>       blcoking.  If we don't block, then the read may repopulate pages
>>>>>>       we just
>>>>>>       invalidated.  We could
>>>>>>
>>>>>>       1- invalidate
>>>>>>       2- if any reads happened while we were invalidating, goto 1
>>>>>>       3- ack
>>>>>>
>>>>>>       but then we risk starvation and livelock.
>>>>>>
>>>>>>       How do other people solve this problem?  It seems like another
>>>>>>       upcall that
>>>>>>       would let you block new reads (and/or writes) from starting
>>>>>>       while the
>>>>>>       invalidate is in progress would do the trick, but I'm not
>>>>>>       convinced I'm
>>>>>>       not missing something much simpler.
>>>>>>
>>>>>>
>>>>>> Do you really need to call fuse_lowlevel_notify_inval_inode() while still
>>>>>> holding the mutex in cfuse? It should be sufficient if you -
>>>>>>
>>>>>> 0 - Receive inval request from server
>>>>>> 1 - mutex_lock() in cfuse
>>>>>> 2 - invalidate cfuse cache
>>>>>> 3 - mutex_unlock() in cfuse
>>>>>> 4 - fuse_lowlevel_notify_inval_inode()
>>>>>> 5 - ack to server
>>>>>>
>>>>>> The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the
>>>>>> mutex boundaries looks unnecessary and self-imposed. In-progress reads
>>>>> which
>>>>>> took the page lock before fuse_lowlevel_notify_inval_inode() would either
>>>>>> read data cached in cfuse (in case they reached the cache before 1), or
>>>>> get
>>>>>> sent over to server as though data was never cached. There wouldn't be a
>>>>>> livelock either. Did I miss something?
>>>>>
>>>>> It's the concurrent reads I'm concerned about:
>>>>>
>>>>> 3.5 - read(2) is called, locks some pages, and sends a message through the
>>>>> fuse connection
>>>>>
>>>>> 3.9 or 4.1 - ceph-fuse gets the reads request.  It can either handle it,
>>>>> repopulating a region of the page cache it possibly just partially
>>>>> invalidated (rendering the invalidate a failure),
>>>>
>>>> I think the problem lies here, that handling the read before 5 is returning
>>>> old data (which effectively renders the invalidation a failure). Step 0
>>>> needs to guarantee that new data about to be written is already staged in
>>>> the server and made available for read. However the write request itself
>>>> needs to be blocked from completing till step 5 from all other clients
>>>> completes.
>>>
>>> It sounds like you're thinking of a weaker consistency model.  The mds is
>>> telling us to stop reads and invalidate our cache *before* anybody is
>>> allowed to write. At the end of this process, we ack, we should have an
>>> empty page cache for this file and any reads should be blocked until
>>> further notice.
>>
>> Yes, the consistency model I was talking about is weaker than "block new reads, purge all cache, wait until further notified". If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version.
>>
>>>>> or block, possibly preventing the invalidate from ever completing.
>>>>>
>>>>
>>>> Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the
>>>> server should be prepared to serve staged data even before Step 0.
>>>> Invalidating might be *delayed* till the in-progress reads finishes. But
>>>> that only delays the completion of write(), but no deadlocks anywhere. Hope
>>>> I din't miss something :-)
>>>
>>> You can ignore the mutex_lock stuff; I don't think it's necessary to see
>>> the issue.  Simply consider a racing read (that has pages locked) and
>>> invalidate (that is walking through the address_space mapping locking and
>>> discarding pages).  We can't block the read without risking deadlock with
>>> the invalidate, and we can't continue with the read without making the
>>> invalidate unsuccessful/unreliable.  We can actually do reads at this
>>> point from the ceph client vs server perspective since we haven't acked
>>> the revocation yet.. but with the fuse vs ceph-fuse interaction we are
>>> choosing betweeen deadlock or potential livelock (if we do the read and
>>> then try the invalidate a second time).
>>
>> For the consistency model you are looking at - (halt all new reads,
>> clear all page cache, perform write) this sure seems to be a deadlock vs
>> live-lock situation.  Why isn't your consistency requirements a problem
>> for the ceph in-kernel client? What extra machinery does it have which
>> FUSE does not? I supposed you implement the "halt new reads" in
>> f_op->[aio_]read even before taking page locks?
>
> Exactly.
>
> I think to get the proper semantics in the fuse case (and also to solve
> the striping problem you describe below), there would need to be an upcall
> to quiece read (and probably something similar to write, depending on how
> the writeback stuff works out) before doing the invalidate.
>
> What does gluster do in this case?  Are they using direct_io mode?  Or a
> different consistency model?


Currently we offer only close-to-open consistency like NFSv3. For 
tighter requirements we honor posix locks and uncache the locked 
regions, with a combination of direct_io. Concurrent reader/writer of 
the same file/record is just not the use case gluster is focussing on.

Avati

     prev parent reply	other threads:[~2013-04-18 19:27 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Pine.LNX.4.64.1108171208490.11873@cobra.newdream.net>
     [not found] ` <878vqpg7zq.fsf@tucsk.pomaz.szeredi.hu>
2011-08-24 19:33   ` [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock Sage Weil
2013-04-18  0:43     ` Sage Weil
     [not found]       ` <alpine.DEB.2.00.1304171730380.23728-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-04-18  3:33         ` Anand Avati
2013-04-18  4:45           ` [fuse-devel] " Sage Weil
     [not found]             ` <alpine.DEB.2.00.1304172140060.626-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2013-04-18  6:26               ` Anand Avati
2013-04-18 17:05                 ` [fuse-devel] " Sage Weil
2013-04-18 18:46                   ` Anand Avati
2013-04-18 19:12                     ` Sage Weil
2013-04-18 19:18                       ` Anand Avati [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=517046F9.4060307@redhat.com \
    --to=avati@redhat.com \
    --cc=anand.avati@gmail.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=fuse-devel@lists.sourceforge.net \
    --cc=greg.farnum@inktank.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.