From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anand Avati Subject: Re: [fuse-devel] fuse_lowlevel_notify_inval_inode deadlock Date: Thu, 18 Apr 2013 12:18:17 -0700 Message-ID: <517046F9.4060307@redhat.com> References: <878vqpg7zq.fsf@tucsk.pomaz.szeredi.hu> <846B62AA-126D-41D7-9507-F44213AF07B9@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: Received: from mx1.redhat.com ([209.132.183.28]:21576 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752975Ab3DRT1p (ORCPT ); Thu, 18 Apr 2013 15:27:45 -0400 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: Anand Avati , "fuse-devel@lists.sourceforge.net" , greg.farnum@inktank.com, ceph-devel@vger.kernel.org On 04/18/2013 12:12 PM, Sage Weil wrote: > On Thu, 18 Apr 2013, Anand Avati wrote: >> On Apr 18, 2013, at 10:05 AM, Sage Weil wrote: >>> On Wed, 17 Apr 2013, Anand Avati wrote: >>>> On Wed, Apr 17, 2013 at 9:45 PM, Sage Weil wrote: >>>> >>>>> On Wed, 17 Apr 2013, Anand Avati wrote: >>>>>> On Wed, Apr 17, 2013 at 5:43 PM, Sage Weil wrote: >>>>>> We've hit a new deadlock with fuse_lowlevel_notify_inval_inode, >>>>>> this time >>>>>> on the read side: >>>>>> >>>>>> - ceph-fuse queues an invalidate (in a separate thread) >>>>>> - kernel initiates a read >>>>>> - invalidate blocks in kernel, waiting on a page lock >>>>>> - the read blocks in ceph-fuse >>>>>> >>>>>> Now, assuming we're reading the stack traces properly, this is >>>>>> more or >>>>>> less what we see with writes, except with reads, and the obvious >>>>>> "don't >>>>>> block the read" would resolve it. >>>>>> >>>>>> But! If that is the only way to avoid deadlock, I'm afraid it >>>>>> is >>>>>> difficult to implement reliable cache invalidation at all. The >>>>>> reason we >>>>>> are invalidating is because the server told us to: we are no >>>>>> longer >>>>>> allowed to do reads and cached data is invalid. The obvious >>>>>> approach is >>>>>> to >>>>>> >>>>>> 1- stop processing new reads >>>>>> 2- let in-progress reads complete >>>>>> 3- invalidate the cache >>>>>> 4- ack to server >>>>>> >>>>>> ...but that will deadlock as above, as any new read will lock >>>>>> pages before >>>>>> blcoking. If we don't block, then the read may repopulate pages >>>>>> we just >>>>>> invalidated. We could >>>>>> >>>>>> 1- invalidate >>>>>> 2- if any reads happened while we were invalidating, goto 1 >>>>>> 3- ack >>>>>> >>>>>> but then we risk starvation and livelock. >>>>>> >>>>>> How do other people solve this problem? It seems like another >>>>>> upcall that >>>>>> would let you block new reads (and/or writes) from starting >>>>>> while the >>>>>> invalidate is in progress would do the trick, but I'm not >>>>>> convinced I'm >>>>>> not missing something much simpler. >>>>>> >>>>>> >>>>>> Do you really need to call fuse_lowlevel_notify_inval_inode() while still >>>>>> holding the mutex in cfuse? It should be sufficient if you - >>>>>> >>>>>> 0 - Receive inval request from server >>>>>> 1 - mutex_lock() in cfuse >>>>>> 2 - invalidate cfuse cache >>>>>> 3 - mutex_unlock() in cfuse >>>>>> 4 - fuse_lowlevel_notify_inval_inode() >>>>>> 5 - ack to server >>>>>> >>>>>> The only necessary ordering seems to be 0->[2,4]->5. Placing 4 within the >>>>>> mutex boundaries looks unnecessary and self-imposed. In-progress reads >>>>> which >>>>>> took the page lock before fuse_lowlevel_notify_inval_inode() would either >>>>>> read data cached in cfuse (in case they reached the cache before 1), or >>>>> get >>>>>> sent over to server as though data was never cached. There wouldn't be a >>>>>> livelock either. Did I miss something? >>>>> >>>>> It's the concurrent reads I'm concerned about: >>>>> >>>>> 3.5 - read(2) is called, locks some pages, and sends a message through the >>>>> fuse connection >>>>> >>>>> 3.9 or 4.1 - ceph-fuse gets the reads request. It can either handle it, >>>>> repopulating a region of the page cache it possibly just partially >>>>> invalidated (rendering the invalidate a failure), >>>> >>>> I think the problem lies here, that handling the read before 5 is returning >>>> old data (which effectively renders the invalidation a failure). Step 0 >>>> needs to guarantee that new data about to be written is already staged in >>>> the server and made available for read. However the write request itself >>>> needs to be blocked from completing till step 5 from all other clients >>>> completes. >>> >>> It sounds like you're thinking of a weaker consistency model. The mds is >>> telling us to stop reads and invalidate our cache *before* anybody is >>> allowed to write. At the end of this process, we ack, we should have an >>> empty page cache for this file and any reads should be blocked until >>> further notice. >> >> Yes, the consistency model I was talking about is weaker than "block new reads, purge all cache, wait until further notified". If you are striping data and if the read request spans multiple stripes, then I guess you do need a stricter version. >> >>>>> or block, possibly preventing the invalidate from ever completing. >>>>> >>>> >>>> Hmm, where would it block? Din't we mutex_unlock() in 3 already? and the >>>> server should be prepared to serve staged data even before Step 0. >>>> Invalidating might be *delayed* till the in-progress reads finishes. But >>>> that only delays the completion of write(), but no deadlocks anywhere. Hope >>>> I din't miss something :-) >>> >>> You can ignore the mutex_lock stuff; I don't think it's necessary to see >>> the issue. Simply consider a racing read (that has pages locked) and >>> invalidate (that is walking through the address_space mapping locking and >>> discarding pages). We can't block the read without risking deadlock with >>> the invalidate, and we can't continue with the read without making the >>> invalidate unsuccessful/unreliable. We can actually do reads at this >>> point from the ceph client vs server perspective since we haven't acked >>> the revocation yet.. but with the fuse vs ceph-fuse interaction we are >>> choosing betweeen deadlock or potential livelock (if we do the read and >>> then try the invalidate a second time). >> >> For the consistency model you are looking at - (halt all new reads, >> clear all page cache, perform write) this sure seems to be a deadlock vs >> live-lock situation. Why isn't your consistency requirements a problem >> for the ceph in-kernel client? What extra machinery does it have which >> FUSE does not? I supposed you implement the "halt new reads" in >> f_op->[aio_]read even before taking page locks? > > Exactly. > > I think to get the proper semantics in the fuse case (and also to solve > the striping problem you describe below), there would need to be an upcall > to quiece read (and probably something similar to write, depending on how > the writeback stuff works out) before doing the invalidate. > > What does gluster do in this case? Are they using direct_io mode? Or a > different consistency model? Currently we offer only close-to-open consistency like NFSv3. For tighter requirements we honor posix locks and uncache the locked regions, with a combination of direct_io. Concurrent reader/writer of the same file/record is just not the use case gluster is focussing on. Avati