From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Yan, Zheng" Subject: Re: [PATCH 3/3] ceph: fix vmtruncate deadlock Date: Tue, 26 Feb 2013 19:05:45 +0800 Message-ID: <512C9709.2030200@intel.com> References: <1361498899-15831-1-git-send-email-zheng.z.yan@intel.com> <1361498899-15831-4-git-send-email-zheng.z.yan@intel.com> <51284624.9050007@intel.com> <512C1AB1.1090307@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from mga09.intel.com ([134.134.136.24]:34667 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753754Ab3BZLFt (ORCPT ); Tue, 26 Feb 2013 06:05:49 -0500 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: Gregory Farnum , "ceph-devel@vger.kernel.org" On 02/26/2013 01:00 PM, Sage Weil wrote: > On Tue, 26 Feb 2013, Yan, Zheng wrote: >>> It looks to me like truncates can get queued for later, so that's n= ot the case? >>> And how could the client receive a truncate while in the middle of >>> writing? Either it's got the write caps (in which case nobody else = can >>> truncate), or it shouldn't be writing...I suppose if somebody enabl= es >>> lazyIO that would do it, but again in that case I imagine we'd rath= er >>> do the truncate before writing: >> >> Commit 22cddde104 make buffered write get/put caps for individual pa= ge. So the MDS can >> revoke the write caps in the middle of write. >> >>> >>>> I think it's better to do vmtruncate after write finishes. >>> What if you're writing past the truncate point? Then your written d= ata >>> would be truncated off even though it should have been a file >>> extension. >>> >> I was wrong, I thought the order of write and truncate is not import= ant.=20 >> Now I'm worry about the correctness of commit 22cddde104, we probabl= y=20 >> should revert that commit. >=20 > I'm sorry I haven't had time to dig into this thread. Do you think w= e=20 > should revert that commit for this merge window? No, that commit fixes a i_size update bug, it is more important than th= e deadlock. But I think the eventual fix is revert that commit and also remove the = optimization that drops Fw caps early (to avoid slow cap revocation caused by balanc= e_dirty_pages) >=20 > My gut feeling is that the whole locking scheme here needs to be rewo= rked. =20 > I suspect where we'll end up is something more like XFS, where there = is an=20 > explicit truncate (or other) mutex that is separate from i_mutex. Th= e=20 > fact that we were relying on i_mutex serialization from the VFS was=20 > fragile and (as we can see) ultimately a flawed approach. And the lo= cking=20 > and races between writes, mmap, and truncation in the page cache code= are=20 > fragile and annoying. The only good thing going for us is that fsx i= s=20 > pretty good at stress testing the code. :) If we want to keep write operation atomic, write and vmtruncate need to= be protected by a mutex. I don't think introducing a new mutex makes thing= simple. >=20 >>> Oh, you're right about the locking (I think, anyway?). However, i_s= ize=20 >>> doesn't protect us ? we might for instance have truncated to zero a= nd=20 >>> then back up to the original file size. :( But that doesn't look li= ke=20 >>> it's handled correctly anyway (racy) ? am I misreading that badly o= r=20 >>> is the infrastructure in fact broken in that case? >> >> You are right. probably we can fix the race by disallowing both read= and=20 >> write when there is pending truncate. >=20 > FWIW, I think the high-level strategy before should still be basicall= y=20 > sound: we can queue a truncation without blocking, and any write path= will=20 > apply a pending truncation under the appropriate lock. But I think i= t's=20 > going to mean carefully documenting the locking requirements for the=20 > various paths and working out a new locking structure. >=20 > Yan, is this something you are able to put some time into? I would l= ike=20 > to work on it, but it's going to be hard for me to find the time in t= he=20 > next several weeks to get to it. OK=EF=BC=8C I will work on it. Regards Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html