All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Yan, Zheng" <zheng.z.yan@intel.com>
To: Sage Weil <sage@inktank.com>
Cc: Gregory Farnum <greg@inktank.com>,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: [PATCH 3/3] ceph: fix vmtruncate deadlock
Date: Tue, 26 Feb 2013 19:05:45 +0800	[thread overview]
Message-ID: <512C9709.2030200@intel.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1302252051300.6167@cobra.newdream.net>

On 02/26/2013 01:00 PM, Sage Weil wrote:
> On Tue, 26 Feb 2013, Yan, Zheng wrote:
>>> It looks to me like truncates can get queued for later, so that's not the case?
>>> And how could the client receive a truncate while in the middle of
>>> writing? Either it's got the write caps (in which case nobody else can
>>> truncate), or it shouldn't be writing...I suppose if somebody enables
>>> lazyIO that would do it, but again in that case I imagine we'd rather
>>> do the truncate before writing:
>>
>> Commit 22cddde104 make buffered write get/put caps for individual page. So the MDS can
>> revoke the write caps in the middle of write.
>>
>>>
>>>> I think it's better to do vmtruncate after write finishes.
>>> What if you're writing past the truncate point? Then your written data
>>> would be truncated off even though it should have been a file
>>> extension.
>>>
>> I was wrong, I thought the order of write and truncate is not important. 
>> Now I'm worry about the correctness of commit 22cddde104, we probably 
>> should revert that commit.
> 
> I'm sorry I haven't had time to dig into this thread.  Do you think we 
> should revert that commit for this merge window?

No, that commit fixes a i_size update bug, it is more important than the deadlock.
But I think the eventual fix is revert that commit and also remove the optimization
that drops Fw caps early (to avoid slow cap revocation caused by balance_dirty_pages)

> 
> My gut feeling is that the whole locking scheme here needs to be reworked.  
> I suspect where we'll end up is something more like XFS, where there is an 
> explicit truncate (or other) mutex that is separate from i_mutex.  The 
> fact that we were relying on i_mutex serialization from the VFS was 
> fragile and (as we can see) ultimately a flawed approach.  And the locking 
> and races between writes, mmap, and truncation in the page cache code are 
> fragile and annoying.  The only good thing going for us is that fsx is 
> pretty good at stress testing the code.  :)

If we want to keep write operation atomic, write and vmtruncate need to be
protected by a mutex. I don't think introducing a new mutex makes thing simple.

> 
>>> Oh, you're right about the locking (I think, anyway?). However, i_size 
>>> doesn't protect us ? we might for instance have truncated to zero and 
>>> then back up to the original file size. :( But that doesn't look like 
>>> it's handled correctly anyway (racy) ? am I misreading that badly or 
>>> is the infrastructure in fact broken in that case?
>>
>> You are right. probably we can fix the race by disallowing both read and 
>> write when there is pending truncate.
> 
> FWIW, I think the high-level strategy before should still be basically 
> sound: we can queue a truncation without blocking, and any write path will 
> apply a pending truncation under the appropriate lock.  But I think it's 
> going to mean carefully documenting the locking requirements for the 
> various paths and working out a new locking structure.
> 
> Yan, is this something you are able to put some time into?  I would like 
> to work on it, but it's going to be hard for me to find the time in the 
> next several weeks to get to it.

OK, I will work on it.

Regards
Yan, Zheng

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2013-02-26 11:05 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-22  2:08 [PATCH 0/3] ceph: misc fixes Yan, Zheng
2013-02-22  2:08 ` [PATCH 1/3] ceph: fix LSSNAP regression Yan, Zheng
2013-02-22  2:08 ` [PATCH 2/3] ceph: queue cap release when trimming cap Yan, Zheng
2013-02-22  2:08 ` [PATCH 3/3] ceph: fix vmtruncate deadlock Yan, Zheng
2013-02-22 18:54   ` Gregory Farnum
2013-02-23  4:31     ` Yan, Zheng
2013-02-26  0:01       ` Gregory Farnum
2013-02-26  2:15         ` Yan, Zheng
2013-02-26  5:00           ` Sage Weil
2013-02-26 11:05             ` Yan, Zheng [this message]
2013-02-26 18:19         ` Gregory Farnum

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=512C9709.2030200@intel.com \
    --to=zheng.z.yan@intel.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=greg@inktank.com \
    --cc=sage@inktank.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.