From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
lsf-pc@lists.linux-foundation.org, linux-ext4@vger.kernel.org
Subject: Re: [LSF/MM TOPIC] Mapping range locking
Date: Fri, 25 Jan 2013 11:01:44 +0800 [thread overview]
Message-ID: <20130125030144.GA11145@gmail.com> (raw)
In-Reply-To: <20130124112607.GA21818@quack.suse.cz>
[Cc linux-ext4 mailing list because we are planning to add extent-level
locking in ext4]
On Thu, Jan 24, 2013 at 12:26:07PM +0100, Jan Kara wrote:
> Hello,
>
> I'd like to discuss idea of using range locking to serialize IO to a
> range of pages. I have POC patches implementing the locking and converting
> ext3 to use it. They pass xfstests and I plan to post them once I gather
> some basic performace data (how much does the range locking cost us).
I'd be interested to discuss this topic. Currently I am working on
extent status tree for ext4 filesystem. The final goal of extent status
tree is the implementation of extent-level locking (a range locking),
which makes us be able to do parallel writes when different extents are
manipulated.
Regards,
- Zheng
>
> Now to the details of the idea. There are several different motivations for
> implementing mapping range locking:
> a) Punch hole is currently racy wrt mmap (page can be faulted in in the
> punched range after page cache has been invalidated) leading to nasty
> results as fs corruption (we can end up writing to already freed block),
> user exposure of uninitialized data, etc. To fix this we need some new
> mechanism of serializing hole punching and page faults.
> b) There is an uncomfortable number of mechanisms serializing various paths
> manipulating pagecache and data underlying it. We have i_mutex, page lock,
> checks for page beyond EOF in pagefault code, i_dio_count for direct IO.
> Different pairs of operations are serialized by different mechanisms and
> not all the cases are covered. Case (a) above is likely the worst but DIO
> vs buffered IO isn't ideal either (we provide only limited consistency).
> The range locking should somewhat simplify serialization of pagecache
> operations. So i_dio_count can be removed completely, i_mutex to certain
> extent (we still need something for things like timestamp updates,
> possibly for i_size changes although those can be dealt with I think).
> c) i_mutex doesn't allow any paralellism of operations using it and some
> filesystems workaround this for specific cases (e.g. DIO reads). Using
> range locking allows for concurrent operations (e.g. writes, DIO) on
> different parts of the file. Of course, range locking itself isn't
> enough to make the parallelism possible. Filesystems still have to
> somehow deal with the concurrency when manipulating inode allocation
> data. But the range locking at least provides a common VFS mechanism for
> serialization VFS itself needs and it's upto each filesystem to
> serialize more if it needs to.
>
> How it works:
>
> General idea is that range lock for range x-y prevents creation of pages in
> that range.
>
> In practice this means:
> All read paths adding page to page cache and grab_cache_page_write_begin()
> first take range lock for the index, then insert locked page, and finally
> unlock the range. See below on why buffered IO uses range locks on per-page
> basis.
>
> DIO gets range lock at the moment it submits bio for the range covering
> pages in the bio. Then pagecache is truncated and bio submitted. Range lock
> is unlocked once bio is completed.
>
> Punch hole for range x-y takes range lock for the range before truncating
> page cache and the lock is released after filesystem blocks for the range
> are freed.
>
> Truncate to size x is equivalent to punch hole for the range x - ~0UL.
>
> The reason why we take the range lock for buffered IO on per-page basis and
> for DIO for each bio separately is lock ordering with mmap_sem. Page faults
> need to instantiate page under mmap_sem. That establishes mmap_sem > range
> lock. Buffered IO takes mmap_sem when prefaulting pages so we cannot hold
> range lock at that moment. Similarly get_user_pages() in DIO code takes
> mmap_sem so we have be sure not to hold range lock when calling that.
>
> Honza
>
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
prev parent reply other threads:[~2013-01-25 2:47 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-01-24 11:26 [LSF/MM TOPIC] Mapping range locking Jan Kara
2013-01-25 3:01 ` Zheng Liu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130125030144.GA11145@gmail.com \
--to=gnehzuil.liu@gmail.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).