Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Nick Piggin <npiggin@suse.de>
To: Jan Kara <jack@suse.cz>
Cc: LKML <linux-kernel@vger.kernel.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
Date: Fri, 26 Jun 2009 14:55:05 +0200	[thread overview]
Message-ID: <20090626125505.GD11450@wotan.suse.de> (raw)
In-Reply-To: <20090626122141.GB32125@duck.suse.cz>

On Fri, Jun 26, 2009 at 02:21:41PM +0200, Jan Kara wrote:
> > Now I may be speaking too soon. It might trun out that my fix is
> > complex as well, but let me just give you an RFC and we can discuss.
>   I've looked at your patch and it's definitely a worthwhile cleanup.
> But it's not quite enough for what I need. I'll try to describe the issue
> as I'm aware of it and possible solutions and maybe you'll have some idea
> about a better solution.
>   PROBLEM: We have a file 'f' of length OLD_ISIZE mmapped upto some offset
> OFF (multiple of pagesize). The problem generally happens when OLD_ISIZE <
> OFF and subsequent filesystem operation (it need not be just truncate() but
> also a plain write() creating a hole - this is the main reason why the
> patch is much more complicated than I'd like) increases file size to
> NEW_ISIZE.
>   The first decision: mmap() documentation says: "The effect of changing
> the size of the underlying file of a mapping on the pages that correspond
> to added or removed regions of the file is  unspecified." So according to
> this it would be perfectly fine if we just discarded all the data beyond
> OLD_ISIZE written via that particular mmap(). It would be even technically
> doable - vma would store minimum i_size which was ever seen at page_mkwrite
> time, page_mkwrite will allocate buffers only upto vma->i_size, we silently
> discard all unmapped dirty buffers. But I also see two problems with this:
>   1) So far we were much nicer to the user and when the file size has been
> increased, user could happily write data via old mmap upto new file size.
> I'd almost bet some people rely on this especially in the case
> truncate-down, truncate-up, write via mmap.
>   2) It's kind of fragile to discard dirty unmapped buffers without a
> warning.
> 
>   So I took the decision that data written via mmap() should be stored
> on disk properly if they were written inside i_size at the time the write
> via mmap() happened. This decision basically dictates, that you have to do
> some magic for the page containing OLD_ISIZE at the time file size is going
> to be extended. All kinds of magic I could think of required taking
> PageLock of the page containing OLD_ISIZE and that makes locking for the
> write case interesting:
>   1) write creating a hole has to update i_size inside PageLock for the
> page it writes to (to avoid races with block_write_full_page()).
>   2) we have to take PageLock for the page containing OLD_ISIZE sometime
> before i_size gets updated to do our magic -> this has to happen in
> write_begin().
> 
>   Now about the magic: There are two things I could imagine we do:
> a) when the page containing OLD_ISIZE is dirty, try to allocate blocks
> under it as needed for NEW_ISIZE - but the errors when this fails will
> return to the process doing truncate / creating hole with write which is
> kind of strange. Also we maybe allocate blocks unnecessarily because user
> may never actually write to the page containing OLD_ISIZE again...
> b) we writeout the page, writeprotect it and let page_mkwrite() do it's
> work when it's called again. This is nice from the theoretical POV but
> gets a bit messy - we have to make sure page_mkwrite() for that page
> doesn't proceed until we update the i_size. Which is why I introduced that
> inode bit-lock. We cannot use e.g. i_mutex for that because it ranks above
> mmap_sem which is held when we enter page_mkwrite.
> 
>   So if you have any idea how to better solve this, you are welcome ;).

Ah thanks, the write(2) case I missed. That does get complex to
do with the page lock.

I agree with the semantics you are aiming for, and I agree we should
not try to allocate blocks when extending i_size.

We actually could update i_size after dropping the page lock in
these paths. That would give a window where we can page_mkclean
the old partial page before the i_size update.

However this does actually require that we remove the partial-page
zeroing that writepage does. I think it does it in order to attempt
to write zeroes into the fs even if the app does mmaped writes
past i_size... but it is pretty dumb anyway really because the
behaviour is undefined anyway so there is no problem if weird
stuff gets written there (it should be zeroed out when extending
the file anyway), and also there is nothing to prevent races of
subsequent mmapped writes before the DMA completes.

But once we remove that, then we can solve it with the page lock
I think.

next prev parent reply	other threads:[~2009-06-26 12:55 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-06-15 17:59 [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
2009-06-15 17:59 ` [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() Jan Kara
2009-06-17 10:28   ` Nick Piggin
2009-06-17 11:49     ` Jan Kara
2009-06-15 17:59 ` [PATCH 02/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
2009-06-25 16:17   ` Nick Piggin
2009-06-25 16:43     ` Nick Piggin
2009-06-25 17:47     ` Christoph Hellwig
2009-06-26  8:42       ` Nick Piggin
2009-06-30 17:37         ` Christoph Hellwig
2009-07-02  7:22           ` Nick Piggin
2009-07-04 15:18             ` Christoph Hellwig
2009-07-06  9:08               ` Nick Piggin
2009-07-06 10:35                 ` Christoph Hellwig
2009-07-06 11:49                   ` Nick Piggin
2009-06-26 12:21     ` Jan Kara
2009-06-26 12:55       ` Nick Piggin [this message]
2009-06-26 16:08         ` Jan Kara
2009-06-29  5:54           ` Nick Piggin
2009-06-15 17:59 ` [PATCH 03/11] ext2: Allocate space for mmaped file on page fault Jan Kara
2009-06-15 17:59 ` [PATCH 04/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize Jan Kara
2009-06-15 17:59 ` [PATCH 05/11] ext3: Allocate space for mmaped file on page fault Jan Kara
2009-06-15 17:59 ` [PATCH 06/11] vfs: Implement generic per-cpu counters for delayed allocation Jan Kara
2009-06-15 17:59 ` [PATCH 07/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped Jan Kara
2009-06-17 10:35   ` Nick Piggin
2009-06-17 12:05     ` Jan Kara
2009-06-17 13:53       ` Nick Piggin
2009-06-18 12:00         ` Theodore Tso
2009-06-18 11:51   ` OGAWA Hirofumi
2009-06-15 17:59 ` [PATCH 08/11] fs: Don't clear dirty bits in block_write_full_page() Jan Kara
2009-06-15 17:59 ` [PATCH 09/11] vfs: Export wakeup_pdflush Jan Kara
2009-06-15 17:59 ` [PATCH 10/11] ext3: Implement delayed allocation on page_mkwrite time Jan Kara
2009-06-15 18:02 ` [PATCH 0/10] Fix page_mkwrite() for blocksize < pagesize (version 3) Jan Kara
2009-06-15 18:17 ` Aneesh Kumar K.V
2009-06-16 10:28   ` Jan Kara
2009-06-16 14:34 ` Christoph Hellwig
2009-06-16 14:42   ` Jan Kara
2009-06-30 17:44     ` Christoph Hellwig
2009-07-01 10:29       ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090626125505.GD11450@wotan.suse.de \
    --to=npiggin@suse.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).