public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed
From: Goswin von Brederlow <goswin-v-b@web.de>
To: Jan Kara <jack@suse.cz>
Cc: Goswin von Brederlow <goswin-v-b@web.de>,
	Pavel Machek <pavel@ucw.cz>, LKML <linux-kernel@vger.kernel.org>,
	npiggin@suse.de, linux-ext4@vger.kernel.org
Subject: Re: [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize
Date: Mon, 01 Jun 2009 17:35:49 +0200	[thread overview]
Message-ID: <87zlcsdl0q.fsf@frosties.localdomain> (raw)
In-Reply-To: <20090601150218.GI14373@duck.suse.cz> (Jan Kara's message of "Mon, 1 Jun 2009 17:02:18 +0200")

Jan Kara <jack@suse.cz> writes:

> On Mon 01-06-09 16:46:28, Goswin von Brederlow wrote:
>> Jan Kara <jack@suse.cz> writes:
>> > On Mon 01-06-09 13:33:08, Goswin von Brederlow wrote:
>> >> Jan Kara <jack@suse.cz> writes:
>> >> 
>> >> > On Sat 30-05-09 13:23:24, Pavel Machek wrote:
>> >> >> Hi!
>> >> >> 
>> >> >> > On filesystems where blocksize < pagesize the situation is more complicated.
>> >> >> > Think for example that blocksize = 1024, pagesize = 4096 and a process does:
>> >> >> >   ftruncate(fd, 0);
>> >> >> >   pwrite(fd, buf, 1024, 0);
>> >> >> >   map = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED, fd, 0);
>> >> >> >   map[0] = 'a';  ----> page_mkwrite() for index 0 is called
>> >> >> >   ftruncate(fd, 10000); /* or even pwrite(fd, buf, 1, 10000) */
>> >> >> >   fsync(fd); ----> writepage() for index 0 is called
>> >> >> > 
>> >> >> > At the moment page_mkwrite() is called, filesystem can allocate only one block
>> >> >> > for the page because i_size == 1024. Otherwise it would create blocks beyond
>> >> >> > i_size which is generally undesirable. But later at writepage() time, we would
>> >> >> > like to have blocks allocated for the whole page (and in principle we have to
>> >> >> > allocate them because user could have filled the page with data after the
>> >> >> > second ftruncate()). This patch introduces a framework which allows filesystems
>> >> >> > to handle this with a reasonable effort.
>> >> >> 
>> >> >> What happens when you do above sequence on today's kernels? Oops? 3000
>> >> >> bytes of random junk in file? ...?
>> >> >   Depends on the filesystem. For example on ext4, you'll see a WARN_ON and the data
>> >> > won't be written. Some filesystems may just try to map blocks and possibly
>> >> > hit deadlock or something like that. Filesystems like ext2 / ext3 /
>> >> > reiserfs generally don't care because so far they allocate blocks on writepage
>> >> > time (which has the problem that you can write data via mmap and kernel
>> >> > will later discard them because it hits ENOSPC or quota limit). That's
>> >> > actually what I was trying to fix originally.
>> >> >
>> >> > 										Honza
>> >> 
>> >> man mmap:
>> >>        A file is mapped in multiples of the page size.  For a file that is not
>> >>        a  multiple  of  the  page  size,  the  remaining memory is zeroed when
>> >>        mapped, and writes to that region are not written out to the file.  The
>> >>        effect  of changing the size of the underlying file of a mapping on the
>> >>        pages that correspond to added  or  removed  regions  of  the  file  is
>> >>        unspecified.
>> >> 
>> >> Whatever happens happens. The above code is just wrong, as in
>> >> unspecified behaviour.
>> >> What happens if you ftruncate() before mmap()?
>> >   OK, I admit I didn't realize mmap() has so weak requirements. Doing mmap
>> > after ftruncate() should work fine because before you write via that new
>> > mmap page_mkwrite() will be called anyway.
>> 
>> But the ftruncate would only allocate a block at position 10000. The
>> file still has a big hole from 1024-4095.
>   ftruncate() actually allocates no blocks. It just updates file size (at
> least for most filesystems). The hole is created as you write.
>
>> >   So what we could alternatively do is that we just discard dirty bits from
>> > buffers that don't have underlying blocks allocated. That would satisfy the
>> > specification as well. But I have to say I'm a bit afraid of discarding
>> > dirty bits like that. Also we'd have to handle the case where someone does
>> > mremap() after ftruncate().
>> >   What other memory management people think?
>> 
>> As said above the file still has a big hole after ftruncate. So not
>> having underlying blocks allocated can't be the deciding factor.
>   I'm not sure I understand here. Do you mean that we should not decide
> about discarding dirty bits depending on whether the buffers have
> underlying blocks or not? In my opinion that should be correct option
> because from what the man page says, user is not guaranteed what happens
> when the file size is extended to 10000 and he tries to write from offset
> 1024 (old i_size) further... Anyway, I'm not defending doing that :-) I'm
> just trying to understand what you mean.

I mean that the file can lack underlying blocks below its old
i_size. So that can't be the deciding factor.

>> If possible I would make ftruncate() after mmap() work just like
>> ftruncate() before mmap(). That is write any dirty page completly up
>> to the current filesize. Allocate disk blocks for the file as needed
>> (you need to do that anyway). Wouldn't it be more work to remember the
>   This is the thing my patches try to achieve :). At truncate time (or
> generally just before i_size is extended), we check the last page and
> propagate dirty bits to buffers inside old i_size (and clear the ones
> beyond old i_size). We also writeprotect the page so that if someone tries
> to write to it via mmap in future, we get page fault, page_mkwrite() is
> called and a filesystem allocates all blocks it needs with the new i_size.

Sounds right.

>> filesize at the time of the mmap() to limit updates to that than using
>> the current file size?
>   Yes, that would be more work. But I never intended to do this...

MfG
        Goswin

  reply	other threads:[~2009-06-01 15:35 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-27 13:00 [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize Jan Kara
2009-05-27 13:00 ` [PATCH 01/11] ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() Jan Kara
2009-05-27 13:00 ` [PATCH 02/11] ext4: Get rid of extend_disksize parameter of ext4_get_blocks_handle() Jan Kara
2009-05-27 13:01 ` [PATCH 03/11] vfs: Add better VFS support for page_mkwrite when blocksize < pagesize Jan Kara
2009-05-27 16:00   ` Josef Bacik
2009-05-27 16:45     ` Aneesh Kumar K.V
2009-05-27 17:06       ` Josef Bacik
2009-05-28 13:03   ` Josef Bacik
2009-05-28 13:10     ` Aneesh Kumar K.V
2009-05-30 11:23   ` Pavel Machek
2009-06-01  9:44     ` Jan Kara
2009-06-01 11:33       ` Goswin von Brederlow
2009-06-01 14:00         ` Jan Kara
2009-06-01 14:46           ` Goswin von Brederlow
2009-06-01 15:02             ` Jan Kara
2009-06-01 15:35               ` Goswin von Brederlow [this message]
2009-05-27 13:01 ` [PATCH 04/11] ext2: Allocate space for mmaped file on page fault Jan Kara
2009-05-27 13:01 ` [PATCH 05/11] ext4: Make sure blocks are properly allocated under mmaped page even when blocksize < pagesize Jan Kara
2009-05-27 14:30   ` Theodore Tso
2009-05-27 14:52     ` Jan Kara
2009-06-04 14:09   ` Theodore Tso
2009-05-27 13:01 ` [PATCH 06/11] ext3: Allocate space for mmaped file on page fault Jan Kara
2009-05-27 13:01 ` [PATCH 07/11] vfs: Implement generic per-cpu counters for delayed allocation Jan Kara
2009-05-27 13:01 ` [PATCH 08/11] vfs: Unmap underlying metadata of new data buffers only when buffer is mapped Jan Kara
2009-05-27 15:35   ` Aneesh Kumar K.V
2009-05-28  9:44     ` Jan Kara
2009-05-28 10:15       ` Aneesh Kumar K.V
2009-05-28 13:50         ` Jan Kara
2009-05-27 13:01 ` [PATCH 09/11] fs: Don't clear dirty bits in block_write_full_page() Jan Kara
2009-05-27 13:01 ` [PATCH 10/11] vfs: Export wakeup_pdflush Jan Kara
2009-05-27 13:01 ` [PATCH 11/11] ext3: Implement delayed allocation on page_mkwrite time Jan Kara
2009-05-27 14:23 ` [PATCH 0/11] Fix page_mkwrite() for blocksize < pagesize Theodore Tso
2009-05-27 14:59   ` Jan Kara
2009-06-04 17:11     ` Theodore Tso
2009-06-05 23:23       ` Jan Kara
2009-05-27 15:33 ` Aneesh Kumar K.V
2009-05-28  9:36   ` Jan Kara

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87zlcsdl0q.fsf@frosties.localdomain \
    --to=goswin-v-b@web.de \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=npiggin@suse.de \
    --cc=pavel@ucw.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox