linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: Dave Chinner <david@fromorbit.com>
Cc: Andi Kleen <ak@linux.intel.com>, Theodore Ts'o <tytso@mit.edu>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	xfs@oss.sgi.com, Dave Hansen <dave.hansen@intel.com>,
	Linux FS Devel <linux-fsdevel@vger.kernel.org>,
	Jan Kara <jack@suse.cz>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	Tim Chen <tim.c.chen@linux.intel.com>
Subject: Re: page fault scalability (ext3, ext4, xfs)
Date: Thu, 15 Aug 2013 14:43:09 -0700	[thread overview]
Message-ID: <CALCETrV7F-47_nRx1AVFqeF8sNoREutbo3kf78ddBLvKKmFCzg@mail.gmail.com> (raw)
In-Reply-To: <20130815213725.GT6023@dastard>

On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> I didn't think of that at all.
>>
>> If userspace does:
>>
>> ptr = mmap(...);
>> ptr[0] = 1;
>> sleep(1);
>> ptr[0] = 2;
>> sleep(1);
>> munmap();
>>
>> Then current kernels will mark the inode changed on (only) the ptr[0]
>> = 1 line.  My patches will instead mark the inode changed when munmap
>> is called (or after ptr[0] = 2 if writepages gets called for any
>> reason).
>>
>> I'm not sure which is better.  POSIX actually requires my behavior
>> (which is most irrelevant).
>
> Not by my reading of it. Posix states that c/mtime needs to be
> updated between the first access and the next msync() call. We
> update mtime on the first access, and so therefore we conform to the
> posix requirement....

It says "between a write reference to the mapped region and the next
call to msync()."  Most write references don't cause page faults.

>
>> My behavior also means that, if an NFS
>> client reads and caches the file between the two writes, then it will
>> eventually find out that the data is stale.
>
> "eventually" is very different behaviour to the current behaviour.
>
> My understanding is that NFS v4 delegations require the underlying
> filesystem to bump the version count on *any* modification made to
> the file so that delegations can be recalled appropriately. So not
> informing the filesystem that the file data has been changed is
> going to cause problems.

We don't do that right now (and we can't without utterly destroying
performance) because we don't trap on every modification.  See
below...

>
>> The current behavior, on
>> the other hand, means that a single pass of mmapped writes through the
>> file will update the times much faster.
>>
>> I could arrange for the first page fault to *also* update times when
>> the FS is exported or if a particular mount option is set.  (The ext4
>> change to request the new behavior is all of four lines, and it's easy
>> to adjust.)
>
> What does "first page fault" mean?

The first write to the page triggers a page fault and marks the page
writable.  The second write to the page (assuming no writeback happens
in the mean time) does not trigger a page fault or notify the kernel
in any way.


In current kernels, this chain of events won't work:

 - Server goes down
 - Server comes up
 - Userspace on server calls mmap and writes something
 - Client reconnects and invalidates its cache
 - Userspace on server writes something else *to the same page*

The client will never notice the second write, because it won't update
any inode state.  With my patches, the client will as soon as the
server starts writeback.

So I think that there are cases where my changes make things better
and cases where they make things worse.

--Andy

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

  reply	other threads:[~2013-08-15 21:43 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-14 17:10 page fault scalability (ext3, ext4, xfs) Dave Hansen
2013-08-14 19:43 ` Theodore Ts'o
2013-08-14 20:50   ` Dave Hansen
2013-08-14 23:06     ` Theodore Ts'o
2013-08-14 23:38       ` Andy Lutomirski
2013-08-15  1:11         ` Theodore Ts'o
2013-08-15  2:10           ` Dave Chinner
2013-08-15  4:32             ` Andy Lutomirski
2013-08-15  6:01               ` Dave Chinner
2013-08-15  6:14                 ` Andy Lutomirski
2013-08-15  6:18                   ` David Lang
2013-08-15  6:28                     ` Andy Lutomirski
2013-08-15  7:11                   ` Dave Chinner
2013-08-15  7:45                     ` Jan Kara
2013-08-15 21:28                       ` Dave Chinner
2013-08-15 21:31                         ` Andy Lutomirski
2013-08-15 21:39                           ` Dave Chinner
2013-08-19 23:23                         ` David Lang
2013-08-19 23:31                           ` Andy Lutomirski
2013-08-15 15:17                     ` Andy Lutomirski
2013-08-15 21:37                       ` Dave Chinner
2013-08-15 21:43                         ` Andy Lutomirski [this message]
2013-08-15 22:18                           ` Dave Chinner
2013-08-15 22:26                             ` Andy Lutomirski
2013-08-16  0:14                               ` Dave Chinner
2013-08-16  0:21                                 ` Andy Lutomirski
2013-08-16 22:02                         ` J. Bruce Fields
2013-08-16 23:18                           ` Andy Lutomirski
2013-08-18 20:17                             ` J. Bruce Fields
2013-08-19 22:17                 ` J. Bruce Fields
2013-08-19 22:29                   ` Andy Lutomirski
2013-08-15 15:14           ` Dave Hansen
2013-08-15  0:24 ` Dave Chinner
2013-08-15  2:24   ` Andi Kleen
2013-08-15  4:29     ` Dave Chinner
2013-08-15 15:36       ` Dave Hansen
2013-08-15 15:09   ` Dave Hansen
2013-08-15 15:05 ` Theodore Ts'o
2013-08-15 17:45   ` Dave Hansen
2013-08-15 19:31     ` Theodore Ts'o

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALCETrV7F-47_nRx1AVFqeF8sNoREutbo3kf78ddBLvKKmFCzg@mail.gmail.com \
    --to=luto@amacapital.net \
    --cc=ak@linux.intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@fromorbit.com \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tim.c.chen@linux.intel.com \
    --cc=tytso@mit.edu \
    --cc=xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).