From: "J. Bruce Fields" <bfields@fieldses.org>
To: Dave Chinner <david@fromorbit.com>
Cc: "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
Theodore Ts'o <tytso@mit.edu>,
Dave Hansen <dave.hansen@linux.intel.com>,
LKML <linux-kernel@vger.kernel.org>,
xfs@oss.sgi.com, Dave Hansen <dave.hansen@intel.com>,
Andi Kleen <ak@linux.intel.com>,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
Jan Kara <jack@suse.cz>, Andy Lutomirski <luto@amacapital.net>,
Tim Chen <tim.c.chen@linux.intel.com>
Subject: Re: page fault scalability (ext3, ext4, xfs)
Date: Mon, 19 Aug 2013 18:17:16 -0400 [thread overview]
Message-ID: <20130819221716.GA17869@fieldses.org> (raw)
In-Reply-To: <20130815060149.GP6023@dastard>
On Thu, Aug 15, 2013 at 04:01:49PM +1000, Dave Chinner wrote:
> On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> > > cost of the unwritten->written conversion.
> > >> >
> > >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> > this part until writeback?
> > >>
> > >> Part of the work has to be done at write time because we need to
> > >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> problems). The unwritten->written conversion does happen at writeback
> > >> (as does the actual block allocation if we are doing delayed
> > >> allocation).
> > >>
> > >> The point is that if the goal is to measure page fault scalability, we
> > >> shouldn't have this other stuff happening as the same time as the page
> > >> fault workload.
> > >
> > > Sure, but the real problem is not the block mapping or allocation
> > > path - even if the test is changed to take that out of the picture,
> > > we still have timestamp updates being done on every single page
> > > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > > and have nanosecond granularity, so every page fault is resulting in
> > > a transaction to update the timestamp of the file being modified.
> >
> > I have (unmergeable) patches to fix this:
> >
> > http://comments.gmane.org/gmane.linux.kernel.mm/92476
>
> The big problem with this approach is that not doing the
> timestamp update on page faults is going to break the inode change
> version counting because for ext4, btrfs and XFS it takes a
> transaction to bump that counter. NFS needs to know the moment a
> file is changed in memory, not when it is written to disk.
I don't think the in-memory updates of the data and the version have to
be completely atomic, if that's what you mean.
> Also, NFS
> requires the change to the counter to be persistent over server
> failures, so it needs to be changed as part of a transaction....
I'm not sure those two updates have to be a single atomic transaction on
disk, either.
(Though the reboot cases are more complicated, I may not have thought it
through.)
(By the way, I wonder what happens if we reuse a change attribute value
after a crash? There's probably a (hard to hit) bug there.)
--b.
>
> IOWs, fixing the "filesystems need a transaction on each page_mkwrite
> call" problem isn't as simple as changing how timestamps are
> updated.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: "J. Bruce Fields" <bfields@fieldses.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Andy Lutomirski <luto@amacapital.net>,
"Theodore Ts'o" <tytso@mit.edu>,
Dave Hansen <dave.hansen@intel.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Linux FS Devel <linux-fsdevel@vger.kernel.org>,
xfs@oss.sgi.com,
"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
Jan Kara <jack@suse.cz>, LKML <linux-kernel@vger.kernel.org>,
Tim Chen <tim.c.chen@linux.intel.com>,
Andi Kleen <ak@linux.intel.com>
Subject: Re: page fault scalability (ext3, ext4, xfs)
Date: Mon, 19 Aug 2013 18:17:16 -0400 [thread overview]
Message-ID: <20130819221716.GA17869@fieldses.org> (raw)
In-Reply-To: <20130815060149.GP6023@dastard>
On Thu, Aug 15, 2013 at 04:01:49PM +1000, Dave Chinner wrote:
> On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> > > cost of the unwritten->written conversion.
> > >> >
> > >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> > this part until writeback?
> > >>
> > >> Part of the work has to be done at write time because we need to
> > >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> problems). The unwritten->written conversion does happen at writeback
> > >> (as does the actual block allocation if we are doing delayed
> > >> allocation).
> > >>
> > >> The point is that if the goal is to measure page fault scalability, we
> > >> shouldn't have this other stuff happening as the same time as the page
> > >> fault workload.
> > >
> > > Sure, but the real problem is not the block mapping or allocation
> > > path - even if the test is changed to take that out of the picture,
> > > we still have timestamp updates being done on every single page
> > > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > > and have nanosecond granularity, so every page fault is resulting in
> > > a transaction to update the timestamp of the file being modified.
> >
> > I have (unmergeable) patches to fix this:
> >
> > http://comments.gmane.org/gmane.linux.kernel.mm/92476
>
> The big problem with this approach is that not doing the
> timestamp update on page faults is going to break the inode change
> version counting because for ext4, btrfs and XFS it takes a
> transaction to bump that counter. NFS needs to know the moment a
> file is changed in memory, not when it is written to disk.
I don't think the in-memory updates of the data and the version have to
be completely atomic, if that's what you mean.
> Also, NFS
> requires the change to the counter to be persistent over server
> failures, so it needs to be changed as part of a transaction....
I'm not sure those two updates have to be a single atomic transaction on
disk, either.
(Though the reboot cases are more complicated, I may not have thought it
through.)
(By the way, I wonder what happens if we reuse a change attribute value
after a crash? There's probably a (hard to hit) bug there.)
--b.
>
> IOWs, fixing the "filesystems need a transaction on each page_mkwrite
> call" problem isn't as simple as changing how timestamps are
> updated.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
next prev parent reply other threads:[~2013-08-19 22:17 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-14 17:10 page fault scalability (ext3, ext4, xfs) Dave Hansen
2013-08-14 17:10 ` Dave Hansen
2013-08-14 19:43 ` Theodore Ts'o
2013-08-14 19:43 ` Theodore Ts'o
2013-08-14 20:50 ` Dave Hansen
2013-08-14 20:50 ` Dave Hansen
2013-08-14 23:06 ` Theodore Ts'o
2013-08-14 23:06 ` Theodore Ts'o
2013-08-14 23:38 ` Andy Lutomirski
2013-08-15 1:11 ` Theodore Ts'o
2013-08-15 2:10 ` Dave Chinner
2013-08-15 4:32 ` Andy Lutomirski
2013-08-15 4:32 ` Andy Lutomirski
2013-08-15 6:01 ` Dave Chinner
2013-08-15 6:14 ` Andy Lutomirski
2013-08-15 6:14 ` Andy Lutomirski
2013-08-15 6:18 ` David Lang
2013-08-15 6:18 ` David Lang
2013-08-15 6:28 ` Andy Lutomirski
2013-08-15 6:28 ` Andy Lutomirski
2013-08-15 7:11 ` Dave Chinner
2013-08-15 7:11 ` Dave Chinner
2013-08-15 7:45 ` Jan Kara
2013-08-15 21:28 ` Dave Chinner
2013-08-15 21:28 ` Dave Chinner
2013-08-15 21:31 ` Andy Lutomirski
2013-08-15 21:39 ` Dave Chinner
2013-08-19 23:23 ` David Lang
2013-08-19 23:23 ` David Lang
2013-08-19 23:31 ` Andy Lutomirski
2013-08-15 15:17 ` Andy Lutomirski
2013-08-15 15:17 ` Andy Lutomirski
2013-08-15 21:37 ` Dave Chinner
2013-08-15 21:37 ` Dave Chinner
2013-08-15 21:43 ` Andy Lutomirski
2013-08-15 21:43 ` Andy Lutomirski
2013-08-15 22:18 ` Dave Chinner
2013-08-15 22:18 ` Dave Chinner
2013-08-15 22:26 ` Andy Lutomirski
2013-08-16 0:14 ` Dave Chinner
2013-08-16 0:21 ` Andy Lutomirski
2013-08-16 22:02 ` J. Bruce Fields
2013-08-16 22:02 ` J. Bruce Fields
2013-08-16 23:18 ` Andy Lutomirski
2013-08-16 23:18 ` Andy Lutomirski
2013-08-18 20:17 ` J. Bruce Fields
2013-08-18 20:17 ` J. Bruce Fields
2013-08-19 22:17 ` J. Bruce Fields [this message]
2013-08-19 22:17 ` J. Bruce Fields
2013-08-19 22:29 ` Andy Lutomirski
2013-08-19 22:29 ` Andy Lutomirski
2013-08-15 15:14 ` Dave Hansen
2013-08-15 15:14 ` Dave Hansen
2013-08-15 0:24 ` Dave Chinner
2013-08-15 0:24 ` Dave Chinner
2013-08-15 2:24 ` Andi Kleen
2013-08-15 2:24 ` Andi Kleen
2013-08-15 4:29 ` Dave Chinner
2013-08-15 4:29 ` Dave Chinner
2013-08-15 15:36 ` Dave Hansen
2013-08-15 15:36 ` Dave Hansen
2013-08-15 15:09 ` Dave Hansen
2013-08-15 15:05 ` Theodore Ts'o
2013-08-15 17:45 ` Dave Hansen
2013-08-15 17:45 ` Dave Hansen
2013-08-15 19:31 ` Theodore Ts'o
2013-08-15 19:31 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130819221716.GA17869@fieldses.org \
--to=bfields@fieldses.org \
--cc=ak@linux.intel.com \
--cc=dave.hansen@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@fromorbit.com \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=tim.c.chen@linux.intel.com \
--cc=tytso@mit.edu \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.