* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-15 7:11 ` Dave Chinner
0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 7:11 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> >> > > cost of the unwritten->written conversion.
> >> >> >
> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> >> > this part until writeback?
> >> >>
> >> >> Part of the work has to be done at write time because we need to
> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> >> problems). The unwritten->written conversion does happen at writeback
> >> >> (as does the actual block allocation if we are doing delayed
> >> >> allocation).
> >> >>
> >> >> The point is that if the goal is to measure page fault scalability, we
> >> >> shouldn't have this other stuff happening as the same time as the page
> >> >> fault workload.
> >> >
> >> > Sure, but the real problem is not the block mapping or allocation
> >> > path - even if the test is changed to take that out of the picture,
> >> > we still have timestamp updates being done on every single page
> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> > and have nanosecond granularity, so every page fault is resulting in
> >> > a transaction to update the timestamp of the file being modified.
> >>
> >> I have (unmergeable) patches to fix this:
> >>
> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >
> > The big problem with this approach is that not doing the
> > timestamp update on page faults is going to break the inode change
> > version counting because for ext4, btrfs and XFS it takes a
> > transaction to bump that counter. NFS needs to know the moment a
> > file is changed in memory, not when it is written to disk. Also, NFS
> > requires the change to the counter to be persistent over server
> > failures, so it needs to be changed as part of a transaction....
>
> I've been running a kernel that has the file_update_time call
> commented out for over a year now, and the only problem I've seen is
> that the timestamp doesn't get updated :)
>
> I think I must be misunderstanding you (or vice versa). I'm currently
Yup, you are.
> redoing the patches, and this time I'll do it for just the mm core and
> ext4. The only change I'm proposing to ext4's page_mkwrite is to
> remove the file_update_time call.
Right. Where does that end up? All the way down in
ext4_mark_iloc_dirty(), and that does:
if (IS_I_VERSION(inode))
inode_inc_iversion(inode);
The XFS transaction code is the same - deep inside it where an inode
is marked as dirty in the transaction, it bumps the same counter and
adds it to the transaction.
If a filesystem is providing an i_version value, then NFS uses it to
determine whether client side caches are still consistent with the
server state. If the filesystem does not provide an i_version, then
NFS falls back to checking c/mtime for changes. If files on the
server are being modified without either the tiemstamps or i_version
changing, then it's likely that there will be problems with client
side cache consistency....
> Instead, ext4 will call
> file_update_time on munmap, exit, MS_ASYNC, and at the end of
> writepages. Unless I'm missing something, there's no need to
> unconditionally start a transaction on page_mkwrite (and there had
> better not be, because file_update_time won't start a transaction if
> the time doesn't change).
Right, there's no unconditional need for a transaction except if the
filesystem is providing the inode version change feature for NFS.
ext4, btrfs and XFS all do this unconditionally, and so therefore
those filesystem have a need for an inode change transaction on
every page fault, just like they do for every write(2) call.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 67+ messages in thread* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 7:11 ` Dave Chinner
(?)
@ 2013-08-15 7:45 ` Jan Kara
2013-08-15 21:28 ` Dave Chinner
-1 siblings, 1 reply; 67+ messages in thread
From: Jan Kara @ 2013-08-15 7:45 UTC (permalink / raw)
To: Dave Chinner
Cc: Andy Lutomirski, Theodore Ts'o, Dave Hansen, Dave Hansen,
Linux FS Devel, xfs, linux-ext4@vger.kernel.org, Jan Kara, LKML,
Tim Chen, Andi Kleen
On Thu 15-08-13 17:11:42, Dave Chinner wrote:
> On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> >> > > cost of the unwritten->written conversion.
> > >> >> >
> > >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> >> > this part until writeback?
> > >> >>
> > >> >> Part of the work has to be done at write time because we need to
> > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> >> problems). The unwritten->written conversion does happen at writeback
> > >> >> (as does the actual block allocation if we are doing delayed
> > >> >> allocation).
> > >> >>
> > >> >> The point is that if the goal is to measure page fault scalability, we
> > >> >> shouldn't have this other stuff happening as the same time as the page
> > >> >> fault workload.
> > >> >
> > >> > Sure, but the real problem is not the block mapping or allocation
> > >> > path - even if the test is changed to take that out of the picture,
> > >> > we still have timestamp updates being done on every single page
> > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > >> > and have nanosecond granularity, so every page fault is resulting in
> > >> > a transaction to update the timestamp of the file being modified.
> > >>
> > >> I have (unmergeable) patches to fix this:
> > >>
> > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > >
> > > The big problem with this approach is that not doing the
> > > timestamp update on page faults is going to break the inode change
> > > version counting because for ext4, btrfs and XFS it takes a
> > > transaction to bump that counter. NFS needs to know the moment a
> > > file is changed in memory, not when it is written to disk. Also, NFS
> > > requires the change to the counter to be persistent over server
> > > failures, so it needs to be changed as part of a transaction....
> >
> > I've been running a kernel that has the file_update_time call
> > commented out for over a year now, and the only problem I've seen is
> > that the timestamp doesn't get updated :)
> >
> > I think I must be misunderstanding you (or vice versa). I'm currently
>
> Yup, you are.
>
> > redoing the patches, and this time I'll do it for just the mm core and
> > ext4. The only change I'm proposing to ext4's page_mkwrite is to
> > remove the file_update_time call.
>
> Right. Where does that end up? All the way down in
> ext4_mark_iloc_dirty(), and that does:
>
> if (IS_I_VERSION(inode))
> inode_inc_iversion(inode);
>
> The XFS transaction code is the same - deep inside it where an inode
> is marked as dirty in the transaction, it bumps the same counter and
> adds it to the transaction.
Yeah, I'd just add that ext4 maintains i_version only if it has been
mounted with i_version mount option. But then NFS server would depend on
c/mtime update so it won't help you much - you still should update at least
one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
exported, you could avoid this relatively expensive dance and defer things
as Andy suggests.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 7:45 ` Jan Kara
@ 2013-08-15 21:28 ` Dave Chinner
0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 21:28 UTC (permalink / raw)
To: Jan Kara
Cc: linux-ext4@vger.kernel.org, Theodore Ts'o, Dave Hansen, LKML,
xfs, Dave Hansen, Andi Kleen, Linux FS Devel, Andy Lutomirski,
Tim Chen
On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > > >> >> > > cost of the unwritten->written conversion.
> > > >> >> >
> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > > >> >> > this part until writeback?
> > > >> >>
> > > >> >> Part of the work has to be done at write time because we need to
> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > > >> >> problems). The unwritten->written conversion does happen at writeback
> > > >> >> (as does the actual block allocation if we are doing delayed
> > > >> >> allocation).
> > > >> >>
> > > >> >> The point is that if the goal is to measure page fault scalability, we
> > > >> >> shouldn't have this other stuff happening as the same time as the page
> > > >> >> fault workload.
> > > >> >
> > > >> > Sure, but the real problem is not the block mapping or allocation
> > > >> > path - even if the test is changed to take that out of the picture,
> > > >> > we still have timestamp updates being done on every single page
> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > > >> > and have nanosecond granularity, so every page fault is resulting in
> > > >> > a transaction to update the timestamp of the file being modified.
> > > >>
> > > >> I have (unmergeable) patches to fix this:
> > > >>
> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > > >
> > > > The big problem with this approach is that not doing the
> > > > timestamp update on page faults is going to break the inode change
> > > > version counting because for ext4, btrfs and XFS it takes a
> > > > transaction to bump that counter. NFS needs to know the moment a
> > > > file is changed in memory, not when it is written to disk. Also, NFS
> > > > requires the change to the counter to be persistent over server
> > > > failures, so it needs to be changed as part of a transaction....
> > >
> > > I've been running a kernel that has the file_update_time call
> > > commented out for over a year now, and the only problem I've seen is
> > > that the timestamp doesn't get updated :)
> > >
> > > I think I must be misunderstanding you (or vice versa). I'm currently
> >
> > Yup, you are.
> >
> > > redoing the patches, and this time I'll do it for just the mm core and
> > > ext4. The only change I'm proposing to ext4's page_mkwrite is to
> > > remove the file_update_time call.
> >
> > Right. Where does that end up? All the way down in
> > ext4_mark_iloc_dirty(), and that does:
> >
> > if (IS_I_VERSION(inode))
> > inode_inc_iversion(inode);
> >
> > The XFS transaction code is the same - deep inside it where an inode
> > is marked as dirty in the transaction, it bumps the same counter and
> > adds it to the transaction.
> Yeah, I'd just add that ext4 maintains i_version only if it has been
> mounted with i_version mount option. But then NFS server would depend on
> c/mtime update so it won't help you much - you still should update at least
> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
> exported, you could avoid this relatively expensive dance and defer things
> as Andy suggests.
The problem with "not exported, don't update" is that files can be
modified on server startup (e.g. after a crash) or in short
maintenance periods when the NFS service is down. When the server is
started back up, the change number needs to indicate the file has
been modified so that clients reconnecting to the server see the
change.
IOWs, even if the NFS server is not up or the filesystem not
exported we still need to update change counts whenever a file
changes if we are going to tell the NFS server that we keep them...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-15 21:28 ` Dave Chinner
0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 21:28 UTC (permalink / raw)
To: Jan Kara
Cc: Andy Lutomirski, Theodore Ts'o, Dave Hansen, Dave Hansen,
Linux FS Devel, xfs, linux-ext4@vger.kernel.org, LKML, Tim Chen,
Andi Kleen
On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > > >> >> > > cost of the unwritten->written conversion.
> > > >> >> >
> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > > >> >> > this part until writeback?
> > > >> >>
> > > >> >> Part of the work has to be done at write time because we need to
> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > > >> >> problems). The unwritten->written conversion does happen at writeback
> > > >> >> (as does the actual block allocation if we are doing delayed
> > > >> >> allocation).
> > > >> >>
> > > >> >> The point is that if the goal is to measure page fault scalability, we
> > > >> >> shouldn't have this other stuff happening as the same time as the page
> > > >> >> fault workload.
> > > >> >
> > > >> > Sure, but the real problem is not the block mapping or allocation
> > > >> > path - even if the test is changed to take that out of the picture,
> > > >> > we still have timestamp updates being done on every single page
> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > > >> > and have nanosecond granularity, so every page fault is resulting in
> > > >> > a transaction to update the timestamp of the file being modified.
> > > >>
> > > >> I have (unmergeable) patches to fix this:
> > > >>
> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > > >
> > > > The big problem with this approach is that not doing the
> > > > timestamp update on page faults is going to break the inode change
> > > > version counting because for ext4, btrfs and XFS it takes a
> > > > transaction to bump that counter. NFS needs to know the moment a
> > > > file is changed in memory, not when it is written to disk. Also, NFS
> > > > requires the change to the counter to be persistent over server
> > > > failures, so it needs to be changed as part of a transaction....
> > >
> > > I've been running a kernel that has the file_update_time call
> > > commented out for over a year now, and the only problem I've seen is
> > > that the timestamp doesn't get updated :)
> > >
> > > I think I must be misunderstanding you (or vice versa). I'm currently
> >
> > Yup, you are.
> >
> > > redoing the patches, and this time I'll do it for just the mm core and
> > > ext4. The only change I'm proposing to ext4's page_mkwrite is to
> > > remove the file_update_time call.
> >
> > Right. Where does that end up? All the way down in
> > ext4_mark_iloc_dirty(), and that does:
> >
> > if (IS_I_VERSION(inode))
> > inode_inc_iversion(inode);
> >
> > The XFS transaction code is the same - deep inside it where an inode
> > is marked as dirty in the transaction, it bumps the same counter and
> > adds it to the transaction.
> Yeah, I'd just add that ext4 maintains i_version only if it has been
> mounted with i_version mount option. But then NFS server would depend on
> c/mtime update so it won't help you much - you still should update at least
> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
> exported, you could avoid this relatively expensive dance and defer things
> as Andy suggests.
The problem with "not exported, don't update" is that files can be
modified on server startup (e.g. after a crash) or in short
maintenance periods when the NFS service is down. When the server is
started back up, the change number needs to indicate the file has
been modified so that clients reconnecting to the server see the
change.
IOWs, even if the NFS server is not up or the filesystem not
exported we still need to update change counts whenever a file
changes if we are going to tell the NFS server that we keep them...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 21:28 ` Dave Chinner
(?)
@ 2013-08-15 21:31 ` Andy Lutomirski
2013-08-15 21:39 ` Dave Chinner
-1 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-15 21:31 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Theodore Ts'o, Dave Hansen, Dave Hansen,
Linux FS Devel, xfs, linux-ext4@vger.kernel.org, LKML, Tim Chen,
Andi Kleen
On Thu, Aug 15, 2013 at 2:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
>> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
>> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> > > >> >> > > cost of the unwritten->written conversion.
>> > > >> >> >
>> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> > > >> >> > this part until writeback?
>> > > >> >>
>> > > >> >> Part of the work has to be done at write time because we need to
>> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> > > >> >> problems). The unwritten->written conversion does happen at writeback
>> > > >> >> (as does the actual block allocation if we are doing delayed
>> > > >> >> allocation).
>> > > >> >>
>> > > >> >> The point is that if the goal is to measure page fault scalability, we
>> > > >> >> shouldn't have this other stuff happening as the same time as the page
>> > > >> >> fault workload.
>> > > >> >
>> > > >> > Sure, but the real problem is not the block mapping or allocation
>> > > >> > path - even if the test is changed to take that out of the picture,
>> > > >> > we still have timestamp updates being done on every single page
>> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > > >> > and have nanosecond granularity, so every page fault is resulting in
>> > > >> > a transaction to update the timestamp of the file being modified.
>> > > >>
>> > > >> I have (unmergeable) patches to fix this:
>> > > >>
>> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> > > >
>> > > > The big problem with this approach is that not doing the
>> > > > timestamp update on page faults is going to break the inode change
>> > > > version counting because for ext4, btrfs and XFS it takes a
>> > > > transaction to bump that counter. NFS needs to know the moment a
>> > > > file is changed in memory, not when it is written to disk. Also, NFS
>> > > > requires the change to the counter to be persistent over server
>> > > > failures, so it needs to be changed as part of a transaction....
>> > >
>> > > I've been running a kernel that has the file_update_time call
>> > > commented out for over a year now, and the only problem I've seen is
>> > > that the timestamp doesn't get updated :)
>> > >
>> > > I think I must be misunderstanding you (or vice versa). I'm currently
>> >
>> > Yup, you are.
>> >
>> > > redoing the patches, and this time I'll do it for just the mm core and
>> > > ext4. The only change I'm proposing to ext4's page_mkwrite is to
>> > > remove the file_update_time call.
>> >
>> > Right. Where does that end up? All the way down in
>> > ext4_mark_iloc_dirty(), and that does:
>> >
>> > if (IS_I_VERSION(inode))
>> > inode_inc_iversion(inode);
>> >
>> > The XFS transaction code is the same - deep inside it where an inode
>> > is marked as dirty in the transaction, it bumps the same counter and
>> > adds it to the transaction.
>> Yeah, I'd just add that ext4 maintains i_version only if it has been
>> mounted with i_version mount option. But then NFS server would depend on
>> c/mtime update so it won't help you much - you still should update at least
>> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
>> exported, you could avoid this relatively expensive dance and defer things
>> as Andy suggests.
>
> The problem with "not exported, don't update" is that files can be
> modified on server startup (e.g. after a crash) or in short
> maintenance periods when the NFS service is down. When the server is
> started back up, the change number needs to indicate the file has
> been modified so that clients reconnecting to the server see the
> change.
>
> IOWs, even if the NFS server is not up or the filesystem not
> exported we still need to update change counts whenever a file
> changes if we are going to tell the NFS server that we keep them...
This will keep working as long as the clients are willing to wait for
writeback (or msync, munmap, or exit) on the server.
--Andy
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 21:31 ` Andy Lutomirski
@ 2013-08-15 21:39 ` Dave Chinner
0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 21:39 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Jan Kara, Theodore Ts'o, Dave Hansen, Dave Hansen,
Linux FS Devel, xfs, linux-ext4@vger.kernel.org, LKML, Tim Chen,
Andi Kleen
On Thu, Aug 15, 2013 at 02:31:14PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 2:28 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Aug 15, 2013 at 09:45:31AM +0200, Jan Kara wrote:
> >> On Thu 15-08-13 17:11:42, Dave Chinner wrote:
> >> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> > > On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > > > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> > > >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > > >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> > > >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> > > >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> > > >> >> > > cost of the unwritten->written conversion.
> >> > > >> >> >
> >> > > >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> > > >> >> > this part until writeback?
> >> > > >> >>
> >> > > >> >> Part of the work has to be done at write time because we need to
> >> > > >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> > > >> >> problems). The unwritten->written conversion does happen at writeback
> >> > > >> >> (as does the actual block allocation if we are doing delayed
> >> > > >> >> allocation).
> >> > > >> >>
> >> > > >> >> The point is that if the goal is to measure page fault scalability, we
> >> > > >> >> shouldn't have this other stuff happening as the same time as the page
> >> > > >> >> fault workload.
> >> > > >> >
> >> > > >> > Sure, but the real problem is not the block mapping or allocation
> >> > > >> > path - even if the test is changed to take that out of the picture,
> >> > > >> > we still have timestamp updates being done on every single page
> >> > > >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> > > >> > and have nanosecond granularity, so every page fault is resulting in
> >> > > >> > a transaction to update the timestamp of the file being modified.
> >> > > >>
> >> > > >> I have (unmergeable) patches to fix this:
> >> > > >>
> >> > > >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> > > >
> >> > > > The big problem with this approach is that not doing the
> >> > > > timestamp update on page faults is going to break the inode change
> >> > > > version counting because for ext4, btrfs and XFS it takes a
> >> > > > transaction to bump that counter. NFS needs to know the moment a
> >> > > > file is changed in memory, not when it is written to disk. Also, NFS
> >> > > > requires the change to the counter to be persistent over server
> >> > > > failures, so it needs to be changed as part of a transaction....
> >> > >
> >> > > I've been running a kernel that has the file_update_time call
> >> > > commented out for over a year now, and the only problem I've seen is
> >> > > that the timestamp doesn't get updated :)
> >> > >
> >> > > I think I must be misunderstanding you (or vice versa). I'm currently
> >> >
> >> > Yup, you are.
> >> >
> >> > > redoing the patches, and this time I'll do it for just the mm core and
> >> > > ext4. The only change I'm proposing to ext4's page_mkwrite is to
> >> > > remove the file_update_time call.
> >> >
> >> > Right. Where does that end up? All the way down in
> >> > ext4_mark_iloc_dirty(), and that does:
> >> >
> >> > if (IS_I_VERSION(inode))
> >> > inode_inc_iversion(inode);
> >> >
> >> > The XFS transaction code is the same - deep inside it where an inode
> >> > is marked as dirty in the transaction, it bumps the same counter and
> >> > adds it to the transaction.
> >> Yeah, I'd just add that ext4 maintains i_version only if it has been
> >> mounted with i_version mount option. But then NFS server would depend on
> >> c/mtime update so it won't help you much - you still should update at least
> >> one of i_version, ctime, mtime on page fault. OTOH if the filesystem isn't
> >> exported, you could avoid this relatively expensive dance and defer things
> >> as Andy suggests.
> >
> > The problem with "not exported, don't update" is that files can be
> > modified on server startup (e.g. after a crash) or in short
> > maintenance periods when the NFS service is down. When the server is
> > started back up, the change number needs to indicate the file has
> > been modified so that clients reconnecting to the server see the
> > change.
> >
> > IOWs, even if the NFS server is not up or the filesystem not
> > exported we still need to update change counts whenever a file
> > changes if we are going to tell the NFS server that we keep them...
>
> This will keep working as long as the clients are willing to wait for
> writeback (or msync, munmap, or exit) on the server.
I don't follow you - what will keep working? If we don't record
changes while the filesystem is not exported, then NFS clients can't
determine if files have changed while the server was down for a
period....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 21:28 ` Dave Chinner
@ 2013-08-19 23:23 ` David Lang
-1 siblings, 0 replies; 67+ messages in thread
From: David Lang @ 2013-08-19 23:23 UTC (permalink / raw)
To: Dave Chinner
Cc: linux-ext4@vger.kernel.org, Theodore Ts'o, Dave Hansen, LKML,
Andy Lutomirski, Dave Hansen, Andi Kleen, Linux FS Devel,
Jan Kara, xfs, Tim Chen
On Fri, 16 Aug 2013, Dave Chinner wrote:
> The problem with "not exported, don't update" is that files can be
> modified on server startup (e.g. after a crash) or in short
> maintenance periods when the NFS service is down. When the server is
> started back up, the change number needs to indicate the file has
> been modified so that clients reconnecting to the server see the
> change.
>
> IOWs, even if the NFS server is not up or the filesystem not
> exported we still need to update change counts whenever a file
> changes if we are going to tell the NFS server that we keep them...
This sounds like you need something more like relctime rather than noctime,
something that updates the time in ram, but doesn't insist on flushing it to
disk immediatly, updating when convienient or when the file is closed.
David Lang
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-19 23:23 ` David Lang
0 siblings, 0 replies; 67+ messages in thread
From: David Lang @ 2013-08-19 23:23 UTC (permalink / raw)
To: Dave Chinner
Cc: Jan Kara, Andy Lutomirski, Theodore Ts'o, Dave Hansen,
Dave Hansen, Linux FS Devel, xfs, linux-ext4@vger.kernel.org,
LKML, Tim Chen, Andi Kleen
On Fri, 16 Aug 2013, Dave Chinner wrote:
> The problem with "not exported, don't update" is that files can be
> modified on server startup (e.g. after a crash) or in short
> maintenance periods when the NFS service is down. When the server is
> started back up, the change number needs to indicate the file has
> been modified so that clients reconnecting to the server see the
> change.
>
> IOWs, even if the NFS server is not up or the filesystem not
> exported we still need to update change counts whenever a file
> changes if we are going to tell the NFS server that we keep them...
This sounds like you need something more like relctime rather than noctime,
something that updates the time in ram, but doesn't insist on flushing it to
disk immediatly, updating when convienient or when the file is closed.
David Lang
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-19 23:23 ` David Lang
(?)
@ 2013-08-19 23:31 ` Andy Lutomirski
-1 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-19 23:31 UTC (permalink / raw)
To: David Lang
Cc: Dave Chinner, Jan Kara, Theodore Ts'o, Dave Hansen,
Dave Hansen, Linux FS Devel, xfs, linux-ext4@vger.kernel.org,
LKML, Tim Chen, Andi Kleen
On Mon, Aug 19, 2013 at 4:23 PM, David Lang <david@lang.hm> wrote:
> On Fri, 16 Aug 2013, Dave Chinner wrote:
>
>> The problem with "not exported, don't update" is that files can be
>> modified on server startup (e.g. after a crash) or in short
>> maintenance periods when the NFS service is down. When the server is
>> started back up, the change number needs to indicate the file has
>> been modified so that clients reconnecting to the server see the
>> change.
>>
>> IOWs, even if the NFS server is not up or the filesystem not
>> exported we still need to update change counts whenever a file
>> changes if we are going to tell the NFS server that we keep them...
>
>
> This sounds like you need something more like relctime rather than noctime,
> something that updates the time in ram, but doesn't insist on flushing it to
> disk immediatly, updating when convienient or when the file is closed.
>
> David Lang
I guess my patches could be extended to do this. In their current
form, when a pte dirty bit is transferred to a page (via page_mkclean
or unmap), the address_space is marked as needed a cmtime update. I
could add a mode in which even the normal write syscall path sets that
bit instead of immediately updating the timestamp. This could be a
nice speedup to non-mmap writers.
To avoid breaking things, things like fsync would need to force a
cmtime flush -- I doubt it would be okay for write; fsync; write;
fsync to leave the timestamp matching the first write.
I'd rather get comments on the current form of my patches and maybe
get them merged before looking at even more far-reaching extensions,
though.
--Andy
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 7:11 ` Dave Chinner
@ 2013-08-15 15:17 ` Andy Lutomirski
-1 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-15 15:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Andi Kleen, Theodore Ts'o, Dave Hansen, LKML, xfs,
Dave Hansen, Linux FS Devel, Jan Kara, linux-ext4@vger.kernel.org,
Tim Chen
On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> >> >> > > cost of the unwritten->written conversion.
>> >> >> >
>> >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> >> >> > this part until writeback?
>> >> >>
>> >> >> Part of the work has to be done at write time because we need to
>> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> >> >> problems). The unwritten->written conversion does happen at writeback
>> >> >> (as does the actual block allocation if we are doing delayed
>> >> >> allocation).
>> >> >>
>> >> >> The point is that if the goal is to measure page fault scalability, we
>> >> >> shouldn't have this other stuff happening as the same time as the page
>> >> >> fault workload.
>> >> >
>> >> > Sure, but the real problem is not the block mapping or allocation
>> >> > path - even if the test is changed to take that out of the picture,
>> >> > we still have timestamp updates being done on every single page
>> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> >> > and have nanosecond granularity, so every page fault is resulting in
>> >> > a transaction to update the timestamp of the file being modified.
>> >>
>> >> I have (unmergeable) patches to fix this:
>> >>
>> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> >
>> > The big problem with this approach is that not doing the
>> > timestamp update on page faults is going to break the inode change
>> > version counting because for ext4, btrfs and XFS it takes a
>> > transaction to bump that counter. NFS needs to know the moment a
>> > file is changed in memory, not when it is written to disk. Also, NFS
>> > requires the change to the counter to be persistent over server
>> > failures, so it needs to be changed as part of a transaction....
>>
>> I've been running a kernel that has the file_update_time call
>> commented out for over a year now, and the only problem I've seen is
>> that the timestamp doesn't get updated :)
>>
[...]
> If a filesystem is providing an i_version value, then NFS uses it to
> determine whether client side caches are still consistent with the
> server state. If the filesystem does not provide an i_version, then
> NFS falls back to checking c/mtime for changes. If files on the
> server are being modified without either the tiemstamps or i_version
> changing, then it's likely that there will be problems with client
> side cache consistency....
I didn't think of that at all.
If userspace does:
ptr = mmap(...);
ptr[0] = 1;
sleep(1);
ptr[0] = 2;
sleep(1);
munmap();
Then current kernels will mark the inode changed on (only) the ptr[0]
= 1 line. My patches will instead mark the inode changed when munmap
is called (or after ptr[0] = 2 if writepages gets called for any
reason).
I'm not sure which is better. POSIX actually requires my behavior
(which is most irrelevant). My behavior also means that, if an NFS
client reads and caches the file between the two writes, then it will
eventually find out that the data is stale. The current behavior, on
the other hand, means that a single pass of mmapped writes through the
file will update the times much faster.
I could arrange for the first page fault to *also* update times when
the FS is exported or if a particular mount option is set. (The ext4
change to request the new behavior is all of four lines, and it's easy
to adjust.)
I'll send patches later today. I want to get msync(MS_ASYNC) working
and pound on them a bit first.
--Andy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-15 15:17 ` Andy Lutomirski
0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-15 15:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> >> >> > > cost of the unwritten->written conversion.
>> >> >> >
>> >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> >> >> > this part until writeback?
>> >> >>
>> >> >> Part of the work has to be done at write time because we need to
>> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> >> >> problems). The unwritten->written conversion does happen at writeback
>> >> >> (as does the actual block allocation if we are doing delayed
>> >> >> allocation).
>> >> >>
>> >> >> The point is that if the goal is to measure page fault scalability, we
>> >> >> shouldn't have this other stuff happening as the same time as the page
>> >> >> fault workload.
>> >> >
>> >> > Sure, but the real problem is not the block mapping or allocation
>> >> > path - even if the test is changed to take that out of the picture,
>> >> > we still have timestamp updates being done on every single page
>> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> >> > and have nanosecond granularity, so every page fault is resulting in
>> >> > a transaction to update the timestamp of the file being modified.
>> >>
>> >> I have (unmergeable) patches to fix this:
>> >>
>> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> >
>> > The big problem with this approach is that not doing the
>> > timestamp update on page faults is going to break the inode change
>> > version counting because for ext4, btrfs and XFS it takes a
>> > transaction to bump that counter. NFS needs to know the moment a
>> > file is changed in memory, not when it is written to disk. Also, NFS
>> > requires the change to the counter to be persistent over server
>> > failures, so it needs to be changed as part of a transaction....
>>
>> I've been running a kernel that has the file_update_time call
>> commented out for over a year now, and the only problem I've seen is
>> that the timestamp doesn't get updated :)
>>
[...]
> If a filesystem is providing an i_version value, then NFS uses it to
> determine whether client side caches are still consistent with the
> server state. If the filesystem does not provide an i_version, then
> NFS falls back to checking c/mtime for changes. If files on the
> server are being modified without either the tiemstamps or i_version
> changing, then it's likely that there will be problems with client
> side cache consistency....
I didn't think of that at all.
If userspace does:
ptr = mmap(...);
ptr[0] = 1;
sleep(1);
ptr[0] = 2;
sleep(1);
munmap();
Then current kernels will mark the inode changed on (only) the ptr[0]
= 1 line. My patches will instead mark the inode changed when munmap
is called (or after ptr[0] = 2 if writepages gets called for any
reason).
I'm not sure which is better. POSIX actually requires my behavior
(which is most irrelevant). My behavior also means that, if an NFS
client reads and caches the file between the two writes, then it will
eventually find out that the data is stale. The current behavior, on
the other hand, means that a single pass of mmapped writes through the
file will update the times much faster.
I could arrange for the first page fault to *also* update times when
the FS is exported or if a particular mount option is set. (The ext4
change to request the new behavior is all of four lines, and it's easy
to adjust.)
I'll send patches later today. I want to get msync(MS_ASYNC) working
and pound on them a bit first.
--Andy
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 15:17 ` Andy Lutomirski
@ 2013-08-15 21:37 ` Dave Chinner
-1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 21:37 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andi Kleen, Theodore Ts'o, Dave Hansen, LKML, xfs,
Dave Hansen, Linux FS Devel, Jan Kara, linux-ext4@vger.kernel.org,
Tim Chen
On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> >> >> > > cost of the unwritten->written conversion.
> >> >> >> >
> >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> >> >> > this part until writeback?
> >> >> >>
> >> >> >> Part of the work has to be done at write time because we need to
> >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> >> >> problems). The unwritten->written conversion does happen at writeback
> >> >> >> (as does the actual block allocation if we are doing delayed
> >> >> >> allocation).
> >> >> >>
> >> >> >> The point is that if the goal is to measure page fault scalability, we
> >> >> >> shouldn't have this other stuff happening as the same time as the page
> >> >> >> fault workload.
> >> >> >
> >> >> > Sure, but the real problem is not the block mapping or allocation
> >> >> > path - even if the test is changed to take that out of the picture,
> >> >> > we still have timestamp updates being done on every single page
> >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> >> > and have nanosecond granularity, so every page fault is resulting in
> >> >> > a transaction to update the timestamp of the file being modified.
> >> >>
> >> >> I have (unmergeable) patches to fix this:
> >> >>
> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> >
> >> > The big problem with this approach is that not doing the
> >> > timestamp update on page faults is going to break the inode change
> >> > version counting because for ext4, btrfs and XFS it takes a
> >> > transaction to bump that counter. NFS needs to know the moment a
> >> > file is changed in memory, not when it is written to disk. Also, NFS
> >> > requires the change to the counter to be persistent over server
> >> > failures, so it needs to be changed as part of a transaction....
> >>
> >> I've been running a kernel that has the file_update_time call
> >> commented out for over a year now, and the only problem I've seen is
> >> that the timestamp doesn't get updated :)
> >>
>
> [...]
>
> > If a filesystem is providing an i_version value, then NFS uses it to
> > determine whether client side caches are still consistent with the
> > server state. If the filesystem does not provide an i_version, then
> > NFS falls back to checking c/mtime for changes. If files on the
> > server are being modified without either the tiemstamps or i_version
> > changing, then it's likely that there will be problems with client
> > side cache consistency....
>
> I didn't think of that at all.
>
> If userspace does:
>
> ptr = mmap(...);
> ptr[0] = 1;
> sleep(1);
> ptr[0] = 2;
> sleep(1);
> munmap();
>
> Then current kernels will mark the inode changed on (only) the ptr[0]
> = 1 line. My patches will instead mark the inode changed when munmap
> is called (or after ptr[0] = 2 if writepages gets called for any
> reason).
>
> I'm not sure which is better. POSIX actually requires my behavior
> (which is most irrelevant).
Not by my reading of it. Posix states that c/mtime needs to be
updated between the first access and the next msync() call. We
update mtime on the first access, and so therefore we conform to the
posix requirement....
> My behavior also means that, if an NFS
> client reads and caches the file between the two writes, then it will
> eventually find out that the data is stale.
"eventually" is very different behaviour to the current behaviour.
My understanding is that NFS v4 delegations require the underlying
filesystem to bump the version count on *any* modification made to
the file so that delegations can be recalled appropriately. So not
informing the filesystem that the file data has been changed is
going to cause problems.
> The current behavior, on
> the other hand, means that a single pass of mmapped writes through the
> file will update the times much faster.
>
> I could arrange for the first page fault to *also* update times when
> the FS is exported or if a particular mount option is set. (The ext4
> change to request the new behavior is all of four lines, and it's easy
> to adjust.)
What does "first page fault" mean?
Cheers,
Dave
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-15 21:37 ` Dave Chinner
0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 21:37 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> >> >> > > cost of the unwritten->written conversion.
> >> >> >> >
> >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> >> >> > this part until writeback?
> >> >> >>
> >> >> >> Part of the work has to be done at write time because we need to
> >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> >> >> problems). The unwritten->written conversion does happen at writeback
> >> >> >> (as does the actual block allocation if we are doing delayed
> >> >> >> allocation).
> >> >> >>
> >> >> >> The point is that if the goal is to measure page fault scalability, we
> >> >> >> shouldn't have this other stuff happening as the same time as the page
> >> >> >> fault workload.
> >> >> >
> >> >> > Sure, but the real problem is not the block mapping or allocation
> >> >> > path - even if the test is changed to take that out of the picture,
> >> >> > we still have timestamp updates being done on every single page
> >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> >> > and have nanosecond granularity, so every page fault is resulting in
> >> >> > a transaction to update the timestamp of the file being modified.
> >> >>
> >> >> I have (unmergeable) patches to fix this:
> >> >>
> >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> >
> >> > The big problem with this approach is that not doing the
> >> > timestamp update on page faults is going to break the inode change
> >> > version counting because for ext4, btrfs and XFS it takes a
> >> > transaction to bump that counter. NFS needs to know the moment a
> >> > file is changed in memory, not when it is written to disk. Also, NFS
> >> > requires the change to the counter to be persistent over server
> >> > failures, so it needs to be changed as part of a transaction....
> >>
> >> I've been running a kernel that has the file_update_time call
> >> commented out for over a year now, and the only problem I've seen is
> >> that the timestamp doesn't get updated :)
> >>
>
> [...]
>
> > If a filesystem is providing an i_version value, then NFS uses it to
> > determine whether client side caches are still consistent with the
> > server state. If the filesystem does not provide an i_version, then
> > NFS falls back to checking c/mtime for changes. If files on the
> > server are being modified without either the tiemstamps or i_version
> > changing, then it's likely that there will be problems with client
> > side cache consistency....
>
> I didn't think of that at all.
>
> If userspace does:
>
> ptr = mmap(...);
> ptr[0] = 1;
> sleep(1);
> ptr[0] = 2;
> sleep(1);
> munmap();
>
> Then current kernels will mark the inode changed on (only) the ptr[0]
> = 1 line. My patches will instead mark the inode changed when munmap
> is called (or after ptr[0] = 2 if writepages gets called for any
> reason).
>
> I'm not sure which is better. POSIX actually requires my behavior
> (which is most irrelevant).
Not by my reading of it. Posix states that c/mtime needs to be
updated between the first access and the next msync() call. We
update mtime on the first access, and so therefore we conform to the
posix requirement....
> My behavior also means that, if an NFS
> client reads and caches the file between the two writes, then it will
> eventually find out that the data is stale.
"eventually" is very different behaviour to the current behaviour.
My understanding is that NFS v4 delegations require the underlying
filesystem to bump the version count on *any* modification made to
the file so that delegations can be recalled appropriately. So not
informing the filesystem that the file data has been changed is
going to cause problems.
> The current behavior, on
> the other hand, means that a single pass of mmapped writes through the
> file will update the times much faster.
>
> I could arrange for the first page fault to *also* update times when
> the FS is exported or if a particular mount option is set. (The ext4
> change to request the new behavior is all of four lines, and it's easy
> to adjust.)
What does "first page fault" mean?
Cheers,
Dave
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 21:37 ` Dave Chinner
@ 2013-08-15 21:43 ` Andy Lutomirski
-1 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-15 21:43 UTC (permalink / raw)
To: Dave Chinner
Cc: Andi Kleen, Theodore Ts'o, Dave Hansen, LKML, xfs,
Dave Hansen, Linux FS Devel, Jan Kara, linux-ext4@vger.kernel.org,
Tim Chen
On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> I didn't think of that at all.
>>
>> If userspace does:
>>
>> ptr = mmap(...);
>> ptr[0] = 1;
>> sleep(1);
>> ptr[0] = 2;
>> sleep(1);
>> munmap();
>>
>> Then current kernels will mark the inode changed on (only) the ptr[0]
>> = 1 line. My patches will instead mark the inode changed when munmap
>> is called (or after ptr[0] = 2 if writepages gets called for any
>> reason).
>>
>> I'm not sure which is better. POSIX actually requires my behavior
>> (which is most irrelevant).
>
> Not by my reading of it. Posix states that c/mtime needs to be
> updated between the first access and the next msync() call. We
> update mtime on the first access, and so therefore we conform to the
> posix requirement....
It says "between a write reference to the mapped region and the next
call to msync()." Most write references don't cause page faults.
>
>> My behavior also means that, if an NFS
>> client reads and caches the file between the two writes, then it will
>> eventually find out that the data is stale.
>
> "eventually" is very different behaviour to the current behaviour.
>
> My understanding is that NFS v4 delegations require the underlying
> filesystem to bump the version count on *any* modification made to
> the file so that delegations can be recalled appropriately. So not
> informing the filesystem that the file data has been changed is
> going to cause problems.
We don't do that right now (and we can't without utterly destroying
performance) because we don't trap on every modification. See
below...
>
>> The current behavior, on
>> the other hand, means that a single pass of mmapped writes through the
>> file will update the times much faster.
>>
>> I could arrange for the first page fault to *also* update times when
>> the FS is exported or if a particular mount option is set. (The ext4
>> change to request the new behavior is all of four lines, and it's easy
>> to adjust.)
>
> What does "first page fault" mean?
The first write to the page triggers a page fault and marks the page
writable. The second write to the page (assuming no writeback happens
in the mean time) does not trigger a page fault or notify the kernel
in any way.
In current kernels, this chain of events won't work:
- Server goes down
- Server comes up
- Userspace on server calls mmap and writes something
- Client reconnects and invalidates its cache
- Userspace on server writes something else *to the same page*
The client will never notice the second write, because it won't update
any inode state. With my patches, the client will as soon as the
server starts writeback.
So I think that there are cases where my changes make things better
and cases where they make things worse.
--Andy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-15 21:43 ` Andy Lutomirski
0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-15 21:43 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> I didn't think of that at all.
>>
>> If userspace does:
>>
>> ptr = mmap(...);
>> ptr[0] = 1;
>> sleep(1);
>> ptr[0] = 2;
>> sleep(1);
>> munmap();
>>
>> Then current kernels will mark the inode changed on (only) the ptr[0]
>> = 1 line. My patches will instead mark the inode changed when munmap
>> is called (or after ptr[0] = 2 if writepages gets called for any
>> reason).
>>
>> I'm not sure which is better. POSIX actually requires my behavior
>> (which is most irrelevant).
>
> Not by my reading of it. Posix states that c/mtime needs to be
> updated between the first access and the next msync() call. We
> update mtime on the first access, and so therefore we conform to the
> posix requirement....
It says "between a write reference to the mapped region and the next
call to msync()." Most write references don't cause page faults.
>
>> My behavior also means that, if an NFS
>> client reads and caches the file between the two writes, then it will
>> eventually find out that the data is stale.
>
> "eventually" is very different behaviour to the current behaviour.
>
> My understanding is that NFS v4 delegations require the underlying
> filesystem to bump the version count on *any* modification made to
> the file so that delegations can be recalled appropriately. So not
> informing the filesystem that the file data has been changed is
> going to cause problems.
We don't do that right now (and we can't without utterly destroying
performance) because we don't trap on every modification. See
below...
>
>> The current behavior, on
>> the other hand, means that a single pass of mmapped writes through the
>> file will update the times much faster.
>>
>> I could arrange for the first page fault to *also* update times when
>> the FS is exported or if a particular mount option is set. (The ext4
>> change to request the new behavior is all of four lines, and it's easy
>> to adjust.)
>
> What does "first page fault" mean?
The first write to the page triggers a page fault and marks the page
writable. The second write to the page (assuming no writeback happens
in the mean time) does not trigger a page fault or notify the kernel
in any way.
In current kernels, this chain of events won't work:
- Server goes down
- Server comes up
- Userspace on server calls mmap and writes something
- Client reconnects and invalidates its cache
- Userspace on server writes something else *to the same page*
The client will never notice the second write, because it won't update
any inode state. With my patches, the client will as soon as the
server starts writeback.
So I think that there are cases where my changes make things better
and cases where they make things worse.
--Andy
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 21:43 ` Andy Lutomirski
@ 2013-08-15 22:18 ` Dave Chinner
-1 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 22:18 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andi Kleen, Theodore Ts'o, Dave Hansen, LKML, xfs,
Dave Hansen, Linux FS Devel, Jan Kara, linux-ext4@vger.kernel.org,
Tim Chen
On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
> <david@fromorbit.com> wrote:
> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> My behavior also means that, if an NFS
> >> client reads and caches the file between the two writes, then it will
> >> eventually find out that the data is stale.
> >
> > "eventually" is very different behaviour to the current behaviour.
> >
> > My understanding is that NFS v4 delegations require the underlying
> > filesystem to bump the version count on *any* modification made to
> > the file so that delegations can be recalled appropriately. So not
> > informing the filesystem that the file data has been changed is
> > going to cause problems.
>
> We don't do that right now (and we can't without utterly destroying
> performance) because we don't trap on every modification. See
> below...
We don't trap every mmap modification. We trap every modification
that the filesystem is informed about. That includes a c/mtime
update on every write page fault. It's as fine grained as we can get
without introducing serious performance killing overhead.
And nobody has made any compelling argument that what we do now is
problematic - all we've got is a microbenchmark doesn't quite scale
linearly because filesystem updates through a global filesystem
structure (the journal) don't scale linearly.
> >> The current behavior, on
> >> the other hand, means that a single pass of mmapped writes through the
> >> file will update the times much faster.
> >>
> >> I could arrange for the first page fault to *also* update times when
> >> the FS is exported or if a particular mount option is set. (The ext4
> >> change to request the new behavior is all of four lines, and it's easy
> >> to adjust.)
> >
> > What does "first page fault" mean?
>
> The first write to the page triggers a page fault and marks the page
> writable. The second write to the page (assuming no writeback happens
> in the mean time) does not trigger a page fault or notify the kernel
> in any way.
IIUC, you are saying is that you'll maintain the current behaviour
(i.e. clean->dirty does a timestamp update) if the filesystem
requires it? So the default behaviour of any filesystem that
supports NFSv4 is going to behave as it does now?
If that's the case, why bother changing anything as nfsv4 is the
default version that the kernel uses? (I'm playing devil's advocate
here).
> In current kernels, this chain of events won't work:
>
> - Server goes down
> - Server comes up
> - Userspace on server calls mmap and writes something
> - Client reconnects and invalidates its cache
> - Userspace on server writes something else *to the same page*
>
> The client will never notice the second write, because it won't update
> any inode state.
That's wrong. The server wrote the dirty page before the client
reconnected, therefore it got marked clean. The second write to the
server page marks it dirty again, causing page_mkwrite to be
called, thereby updating the timestamp/i_version field. So, the NFS
client will notice the second change on the server, and it will
notice it immediately after the second access has occurred, not some
time later when:
> With my patches, the client will as soon as the
> server starts writeback.
Your patches introduce a 30+ second window where a file can be dirty
on the server but the NFS server doesn't know about it and can't
tell the clients about it because i_version doesn't get bumped until
writeback.....
> So I think that there are cases where my changes make things better
> and cases where they make things worse.
Right, and the issue is that there are important use cases that we
have to support in default configurations that it makes things
worse.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-15 22:18 ` Dave Chinner
0 siblings, 0 replies; 67+ messages in thread
From: Dave Chinner @ 2013-08-15 22:18 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
> <david@fromorbit.com> wrote:
> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> My behavior also means that, if an NFS
> >> client reads and caches the file between the two writes, then it will
> >> eventually find out that the data is stale.
> >
> > "eventually" is very different behaviour to the current behaviour.
> >
> > My understanding is that NFS v4 delegations require the underlying
> > filesystem to bump the version count on *any* modification made to
> > the file so that delegations can be recalled appropriately. So not
> > informing the filesystem that the file data has been changed is
> > going to cause problems.
>
> We don't do that right now (and we can't without utterly destroying
> performance) because we don't trap on every modification. See
> below...
We don't trap every mmap modification. We trap every modification
that the filesystem is informed about. That includes a c/mtime
update on every write page fault. It's as fine grained as we can get
without introducing serious performance killing overhead.
And nobody has made any compelling argument that what we do now is
problematic - all we've got is a microbenchmark doesn't quite scale
linearly because filesystem updates through a global filesystem
structure (the journal) don't scale linearly.
> >> The current behavior, on
> >> the other hand, means that a single pass of mmapped writes through the
> >> file will update the times much faster.
> >>
> >> I could arrange for the first page fault to *also* update times when
> >> the FS is exported or if a particular mount option is set. (The ext4
> >> change to request the new behavior is all of four lines, and it's easy
> >> to adjust.)
> >
> > What does "first page fault" mean?
>
> The first write to the page triggers a page fault and marks the page
> writable. The second write to the page (assuming no writeback happens
> in the mean time) does not trigger a page fault or notify the kernel
> in any way.
IIUC, you are saying is that you'll maintain the current behaviour
(i.e. clean->dirty does a timestamp update) if the filesystem
requires it? So the default behaviour of any filesystem that
supports NFSv4 is going to behave as it does now?
If that's the case, why bother changing anything as nfsv4 is the
default version that the kernel uses? (I'm playing devil's advocate
here).
> In current kernels, this chain of events won't work:
>
> - Server goes down
> - Server comes up
> - Userspace on server calls mmap and writes something
> - Client reconnects and invalidates its cache
> - Userspace on server writes something else *to the same page*
>
> The client will never notice the second write, because it won't update
> any inode state.
That's wrong. The server wrote the dirty page before the client
reconnected, therefore it got marked clean. The second write to the
server page marks it dirty again, causing page_mkwrite to be
called, thereby updating the timestamp/i_version field. So, the NFS
client will notice the second change on the server, and it will
notice it immediately after the second access has occurred, not some
time later when:
> With my patches, the client will as soon as the
> server starts writeback.
Your patches introduce a 30+ second window where a file can be dirty
on the server but the NFS server doesn't know about it and can't
tell the clients about it because i_version doesn't get bumped until
writeback.....
> So I think that there are cases where my changes make things better
> and cases where they make things worse.
Right, and the issue is that there are important use cases that we
have to support in default configurations that it makes things
worse.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 22:18 ` Dave Chinner
(?)
@ 2013-08-15 22:26 ` Andy Lutomirski
2013-08-16 0:14 ` Dave Chinner
-1 siblings, 1 reply; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-15 22:26 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
>> <david@fromorbit.com> wrote:
>> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> >> My behavior also means that, if an NFS
>> >> client reads and caches the file between the two writes, then it will
>> >> eventually find out that the data is stale.
>> >
>> > "eventually" is very different behaviour to the current behaviour.
>> >
>> > My understanding is that NFS v4 delegations require the underlying
>> > filesystem to bump the version count on *any* modification made to
>> > the file so that delegations can be recalled appropriately. So not
>> > informing the filesystem that the file data has been changed is
>> > going to cause problems.
>>
>> We don't do that right now (and we can't without utterly destroying
>> performance) because we don't trap on every modification. See
>> below...
>
> We don't trap every mmap modification. We trap every modification
> that the filesystem is informed about. That includes a c/mtime
> update on every write page fault. It's as fine grained as we can get
> without introducing serious performance killing overhead.
>
> And nobody has made any compelling argument that what we do now is
> problematic - all we've got is a microbenchmark doesn't quite scale
> linearly because filesystem updates through a global filesystem
> structure (the journal) don't scale linearly.
I don't personally care about scaling. I care about sleeping in write
faults, and starting journal transactions sleeps, and this is an
absolute show-stopper for me. (It's a real-time latency problem, not
a throughput or scalability thing.)
>
>> >> The current behavior, on
>> >> the other hand, means that a single pass of mmapped writes through the
>> >> file will update the times much faster.
>> >>
>> >> I could arrange for the first page fault to *also* update times when
>> >> the FS is exported or if a particular mount option is set. (The ext4
>> >> change to request the new behavior is all of four lines, and it's easy
>> >> to adjust.)
>> >
>> > What does "first page fault" mean?
>>
>> The first write to the page triggers a page fault and marks the page
>> writable. The second write to the page (assuming no writeback happens
>> in the mean time) does not trigger a page fault or notify the kernel
>> in any way.
>
> IIUC, you are saying is that you'll maintain the current behaviour
> (i.e. clean->dirty does a timestamp update) if the filesystem
> requires it? So the default behaviour of any filesystem that
> supports NFSv4 is going to behave as it does now?
>
> If that's the case, why bother changing anything as nfsv4 is the
> default version that the kernel uses? (I'm playing devil's advocate
> here).
Because the performance sucks right now. I'd like to fix it without
breaking things, and I think I can fix it while actually improving the
semantics.
>
>> In current kernels, this chain of events won't work:
>>
>> - Server goes down
>> - Server comes up
>> - Userspace on server calls mmap and writes something
>> - Client reconnects and invalidates its cache
>> - Userspace on server writes something else *to the same page*
>>
>> The client will never notice the second write, because it won't update
>> any inode state.
>
> That's wrong. The server wrote the dirty page before the client
> reconnected, therefore it got marked clean.
Why would it write the dirty page? Is the client's NFSv4 request
forcing the server to scan for dirty ptes or pages? If so, can you
point me to that code? I can probably make it work deterministically.
> The second write to the
> server page marks it dirty again, causing page_mkwrite to be
> called, thereby updating the timestamp/i_version field. So, the NFS
> client will notice the second change on the server, and it will
> notice it immediately after the second access has occurred, not some
> time later when:
>
>> With my patches, the client will as soon as the
>> server starts writeback.
>
> Your patches introduce a 30+ second window where a file can be dirty
> on the server but the NFS server doesn't know about it and can't
> tell the clients about it because i_version doesn't get bumped until
> writeback.....
I claim that there's an infinite window right now, and that 30 seconds
is therefore an improvement.
--Andy
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 22:26 ` Andy Lutomirski
@ 2013-08-16 0:14 ` Dave Chinner
2013-08-16 0:21 ` Andy Lutomirski
0 siblings, 1 reply; 67+ messages in thread
From: Dave Chinner @ 2013-08-16 0:14 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Thu, Aug 15, 2013 at 03:26:09PM -0700, Andy Lutomirski wrote:
> On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
> >> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
> >> <david@fromorbit.com> wrote:
> >> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> >> My behavior also means that, if an NFS
> >> >> client reads and caches the file between the two writes, then it will
> >> >> eventually find out that the data is stale.
> >> >
> >> > "eventually" is very different behaviour to the current behaviour.
> >> >
> >> > My understanding is that NFS v4 delegations require the underlying
> >> > filesystem to bump the version count on *any* modification made to
> >> > the file so that delegations can be recalled appropriately. So not
> >> > informing the filesystem that the file data has been changed is
> >> > going to cause problems.
> >>
> >> We don't do that right now (and we can't without utterly destroying
> >> performance) because we don't trap on every modification. See
> >> below...
> >
> > We don't trap every mmap modification. We trap every modification
> > that the filesystem is informed about. That includes a c/mtime
> > update on every write page fault. It's as fine grained as we can get
> > without introducing serious performance killing overhead.
> >
> > And nobody has made any compelling argument that what we do now is
> > problematic - all we've got is a microbenchmark doesn't quite scale
> > linearly because filesystem updates through a global filesystem
> > structure (the journal) don't scale linearly.
>
> I don't personally care about scaling. I care about sleeping in write
> faults, and starting journal transactions sleeps, and this is an
> absolute show-stopper for me. (It's a real-time latency problem, not
> a throughput or scalability thing.)
Different problem, then. And one that does actaully have a solution
that is already implemented but not exposed to userspace -
O_NOCMTIME. i.e. we actually support turning off c/mtime updates on
a per file basis - the XFS open-by-handle interface sets this flag
by default on files opened that way.....
Expose that to open/fcntl and your problem is solved without
impacting anyone else or default behaviours of filesystems.
> >> In current kernels, this chain of events won't work:
> >>
> >> - Server goes down
> >> - Server comes up
> >> - Userspace on server calls mmap and writes something
> >> - Client reconnects and invalidates its cache
> >> - Userspace on server writes something else *to the same page*
> >>
> >> The client will never notice the second write, because it won't update
> >> any inode state.
> >
> > That's wrong. The server wrote the dirty page before the client
> > reconnected, therefore it got marked clean.
>
> Why would it write the dirty page?
Terminology mismatch - you said it "writes something", not "dirties
the page". So, it's easy to take that as "does writeback" as opposed
to "dirties memory".
As to what woudl write it? Memory pressure, a user running sync,
ENOSPC conditions, all sorts of things that you can't control. You
cannot rely on writeback only happening periodically and therefore
being predictable and deterministic.
> > The second write to the
> > server page marks it dirty again, causing page_mkwrite to be
> > called, thereby updating the timestamp/i_version field. So, the NFS
> > client will notice the second change on the server, and it will
> > notice it immediately after the second access has occurred, not some
> > time later when:
> >
> >> With my patches, the client will as soon as the
> >> server starts writeback.
> >
> > Your patches introduce a 30+ second window where a file can be dirty
> > on the server but the NFS server doesn't know about it and can't
> > tell the clients about it because i_version doesn't get bumped until
> > writeback.....
>
> I claim that there's an infinite window right now, and that 30 seconds
> is therefore an improvement.
You're talking about after the second change is made. I'm talking
about the difference in behaviour after the *initial change* is
made. Your changes will result in the client not doing an
invalidation because timestamps don't get changed for 30s with your
patches. That's the problem - the first change of a file needs to
bump the i_version immediately, not in 30s time.
That's why delaying timestamp updates doesn't fix the scalability
problem that was reported. It might fix a different problem, but it
doesn't void the *requirment* that filesystems need to do
transactional updates during page faults....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-16 0:14 ` Dave Chinner
@ 2013-08-16 0:21 ` Andy Lutomirski
0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-16 0:21 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Dave Hansen, Dave Hansen, Linux FS Devel, xfs,
linux-ext4@vger.kernel.org, Jan Kara, LKML, Tim Chen, Andi Kleen
On Thu, Aug 15, 2013 at 5:14 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Aug 15, 2013 at 03:26:09PM -0700, Andy Lutomirski wrote:
>> On Thu, Aug 15, 2013 at 3:18 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Aug 15, 2013 at 02:43:09PM -0700, Andy Lutomirski wrote:
>> >> On Thu, Aug 15, 2013 at 2:37 PM, Dave Chinner
>> >> <david@fromorbit.com> wrote:
>> >> > On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>
>> >> In current kernels, this chain of events won't work:
>> >>
>> >> - Server goes down
>> >> - Server comes up
>> >> - Userspace on server calls mmap and writes something
>> >> - Client reconnects and invalidates its cache
>> >> - Userspace on server writes something else *to the same page*
>> >>
>> >> The client will never notice the second write, because it won't update
>> >> any inode state.
>> >
>> > That's wrong. The server wrote the dirty page before the client
>> > reconnected, therefore it got marked clean.
>>
>> Why would it write the dirty page?
>
> Terminology mismatch - you said it "writes something", not "dirties
> the page". So, it's easy to take that as "does writeback" as opposed
> to "dirties memory".
When I say "writes something" I mean literally performs a store to
memory. That is:
ptr[offset] = value;
In my example, the client will *never* catch up.
>
>> > The second write to the
>> > server page marks it dirty again, causing page_mkwrite to be
>> > called, thereby updating the timestamp/i_version field. So, the NFS
>> > client will notice the second change on the server, and it will
>> > notice it immediately after the second access has occurred, not some
>> > time later when:
>> >
>> >> With my patches, the client will as soon as the
>> >> server starts writeback.
>> >
>> > Your patches introduce a 30+ second window where a file can be dirty
>> > on the server but the NFS server doesn't know about it and can't
>> > tell the clients about it because i_version doesn't get bumped until
>> > writeback.....
>>
>> I claim that there's an infinite window right now, and that 30 seconds
>> is therefore an improvement.
>
> You're talking about after the second change is made. I'm talking
> about the difference in behaviour after the *initial change* is
> made. Your changes will result in the client not doing an
> invalidation because timestamps don't get changed for 30s with your
> patches. That's the problem - the first change of a file needs to
> bump the i_version immediately, not in 30s time.
>
> That's why delaying timestamp updates doesn't fix the scalability
> problem that was reported. It might fix a different problem, but it
> doesn't void the *requirment* that filesystems need to do
> transactional updates during page faults....
>
And this is why I'm unconvinced that your requirement is sensible.
It's attempting to make sure that every mmaped write results in a some
kind of FS update, but it actually only results in an FS update
*before* the *first* mmapped write after writeback. It's racy as
hell.
My approach is slow but not racy.
--Andy
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-15 21:37 ` Dave Chinner
@ 2013-08-16 22:02 ` J. Bruce Fields
-1 siblings, 0 replies; 67+ messages in thread
From: J. Bruce Fields @ 2013-08-16 22:02 UTC (permalink / raw)
To: Dave Chinner
Cc: linux-ext4@vger.kernel.org, Theodore Ts'o, Dave Hansen, LKML,
xfs, Dave Hansen, Andi Kleen, Linux FS Devel, Jan Kara,
Andy Lutomirski, Tim Chen
On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> >> >> > > cost of the unwritten->written conversion.
> > >> >> >> >
> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> >> >> > this part until writeback?
> > >> >> >>
> > >> >> >> Part of the work has to be done at write time because we need to
> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> >> >> problems). The unwritten->written conversion does happen at writeback
> > >> >> >> (as does the actual block allocation if we are doing delayed
> > >> >> >> allocation).
> > >> >> >>
> > >> >> >> The point is that if the goal is to measure page fault scalability, we
> > >> >> >> shouldn't have this other stuff happening as the same time as the page
> > >> >> >> fault workload.
> > >> >> >
> > >> >> > Sure, but the real problem is not the block mapping or allocation
> > >> >> > path - even if the test is changed to take that out of the picture,
> > >> >> > we still have timestamp updates being done on every single page
> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > >> >> > and have nanosecond granularity, so every page fault is resulting in
> > >> >> > a transaction to update the timestamp of the file being modified.
> > >> >>
> > >> >> I have (unmergeable) patches to fix this:
> > >> >>
> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > >> >
> > >> > The big problem with this approach is that not doing the
> > >> > timestamp update on page faults is going to break the inode change
> > >> > version counting because for ext4, btrfs and XFS it takes a
> > >> > transaction to bump that counter. NFS needs to know the moment a
> > >> > file is changed in memory, not when it is written to disk. Also, NFS
> > >> > requires the change to the counter to be persistent over server
> > >> > failures, so it needs to be changed as part of a transaction....
> > >>
> > >> I've been running a kernel that has the file_update_time call
> > >> commented out for over a year now, and the only problem I've seen is
> > >> that the timestamp doesn't get updated :)
> > >>
> >
> > [...]
> >
> > > If a filesystem is providing an i_version value, then NFS uses it to
> > > determine whether client side caches are still consistent with the
> > > server state. If the filesystem does not provide an i_version, then
> > > NFS falls back to checking c/mtime for changes. If files on the
> > > server are being modified without either the tiemstamps or i_version
> > > changing, then it's likely that there will be problems with client
> > > side cache consistency....
> >
> > I didn't think of that at all.
> >
> > If userspace does:
> >
> > ptr = mmap(...);
> > ptr[0] = 1;
> > sleep(1);
> > ptr[0] = 2;
> > sleep(1);
> > munmap();
> >
> > Then current kernels will mark the inode changed on (only) the ptr[0]
> > = 1 line. My patches will instead mark the inode changed when munmap
> > is called (or after ptr[0] = 2 if writepages gets called for any
> > reason).
> >
> > I'm not sure which is better. POSIX actually requires my behavior
> > (which is most irrelevant).
>
> Not by my reading of it. Posix states that c/mtime needs to be
> updated between the first access and the next msync() call. We
> update mtime on the first access, and so therefore we conform to the
> posix requirement....
>
> > My behavior also means that, if an NFS
> > client reads and caches the file between the two writes, then it will
> > eventually find out that the data is stale.
>
> "eventually" is very different behaviour to the current behaviour.
>
> My understanding is that NFS v4 delegations require the underlying
> filesystem to bump the version count on *any* modification made to
> the file so that delegations can be recalled appropriately.
Delegations at least shouldn't be an issue here: they're recalled on the
open.
--b.
> So not
> informing the filesystem that the file data has been changed is
> going to cause problems.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-16 22:02 ` J. Bruce Fields
0 siblings, 0 replies; 67+ messages in thread
From: J. Bruce Fields @ 2013-08-16 22:02 UTC (permalink / raw)
To: Dave Chinner
Cc: Andy Lutomirski, Theodore Ts'o, Dave Hansen, Dave Hansen,
Linux FS Devel, xfs, linux-ext4@vger.kernel.org, Jan Kara, LKML,
Tim Chen, Andi Kleen
On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> > >> >> >> > > cost of the unwritten->written conversion.
> > >> >> >> >
> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> > >> >> >> > this part until writeback?
> > >> >> >>
> > >> >> >> Part of the work has to be done at write time because we need to
> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> > >> >> >> problems). The unwritten->written conversion does happen at writeback
> > >> >> >> (as does the actual block allocation if we are doing delayed
> > >> >> >> allocation).
> > >> >> >>
> > >> >> >> The point is that if the goal is to measure page fault scalability, we
> > >> >> >> shouldn't have this other stuff happening as the same time as the page
> > >> >> >> fault workload.
> > >> >> >
> > >> >> > Sure, but the real problem is not the block mapping or allocation
> > >> >> > path - even if the test is changed to take that out of the picture,
> > >> >> > we still have timestamp updates being done on every single page
> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> > >> >> > and have nanosecond granularity, so every page fault is resulting in
> > >> >> > a transaction to update the timestamp of the file being modified.
> > >> >>
> > >> >> I have (unmergeable) patches to fix this:
> > >> >>
> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> > >> >
> > >> > The big problem with this approach is that not doing the
> > >> > timestamp update on page faults is going to break the inode change
> > >> > version counting because for ext4, btrfs and XFS it takes a
> > >> > transaction to bump that counter. NFS needs to know the moment a
> > >> > file is changed in memory, not when it is written to disk. Also, NFS
> > >> > requires the change to the counter to be persistent over server
> > >> > failures, so it needs to be changed as part of a transaction....
> > >>
> > >> I've been running a kernel that has the file_update_time call
> > >> commented out for over a year now, and the only problem I've seen is
> > >> that the timestamp doesn't get updated :)
> > >>
> >
> > [...]
> >
> > > If a filesystem is providing an i_version value, then NFS uses it to
> > > determine whether client side caches are still consistent with the
> > > server state. If the filesystem does not provide an i_version, then
> > > NFS falls back to checking c/mtime for changes. If files on the
> > > server are being modified without either the tiemstamps or i_version
> > > changing, then it's likely that there will be problems with client
> > > side cache consistency....
> >
> > I didn't think of that at all.
> >
> > If userspace does:
> >
> > ptr = mmap(...);
> > ptr[0] = 1;
> > sleep(1);
> > ptr[0] = 2;
> > sleep(1);
> > munmap();
> >
> > Then current kernels will mark the inode changed on (only) the ptr[0]
> > = 1 line. My patches will instead mark the inode changed when munmap
> > is called (or after ptr[0] = 2 if writepages gets called for any
> > reason).
> >
> > I'm not sure which is better. POSIX actually requires my behavior
> > (which is most irrelevant).
>
> Not by my reading of it. Posix states that c/mtime needs to be
> updated between the first access and the next msync() call. We
> update mtime on the first access, and so therefore we conform to the
> posix requirement....
>
> > My behavior also means that, if an NFS
> > client reads and caches the file between the two writes, then it will
> > eventually find out that the data is stale.
>
> "eventually" is very different behaviour to the current behaviour.
>
> My understanding is that NFS v4 delegations require the underlying
> filesystem to bump the version count on *any* modification made to
> the file so that delegations can be recalled appropriately.
Delegations at least shouldn't be an issue here: they're recalled on the
open.
--b.
> So not
> informing the filesystem that the file data has been changed is
> going to cause problems.
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-16 22:02 ` J. Bruce Fields
@ 2013-08-16 23:18 ` Andy Lutomirski
-1 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:18 UTC (permalink / raw)
To: J. Bruce Fields
Cc: Andi Kleen, Theodore Ts'o, Dave Hansen, LKML, xfs,
Dave Hansen, Linux FS Devel, Jan Kara, linux-ext4@vger.kernel.org,
Tim Chen
On Fri, Aug 16, 2013 at 3:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
>> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> > >> >> >> > > cost of the unwritten->written conversion.
>> > >> >> >> >
>> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> > >> >> >> > this part until writeback?
>> > >> >> >>
>> > >> >> >> Part of the work has to be done at write time because we need to
>> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> > >> >> >> problems). The unwritten->written conversion does happen at writeback
>> > >> >> >> (as does the actual block allocation if we are doing delayed
>> > >> >> >> allocation).
>> > >> >> >>
>> > >> >> >> The point is that if the goal is to measure page fault scalability, we
>> > >> >> >> shouldn't have this other stuff happening as the same time as the page
>> > >> >> >> fault workload.
>> > >> >> >
>> > >> >> > Sure, but the real problem is not the block mapping or allocation
>> > >> >> > path - even if the test is changed to take that out of the picture,
>> > >> >> > we still have timestamp updates being done on every single page
>> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > >> >> > and have nanosecond granularity, so every page fault is resulting in
>> > >> >> > a transaction to update the timestamp of the file being modified.
>> > >> >>
>> > >> >> I have (unmergeable) patches to fix this:
>> > >> >>
>> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> > >> >
>> > >> > The big problem with this approach is that not doing the
>> > >> > timestamp update on page faults is going to break the inode change
>> > >> > version counting because for ext4, btrfs and XFS it takes a
>> > >> > transaction to bump that counter. NFS needs to know the moment a
>> > >> > file is changed in memory, not when it is written to disk. Also, NFS
>> > >> > requires the change to the counter to be persistent over server
>> > >> > failures, so it needs to be changed as part of a transaction....
>> > >>
>> > >> I've been running a kernel that has the file_update_time call
>> > >> commented out for over a year now, and the only problem I've seen is
>> > >> that the timestamp doesn't get updated :)
>> > >>
>> >
>> > [...]
>> >
>> > > If a filesystem is providing an i_version value, then NFS uses it to
>> > > determine whether client side caches are still consistent with the
>> > > server state. If the filesystem does not provide an i_version, then
>> > > NFS falls back to checking c/mtime for changes. If files on the
>> > > server are being modified without either the tiemstamps or i_version
>> > > changing, then it's likely that there will be problems with client
>> > > side cache consistency....
>> >
>> > I didn't think of that at all.
>> >
>> > If userspace does:
>> >
>> > ptr = mmap(...);
>> > ptr[0] = 1;
>> > sleep(1);
>> > ptr[0] = 2;
>> > sleep(1);
>> > munmap();
>> >
>> > Then current kernels will mark the inode changed on (only) the ptr[0]
>> > = 1 line. My patches will instead mark the inode changed when munmap
>> > is called (or after ptr[0] = 2 if writepages gets called for any
>> > reason).
>> >
>> > I'm not sure which is better. POSIX actually requires my behavior
>> > (which is most irrelevant).
>>
>> Not by my reading of it. Posix states that c/mtime needs to be
>> updated between the first access and the next msync() call. We
>> update mtime on the first access, and so therefore we conform to the
>> posix requirement....
>>
>> > My behavior also means that, if an NFS
>> > client reads and caches the file between the two writes, then it will
>> > eventually find out that the data is stale.
>>
>> "eventually" is very different behaviour to the current behaviour.
>>
>> My understanding is that NFS v4 delegations require the underlying
>> filesystem to bump the version count on *any* modification made to
>> the file so that delegations can be recalled appropriately.
>
> Delegations at least shouldn't be an issue here: they're recalled on the
> open.
Can you translate that into clueless-non-NFS-expert? :)
Anyway, I'm sending patches in a sec. Dave (Hansen), want to test? I
played with will-it-scale a bit, but I don't really know what I'm
doing.
--Andy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-16 23:18 ` Andy Lutomirski
0 siblings, 0 replies; 67+ messages in thread
From: Andy Lutomirski @ 2013-08-16 23:18 UTC (permalink / raw)
To: J. Bruce Fields
Cc: Dave Chinner, Theodore Ts'o, Dave Hansen, Dave Hansen,
Linux FS Devel, xfs, linux-ext4@vger.kernel.org, Jan Kara, LKML,
Tim Chen, Andi Kleen
On Fri, Aug 16, 2013 at 3:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
>> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
>> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
>> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
>> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
>> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
>> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
>> > >> >> >> > > cost of the unwritten->written conversion.
>> > >> >> >> >
>> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
>> > >> >> >> > this part until writeback?
>> > >> >> >>
>> > >> >> >> Part of the work has to be done at write time because we need to
>> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
>> > >> >> >> problems). The unwritten->written conversion does happen at writeback
>> > >> >> >> (as does the actual block allocation if we are doing delayed
>> > >> >> >> allocation).
>> > >> >> >>
>> > >> >> >> The point is that if the goal is to measure page fault scalability, we
>> > >> >> >> shouldn't have this other stuff happening as the same time as the page
>> > >> >> >> fault workload.
>> > >> >> >
>> > >> >> > Sure, but the real problem is not the block mapping or allocation
>> > >> >> > path - even if the test is changed to take that out of the picture,
>> > >> >> > we still have timestamp updates being done on every single page
>> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
>> > >> >> > and have nanosecond granularity, so every page fault is resulting in
>> > >> >> > a transaction to update the timestamp of the file being modified.
>> > >> >>
>> > >> >> I have (unmergeable) patches to fix this:
>> > >> >>
>> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
>> > >> >
>> > >> > The big problem with this approach is that not doing the
>> > >> > timestamp update on page faults is going to break the inode change
>> > >> > version counting because for ext4, btrfs and XFS it takes a
>> > >> > transaction to bump that counter. NFS needs to know the moment a
>> > >> > file is changed in memory, not when it is written to disk. Also, NFS
>> > >> > requires the change to the counter to be persistent over server
>> > >> > failures, so it needs to be changed as part of a transaction....
>> > >>
>> > >> I've been running a kernel that has the file_update_time call
>> > >> commented out for over a year now, and the only problem I've seen is
>> > >> that the timestamp doesn't get updated :)
>> > >>
>> >
>> > [...]
>> >
>> > > If a filesystem is providing an i_version value, then NFS uses it to
>> > > determine whether client side caches are still consistent with the
>> > > server state. If the filesystem does not provide an i_version, then
>> > > NFS falls back to checking c/mtime for changes. If files on the
>> > > server are being modified without either the tiemstamps or i_version
>> > > changing, then it's likely that there will be problems with client
>> > > side cache consistency....
>> >
>> > I didn't think of that at all.
>> >
>> > If userspace does:
>> >
>> > ptr = mmap(...);
>> > ptr[0] = 1;
>> > sleep(1);
>> > ptr[0] = 2;
>> > sleep(1);
>> > munmap();
>> >
>> > Then current kernels will mark the inode changed on (only) the ptr[0]
>> > = 1 line. My patches will instead mark the inode changed when munmap
>> > is called (or after ptr[0] = 2 if writepages gets called for any
>> > reason).
>> >
>> > I'm not sure which is better. POSIX actually requires my behavior
>> > (which is most irrelevant).
>>
>> Not by my reading of it. Posix states that c/mtime needs to be
>> updated between the first access and the next msync() call. We
>> update mtime on the first access, and so therefore we conform to the
>> posix requirement....
>>
>> > My behavior also means that, if an NFS
>> > client reads and caches the file between the two writes, then it will
>> > eventually find out that the data is stale.
>>
>> "eventually" is very different behaviour to the current behaviour.
>>
>> My understanding is that NFS v4 delegations require the underlying
>> filesystem to bump the version count on *any* modification made to
>> the file so that delegations can be recalled appropriately.
>
> Delegations at least shouldn't be an issue here: they're recalled on the
> open.
Can you translate that into clueless-non-NFS-expert? :)
Anyway, I'm sending patches in a sec. Dave (Hansen), want to test? I
played with will-it-scale a bit, but I don't really know what I'm
doing.
--Andy
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
2013-08-16 23:18 ` Andy Lutomirski
@ 2013-08-18 20:17 ` J. Bruce Fields
-1 siblings, 0 replies; 67+ messages in thread
From: J. Bruce Fields @ 2013-08-18 20:17 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Andi Kleen, Theodore Ts'o, Dave Hansen, LKML, xfs,
Dave Hansen, Linux FS Devel, Jan Kara, linux-ext4@vger.kernel.org,
Tim Chen
On Fri, Aug 16, 2013 at 04:18:33PM -0700, Andy Lutomirski wrote:
> On Fri, Aug 16, 2013 at 3:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
> >> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> > >> >> >> > > cost of the unwritten->written conversion.
> >> > >> >> >> >
> >> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> > >> >> >> > this part until writeback?
> >> > >> >> >>
> >> > >> >> >> Part of the work has to be done at write time because we need to
> >> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> > >> >> >> problems). The unwritten->written conversion does happen at writeback
> >> > >> >> >> (as does the actual block allocation if we are doing delayed
> >> > >> >> >> allocation).
> >> > >> >> >>
> >> > >> >> >> The point is that if the goal is to measure page fault scalability, we
> >> > >> >> >> shouldn't have this other stuff happening as the same time as the page
> >> > >> >> >> fault workload.
> >> > >> >> >
> >> > >> >> > Sure, but the real problem is not the block mapping or allocation
> >> > >> >> > path - even if the test is changed to take that out of the picture,
> >> > >> >> > we still have timestamp updates being done on every single page
> >> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> > >> >> > and have nanosecond granularity, so every page fault is resulting in
> >> > >> >> > a transaction to update the timestamp of the file being modified.
> >> > >> >>
> >> > >> >> I have (unmergeable) patches to fix this:
> >> > >> >>
> >> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> > >> >
> >> > >> > The big problem with this approach is that not doing the
> >> > >> > timestamp update on page faults is going to break the inode change
> >> > >> > version counting because for ext4, btrfs and XFS it takes a
> >> > >> > transaction to bump that counter. NFS needs to know the moment a
> >> > >> > file is changed in memory, not when it is written to disk. Also, NFS
> >> > >> > requires the change to the counter to be persistent over server
> >> > >> > failures, so it needs to be changed as part of a transaction....
> >> > >>
> >> > >> I've been running a kernel that has the file_update_time call
> >> > >> commented out for over a year now, and the only problem I've seen is
> >> > >> that the timestamp doesn't get updated :)
> >> > >>
> >> >
> >> > [...]
> >> >
> >> > > If a filesystem is providing an i_version value, then NFS uses it to
> >> > > determine whether client side caches are still consistent with the
> >> > > server state. If the filesystem does not provide an i_version, then
> >> > > NFS falls back to checking c/mtime for changes. If files on the
> >> > > server are being modified without either the tiemstamps or i_version
> >> > > changing, then it's likely that there will be problems with client
> >> > > side cache consistency....
> >> >
> >> > I didn't think of that at all.
> >> >
> >> > If userspace does:
> >> >
> >> > ptr = mmap(...);
> >> > ptr[0] = 1;
> >> > sleep(1);
> >> > ptr[0] = 2;
> >> > sleep(1);
> >> > munmap();
> >> >
> >> > Then current kernels will mark the inode changed on (only) the ptr[0]
> >> > = 1 line. My patches will instead mark the inode changed when munmap
> >> > is called (or after ptr[0] = 2 if writepages gets called for any
> >> > reason).
> >> >
> >> > I'm not sure which is better. POSIX actually requires my behavior
> >> > (which is most irrelevant).
> >>
> >> Not by my reading of it. Posix states that c/mtime needs to be
> >> updated between the first access and the next msync() call. We
> >> update mtime on the first access, and so therefore we conform to the
> >> posix requirement....
> >>
> >> > My behavior also means that, if an NFS
> >> > client reads and caches the file between the two writes, then it will
> >> > eventually find out that the data is stale.
> >>
> >> "eventually" is very different behaviour to the current behaviour.
> >>
> >> My understanding is that NFS v4 delegations require the underlying
> >> filesystem to bump the version count on *any* modification made to
> >> the file so that delegations can be recalled appropriately.
> >
> > Delegations at least shouldn't be an issue here: they're recalled on the
> > open.
>
> Can you translate that into clueless-non-NFS-expert? :)
An NFS "delegation" is roughly the same thing as what's called a "lease"
by the linux vfs or an "OpLock" in SMB. It's a lock that is recalled
from the holder on certain conflicting operations. (Basically a way to
tell a client "you're the only one using this file, feel free to cache
it until I tell you otherwise".)
Delegations are recalled on conflicting opens, so by the time you get to
IO there shouldn't be any. I don't think they're really relevant to
this discussion.
--b.
>
> Anyway, I'm sending patches in a sec. Dave (Hansen), want to test? I
> played with will-it-scale a bit, but I don't really know what I'm
> doing.
>
> --Andy
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 67+ messages in thread
* Re: page fault scalability (ext3, ext4, xfs)
@ 2013-08-18 20:17 ` J. Bruce Fields
0 siblings, 0 replies; 67+ messages in thread
From: J. Bruce Fields @ 2013-08-18 20:17 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Dave Chinner, Theodore Ts'o, Dave Hansen, Dave Hansen,
Linux FS Devel, xfs, linux-ext4@vger.kernel.org, Jan Kara, LKML,
Tim Chen, Andi Kleen
On Fri, Aug 16, 2013 at 04:18:33PM -0700, Andy Lutomirski wrote:
> On Fri, Aug 16, 2013 at 3:02 PM, J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Fri, Aug 16, 2013 at 07:37:25AM +1000, Dave Chinner wrote:
> >> On Thu, Aug 15, 2013 at 08:17:18AM -0700, Andy Lutomirski wrote:
> >> > On Thu, Aug 15, 2013 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> > > On Wed, Aug 14, 2013 at 11:14:37PM -0700, Andy Lutomirski wrote:
> >> > >> On Wed, Aug 14, 2013 at 11:01 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > >> > On Wed, Aug 14, 2013 at 09:32:13PM -0700, Andy Lutomirski wrote:
> >> > >> >> On Wed, Aug 14, 2013 at 7:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > >> >> > On Wed, Aug 14, 2013 at 09:11:01PM -0400, Theodore Ts'o wrote:
> >> > >> >> >> On Wed, Aug 14, 2013 at 04:38:12PM -0700, Andy Lutomirski wrote:
> >> > >> >> >> > > It would be better to write zeros to it, so we aren't measuring the
> >> > >> >> >> > > cost of the unwritten->written conversion.
> >> > >> >> >> >
> >> > >> >> >> > At the risk of beating a dead horse, how hard would it be to defer
> >> > >> >> >> > this part until writeback?
> >> > >> >> >>
> >> > >> >> >> Part of the work has to be done at write time because we need to
> >> > >> >> >> update allocation statistics (i.e., so that we don't have ENOSPC
> >> > >> >> >> problems). The unwritten->written conversion does happen at writeback
> >> > >> >> >> (as does the actual block allocation if we are doing delayed
> >> > >> >> >> allocation).
> >> > >> >> >>
> >> > >> >> >> The point is that if the goal is to measure page fault scalability, we
> >> > >> >> >> shouldn't have this other stuff happening as the same time as the page
> >> > >> >> >> fault workload.
> >> > >> >> >
> >> > >> >> > Sure, but the real problem is not the block mapping or allocation
> >> > >> >> > path - even if the test is changed to take that out of the picture,
> >> > >> >> > we still have timestamp updates being done on every single page
> >> > >> >> > fault. ext4, XFS and btrfs all do transactional timestamp updates
> >> > >> >> > and have nanosecond granularity, so every page fault is resulting in
> >> > >> >> > a transaction to update the timestamp of the file being modified.
> >> > >> >>
> >> > >> >> I have (unmergeable) patches to fix this:
> >> > >> >>
> >> > >> >> http://comments.gmane.org/gmane.linux.kernel.mm/92476
> >> > >> >
> >> > >> > The big problem with this approach is that not doing the
> >> > >> > timestamp update on page faults is going to break the inode change
> >> > >> > version counting because for ext4, btrfs and XFS it takes a
> >> > >> > transaction to bump that counter. NFS needs to know the moment a
> >> > >> > file is changed in memory, not when it is written to disk. Also, NFS
> >> > >> > requires the change to the counter to be persistent over server
> >> > >> > failures, so it needs to be changed as part of a transaction....
> >> > >>
> >> > >> I've been running a kernel that has the file_update_time call
> >> > >> commented out for over a year now, and the only problem I've seen is
> >> > >> that the timestamp doesn't get updated :)
> >> > >>
> >> >
> >> > [...]
> >> >
> >> > > If a filesystem is providing an i_version value, then NFS uses it to
> >> > > determine whether client side caches are still consistent with the
> >> > > server state. If the filesystem does not provide an i_version, then
> >> > > NFS falls back to checking c/mtime for changes. If files on the
> >> > > server are being modified without either the tiemstamps or i_version
> >> > > changing, then it's likely that there will be problems with client
> >> > > side cache consistency....
> >> >
> >> > I didn't think of that at all.
> >> >
> >> > If userspace does:
> >> >
> >> > ptr = mmap(...);
> >> > ptr[0] = 1;
> >> > sleep(1);
> >> > ptr[0] = 2;
> >> > sleep(1);
> >> > munmap();
> >> >
> >> > Then current kernels will mark the inode changed on (only) the ptr[0]
> >> > = 1 line. My patches will instead mark the inode changed when munmap
> >> > is called (or after ptr[0] = 2 if writepages gets called for any
> >> > reason).
> >> >
> >> > I'm not sure which is better. POSIX actually requires my behavior
> >> > (which is most irrelevant).
> >>
> >> Not by my reading of it. Posix states that c/mtime needs to be
> >> updated between the first access and the next msync() call. We
> >> update mtime on the first access, and so therefore we conform to the
> >> posix requirement....
> >>
> >> > My behavior also means that, if an NFS
> >> > client reads and caches the file between the two writes, then it will
> >> > eventually find out that the data is stale.
> >>
> >> "eventually" is very different behaviour to the current behaviour.
> >>
> >> My understanding is that NFS v4 delegations require the underlying
> >> filesystem to bump the version count on *any* modification made to
> >> the file so that delegations can be recalled appropriately.
> >
> > Delegations at least shouldn't be an issue here: they're recalled on the
> > open.
>
> Can you translate that into clueless-non-NFS-expert? :)
An NFS "delegation" is roughly the same thing as what's called a "lease"
by the linux vfs or an "OpLock" in SMB. It's a lock that is recalled
from the holder on certain conflicting operations. (Basically a way to
tell a client "you're the only one using this file, feel free to cache
it until I tell you otherwise".)
Delegations are recalled on conflicting opens, so by the time you get to
IO there shouldn't be any. I don't think they're really relevant to
this discussion.
--b.
>
> Anyway, I'm sending patches in a sec. Dave (Hansen), want to test? I
> played with will-it-scale a bit, but I don't really know what I'm
> doing.
>
> --Andy
^ permalink raw reply [flat|nested] 67+ messages in thread