* nfs performance delta between filesystems
@ 2010-01-22 17:54 Emmanuel Florac
2010-01-22 18:38 ` bpm
0 siblings, 1 reply; 7+ messages in thread
From: Emmanuel Florac @ 2010-01-22 17:54 UTC (permalink / raw)
To: xfs
Here is an interesting mystery to solve : XFS is always much faster
locally than all other FSs, but quite slower through NFS.
I'm benchmarking a RAID-10 setup with various filesystems, ext3, reiser
(3.6), jfs and xfs. I'm strictly and only interested in continuous
sequential write (data recording) through NFS (basic 1G Eth), so I'm
benchmarking with
dd if=/dev/zero of=/mnt/whatever/file bs=1M
following the throughput with iostat.
Setup : linux 2.6.27.21, software RAID 10 (64K stripe), 1GB RAM, 64
bits kernel, 4 Savvio 500GB disks, Phenom II, Intel Pro1000 network.
All the following tests have been run several times each, to even
results.
Locally, xfs mounted this way
/dev/md2 on /mnt/raid type xfs (rw,noatime,nobarrier)
is wayyyyy better than all other filesystems :
~# dd if=/dev/zero of=/mnt/raid/testdd bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 23.0655 s, 186 MB/s
ext3, jfs and reiser score the same around 150 MB/s.
I'm exporting the mountpoint through NFS :
/mnt/raid *(fsid=0,rw,async,no_root_squash,no_subtree_check)
And mount with no options (or just "tcp") on the client. Then the
results are quite different. XFS gives a bit less than 90 MB/s (not
bad):
~$ sudo dd if=/dev/zero of=/mnt/temp/testnfsdd1 bs=1M count=2048
2048+0 enregistrements lus
2048+0 enregistrements écrits
2147483648 octets (2,1 GB) copiés, 24,1755 s, 88,8 MB/s
But... reiserfs gives ~ 92 MB/s, and ext3 95 MB/s. Results are
consistently better with both through NFS, while they're constantly
much worse locally... How come???
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: nfs performance delta between filesystems
2010-01-22 17:54 nfs performance delta between filesystems Emmanuel Florac
@ 2010-01-22 18:38 ` bpm
2010-01-22 20:46 ` Emmanuel Florac
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: bpm @ 2010-01-22 18:38 UTC (permalink / raw)
To: Emmanuel Florac; +Cc: xfs
Hey Emmanuel,
I did some research on this in April last year on an old, old kernel.
One of the codepaths I flagged:
nfsd_create
write_inode_now
__sync_single_inode
write_inode
xfs_fs_write_inode
xfs_inode_flush
xfs_iflush
There were small gains to be had by reordering the sync of the parent and
child syncs where the two inodes were in the same cluster. The larger
problem seemed to be that we're not treating the log as stable storage.
By calling write_inode_now we've written the changes to the log first
and then gone and also written them out to the inode.
nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
kernel I'm looking at). I have a patchset that changes
this to an fsync so we force the log and call it good. I'll be happy to
dust it off if someone hasn't already addressed this situation.
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: nfs performance delta between filesystems
2010-01-22 18:38 ` bpm
@ 2010-01-22 20:46 ` Emmanuel Florac
2010-01-23 12:30 ` Dave Chinner
2010-01-25 15:04 ` Christoph Hellwig
2 siblings, 0 replies; 7+ messages in thread
From: Emmanuel Florac @ 2010-01-22 20:46 UTC (permalink / raw)
To: bpm; +Cc: xfs
Le Fri, 22 Jan 2010 12:38:48 -0600 vous écriviez:
> There were small gains to be had by reordering the sync of the parent
> and child syncs where the two inodes were in the same cluster. The
> larger problem seemed to be that we're not treating the log as stable
> storage. By calling write_inode_now we've written the changes to the
> log first and then gone and also written them out to the inode.
I thought that using nobarrier would prevent the log operations to be
sync'd to disk too quickly? Could I possibly enhance this behaviour by
using an external log on a very fast storage (ramdisk...)?
> nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> kernel I'm looking at). I have a patchset that changes
> this to an fsync so we force the log and call it good. I'll be happy
> to dust it off if someone hasn't already addressed this situation.
Well, if you think this patchset can be applied to a current kernel
without too much hacking I'm ready to give it a try :)
--
------------------------------------------------------------------------
Emmanuel Florac | Direction technique
| Intellique
| <eflorac@intellique.com>
| +33 1 78 94 84 02
------------------------------------------------------------------------
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: nfs performance delta between filesystems
2010-01-22 18:38 ` bpm
2010-01-22 20:46 ` Emmanuel Florac
@ 2010-01-23 12:30 ` Dave Chinner
2010-01-25 15:04 ` Christoph Hellwig
2 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2010-01-23 12:30 UTC (permalink / raw)
To: bpm; +Cc: xfs
On Fri, Jan 22, 2010 at 12:38:48PM -0600, bpm@sgi.com wrote:
> Hey Emmanuel,
>
> I did some research on this in April last year on an old, old kernel.
> One of the codepaths I flagged:
>
> nfsd_create
> write_inode_now
> __sync_single_inode
> write_inode
> xfs_fs_write_inode
> xfs_inode_flush
> xfs_iflush
>
> There were small gains to be had by reordering the sync of the parent and
> child syncs where the two inodes were in the same cluster. The larger
> problem seemed to be that we're not treating the log as stable storage.
> By calling write_inode_now we've written the changes to the log first
> and then gone and also written them out to the inode.
Pretty much right, but there are historical reasons for that
behaviour. The ->write_inode() path is the only
method for the higher layers to say "write this inode to disk".
That's how XFS has been treating it for a long time - as a command
to _physically_ write a dirty inode some time after it was first
changed and the transaction is already on disk.
Unfortunately, NFS is using the same call for is a method for saying
"commit this changed inode to disk immediately", which is a
different semantic to the way the sync code uses it and physical
inode IO really hurts here.
> nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> kernel I'm looking at). I have a patchset that changes
> this to an fsync so we force the log and call it good. I'll be happy to
> dust it off if someone hasn't already addressed this situation.
The delayed write inode flushing patchset I'm finalising does this.
We now have reliable tracking of dirty inodes in XFS and a method
for efficient physical writeback, so we no longer need to rely on
->write_inode to tell us to write inodes to disk. Hence the patchset
turns the inode write into a an xfs_fsync() if it is a sync write or
a delayed write if it is async. I'm hoping to have that ready for
.34 inclusion sometime next week...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nfs performance delta between filesystems
2010-01-22 18:38 ` bpm
2010-01-22 20:46 ` Emmanuel Florac
2010-01-23 12:30 ` Dave Chinner
@ 2010-01-25 15:04 ` Christoph Hellwig
2010-01-25 20:28 ` bpm
2 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2010-01-25 15:04 UTC (permalink / raw)
To: bpm; +Cc: xfs
On Fri, Jan 22, 2010 at 12:38:48PM -0600, bpm@sgi.com wrote:
> Hey Emmanuel,
>
> I did some research on this in April last year on an old, old kernel.
> One of the codepaths I flagged:
>
> nfsd_create
> write_inode_now
> __sync_single_inode
> write_inode
> xfs_fs_write_inode
> xfs_inode_flush
> xfs_iflush
>
> There were small gains to be had by reordering the sync of the parent and
> child syncs where the two inodes were in the same cluster. The larger
> problem seemed to be that we're not treating the log as stable storage.
> By calling write_inode_now we've written the changes to the log first
> and then gone and also written them out to the inode.
>
> nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> kernel I'm looking at). I have a patchset that changes
> this to an fsync so we force the log and call it good. I'll be happy to
> dust it off if someone hasn't already addressed this situation.
Dave and I had had some discussion about this when going through his
inode writeback changes. Changing to ->fsync might indeed be the
easiest option, but on the other hand I'm really trying to get rid of
the special case of ->fsync without a file argument in the VFS as it
complicates stackable filesystem layers and also creates a rather
annoying and under/un documented assumtion that filesystem that need
the file pointer can't be NFS exported. One option if we want to
keep these semantics is to add a new export operation just for
synchronizations things in NFS.
But given that the current use case in NFS is to pair one write_inode
call with one actual VFS operation it might be better to just
automatically turn on the wsync mount option in XFS, we'd need a hook
from NFSD into the filesystem to implement this, but I've been looking
into adding this anyway to allow for checking other paramters like the
file handle size against filesystem limitations. Any chance you
could run your tests against a wsync filesystem?
But all this affects metadata performance, and only for sync exports,
while the OP does a simple dd which is streaming data I/O and uses the
(extremly unsafe) async export operation that disables the write_inode
calls.
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nfs performance delta between filesystems
2010-01-25 15:04 ` Christoph Hellwig
@ 2010-01-25 20:28 ` bpm
2010-01-25 20:40 ` Christoph Hellwig
0 siblings, 1 reply; 7+ messages in thread
From: bpm @ 2010-01-25 20:28 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: xfs
On Mon, Jan 25, 2010 at 10:04:10AM -0500, Christoph Hellwig wrote:
> On Fri, Jan 22, 2010 at 12:38:48PM -0600, bpm@sgi.com wrote:
> > Hey Emmanuel,
> >
> > I did some research on this in April last year on an old, old kernel.
> > One of the codepaths I flagged:
> >
> > nfsd_create
> > write_inode_now
> > __sync_single_inode
> > write_inode
> > xfs_fs_write_inode
> > xfs_inode_flush
> > xfs_iflush
> >
> > There were small gains to be had by reordering the sync of the parent and
> > child syncs where the two inodes were in the same cluster. The larger
> > problem seemed to be that we're not treating the log as stable storage.
> > By calling write_inode_now we've written the changes to the log first
> > and then gone and also written them out to the inode.
> >
> > nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> > kernel I'm looking at). I have a patchset that changes
> > this to an fsync so we force the log and call it good. I'll be happy to
> > dust it off if someone hasn't already addressed this situation.
>
> Dave and I had had some discussion about this when going through his
> inode writeback changes. Changing to ->fsync might indeed be the
> easiest option, but on the other hand I'm really trying to get rid of
> the special case of ->fsync without a file argument in the VFS as it
> complicates stackable filesystem layers and also creates a rather
> annoying and under/un documented assumtion that filesystem that need
> the file pointer can't be NFS exported. One option if we want to
> keep these semantics is to add a new export operation just for
> synchronizations things in NFS.
>
> But given that the current use case in NFS is to pair one write_inode
> call with one actual VFS operation it might be better to just
> automatically turn on the wsync mount option in XFS, we'd need a hook
> from NFSD into the filesystem to implement this, but I've been looking
> into adding this anyway to allow for checking other paramters like the
> file handle size against filesystem limitations. Any chance you
> could run your tests against a wsync filesystem?
The original tests were done with the wsync mount option. I'm not
really sure that it was necessary. Test case was "tar -xvf
ImageMagick.tar". 'fdatasync' represents whether the export option
controlling usage of write_inode_now vs fsync was set.
internal log, no wsync, no fdatasync
2m48.632s 2m59.676s 2m42.450s
internal log, wsync, no fdatasync
3m1.320s 3m10.961s 2m53.560s
internal log, wsync, fdatasync
1m40.191s 1m38.780s 1m35.758s
external log, no wsync, no fdatasync
1m37.069s 1m37.850s 1m38.303s
external log, wsync, no fdatasync
1m48.948s 1m47.804s 1m50.219s
external log, wsync, fdatasync
1m19.265s 1m19.129s 1m19.635s
> But all this affects metadata performance, and only for sync exports,
> while the OP does a simple dd which is streaming data I/O and uses the
> (extremly unsafe) async export operation that disables the write_inode
> calls.
Right. This might not apply to Emmanuel's problem. I've been wondering
if a recent change to not hold the inode mutex over the sync helps in
the streaming io case. Any idea?
Anyway, looks like Dave's patchset addresses this.
-Ben
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: nfs performance delta between filesystems
2010-01-25 20:28 ` bpm
@ 2010-01-25 20:40 ` Christoph Hellwig
0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2010-01-25 20:40 UTC (permalink / raw)
To: bpm; +Cc: Christoph Hellwig, xfs
On Mon, Jan 25, 2010 at 02:28:39PM -0600, bpm@sgi.com wrote:
> The original tests were done with the wsync mount option. I'm not
> really sure that it was necessary. Test case was "tar -xvf
> ImageMagick.tar". 'fdatasync' represents whether the export option
> controlling usage of write_inode_now vs fsync was set.
Ok. Btw, you need to call ->fsync with fdatasync = 0 for NFS as it
also wants to catch non-data changes to the inode. Doesn't matter
for XFS as we currently always force a full fsync, but I'm going to
change that soon.
> internal log, no wsync, no fdatasync
> 2m48.632s 2m59.676s 2m42.450s
>
> internal log, wsync, no fdatasync
> 3m1.320s 3m10.961s 2m53.560s
>
> internal log, wsync, fdatasync
> 1m40.191s 1m38.780s 1m35.758s
The wsync case always still includes either the ->fsync or write_inode
call, right? If we use wsync we shouldn't need either in theory as
the transactions already commit synchronously.
Anyway, given the massive improvements of ->fsync vs write_inode you
really should post that patch to the NFS list for discussion ASAP.
> > But all this affects metadata performance, and only for sync exports,
> > while the OP does a simple dd which is streaming data I/O and uses the
> > (extremly unsafe) async export operation that disables the write_inode
> > calls.
>
> Right. This might not apply to Emmanuel's problem. I've been wondering
> if a recent change to not hold the inode mutex over the sync helps in
> the streaming io case. Any idea?
It should help a bit. I'm not sure it can cause that much of a
difference for such a simple single-threaded workload. Emmanuel, is
there any chance you could try the latest 2.6.32-stable kernel or even
2.6.33-rc as those changes are included there?
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2010-01-25 20:39 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-22 17:54 nfs performance delta between filesystems Emmanuel Florac
2010-01-22 18:38 ` bpm
2010-01-22 20:46 ` Emmanuel Florac
2010-01-23 12:30 ` Dave Chinner
2010-01-25 15:04 ` Christoph Hellwig
2010-01-25 20:28 ` bpm
2010-01-25 20:40 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox