public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* nfs performance delta between filesystems
@ 2010-01-22 17:54 Emmanuel Florac
  2010-01-22 18:38 ` bpm
  0 siblings, 1 reply; 7+ messages in thread
From: Emmanuel Florac @ 2010-01-22 17:54 UTC (permalink / raw)
  To: xfs


Here is an interesting mystery to solve : XFS is always much faster
locally than all other FSs, but quite slower through NFS.

I'm benchmarking a RAID-10 setup with various filesystems, ext3, reiser
(3.6), jfs and xfs. I'm strictly and only interested in continuous
sequential write (data recording) through NFS (basic 1G Eth), so I'm
benchmarking with 

dd if=/dev/zero of=/mnt/whatever/file bs=1M

following the throughput with iostat.

Setup : linux 2.6.27.21, software RAID 10 (64K stripe), 1GB RAM, 64
bits kernel, 4 Savvio 500GB disks, Phenom II, Intel Pro1000 network.

All the following tests have been run several times each, to even
results.

Locally, xfs mounted this way 

/dev/md2 on /mnt/raid type xfs (rw,noatime,nobarrier)

is wayyyyy better than all other filesystems :

~# dd if=/dev/zero of=/mnt/raid/testdd bs=1M count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 23.0655 s, 186 MB/s

ext3, jfs and reiser score the same around 150 MB/s.

I'm exporting the mountpoint through NFS :

/mnt/raid    *(fsid=0,rw,async,no_root_squash,no_subtree_check)

And mount with no options (or just "tcp") on the client. Then the
results are quite different. XFS gives a bit less than 90 MB/s (not
bad):

 ~$ sudo dd if=/dev/zero of=/mnt/temp/testnfsdd1 bs=1M count=2048
2048+0 enregistrements lus
2048+0 enregistrements écrits
2147483648 octets (2,1 GB) copiés, 24,1755 s, 88,8 MB/s

But... reiserfs gives ~ 92 MB/s, and ext3 95 MB/s. Results are
consistently better with both through NFS, while they're constantly
much worse locally... How come???

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nfs performance delta between filesystems
  2010-01-22 17:54 nfs performance delta between filesystems Emmanuel Florac
@ 2010-01-22 18:38 ` bpm
  2010-01-22 20:46   ` Emmanuel Florac
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: bpm @ 2010-01-22 18:38 UTC (permalink / raw)
  To: Emmanuel Florac; +Cc: xfs

Hey Emmanuel,

I did some research on this in April last year on an old, old kernel.
One of the codepaths I flagged:

nfsd_create
  write_inode_now
    __sync_single_inode
      write_inode
        xfs_fs_write_inode
	  xfs_inode_flush
	    xfs_iflush

There were small gains to be had by reordering the sync of the parent and
child syncs where the two inodes were in the same cluster.  The larger
problem seemed to be that we're not treating the log as stable storage.
By calling write_inode_now we've written the changes to the log first
and then gone and also written them out to the inode.  

nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
kernel I'm looking at).  I have a patchset that changes
this to an fsync so we force the log and call it good.  I'll be happy to
dust it off if someone hasn't already addressed this situation.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nfs performance delta between filesystems
  2010-01-22 18:38 ` bpm
@ 2010-01-22 20:46   ` Emmanuel Florac
  2010-01-23 12:30   ` Dave Chinner
  2010-01-25 15:04   ` Christoph Hellwig
  2 siblings, 0 replies; 7+ messages in thread
From: Emmanuel Florac @ 2010-01-22 20:46 UTC (permalink / raw)
  To: bpm; +Cc: xfs

Le Fri, 22 Jan 2010 12:38:48 -0600 vous écriviez:

> There were small gains to be had by reordering the sync of the parent
> and child syncs where the two inodes were in the same cluster.  The
> larger problem seemed to be that we're not treating the log as stable
> storage. By calling write_inode_now we've written the changes to the
> log first and then gone and also written them out to the inode.  

I thought that using nobarrier would prevent the log operations to be
sync'd to disk too quickly? Could I possibly enhance this behaviour by
using an external log on a very fast storage (ramdisk...)?

> nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> kernel I'm looking at).  I have a patchset that changes
> this to an fsync so we force the log and call it good.  I'll be happy
> to dust it off if someone hasn't already addressed this situation.

Well, if you think this patchset can be applied to a current kernel
without too much hacking I'm ready to give it a try :)

-- 
------------------------------------------------------------------------
Emmanuel Florac     |   Direction technique
                    |   Intellique
                    |	<eflorac@intellique.com>
                    |   +33 1 78 94 84 02
------------------------------------------------------------------------

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nfs performance delta between filesystems
  2010-01-22 18:38 ` bpm
  2010-01-22 20:46   ` Emmanuel Florac
@ 2010-01-23 12:30   ` Dave Chinner
  2010-01-25 15:04   ` Christoph Hellwig
  2 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2010-01-23 12:30 UTC (permalink / raw)
  To: bpm; +Cc: xfs

On Fri, Jan 22, 2010 at 12:38:48PM -0600, bpm@sgi.com wrote:
> Hey Emmanuel,
> 
> I did some research on this in April last year on an old, old kernel.
> One of the codepaths I flagged:
> 
> nfsd_create
>   write_inode_now
>     __sync_single_inode
>       write_inode
>         xfs_fs_write_inode
> 	  xfs_inode_flush
> 	    xfs_iflush
> 
> There were small gains to be had by reordering the sync of the parent and
> child syncs where the two inodes were in the same cluster.  The larger
> problem seemed to be that we're not treating the log as stable storage.
> By calling write_inode_now we've written the changes to the log first
> and then gone and also written them out to the inode.  

Pretty much right, but there are historical reasons for that
behaviour. The ->write_inode() path is the only
method for the higher layers to say "write this inode to disk".
That's how XFS has been treating it for a long time - as a command
to _physically_ write a dirty inode some time after it was first
changed and the transaction is already on disk.

Unfortunately, NFS is using the same call for is a method for saying
"commit this changed inode to disk immediately", which is a
different semantic to the way the sync code uses it and physical
inode IO really hurts here.

> nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> kernel I'm looking at).  I have a patchset that changes
> this to an fsync so we force the log and call it good.  I'll be happy to
> dust it off if someone hasn't already addressed this situation.

The delayed write inode flushing patchset I'm finalising does this.
We now have reliable tracking of dirty inodes in XFS and a method
for efficient physical writeback, so we no longer need to rely on
->write_inode to tell us to write inodes to disk. Hence the patchset
turns the inode write into a an xfs_fsync() if it is a sync write or
a delayed write if it is async.  I'm hoping to have that ready for
.34 inclusion sometime next week...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nfs performance delta between filesystems
  2010-01-22 18:38 ` bpm
  2010-01-22 20:46   ` Emmanuel Florac
  2010-01-23 12:30   ` Dave Chinner
@ 2010-01-25 15:04   ` Christoph Hellwig
  2010-01-25 20:28     ` bpm
  2 siblings, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2010-01-25 15:04 UTC (permalink / raw)
  To: bpm; +Cc: xfs

On Fri, Jan 22, 2010 at 12:38:48PM -0600, bpm@sgi.com wrote:
> Hey Emmanuel,
> 
> I did some research on this in April last year on an old, old kernel.
> One of the codepaths I flagged:
> 
> nfsd_create
>   write_inode_now
>     __sync_single_inode
>       write_inode
>         xfs_fs_write_inode
> 	  xfs_inode_flush
> 	    xfs_iflush
> 
> There were small gains to be had by reordering the sync of the parent and
> child syncs where the two inodes were in the same cluster.  The larger
> problem seemed to be that we're not treating the log as stable storage.
> By calling write_inode_now we've written the changes to the log first
> and then gone and also written them out to the inode.  
> 
> nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> kernel I'm looking at).  I have a patchset that changes
> this to an fsync so we force the log and call it good.  I'll be happy to
> dust it off if someone hasn't already addressed this situation.

Dave and I had had some discussion about this when going through his
inode writeback changes.  Changing to ->fsync might indeed be the
easiest option, but on the other hand I'm really trying to get rid of
the special case of ->fsync without a file argument in the VFS as it
complicates stackable filesystem layers and also creates a rather
annoying and under/un documented assumtion that filesystem that need
the file pointer can't be NFS exported.  One option if we want to
keep these semantics is to add a new export operation just for
synchronizations things in NFS.

But given that the current use case in NFS is to pair one write_inode
call with one actual VFS operation it might be better to just
automatically turn on the wsync mount option in XFS, we'd need a hook
from NFSD into the filesystem to implement this, but I've been looking
into adding this anyway to allow for checking other paramters like the
file handle size against filesystem limitations.  Any chance you
could run your tests against a wsync filesystem?

But all this affects metadata performance, and only for sync exports,
while the OP does a simple dd which is streaming data I/O and uses the
(extremly unsafe) async export operation that disables the write_inode
calls.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nfs performance delta between filesystems
  2010-01-25 15:04   ` Christoph Hellwig
@ 2010-01-25 20:28     ` bpm
  2010-01-25 20:40       ` Christoph Hellwig
  0 siblings, 1 reply; 7+ messages in thread
From: bpm @ 2010-01-25 20:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Mon, Jan 25, 2010 at 10:04:10AM -0500, Christoph Hellwig wrote:
> On Fri, Jan 22, 2010 at 12:38:48PM -0600, bpm@sgi.com wrote:
> > Hey Emmanuel,
> > 
> > I did some research on this in April last year on an old, old kernel.
> > One of the codepaths I flagged:
> > 
> > nfsd_create
> >   write_inode_now
> >     __sync_single_inode
> >       write_inode
> >         xfs_fs_write_inode
> > 	  xfs_inode_flush
> > 	    xfs_iflush
> > 
> > There were small gains to be had by reordering the sync of the parent and
> > child syncs where the two inodes were in the same cluster.  The larger
> > problem seemed to be that we're not treating the log as stable storage.
> > By calling write_inode_now we've written the changes to the log first
> > and then gone and also written them out to the inode.  
> > 
> > nfsd_create, nfsd_link, and nfsd_setattr all do this (or do in the old
> > kernel I'm looking at).  I have a patchset that changes
> > this to an fsync so we force the log and call it good.  I'll be happy to
> > dust it off if someone hasn't already addressed this situation.
> 
> Dave and I had had some discussion about this when going through his
> inode writeback changes.  Changing to ->fsync might indeed be the
> easiest option, but on the other hand I'm really trying to get rid of
> the special case of ->fsync without a file argument in the VFS as it
> complicates stackable filesystem layers and also creates a rather
> annoying and under/un documented assumtion that filesystem that need
> the file pointer can't be NFS exported.  One option if we want to
> keep these semantics is to add a new export operation just for
> synchronizations things in NFS.
> 
> But given that the current use case in NFS is to pair one write_inode
> call with one actual VFS operation it might be better to just
> automatically turn on the wsync mount option in XFS, we'd need a hook
> from NFSD into the filesystem to implement this, but I've been looking
> into adding this anyway to allow for checking other paramters like the
> file handle size against filesystem limitations.  Any chance you
> could run your tests against a wsync filesystem?

The original tests were done with the wsync mount option.  I'm not
really sure that it was necessary.  Test case was "tar -xvf
ImageMagick.tar".  'fdatasync' represents whether the export option
controlling usage of write_inode_now vs fsync was set.

internal log, no wsync, no fdatasync
2m48.632s       2m59.676s       2m42.450s

internal log, wsync, no fdatasync
3m1.320s        3m10.961s       2m53.560s

internal log, wsync, fdatasync
1m40.191s       1m38.780s       1m35.758s

external log, no wsync, no fdatasync
1m37.069s       1m37.850s       1m38.303s

external log, wsync, no fdatasync
1m48.948s       1m47.804s       1m50.219s

external log, wsync, fdatasync
1m19.265s       1m19.129s       1m19.635s

> But all this affects metadata performance, and only for sync exports,
> while the OP does a simple dd which is streaming data I/O and uses the
> (extremly unsafe) async export operation that disables the write_inode
> calls.

Right.  This might not apply to Emmanuel's problem.  I've been wondering
if a recent change to not hold the inode mutex over the sync helps in
the streaming io case.  Any idea?

Anyway, looks like Dave's patchset addresses this.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: nfs performance delta between filesystems
  2010-01-25 20:28     ` bpm
@ 2010-01-25 20:40       ` Christoph Hellwig
  0 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2010-01-25 20:40 UTC (permalink / raw)
  To: bpm; +Cc: Christoph Hellwig, xfs

On Mon, Jan 25, 2010 at 02:28:39PM -0600, bpm@sgi.com wrote:
> The original tests were done with the wsync mount option.  I'm not
> really sure that it was necessary.  Test case was "tar -xvf
> ImageMagick.tar".  'fdatasync' represents whether the export option
> controlling usage of write_inode_now vs fsync was set.

Ok.  Btw, you need to call ->fsync with fdatasync = 0 for NFS as it
also wants to catch non-data changes to the inode.  Doesn't matter
for XFS as we currently always force a full fsync, but I'm going to
change that soon.

> internal log, no wsync, no fdatasync
> 2m48.632s       2m59.676s       2m42.450s
> 
> internal log, wsync, no fdatasync
> 3m1.320s        3m10.961s       2m53.560s
> 
> internal log, wsync, fdatasync
> 1m40.191s       1m38.780s       1m35.758s

The wsync case always still includes either the ->fsync or write_inode
call, right?  If we use wsync we shouldn't need either in theory as
the transactions already commit synchronously.

Anyway, given the massive improvements of ->fsync vs write_inode you
really should post that patch to the NFS list for discussion ASAP.


> > But all this affects metadata performance, and only for sync exports,
> > while the OP does a simple dd which is streaming data I/O and uses the
> > (extremly unsafe) async export operation that disables the write_inode
> > calls.
> 
> Right.  This might not apply to Emmanuel's problem.  I've been wondering
> if a recent change to not hold the inode mutex over the sync helps in
> the streaming io case.  Any idea?

It should help a bit.  I'm not sure it can cause that much of a
difference for such a simple single-threaded workload.  Emmanuel, is
there any chance you could try the latest 2.6.32-stable kernel or even
2.6.33-rc as those changes are included there?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-01-25 20:39 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-22 17:54 nfs performance delta between filesystems Emmanuel Florac
2010-01-22 18:38 ` bpm
2010-01-22 20:46   ` Emmanuel Florac
2010-01-23 12:30   ` Dave Chinner
2010-01-25 15:04   ` Christoph Hellwig
2010-01-25 20:28     ` bpm
2010-01-25 20:40       ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox