All of lore.kernel.org
 help / color / mirror / Atom feed
From: Greg Banks <gnb@sgi.com>
To: Olaf Kirch <okir@suse.de>
Cc: nfs@lists.sourceforge.net
Subject: Re: nfsd write throughput
Date: Tue, 3 Aug 2004 12:10:18 +1000	[thread overview]
Message-ID: <20040803021018.GG5581@sgi.com> (raw)
In-Reply-To: <20040802162448.GB21365@suse.de>

On Mon, Aug 02, 2004 at 06:24:49PM +0200, Olaf Kirch wrote:
> @@ -810,6 +811,22 @@
>  		}
>  		last_ino = inode->i_ino;
>  		last_dev = inode->i_sb->s_dev;
> +	} else if (err >= 0 && !stable) {
> +		/* If we've been writing several pages, schedule them
> +		 * for the disk immediately. The client may be streaming
> +		 * and we don't want to hang on a huge journal sync when the
> +		 * commit comes in
> +		 */
> +		struct address_space	*mapping;
> +
> +		/* This assumes a minimum page size of 1K, and will issue
> +		 * a filemap_flushfast call every 64 pages written by the
> +		 * client. */
> +		if ((cnt & 1023) == 0
> +		 && ((offset / cnt) & 63) == 0
> +		 && (mapping = inode->i_mapping) != NULL
> +		 && !bdi_write_congested(mapping->backing_dev_info))
> +			filemap_flushfast(mapping);
>  	}
>  
>  	dprintk("nfsd: write complete err=%d\n", err);

Olaf, I think this patch has problems.

First, the way the v3 server is supposed to work is that normal page
cache pressure pushes pages from unstable writes to disk before the
COMMIT call arrives from the client.  The best way to achieve this
for a dedicated NFS server box is tuning the pdflush parameters
to be more aggressive about writing back dirty pages, e.g. bumping
down the following in /proc/vm: dirty_background_ratio, dirty_ratio,
dirty_writeback_centisecs, and dirty_expire_centisecs.  I have to
admit I've not tried this yet on 2.6 but the equivalent on 2.4 has
been generally useful.

I think another useful approach would be to writeback pages which
have been written by NFS unstable writes at a faster rate than pages
written by local applications, i.e. add a new /proc/vm/ sysctl like
nfs_dirty_writeback_centisecs and a per-page flag.  With a separate
sysctl the default value can be smaller so that you get the desired
behaviour for NFS pages without the syadmin having to do page cache
tuning or perturbing the behaviour of local IO.

The justification for this approach is that data in such pages is
most likely also stored in clients' page caches too.  Recent IRIX
releases do this, and I have an open bug to implement something like
that in Linux.

Second, I have several problems with the heuristics for choosing when
to call filemap_flushfast().

For example, imagine the disk backend is a hardware RAID5 with a
stripe size of 128K or greater and the client is doing streaming
32K WRITE calls.  With your patch, every second WRITE call will now
try to write half a RAID stripe unit, requiring the RAID controller
to read the other half to update parity, which will significantly
hurt performance.  Similar bad things happen if the server is doing
strided or random writes of 1024 B at offsets which are multiples
of 64 KB.

If the disk writes are being pushed by the normal page cache
mechanisms, then the normal page cache and filesystem write clustering
has at least some chance (and the state, e.g. XFS is aware of the
hardware RAID stripe parameters) to construct writes of an appropriate
size.  Whether the page cache and fs actually do the right thing is
another matter, but that's where the responsibility lies.


Greg.
-- 
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.


-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. www.ostg.com
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

  reply	other threads:[~2004-08-03  2:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-08-02 16:24 nfsd write throughput Olaf Kirch
2004-08-03  2:10 ` Greg Banks [this message]
2004-08-03  6:02   ` Olaf Kirch
2004-08-03  7:55     ` Greg Banks
2004-08-03  8:09       ` Olaf Kirch
2004-08-03  8:28         ` Greg Banks
2004-08-03 10:32       ` Olaf Kirch
2004-08-03 10:52         ` Olaf Kirch
2004-08-03 11:24         ` Greg Banks
2004-08-03 13:26           ` Olaf Kirch
2004-08-03  2:23 ` Neil Brown
  -- strict thread matches above, loose matches on Subject: below --
2004-08-04  0:10 Bruce Allan
2004-08-04  8:18 ` Greg Banks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040803021018.GG5581@sgi.com \
    --to=gnb@sgi.com \
    --cc=nfs@lists.sourceforge.net \
    --cc=okir@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.