From: jehan procaccia <jehan.procaccia@int-evry.fr>
To: Neil Brown <neilb@cse.unsw.edu.au>
Cc: "Lever, Charles" <Charles.Lever@netapp.com>, nfs@lists.sourceforge.net
Subject: Re: async vs. sync
Date: Thu, 25 Nov 2004 00:14:36 +0100 [thread overview]
Message-ID: <41A515DC.7010408@int-evry.fr> (raw)
In-Reply-To: <16805.2572.79895.275921@cse.unsw.edu.au>
Neil Brown wrote:
>On Wednesday November 24, jehan.procaccia@int-evry.fr wrote:
>
>
>>However now the tar extraction goes very fast but stops 1 or 2 or and
>>restart fast -> there are some hangs. Here with a 16MB journal I got 15
>>hangs of 1-2 seconds, with a 128 MB I get only 3 hangs but they last 4or
>>5 seconds. I checked at a momment of an hang on the nfs server with
>>iostat, and disk utilisation goes from a few % to 316 % in the exemple
>>below (for 128 MB journal withing the 4 seconds hangs it goes to 4700 % !)
>>Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
>>avgrq-sz avgqu-sz await svctm %util
>>/dev/emcpowerl2
>> 0.00 150.67 97.33 224.00 768.00 3018.67 384.00
>>1509.33 11.78 33.33 19.79 9.83 316.00
>>
>>Maybe it hangs because the journal commits on the SP ! ?
>>
>>
>>
>
>It hangs because of some clumsy code in ext3 that no-one has bothered
>to fix yet - I had a look once but it was a little beyond the time I
>had to spare.
>
>When information is written to the journal, it stays in memory as well
>and is eventually written out to the main filesystem using normal
>lazy-flushing mechanisms (data is pushed out either due to memory
>pressure or because it has been idle for too long).
>When ext3 wants to add information to the head of the journal, it
>needs to clean up the tail to make space.
>If it finds that the data that was written to the tail is already
>safe in the main filesystem, it just frees up some of the tail and
>starts using it for a new head.
>HOWEVER, if it finds that the data in the tail hasn't made it to the
>main filesystem, it flushes *ALL* of the data in the journal out to
>the main filesystem. (It should only flush some fraction or fixed
>number of blocks or something). This flushing causes a very
>noticeable pause. The larger the journal, the less often the flush is
>needed, but the longer the flush lasts for.
>
>There are two ways to avoid this pause. One I have tested and works
>well. The other only just occurred to me and I haven't tried.
>
>The untested one involves making the journal larger than main memory.
>If it is that large, then memory pressure should flush out journal
>blocks before the journal wraps back to them, and so the flush should
>never happen. However such a large journal may cause other problems
>(slow replay) as mentioned in my other email.
>
>The way that works if to adjust the "bdflush" parameters so that data
>is flushed to disk more quickly. The default is to flush data once it
>is 30 seconds old. If you reduce that to 5 seconds, the problem goes
>away.
>
>For 2.4, I put
>vm.bdflush = 30 500 0 0 100 500 60 20 0
>
>in my /etc/sysctl.conf, which is equivalent to running
> echo 30 500 0 0 100 500 60 20 0 > /proc/sys/vm/bdflush
>
>
$ uname -r
2.4.21-4.ELsmp
here's what I had before setting the above:
$ cat /proc/sys/vm/bdflush
50 500 0 0 500 3000 80 50 0
Now indeed pauses seems to be shorter (I've seen 12 instead of 15 and
they latest less than 1s )
[root@arvouin Nfs-test]# time tar xvfz
/usr/src/redhat/SOURCES/httpd-2.0.51.tar.gz
real 1m22.504s
user 0m0.898s
sys 0m2.846s
On a 128MB journal it's even better, I don't see any pauses (I had a
least 3 of each 4-5 seconds before) .
[root@arvouin Nfs-test]# time tar xvfz
/usr/src/redhat/SOURCES/httpd-2.0.51.tar.gz
real 0m25.038s
user 0m0.914s
sys 0m2.477s
Very good :-)
just for the record so that I'am sure how I got that performance, here
is the server's export options: (data=journal in /etc/fstab for that FS !)
$ cat /var/lib/nfs/xtab
/mnt/emcpowerm1
arvouin.int-evry.fr(rw,sync,no_wdelay,hide,nocrossmnt,secure,no_root_squash,no_all_squash,subtree_check,secure_locks,no_acl,mapping=identity,anonuid=-2,anongid=-2)
and client mount option
[root@arvouin Nfs-test]# cat /proc/mounts
cobra3:/mnt/emcpowerm1 /mnt/cobra3extjournal nfs
rw,v3,rsize=8192,wsize=8192,hard,tcp,lock,addr=cobra3 0 0
To be sure of the improvement of the "hack" on /proc/sys/vm/bdflush I've
set it back to the original values:
$ echo 50 500 0 0 500 3000 80 50 0 > /proc/sys/vm/bdflush
and dynamically (no unmount or remont anything on either side) test again
[root@arvouin Nfs-test]# time tar xvfz
/usr/src/redhat/SOURCES/httpd-2.0.51.tar.gz
real 1m19.655s
user 0m0.860s
sys 0m2.612s
time is longer and pauses are worst than I though -> 3 pauses of
approximately 10 to 15 seconds each !
So it seem to be a very good advice to echo 30 500 0 0 100 500 60 20 0
> /proc/sys/vm/bdflush :-)
however this is a general configuration, will it disturb other devices ?
what means every figures here ? why where they set to an non optimal
value iin the 1st place ?
PS: different optimisation:
I've read this "the maximum block size is
defined by the value of the kernel constant *NFSSVC_MAXBLKSIZE*,
found in the Linux kernel source file ./include/linux/nfsd/const.h"
is there a way to change my actual 8K buffer size to 32 K without
recompiling the kernel ?
thanks.
>For 2.6, I assume you would
> echo 500 > /proc/sys/vm/dirty_expire_centisecs
>but I haven't tested this.
>
>
>
>
>>Well, finally, is this safer in terms of performances to externalize
>>journal than using async export ?
>>
>>
>
>Absolutely, providing you trust the hardware that you are storing your
>journal on.
>An external journal is perfectly safe.
>async export is not.
>
>NeilBrown
>
>
-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://productguide.itmanagersjournal.com/
_______________________________________________
NFS maillist - NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs
next prev parent reply other threads:[~2004-11-24 23:14 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
2004-11-23 14:30 async vs. sync Lever, Charles
2004-11-23 21:46 ` jehan procaccia
2004-11-24 18:45 ` jehan.procaccia
2004-11-24 22:24 ` Neil Brown
2004-11-24 23:14 ` jehan procaccia [this message]
2004-11-24 23:34 ` Neil Brown
2004-11-24 22:09 ` Neil Brown
[not found] ` <Pine.GSO.4.53.0412010900500.5486@int1.cdc.noaa.gov>
2004-12-01 17:27 ` jehan.procaccia
-- strict thread matches above, loose matches on Subject: below --
2004-11-24 19:05 Lever, Charles
2004-11-23 16:36 Lever, Charles
2004-11-23 18:16 ` Dan Stromberg
2004-11-23 3:53 Lever, Charles
2004-11-23 16:33 ` Dan Stromberg
2004-11-22 22:14 Lever, Charles
[not found] <20041122214605.8E2B31D0FE1@sc8-sf-uberspam1.sourceforge.net>
2004-11-22 21:57 ` Joshua Baker-LePain
2004-11-22 21:50 Lever, Charles
2004-11-22 22:06 ` jehan procaccia
2004-11-23 1:09 ` Dan Stromberg
2004-11-22 19:02 Lever, Charles
2004-11-22 21:25 ` jehan procaccia
2004-11-22 21:45 ` Nicolas.Kowalski
2004-11-22 23:51 ` jehan procaccia
2004-11-22 18:31 Lever, Charles
2004-11-16 18:48 Lever, Charles
2004-11-22 15:36 ` Olaf Kirch
2004-11-22 17:55 ` jehan.procaccia
2004-11-22 18:06 ` Roger Heflin
2004-11-22 18:46 ` jehan.procaccia
2004-11-22 19:10 ` Roger Heflin
2004-11-22 21:44 ` jehan procaccia
2004-11-22 21:52 ` jehan procaccia
2004-11-22 22:20 ` Trond Myklebust
2004-11-22 22:57 ` jehan procaccia
2004-11-23 9:50 ` jehan procaccia
2004-11-23 14:57 ` J. Bruce Fields
2004-11-22 18:08 ` Trond Myklebust
2004-11-22 18:57 ` jehan.procaccia
2004-11-22 19:05 ` Roger Heflin
2004-11-22 20:14 ` Trond Myklebust
2004-11-22 21:04 ` Paul Cunningham
2004-11-22 21:14 ` Trond Myklebust
2004-11-22 22:07 ` Paul Cunningham
2004-11-22 22:26 ` Trond Myklebust
2004-11-16 18:45 Lever, Charles
2004-11-16 16:15 Lever, Charles
2004-11-16 16:32 ` Trond Myklebust
2004-11-16 17:18 ` jehan.procaccia
2004-11-16 18:08 ` Trond Myklebust
[not found] <482A3FA0050D21419C269D13989C61130435E530@lavender-fe.eng.netapp.com>
2004-07-27 15:07 ` Bernd Schubert
2004-07-26 23:05 John Roberts
[not found] <482A3FA0050D21419C269D13989C61130435E523@lavender-fe.eng.netapp.com>
2004-07-26 21:28 ` Bernd Schubert
[not found] <482A3FA0050D21419C269D13989C61130435E51E@lavender-fe.eng.netapp.com>
2004-07-26 17:05 ` Bernd Schubert
2004-07-26 19:47 ` Jan Bruvoll
2004-07-26 22:06 ` Bernd Schubert
2004-07-27 12:00 ` Jan Bruvoll
2004-07-27 13:00 ` Bernd Schubert
2004-07-27 13:56 ` raven
2004-07-27 14:04 ` Jan Bruvoll
2004-07-27 14:11 ` Jan Bruvoll
2004-07-28 8:56 ` Olaf Kirch
2004-07-28 12:35 ` Bernd Schubert
2004-07-28 12:49 ` Olaf Kirch
2004-07-23 16:20 Linux NFS writes to Solaris very, very slow John Roberts
2004-07-26 15:17 ` async vs. sync Bernd Schubert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41A515DC.7010408@int-evry.fr \
--to=jehan.procaccia@int-evry.fr \
--cc=Charles.Lever@netapp.com \
--cc=neilb@cse.unsw.edu.au \
--cc=nfs@lists.sourceforge.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.